CN113052030A

CN113052030A - Double-current multi-scale hand posture estimation method based on single RGB image

Info

Publication number: CN113052030A
Application number: CN202110273215.0A
Authority: CN
Inventors: 王立春; 马胜蕾; 李敬华; 孔德慧; 王少帆; 尹宝才
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-06-29
Anticipated expiration: 2041-03-11
Also published as: CN113052030B

Abstract

The invention relates to a double-current multi-scale hand posture estimation method based on a single RGB image, which is used for solving the problems of self-occlusion and adjacent joint prediction ambiguity in the single RGB image. The method comprises the steps of taking RGB images as input, extracting features of a single image by using a deep neural network, obtaining 2D posture initial coordinates of hand joints, and performing 2D posture estimation by using a double-branch network to obtain two paths of 2D posture coordinates of the hand joints; for the two paths of 2D posture coordinates, the 3D coordinates of the two paths of hand joints are respectively estimated by utilizing a double-branch multi-scale semantic graph U-Net network, then the two paths of 3D coordinates are added and averaged, and finally the 3D coordinates of the hand joints are output. The hand posture estimation method is based on different topological structures of the hand, better utilizes information among joints, and finally realizes high-precision hand posture estimation.

Description

Double-current multi-scale hand posture estimation method based on single RGB image

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a hand posture estimation method based on a double-current multi-scale network for RGB images.

Background

In the daily interplay process between people, natural language, written language and body language are the three most important expression modes, but the former two are all limited by regions, countries, ethnicities and cultures, and the body language is not only flexible and changeable, can express some basic intentions of people, but also is intuitive and easy to understand and is not easy to generate ambiguity. Therefore, body languages are increasingly favored by human interaction researchers. Human hands are one of the most important parts in body language expression, and can convey abundant information, so that it is valuable and necessary for a computer to read information conveyed by human hands.

The gesture is the main mode for human and external information transmission, and because of the flexibility, freedom, complexity and changeability, the gesture movement contains a large amount of useful information, and the hand undertakes most of the work in life such as communication, operation and the like. It is well known that most machine operations are performed by hand. Therefore, no matter natural human-computer interaction is performed or human hand operation experience is transferred to the robot, firstly, the posture of the human hand needs to be estimated, and the posture information of the human hand is transferred to the robot equipment, so that human-computer interaction is performed.

At present, the method for estimating the hand posture is roughly divided into two stages, namely, the 2D posture of the hand is estimated by inputting an image, and then the 3D posture of the hand is regressed through the 2D posture. Hand pose estimation can be roughly classified into three categories according to the kind of input image: 1) and (3) estimating hand postures according to the depth images: depth image-based methods have traditionally been the primary method of hand pose estimation. The depth image contains certain depth information, and three-dimensional information of hand joints can be better obtained in the 3D posture regression process, but the imaging range of the depth camera at the present stage is very limited, the quality is not high enough, and great influence can be brought to a hand posture estimation method which depends on the depth image as input; meanwhile, the depth image is not applied in practice, and people usually have difficulty in obtaining the depth image. 2) And (3) estimating the hand postures according to the multiple RGB images: images based on multiple RGB are easier to acquire than depth image based methods, while multiple RGB images taken from different views contain rich 3D information, and therefore some methods take multiple images as input to alleviate occlusion problems. Although the method can obtain higher precision and can effectively solve the problem of self-shielding of hands, the required training and testing resources are larger, and the acquisition of a data set is more complicated. 3) Hand pose estimation from a single RGB image: compared with the two methods, the single RGB image is easier to obtain and more practical, and the gesture posture estimation based on the RGB single image is widely concerned at present. But estimating three-dimensional hand poses from only a single RGB image faces greater challenges due to the lack of input depth information. The gesture pose estimation method generally includes two stages, respectively, estimating the 2D pose of the hand based on the input image, and regressing the 3D pose of the hand through the 2D pose.

Factors influencing gesture posture estimation comprise that a self-shielding phenomenon exists in a part of gestures, prediction ambiguity exists in some adjacent joints during 3D posture regression, and the like.

Disclosure of Invention

The invention provides an improved hand posture estimation method aiming at three problems of gesture self-occlusion, adjacent joint prediction ambiguity and lack of semantic information due to the fact that weight is shared by traditional graph convolution at each node. The method comprises the following steps: the double-flow gesture attitude estimation method based on two topological structures is provided, and the problem of gesture self-occlusion is solved; a multi-scale U-net 3D gesture regression method is provided, and the problem of ambiguity prediction of adjacent joints during regression is solved; the method has the advantages that the problem of hand posture estimation of the semantic graph convolutional network is introduced for the first time, so that the node weight of each joint is different, semantic information of each joint is described powerfully, and the precision of 2D posture estimation and 3D posture regression is improved. The specific technical scheme is as follows:

step 1) extracting the characteristics of a single image and obtaining 2D posture initial coordinates of hand joints;

step 2) performing 2D posture estimation by using a double-branch network to obtain accurate 2D posture coordinates of hand joints, wherein the double-branch network has two branches with the same structure;

an N x F feature matrix is obtained through the first step, wherein N represents the number of hand joints, and F represents the feature dimension, and a graph can be obtained. In the second step, the graph obtained in the first step is used, two graph structures are designed according to different connection relations of hands, each graph structure is represented by using a different adjacency matrix, and therefore a double-branch network structure is designed. Each branch inputs the obtained feature matrix and the corresponding adjacent matrix into a 2D posture optimization network consisting of semantic graph convolution layers, so that each branch obtains the 2D posture of one hand.

Step 3) estimating 3D coordinates of the hand joints by utilizing a multi-scale semantic graph U-Net network, wherein the multi-scale semantic graph U-Net network has two branches with the same structure, and the input of the multi-scale semantic graph U-Net network of each branch is the 2D posture coordinates obtained by the branch in the step 2) and a corresponding adjacency matrix, and the input is the 3D posture of the hand joints; and then, adding and averaging the 3D postures obtained by the two branches, and finally outputting the 3D coordinates of the joints of the hand.

Advantageous effects

The invention provides a novel double-flow and multi-scale network model based on two topological structures of hand joints, solves the problems of self-shielding and adjacent joint prediction ambiguity based on a single RGB image, and realizes the acquisition of high-precision three-dimensional coordinates of the hand joints from the single RGB image. Most of the existing methods use the training of a depth image to a network model as a guide in the training process so as to obtain more accurate three-dimensional coordinates of hand joints. Compared with the existing method, the method is based on two different topological structures of the hand, so that the information between joints of the hand is better utilized, and meanwhile, a double-flow multi-scale network model is designed on different characteristics possibly existing in the information of different hand joints. And finally, high-precision hand posture estimation is realized.

Drawings

FIG. 1 is a general schematic diagram of a hand pose estimation method;

FIG. 2 is a schematic diagram of two topologies of a hand;

FIG. 3 is a multi-scale graph U-net network model;

FIG. 4 is a schematic illustration of hand bone loss.

Detailed Description

The invention consists of three parts: 1) feature extraction and 2D posture initialization, wherein the image is input into a network to extract features of the image and obtain an initial 2D posture of a hand joint; 2)2D attitude estimation, namely obtaining two topological structures of hand joints by using the characteristics obtained in the step 1) and the initial 2D attitude, and respectively inputting the two topological structures into a 2D attitude optimization network to refine the 2D attitude; 3) and 3D posture regression, namely designing a multi-scale feature fusion graph Unet 3D regression model by using the features of different joints of a hand on features of different scales. The overall schematic diagram of the hand posture estimation method is shown in fig. 1, two topological structures are shown in fig. 2, and a 3D regression model is shown in fig. 3.

1) Feature extraction and 2D pose initialization

Firstly, extracting 2048-dimensional feature vectors of an input image by using a Resnet50 network, obtaining initial 2D joint coordinates by passing the 2048-dimensional feature vectors through an additional full-connection layer, and splicing the 2048-dimensional feature vectors extracted by the Resnet50 network and the obtained initial 2D joint coordinates to obtain an N multiplied by F feature matrix, wherein N represents the number of hand joints, and N is 21 in the invention, and comprises 1 wrist joint, 5 metacarpophalangeal joints and 3 finger joints of each finger; f denotes a feature dimension, which is 2050 in the present invention, and includes 2048-dimensional features of the image and 2D joint coordinates x and y of each joint point.

2)2D pose estimation

Through the graph obtained in the first step, two different graph structures of a hand are designed, so that a double-branch network is designed; the two graph structures can perform feature complementation in a subsequent network by using different connection modes of hands, so that more accurate hand postures are obtained. The two figures are structured inIn the invention, different adjacency matrixes are respectively used for representation, and the characteristic matrix NxF obtained in the first step and the adjacency matrix A of each branch are respectively input into a 2D posture optimization network consisting of 3 layers of semantic graph convolution layers to obtain the 2D posture of the hand. In view of the fact that the prior hand posture estimation method cannot better express the correlation relation between the hand joints when using the graph convolutional network, the invention designs two different graph structures so as to better express the relation between the hand joints. The two graph structures are shown in fig. 2. The first graph structure is called physical connection, which is a connection between physical joints of hands, and this is a topological structure commonly used in the prior method, which can effectively use the relation between joints on each finger, so that it is more advantageous to represent some simpler gestures. The second graph structure is called as symmetrical connection, which utilizes the connection between the same joints of each finger, the interaction and the spatial connection of the same joints of each finger are not considered in the previous method, the connection between the same joints of each finger can be effectively utilized, so that the self-shielding problems are effectively solved, and because the motions of the same joints of the fingers in the motion of the hand are very similar, a certain relation exists between the adjacent joints in the motion process, the graph structure can better express the relation, for example, the self-shielding condition exists in the hand in the motions of grabbing objects and holding a fist, and by utilizing the connection mode, the shielded joints can be better estimated according to the visible joints. This configuration is therefore more advantageous for some gestures with self-occluding hand joints. The two graph structures are represented by different adjacency matrices. The construction method of the symmetrically connected adjacency matrix comprises the following steps: let G ═ { V, E } denote a graph, where V is the set of N hand joint points and E denotes an edge; adjacency matrix A, when two joint points are connected, a_ij1 otherwise a_ij0, i and j represent the two joints of the hand, respectively; each finger has three joints, the third joint is at the tip of the finger, the second joint and the first joint are respectively at the two joints under the tip of the finger, for example, the first joint and the second joint of the middle finger are respectively at the joint above the palm joint of the middle fingerThe third joint of the middle finger; the first joints of the adjacent fingers are mutually connected, the second joints are also mutually connected, and the third joints are mutually connected; the palm is provided with 6 joints, one of which is a wrist joint, the wrist joint is connected with the other 5 metacarpophalangeal joints, and the adjacent metacarpophalangeal joints are connected with each other.

Because certain relevance exists among joints of a hand, the adjacent relation among the joints can be represented by using an adjacent matrix, and the semantic graph convolution can well describe the relation among the joints and can better extract semantic information of joint points, so that better characteristic representation is obtained, a 2D posture optimization network consisting of 3 layers of semantic graph convolution layers is adopted as a basic network model for 2D posture estimation.

The basic formula of conventional graph convolution is as follows:

wherein,

is the input feature matrix of the l-th layer,

representing the output characteristic matrix of the l layer; representing the number of joints, K representing the input feature dimension of the graph convolution network,

is a matrix symmetrically normalized with respect to a, which represents the adjacency matrix.

It can be seen that the traditional graph convolution operation shares weight at each node, and in fact, each joint of the hand has different influence on other joints of the hand, so that the sharing weight is not appropriate enough, and in order to better describe semantic information of each node, the semantic graph convolution is introduced for the first time to estimate the hand posture so as to expect to achieve better effect. The semantic graph convolution formula is as follows:

X^(l+1)＝σ(WX^(l)ρ_i(M⊙A)) (2)

where ρ is_iIs a Softmax nonlinear transformation used to normalize the matrix elements. An element indicating a matrix pixel level operation if_ijIs 1, then M in the matrix M is returned_ijThe value of the element, otherwise, the return value is passed through ρ_iThe operation yields a value of approximately 0. The matrix a is an adjacency matrix of nodes, and represents a connection relationship between the nodes. σ denotes the ReLu nonlinear activation function.

Representing a learnable weighting matrix. X⁰For input to the network, X in the invention⁰N × F is the feature matrix obtained in the first stage.

In the invention, accurate 2D attitude estimation is realized by the convolution layer of 3 semantic graphs, and X is input⁰The output is the accurately estimated 2D coordinates of the N joint points, N × F feature matrix. The input-output dimensions of each layer are as follows, with the arrows representing the semantic graph convolution operation: (21,2050) → (21,128) → (21,16) → (21, 2). The first number 21 in brackets represents the number of joints of the hand and the second number represents the characteristic dimension of each joint.

3) Multi-scale 3D hand pose estimation based on dual branches

Most documents adopt a graph volume module and a Nonlocal module as a 3D posture regressor, and in the invention, a U-Net network is adopted as a basic network model, and a multi-scale feature fusion process is designed, so that the relation between points is obtained more accurately, and the ambiguity generated from a 2D posture to a 3D posture can be eliminated better. The network diagram is shown in fig. 3. The input is the 2D hand posture coordinate obtained in the second step, and after down sampling, up sampling and multi-scale feature fusion, the three-dimensional coordinate of the hand joint is output through a semantic graph convolution layer.

Multi-scale features have been used in computer vision problems, and fusion of multi-scale features has shown surprising performance. Because the characteristics of each part of the hand joint are not necessarily concentrated on the last layer of characteristics, and the joints of different parts are possibly distributed on the characteristics of different scales, the invention utilizes the decoder process of a U-Net network to perform multi-scale characteristic fusion, and finally utilizes a semantic graph convolution layer to obtain the three-dimensional coordinates of the hand joint. The network comprises three phases: the method comprises the following steps of down-sampling, up-sampling and multi-scale feature fusion, wherein a network firstly obtains global information of hand joints in the down-sampling process, and then restores the resolution through up-sampling. To preserve the underlying information, the features of the downsampling stage are added to the upsampling branch through some skipped connections. And finally, combining the features obtained under multiple scales and predicting the three-dimensional coordinates of the hand joints. The specific process is as follows.

A down-sampling stage: in most documents, graph pooling is performed by using a sigmoid function, and the use of the sigmoid function generates a very small gradient in a back propagation process, so that a network does not update randomly initialized selected nodes in the whole training phase, and the advantage of a pooling layer is lost. In the invention, a fully connected layer is used and applied to the transposition of the feature matrix, and the fully connected layer can be used as a kernel of each feature and output the number of required nodes. The calculation process of the down sampling is as follows:

Y₀＝G₀(x) (3)

P_i＝Pooling_i(Y_i-1) i＝1,2,…,5 (4)

wherein

Representing the input of the network, and being a hand joint 2D posture coordinate matrix obtained in the second stage, where N is the number of joints, and l is the characteristic dimension (where N is 21, and l is 2); i denotes the number of down-sampling,

representing a convolution operation of downsampling, Y_iRepresenting the output of the graph convolution, Pooling_i(. for) downsampling, i.e. pooling, of calculations, P_iRepresents the output of the pooling layer, and FC (-) represents the full-connectivity layer operation.

An up-sampling stage: in the present invention, the upsampling, i.e., inverse pooling, layer uses a method of transposed convolution, which is applied to the transposed matrix of the feature using the fully connected layer to obtain the required number of output nodes, and then transposes the matrix again. The computational process of the anti-pooling is as follows:

wherein U is_iRepresenting the output of each layer during the production run, Unpooling_i(. cndot.) represents an inverse pooling calculation (i.e. upsampling),

representing feature fusion-that is, feature stitching,

representing a downsampled graph convolution operation.

A multi-scale feature extraction stage: because the multi-scale strategy can better capture the global context information, in the up-sampling stage, 21 nodes are respectively and directly up-sampled in each scale, then the dimensionality of the nodes is output to be 4 dimensions (21,4) through one graph volume layer, and then the three-dimensional coordinates of the hand joints are obtained through two graph volume layers after the characteristics of all scales are spliced. The calculation process is as follows:

wherein

Representing a multi-scale graph convolution calculation, F_iFor the output of the map convolution layer at different scales, Un boost_i(. cndot.) represents an inverse pooling calculation (i.e., upsampling). out represents the multi-scale fused output.

To be able to train the network better end-to-end, we use the following loss function as the objective function of the whole network.

In the 2D pose estimation and 2D pose initialization process, we use the loss as:

wherein H_jThe label representing the 2D pose, i.e. the 2D coordinates carried by itself in the public data set,

represents the predicted 2D pose coordinates and N represents the total number of hand joints, where N is 21.

Use of pose loss L in 3D pose estimation_poseLoss of bone length L_lenAnd loss of bone orientation L_dir。

Attitude loss L_poseComprises the following steps:

wherein

The tag representing the 3D pose, i.e. the 3D coordinates carried in the public data set itself,

representing the predicted 3D pose coordinates.

Loss of bone length L_lenAnd loss of bone orientation L_dirComprises the following steps:

wherein b is_i,jRepresents the skeletal vector between the ith and jth joints of the hand, namely:

loss function L of 3D hand pose estimation network_3DComprises the following steps:

L_3D＝λ_poseL_pose+λ_lenL_len+λ_dirL_dir (14)

wherein λ_pose、λ_lenAnd λ_dirThe weight over-parameters between the loss of 3D pose, loss of bone length and loss of bone orientation, respectively, are set here as_pose＝1、λ_len＝0.01、λ_dir0.1. A schematic of bone loss is shown in fig. 4.

The overall objective function L is:

L＝λ_initL_init+λ_2DL_2D+λ_3DL_3D (15)

wherein λ_init、λ_2DAnd λ_3DThe hyper-parameters of the initial 2D attitude loss, 2D attitude loss and 3D attitude loss, respectively, are set here as: lambda [ alpha ]_init＝0.01、λ_2D＝0.01、λ_3D＝1。

The present invention uses the disclosed datasets STB and ferrihand as training and testing datasets for the model for validation, while using the disclosed Obman dataset as a pre-training dataset for the network model.

Data set

Ferrihand dataset: the data set consists of real images and shows samples with and without object interaction. It was captured by a multi-view setup, containing 33,000 samples. Recording gestures on 32 objects provides 3D annotations of 21 joint points. For the present invention, the data set was divided into 80% for training and 20% for testing.

STB data set: the STB data set is a real data set containing 18000 images, 21 three-dimensional hand joint position ground truth and corresponding depth images. In the experiment, 15,000 images were used for the training sample and 3000 images were used for the test sample. To be consistent with the Obman data set, the root joint of the STB data set was moved from the palm center to the wrist.

Obman dataset: the Obman dataset is a large dataset synthesized for hand pose estimation verification. The images in the dataset were obtained using ShapeNet rendering. Obman contained 141,550 training data, 6463 validation data, and 6285 test data. Despite the large amount of annotation data, these synthetic image-trained models do not scale well to real images. It is helpful to the model training of the present invention that the model can be pre-trained on the massive dataset of Obman, and then the network model can be fine-tuned using the real images.

Evaluation index

The present invention evaluates the model by calculating the euclidean distance between the 3D tag coordinates and the estimated coordinates. The calculation formula is as follows:

where MPJPE is the average joint error, T denotes the number of samples, N denotes the number of joint points (where N is 21),

the tag coordinates of the ith joint of the t-th image,

i-th representing the t-th imagePredicted coordinates of individual joints.

Experiment and results

The experiments of the invention were trained and tested on an Ubuntu16.04 system computer with Intel Core i5 CPU, 11GB RAM and NVIDIA GTX2080TI GPU. The frame used was pytorch 1.3.1. We first pre-train the network model using the synthetic dataset Obman. Setting the initial learning rate to be 0.001, multiplying the learning rate of every 500 epochs by 0.1, and training 5000 epochs in total; the model was then trained end-to-end using the Ferrian dataset for 500 epochs with an initial learning rate of 0.001, multiplying each 100 epoch learning rates by 0.1 all images were resized to 224 x 224 pixels and passed into the Resnet network.

The design of the invention has completed two experiments, the first experiment aims at comparing with the bseline of the invention; the method of Bardia and the like is adopted by Baseline, the Resnet50 network is used for extracting image features, and then self-adaptive graph volume and a graph U-net network are used for performing 2D and 3D posture regression; the double-flow network and the multi-scale model related to the invention are respectively subjected to ablation experiments. The second experiment was compared to some of the latest research methods of hand pose estimation based on RGB images. The experimental results are shown in the following table:

table 1 comparison with existing methods on STB datasets

Method	Mean joint error (unit: mm)
		Theodoridis et al.	6.93
Spurr et al.	8.56
		Ge et al.	6.37
Yang et al.	7.05
		The invention	5.972

Table 2 comparison with existing methods on ferrihand data sets

Method	Mean joint error (unit: mm)
		Parelli M et al.	11
Doosti B et al.	8.887
		Ge et al.	15.36
The invention	8.247

TABLE 3 ablation experiments with the method proposed by the invention (mean joint error, unit: mm)

Method	STB data set	Ferihad dataset
			Baseline	8.573	8.887
Baseline + semantic graph convolution	7.256	8.406
			Resnet50+ dual stream network	8.044	8.637
Resnet50+ double-stream network + semantic graph convolution	7.202	8.358
			Resnet50+ multiscale	7.657	8.559
Resnet50+ multiscale + semantic graph convolution	6.655	8.435
			Resnet50+ double stream + multiscale	6.085	8.401
Resnet50+ double stream + multiscale + semantic graph convolution	5.972	8.247

In conclusion, the double-current and multi-scale network model provided by the invention is superior to the traditional hand posture estimation method, namely, the hand posture estimation with higher precision can be obtained under the condition that a single RGB image is used as input. The model of the invention uses two different graph structures of the hand, thereby being capable of better utilizing the characteristics between the adjacent joints of the hand joints, and simultaneously, the two structures have complementary structural characteristics; the multi-scale features can better utilize semantic information of hand joints so as to reduce ambiguity of adjacent joints in prediction in a single RGB image regression process; and a semantic graph convolution network is introduced to hand posture estimation, so that semantic information of each joint is more powerfully described. Thereby obtaining a more accurate 3D hand pose.

Claims

1. A double-current multi-scale hand posture estimation method based on a single RGB image is characterized by comprising the following steps:

2. The method of claim 1, wherein the method comprises the following steps: the step 1) is specifically as follows,

encoding input single RGB images by using a Resnet50 network, wherein each input image generates a characteristic vector with 2048 dimensions;

and then, generating initial predicted two-dimensional coordinates of the hand joint points by using an additional full-connection layer, simultaneously splicing the obtained feature vectors with the initial two-dimensional predicted coordinates of each joint point, and generating a graph with F features in each node, namely obtaining an N multiplied by F feature matrix, wherein N represents the number of the hand joints, and F represents the feature dimension.

3. The method of claim 1, wherein the method comprises the following steps: the step 2) specifically comprises the following steps of,

obtaining an N multiplied by F characteristic matrix according to the first step, wherein N represents the number of joints of the hand, F represents characteristic dimension, and two graph structures are obtained, wherein each graph structure is represented by an adjacent matrix, the first graph structure is called physical connection and is used for representing the relation between the physical joints of the hand, and the second graph structure is called symmetrical connection and is used for representing the relation between the same joints of each finger;

inputting the N x F characteristic matrix and the physically connected adjacent matrix into one branch of the double-branch network, and inputting the N x F characteristic matrix and the symmetrically connected adjacent matrix into the other branch of the double-branch network, wherein each branch network is composed of 3-layer semantic graph convolution layers which are mutually connected in series.

4. The method of claim 3, wherein the method comprises the following steps: the semantic graph convolution formula is as follows:

X^(l+1)＝σ(WX^(l)ρ_i(M⊙A)) (2)

where ρ is_iIs a Softmax nonlinear transformation, for normalizing matrix elements, which indicates a matrix pixel level operation,if the element a in the matrix A is_ijIs 1, then M in the matrix M is returned_ijThe value of the element, otherwise, the return value is passed through ρ_iThe operation is to obtain a value of approximately 0, the matrix A is an adjacent matrix of nodes and represents the connection relation among the nodes, the sigma represents a ReLu nonlinear activation function,

representing a learnable weighting matrix, X⁰Being input to the network, i.e. X⁰N × F is the feature matrix obtained in step 1), and the output is the accurately estimated 2D coordinates of the N joint points.

5. The method of claim 3, wherein the method comprises the following steps: the construction method of the symmetrically connected adjacency matrix comprises the following steps: let G ═ { V, E } denote a graph, where V is the set of N hand joint points and E denotes an edge; adjacency matrix A, when two joint points are connected, a_ij1 otherwise a_ij0, i and j represent the two joints of the hand, respectively; each finger is provided with three joints, the finger tip is provided with a third joint, and the two joints below the finger tip are respectively a second joint and a first joint; the same joints of adjacent fingers are connected with each other; the palm is provided with 6 joints, one of which is a wrist joint, the wrist joint is connected with the other 5 metacarpophalangeal joints, and the adjacent metacarpophalangeal joints are connected with each other.

6. The method of claim 1, wherein the method comprises the following steps: the branch structure in the multi-scale semantic graph U-Net network sequentially comprises three stages: down-sampling, up-sampling, multi-scale feature fusion,

wherein, the calculation process of the down-sampling stage is as follows:

Y₀＝G₀(x) (3)

P_i＝Pooling_i(Y_i-1)i＝1，2，...，5 (4)

wherein

Representing the input of the network, namely a hand joint 2D attitude coordinate matrix obtained in the second stage, wherein N is the number of joints, and l is the characteristic dimension; i denotes the number of down-sampling,

representing a convolution operation of downsampling, Y_iRepresenting the output of the graph convolution, Pooling_i(. for) downsampling, i.e. pooling, of calculations, P_iRepresents the output of the pooling layer, FC (-) represents the full-connectivity layer operation;

an up-sampling stage: the method of using transposed convolution for upsampling, i.e. inverse pooling layer, uses a fully connected layer to apply it to the transposed matrix of the features to obtain the required number of output nodes, and then transposes the matrix again, the computation process of its inverse pooling is as follows:

wherein U is_iRepresenting the output of each layer during the production run, Unpooling_i(. cndot.) represents an inverse pooling calculation, i.e. upsampling,

representing feature fusion-that is, feature stitching,

a graph convolution operation representing downsampling;

in the multi-scale feature extraction stage, the calculation process is as follows:

wherein

Representing a multi-scale graph convolution calculation, F_iFor the output of the map convolution layer at different scales, Un boost_i(. cndot.) represents the inverse pooling calculation (i.e., upsampling), and out represents the multi-scale fused output.

7. The method of claim 1, wherein the method comprises the following steps: the overall objective function L is:

L＝λ_initL_init+λ_2DL_2D+λ_3DL_3D (15)

wherein λ_init、λ_2DAnd λ_3DRespectively initial 2D attitude loss, 2D attitude loss and 3D attitude loss, L_initThe loss in the 2D pose estimation and 2D pose initialization process is as follows:

representing predicted 2D pose coordinates;

L_3Destimating a loss function of the network for the 3D hand pose:

L_3D＝λ_poseL_pose+λ_lenL_len+λ_dirL_dir (14)

wherein λ_pose、λ_lenAnd λ_dirThe weight over-parameters between the 3D pose loss, the bone length loss and the bone orientation loss, respectively,

attitude loss L_poseComprises the following steps:

wherein

representing predicted 3D pose coordinates;

wherein b is_i，jRepresents the skeletal vector between the ith and jth joints of the hand, namely: