CN113052030A - Double-current multi-scale hand posture estimation method based on single RGB image - Google Patents

Double-current multi-scale hand posture estimation method based on single RGB image Download PDF

Info

Publication number
CN113052030A
CN113052030A CN202110273215.0A CN202110273215A CN113052030A CN 113052030 A CN113052030 A CN 113052030A CN 202110273215 A CN202110273215 A CN 202110273215A CN 113052030 A CN113052030 A CN 113052030A
Authority
CN
China
Prior art keywords
hand
joints
matrix
coordinates
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110273215.0A
Other languages
Chinese (zh)
Other versions
CN113052030B (en
Inventor
王立春
马胜蕾
李敬华
孔德慧
王少帆
尹宝才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110273215.0A priority Critical patent/CN113052030B/en
Publication of CN113052030A publication Critical patent/CN113052030A/en
Application granted granted Critical
Publication of CN113052030B publication Critical patent/CN113052030B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • G06V20/647Three-dimensional objects by matching two-dimensional images to three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Social Psychology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a double-current multi-scale hand posture estimation method based on a single RGB image, which is used for solving the problems of self-occlusion and adjacent joint prediction ambiguity in the single RGB image. The method comprises the steps of taking RGB images as input, extracting features of a single image by using a deep neural network, obtaining 2D posture initial coordinates of hand joints, and performing 2D posture estimation by using a double-branch network to obtain two paths of 2D posture coordinates of the hand joints; for the two paths of 2D posture coordinates, the 3D coordinates of the two paths of hand joints are respectively estimated by utilizing a double-branch multi-scale semantic graph U-Net network, then the two paths of 3D coordinates are added and averaged, and finally the 3D coordinates of the hand joints are output. The hand posture estimation method is based on different topological structures of the hand, better utilizes information among joints, and finally realizes high-precision hand posture estimation.

Description

Double-current multi-scale hand posture estimation method based on single RGB image
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a hand posture estimation method based on a double-current multi-scale network for RGB images.
Background
In the daily interplay process between people, natural language, written language and body language are the three most important expression modes, but the former two are all limited by regions, countries, ethnicities and cultures, and the body language is not only flexible and changeable, can express some basic intentions of people, but also is intuitive and easy to understand and is not easy to generate ambiguity. Therefore, body languages are increasingly favored by human interaction researchers. Human hands are one of the most important parts in body language expression, and can convey abundant information, so that it is valuable and necessary for a computer to read information conveyed by human hands.
The gesture is the main mode for human and external information transmission, and because of the flexibility, freedom, complexity and changeability, the gesture movement contains a large amount of useful information, and the hand undertakes most of the work in life such as communication, operation and the like. It is well known that most machine operations are performed by hand. Therefore, no matter natural human-computer interaction is performed or human hand operation experience is transferred to the robot, firstly, the posture of the human hand needs to be estimated, and the posture information of the human hand is transferred to the robot equipment, so that human-computer interaction is performed.
At present, the method for estimating the hand posture is roughly divided into two stages, namely, the 2D posture of the hand is estimated by inputting an image, and then the 3D posture of the hand is regressed through the 2D posture. Hand pose estimation can be roughly classified into three categories according to the kind of input image: 1) and (3) estimating hand postures according to the depth images: depth image-based methods have traditionally been the primary method of hand pose estimation. The depth image contains certain depth information, and three-dimensional information of hand joints can be better obtained in the 3D posture regression process, but the imaging range of the depth camera at the present stage is very limited, the quality is not high enough, and great influence can be brought to a hand posture estimation method which depends on the depth image as input; meanwhile, the depth image is not applied in practice, and people usually have difficulty in obtaining the depth image. 2) And (3) estimating the hand postures according to the multiple RGB images: images based on multiple RGB are easier to acquire than depth image based methods, while multiple RGB images taken from different views contain rich 3D information, and therefore some methods take multiple images as input to alleviate occlusion problems. Although the method can obtain higher precision and can effectively solve the problem of self-shielding of hands, the required training and testing resources are larger, and the acquisition of a data set is more complicated. 3) Hand pose estimation from a single RGB image: compared with the two methods, the single RGB image is easier to obtain and more practical, and the gesture posture estimation based on the RGB single image is widely concerned at present. But estimating three-dimensional hand poses from only a single RGB image faces greater challenges due to the lack of input depth information. The gesture pose estimation method generally includes two stages, respectively, estimating the 2D pose of the hand based on the input image, and regressing the 3D pose of the hand through the 2D pose.
Factors influencing gesture posture estimation comprise that a self-shielding phenomenon exists in a part of gestures, prediction ambiguity exists in some adjacent joints during 3D posture regression, and the like.
Disclosure of Invention
The invention provides an improved hand posture estimation method aiming at three problems of gesture self-occlusion, adjacent joint prediction ambiguity and lack of semantic information due to the fact that weight is shared by traditional graph convolution at each node. The method comprises the following steps: the double-flow gesture attitude estimation method based on two topological structures is provided, and the problem of gesture self-occlusion is solved; a multi-scale U-net 3D gesture regression method is provided, and the problem of ambiguity prediction of adjacent joints during regression is solved; the method has the advantages that the problem of hand posture estimation of the semantic graph convolutional network is introduced for the first time, so that the node weight of each joint is different, semantic information of each joint is described powerfully, and the precision of 2D posture estimation and 3D posture regression is improved. The specific technical scheme is as follows:
step 1) extracting the characteristics of a single image and obtaining 2D posture initial coordinates of hand joints;
step 2) performing 2D posture estimation by using a double-branch network to obtain accurate 2D posture coordinates of hand joints, wherein the double-branch network has two branches with the same structure;
an N x F feature matrix is obtained through the first step, wherein N represents the number of hand joints, and F represents the feature dimension, and a graph can be obtained. In the second step, the graph obtained in the first step is used, two graph structures are designed according to different connection relations of hands, each graph structure is represented by using a different adjacency matrix, and therefore a double-branch network structure is designed. Each branch inputs the obtained feature matrix and the corresponding adjacent matrix into a 2D posture optimization network consisting of semantic graph convolution layers, so that each branch obtains the 2D posture of one hand.
Step 3) estimating 3D coordinates of the hand joints by utilizing a multi-scale semantic graph U-Net network, wherein the multi-scale semantic graph U-Net network has two branches with the same structure, and the input of the multi-scale semantic graph U-Net network of each branch is the 2D posture coordinates obtained by the branch in the step 2) and a corresponding adjacency matrix, and the input is the 3D posture of the hand joints; and then, adding and averaging the 3D postures obtained by the two branches, and finally outputting the 3D coordinates of the joints of the hand.
Advantageous effects
The invention provides a novel double-flow and multi-scale network model based on two topological structures of hand joints, solves the problems of self-shielding and adjacent joint prediction ambiguity based on a single RGB image, and realizes the acquisition of high-precision three-dimensional coordinates of the hand joints from the single RGB image. Most of the existing methods use the training of a depth image to a network model as a guide in the training process so as to obtain more accurate three-dimensional coordinates of hand joints. Compared with the existing method, the method is based on two different topological structures of the hand, so that the information between joints of the hand is better utilized, and meanwhile, a double-flow multi-scale network model is designed on different characteristics possibly existing in the information of different hand joints. And finally, high-precision hand posture estimation is realized.
Drawings
FIG. 1 is a general schematic diagram of a hand pose estimation method;
FIG. 2 is a schematic diagram of two topologies of a hand;
FIG. 3 is a multi-scale graph U-net network model;
FIG. 4 is a schematic illustration of hand bone loss.
Detailed Description
The invention consists of three parts: 1) feature extraction and 2D posture initialization, wherein the image is input into a network to extract features of the image and obtain an initial 2D posture of a hand joint; 2)2D attitude estimation, namely obtaining two topological structures of hand joints by using the characteristics obtained in the step 1) and the initial 2D attitude, and respectively inputting the two topological structures into a 2D attitude optimization network to refine the 2D attitude; 3) and 3D posture regression, namely designing a multi-scale feature fusion graph Unet 3D regression model by using the features of different joints of a hand on features of different scales. The overall schematic diagram of the hand posture estimation method is shown in fig. 1, two topological structures are shown in fig. 2, and a 3D regression model is shown in fig. 3.
1) Feature extraction and 2D pose initialization
Firstly, extracting 2048-dimensional feature vectors of an input image by using a Resnet50 network, obtaining initial 2D joint coordinates by passing the 2048-dimensional feature vectors through an additional full-connection layer, and splicing the 2048-dimensional feature vectors extracted by the Resnet50 network and the obtained initial 2D joint coordinates to obtain an N multiplied by F feature matrix, wherein N represents the number of hand joints, and N is 21 in the invention, and comprises 1 wrist joint, 5 metacarpophalangeal joints and 3 finger joints of each finger; f denotes a feature dimension, which is 2050 in the present invention, and includes 2048-dimensional features of the image and 2D joint coordinates x and y of each joint point.
2)2D pose estimation
Through the graph obtained in the first step, two different graph structures of a hand are designed, so that a double-branch network is designed; the two graph structures can perform feature complementation in a subsequent network by using different connection modes of hands, so that more accurate hand postures are obtained. The two figures are structured inIn the invention, different adjacency matrixes are respectively used for representation, and the characteristic matrix NxF obtained in the first step and the adjacency matrix A of each branch are respectively input into a 2D posture optimization network consisting of 3 layers of semantic graph convolution layers to obtain the 2D posture of the hand. In view of the fact that the prior hand posture estimation method cannot better express the correlation relation between the hand joints when using the graph convolutional network, the invention designs two different graph structures so as to better express the relation between the hand joints. The two graph structures are shown in fig. 2. The first graph structure is called physical connection, which is a connection between physical joints of hands, and this is a topological structure commonly used in the prior method, which can effectively use the relation between joints on each finger, so that it is more advantageous to represent some simpler gestures. The second graph structure is called as symmetrical connection, which utilizes the connection between the same joints of each finger, the interaction and the spatial connection of the same joints of each finger are not considered in the previous method, the connection between the same joints of each finger can be effectively utilized, so that the self-shielding problems are effectively solved, and because the motions of the same joints of the fingers in the motion of the hand are very similar, a certain relation exists between the adjacent joints in the motion process, the graph structure can better express the relation, for example, the self-shielding condition exists in the hand in the motions of grabbing objects and holding a fist, and by utilizing the connection mode, the shielded joints can be better estimated according to the visible joints. This configuration is therefore more advantageous for some gestures with self-occluding hand joints. The two graph structures are represented by different adjacency matrices. The construction method of the symmetrically connected adjacency matrix comprises the following steps: let G ═ { V, E } denote a graph, where V is the set of N hand joint points and E denotes an edge; adjacency matrix A, when two joint points are connected, aij1 otherwise aij0, i and j represent the two joints of the hand, respectively; each finger has three joints, the third joint is at the tip of the finger, the second joint and the first joint are respectively at the two joints under the tip of the finger, for example, the first joint and the second joint of the middle finger are respectively at the joint above the palm joint of the middle fingerThe third joint of the middle finger; the first joints of the adjacent fingers are mutually connected, the second joints are also mutually connected, and the third joints are mutually connected; the palm is provided with 6 joints, one of which is a wrist joint, the wrist joint is connected with the other 5 metacarpophalangeal joints, and the adjacent metacarpophalangeal joints are connected with each other.
Because certain relevance exists among joints of a hand, the adjacent relation among the joints can be represented by using an adjacent matrix, and the semantic graph convolution can well describe the relation among the joints and can better extract semantic information of joint points, so that better characteristic representation is obtained, a 2D posture optimization network consisting of 3 layers of semantic graph convolution layers is adopted as a basic network model for 2D posture estimation.
The basic formula of conventional graph convolution is as follows:
Figure BDA0002971245060000051
wherein,
Figure BDA0002971245060000052
is the input feature matrix of the l-th layer,
Figure BDA0002971245060000053
representing the output characteristic matrix of the l layer; representing the number of joints, K representing the input feature dimension of the graph convolution network,
Figure BDA0002971245060000054
is a matrix symmetrically normalized with respect to a, which represents the adjacency matrix.
It can be seen that the traditional graph convolution operation shares weight at each node, and in fact, each joint of the hand has different influence on other joints of the hand, so that the sharing weight is not appropriate enough, and in order to better describe semantic information of each node, the semantic graph convolution is introduced for the first time to estimate the hand posture so as to expect to achieve better effect. The semantic graph convolution formula is as follows:
X(l+1)=σ(WX(l)ρi(M⊙A)) (2)
where ρ isiIs a Softmax nonlinear transformation used to normalize the matrix elements. An element indicating a matrix pixel level operation ifijIs 1, then M in the matrix M is returnedijThe value of the element, otherwise, the return value is passed through ρiThe operation yields a value of approximately 0. The matrix a is an adjacency matrix of nodes, and represents a connection relationship between the nodes. σ denotes the ReLu nonlinear activation function.
Figure BDA0002971245060000055
Representing a learnable weighting matrix. X0For input to the network, X in the invention0N × F is the feature matrix obtained in the first stage.
In the invention, accurate 2D attitude estimation is realized by the convolution layer of 3 semantic graphs, and X is input0The output is the accurately estimated 2D coordinates of the N joint points, N × F feature matrix. The input-output dimensions of each layer are as follows, with the arrows representing the semantic graph convolution operation: (21,2050) → (21,128) → (21,16) → (21, 2). The first number 21 in brackets represents the number of joints of the hand and the second number represents the characteristic dimension of each joint.
3) Multi-scale 3D hand pose estimation based on dual branches
Most documents adopt a graph volume module and a Nonlocal module as a 3D posture regressor, and in the invention, a U-Net network is adopted as a basic network model, and a multi-scale feature fusion process is designed, so that the relation between points is obtained more accurately, and the ambiguity generated from a 2D posture to a 3D posture can be eliminated better. The network diagram is shown in fig. 3. The input is the 2D hand posture coordinate obtained in the second step, and after down sampling, up sampling and multi-scale feature fusion, the three-dimensional coordinate of the hand joint is output through a semantic graph convolution layer.
Multi-scale features have been used in computer vision problems, and fusion of multi-scale features has shown surprising performance. Because the characteristics of each part of the hand joint are not necessarily concentrated on the last layer of characteristics, and the joints of different parts are possibly distributed on the characteristics of different scales, the invention utilizes the decoder process of a U-Net network to perform multi-scale characteristic fusion, and finally utilizes a semantic graph convolution layer to obtain the three-dimensional coordinates of the hand joint. The network comprises three phases: the method comprises the following steps of down-sampling, up-sampling and multi-scale feature fusion, wherein a network firstly obtains global information of hand joints in the down-sampling process, and then restores the resolution through up-sampling. To preserve the underlying information, the features of the downsampling stage are added to the upsampling branch through some skipped connections. And finally, combining the features obtained under multiple scales and predicting the three-dimensional coordinates of the hand joints. The specific process is as follows.
A down-sampling stage: in most documents, graph pooling is performed by using a sigmoid function, and the use of the sigmoid function generates a very small gradient in a back propagation process, so that a network does not update randomly initialized selected nodes in the whole training phase, and the advantage of a pooling layer is lost. In the invention, a fully connected layer is used and applied to the transposition of the feature matrix, and the fully connected layer can be used as a kernel of each feature and output the number of required nodes. The calculation process of the down sampling is as follows:
Y0=G0(x) (3)
Pi=Poolingi(Yi-1) i=1,2,…,5 (4)
Figure BDA0002971245060000061
wherein
Figure BDA0002971245060000062
Representing the input of the network, and being a hand joint 2D posture coordinate matrix obtained in the second stage, where N is the number of joints, and l is the characteristic dimension (where N is 21, and l is 2); i denotes the number of down-sampling,
Figure BDA0002971245060000063
representing a convolution operation of downsampling, YiRepresenting the output of the graph convolution, Poolingi(. for) downsampling, i.e. pooling, of calculations, PiRepresents the output of the pooling layer, and FC (-) represents the full-connectivity layer operation.
An up-sampling stage: in the present invention, the upsampling, i.e., inverse pooling, layer uses a method of transposed convolution, which is applied to the transposed matrix of the feature using the fully connected layer to obtain the required number of output nodes, and then transposes the matrix again. The computational process of the anti-pooling is as follows:
Figure BDA0002971245060000064
wherein U isiRepresenting the output of each layer during the production run, Unpoolingi(. cndot.) represents an inverse pooling calculation (i.e. upsampling),
Figure BDA0002971245060000065
representing feature fusion-that is, feature stitching,
Figure BDA0002971245060000066
representing a downsampled graph convolution operation.
A multi-scale feature extraction stage: because the multi-scale strategy can better capture the global context information, in the up-sampling stage, 21 nodes are respectively and directly up-sampled in each scale, then the dimensionality of the nodes is output to be 4 dimensions (21,4) through one graph volume layer, and then the three-dimensional coordinates of the hand joints are obtained through two graph volume layers after the characteristics of all scales are spliced. The calculation process is as follows:
Figure BDA0002971245060000071
Figure BDA0002971245060000072
wherein
Figure BDA0002971245060000073
Representing a multi-scale graph convolution calculation, FiFor the output of the map convolution layer at different scales, Un boosti(. cndot.) represents an inverse pooling calculation (i.e., upsampling). out represents the multi-scale fused output.
To be able to train the network better end-to-end, we use the following loss function as the objective function of the whole network.
In the 2D pose estimation and 2D pose initialization process, we use the loss as:
Figure BDA0002971245060000074
wherein HjThe label representing the 2D pose, i.e. the 2D coordinates carried by itself in the public data set,
Figure BDA0002971245060000075
represents the predicted 2D pose coordinates and N represents the total number of hand joints, where N is 21.
Use of pose loss L in 3D pose estimationposeLoss of bone length LlenAnd loss of bone orientation Ldir
Attitude loss LposeComprises the following steps:
Figure BDA0002971245060000076
wherein
Figure BDA0002971245060000077
The tag representing the 3D pose, i.e. the 3D coordinates carried in the public data set itself,
Figure BDA0002971245060000078
representing the predicted 3D pose coordinates.
Loss of bone length LlenAnd loss of bone orientation LdirComprises the following steps:
Figure BDA0002971245060000079
Figure BDA00029712450600000710
wherein b isi,jRepresents the skeletal vector between the ith and jth joints of the hand, namely:
Figure BDA00029712450600000711
loss function L of 3D hand pose estimation network3DComprises the following steps:
L3D=λposeLposelenLlendirLdir (14)
wherein λpose、λlenAnd λdirThe weight over-parameters between the loss of 3D pose, loss of bone length and loss of bone orientation, respectively, are set here aspose=1、λlen=0.01、λdir0.1. A schematic of bone loss is shown in fig. 4.
The overall objective function L is:
L=λinitLinit2DL2D3DL3D (15)
wherein λinit、λ2DAnd λ3DThe hyper-parameters of the initial 2D attitude loss, 2D attitude loss and 3D attitude loss, respectively, are set here as: lambda [ alpha ]init=0.01、λ2D=0.01、λ3D=1。
The present invention uses the disclosed datasets STB and ferrihand as training and testing datasets for the model for validation, while using the disclosed Obman dataset as a pre-training dataset for the network model.
Data set
Ferrihand dataset: the data set consists of real images and shows samples with and without object interaction. It was captured by a multi-view setup, containing 33,000 samples. Recording gestures on 32 objects provides 3D annotations of 21 joint points. For the present invention, the data set was divided into 80% for training and 20% for testing.
STB data set: the STB data set is a real data set containing 18000 images, 21 three-dimensional hand joint position ground truth and corresponding depth images. In the experiment, 15,000 images were used for the training sample and 3000 images were used for the test sample. To be consistent with the Obman data set, the root joint of the STB data set was moved from the palm center to the wrist.
Obman dataset: the Obman dataset is a large dataset synthesized for hand pose estimation verification. The images in the dataset were obtained using ShapeNet rendering. Obman contained 141,550 training data, 6463 validation data, and 6285 test data. Despite the large amount of annotation data, these synthetic image-trained models do not scale well to real images. It is helpful to the model training of the present invention that the model can be pre-trained on the massive dataset of Obman, and then the network model can be fine-tuned using the real images.
Evaluation index
The present invention evaluates the model by calculating the euclidean distance between the 3D tag coordinates and the estimated coordinates. The calculation formula is as follows:
Figure BDA0002971245060000081
where MPJPE is the average joint error, T denotes the number of samples, N denotes the number of joint points (where N is 21),
Figure BDA0002971245060000082
the tag coordinates of the ith joint of the t-th image,
Figure BDA0002971245060000083
i-th representing the t-th imagePredicted coordinates of individual joints.
Experiment and results
The experiments of the invention were trained and tested on an Ubuntu16.04 system computer with Intel Core i5 CPU, 11GB RAM and NVIDIA GTX2080TI GPU. The frame used was pytorch 1.3.1. We first pre-train the network model using the synthetic dataset Obman. Setting the initial learning rate to be 0.001, multiplying the learning rate of every 500 epochs by 0.1, and training 5000 epochs in total; the model was then trained end-to-end using the Ferrian dataset for 500 epochs with an initial learning rate of 0.001, multiplying each 100 epoch learning rates by 0.1 all images were resized to 224 x 224 pixels and passed into the Resnet network.
The design of the invention has completed two experiments, the first experiment aims at comparing with the bseline of the invention; the method of Bardia and the like is adopted by Baseline, the Resnet50 network is used for extracting image features, and then self-adaptive graph volume and a graph U-net network are used for performing 2D and 3D posture regression; the double-flow network and the multi-scale model related to the invention are respectively subjected to ablation experiments. The second experiment was compared to some of the latest research methods of hand pose estimation based on RGB images. The experimental results are shown in the following table:
table 1 comparison with existing methods on STB datasets
Method Mean joint error (unit: mm)
Theodoridis et al. 6.93
Spurr et al. 8.56
Ge et al. 6.37
Yang et al. 7.05
The invention 5.972
Table 2 comparison with existing methods on ferrihand data sets
Method Mean joint error (unit: mm)
Parelli M et al. 11
Doosti B et al. 8.887
Ge et al. 15.36
The invention 8.247
TABLE 3 ablation experiments with the method proposed by the invention (mean joint error, unit: mm)
Method STB data set Ferihad dataset
Baseline 8.573 8.887
Baseline + semantic graph convolution 7.256 8.406
Resnet50+ dual stream network 8.044 8.637
Resnet50+ double-stream network + semantic graph convolution 7.202 8.358
Resnet50+ multiscale 7.657 8.559
Resnet50+ multiscale + semantic graph convolution 6.655 8.435
Resnet50+ double stream + multiscale 6.085 8.401
Resnet50+ double stream + multiscale + semantic graph convolution 5.972 8.247
In conclusion, the double-current and multi-scale network model provided by the invention is superior to the traditional hand posture estimation method, namely, the hand posture estimation with higher precision can be obtained under the condition that a single RGB image is used as input. The model of the invention uses two different graph structures of the hand, thereby being capable of better utilizing the characteristics between the adjacent joints of the hand joints, and simultaneously, the two structures have complementary structural characteristics; the multi-scale features can better utilize semantic information of hand joints so as to reduce ambiguity of adjacent joints in prediction in a single RGB image regression process; and a semantic graph convolution network is introduced to hand posture estimation, so that semantic information of each joint is more powerfully described. Thereby obtaining a more accurate 3D hand pose.

Claims (7)

1. A double-current multi-scale hand posture estimation method based on a single RGB image is characterized by comprising the following steps:
step 1) extracting the characteristics of a single image and obtaining 2D posture initial coordinates of hand joints;
step 2) performing 2D posture estimation by using a double-branch network to obtain accurate 2D posture coordinates of hand joints, wherein the double-branch network has two branches with the same structure;
step 3) estimating 3D coordinates of the hand joints by utilizing a multi-scale semantic graph U-Net network, wherein the multi-scale semantic graph U-Net network has two branches with the same structure, and the input of the multi-scale semantic graph U-Net network of each branch is the 2D posture coordinates obtained by the branch in the step 2) and a corresponding adjacency matrix, and the input is the 3D posture of the hand joints; and then, adding and averaging the 3D postures obtained by the two branches, and finally outputting the 3D coordinates of the joints of the hand.
2. The method of claim 1, wherein the method comprises the following steps: the step 1) is specifically as follows,
encoding input single RGB images by using a Resnet50 network, wherein each input image generates a characteristic vector with 2048 dimensions;
and then, generating initial predicted two-dimensional coordinates of the hand joint points by using an additional full-connection layer, simultaneously splicing the obtained feature vectors with the initial two-dimensional predicted coordinates of each joint point, and generating a graph with F features in each node, namely obtaining an N multiplied by F feature matrix, wherein N represents the number of the hand joints, and F represents the feature dimension.
3. The method of claim 1, wherein the method comprises the following steps: the step 2) specifically comprises the following steps of,
obtaining an N multiplied by F characteristic matrix according to the first step, wherein N represents the number of joints of the hand, F represents characteristic dimension, and two graph structures are obtained, wherein each graph structure is represented by an adjacent matrix, the first graph structure is called physical connection and is used for representing the relation between the physical joints of the hand, and the second graph structure is called symmetrical connection and is used for representing the relation between the same joints of each finger;
inputting the N x F characteristic matrix and the physically connected adjacent matrix into one branch of the double-branch network, and inputting the N x F characteristic matrix and the symmetrically connected adjacent matrix into the other branch of the double-branch network, wherein each branch network is composed of 3-layer semantic graph convolution layers which are mutually connected in series.
4. The method of claim 3, wherein the method comprises the following steps: the semantic graph convolution formula is as follows:
X(l+1)=σ(WX(l)ρi(M⊙A)) (2)
where ρ isiIs a Softmax nonlinear transformation, for normalizing matrix elements, which indicates a matrix pixel level operation,if the element a in the matrix A isijIs 1, then M in the matrix M is returnedijThe value of the element, otherwise, the return value is passed through ρiThe operation is to obtain a value of approximately 0, the matrix A is an adjacent matrix of nodes and represents the connection relation among the nodes, the sigma represents a ReLu nonlinear activation function,
Figure FDA0002971245050000021
representing a learnable weighting matrix, X0Being input to the network, i.e. X0N × F is the feature matrix obtained in step 1), and the output is the accurately estimated 2D coordinates of the N joint points.
5. The method of claim 3, wherein the method comprises the following steps: the construction method of the symmetrically connected adjacency matrix comprises the following steps: let G ═ { V, E } denote a graph, where V is the set of N hand joint points and E denotes an edge; adjacency matrix A, when two joint points are connected, aij1 otherwise aij0, i and j represent the two joints of the hand, respectively; each finger is provided with three joints, the finger tip is provided with a third joint, and the two joints below the finger tip are respectively a second joint and a first joint; the same joints of adjacent fingers are connected with each other; the palm is provided with 6 joints, one of which is a wrist joint, the wrist joint is connected with the other 5 metacarpophalangeal joints, and the adjacent metacarpophalangeal joints are connected with each other.
6. The method of claim 1, wherein the method comprises the following steps: the branch structure in the multi-scale semantic graph U-Net network sequentially comprises three stages: down-sampling, up-sampling, multi-scale feature fusion,
wherein, the calculation process of the down-sampling stage is as follows:
Y0=G0(x) (3)
Pi=Poolingi(Yi-1)i=1,2,...,5 (4)
Figure FDA0002971245050000022
wherein
Figure FDA0002971245050000023
Representing the input of the network, namely a hand joint 2D attitude coordinate matrix obtained in the second stage, wherein N is the number of joints, and l is the characteristic dimension; i denotes the number of down-sampling,
Figure FDA0002971245050000024
representing a convolution operation of downsampling, YiRepresenting the output of the graph convolution, Poolingi(. for) downsampling, i.e. pooling, of calculations, PiRepresents the output of the pooling layer, FC (-) represents the full-connectivity layer operation;
an up-sampling stage: the method of using transposed convolution for upsampling, i.e. inverse pooling layer, uses a fully connected layer to apply it to the transposed matrix of the features to obtain the required number of output nodes, and then transposes the matrix again, the computation process of its inverse pooling is as follows:
Figure FDA0002971245050000025
wherein U isiRepresenting the output of each layer during the production run, Unpoolingi(. cndot.) represents an inverse pooling calculation, i.e. upsampling,
Figure FDA0002971245050000027
representing feature fusion-that is, feature stitching,
Figure FDA0002971245050000026
a graph convolution operation representing downsampling;
in the multi-scale feature extraction stage, the calculation process is as follows:
Figure FDA0002971245050000031
Figure FDA0002971245050000032
wherein
Figure FDA0002971245050000033
Representing a multi-scale graph convolution calculation, FiFor the output of the map convolution layer at different scales, Un boosti(. cndot.) represents the inverse pooling calculation (i.e., upsampling), and out represents the multi-scale fused output.
7. The method of claim 1, wherein the method comprises the following steps: the overall objective function L is:
L=λinitLinit2DL2D3DL3D (15)
wherein λinit、λ2DAnd λ3DRespectively initial 2D attitude loss, 2D attitude loss and 3D attitude loss, LinitThe loss in the 2D pose estimation and 2D pose initialization process is as follows:
Figure FDA0002971245050000034
wherein HjThe label representing the 2D pose, i.e. the 2D coordinates carried by itself in the public data set,
Figure FDA00029712450500000311
representing predicted 2D pose coordinates;
L3Destimating a loss function of the network for the 3D hand pose:
L3D=λposeLposelenLlendirLdir (14)
wherein λpose、λlenAnd λdirThe weight over-parameters between the 3D pose loss, the bone length loss and the bone orientation loss, respectively,
attitude loss LposeComprises the following steps:
Figure FDA0002971245050000035
wherein
Figure FDA0002971245050000036
The tag representing the 3D pose, i.e. the 3D coordinates carried in the public data set itself,
Figure FDA0002971245050000037
representing predicted 3D pose coordinates;
loss of bone length LlenAnd loss of bone orientation LdirComprises the following steps:
Figure FDA0002971245050000038
Figure FDA0002971245050000039
wherein b isi,jRepresents the skeletal vector between the ith and jth joints of the hand, namely:
Figure FDA00029712450500000310
CN202110273215.0A 2021-03-11 2021-03-11 Double-flow multi-scale hand gesture estimation method based on single RGB image Active CN113052030B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110273215.0A CN113052030B (en) 2021-03-11 2021-03-11 Double-flow multi-scale hand gesture estimation method based on single RGB image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110273215.0A CN113052030B (en) 2021-03-11 2021-03-11 Double-flow multi-scale hand gesture estimation method based on single RGB image

Publications (2)

Publication Number Publication Date
CN113052030A true CN113052030A (en) 2021-06-29
CN113052030B CN113052030B (en) 2024-09-24

Family

ID=76513189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110273215.0A Active CN113052030B (en) 2021-03-11 2021-03-11 Double-flow multi-scale hand gesture estimation method based on single RGB image

Country Status (1)

Country Link
CN (1) CN113052030B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762177A (en) * 2021-09-13 2021-12-07 成都市谛视科技有限公司 Real-time human body 3D posture estimation method and device, computer equipment and storage medium
CN116958958A (en) * 2023-07-31 2023-10-27 中国科学技术大学 Self-adaptive class-level object attitude estimation method based on graph convolution double-flow shape prior

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175566A (en) * 2019-05-27 2019-08-27 大连理工大学 A kind of hand gestures estimating system and method based on RGBD converged network
CN110427877A (en) * 2019-08-01 2019-11-08 大连海事大学 A method of the human body three-dimensional posture estimation based on structural information
CN111428555A (en) * 2020-01-17 2020-07-17 大连理工大学 Joint-divided hand posture estimation method
CN112329525A (en) * 2020-09-27 2021-02-05 中国科学院软件研究所 Gesture recognition method and device based on space-time diagram convolutional neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175566A (en) * 2019-05-27 2019-08-27 大连理工大学 A kind of hand gestures estimating system and method based on RGBD converged network
CN110427877A (en) * 2019-08-01 2019-11-08 大连海事大学 A method of the human body three-dimensional posture estimation based on structural information
CN111428555A (en) * 2020-01-17 2020-07-17 大连理工大学 Joint-divided hand posture estimation method
CN112329525A (en) * 2020-09-27 2021-02-05 中国科学院软件研究所 Gesture recognition method and device based on space-time diagram convolutional neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762177A (en) * 2021-09-13 2021-12-07 成都市谛视科技有限公司 Real-time human body 3D posture estimation method and device, computer equipment and storage medium
CN116958958A (en) * 2023-07-31 2023-10-27 中国科学技术大学 Self-adaptive class-level object attitude estimation method based on graph convolution double-flow shape prior

Also Published As

Publication number Publication date
CN113052030B (en) 2024-09-24

Similar Documents

Publication Publication Date Title
Qin et al. U2-Net: Going deeper with nested U-structure for salient object detection
CN109685819B (en) Three-dimensional medical image segmentation method based on feature enhancement
CN112288011B (en) Image matching method based on self-attention deep neural network
Zhang et al. Progressive hard-mining network for monocular depth estimation
CN112784764A (en) Expression recognition method and system based on local and global attention mechanism
CN112329525A (en) Gesture recognition method and device based on space-time diagram convolutional neural network
CN111709270B (en) Three-dimensional shape recovery and attitude estimation method and device based on depth image
CN112446253B (en) Skeleton behavior recognition method and device
Ren et al. Spatial-aware stacked regression network for real-time 3d hand pose estimation
CN112232106A (en) Two-dimensional to three-dimensional human body posture estimation method
CN111462274A (en) Human body image synthesis method and system based on SMP L model
CN115223201A (en) Monocular sequence image-based three-dimensional human body joint point estimation method, system and device
Aslam et al. A Review of deep learning approaches for image analysis
CN115761117A (en) Three-dimensional human body reconstruction method and system based on STAR model
Wu et al. Link-RGBD: Cross-guided feature fusion network for RGBD semantic segmentation
Wang et al. Simplified-attention Enhanced Graph Convolutional Network for 3D human pose estimation
Huo et al. Independent Dual Graph Attention Convolutional Network for Skeleton-Based Action Recognition
CN113052030B (en) Double-flow multi-scale hand gesture estimation method based on single RGB image
Liu et al. Sketch to portrait generation with generative adversarial networks and edge constraint
CN117274607A (en) Multi-path pyramid-based lightweight medical image segmentation network, method and equipment
CN107330912A (en) A kind of target tracking method of rarefaction representation based on multi-feature fusion
Yan et al. MRSNet: Joint consistent optic disc and cup segmentation based on large kernel residual convolutional attention and self-attention
CN114863548B (en) Emotion recognition method and device based on nonlinear space characteristics of human body movement gestures
CN115601787A (en) Rapid human body posture estimation method based on abbreviated representation
Ni et al. QMGR-Net: quaternion multi-graph reasoning network for 3D hand pose estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant