CN113052030A - Double-current multi-scale hand posture estimation method based on single RGB image - Google Patents
Double-current multi-scale hand posture estimation method based on single RGB image Download PDFInfo
- Publication number
- CN113052030A CN113052030A CN202110273215.0A CN202110273215A CN113052030A CN 113052030 A CN113052030 A CN 113052030A CN 202110273215 A CN202110273215 A CN 202110273215A CN 113052030 A CN113052030 A CN 113052030A
- Authority
- CN
- China
- Prior art keywords
- hand
- joints
- matrix
- coordinates
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 210000002478 hand joint Anatomy 0.000 claims abstract description 40
- 230000036544 posture Effects 0.000 claims description 55
- 239000011159 matrix material Substances 0.000 claims description 48
- 238000011176 pooling Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 17
- 238000005070 sampling Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 13
- 210000000988 bone and bone Anatomy 0.000 claims description 10
- 239000013598 vector Substances 0.000 claims description 7
- 230000004927 fusion Effects 0.000 claims description 6
- 210000000811 metacarpophalangeal joint Anatomy 0.000 claims description 5
- 210000003857 wrist joint Anatomy 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 2
- 230000004913 activation Effects 0.000 claims description 2
- 238000010276 construction Methods 0.000 claims description 2
- 238000004519 manufacturing process Methods 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 abstract 1
- 238000012549 training Methods 0.000 description 11
- 238000002474 experimental method Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 210000004247 hand Anatomy 0.000 description 6
- 230000003993 interaction Effects 0.000 description 6
- 230000033001 locomotion Effects 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 238000003384 imaging method Methods 0.000 description 3
- 206010065687 Bone loss Diseases 0.000 description 2
- 238000002679 ablation Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 210000001145 finger joint Anatomy 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 210000002411 hand bone Anatomy 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 210000000707 wrist Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4038—Image mosaicing, e.g. composing plane images from plane sub-images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
- G06V20/647—Three-dimensional objects by matching two-dimensional images to three-dimensional objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Human Computer Interaction (AREA)
- Psychiatry (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Social Psychology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a double-current multi-scale hand posture estimation method based on a single RGB image, which is used for solving the problems of self-occlusion and adjacent joint prediction ambiguity in the single RGB image. The method comprises the steps of taking RGB images as input, extracting features of a single image by using a deep neural network, obtaining 2D posture initial coordinates of hand joints, and performing 2D posture estimation by using a double-branch network to obtain two paths of 2D posture coordinates of the hand joints; for the two paths of 2D posture coordinates, the 3D coordinates of the two paths of hand joints are respectively estimated by utilizing a double-branch multi-scale semantic graph U-Net network, then the two paths of 3D coordinates are added and averaged, and finally the 3D coordinates of the hand joints are output. The hand posture estimation method is based on different topological structures of the hand, better utilizes information among joints, and finally realizes high-precision hand posture estimation.
Description
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a hand posture estimation method based on a double-current multi-scale network for RGB images.
Background
In the daily interplay process between people, natural language, written language and body language are the three most important expression modes, but the former two are all limited by regions, countries, ethnicities and cultures, and the body language is not only flexible and changeable, can express some basic intentions of people, but also is intuitive and easy to understand and is not easy to generate ambiguity. Therefore, body languages are increasingly favored by human interaction researchers. Human hands are one of the most important parts in body language expression, and can convey abundant information, so that it is valuable and necessary for a computer to read information conveyed by human hands.
The gesture is the main mode for human and external information transmission, and because of the flexibility, freedom, complexity and changeability, the gesture movement contains a large amount of useful information, and the hand undertakes most of the work in life such as communication, operation and the like. It is well known that most machine operations are performed by hand. Therefore, no matter natural human-computer interaction is performed or human hand operation experience is transferred to the robot, firstly, the posture of the human hand needs to be estimated, and the posture information of the human hand is transferred to the robot equipment, so that human-computer interaction is performed.
At present, the method for estimating the hand posture is roughly divided into two stages, namely, the 2D posture of the hand is estimated by inputting an image, and then the 3D posture of the hand is regressed through the 2D posture. Hand pose estimation can be roughly classified into three categories according to the kind of input image: 1) and (3) estimating hand postures according to the depth images: depth image-based methods have traditionally been the primary method of hand pose estimation. The depth image contains certain depth information, and three-dimensional information of hand joints can be better obtained in the 3D posture regression process, but the imaging range of the depth camera at the present stage is very limited, the quality is not high enough, and great influence can be brought to a hand posture estimation method which depends on the depth image as input; meanwhile, the depth image is not applied in practice, and people usually have difficulty in obtaining the depth image. 2) And (3) estimating the hand postures according to the multiple RGB images: images based on multiple RGB are easier to acquire than depth image based methods, while multiple RGB images taken from different views contain rich 3D information, and therefore some methods take multiple images as input to alleviate occlusion problems. Although the method can obtain higher precision and can effectively solve the problem of self-shielding of hands, the required training and testing resources are larger, and the acquisition of a data set is more complicated. 3) Hand pose estimation from a single RGB image: compared with the two methods, the single RGB image is easier to obtain and more practical, and the gesture posture estimation based on the RGB single image is widely concerned at present. But estimating three-dimensional hand poses from only a single RGB image faces greater challenges due to the lack of input depth information. The gesture pose estimation method generally includes two stages, respectively, estimating the 2D pose of the hand based on the input image, and regressing the 3D pose of the hand through the 2D pose.
Factors influencing gesture posture estimation comprise that a self-shielding phenomenon exists in a part of gestures, prediction ambiguity exists in some adjacent joints during 3D posture regression, and the like.
Disclosure of Invention
The invention provides an improved hand posture estimation method aiming at three problems of gesture self-occlusion, adjacent joint prediction ambiguity and lack of semantic information due to the fact that weight is shared by traditional graph convolution at each node. The method comprises the following steps: the double-flow gesture attitude estimation method based on two topological structures is provided, and the problem of gesture self-occlusion is solved; a multi-scale U-net 3D gesture regression method is provided, and the problem of ambiguity prediction of adjacent joints during regression is solved; the method has the advantages that the problem of hand posture estimation of the semantic graph convolutional network is introduced for the first time, so that the node weight of each joint is different, semantic information of each joint is described powerfully, and the precision of 2D posture estimation and 3D posture regression is improved. The specific technical scheme is as follows:
step 1) extracting the characteristics of a single image and obtaining 2D posture initial coordinates of hand joints;
step 2) performing 2D posture estimation by using a double-branch network to obtain accurate 2D posture coordinates of hand joints, wherein the double-branch network has two branches with the same structure;
an N x F feature matrix is obtained through the first step, wherein N represents the number of hand joints, and F represents the feature dimension, and a graph can be obtained. In the second step, the graph obtained in the first step is used, two graph structures are designed according to different connection relations of hands, each graph structure is represented by using a different adjacency matrix, and therefore a double-branch network structure is designed. Each branch inputs the obtained feature matrix and the corresponding adjacent matrix into a 2D posture optimization network consisting of semantic graph convolution layers, so that each branch obtains the 2D posture of one hand.
Step 3) estimating 3D coordinates of the hand joints by utilizing a multi-scale semantic graph U-Net network, wherein the multi-scale semantic graph U-Net network has two branches with the same structure, and the input of the multi-scale semantic graph U-Net network of each branch is the 2D posture coordinates obtained by the branch in the step 2) and a corresponding adjacency matrix, and the input is the 3D posture of the hand joints; and then, adding and averaging the 3D postures obtained by the two branches, and finally outputting the 3D coordinates of the joints of the hand.
Advantageous effects
The invention provides a novel double-flow and multi-scale network model based on two topological structures of hand joints, solves the problems of self-shielding and adjacent joint prediction ambiguity based on a single RGB image, and realizes the acquisition of high-precision three-dimensional coordinates of the hand joints from the single RGB image. Most of the existing methods use the training of a depth image to a network model as a guide in the training process so as to obtain more accurate three-dimensional coordinates of hand joints. Compared with the existing method, the method is based on two different topological structures of the hand, so that the information between joints of the hand is better utilized, and meanwhile, a double-flow multi-scale network model is designed on different characteristics possibly existing in the information of different hand joints. And finally, high-precision hand posture estimation is realized.
Drawings
FIG. 1 is a general schematic diagram of a hand pose estimation method;
FIG. 2 is a schematic diagram of two topologies of a hand;
FIG. 3 is a multi-scale graph U-net network model;
FIG. 4 is a schematic illustration of hand bone loss.
Detailed Description
The invention consists of three parts: 1) feature extraction and 2D posture initialization, wherein the image is input into a network to extract features of the image and obtain an initial 2D posture of a hand joint; 2)2D attitude estimation, namely obtaining two topological structures of hand joints by using the characteristics obtained in the step 1) and the initial 2D attitude, and respectively inputting the two topological structures into a 2D attitude optimization network to refine the 2D attitude; 3) and 3D posture regression, namely designing a multi-scale feature fusion graph Unet 3D regression model by using the features of different joints of a hand on features of different scales. The overall schematic diagram of the hand posture estimation method is shown in fig. 1, two topological structures are shown in fig. 2, and a 3D regression model is shown in fig. 3.
1) Feature extraction and 2D pose initialization
Firstly, extracting 2048-dimensional feature vectors of an input image by using a Resnet50 network, obtaining initial 2D joint coordinates by passing the 2048-dimensional feature vectors through an additional full-connection layer, and splicing the 2048-dimensional feature vectors extracted by the Resnet50 network and the obtained initial 2D joint coordinates to obtain an N multiplied by F feature matrix, wherein N represents the number of hand joints, and N is 21 in the invention, and comprises 1 wrist joint, 5 metacarpophalangeal joints and 3 finger joints of each finger; f denotes a feature dimension, which is 2050 in the present invention, and includes 2048-dimensional features of the image and 2D joint coordinates x and y of each joint point.
2)2D pose estimation
Through the graph obtained in the first step, two different graph structures of a hand are designed, so that a double-branch network is designed; the two graph structures can perform feature complementation in a subsequent network by using different connection modes of hands, so that more accurate hand postures are obtained. The two figures are structured inIn the invention, different adjacency matrixes are respectively used for representation, and the characteristic matrix NxF obtained in the first step and the adjacency matrix A of each branch are respectively input into a 2D posture optimization network consisting of 3 layers of semantic graph convolution layers to obtain the 2D posture of the hand. In view of the fact that the prior hand posture estimation method cannot better express the correlation relation between the hand joints when using the graph convolutional network, the invention designs two different graph structures so as to better express the relation between the hand joints. The two graph structures are shown in fig. 2. The first graph structure is called physical connection, which is a connection between physical joints of hands, and this is a topological structure commonly used in the prior method, which can effectively use the relation between joints on each finger, so that it is more advantageous to represent some simpler gestures. The second graph structure is called as symmetrical connection, which utilizes the connection between the same joints of each finger, the interaction and the spatial connection of the same joints of each finger are not considered in the previous method, the connection between the same joints of each finger can be effectively utilized, so that the self-shielding problems are effectively solved, and because the motions of the same joints of the fingers in the motion of the hand are very similar, a certain relation exists between the adjacent joints in the motion process, the graph structure can better express the relation, for example, the self-shielding condition exists in the hand in the motions of grabbing objects and holding a fist, and by utilizing the connection mode, the shielded joints can be better estimated according to the visible joints. This configuration is therefore more advantageous for some gestures with self-occluding hand joints. The two graph structures are represented by different adjacency matrices. The construction method of the symmetrically connected adjacency matrix comprises the following steps: let G ═ { V, E } denote a graph, where V is the set of N hand joint points and E denotes an edge; adjacency matrix A, when two joint points are connected, aij1 otherwise aij0, i and j represent the two joints of the hand, respectively; each finger has three joints, the third joint is at the tip of the finger, the second joint and the first joint are respectively at the two joints under the tip of the finger, for example, the first joint and the second joint of the middle finger are respectively at the joint above the palm joint of the middle fingerThe third joint of the middle finger; the first joints of the adjacent fingers are mutually connected, the second joints are also mutually connected, and the third joints are mutually connected; the palm is provided with 6 joints, one of which is a wrist joint, the wrist joint is connected with the other 5 metacarpophalangeal joints, and the adjacent metacarpophalangeal joints are connected with each other.
Because certain relevance exists among joints of a hand, the adjacent relation among the joints can be represented by using an adjacent matrix, and the semantic graph convolution can well describe the relation among the joints and can better extract semantic information of joint points, so that better characteristic representation is obtained, a 2D posture optimization network consisting of 3 layers of semantic graph convolution layers is adopted as a basic network model for 2D posture estimation.
The basic formula of conventional graph convolution is as follows:
wherein,is the input feature matrix of the l-th layer,representing the output characteristic matrix of the l layer; representing the number of joints, K representing the input feature dimension of the graph convolution network,is a matrix symmetrically normalized with respect to a, which represents the adjacency matrix.
It can be seen that the traditional graph convolution operation shares weight at each node, and in fact, each joint of the hand has different influence on other joints of the hand, so that the sharing weight is not appropriate enough, and in order to better describe semantic information of each node, the semantic graph convolution is introduced for the first time to estimate the hand posture so as to expect to achieve better effect. The semantic graph convolution formula is as follows:
X(l+1)=σ(WX(l)ρi(M⊙A)) (2)
where ρ isiIs a Softmax nonlinear transformation used to normalize the matrix elements. An element indicating a matrix pixel level operation ifijIs 1, then M in the matrix M is returnedijThe value of the element, otherwise, the return value is passed through ρiThe operation yields a value of approximately 0. The matrix a is an adjacency matrix of nodes, and represents a connection relationship between the nodes. σ denotes the ReLu nonlinear activation function.Representing a learnable weighting matrix. X0For input to the network, X in the invention0N × F is the feature matrix obtained in the first stage.
In the invention, accurate 2D attitude estimation is realized by the convolution layer of 3 semantic graphs, and X is input0The output is the accurately estimated 2D coordinates of the N joint points, N × F feature matrix. The input-output dimensions of each layer are as follows, with the arrows representing the semantic graph convolution operation: (21,2050) → (21,128) → (21,16) → (21, 2). The first number 21 in brackets represents the number of joints of the hand and the second number represents the characteristic dimension of each joint.
3) Multi-scale 3D hand pose estimation based on dual branches
Most documents adopt a graph volume module and a Nonlocal module as a 3D posture regressor, and in the invention, a U-Net network is adopted as a basic network model, and a multi-scale feature fusion process is designed, so that the relation between points is obtained more accurately, and the ambiguity generated from a 2D posture to a 3D posture can be eliminated better. The network diagram is shown in fig. 3. The input is the 2D hand posture coordinate obtained in the second step, and after down sampling, up sampling and multi-scale feature fusion, the three-dimensional coordinate of the hand joint is output through a semantic graph convolution layer.
Multi-scale features have been used in computer vision problems, and fusion of multi-scale features has shown surprising performance. Because the characteristics of each part of the hand joint are not necessarily concentrated on the last layer of characteristics, and the joints of different parts are possibly distributed on the characteristics of different scales, the invention utilizes the decoder process of a U-Net network to perform multi-scale characteristic fusion, and finally utilizes a semantic graph convolution layer to obtain the three-dimensional coordinates of the hand joint. The network comprises three phases: the method comprises the following steps of down-sampling, up-sampling and multi-scale feature fusion, wherein a network firstly obtains global information of hand joints in the down-sampling process, and then restores the resolution through up-sampling. To preserve the underlying information, the features of the downsampling stage are added to the upsampling branch through some skipped connections. And finally, combining the features obtained under multiple scales and predicting the three-dimensional coordinates of the hand joints. The specific process is as follows.
A down-sampling stage: in most documents, graph pooling is performed by using a sigmoid function, and the use of the sigmoid function generates a very small gradient in a back propagation process, so that a network does not update randomly initialized selected nodes in the whole training phase, and the advantage of a pooling layer is lost. In the invention, a fully connected layer is used and applied to the transposition of the feature matrix, and the fully connected layer can be used as a kernel of each feature and output the number of required nodes. The calculation process of the down sampling is as follows:
Y0=G0(x) (3)
Pi=Poolingi(Yi-1) i=1,2,…,5 (4)
whereinRepresenting the input of the network, and being a hand joint 2D posture coordinate matrix obtained in the second stage, where N is the number of joints, and l is the characteristic dimension (where N is 21, and l is 2); i denotes the number of down-sampling,representing a convolution operation of downsampling, YiRepresenting the output of the graph convolution, Poolingi(. for) downsampling, i.e. pooling, of calculations, PiRepresents the output of the pooling layer, and FC (-) represents the full-connectivity layer operation.
An up-sampling stage: in the present invention, the upsampling, i.e., inverse pooling, layer uses a method of transposed convolution, which is applied to the transposed matrix of the feature using the fully connected layer to obtain the required number of output nodes, and then transposes the matrix again. The computational process of the anti-pooling is as follows:
wherein U isiRepresenting the output of each layer during the production run, Unpoolingi(. cndot.) represents an inverse pooling calculation (i.e. upsampling),representing feature fusion-that is, feature stitching,representing a downsampled graph convolution operation.
A multi-scale feature extraction stage: because the multi-scale strategy can better capture the global context information, in the up-sampling stage, 21 nodes are respectively and directly up-sampled in each scale, then the dimensionality of the nodes is output to be 4 dimensions (21,4) through one graph volume layer, and then the three-dimensional coordinates of the hand joints are obtained through two graph volume layers after the characteristics of all scales are spliced. The calculation process is as follows:
whereinRepresenting a multi-scale graph convolution calculation, FiFor the output of the map convolution layer at different scales, Un boosti(. cndot.) represents an inverse pooling calculation (i.e., upsampling). out represents the multi-scale fused output.
To be able to train the network better end-to-end, we use the following loss function as the objective function of the whole network.
In the 2D pose estimation and 2D pose initialization process, we use the loss as:
wherein HjThe label representing the 2D pose, i.e. the 2D coordinates carried by itself in the public data set,represents the predicted 2D pose coordinates and N represents the total number of hand joints, where N is 21.
Use of pose loss L in 3D pose estimationposeLoss of bone length LlenAnd loss of bone orientation Ldir。
Attitude loss LposeComprises the following steps:
whereinThe tag representing the 3D pose, i.e. the 3D coordinates carried in the public data set itself,representing the predicted 3D pose coordinates.
Loss of bone length LlenAnd loss of bone orientation LdirComprises the following steps:
wherein b isi,jRepresents the skeletal vector between the ith and jth joints of the hand, namely:
loss function L of 3D hand pose estimation network3DComprises the following steps:
L3D=λposeLpose+λlenLlen+λdirLdir (14)
wherein λpose、λlenAnd λdirThe weight over-parameters between the loss of 3D pose, loss of bone length and loss of bone orientation, respectively, are set here aspose=1、λlen=0.01、λdir0.1. A schematic of bone loss is shown in fig. 4.
The overall objective function L is:
L=λinitLinit+λ2DL2D+λ3DL3D (15)
wherein λinit、λ2DAnd λ3DThe hyper-parameters of the initial 2D attitude loss, 2D attitude loss and 3D attitude loss, respectively, are set here as: lambda [ alpha ]init=0.01、λ2D=0.01、λ3D=1。
The present invention uses the disclosed datasets STB and ferrihand as training and testing datasets for the model for validation, while using the disclosed Obman dataset as a pre-training dataset for the network model.
Data set
Ferrihand dataset: the data set consists of real images and shows samples with and without object interaction. It was captured by a multi-view setup, containing 33,000 samples. Recording gestures on 32 objects provides 3D annotations of 21 joint points. For the present invention, the data set was divided into 80% for training and 20% for testing.
STB data set: the STB data set is a real data set containing 18000 images, 21 three-dimensional hand joint position ground truth and corresponding depth images. In the experiment, 15,000 images were used for the training sample and 3000 images were used for the test sample. To be consistent with the Obman data set, the root joint of the STB data set was moved from the palm center to the wrist.
Obman dataset: the Obman dataset is a large dataset synthesized for hand pose estimation verification. The images in the dataset were obtained using ShapeNet rendering. Obman contained 141,550 training data, 6463 validation data, and 6285 test data. Despite the large amount of annotation data, these synthetic image-trained models do not scale well to real images. It is helpful to the model training of the present invention that the model can be pre-trained on the massive dataset of Obman, and then the network model can be fine-tuned using the real images.
Evaluation index
The present invention evaluates the model by calculating the euclidean distance between the 3D tag coordinates and the estimated coordinates. The calculation formula is as follows:
where MPJPE is the average joint error, T denotes the number of samples, N denotes the number of joint points (where N is 21),the tag coordinates of the ith joint of the t-th image,i-th representing the t-th imagePredicted coordinates of individual joints.
Experiment and results
The experiments of the invention were trained and tested on an Ubuntu16.04 system computer with Intel Core i5 CPU, 11GB RAM and NVIDIA GTX2080TI GPU. The frame used was pytorch 1.3.1. We first pre-train the network model using the synthetic dataset Obman. Setting the initial learning rate to be 0.001, multiplying the learning rate of every 500 epochs by 0.1, and training 5000 epochs in total; the model was then trained end-to-end using the Ferrian dataset for 500 epochs with an initial learning rate of 0.001, multiplying each 100 epoch learning rates by 0.1 all images were resized to 224 x 224 pixels and passed into the Resnet network.
The design of the invention has completed two experiments, the first experiment aims at comparing with the bseline of the invention; the method of Bardia and the like is adopted by Baseline, the Resnet50 network is used for extracting image features, and then self-adaptive graph volume and a graph U-net network are used for performing 2D and 3D posture regression; the double-flow network and the multi-scale model related to the invention are respectively subjected to ablation experiments. The second experiment was compared to some of the latest research methods of hand pose estimation based on RGB images. The experimental results are shown in the following table:
table 1 comparison with existing methods on STB datasets
Method | Mean joint error (unit: mm) |
Theodoridis et al. | 6.93 |
Spurr et al. | 8.56 |
Ge et al. | 6.37 |
Yang et al. | 7.05 |
The invention | 5.972 |
Table 2 comparison with existing methods on ferrihand data sets
Method | Mean joint error (unit: mm) |
Parelli M et al. | 11 |
Doosti B et al. | 8.887 |
Ge et al. | 15.36 |
The invention | 8.247 |
TABLE 3 ablation experiments with the method proposed by the invention (mean joint error, unit: mm)
Method | STB data set | Ferihad dataset |
Baseline | 8.573 | 8.887 |
Baseline + semantic graph convolution | 7.256 | 8.406 |
Resnet50+ dual stream network | 8.044 | 8.637 |
Resnet50+ double-stream network + semantic graph convolution | 7.202 | 8.358 |
Resnet50+ multiscale | 7.657 | 8.559 |
Resnet50+ multiscale + semantic graph convolution | 6.655 | 8.435 |
Resnet50+ double stream + multiscale | 6.085 | 8.401 |
Resnet50+ double stream + multiscale + semantic graph convolution | 5.972 | 8.247 |
In conclusion, the double-current and multi-scale network model provided by the invention is superior to the traditional hand posture estimation method, namely, the hand posture estimation with higher precision can be obtained under the condition that a single RGB image is used as input. The model of the invention uses two different graph structures of the hand, thereby being capable of better utilizing the characteristics between the adjacent joints of the hand joints, and simultaneously, the two structures have complementary structural characteristics; the multi-scale features can better utilize semantic information of hand joints so as to reduce ambiguity of adjacent joints in prediction in a single RGB image regression process; and a semantic graph convolution network is introduced to hand posture estimation, so that semantic information of each joint is more powerfully described. Thereby obtaining a more accurate 3D hand pose.
Claims (7)
1. A double-current multi-scale hand posture estimation method based on a single RGB image is characterized by comprising the following steps:
step 1) extracting the characteristics of a single image and obtaining 2D posture initial coordinates of hand joints;
step 2) performing 2D posture estimation by using a double-branch network to obtain accurate 2D posture coordinates of hand joints, wherein the double-branch network has two branches with the same structure;
step 3) estimating 3D coordinates of the hand joints by utilizing a multi-scale semantic graph U-Net network, wherein the multi-scale semantic graph U-Net network has two branches with the same structure, and the input of the multi-scale semantic graph U-Net network of each branch is the 2D posture coordinates obtained by the branch in the step 2) and a corresponding adjacency matrix, and the input is the 3D posture of the hand joints; and then, adding and averaging the 3D postures obtained by the two branches, and finally outputting the 3D coordinates of the joints of the hand.
2. The method of claim 1, wherein the method comprises the following steps: the step 1) is specifically as follows,
encoding input single RGB images by using a Resnet50 network, wherein each input image generates a characteristic vector with 2048 dimensions;
and then, generating initial predicted two-dimensional coordinates of the hand joint points by using an additional full-connection layer, simultaneously splicing the obtained feature vectors with the initial two-dimensional predicted coordinates of each joint point, and generating a graph with F features in each node, namely obtaining an N multiplied by F feature matrix, wherein N represents the number of the hand joints, and F represents the feature dimension.
3. The method of claim 1, wherein the method comprises the following steps: the step 2) specifically comprises the following steps of,
obtaining an N multiplied by F characteristic matrix according to the first step, wherein N represents the number of joints of the hand, F represents characteristic dimension, and two graph structures are obtained, wherein each graph structure is represented by an adjacent matrix, the first graph structure is called physical connection and is used for representing the relation between the physical joints of the hand, and the second graph structure is called symmetrical connection and is used for representing the relation between the same joints of each finger;
inputting the N x F characteristic matrix and the physically connected adjacent matrix into one branch of the double-branch network, and inputting the N x F characteristic matrix and the symmetrically connected adjacent matrix into the other branch of the double-branch network, wherein each branch network is composed of 3-layer semantic graph convolution layers which are mutually connected in series.
4. The method of claim 3, wherein the method comprises the following steps: the semantic graph convolution formula is as follows:
X(l+1)=σ(WX(l)ρi(M⊙A)) (2)
where ρ isiIs a Softmax nonlinear transformation, for normalizing matrix elements, which indicates a matrix pixel level operation,if the element a in the matrix A isijIs 1, then M in the matrix M is returnedijThe value of the element, otherwise, the return value is passed through ρiThe operation is to obtain a value of approximately 0, the matrix A is an adjacent matrix of nodes and represents the connection relation among the nodes, the sigma represents a ReLu nonlinear activation function,representing a learnable weighting matrix, X0Being input to the network, i.e. X0N × F is the feature matrix obtained in step 1), and the output is the accurately estimated 2D coordinates of the N joint points.
5. The method of claim 3, wherein the method comprises the following steps: the construction method of the symmetrically connected adjacency matrix comprises the following steps: let G ═ { V, E } denote a graph, where V is the set of N hand joint points and E denotes an edge; adjacency matrix A, when two joint points are connected, aij1 otherwise aij0, i and j represent the two joints of the hand, respectively; each finger is provided with three joints, the finger tip is provided with a third joint, and the two joints below the finger tip are respectively a second joint and a first joint; the same joints of adjacent fingers are connected with each other; the palm is provided with 6 joints, one of which is a wrist joint, the wrist joint is connected with the other 5 metacarpophalangeal joints, and the adjacent metacarpophalangeal joints are connected with each other.
6. The method of claim 1, wherein the method comprises the following steps: the branch structure in the multi-scale semantic graph U-Net network sequentially comprises three stages: down-sampling, up-sampling, multi-scale feature fusion,
wherein, the calculation process of the down-sampling stage is as follows:
Y0=G0(x) (3)
Pi=Poolingi(Yi-1)i=1,2,...,5 (4)
whereinRepresenting the input of the network, namely a hand joint 2D attitude coordinate matrix obtained in the second stage, wherein N is the number of joints, and l is the characteristic dimension; i denotes the number of down-sampling,representing a convolution operation of downsampling, YiRepresenting the output of the graph convolution, Poolingi(. for) downsampling, i.e. pooling, of calculations, PiRepresents the output of the pooling layer, FC (-) represents the full-connectivity layer operation;
an up-sampling stage: the method of using transposed convolution for upsampling, i.e. inverse pooling layer, uses a fully connected layer to apply it to the transposed matrix of the features to obtain the required number of output nodes, and then transposes the matrix again, the computation process of its inverse pooling is as follows:
wherein U isiRepresenting the output of each layer during the production run, Unpoolingi(. cndot.) represents an inverse pooling calculation, i.e. upsampling,representing feature fusion-that is, feature stitching,a graph convolution operation representing downsampling;
in the multi-scale feature extraction stage, the calculation process is as follows:
7. The method of claim 1, wherein the method comprises the following steps: the overall objective function L is:
L=λinitLinit+λ2DL2D+λ3DL3D (15)
wherein λinit、λ2DAnd λ3DRespectively initial 2D attitude loss, 2D attitude loss and 3D attitude loss, LinitThe loss in the 2D pose estimation and 2D pose initialization process is as follows:
wherein HjThe label representing the 2D pose, i.e. the 2D coordinates carried by itself in the public data set,representing predicted 2D pose coordinates;
L3Destimating a loss function of the network for the 3D hand pose:
L3D=λposeLpose+λlenLlen+λdirLdir (14)
wherein λpose、λlenAnd λdirThe weight over-parameters between the 3D pose loss, the bone length loss and the bone orientation loss, respectively,
attitude loss LposeComprises the following steps:
whereinThe tag representing the 3D pose, i.e. the 3D coordinates carried in the public data set itself,representing predicted 3D pose coordinates;
loss of bone length LlenAnd loss of bone orientation LdirComprises the following steps:
wherein b isi,jRepresents the skeletal vector between the ith and jth joints of the hand, namely:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110273215.0A CN113052030B (en) | 2021-03-11 | 2021-03-11 | Double-flow multi-scale hand gesture estimation method based on single RGB image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110273215.0A CN113052030B (en) | 2021-03-11 | 2021-03-11 | Double-flow multi-scale hand gesture estimation method based on single RGB image |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113052030A true CN113052030A (en) | 2021-06-29 |
CN113052030B CN113052030B (en) | 2024-09-24 |
Family
ID=76513189
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110273215.0A Active CN113052030B (en) | 2021-03-11 | 2021-03-11 | Double-flow multi-scale hand gesture estimation method based on single RGB image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113052030B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113762177A (en) * | 2021-09-13 | 2021-12-07 | 成都市谛视科技有限公司 | Real-time human body 3D posture estimation method and device, computer equipment and storage medium |
CN116958958A (en) * | 2023-07-31 | 2023-10-27 | 中国科学技术大学 | Self-adaptive class-level object attitude estimation method based on graph convolution double-flow shape prior |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175566A (en) * | 2019-05-27 | 2019-08-27 | 大连理工大学 | A kind of hand gestures estimating system and method based on RGBD converged network |
CN110427877A (en) * | 2019-08-01 | 2019-11-08 | 大连海事大学 | A method of the human body three-dimensional posture estimation based on structural information |
CN111428555A (en) * | 2020-01-17 | 2020-07-17 | 大连理工大学 | Joint-divided hand posture estimation method |
CN112329525A (en) * | 2020-09-27 | 2021-02-05 | 中国科学院软件研究所 | Gesture recognition method and device based on space-time diagram convolutional neural network |
-
2021
- 2021-03-11 CN CN202110273215.0A patent/CN113052030B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175566A (en) * | 2019-05-27 | 2019-08-27 | 大连理工大学 | A kind of hand gestures estimating system and method based on RGBD converged network |
CN110427877A (en) * | 2019-08-01 | 2019-11-08 | 大连海事大学 | A method of the human body three-dimensional posture estimation based on structural information |
CN111428555A (en) * | 2020-01-17 | 2020-07-17 | 大连理工大学 | Joint-divided hand posture estimation method |
CN112329525A (en) * | 2020-09-27 | 2021-02-05 | 中国科学院软件研究所 | Gesture recognition method and device based on space-time diagram convolutional neural network |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113762177A (en) * | 2021-09-13 | 2021-12-07 | 成都市谛视科技有限公司 | Real-time human body 3D posture estimation method and device, computer equipment and storage medium |
CN116958958A (en) * | 2023-07-31 | 2023-10-27 | 中国科学技术大学 | Self-adaptive class-level object attitude estimation method based on graph convolution double-flow shape prior |
Also Published As
Publication number | Publication date |
---|---|
CN113052030B (en) | 2024-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Qin et al. | U2-Net: Going deeper with nested U-structure for salient object detection | |
CN109685819B (en) | Three-dimensional medical image segmentation method based on feature enhancement | |
CN112288011B (en) | Image matching method based on self-attention deep neural network | |
Zhang et al. | Progressive hard-mining network for monocular depth estimation | |
CN112784764A (en) | Expression recognition method and system based on local and global attention mechanism | |
CN112329525A (en) | Gesture recognition method and device based on space-time diagram convolutional neural network | |
CN111709270B (en) | Three-dimensional shape recovery and attitude estimation method and device based on depth image | |
CN112446253B (en) | Skeleton behavior recognition method and device | |
Ren et al. | Spatial-aware stacked regression network for real-time 3d hand pose estimation | |
CN112232106A (en) | Two-dimensional to three-dimensional human body posture estimation method | |
CN111462274A (en) | Human body image synthesis method and system based on SMP L model | |
CN115223201A (en) | Monocular sequence image-based three-dimensional human body joint point estimation method, system and device | |
Aslam et al. | A Review of deep learning approaches for image analysis | |
CN115761117A (en) | Three-dimensional human body reconstruction method and system based on STAR model | |
Wu et al. | Link-RGBD: Cross-guided feature fusion network for RGBD semantic segmentation | |
Wang et al. | Simplified-attention Enhanced Graph Convolutional Network for 3D human pose estimation | |
Huo et al. | Independent Dual Graph Attention Convolutional Network for Skeleton-Based Action Recognition | |
CN113052030B (en) | Double-flow multi-scale hand gesture estimation method based on single RGB image | |
Liu et al. | Sketch to portrait generation with generative adversarial networks and edge constraint | |
CN117274607A (en) | Multi-path pyramid-based lightweight medical image segmentation network, method and equipment | |
CN107330912A (en) | A kind of target tracking method of rarefaction representation based on multi-feature fusion | |
Yan et al. | MRSNet: Joint consistent optic disc and cup segmentation based on large kernel residual convolutional attention and self-attention | |
CN114863548B (en) | Emotion recognition method and device based on nonlinear space characteristics of human body movement gestures | |
CN115601787A (en) | Rapid human body posture estimation method based on abbreviated representation | |
Ni et al. | QMGR-Net: quaternion multi-graph reasoning network for 3D hand pose estimation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |