CN115588237A - Three-dimensional hand posture estimation method based on monocular RGB image - Google Patents

Three-dimensional hand posture estimation method based on monocular RGB image Download PDF

Info

Publication number
CN115588237A
CN115588237A CN202211255461.4A CN202211255461A CN115588237A CN 115588237 A CN115588237 A CN 115588237A CN 202211255461 A CN202211255461 A CN 202211255461A CN 115588237 A CN115588237 A CN 115588237A
Authority
CN
China
Prior art keywords
joint
joint point
features
feature
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211255461.4A
Other languages
Chinese (zh)
Inventor
叶中付
田瑞田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202211255461.4A priority Critical patent/CN115588237A/en
Publication of CN115588237A publication Critical patent/CN115588237A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The invention discloses a monocular RGB image-based three-dimensional hand posture estimation method, and designs a feature enhancement method which can explicitly introduce a hand inherent skeleton structure to adaptively enhance the features of joint points to be estimated by utilizing associated information so as to finally improve the accuracy of hand posture estimation. The designed method is as follows: firstly, extracting joint point-level semantic features and skeleton-level semantic features from an input hand image by using a convolutional neural network, and performing cross-semantic aggregation on the two features by using a feature fusion module; the characteristic self-adaptive enhancement module can then utilize the associated information to adaptively enhance the relevant characteristics of each joint point; and then obtaining a joint point two-dimensional thermodynamic diagram and a relative depth map through an output layer, continuously refining the two-dimensional thermodynamic diagram and the depth map by adopting a multi-stage optimization strategy to estimate more accurate hand two-dimensional joint point coordinates and relative depth, and finally calculating final hand three-dimensional coordinate information by utilizing camera parameters.

Description

Three-dimensional hand posture estimation method based on monocular RGB image
Technical Field
The invention relates to a method for estimating hand gesture of monocular RGB (red, green and blue) images with inherent associated information of hand joint points, belonging to the field of image processing.
Background
Three-dimensional hand pose estimation methods, which can be classified into depth image-based and RGB image-based three-dimensional hand pose estimation according to data sources, are more popular than depth image-based three-dimensional hand pose estimation solutions in many visual applications due to the low cost and low power consumption of RGB cameras. Compared with a depth image, a monocular RGB image has inherent defects of depth ambiguity, illumination sensitivity and the like, so that estimating a three-dimensional hand posture from the monocular RGB image is a challenging problem.
Three-dimensional pose estimation of monocular RGB images can be generally classified into a thermodynamic diagram detection-based method, a two-dimensional pose-to-three-dimensional pose lifting method, and a parametric model fitting method. The method based on thermodynamic diagram detection is different from a regression method in that the three-dimensional position of each key point is directly predicted, but the three-dimensional Gaussian distribution of each joint point in the thermodynamic diagram representation is predicted, and then the three-dimensional key point coordinates can be easily analyzed from the thermodynamic diagram through post-processing. Due to the fact that a plurality of robust two-dimensional posture estimation methods are available, many researchers concentrate on a method for improving two-dimensional posture estimation to three-dimensional posture, the whole method flow is mainly divided into two parts, firstly, an accurate two-dimensional hand posture estimation value is estimated from a monocular RGB image, and then a final three-dimensional posture estimation coordinate can be regressed through a low-complexity regression model. The method based on the regression of the parameter model is different from the two methods, but the shape of the human hand is approximately represented by the parameter model, most researches adopt a method based on the MANO modeling, the posture and the shape parameters are mapped to a triangular mesh to represent the shape of the human hand, and most of the work integrates the hand model into a differentiable layer of a neural network and automatically learns and maps to the MANO parameters to estimate the final key point because the parameter model contains the rich prior structure of the human hand.
Although the existing two-dimensional hand gesture estimation method has high accuracy. However, due to the problems of lack of inherent depth information of monocular RGB images, non-correspondence of back projection of two-dimensional pose estimation to a three-dimensional pose, and the like, all joint points are treated equally in consideration of previous work, differences of different joint points are not considered, and context relations of each joint point of a hand are ignored, and the joint point association relation can relieve depth blurring to a certain extent, and improve prediction accuracy, it is necessary to research a more robust monocular RGB image three-dimensional gesture estimation method to obtain an accurate pose estimation value.
Disclosure of Invention
The technical problem of the invention is solved: the hand posture estimation method based on the monocular RGB image is provided to overcome the defects of the prior art so as to obtain accurate hand joint point coordinates. The method comprises the steps of converting a hand three-dimensional posture estimation task into two subtasks of joint point two-dimensional position estimation in a plane and joint point three-dimensional relative depth estimation in a space, wherein a designed network consists of a visual feature extraction module, a semantic feature aggregation module and a joint point feature enhancement module, and a multi-stage optimization strategy is adopted to improve the accuracy of final prediction.
The purpose of the invention is realized by the following technical scheme:
the invention discloses a monocular RGB image-based three-dimensional hand posture estimation method, which comprises the following steps:
step 1: constructing a hand three-dimensional posture estimation network model, wherein the network model consists of a visual feature extraction module, a semantic feature aggregation module and a joint point feature self-adaptive enhancement module;
step 2, inputting monocular RGB image frames with hands as centers, generating a joint point positioning information graph and a skeleton correlation information graph in a visual feature extraction module through the visual feature extraction module, and then adopting a pre-trained ResNet18 as a feature extraction encoder to obtain image abstract semantic features; designing a binary decoder based on an hourglass structure, and monitoring by using a joint point positioning information graph and a skeleton related information graph to obtain joint point level characteristics containing hand joint point positioning information and skeleton level characteristics containing joint point related information;
step 3, sending the obtained joint point level features and skeleton level features into a semantic feature aggregation module, and adaptively fusing the captured skeleton level features and joint point level features across semantics by the semantic feature fusion module to obtain aggregated semantic features simultaneously containing the two features;
step 4, the aggregated semantic features are sent to a joint point feature self-adaptive enhancement module which explicitly introduces a hand inherent skeleton structure, and the relevant features are adaptively enhanced for each hand joint point to be detected by utilizing the relevant information to obtain the enhanced features of each joint point;
and 5, respectively obtaining a predicted two-dimensional heat map and a predicted relative depth map of the joint points through an output layer for the enhanced features of each joint point, continuously refining the two-dimensional heat map and the relative depth map of the joint points by adopting a multi-stage iterative optimization method, obtaining plane coordinates and relative depth values of the joint points through a decoding function, and finally obtaining three-dimensional coordinates of the joint points of the hand through camera parameter calculation, so that hand posture estimation is completely finished.
Further, the step 2 is specifically realized as follows: firstly, generating a joint point positioning information graph and a skeleton correlation information graph which accord with a Gaussian distribution form for network supervision training; then adopting ResNet18 pre-trained on ImageNet as a feature extraction encoder to obtain the abstract semantic features of the image; designing a binary decoder based on an hourglass structure, taking a joint point positioning information graph and a bone related information graph as supervision, and simultaneously outputting a bone-level characteristic F b And joint point level features F j To provide richer semantic information.
Further, in the step 3, the semantic feature aggregation module adaptively performs cross-semantic fusionCaptured bone level features F b And joint point level characteristics F j Obtaining a fusion feature containing both semantics
Figure BDA0003889518460000031
The process of (2) is as follows:
(31) Firstly, outputting a skeleton-level feature F based on the binary decoding of the hourglass structure b And joint point level features F j Splicing, obtaining the weight of the corresponding feature through a 3 × 3 convolution and sigmoid activation function, wherein the formula is as follows:
W b ,W j =σ(conv 3×3 (cat(F b ,F j ),θ 1 ))
wherein theta is 1 Representing the parameters to be learned by the network, F b ,F j Representing bone-level and joint-level features before a semantic aggregation module, W b ,W j Representing the weight of the learned bone-level features and joint-level features, sigma representing a sigmoid activation function, and cat representing splicing operation;
(32) The skeleton level characteristic branch and the joint point level characteristic branch in the binary decoding based on the hourglass structure are cross-semantic self-adaptive characteristic fusion operations with residual connection structures respectively, and the joint point level weight W obtained in the step (31) is utilized j And skeletal level weight W b And calculating to obtain the features after cross-semantic fusion, wherein the formula is as follows:
Figure BDA0003889518460000032
wherein
Figure BDA0003889518460000033
Representing the residual concatenated bone-level features and joint-level features,
Figure BDA0003889518460000034
representing a dot product operation;
(33) Finally, the bone features after cross-semantic fusion
Figure BDA0003889518460000035
And joint point level features
Figure BDA0003889518460000036
Sending the spliced result into 1 multiplied by 1 convolution to obtain the final characteristics of the aggregated semantics
Figure BDA0003889518460000037
Figure BDA0003889518460000038
Wherein theta is 2 Indicating the parameters to be learned by the network,
Figure BDA0003889518460000039
and c represents the feature dimension.
Further, the joint point feature adaptive enhancement module of step 4 is implemented as follows:
firstly, constructing a joint point incidence structure matrix
Figure BDA00038895184600000310
Wherein J represents the number of hand joint points, and N represents N defined joint points having associated information with the joint points; then, the semantic features after the semantic aggregation module is aggregated
Figure BDA00038895184600000311
The division into J groups in the feature dimension ensures that each different joint point is assigned a unique feature
Figure BDA00038895184600000312
The unique means that the characteristics allocated to each joint point are different in all J joint points;
Figure BDA00038895184600000313
j ∈ 0.. J-1,C denotes the total feature dimension of all the joints, and c denotes each jointA characteristic dimension of the node;
the formula of the joint feature enhancement module is as follows,
Figure BDA0003889518460000041
Figure BDA0003889518460000042
Figure BDA0003889518460000043
wherein f is j Representing the original features of the joint points to be estimated, f i N-1 represents N joint features having an association relationship with the joint to be estimated, θ represents a parameter to be learned by the network,
Figure BDA0003889518460000044
n-1 represents learned association information of N joint points having an association relationship with a joint point to be estimated,
Figure BDA0003889518460000045
n-1 represents the learned N joint point weight coefficients having an association relationship with the joint point to be estimated,
Figure BDA0003889518460000046
representing the enhanced features of the joint to be estimated.
Further, in step 5, a method of multi-stage iterative optimization is implemented as follows: joint point level features after enhancement
Figure BDA0003889518460000047
Then, a two-dimensional thermodynamic diagram and a relative depth diagram of each joint point to be detected are predicted independently, wherein the formula is as follows, theta represents a parameter to be learned, H j ,D j Two-dimensional heat maps representing predicted joint points, respectivelyAnd relative depth map:
Figure BDA0003889518460000048
two-dimensional heat map H of predicted joint points j Depth map D j Enhancing post-semantic features with the joint
Figure BDA0003889518460000049
Splicing the two-dimensional heat map and the depth map of the joint points and sending the two-dimensional heat map and the depth map to the next stage of the network to learn the optimized joint points; then, a more accurate pose estimation result can be learned based on the optimized features, wherein the formula is as follows
Figure BDA00038895184600000410
Respectively representing the semantic features, two-dimensional heat map and depth map, phi, of the joint point to be detected in the t stage t+1 Denotes the characteristic polymerization process of stage t +1,. Phi t+1 The generation process of the t +1 stage two-dimensional heat map and the depth map is represented, the structure of the generation process is consistent with that of the t stage joint point feature enhancement module, and the generation process can capture longer-range associated information;
Figure BDA00038895184600000411
Figure BDA00038895184600000412
respectively decoding the two-dimensional joint point heat map and the relative depth heat map predicted at each stage to obtain uv coordinates of the joint point in a pixel plane and z coordinates of the joint point relative depth rel The formula is as follows:
Figure BDA00038895184600000413
Figure BDA00038895184600000414
Figure BDA00038895184600000415
wherein
Figure BDA0003889518460000051
Representing a two-dimensional thermodynamic diagram u after normalization of the joint points to be estimated j ,v j Two-dimensional coordinates representing the pixel plane of the joint point to be estimated,
Figure BDA0003889518460000052
representing relative depth values of the joint points to be estimated;
and finally, calculating the final coordinates of the hand joint points through the camera parameters, wherein the calculation formula is as follows:
Figure BDA0003889518460000053
wherein z is root Representing the root relative depth coordinate and K representing the in-camera parameter matrix.
Compared with the prior art, the invention has the advantages that:
(1) Different from the prior method for treating all hand joint points equally, the method provided by the invention considers the difference of different joint points, innovatively introduces the inherent skeleton structure of the hand, and adaptively extracts the relevant characteristics of each joint point to be detected by utilizing the relevant information of each joint point to be detected, so that the problems of hand shielding and depth blurring are relieved, and the accuracy of hand posture estimation is improved. The invention achieves the aim by designing a binary decoder, a semantic aggregation module and a joint point characteristic self-adaptive enhancement module. The binary decoder and the semantic aggregation module can generate rich joint point level semantics and skeleton level semantics, when joint point two-dimensional plane coordinates and relative depth coordinates are estimated, the joint point feature adaptive enhancement module can adaptively extract features related to joint points to be predicted from the joint points with the association relationship, which means that each joint point regresses in a subspace of the joint point feature, the joint point features with the association relationship are fully utilized, and the decoupled prediction mode greatly reduces the prediction difficulty and enhances the estimation accuracy.
(2) The invention also adopts a multi-stage iterative optimization strategy to capture the long-range dependency relationship between the hand joint points, takes the rough two-dimensional thermodynamic diagram and the depth map as the guide, and continuously refines the output two-dimensional thermodynamic diagram and the depth map in multiple iterations. The proposed iterative optimization strategy can adaptively capture the associated information of different degrees aiming at the joint points of different degrees of freedom, so that the relation among different joint points can be realized, the problems of depth modules and self-occlusion are relieved, and the accuracy of hand posture estimation is finally enhanced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for estimating a three-dimensional hand pose based on monocular RGB images according to the present invention;
FIG. 2 is a model diagram of hand three-dimensional pose estimation for monocular RGB images in accordance with the present invention;
FIG. 3 is a visual display of the estimation results of the present invention;
fig. 4 is a comparison of the results of the same other methods of the present invention on public STB datasets.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The invention discloses a robust hand three-dimensional posture estimation method based on monocular RGB images to obtain accurate hand key point coordinates, designs a characteristic enhancement method for explicitly introducing an inherent skeleton structure of hand joint points, adaptively enhances the characteristics of the joint points to be estimated by utilizing associated information, inhibits the characteristics of irrelevant joint points, and simultaneously adopts a multi-stage optimization strategy to continuously refine a two-dimensional thermodynamic diagram and a depth diagram to estimate more accurate hand two-dimensional joint point coordinates and relative depth so as to finally improve the accuracy of hand posture estimation.
As shown in fig. 1, the method of the present invention comprises the steps of:
step 1, regarding monocular three-dimensional hand posture estimation as two subtasks of joint point two-dimensional position information estimation in a plane and joint point three-dimensional relative depth information estimation in a space. Firstly, generating a joint point positioning information graph and a skeleton correlation information graph which accord with a Gaussian distribution form for network supervision training, and then adopting ResNet18 pre-trained on ImageNet as a feature extraction encoder to obtain high-level abstract semantic features; then, a binary decoder is designed based on the hourglass structure to output the bone level features F simultaneously b And joint level feature F j Skeleton level features F containing joint point connection information b And an articulation point level feature F containing articulation point positioning information j Can complement each other, and obtains fusion characteristics rich in two semantics simultaneously through a subsequent characteristic fusion module
Figure BDA0003889518460000061
Step 2, the semantic feature fusion module can self-adaptively cross-semantic fusion captured skeleton-level features F b And joint point level features F j To obtain a fusion feature containing both semantics
Figure BDA0003889518460000062
The process is as follows, firstly, the bone characteristics F b And joint point characteristics F j Splicing, obtaining the weight of the corresponding feature through a 3 × 3 convolution and sigmoid activation function, wherein the formula is as follows:
W b ,W j =σ(conv 3×3 (cat(F b ,F j ),θ 1 ))
wherein theta is 1 Representing parameters to be learned by the network, F b ,F j Representing bone and joint characteristics before polymerization, W b ,W j Representing the weight of the learned bone features and joint point features, sigma representing a sigmoid activation function, and cat representing a splicing operation.
Then each branch is a cross-semantic self-adaptive feature fusion operation with a residual connection structure, and the formula is as follows:
Figure BDA0003889518460000063
wherein
Figure BDA0003889518460000071
Representing the bone and joint features after residual connection,
Figure BDA0003889518460000072
indicating a dot product operation.
Finally, after feature dimension splicing, sending the feature dimension spliced into 1 multiplied by 1 convolution to obtain the final feature of the aggregated semantics
Figure BDA0003889518460000073
Figure BDA0003889518460000074
Wherein theta is 2 Indicating the parameters to be learned by the network,
Figure BDA0003889518460000075
indicates the final fusion characteristics, cat indicates splicingAnd (5) operating.
And 3, designing a joint point feature self-adaptive enhancement module capable of explicitly introducing a hand inherent skeleton structure, and extracting context information of joint points to be detected by utilizing the association information among the joint points to enhance relevant features and inhibit irrelevant features. Firstly, constructing a joint point incidence structure matrix
Figure BDA0003889518460000076
Wherein J represents the number of hand joint points, and J =21, N represents N joint points which are defined and have associated information with the joint points, and semantic features after aggregation
Figure BDA0003889518460000077
The feature dimensions are divided into J groups, so that each different joint point can be assigned to a specific feature
Figure BDA0003889518460000078
Wherein
Figure BDA0003889518460000079
J ∈ 0.. J-1,C represents the total feature dimension of all the joint points, and c represents the feature dimension of each joint point. When the two-dimensional plane coordinates and the relative depth coordinates of the joint points are estimated, the characteristics related to the joint points j to be predicted are extracted from the N joint points with the association relationship in a self-adaptive mode, so that each joint point j is regressed in the subspace of the joint point j, the characteristics of the joint points with the association relationship are fully utilized, meanwhile, the characteristics of irrelevant joint points are restrained, the prediction difficulty is greatly reduced by the aid of the decoupling prediction mode, and the estimation accuracy is enhanced. The formula of the joint point characteristic enhancing module is as follows:
Figure BDA00038895184600000710
Figure BDA00038895184600000711
Figure BDA00038895184600000712
wherein f is j Representing the original features of the joint points to be estimated, f i N-1 denotes N joint features having an association relationship with the joint to be estimated, θ denotes a parameter to be learned by the network,
Figure BDA00038895184600000713
n-1 represents learned association information of N joint points having an association relationship with a joint point to be estimated,
Figure BDA00038895184600000714
n-1 represents the learned N joint point weight coefficients having an association relationship with the joint point to be estimated,
Figure BDA00038895184600000715
representing the enhanced features of the joint to be estimated.
And 4, continuously refining the joint point heat map and the relative depth map by adopting a multi-stage optimization method to obtain a more accurate and robust posture estimation result. In particular, joint point level features after augmentation
Figure BDA00038895184600000716
Then, a two-dimensional thermodynamic diagram and a relative depth diagram of each joint point to be detected can be predicted independently, wherein the formula is as follows, theta represents a parameter to be learned, and H j ,D j Representing the predicted two-dimensional heat map of the joint point and the relative depth map, respectively:
Figure BDA0003889518460000081
step 5, two-dimensional heat map H of the predicted joint points j Depth map D j With the joint point enhanced postwordsSemantic features
Figure BDA0003889518460000082
The joint point two-dimensional heat map and the depth map are spliced and sent to a next stage of the network to learn the optimized joint point two-dimensional heat map and depth map, and the joint point incidence relation is continuously expanded in subsequent optimization so as to capture the longer-range dependence relation, and then a more accurate attitude estimation result can be learned based on the optimized characteristics, wherein the formula is as follows
Figure BDA0003889518460000083
Respectively representing the semantic features, two-dimensional heat map and depth map, phi, of the joint point to be detected in the t stage t+1 Denotes the characteristic polymerization process of stage t +1,. Phi t+1 The generation process of the two-dimensional heat map and the depth map in the t +1 th stage is consistent with the structure of the joint feature enhancement module in the t-th stage, but the associated information of a longer range is captured.
Figure BDA0003889518460000084
Figure BDA0003889518460000085
Respectively decoding the two-dimensional joint point heat map and the relative depth heat map predicted at each stage to obtain uv coordinates of the joint point in a pixel plane and z coordinates of the joint point relative depth rel The formula is as follows, wherein
Figure BDA0003889518460000086
Representing a two-dimensional thermodynamic diagram u after normalization of the joint points to be estimated j ,v j Two-dimensional coordinates representing the pixel plane of the joint point to be estimated,
Figure BDA0003889518460000087
representing relative depth values of the joint points to be estimated:
Figure BDA0003889518460000088
Figure BDA0003889518460000089
Figure BDA00038895184600000810
finally, the final coordinates of the joint points of the hand can be obtained through calculation by the camera parameters, and the calculation formula is as follows, wherein z root And expressing the absolute depth coordinate, and K expresses a parameter matrix in the camera, so that the hand posture estimation process is completely finished.
Figure BDA00038895184600000811
As shown in fig. 2, which shows the overall network framework of the present invention, firstly, RGB images centered on the hand are input, a feature encoder and a binary feature decoder of a left frame can respectively obtain a skeleton-level feature and a joint-level feature, and then a semantic aggregation module semantically integrates the two features; and then, a joint point feature self-adaptive enhancement module is represented in the right frame, the aggregated semantic features are equally divided into 21 groups, then, for each joint point to be detected, relevant features can be adaptively enhanced by utilizing associated information, a two-dimensional thermodynamic diagram and a depth diagram are obtained through an output layer, then, an N-stage iterative optimization strategy is adopted, the output two-dimensional thermodynamic diagram and the depth diagram are continuously refined, finally, two-dimensional pixel coordinates and space depth coordinates of the 21 joint points are respectively decoded, and finally, three-dimensional coordinates of the hand joint points are obtained through camera parameter calculation. In fig. 2, encode represents an encoder, decode represents a decoder, fusion represents a feature aggregation module, enhance represents a feature adaptive enhancement module, and refine represents a multi-stage iterative optimization stage.
As shown in fig. 3, the visual display of the estimation result of the present invention shows the input human hand image on the left with the predicted two-dimensional joint coordinates marked thereon, the predicted joint coordinates in the middle, and the actual joint coordinates on the right. Fig. 3 visually shows the accuracy of the detection result of the present invention, in fig. 3, clr + Pred _ kp2D indicates that the predicted two-dimensional joint coordinates are marked on the original image, pred 3D indicates the predicted joint coordinates, and 3 dannottation indicates the real joint coordinates. Keypoint represents a predicted hand joint point, which is shown as a circle in the figure, thumb, index, middle, ring, and pinky.
Fig. 4 is a graph comparing the prediction results of the STB public data set with other methods in accordance with the present invention, wherein the 3D PCK is expressed as a ratio of correctly estimated key points, meaning that a ratio of normalized distance between the detected key points and their corresponding real labels smaller than a set threshold is calculated, and the larger the value is, the better the value is. Fig. 4 shows a comparison graph of the prediction results of the STB public data set and other methods of the present invention, where the results are better than those of other existing methods, error Thresholds represent Error Thresholds, and in mm units, 3D PCK represents the proportion that the normalized distance between the detected key point and the corresponding real tag is smaller than a set threshold, AUC represents the area under the 3D PCK curve, ours represents the method of the present invention, ge represents the algorithm proposed by Ge et al, yang et al represents the algorithm proposed by Yang et al, iqbal represents the algorithm proposed by Iqbal, and spur et al represents the algorithm proposed by Spurr et al.

Claims (5)

1. A monocular RGB image-based three-dimensional hand posture estimation method is characterized by comprising the following steps:
step 1: constructing a hand three-dimensional posture estimation network model, wherein the network model consists of a visual feature extraction module, a semantic feature aggregation module and a joint point feature self-adaptive enhancement module;
step 2, inputting a monocular RGB image frame with a hand as a center, generating a joint point positioning information graph and a skeleton correlation information graph in a visual feature extraction module through the visual feature extraction module, and then adopting a pre-trained ResNet18 as a feature extraction encoder to obtain an image abstract semantic feature; designing a binary decoder based on an hourglass structure, and monitoring by using a joint point positioning information graph and a skeleton related information graph to obtain joint point level characteristics containing hand joint point positioning information and skeleton level characteristics containing joint point related information;
step 3, sending the obtained joint point level features and skeleton level features into a semantic feature aggregation module, wherein the semantic feature fusion module adaptively performs cross-semantic fusion on the captured skeleton level features and joint point level features, and obtains aggregated semantic features simultaneously containing the two features;
step 4, the aggregated semantic features are sent to a joint point feature self-adaptive enhancement module which explicitly introduces a hand inherent skeleton structure, and the relevant features are adaptively enhanced for each hand joint point to be detected by utilizing the relevant information to obtain the enhanced features of each joint point;
and 5, respectively obtaining a predicted two-dimensional heat map and a predicted relative depth map of the joint points through an output layer for the enhanced features of each joint point, continuously refining the two-dimensional heat map and the relative depth map of the joint points by adopting a multi-stage iterative optimization method, obtaining plane coordinates and relative depth values of the joint points through a decoding function, and finally obtaining three-dimensional coordinates of the joint points of the hand through camera parameter calculation, so that hand posture estimation is completely finished.
2. The method of claim 1, wherein: the step 2 is specifically realized as follows: firstly, generating a joint point positioning information graph and a skeleton correlation information graph which accord with a Gaussian distribution form for network supervision training; then adopting ResNet18 pre-trained on ImageNet as a feature extraction encoder to obtain the abstract semantic features of the image; designing a binary decoder based on an hourglass structure, taking a joint point positioning information graph and a bone related information graph as supervision, and simultaneously outputting a bone-level characteristic F b And joint point level features F j To provide richer semantic information.
3. The method of claim 1, wherein: in the step 3, the semantic feature aggregation module adaptively crosses the skeleton-level features F captured by semantic fusion b And joint point level features F j Obtaining a fusion feature containing the two semantics
Figure FDA0003889518450000011
The process is as follows:
(31) Firstly, outputting a skeleton-level feature F based on the binary decoding of the hourglass structure b And joint point level features F j Splicing, obtaining the weight of the corresponding feature through a 3 × 3 convolution and sigmoid activation function, wherein the formula is as follows:
W b ,W j =σ(conv 3×3 (cat(F b ,F j ),θ 1 ))
wherein theta is 1 Representing parameters to be learned by the network, F b ,F j Representing bone-level and joint-level features before a semantic aggregation module, W b ,W j Representing the learned bone-level features and joint point-level feature weights, sigma representing a sigmoid activation function, and cat representing a splicing operation;
(32) The branch of the skeleton-level feature and the branch of the joint-point-level feature in binary decoding based on the hourglass structure are cross-semantic self-adaptive feature fusion operations with residual connection structures respectively, and the joint-point-level weight W obtained in the step (31) is utilized j And skeletal level weight W b And calculating to obtain the features after cross-semantic fusion, wherein the formula is as follows:
Figure FDA0003889518450000021
wherein
Figure FDA0003889518450000022
Representing the residual concatenated bone-level features and joint-level features,
Figure FDA0003889518450000023
representing a dot product operation;
(33) Finally, the bone features after cross-semantic fusion
Figure FDA0003889518450000024
And joint point level features
Figure FDA0003889518450000025
Sending the spliced result into 1 multiplied by 1 convolution to obtain the final characteristics of the aggregated semantics
Figure FDA0003889518450000026
Figure FDA0003889518450000027
Wherein theta is 2 Indicating the parameters to be learned by the network,
Figure FDA0003889518450000028
and c represents the feature dimension.
4. The method of claim 1, wherein: the joint point feature adaptive enhancement module in the step 4 is realized as follows:
firstly, constructing a joint point incidence structure matrix
Figure FDA0003889518450000029
Wherein J represents the number of hand joint points, and N represents N defined joint points having associated information with the joint points; the semantic features after the semantic aggregation module is aggregated
Figure FDA00038895184500000210
Are equally divided into J groups in the feature dimension, ensuring that each different joint point is assigned to a unique feature
Figure FDA00038895184500000211
The unique means that the characteristics allocated to each joint point are different in all J joint points;
Figure FDA00038895184500000212
c represents the total feature dimension of all the joint points, and C represents the feature dimension of each joint point;
the formula of the joint feature enhancement module is as follows,
f j i =conv 1x1 (cat(avgpool(f j ),avgpool(f i )),θ)
Figure FDA00038895184500000213
Figure FDA0003889518450000031
wherein f is j Representing the original features of the joint points to be estimated, f i N-1 represents N joint features having an association relationship with the joint to be estimated, θ represents a parameter to be learned by the network,
Figure FDA0003889518450000032
representing the learned associated information of N joint points having associated relations with the joint points to be estimated,
Figure FDA0003889518450000033
represents the learned N joint point weight coefficients which have the association relation with the joint point to be estimated,
Figure FDA0003889518450000034
representing the enhanced features of the joint point to be estimated.
5. The method of claim 1, whichIs characterized in that: in step 5, a method of multi-stage iterative optimization is implemented as follows: joint point level features after enhancement
Figure FDA0003889518450000035
Then, a two-dimensional thermodynamic diagram and a relative depth diagram of each joint point to be detected are predicted independently, wherein the formula is as follows, theta represents a parameter to be learned, H j ,D j Representing a predicted two-dimensional heat map of the joint point and a relative depth map respectively,
Figure FDA0003889518450000036
two-dimensional heat map H of predicted joint points j Depth map D j Enhancing post-semantic features with the joint
Figure FDA0003889518450000037
Splicing the joint points, and sending the joint points to a next stage of a network to learn the optimized two-dimensional heat map and depth map of the joint points; then, a more accurate pose estimation result can be learned based on the optimized features, wherein the formula is as follows
Figure FDA0003889518450000038
Respectively representing the semantic features, two-dimensional heat map and depth map, phi, of the joint point to be detected in the t stage t+1 Denotes the characteristic polymerization process of stage t +1,. Phi t+1 The generation process of the t +1 stage two-dimensional heat map and the depth map is represented, the structure of the generation process is consistent with that of the t stage joint point feature enhancement module, and the generation process can capture longer-range associated information;
Figure FDA0003889518450000039
Figure FDA00038895184500000310
respectively decoding the two-dimensional joint point heat map and the relative depth heat map predicted at each stage to obtain uv coordinates of the joint point in a pixel plane and z coordinates of the joint point relative depth rel The formula is as follows:
Figure FDA00038895184500000311
Figure FDA00038895184500000312
Figure FDA00038895184500000313
wherein
Figure FDA0003889518450000041
Representing a two-dimensional thermodynamic diagram u after normalization of the joint points to be estimated j ,v j Two-dimensional coordinates representing the pixel plane of the joint point to be estimated,
Figure FDA0003889518450000042
representing relative depth values of the joint points to be estimated;
finally, the final coordinates of the hand joint points are obtained through the calculation of camera parameters, the calculation formula is as follows,
Figure FDA0003889518450000043
wherein z is root Representing the root relative depth coordinate and K representing the in-camera parameter matrix.
CN202211255461.4A 2022-10-13 2022-10-13 Three-dimensional hand posture estimation method based on monocular RGB image Pending CN115588237A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211255461.4A CN115588237A (en) 2022-10-13 2022-10-13 Three-dimensional hand posture estimation method based on monocular RGB image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211255461.4A CN115588237A (en) 2022-10-13 2022-10-13 Three-dimensional hand posture estimation method based on monocular RGB image

Publications (1)

Publication Number Publication Date
CN115588237A true CN115588237A (en) 2023-01-10

Family

ID=84780008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211255461.4A Pending CN115588237A (en) 2022-10-13 2022-10-13 Three-dimensional hand posture estimation method based on monocular RGB image

Country Status (1)

Country Link
CN (1) CN115588237A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116434335A (en) * 2023-03-30 2023-07-14 东莞理工学院 Method, device, equipment and storage medium for identifying action sequence and deducing intention
CN116880687A (en) * 2023-06-07 2023-10-13 黑龙江科技大学 Suspension touch method based on monocular multi-algorithm
CN116434335B (en) * 2023-03-30 2024-04-30 东莞理工学院 Method, device, equipment and storage medium for identifying action sequence and deducing intention

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116434335A (en) * 2023-03-30 2023-07-14 东莞理工学院 Method, device, equipment and storage medium for identifying action sequence and deducing intention
CN116434335B (en) * 2023-03-30 2024-04-30 东莞理工学院 Method, device, equipment and storage medium for identifying action sequence and deducing intention
CN116880687A (en) * 2023-06-07 2023-10-13 黑龙江科技大学 Suspension touch method based on monocular multi-algorithm
CN116880687B (en) * 2023-06-07 2024-03-19 黑龙江科技大学 Suspension touch method based on monocular multi-algorithm

Similar Documents

Publication Publication Date Title
CN110458844B (en) Semantic segmentation method for low-illumination scene
CN106845487B (en) End-to-end license plate identification method
CN108875935B (en) Natural image target material visual characteristic mapping method based on generation countermeasure network
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
CN112200057B (en) Face living body detection method and device, electronic equipment and storage medium
CN110222718B (en) Image processing method and device
CN110728183A (en) Human body action recognition method based on attention mechanism neural network
CN111931549B (en) Human skeleton motion prediction method based on multi-task non-autoregressive decoding
CN111832592A (en) RGBD significance detection method and related device
CN112036260A (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN113076957A (en) RGB-D image saliency target detection method based on cross-modal feature fusion
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN113312973A (en) Method and system for extracting features of gesture recognition key points
CN116342953A (en) Dual-mode target detection model and method based on residual shrinkage attention network
CN114359631A (en) Target classification and positioning method based on coding-decoding weak supervision network model
CN116434033A (en) Cross-modal contrast learning method and system for RGB-D image dense prediction task
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN116030498A (en) Virtual garment running and showing oriented three-dimensional human body posture estimation method
CN115588237A (en) Three-dimensional hand posture estimation method based on monocular RGB image
Zhang et al. Multiscale adaptation fusion networks for depth completion
Li et al. RoadFormer: Duplex Transformer for RGB-normal semantic road scene parsing
CN114359626A (en) Visible light-thermal infrared obvious target detection method based on condition generation countermeasure network
CN107729885B (en) Face enhancement method based on multiple residual error learning
CN112668543B (en) Isolated word sign language recognition method based on hand model perception
CN116030537B (en) Three-dimensional human body posture estimation method based on multi-branch attention-seeking convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination