CN116959120A

CN116959120A - Hand gesture estimation method and system based on hand joints

Info

Publication number: CN116959120A
Application number: CN202311194384.0A
Authority: CN
Inventors: 刘李漫; 李生玲; 田金山; 韩逸飞; 胡怀飞; 唐奇伶
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2023-10-27
Anticipated expiration: 2043-09-15
Also published as: CN116959120B

Abstract

The application provides a hand gesture estimation method and a hand gesture estimation system based on hand joints, comprising the following steps: s1, acquiring an initial image of a human hand, and preprocessing to obtain a plurality of hand joint images; s2, performing feature extraction on a plurality of hand joint images according to hand joint features by using an HRNet network to obtain a plurality of hand joint feature images; s3, respectively predicting probability density maps of the hand joints by using a two-dimensional joint prediction network according to the characteristic images of the hand joints to obtain a plurality of key point heat maps; s4, combining the plurality of key point heat maps to obtain a human hand distribution map, and optimizing to obtain a human hand joint posture feature map; s5, predicting the hand gesture according to the hand distribution diagram and the hand joint gesture feature diagram to obtain the three-dimensional coordinates of the hand joints. According to the method, the characteristic extraction is carried out on the hand joint image through the HRNet network, so that the complexity and the calculated amount of the HRNet network are reduced, and meanwhile, the accuracy of the characteristic extraction is ensured.

Description

Hand gesture estimation method and system based on hand joints

Technical Field

The application relates to the technical field of computer vision, in particular to a hand gesture estimation method and system based on hand joints.

Background

In real life, human hand pose estimation is widely used in many fields. Such as human-machine interaction, gesture recognition, virtual reality and augmented reality, etc. Initially, the method of studying monocular human hand pose estimation was mainly implemented by depth maps. However, in view of the fact that RGB cameras are more readily available and more ubiquitous than depth cameras, most of the current research is based on monocular RGB images. But it suffers from a lack of depth information and serious hand-hand/object occlusion problems. At this stage, monocular hand pose estimation is roughly classified into a data-driven-based method and a model-based method. Zimmermann et al propose for the first time that three-dimensional pose estimation of a human hand in a monocular RGB image is performed by deep learning, and the method simulates different hand poses by rendering a synthetic human hand dataset, but the model adopted by the method is relatively simple, and the estimated 3D human hand pose still has a large room for improvement. Ge et al propose a point-to-point regression predictive joint point network and directly take the 3D point cloud as an input to the network while outputting a point-by-point estimate, but this method requires a large amount of 3D point cloud data, resulting in higher cost of data collection and processing. Romaro et al propose a MANO parametric model for 3D human hand reconstruction that learns a wide variety of hand poses by 1000 high resolution 3D scans of 31 subjects' hands, and that MANO models can generate arbitrary hand poses with only a small number of input model parameters, but the dataset lacks the epidermis portion, and early performance for human hand pose estimation is low. Boukhayma et al propose to predict hand and camera parameters using a depth convolutional encoder; the 3D hand grid generated by the MANO model is generated through a decoder, and the generated hand is projected into an image domain through a re-projection module, but the accuracy of predicted key points at the edge part of the mask is not high due to the fact that a hand mask cannot be accurately acquired in practice. Spurr et al propose to utilize a large number of self-supervised learning methods without labels to estimate three-dimensional gestures of human hands for the first time, and propose a contrast learning objective function which can have invariance to the appearance transformation and isovariability to the geometric transformation.

Chinese patent CN115170762a discloses a single-view human hand reconstruction method, apparatus, and readable storage medium, which uses convolutional neural network to obtain deep human hand features and two-dimensional joint thermal map, extracts human hand gesture features according to two-dimensional Guan Jiere map, upsamples the deep human hand features, fuses with human hand gesture features, until a three-dimensional human hand grid model with preset number is reconstructed.

In the above technical solution, the MANO grid model is used to output the human hand grid so as to perform three-dimensional reconstruction of human hand, but this method increases the calculation amount and complexity of the network model.

Disclosure of Invention

In view of the above, the application provides a hand gesture estimation method and system based on hand joints, which performs feature extraction on hand joint images through a bottleneck module and a basic module in an HRNet network, wherein the basic module adopts depth separable convolution to output hand gestures, namely three-dimensional joint positions, so that the complexity and the calculated amount of the HRNet network are reduced, and the accuracy of feature extraction is ensured.

The technical scheme of the application is realized as follows:

in a first aspect, the present application provides a hand gesture estimation method based on hand joints, comprising the steps of:

s1, acquiring a hand initial image, and preprocessing the hand initial image to obtain a plurality of hand joint images;

s2, performing feature extraction on the plurality of hand joint images according to hand joint features by using an HRNet network to obtain a plurality of hand joint feature images;

s3, predicting probability density maps of the hand joints by using a two-dimensional joint prediction network according to the hand joint characteristic images to obtain a plurality of key point heat maps;

s4, merging the plurality of key point heat maps to obtain a human hand distribution map, and optimizing the human hand distribution map to obtain a human hand joint gesture feature map;

s5, predicting the hand gesture according to the hand distribution diagram and the hand joint gesture feature diagram to obtain three-dimensional coordinates of the hand joints, and obtaining a hand gesture estimation result according to the three-dimensional coordinates of the hand joints.

On the basis of the above technical solution, preferably, the HRNet network includes a bottleneck module and a base module, and step S2 specifically includes:

s21, performing first-stage feature extraction on the hand joint images by using the bottleneck module to obtain a first-stage feature map of the hand joint images;

s22, performing second-stage feature extraction on the first-stage feature map by using the basic module to obtain a second-stage feature map of the hand joint image;

s23, performing third-stage feature extraction on the second-stage feature map by using the basic module to obtain a first feature map, and fusing the first feature map with the highest-resolution feature map in the second-stage feature map to obtain a third-stage feature map of the hand joint image;

and S24, performing fourth-stage feature extraction on the third-stage feature map by using the basic module to obtain a second feature map, and fusing the second feature map with the highest-resolution feature map in the third-stage feature map to obtain a hand joint feature map.

On the basis of the above technical solution, preferably, step S21 specifically includes:

performing first-stage feature extraction on the plurality of hand joint images by using a residual error network of the bottleneck module to obtain a plurality of third feature images;

integrating the channel and spatial information of the third feature maps by using a CBAM attention mechanism to obtain fourth feature maps;

and connecting the plurality of fourth feature images by using a connection formula to obtain a first-stage feature image of the hand joint image.

On the basis of the above technical solution, preferably, the connection formula is:

；

wherein ,fourth feature map representing ith hand joint image, +.>Features representing the connection of the ith fourth feature map,/->Representing the number of hand joint features->A feature representing the convolution extracted feature of the fourth feature map by means of a filter,/->Representing the 1 st fourth feature map, +.>And the feature which is obtained by convolving and extracting the feature which is connected with the (i-1) th fourth feature image and represents the (i) th hand joint image through a filter.

On the basis of the above technical solution, preferably, step S3 specifically includes:

generating probability density maps of a plurality of different joints according to a plurality of hand joint characteristic images by using a two-dimensional joint prediction network;

calculating confidence scores of pixel points in the probability density maps of the different joints;

taking the quarter offset position of the highest confidence coefficient score and the second highest confidence coefficient score direction in the probability density maps of different joints as key points of the probability density maps, wherein the key points form a key point heat map;

and mapping the key points to the hand joint characteristic image to obtain the two-dimensional coordinates of the key points.

Based on the above technical solution, preferably, in step S3, two-dimensional coordinates of the key point are calculated using a key point coordinate formula,

the key point coordinate formula is as follows:

；

wherein ,abscissa representing two-dimensional coordinates corresponding to the j-th joint pixel point,/->Representing the ordinate of the two-dimensional coordinate corresponding to the j-th joint pixel point, x representing the abscissa of the pixel point corresponding to the highest confidence score, y representing the ordinate of the pixel point corresponding to the highest confidence score, and->Representing the x-coordinate of the pixel point corresponding to the highest confidence score in the probability density map,representing the y-coordinate of the pixel point corresponding to the highest confidence score in the probability density map, ++>Representing the confidence score on the right adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map; />Representing the confidence score on the left adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map,/the confidence score on the left adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map>Representing the profile of the j-th joint>Representing the coordinates +.>Corresponding pixel point, < >>Probability density map representing j-th joint, +.>And (5) representing a j-th joint key point heat map.

On the basis of the above technical solution, preferably, step S4 specifically includes:

combining the plurality of key point heat maps to obtain a human hand distribution map;

determining a distribution area of a human hand according to the human hand distribution diagram, and taking the distribution area of the human hand as an interested area;

and optimizing the region of interest by using a joint gesture encoder to obtain a human hand joint gesture feature map.

Still more preferably, in step S5, feature extraction is performed using two concatenated group convolutions, and a shuffle operation is used to predict the pose of a human hand.

In a second aspect, the present application further provides a hand gesture estimation system based on a hand joint, and the hand gesture estimation method based on the hand joint according to any one of the above aspects is adopted, including:

the acquisition module is used for acquiring a hand initial image, preprocessing the hand initial image according to hand joint characteristics and obtaining a plurality of hand joint images;

the feature extraction module is used for extracting features of the hand joint images by using the HRNet network to obtain a plurality of hand joint feature images;

the heat map prediction module is used for predicting a probability density map of the hand joint by using a two-dimensional joint prediction network according to the hand joint characteristic images to obtain a plurality of key point heat maps;

the merging module is used for merging the plurality of key point heat maps to obtain a human hand distribution map, and optimizing the human hand distribution map to obtain a human hand joint gesture feature map;

the 3D joint prediction module is used for predicting the hand gesture according to the hand distribution diagram and the hand joint gesture feature diagram to obtain three-dimensional coordinates of the hand joint, and obtaining a hand gesture estimation result according to the three-dimensional coordinates of the hand joint.

On the basis of the technical proposal, the feature extraction module preferably uses the HRNet network for feature extraction, wherein the HRNet network comprises a bottleneck module and a basic module,

the bottleneck module is used for extracting characteristics of a plurality of hand joint images to obtain a first-stage characteristic image;

the basic module is used for carrying out feature extraction on the first-stage feature image to obtain the hand joint feature image.

Compared with the prior art, the hand gesture estimation method based on the hand joints has the following beneficial effects:

(1) The bottleneck module in the HRNet network is used for extracting channel information of the hand joint images, the extracted third feature images are connected, and the feature extraction is carried out on the connected first-stage feature images through depth separable convolution in the base module of the HRNet network, so that complexity and calculated amount of the HRNet network are reduced, and meanwhile, accuracy of feature extraction is guaranteed.

(2) The confidence score of the pixel point in the probability density map is calculated, the position of the quarter offset in the highest confidence score and the second highest confidence score direction in the probability density map is used as the key point in the probability density map, the position of the key point in the key point heat map is further optimized by using the joint gesture encoder, the influence of background information in the initial image of the human hand is restrained, and further the accuracy of estimating the gesture of the human hand is improved.

(3) By using two cascaded group convolutions to conduct hand prediction on the hand distribution diagram and the hand joint gesture feature diagram, the HRNet network is lighter, meanwhile, the performance of the HRNet network is guaranteed, the feature expression capability is guaranteed, and the accuracy of hand gesture prediction is further improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a hand gesture estimation method based on hand joints of the present application;

fig. 2 is a block diagram of a hand gesture estimation method based on hand joints according to the present application.

Detailed Description

The following description of the embodiments of the present application will clearly and fully describe the technical aspects of the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to fall within the scope of the present application.

As shown in fig. 1, the application provides a hand gesture estimation method based on hand joints, which comprises the following steps:

s1, acquiring a hand initial image, and preprocessing the hand initial image to obtain a plurality of hand joint images.

In the embodiment of the application, the preprocessing is to uniformly cut the initial image of the hand into the image with the dimension of 128×128, so that a plurality of hand joint images can be obtained, and the characteristics of the hand joints can be better extracted.

S2, performing feature extraction on the plurality of hand joint images according to hand joint features by using an HRNet network to obtain a plurality of hand joint feature images。

It will be appreciated that since the human hand is non-rigid, it is divided into five parts, the carpophalangeal joint (CM), metacarpophalangeal joint (MCP), proximal interphalangeal joint (PIP), distal interphalangeal joint (DIP) and fingertip (TIP). Because the MCP and the motions of all the interphalangeal joints have a certain constraint relationship, the motions of the rest joints and the CM are inseparable.

In the embodiment of the application, five parts of the hand are set to be five joint features, and feature extraction is performed on hand joint images according to the five joint features. The HRNet network is a high-resolution network, and can effectively extract the characteristics in the hand joint image, so that the subsequent joint prediction is more accurate and reliable.

In the embodiment of the application, the bottleneck module of the HRNet network is divided into five blocks according to the joint characteristics of the hands, and the bottleneck module is used for extracting the characteristics of the joint images of the hands, so that the multi-layer characteristic extraction capability of the HRNet network is improved.

As shown in fig. 2, specifically, the HRNet network includes a bottleneck module and a base module, and step S2 specifically includes:

s21, performing first-stage feature extraction on the hand joint images by using the bottleneck module to obtain a first-stage feature map f of the hand joint images ₁ ；

As will be appreciated by those skilled in the art, the HRNet network comprises a plurality of parallel sub-networks, each having a different resolution, with information exchanged between the sub-networks through multiple feature fusion. The HRNet network adopts four stages to extract hand joint image characteristics, gradually downsamples the image resolution, and avoids the information loss of the characteristic images in the downsampling process.

Wherein the first stage comprises 4 bottleneck modules, and the 4 bottleneck modules are used for mapping a plurality of hand joint diagramsExtracting the image features, and extracting the first stage feature image f ₁ The image resolution of the image is reduced to 1/4 of the image of the hand joint, and the channel number of the bottleneck module is changed into a first-stage characteristic diagram f by repeatedly using the bottleneck module to change the channel number of the bottleneck module ₁ 2 times the lowest resolution of (a).

The bottleneck module uses a plurality of cascaded filters for enhancing the bottleneck module's ability to extract features while enhancing the expression of features in the hand joint image. Preferably, the filter performs channel information extraction using a filter with a convolution kernel of 3×3.

Further, the step S21 specifically includes:

performing first-stage feature extraction on the hand joint images by using a residual error network of a bottleneck module to obtain a plurality of third feature images F ₃ ；

Using a CBAM attention mechanism on said plurality of third feature maps F ₃ Integrating the channel and space information to obtain a plurality of fourth feature images F ₄ ；

Using a connection formula to said fourth plurality of feature maps F ₄ Connecting to obtain a first stage characteristic diagram f of the hand joint image ₁ 。

The connection formula is:

wherein ,fourth feature map F representing ith hand joint image ₄ ，/>Representing the ith fourth feature map F ₄ Characteristics of the connection->Representing the number of hand joint features->Representation ofFourth characteristic diagram F ₄ Features convolutionally extracted by means of a filter, < >>Representing the 1 st fourth feature map F ₄ ，/>Fourth feature map representing ith hand joint image and ith-1 fourth feature map F ₄ And the connected features are connected and then subjected to convolution extraction through a filter.

It will be appreciated that the number of components,i.e. there are 5 joint features of the hand, dividing the initial image of the hand into 5 groups. Wherein the first set of images is not characterized using a filter in the bottleneck module.

S22, using the basic module to perform a first stage characteristic diagram f ₁ Extracting features of the second stage to obtain a feature map f of the second stage of the hand joint image ₂ 。

S23, using the basic module to perform a second-stage feature map f ₂ Extracting the third stage feature to obtain a first feature map F ₁ The first characteristic diagram F ₁ And a second stage characteristic diagram f ₂ The feature images with highest resolution ratio are fused to obtain a third stage feature image f of the hand joint image ₃ 。

S24, using the basic module to perform a third-stage characteristic diagram f ₃ Extracting the fourth stage feature to obtain a second feature map F ₂ The second characteristic diagram F ₂ And third stage characteristic diagram f ₃ The feature images with highest resolution ratio are fused to obtain a hand joint feature image。

It will be appreciated that the first stage feature map f is plotted using the base module ₁ In feature extraction, the SE attention mechanism is also used, and each channel feature is further adjusted by combining the SE attention mechanismThe weight of the graph can inhibit unimportant channel information and strengthen important channel information, so that the accuracy of human hand posture estimation can be improved.

Extracting features by using a basic module to change the resolution of the obtained feature image into 1/2 of that of the original image, namely a first stage feature image f ₁ The image resolution is the second stage feature map f ₂ 2 times of image resolution, second stage feature map f ₂ The image resolution is the third stage feature map f ₃ 2 times of image resolution, third stage characteristic diagram f ₃ The image resolution is 2 times of the image resolution of the fourth-stage feature map, and at the same time, the number of channels in the second stage is 64, the number of channels in the third stage is 128, and the number of channels in the fourth stage is 256.

The base module further comprises an up-sampling operation and a down-sampling operation when extracting the features, the result of the up-sampling operation is fused with the result of the down-sampling operation, and then the feature extraction of the next stage is carried out, so that the information loss of the feature map in the down-sampling process can be effectively avoided, and the reliability of the HRNet network on the feature extraction is improved.

In the embodiment of the application, the first characteristic diagram F is used for ₁ And a second stage characteristic diagram f ₂ The highest resolution features of the three phases are fused to enable the third-phase feature map f ₃ The second stage characteristic diagram f is reserved ₂ The information of (2) can prevent the feature from being lost during the feature extraction, and improve the accuracy of the HRNet network feature extraction.

Preferably, the first feature map F is generated by using a multi-scale fusion method ₁ And a second stage characteristic diagram f ₂ The highest resolution features of (1) are fused.

It can be understood that the bottleneck module in the HRNet network uses a plurality of filters to perform feature extraction on the hand joint image, so as to obtain a first-stage feature map f of the hand joint image ₁ First stage feature map f of hand joint image using depth separable convolution in base module ₁ Feature extraction is performed, and the accuracy of feature extraction is ensured while the complexity and the calculated amount of the HRNet network are reduced.

S3, according to the multipleIndividual hand joint feature imagesAnd respectively predicting probability density maps of the hand joints by using a two-dimensional joint prediction network to obtain a plurality of key point heat maps.

In the embodiment of the application, two-dimensional information of a human hand image is predicted by using a two-dimensional joint prediction network, and 21 key point heat maps containing single key points can be obtained by predicting a probability density map of the hand joints, namely, heat map prediction.

It can be understood that the key point heat map is that there are one for each joint of the hand, and there are 21 total, and the pixel value of each key point heat map represents the probability value of the corresponding joint of the pixel point, wherein the pixel point with the largest value is most likely to be the joint. If 21 key point heat values containing key points are directly overlapped, pixel value areas are overlapped, and joint positions cannot be distinguished, the method only adopts the pixel point corresponding to the highest confidence score in each key point heat map as the key point, and other pixel points in the key point heat map are all set to 0, so that a single Zhang Guanjian Shan Retu indicating hand image 21 joint positions can be realized.

Specifically, step S3 specifically includes:

using a two-dimensional joint prediction network from multiple hand joint feature imagesGenerating probability density maps of a plurality of different joints;

mapping the key points to hand joint characteristic imagesAnd obtaining the two-dimensional coordinates of the key points.

In a further embodiment of the present application, the two-dimensional coordinates of the keypoint are calculated using the keypoint coordinate formula in step S3,

the key point coordinate formula is as follows:

wherein ,abscissa representing two-dimensional coordinates corresponding to the j-th joint pixel point,/->Representing the ordinate of the two-dimensional coordinate corresponding to the j-th joint pixel point, x representing the abscissa of the pixel point corresponding to the highest confidence score, y representing the ordinate of the pixel point corresponding to the highest confidence score, and->Representing the x-coordinate of the pixel point corresponding to the highest confidence score in the probability density map,representing the y-coordinate of the pixel point corresponding to the highest confidence score in the probability density map, ++>Representing the confidence score on the right adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map; />Representing the position of the pixel point coordinate adjacent to the left corresponding to the highest confidence coefficient score in the j-th joint probability surface density mapConfidence score->Representing the profile of the j-th joint>Representing the coordinates +.>Corresponding pixel point, < >>Probability density map representing j-th joint, +.>And (5) representing a j-th joint key point heat map.

S4, combining the plurality of key point heat maps to obtain a human hand distribution mapFor the human hand profile->Optimizing to obtain a characteristic diagram of the gesture of the human hand joint +.>。

It can be understood that the total of 21 key point heat maps is 21, and the formula adopted in step S4 is as follows:

wherein ,21 key point heat maps->Human hand distribution map formed after combination->。

In a further embodiment of the present application, step S4 specifically includes:

combining the plurality of key point heat maps to obtain a human hand distribution map；

According to the human hand distribution diagramDetermining a distribution area of a human hand, and taking the distribution area of the human hand as an interested area;

optimizing the region of interest by using a joint gesture encoder to obtain a characteristic diagram of the gesture of the human hand joint。

It will be appreciated that the RoIWArp joint pose encoder is a deep learning model for optimizing an image, and combining regional awareness and pose estimation to identify and encode key points in the image.

In the embodiment of the application, the RoIWArp joint gesture encoder is adopted to carry out gesture encoding so as to inhibit the expression of background information, thereby better capturing the gesture information of the human hand joint in the initial image of the human hand and improving the accuracy of human hand gesture estimation.

S5, according to the human hand distribution diagramAnd human hand joint posture feature map->Predicting the hand gesture to obtain three-dimensional coordinates of the hand joints, and obtaining a hand gesture estimation result according to the three-dimensional coordinates of the hand joints.

In the embodiment of the application, in step S5, two cascaded group convolutions are adopted to perform feature extraction, and a shuffle operation is used to predict the hand gesture.

Concatenating two group convolutions can make the HRNet network more lightweight, while the shuffle operation can haveThe problems that the parameters of the HRNet are reduced and the performance of the HRNet network is reduced by group convolution are effectively avoided. Wherein the convolution kernel of the two cascaded group convolutions is 1×1, and the two cascaded group convolutions are used for distributing the graph of the human handAnd human hand joint posture feature map->And performing dimension reduction and dimension increase.

Preferably, the two cascaded group convolutions comprise a first convolution layer and a second convolution layer, wherein the depth separable convolutions are used in the second convolution layer, so that the expressive capacity of the features can be ensured, and the key points in the initial image of the human hand can be predicted more accurately.

In the embodiment of the application, the bottleneck module in the HRNet network is used for extracting the characteristics of the hand joint image to obtain a plurality of third characteristic diagrams F ₃ Third characteristic diagram F ₃ After connection, the characteristic fusion is carried out through a filter, and a CBAM attention mechanism is utilized to integrate the channel and the space information, so that the multi-layer characteristic extraction capability of the HRNet network is improved; simultaneously, a basic module in the HRNet network is utilized to integrate a first stage characteristic diagram f of the hand joint image ₁ And performing feature extraction, wherein a basic module performs feature extraction by adopting depth separable convolution, so that the complexity and the calculated amount of the HRNet network are reduced, and the accuracy of feature extraction is ensured.

Calculating the confidence score of each pixel point in each probability density map, taking the quarter offset position of the highest confidence score and the next highest confidence score direction in the probability density map as the key point in the probability density map, and mapping the key point to the hand joint characteristic imageObtaining two-dimensional coordinates of the key points, further determining positions of the key points in the key point heat map, further optimizing the positions of the key points in the key point heat map by using a joint gesture encoder, inhibiting influence of background information in an initial image of a human hand, and further providingThe accuracy of human hand pose estimation is high.

Human hand profile by using two cascaded group convolutionsAnd human hand joint posture feature map->And the human hand prediction is carried out, so that the HRNet network is lighter, the performance of the HRNet network is ensured, the expression capability of the characteristics is ensured, and the accuracy of the human hand posture prediction is further improved.

The application also provides a hand gesture estimation system based on the hand joint, which adopts the hand gesture estimation method based on the hand joint, comprising the following steps:

the feature extraction module is used for extracting features of the hand joint images by using the HRNet network to obtain a plurality of hand joint feature images；

A heat map prediction module for predicting hand joint characteristic imagesRespectively predicting probability density maps of the hand joints by using a two-dimensional joint prediction network to obtain a plurality of key point heat maps;

the merging module is used for merging the plurality of key point heat maps to obtain a human hand distribution diagramFor the human hand profile->Optimizing to obtain a characteristic diagram of the gesture of the human hand joint +.>；

The 3D joint prediction module is used for performing prediction according to the human hand distribution diagramAnd human hand joint posture feature map->Predicting the hand gesture to obtain three-dimensional coordinates of the hand joints, and obtaining a hand gesture estimation result according to the three-dimensional coordinates of the hand joints.

The feature extraction module performs feature extraction by using an HRNet network, wherein the HRNet network comprises a bottleneck module and a basic module, and the bottleneck module is used for performing feature extraction on a plurality of hand joint images to obtain a first-stage feature map f ₁ The method comprises the steps of carrying out a first treatment on the surface of the The basic module is used for the first stage characteristic diagram f ₁ Extracting features to obtain the hand joint feature image。

The 3D joint prediction module comprises two cascaded group convolutions for use in accordance with the human hand profileAnd human hand joint posture feature map->And extracting hand gesture features, wherein the two cascaded group convolutions comprise a first convolution layer and a second convolution layer, and the second convolution layer adopts depth separable convolutions.

In the embodiment of the application, the acquisition module acquires the initial image of the human hand and cuts the initial image of the human hand to obtain a plurality of hand joint images, the bottleneck module in the HRNet network in the feature extraction module is utilized to extract the features of the hand joint images, the multi-layer feature extraction capacity of the HRNet network is improved, and the depth separable convolution of the basic module is utilized to perform the first-stage feature map f of the bottleneck module ₁ Feature extraction is carried out, the complexity and the calculated amount of the HRNet network are reduced, meanwhile, the precision of feature extraction is guaranteed, key point heatmaps are combined and optimized through a combining module, the influence of background information in an initial image of a human hand is restrained, and the optimized human hand joint posture feature map is subjected to group convolution through two cascading in a 3D joint prediction moduleAnd the hand gesture is predicted, so that the accuracy of hand gesture estimation is improved.

The application also discloses an electronic device, comprising: at least one processor, at least one memory, a communication interface, and a bus; the processor, the memory and the communication interface complete communication with each other through the bus; the memory stores program instructions executable by the processor, which the processor invokes to implement the hand joint-based hand pose estimation method according to any of the above.

The application also discloses a computer readable storage medium storing computer instructions for causing the computer to implement the hand gesture estimation method based on the hand joints according to any one of the above.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the application.

Claims

1. A hand gesture estimation method based on hand joints is characterized in that: the method comprises the following steps:

2. A hand gesture estimation method based on hand joints as set forth in claim 1, wherein: the HRNet network comprises a bottleneck module and a basic module, and the step S2 specifically comprises:

and S24, performing fourth-stage feature extraction on the third-stage feature map by using the basic module to obtain a second feature map, and fusing the second feature map with the highest-resolution feature map in the third-stage feature map to obtain a hand joint feature image.

3. A hand gesture estimation method based on hand joints as set forth in claim 2, wherein: the step S21 specifically includes:

4. A hand gesture estimation method based on hand joints as set forth in claim 3, wherein: the connection formula is as follows:

；

5. A hand gesture estimation method based on hand joints as set forth in claim 1, wherein: the step S3 specifically comprises the following steps:

6. The hand gesture estimation method based on hand joints according to claim 5, wherein: the two-dimensional coordinates of the keypoint are calculated using the keypoint coordinate formula in step S3,

the key point coordinate formula is as follows:

；

wherein ,abscissa representing two-dimensional coordinates corresponding to the j-th joint pixel point,/->Representing the ordinate of the two-dimensional coordinate corresponding to the j-th joint pixel point, x represents the abscissa of the pixel point corresponding to the highest confidence score, and y representsThe highest confidence score corresponds to the ordinate of the pixel point,/->X-coordinate representing pixel point corresponding to highest confidence score in probability density map, ++>Representing the y-coordinate of the pixel point corresponding to the highest confidence score in the probability density map, ++>Representing the confidence score on the right adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map; />Representing the confidence score on the left adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map,/the confidence score on the left adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map>Representing the profile of the j-th joint>Representing the coordinates +.>Corresponding pixel point, < >>Probability density map representing j-th joint, +.>And (5) representing a j-th joint key point heat map.

7. A hand gesture estimation method based on hand joints as set forth in claim 1, wherein: the step S4 specifically comprises the following steps:

8. A hand gesture estimation method based on hand joints as set forth in claim 1, wherein: in step S5, feature extraction is performed by adopting two cascaded group convolutions, and hand gestures are predicted by using a shuffle operation.

9. A hand gesture estimation system based on hand joints, which is characterized in that: a hand gesture estimation method based on hand joints according to any one of claims 1-8, comprising:

10. A hand joint based hand pose estimation system according to claim 9, wherein: the feature extraction module performs feature extraction by using an HRNet network, wherein the HRNet network comprises a bottleneck module and a base module,