CN116959120A - Hand gesture estimation method and system based on hand joints - Google Patents

Hand gesture estimation method and system based on hand joints Download PDF

Info

Publication number
CN116959120A
CN116959120A CN202311194384.0A CN202311194384A CN116959120A CN 116959120 A CN116959120 A CN 116959120A CN 202311194384 A CN202311194384 A CN 202311194384A CN 116959120 A CN116959120 A CN 116959120A
Authority
CN
China
Prior art keywords
hand
joint
feature
map
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311194384.0A
Other languages
Chinese (zh)
Other versions
CN116959120B (en
Inventor
刘李漫
李生玲
田金山
韩逸飞
胡怀飞
唐奇伶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South Central Minzu University
Original Assignee
South Central University for Nationalities
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South Central University for Nationalities filed Critical South Central University for Nationalities
Priority to CN202311194384.0A priority Critical patent/CN116959120B/en
Publication of CN116959120A publication Critical patent/CN116959120A/en
Application granted granted Critical
Publication of CN116959120B publication Critical patent/CN116959120B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a hand gesture estimation method and a hand gesture estimation system based on hand joints, comprising the following steps: s1, acquiring an initial image of a human hand, and preprocessing to obtain a plurality of hand joint images; s2, performing feature extraction on a plurality of hand joint images according to hand joint features by using an HRNet network to obtain a plurality of hand joint feature images; s3, respectively predicting probability density maps of the hand joints by using a two-dimensional joint prediction network according to the characteristic images of the hand joints to obtain a plurality of key point heat maps; s4, combining the plurality of key point heat maps to obtain a human hand distribution map, and optimizing to obtain a human hand joint posture feature map; s5, predicting the hand gesture according to the hand distribution diagram and the hand joint gesture feature diagram to obtain the three-dimensional coordinates of the hand joints. According to the method, the characteristic extraction is carried out on the hand joint image through the HRNet network, so that the complexity and the calculated amount of the HRNet network are reduced, and meanwhile, the accuracy of the characteristic extraction is ensured.

Description

Hand gesture estimation method and system based on hand joints
Technical Field
The application relates to the technical field of computer vision, in particular to a hand gesture estimation method and system based on hand joints.
Background
In real life, human hand pose estimation is widely used in many fields. Such as human-machine interaction, gesture recognition, virtual reality and augmented reality, etc. Initially, the method of studying monocular human hand pose estimation was mainly implemented by depth maps. However, in view of the fact that RGB cameras are more readily available and more ubiquitous than depth cameras, most of the current research is based on monocular RGB images. But it suffers from a lack of depth information and serious hand-hand/object occlusion problems. At this stage, monocular hand pose estimation is roughly classified into a data-driven-based method and a model-based method. Zimmermann et al propose for the first time that three-dimensional pose estimation of a human hand in a monocular RGB image is performed by deep learning, and the method simulates different hand poses by rendering a synthetic human hand dataset, but the model adopted by the method is relatively simple, and the estimated 3D human hand pose still has a large room for improvement. Ge et al propose a point-to-point regression predictive joint point network and directly take the 3D point cloud as an input to the network while outputting a point-by-point estimate, but this method requires a large amount of 3D point cloud data, resulting in higher cost of data collection and processing. Romaro et al propose a MANO parametric model for 3D human hand reconstruction that learns a wide variety of hand poses by 1000 high resolution 3D scans of 31 subjects' hands, and that MANO models can generate arbitrary hand poses with only a small number of input model parameters, but the dataset lacks the epidermis portion, and early performance for human hand pose estimation is low. Boukhayma et al propose to predict hand and camera parameters using a depth convolutional encoder; the 3D hand grid generated by the MANO model is generated through a decoder, and the generated hand is projected into an image domain through a re-projection module, but the accuracy of predicted key points at the edge part of the mask is not high due to the fact that a hand mask cannot be accurately acquired in practice. Spurr et al propose to utilize a large number of self-supervised learning methods without labels to estimate three-dimensional gestures of human hands for the first time, and propose a contrast learning objective function which can have invariance to the appearance transformation and isovariability to the geometric transformation.
Chinese patent CN115170762a discloses a single-view human hand reconstruction method, apparatus, and readable storage medium, which uses convolutional neural network to obtain deep human hand features and two-dimensional joint thermal map, extracts human hand gesture features according to two-dimensional Guan Jiere map, upsamples the deep human hand features, fuses with human hand gesture features, until a three-dimensional human hand grid model with preset number is reconstructed.
In the above technical solution, the MANO grid model is used to output the human hand grid so as to perform three-dimensional reconstruction of human hand, but this method increases the calculation amount and complexity of the network model.
Disclosure of Invention
In view of the above, the application provides a hand gesture estimation method and system based on hand joints, which performs feature extraction on hand joint images through a bottleneck module and a basic module in an HRNet network, wherein the basic module adopts depth separable convolution to output hand gestures, namely three-dimensional joint positions, so that the complexity and the calculated amount of the HRNet network are reduced, and the accuracy of feature extraction is ensured.
The technical scheme of the application is realized as follows:
in a first aspect, the present application provides a hand gesture estimation method based on hand joints, comprising the steps of:
s1, acquiring a hand initial image, and preprocessing the hand initial image to obtain a plurality of hand joint images;
s2, performing feature extraction on the plurality of hand joint images according to hand joint features by using an HRNet network to obtain a plurality of hand joint feature images;
s3, predicting probability density maps of the hand joints by using a two-dimensional joint prediction network according to the hand joint characteristic images to obtain a plurality of key point heat maps;
s4, merging the plurality of key point heat maps to obtain a human hand distribution map, and optimizing the human hand distribution map to obtain a human hand joint gesture feature map;
s5, predicting the hand gesture according to the hand distribution diagram and the hand joint gesture feature diagram to obtain three-dimensional coordinates of the hand joints, and obtaining a hand gesture estimation result according to the three-dimensional coordinates of the hand joints.
On the basis of the above technical solution, preferably, the HRNet network includes a bottleneck module and a base module, and step S2 specifically includes:
s21, performing first-stage feature extraction on the hand joint images by using the bottleneck module to obtain a first-stage feature map of the hand joint images;
s22, performing second-stage feature extraction on the first-stage feature map by using the basic module to obtain a second-stage feature map of the hand joint image;
s23, performing third-stage feature extraction on the second-stage feature map by using the basic module to obtain a first feature map, and fusing the first feature map with the highest-resolution feature map in the second-stage feature map to obtain a third-stage feature map of the hand joint image;
and S24, performing fourth-stage feature extraction on the third-stage feature map by using the basic module to obtain a second feature map, and fusing the second feature map with the highest-resolution feature map in the third-stage feature map to obtain a hand joint feature map.
On the basis of the above technical solution, preferably, step S21 specifically includes:
performing first-stage feature extraction on the plurality of hand joint images by using a residual error network of the bottleneck module to obtain a plurality of third feature images;
integrating the channel and spatial information of the third feature maps by using a CBAM attention mechanism to obtain fourth feature maps;
and connecting the plurality of fourth feature images by using a connection formula to obtain a first-stage feature image of the hand joint image.
On the basis of the above technical solution, preferably, the connection formula is:
wherein ,fourth feature map representing ith hand joint image, +.>Features representing the connection of the ith fourth feature map,/->Representing the number of hand joint features->A feature representing the convolution extracted feature of the fourth feature map by means of a filter,/->Representing the 1 st fourth feature map, +.>And the feature which is obtained by convolving and extracting the feature which is connected with the (i-1) th fourth feature image and represents the (i) th hand joint image through a filter.
On the basis of the above technical solution, preferably, step S3 specifically includes:
generating probability density maps of a plurality of different joints according to a plurality of hand joint characteristic images by using a two-dimensional joint prediction network;
calculating confidence scores of pixel points in the probability density maps of the different joints;
taking the quarter offset position of the highest confidence coefficient score and the second highest confidence coefficient score direction in the probability density maps of different joints as key points of the probability density maps, wherein the key points form a key point heat map;
and mapping the key points to the hand joint characteristic image to obtain the two-dimensional coordinates of the key points.
Based on the above technical solution, preferably, in step S3, two-dimensional coordinates of the key point are calculated using a key point coordinate formula,
the key point coordinate formula is as follows:
wherein ,abscissa representing two-dimensional coordinates corresponding to the j-th joint pixel point,/->Representing the ordinate of the two-dimensional coordinate corresponding to the j-th joint pixel point, x representing the abscissa of the pixel point corresponding to the highest confidence score, y representing the ordinate of the pixel point corresponding to the highest confidence score, and->Representing the x-coordinate of the pixel point corresponding to the highest confidence score in the probability density map,representing the y-coordinate of the pixel point corresponding to the highest confidence score in the probability density map, ++>Representing the confidence score on the right adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map; />Representing the confidence score on the left adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map,/the confidence score on the left adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map>Representing the profile of the j-th joint>Representing the coordinates +.>Corresponding pixel point, < >>Probability density map representing j-th joint, +.>And (5) representing a j-th joint key point heat map.
On the basis of the above technical solution, preferably, step S4 specifically includes:
combining the plurality of key point heat maps to obtain a human hand distribution map;
determining a distribution area of a human hand according to the human hand distribution diagram, and taking the distribution area of the human hand as an interested area;
and optimizing the region of interest by using a joint gesture encoder to obtain a human hand joint gesture feature map.
Still more preferably, in step S5, feature extraction is performed using two concatenated group convolutions, and a shuffle operation is used to predict the pose of a human hand.
In a second aspect, the present application further provides a hand gesture estimation system based on a hand joint, and the hand gesture estimation method based on the hand joint according to any one of the above aspects is adopted, including:
the acquisition module is used for acquiring a hand initial image, preprocessing the hand initial image according to hand joint characteristics and obtaining a plurality of hand joint images;
the feature extraction module is used for extracting features of the hand joint images by using the HRNet network to obtain a plurality of hand joint feature images;
the heat map prediction module is used for predicting a probability density map of the hand joint by using a two-dimensional joint prediction network according to the hand joint characteristic images to obtain a plurality of key point heat maps;
the merging module is used for merging the plurality of key point heat maps to obtain a human hand distribution map, and optimizing the human hand distribution map to obtain a human hand joint gesture feature map;
the 3D joint prediction module is used for predicting the hand gesture according to the hand distribution diagram and the hand joint gesture feature diagram to obtain three-dimensional coordinates of the hand joint, and obtaining a hand gesture estimation result according to the three-dimensional coordinates of the hand joint.
On the basis of the technical proposal, the feature extraction module preferably uses the HRNet network for feature extraction, wherein the HRNet network comprises a bottleneck module and a basic module,
the bottleneck module is used for extracting characteristics of a plurality of hand joint images to obtain a first-stage characteristic image;
the basic module is used for carrying out feature extraction on the first-stage feature image to obtain the hand joint feature image.
Compared with the prior art, the hand gesture estimation method based on the hand joints has the following beneficial effects:
(1) The bottleneck module in the HRNet network is used for extracting channel information of the hand joint images, the extracted third feature images are connected, and the feature extraction is carried out on the connected first-stage feature images through depth separable convolution in the base module of the HRNet network, so that complexity and calculated amount of the HRNet network are reduced, and meanwhile, accuracy of feature extraction is guaranteed.
(2) The confidence score of the pixel point in the probability density map is calculated, the position of the quarter offset in the highest confidence score and the second highest confidence score direction in the probability density map is used as the key point in the probability density map, the position of the key point in the key point heat map is further optimized by using the joint gesture encoder, the influence of background information in the initial image of the human hand is restrained, and further the accuracy of estimating the gesture of the human hand is improved.
(3) By using two cascaded group convolutions to conduct hand prediction on the hand distribution diagram and the hand joint gesture feature diagram, the HRNet network is lighter, meanwhile, the performance of the HRNet network is guaranteed, the feature expression capability is guaranteed, and the accuracy of hand gesture prediction is further improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a hand gesture estimation method based on hand joints of the present application;
fig. 2 is a block diagram of a hand gesture estimation method based on hand joints according to the present application.
Detailed Description
The following description of the embodiments of the present application will clearly and fully describe the technical aspects of the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to fall within the scope of the present application.
As shown in fig. 1, the application provides a hand gesture estimation method based on hand joints, which comprises the following steps:
s1, acquiring a hand initial image, and preprocessing the hand initial image to obtain a plurality of hand joint images.
In the embodiment of the application, the preprocessing is to uniformly cut the initial image of the hand into the image with the dimension of 128×128, so that a plurality of hand joint images can be obtained, and the characteristics of the hand joints can be better extracted.
S2, performing feature extraction on the plurality of hand joint images according to hand joint features by using an HRNet network to obtain a plurality of hand joint feature images
It will be appreciated that since the human hand is non-rigid, it is divided into five parts, the carpophalangeal joint (CM), metacarpophalangeal joint (MCP), proximal interphalangeal joint (PIP), distal interphalangeal joint (DIP) and fingertip (TIP). Because the MCP and the motions of all the interphalangeal joints have a certain constraint relationship, the motions of the rest joints and the CM are inseparable.
In the embodiment of the application, five parts of the hand are set to be five joint features, and feature extraction is performed on hand joint images according to the five joint features. The HRNet network is a high-resolution network, and can effectively extract the characteristics in the hand joint image, so that the subsequent joint prediction is more accurate and reliable.
In the embodiment of the application, the bottleneck module of the HRNet network is divided into five blocks according to the joint characteristics of the hands, and the bottleneck module is used for extracting the characteristics of the joint images of the hands, so that the multi-layer characteristic extraction capability of the HRNet network is improved.
As shown in fig. 2, specifically, the HRNet network includes a bottleneck module and a base module, and step S2 specifically includes:
s21, performing first-stage feature extraction on the hand joint images by using the bottleneck module to obtain a first-stage feature map f of the hand joint images 1
As will be appreciated by those skilled in the art, the HRNet network comprises a plurality of parallel sub-networks, each having a different resolution, with information exchanged between the sub-networks through multiple feature fusion. The HRNet network adopts four stages to extract hand joint image characteristics, gradually downsamples the image resolution, and avoids the information loss of the characteristic images in the downsampling process.
Wherein the first stage comprises 4 bottleneck modules, and the 4 bottleneck modules are used for mapping a plurality of hand joint diagramsExtracting the image features, and extracting the first stage feature image f 1 The image resolution of the image is reduced to 1/4 of the image of the hand joint, and the channel number of the bottleneck module is changed into a first-stage characteristic diagram f by repeatedly using the bottleneck module to change the channel number of the bottleneck module 1 2 times the lowest resolution of (a).
The bottleneck module uses a plurality of cascaded filters for enhancing the bottleneck module's ability to extract features while enhancing the expression of features in the hand joint image. Preferably, the filter performs channel information extraction using a filter with a convolution kernel of 3×3.
Further, the step S21 specifically includes:
performing first-stage feature extraction on the hand joint images by using a residual error network of a bottleneck module to obtain a plurality of third feature images F 3
Using a CBAM attention mechanism on said plurality of third feature maps F 3 Integrating the channel and space information to obtain a plurality of fourth feature images F 4
Using a connection formula to said fourth plurality of feature maps F 4 Connecting to obtain a first stage characteristic diagram f of the hand joint image 1
The connection formula is:
wherein ,fourth feature map F representing ith hand joint image 4 ,/>Representing the ith fourth feature map F 4 Characteristics of the connection->Representing the number of hand joint features->Representation ofFourth characteristic diagram F 4 Features convolutionally extracted by means of a filter, < >>Representing the 1 st fourth feature map F 4 ,/>Fourth feature map representing ith hand joint image and ith-1 fourth feature map F 4 And the connected features are connected and then subjected to convolution extraction through a filter.
It will be appreciated that the number of components,i.e. there are 5 joint features of the hand, dividing the initial image of the hand into 5 groups. Wherein the first set of images is not characterized using a filter in the bottleneck module.
S22, using the basic module to perform a first stage characteristic diagram f 1 Extracting features of the second stage to obtain a feature map f of the second stage of the hand joint image 2
S23, using the basic module to perform a second-stage feature map f 2 Extracting the third stage feature to obtain a first feature map F 1 The first characteristic diagram F 1 And a second stage characteristic diagram f 2 The feature images with highest resolution ratio are fused to obtain a third stage feature image f of the hand joint image 3
S24, using the basic module to perform a third-stage characteristic diagram f 3 Extracting the fourth stage feature to obtain a second feature map F 2 The second characteristic diagram F 2 And third stage characteristic diagram f 3 The feature images with highest resolution ratio are fused to obtain a hand joint feature image
It will be appreciated that the first stage feature map f is plotted using the base module 1 In feature extraction, the SE attention mechanism is also used, and each channel feature is further adjusted by combining the SE attention mechanismThe weight of the graph can inhibit unimportant channel information and strengthen important channel information, so that the accuracy of human hand posture estimation can be improved.
Extracting features by using a basic module to change the resolution of the obtained feature image into 1/2 of that of the original image, namely a first stage feature image f 1 The image resolution is the second stage feature map f 2 2 times of image resolution, second stage feature map f 2 The image resolution is the third stage feature map f 3 2 times of image resolution, third stage characteristic diagram f 3 The image resolution is 2 times of the image resolution of the fourth-stage feature map, and at the same time, the number of channels in the second stage is 64, the number of channels in the third stage is 128, and the number of channels in the fourth stage is 256.
The base module further comprises an up-sampling operation and a down-sampling operation when extracting the features, the result of the up-sampling operation is fused with the result of the down-sampling operation, and then the feature extraction of the next stage is carried out, so that the information loss of the feature map in the down-sampling process can be effectively avoided, and the reliability of the HRNet network on the feature extraction is improved.
In the embodiment of the application, the first characteristic diagram F is used for 1 And a second stage characteristic diagram f 2 The highest resolution features of the three phases are fused to enable the third-phase feature map f 3 The second stage characteristic diagram f is reserved 2 The information of (2) can prevent the feature from being lost during the feature extraction, and improve the accuracy of the HRNet network feature extraction.
Preferably, the first feature map F is generated by using a multi-scale fusion method 1 And a second stage characteristic diagram f 2 The highest resolution features of (1) are fused.
It can be understood that the bottleneck module in the HRNet network uses a plurality of filters to perform feature extraction on the hand joint image, so as to obtain a first-stage feature map f of the hand joint image 1 First stage feature map f of hand joint image using depth separable convolution in base module 1 Feature extraction is performed, and the accuracy of feature extraction is ensured while the complexity and the calculated amount of the HRNet network are reduced.
S3, according to the multipleIndividual hand joint feature imagesAnd respectively predicting probability density maps of the hand joints by using a two-dimensional joint prediction network to obtain a plurality of key point heat maps.
In the embodiment of the application, two-dimensional information of a human hand image is predicted by using a two-dimensional joint prediction network, and 21 key point heat maps containing single key points can be obtained by predicting a probability density map of the hand joints, namely, heat map prediction.
It can be understood that the key point heat map is that there are one for each joint of the hand, and there are 21 total, and the pixel value of each key point heat map represents the probability value of the corresponding joint of the pixel point, wherein the pixel point with the largest value is most likely to be the joint. If 21 key point heat values containing key points are directly overlapped, pixel value areas are overlapped, and joint positions cannot be distinguished, the method only adopts the pixel point corresponding to the highest confidence score in each key point heat map as the key point, and other pixel points in the key point heat map are all set to 0, so that a single Zhang Guanjian Shan Retu indicating hand image 21 joint positions can be realized.
Specifically, step S3 specifically includes:
using a two-dimensional joint prediction network from multiple hand joint feature imagesGenerating probability density maps of a plurality of different joints;
calculating confidence scores of pixel points in the probability density maps of the different joints;
taking the quarter offset position of the highest confidence coefficient score and the second highest confidence coefficient score direction in the probability density maps of different joints as key points of the probability density maps, wherein the key points form a key point heat map;
mapping the key points to hand joint characteristic imagesAnd obtaining the two-dimensional coordinates of the key points.
In a further embodiment of the present application, the two-dimensional coordinates of the keypoint are calculated using the keypoint coordinate formula in step S3,
the key point coordinate formula is as follows:
wherein ,abscissa representing two-dimensional coordinates corresponding to the j-th joint pixel point,/->Representing the ordinate of the two-dimensional coordinate corresponding to the j-th joint pixel point, x representing the abscissa of the pixel point corresponding to the highest confidence score, y representing the ordinate of the pixel point corresponding to the highest confidence score, and->Representing the x-coordinate of the pixel point corresponding to the highest confidence score in the probability density map,representing the y-coordinate of the pixel point corresponding to the highest confidence score in the probability density map, ++>Representing the confidence score on the right adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map; />Representing the position of the pixel point coordinate adjacent to the left corresponding to the highest confidence coefficient score in the j-th joint probability surface density mapConfidence score->Representing the profile of the j-th joint>Representing the coordinates +.>Corresponding pixel point, < >>Probability density map representing j-th joint, +.>And (5) representing a j-th joint key point heat map.
S4, combining the plurality of key point heat maps to obtain a human hand distribution mapFor the human hand profile->Optimizing to obtain a characteristic diagram of the gesture of the human hand joint +.>
It can be understood that the total of 21 key point heat maps is 21, and the formula adopted in step S4 is as follows:
wherein ,21 key point heat maps->Human hand distribution map formed after combination->
In a further embodiment of the present application, step S4 specifically includes:
combining the plurality of key point heat maps to obtain a human hand distribution map
According to the human hand distribution diagramDetermining a distribution area of a human hand, and taking the distribution area of the human hand as an interested area;
optimizing the region of interest by using a joint gesture encoder to obtain a characteristic diagram of the gesture of the human hand joint
It will be appreciated that the RoIWArp joint pose encoder is a deep learning model for optimizing an image, and combining regional awareness and pose estimation to identify and encode key points in the image.
In the embodiment of the application, the RoIWArp joint gesture encoder is adopted to carry out gesture encoding so as to inhibit the expression of background information, thereby better capturing the gesture information of the human hand joint in the initial image of the human hand and improving the accuracy of human hand gesture estimation.
S5, according to the human hand distribution diagramAnd human hand joint posture feature map->Predicting the hand gesture to obtain three-dimensional coordinates of the hand joints, and obtaining a hand gesture estimation result according to the three-dimensional coordinates of the hand joints.
In the embodiment of the application, in step S5, two cascaded group convolutions are adopted to perform feature extraction, and a shuffle operation is used to predict the hand gesture.
Concatenating two group convolutions can make the HRNet network more lightweight, while the shuffle operation can haveThe problems that the parameters of the HRNet are reduced and the performance of the HRNet network is reduced by group convolution are effectively avoided. Wherein the convolution kernel of the two cascaded group convolutions is 1×1, and the two cascaded group convolutions are used for distributing the graph of the human handAnd human hand joint posture feature map->And performing dimension reduction and dimension increase.
Preferably, the two cascaded group convolutions comprise a first convolution layer and a second convolution layer, wherein the depth separable convolutions are used in the second convolution layer, so that the expressive capacity of the features can be ensured, and the key points in the initial image of the human hand can be predicted more accurately.
In the embodiment of the application, the bottleneck module in the HRNet network is used for extracting the characteristics of the hand joint image to obtain a plurality of third characteristic diagrams F 3 Third characteristic diagram F 3 After connection, the characteristic fusion is carried out through a filter, and a CBAM attention mechanism is utilized to integrate the channel and the space information, so that the multi-layer characteristic extraction capability of the HRNet network is improved; simultaneously, a basic module in the HRNet network is utilized to integrate a first stage characteristic diagram f of the hand joint image 1 And performing feature extraction, wherein a basic module performs feature extraction by adopting depth separable convolution, so that the complexity and the calculated amount of the HRNet network are reduced, and the accuracy of feature extraction is ensured.
Calculating the confidence score of each pixel point in each probability density map, taking the quarter offset position of the highest confidence score and the next highest confidence score direction in the probability density map as the key point in the probability density map, and mapping the key point to the hand joint characteristic imageObtaining two-dimensional coordinates of the key points, further determining positions of the key points in the key point heat map, further optimizing the positions of the key points in the key point heat map by using a joint gesture encoder, inhibiting influence of background information in an initial image of a human hand, and further providingThe accuracy of human hand pose estimation is high.
Human hand profile by using two cascaded group convolutionsAnd human hand joint posture feature map->And the human hand prediction is carried out, so that the HRNet network is lighter, the performance of the HRNet network is ensured, the expression capability of the characteristics is ensured, and the accuracy of the human hand posture prediction is further improved.
The application also provides a hand gesture estimation system based on the hand joint, which adopts the hand gesture estimation method based on the hand joint, comprising the following steps:
the acquisition module is used for acquiring a hand initial image, preprocessing the hand initial image according to hand joint characteristics and obtaining a plurality of hand joint images;
the feature extraction module is used for extracting features of the hand joint images by using the HRNet network to obtain a plurality of hand joint feature images
A heat map prediction module for predicting hand joint characteristic imagesRespectively predicting probability density maps of the hand joints by using a two-dimensional joint prediction network to obtain a plurality of key point heat maps;
the merging module is used for merging the plurality of key point heat maps to obtain a human hand distribution diagramFor the human hand profile->Optimizing to obtain a characteristic diagram of the gesture of the human hand joint +.>
The 3D joint prediction module is used for performing prediction according to the human hand distribution diagramAnd human hand joint posture feature map->Predicting the hand gesture to obtain three-dimensional coordinates of the hand joints, and obtaining a hand gesture estimation result according to the three-dimensional coordinates of the hand joints.
The feature extraction module performs feature extraction by using an HRNet network, wherein the HRNet network comprises a bottleneck module and a basic module, and the bottleneck module is used for performing feature extraction on a plurality of hand joint images to obtain a first-stage feature map f 1 The method comprises the steps of carrying out a first treatment on the surface of the The basic module is used for the first stage characteristic diagram f 1 Extracting features to obtain the hand joint feature image
The 3D joint prediction module comprises two cascaded group convolutions for use in accordance with the human hand profileAnd human hand joint posture feature map->And extracting hand gesture features, wherein the two cascaded group convolutions comprise a first convolution layer and a second convolution layer, and the second convolution layer adopts depth separable convolutions.
In the embodiment of the application, the acquisition module acquires the initial image of the human hand and cuts the initial image of the human hand to obtain a plurality of hand joint images, the bottleneck module in the HRNet network in the feature extraction module is utilized to extract the features of the hand joint images, the multi-layer feature extraction capacity of the HRNet network is improved, and the depth separable convolution of the basic module is utilized to perform the first-stage feature map f of the bottleneck module 1 Feature extraction is carried out, the complexity and the calculated amount of the HRNet network are reduced, meanwhile, the precision of feature extraction is guaranteed, key point heatmaps are combined and optimized through a combining module, the influence of background information in an initial image of a human hand is restrained, and the optimized human hand joint posture feature map is subjected to group convolution through two cascading in a 3D joint prediction moduleAnd the hand gesture is predicted, so that the accuracy of hand gesture estimation is improved.
The application also discloses an electronic device, comprising: at least one processor, at least one memory, a communication interface, and a bus; the processor, the memory and the communication interface complete communication with each other through the bus; the memory stores program instructions executable by the processor, which the processor invokes to implement the hand joint-based hand pose estimation method according to any of the above.
The application also discloses a computer readable storage medium storing computer instructions for causing the computer to implement the hand gesture estimation method based on the hand joints according to any one of the above.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the application.

Claims (10)

1. A hand gesture estimation method based on hand joints is characterized in that: the method comprises the following steps:
s1, acquiring a hand initial image, and preprocessing the hand initial image to obtain a plurality of hand joint images;
s2, performing feature extraction on the plurality of hand joint images according to hand joint features by using an HRNet network to obtain a plurality of hand joint feature images;
s3, predicting probability density maps of the hand joints by using a two-dimensional joint prediction network according to the hand joint characteristic images to obtain a plurality of key point heat maps;
s4, merging the plurality of key point heat maps to obtain a human hand distribution map, and optimizing the human hand distribution map to obtain a human hand joint gesture feature map;
s5, predicting the hand gesture according to the hand distribution diagram and the hand joint gesture feature diagram to obtain three-dimensional coordinates of the hand joints, and obtaining a hand gesture estimation result according to the three-dimensional coordinates of the hand joints.
2. A hand gesture estimation method based on hand joints as set forth in claim 1, wherein: the HRNet network comprises a bottleneck module and a basic module, and the step S2 specifically comprises:
s21, performing first-stage feature extraction on the hand joint images by using the bottleneck module to obtain a first-stage feature map of the hand joint images;
s22, performing second-stage feature extraction on the first-stage feature map by using the basic module to obtain a second-stage feature map of the hand joint image;
s23, performing third-stage feature extraction on the second-stage feature map by using the basic module to obtain a first feature map, and fusing the first feature map with the highest-resolution feature map in the second-stage feature map to obtain a third-stage feature map of the hand joint image;
and S24, performing fourth-stage feature extraction on the third-stage feature map by using the basic module to obtain a second feature map, and fusing the second feature map with the highest-resolution feature map in the third-stage feature map to obtain a hand joint feature image.
3. A hand gesture estimation method based on hand joints as set forth in claim 2, wherein: the step S21 specifically includes:
performing first-stage feature extraction on the plurality of hand joint images by using a residual error network of the bottleneck module to obtain a plurality of third feature images;
integrating the channel and spatial information of the third feature maps by using a CBAM attention mechanism to obtain fourth feature maps;
and connecting the plurality of fourth feature images by using a connection formula to obtain a first-stage feature image of the hand joint image.
4. A hand gesture estimation method based on hand joints as set forth in claim 3, wherein: the connection formula is as follows:
wherein ,fourth feature map representing ith hand joint image, +.>Features representing the connection of the ith fourth feature map,/->Representing the number of hand joint features->A feature representing the convolution extracted feature of the fourth feature map by means of a filter,/->Representing the 1 st fourth feature map, +.>And the feature which is obtained by convolving and extracting the feature which is connected with the (i-1) th fourth feature image and represents the (i) th hand joint image through a filter.
5. A hand gesture estimation method based on hand joints as set forth in claim 1, wherein: the step S3 specifically comprises the following steps:
generating probability density maps of a plurality of different joints according to a plurality of hand joint characteristic images by using a two-dimensional joint prediction network;
calculating confidence scores of pixel points in the probability density maps of the different joints;
taking the quarter offset position of the highest confidence coefficient score and the second highest confidence coefficient score direction in the probability density maps of different joints as key points of the probability density maps, wherein the key points form a key point heat map;
and mapping the key points to the hand joint characteristic image to obtain the two-dimensional coordinates of the key points.
6. The hand gesture estimation method based on hand joints according to claim 5, wherein: the two-dimensional coordinates of the keypoint are calculated using the keypoint coordinate formula in step S3,
the key point coordinate formula is as follows:
wherein ,abscissa representing two-dimensional coordinates corresponding to the j-th joint pixel point,/->Representing the ordinate of the two-dimensional coordinate corresponding to the j-th joint pixel point, x represents the abscissa of the pixel point corresponding to the highest confidence score, and y representsThe highest confidence score corresponds to the ordinate of the pixel point,/->X-coordinate representing pixel point corresponding to highest confidence score in probability density map, ++>Representing the y-coordinate of the pixel point corresponding to the highest confidence score in the probability density map, ++>Representing the confidence score on the right adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map; />Representing the confidence score on the left adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map,/the confidence score on the left adjacent to the pixel point coordinate corresponding to the highest confidence score in the j-th joint probability surface density map>Representing the profile of the j-th joint>Representing the coordinates +.>Corresponding pixel point, < >>Probability density map representing j-th joint, +.>And (5) representing a j-th joint key point heat map.
7. A hand gesture estimation method based on hand joints as set forth in claim 1, wherein: the step S4 specifically comprises the following steps:
combining the plurality of key point heat maps to obtain a human hand distribution map;
determining a distribution area of a human hand according to the human hand distribution diagram, and taking the distribution area of the human hand as an interested area;
and optimizing the region of interest by using a joint gesture encoder to obtain a human hand joint gesture feature map.
8. A hand gesture estimation method based on hand joints as set forth in claim 1, wherein: in step S5, feature extraction is performed by adopting two cascaded group convolutions, and hand gestures are predicted by using a shuffle operation.
9. A hand gesture estimation system based on hand joints, which is characterized in that: a hand gesture estimation method based on hand joints according to any one of claims 1-8, comprising:
the acquisition module is used for acquiring a hand initial image, preprocessing the hand initial image according to hand joint characteristics and obtaining a plurality of hand joint images;
the feature extraction module is used for extracting features of the hand joint images by using the HRNet network to obtain a plurality of hand joint feature images;
the heat map prediction module is used for predicting a probability density map of the hand joint by using a two-dimensional joint prediction network according to the hand joint characteristic images to obtain a plurality of key point heat maps;
the merging module is used for merging the plurality of key point heat maps to obtain a human hand distribution map, and optimizing the human hand distribution map to obtain a human hand joint gesture feature map;
the 3D joint prediction module is used for predicting the hand gesture according to the hand distribution diagram and the hand joint gesture feature diagram to obtain three-dimensional coordinates of the hand joint, and obtaining a hand gesture estimation result according to the three-dimensional coordinates of the hand joint.
10. A hand joint based hand pose estimation system according to claim 9, wherein: the feature extraction module performs feature extraction by using an HRNet network, wherein the HRNet network comprises a bottleneck module and a base module,
the bottleneck module is used for extracting characteristics of a plurality of hand joint images to obtain a first-stage characteristic image;
the basic module is used for carrying out feature extraction on the first-stage feature image to obtain the hand joint feature image.
CN202311194384.0A 2023-09-15 2023-09-15 Hand gesture estimation method and system based on hand joints Active CN116959120B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311194384.0A CN116959120B (en) 2023-09-15 2023-09-15 Hand gesture estimation method and system based on hand joints

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311194384.0A CN116959120B (en) 2023-09-15 2023-09-15 Hand gesture estimation method and system based on hand joints

Publications (2)

Publication Number Publication Date
CN116959120A true CN116959120A (en) 2023-10-27
CN116959120B CN116959120B (en) 2023-12-01

Family

ID=88458647

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311194384.0A Active CN116959120B (en) 2023-09-15 2023-09-15 Hand gesture estimation method and system based on hand joints

Country Status (1)

Country Link
CN (1) CN116959120B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191627A (en) * 2020-01-06 2020-05-22 浙江工业大学 Method for improving accuracy of dynamic gesture motion recognition under multiple viewpoints
CN111209861A (en) * 2020-01-06 2020-05-29 浙江工业大学 Dynamic gesture action recognition method based on deep learning
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method
WO2020177498A1 (en) * 2019-03-04 2020-09-10 南京邮电大学 Non-intrusive human body thermal comfort detection method and system based on posture estimation
CN113158870A (en) * 2021-04-15 2021-07-23 华南理工大学 Countermeasure type training method, system and medium for 2D multi-person attitude estimation network
CN113298040A (en) * 2021-06-21 2021-08-24 清华大学 Key point detection method and device, electronic equipment and computer-readable storage medium
CN114519865A (en) * 2022-01-14 2022-05-20 宁波大学 2D human body posture estimation method fused with integrated attention
CN114627491A (en) * 2021-12-28 2022-06-14 浙江工商大学 Single three-dimensional attitude estimation method based on polar line convergence
WO2022142854A1 (en) * 2020-12-29 2022-07-07 深圳市优必选科技股份有限公司 Optimization method and apparatus for human pose recognition module, and terminal device
KR20220098895A (en) * 2021-01-05 2022-07-12 주식회사 케이티 Apparatus and method for estimating the pose of the human body
CN116091596A (en) * 2022-11-29 2023-05-09 南京龙垣信息科技有限公司 Multi-person 2D human body posture estimation method and device from bottom to top
CN116092190A (en) * 2023-01-06 2023-05-09 大连理工大学 Human body posture estimation method based on self-attention high-resolution network
CN116311518A (en) * 2023-03-20 2023-06-23 北京工业大学 Hierarchical character interaction detection method based on human interaction intention information

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020177498A1 (en) * 2019-03-04 2020-09-10 南京邮电大学 Non-intrusive human body thermal comfort detection method and system based on posture estimation
CN111209861A (en) * 2020-01-06 2020-05-29 浙江工业大学 Dynamic gesture action recognition method based on deep learning
CN111191627A (en) * 2020-01-06 2020-05-22 浙江工业大学 Method for improving accuracy of dynamic gesture motion recognition under multiple viewpoints
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method
WO2022142854A1 (en) * 2020-12-29 2022-07-07 深圳市优必选科技股份有限公司 Optimization method and apparatus for human pose recognition module, and terminal device
KR20220098895A (en) * 2021-01-05 2022-07-12 주식회사 케이티 Apparatus and method for estimating the pose of the human body
CN113158870A (en) * 2021-04-15 2021-07-23 华南理工大学 Countermeasure type training method, system and medium for 2D multi-person attitude estimation network
CN113298040A (en) * 2021-06-21 2021-08-24 清华大学 Key point detection method and device, electronic equipment and computer-readable storage medium
CN114627491A (en) * 2021-12-28 2022-06-14 浙江工商大学 Single three-dimensional attitude estimation method based on polar line convergence
CN114519865A (en) * 2022-01-14 2022-05-20 宁波大学 2D human body posture estimation method fused with integrated attention
CN116091596A (en) * 2022-11-29 2023-05-09 南京龙垣信息科技有限公司 Multi-person 2D human body posture estimation method and device from bottom to top
CN116092190A (en) * 2023-01-06 2023-05-09 大连理工大学 Human body posture estimation method based on self-attention high-resolution network
CN116311518A (en) * 2023-03-20 2023-06-23 北京工业大学 Hierarchical character interaction detection method based on human interaction intention information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KUN ZHANG等: "Learning Heatmap-Style Jigsaw Puzzles Provides Good Pretraining for 2D Human Pose Estimation", 《ARXIV》, pages 1 - 10 *
高旭: "基于人体骨架的均衡化单人姿态估计方法", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 4, pages 138 - 1102 *

Also Published As

Publication number Publication date
CN116959120B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
Fieraru et al. Three-dimensional reconstruction of human interactions
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN111080670B (en) Image extraction method, device, equipment and storage medium
CN110210426B (en) Method for estimating hand posture from single color image based on attention mechanism
WO2021098576A1 (en) Hand posture estimation method and apparatus, and computer storage medium
CN112131965A (en) Human body posture estimation method and device, electronic equipment and storage medium
CN113269089A (en) Real-time gesture recognition method and system based on deep learning
JP6052533B2 (en) Feature amount extraction apparatus and feature amount extraction method
CN111914595B (en) Human hand three-dimensional attitude estimation method and device based on color image
Liu et al. Hand pose estimation from rgb images based on deep learning: A survey
CN117218246A (en) Training method and device for image generation model, electronic equipment and storage medium
CN116740290B (en) Three-dimensional interaction double-hand reconstruction method and system based on deformable attention
CN114792401A (en) Training method, device and equipment of behavior recognition model and storage medium
CN116958958A (en) Self-adaptive class-level object attitude estimation method based on graph convolution double-flow shape prior
CN116959120B (en) Hand gesture estimation method and system based on hand joints
Khan et al. Towards monocular neural facial depth estimation: Past, present, and future
Zhang et al. A multi-cue guidance network for depth completion
CN113763536A (en) Three-dimensional reconstruction method based on RGB image
CN117953545B (en) Three-dimensional hand gesture estimation method, device and processing equipment based on color image
CN117576307A (en) Double-hand reconstruction method based on multi-scale color information and depth information fusion
Wang et al. Hourglass network for hand pose estimation with rgb images
WO2023273272A1 (en) Target pose estimation method and apparatus, computing device, storage medium, and computer program
Li et al. HRI: human reasoning inspired hand pose estimation with shape memory update and contact-guided refinement
Gong Application and Practice of Artificial Intelligence Technology in Interior Design
Farjadi et al. RGB Image-Based Hand Pose Estimation: A Survey on Deep Learning Perspective

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant