CN111191622B

CN111191622B - Gesture recognition method, system and storage medium based on thermodynamic diagram and offset vector

Info

Publication number: CN111191622B
Application number: CN202010006031.3A
Authority: CN
Inventors: 肖菁; 李海超; 屈光卓
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2023-05-26
Anticipated expiration: 2040-01-03
Also published as: CN111191622A

Abstract

The invention discloses a gesture recognition method, a gesture recognition system and a storage medium based on thermodynamic diagrams and offset vectors, wherein the gesture recognition method comprises the following steps: acquiring a target image to be identified; extracting the characteristics of the target image to be identified; predicting the positions of key points according to the extracted features; correcting the predicted key points, and determining the final positions of the key points; and determining the gesture information of the target to be identified according to the key points. According to the invention, by extracting the characteristics of the image, then predicting the positions of the key points, correcting the predicted results and finally identifying to obtain the attitude information, the more accurate attitude information can be obtained, and the method can be widely applied to the technical field of deep learning.

Description

Gesture recognition method, system and storage medium based on thermodynamic diagram and offset vector

Technical Field

The invention relates to the technical field of deep learning, in particular to a gesture recognition method, a gesture recognition system and a storage medium based on thermodynamic diagrams and offset vectors.

Background

Thermodynamic diagrams: the probability map is that the probability of a pixel point which is closer to the center point is closer to 1, and the probability of a pixel point which is farther from the center point is closer to 0, and the simulation can be performed by a corresponding function, such as Gaussian, and the like.

Offset vector: the point-to-point displacement is inferred from the distance between the point and the reference point.

Posture estimation: the pose of the object in the image (or stereoscopic image, image sequence) is determined, reconstructing the specific tasks of the joints and limbs of the person.

People often record life by taking pictures in daily life, and in order to better understand character information in pictures, people want to locate the positions of people, know activities performed by people, and how to achieve the targets is a main problem of human body posture estimation. Pose estimation is also called human body keypoint detection, and mainly identifies the position of a human body's key parts, such as a nose, a left eye, a right eye, a left ear, a right ear, a left shoulder, a right shoulder, a left elbow, a right elbow, a left wrist, a right wrist, a left hip, a right hip, a left knee, a right knee, a left ankle, a right ankle, and the like. Despite many years of research, it has been a very challenging problem in computer vision to date, with difficulties arising mainly from complex background, blurring, occlusion, light and shade of illumination, and clothing color in natural scenes. Moreover, limb interactions from person to person can also cause strong disturbances, such as overlapping of limbs, shadowing between limbs.

Because more than one person is often in the actual application scene, the current gesture estimation algorithm is mainly a multi-person gesture algorithm. There are two main trends in the multi-person pose estimation algorithm, one is Top-down (Top-down) and the other is Bottom-up (Bottom-up). The top-down method is to obtain the detection frames of multiple people in the image by using Object detection (Object detection) method, such as fast-RCNN (fast Region-based Convolutional Neural Networks) or SSD (Single Shot MultiBox Detector), and then cut them from the original image and transmit them to the later pose estimation network respectively, where the network predicts the key points of the human body separately for the cut image. The top-down approach converts the problem of multi-person pose estimation into single person pose estimation. The bottom-up multi-person gesture estimation method is to detect key points on all persons first, then cluster the key points, and connect different key points of different persons together, so that different individuals are generated by clustering. The bottom-up multi-person gesture estimation method focuses on the exploration of a key point clustering method, namely how to construct the relation between different key points.

With the rapid development of the deep learning technology in the field of computer vision, a large amount of research work for solving the detection of key points of human bodies by adopting deep learning has been developed in recent years. However, most existing works focus on how to design the data transfer path in the network to obtain the rich spatial information and detail information in the picture. For example, a feature pyramid network (Feature Pyramid Networks for Object Detection), a cascaded convolutional neural network (Cascaded Pyramid Network for Multi-Person Pose Estimation) and a stacked hourglass network (Stacked Hourglass Networks for Human Pose Estimation), and so forth. These methods can naturally improve the accuracy of human body key point detection, but they ignore the small offset that occurs in the mapping process of the predicted point from low resolution to high resolution, which causes a certain degree of precision loss.

Disclosure of Invention

In view of this, embodiments of the present invention provide a gesture recognition method, system and storage medium based on a thermodynamic diagram and an offset vector with high precision.

The first aspect of the invention provides a gesture recognition method based on thermodynamic diagrams and offset vectors, comprising the following steps:

acquiring a target image to be identified;

extracting the characteristics of the target image to be identified;

predicting the positions of key points according to the extracted features;

correcting the predicted key points, and determining the final positions of the key points; and

and determining the gesture information of the target to be identified according to the key points.

Further, the step of extracting the features of the target image to be identified includes:

cutting the obtained target image to be identified;

inputting each image obtained by cutting into a residual error network; and

and performing coding processing through the residual error network to obtain a first characteristic diagram.

Further, the residual network includes five sets of convolutional layers;

in addition, the step of obtaining the feature map through the encoding processing of the residual network comprises the following steps:

performing dimension changing processing on each channel of the feature map through convolution check, wherein the dimension changing processing comprises dimension increasing and dimension decreasing;

normalizing each channel; and

and carrying out nonlinear activation processing on the normalization processing result.

Further, the step of extracting the features of the target image to be identified further includes a decoding step, where the decoding step includes:

inputting the obtained first characteristic diagram into a deconvolution structure;

decoding the first feature map through a deconvolution structure; and

and acquiring characteristic response graphs of all the channels.

Further, the predicting the key point position according to the extracted feature includes:

obtaining thermodynamic diagrams from the output results of the channels;

calculating the maximum value of each thermodynamic diagram to obtain the position information of each key point on the thermodynamic diagram; and

and mapping the position information of the key points to the target image to be identified according to the size relation between the target image to be identified and the thermodynamic diagram.

Further, the step of correcting the predicted key point and determining the final position of the key point includes the steps of:

determining an offset vector of the key point according to the output result of each channel; and

and adding the offset vector to the maximum value of the thermodynamic diagram according to the offset vector, and determining the final position of the key point.

Further, the method also comprises the following steps:

training a thermodynamic diagram by adopting a mean square error loss function; and

in training the offset vector, a smooth penalty function is employed to handle the gap between the true offset and the predicted offset.

A second aspect of the present invention provides a thermodynamic diagram and offset vector based gesture recognition system comprising:

the acquisition module is used for acquiring the target image to be identified;

the feature extraction module is used for extracting features of the target image to be identified;

the key point predicting module is used for predicting the position of the key point according to the extracted characteristics;

the key point correction module is used for correcting the predicted key points and determining the final positions of the key points; and

and the gesture determining module is used for determining gesture information of the target to be recognized according to the key points.

A third aspect of the present invention provides a thermodynamic diagram and offset vector based gesture recognition system comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method.

A fourth aspect of the invention provides a storage medium having stored therein processor-executable instructions which, when executed by a processor, are for performing the method.

One or more of the above technical solutions in the embodiments of the present invention have the following advantages: according to the embodiment of the invention, the predicted result can be corrected by extracting the characteristics of the image and predicting the positions of the key points, and finally the gesture information is obtained by recognition.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart illustrating the overall steps of an embodiment of the present invention;

FIG. 2 is a first example flow chart of an embodiment of the present invention;

FIG. 3 is a second exemplary flow chart of an embodiment of the present invention;

FIG. 4 is a schematic diagram of coordinate position prediction by coordinate offset correction thermodynamic diagrams according to an embodiment of the present invention;

FIG. 5 is a graph showing the comparison of various algorithms on MSCOCO data sets according to embodiments of the present invention;

FIG. 6 is a graph showing the comparison of various algorithms on MPII datasets in accordance with an embodiment of the present invention;

FIG. 7 is a comparison of various algorithms of an embodiment of the present invention on a CROWPOSE dataset;

FIG. 8 is a graph showing the detection of HOPE on MSCOCO data sets according to an embodiment of the invention;

FIG. 9 is a graph showing the detection of HOPE on an MPII dataset according to an embodiment of the invention;

fig. 10 shows the detection of hop on a crowdose dataset according to an embodiment of the present invention.

Detailed Description

The invention is further explained and illustrated below with reference to the drawing and the specific embodiments of the present specification. The step numbers in the embodiments of the present invention are set for convenience of illustration, and the order of steps is not limited in any way, and the execution order of the steps in the embodiments can be adaptively adjusted according to the understanding of those skilled in the art.

Since the prior art is mostly only a thermodynamic approach to innovations for network architecture, and mainly focuses on loss functions. However, thermodynamic diagram-based methods have a coordinate mapping process that ignores the loss of predicted points from the low resolution thermodynamic diagram resulting in the loss of coordinate mapping back to the original diagram, which limits the accuracy improvement.

Thus, the application provides a human body posture estimation method based on thermodynamic diagrams and coordinate offset, which predicts thermodynamic diagrams and offset vectors of key points by extracting features through a convolutional neural network with strong robustness, predicts the coordinates of the key points by using the thermodynamic diagrams, and corrects the coordinates of the key points by using the offset vectors so as to obtain more accurate position information.

Referring to fig. 1, the specific implementation steps of the embodiment of the present application include:

s1: acquiring a target image to be identified;

s2: extracting the characteristics of the target image to be identified;

as shown in fig. 2 and fig. 3, the feature extraction in the embodiment of the present application is to convert a picture into a feature, and the network structure of the model is mainly divided into two parts, namely, an encoding module and a decoding module. The coding module adopts a 50-layer residual network, removes the last 1x1 convolution layer, extracts the characteristics of an input image in a full convolution mode, and particularly has very excellent performance in a plurality of computer vision tasks due to the design of residual, and has very strong characteristic expression capability.

The residual network in this embodiment is composed of c1, c2, c3, c4, c5, and 5 groups of convolutional layers, each layer containing N residual modules. The residual module is composed of alternating convolution layers, BN layers and ReLUs, the convolution kernel of 1x1 is mainly used for reducing or increasing the dimension of a channel of the feature map, and the calculated amount of the next convolution kernel can be effectively reduced by reducing the dimension by 1x1 before the convolution kernel of 3x3 is input. The BN layer is a batch normalization layer, and each channel has four corresponding parameters, namely a mean value, a variance, a coefficient of telescopic transformation and a bias, which are used for normalizing the characteristics input into the layer, so that the problem that the gradient disappears or the gradient explodes due to the change of the data distribution of the middle layer in the model training process is prevented. The ReLU is used as a nonlinear activation function, so that nonlinear expression capacity of the network is improved on one hand, and on the other hand, the problem that parameters of the Sigmoid function are slowly updated in a saturation region is avoided. As shown in fig. 2, the specific implementation steps of the encoding steps in the embodiments of the present application are: firstly, cutting an acquired image; secondly, inputting the image into a residual error network; and thirdly, acquiring the coded characteristic diagram from a residual error network.

In addition, the embodiment of the application further includes a decoding step, as shown in fig. 2, where the decoding step includes: the first step, inputting the obtained characteristic diagram into a deconvolution structure; secondly, encoding the feature map by the deconvolution structure; third, a characteristic response map of 3*n channels would be obtained from a 1x1 convolution. As shown in fig. 3, the network outputs a thermodynamic diagram corresponding to n channels for predicting the positions of n keypoints, respectively, and an offset vector corresponding to 2*n channels for predicting the offsets of the keypoints at each position in the x and y directions, respectively, the final network end profile has a size of 64×48, which is a quarter of the input image in width and height.

S3: predicting the positions of key points according to the extracted features;

specifically, the present embodiment assumes that the position of the kth key point is l _k If position x on thermodynamic diagram _i And key point l _k The distance of each position in the circle is not more than the radius R, and the probability of each position being a true key point is subject to a Gaussian distribution, thus being more beneficial to the learning of the network, namely h _k (x _i )＝G(x _i -l _k )if||x _i -l _k R is not more than _k (x _i ) =0, where G represents a gaussian function. Clearly, from the key point l _k The closer the more likely the location on the thermodynamic diagram is to be a keypoint. Specifically, the implementation step of predicting the key point includes: firstly, obtaining a thermodynamic diagram in an output channel of a network; second, for each key point l _k Corresponds to a thermodynamic diagram h _k Obtaining the maximum value of each thermodynamic diagram to obtain the position of each key point on the thermodynamic diagram; and thirdly, mapping coordinates from the thermodynamic diagram to the input image according to the multiple relation between the input image size and the thermodynamic diagram size.

S4: correcting the predicted key points, and determining the final positions of the key points;

specifically, in the embodiment of the present application, the key points have a loss of precision when mapping from a low-resolution image to a high-resolution image, as shown in fig. 4 (b), each grid represents a pixel position, and the area framed by the rectangle in fig. 4 (a) is a thermodynamic diagram for predicting the position of the left wrist, but when the predicted coordinates thereof are mapped to the resolution of the input image, a larger loss of precision occurs. As can be seen from fig. 4 (b), one pixel in the thermodynamic diagram actually represents the 16-pixel position of the original image, because the width and height are both one-fourth of the original image, and the coordinate by 4 on each thermodynamic diagram can only be mapped to the first pixel of the corresponding region of the input image, i.e. the position in the upper left corner of the 16 grid in fig. 4 (b), which is the source of the loss of precision in the coordinate mapping process. Many works to reduce the accuracy loss of coordinate mapping, manually shifting the thermodynamic diagram predicted keypoint locations by a quarter of a pixel, i.e. by a distance of 1 pixel on the original input image, in this stage does indeed reduce the expected error between the mapped keypoints and the true keypoints, with a slight accuracy improvement, but does not radically solve the accuracy loss problem.

Based on such a current situation, the network of the present application predicts each location x in addition to the output thermodynamic diagram _i Two-dimensional offset vector o relative to input image _k (x _i ) Let neural network actively learn the offset between mapped key points and real key points, o _k (x _i ) After mapping a certain position xi on the kth thermodynamic diagram, the shift relative to the kth keypoint on the input image is aimed at correcting the predicted position of the keypoint. Because there are k keypoints, the network of the present application generates k such offset fields, solving a two-dimensional regression problem for each keypoint and its nearby locations, respectively.

Referring to fig. 2, the correction step is implemented specifically by a first step of taking the location of the maximum thermodynamic diagram after the thermodynamic diagram and the offset vector are generated by the network, and a second step of adding the offset vector to the maximum thermodynamic diagram location to obtain the location of the key point that is finally mapped to the input image, i.e. keypoint positions=hetmap positions.

S5: and determining the gesture information of the target to be identified according to the key points.

In addition, the embodiment of the application also provides the steps of model training and testing, in particular:

the classical mean square error loss function is adopted to train the thermodynamic diagram, and it is worth noting that the probability value of the region within the distance R near the key point is only calculated and lost, namely only those points near the key point are trained, so that the convergence of the network is facilitated, and the loss function is shown.

In respect of training offset vectors, inspired by the target detection domain regression detection frame coordinates, the present application employs a smooth loss function to penalize the gap between true and predicted offsets, as shown.

Such a loss function may make the loss more robust to outliers of anomalies, thereby controlling better back propagation of gradients in the network. Likewise, the present application only calculates the losses at those locations that are no more than R from the keypoint. After fusing the two losses, the final loss function is shown as formula, where λ _h And lambda (lambda) _o The weights representing the two losses, respectively, are scaled 4:1, and the optimizer used by the training model is adam.

L(θ)＝λ _h L _h (θ)+λ _o L _o (θ) (3)

In addition, the present application selects three test datasets disclosed in the attitude estimation field to perform experimental measurements to further illustrate the advantages of the present application over the prior art:

the operating environment of the embodiment of the application: 6 cores, intel Xeon E5-2620 processor, 64GB memory, titan X graphics card, ubuntu 16.04 operating system.

The three data sets are: (1) MSCOCO: the MSCOCO data set can be applied to tasks such as target detection, semantic segmentation, key point detection and the like. The patent mainly uses a COCO data set in 2017, wherein a training set comprises 118287 pictures, a test set comprises 5000 pictures, and no picture has labels of a plurality of people.

(2) MPII A MPII human body posture dataset is the most advanced benchmark for assessing articulated human body posture estimates. The dataset comprises about 25K images containing more than 40K persons with annotated human joints. These images are collected according to a classification system of human daily activities. The entire dataset covers 410 human activities, each image having an activity label. Each image is extracted from the YouTube video. Approximately 25000 pictures are included in the data, and more than 4 ten thousand unlabeled human body key point examples are included, wherein 28000 examples are used for training of a network, and the rest 12000 samples are used for testing.

(3) Crowdose we also evaluated our method in a crowdpost dataset containing 2 ten thousand pictures and 8 ten thousand human examples. The crowd posture data set is designed to improve performance in crowded situations, so that the model is suitable for different scenes.

In order to evaluate the effectiveness of the algorithm, the performance evaluation indexes of the AP and the PCK are adopted in the experiment of this embodiment, wherein the AP is used as the evaluation index in the COCO and crowtoe data set, and the PCK is used as the evaluation index in the MPII data set. Object keypoint similarity OKS (object keypoint similarity), which is used to calculate the similarity between predicted keypoints and labeled keypoints, is formulated as follows:

wherein D is _i Representing the Euclidean distance between the predicted and labeled keypoints, s being the scale of the object, k _i Is a key control constant for controlling attenuation, v _i Indicating whether the keypoint is visible. After giving the threshold s of OKS, the average accuracy over the test set can then be calculated by the following formulaTo:

another important evaluation criterion for keypoints is PCK, which is an indicator of the proportion of all predicted keypoints that fall a normalized distance around the corresponding labeled keypoint. This normalized distance is often related to the longest distance of the human torso in the picture. Usually denoted pck @ sigma, where sigma is a fraction between intervals 0,1, and multiplying sigma by the longest trunk distance yields the normalized distance in the evaluation index, by the following method:

where N represents the total number of samples and k represents the kth human keypoint, so the overall PCK is:

the evaluation index used on the MPII dataset is PCKh, unlike PCK, which replaces the longest trunk distance used in normalizing distance with the longest head distance.

The present embodiments compare AP, PCK values with other algorithms on the MSCOCO, MPII, CROWPOSE three data sets, respectively. These methods include simple human body pose estimation and tracking baseline (Simple baselines for human pose estimation and tracking, SB) and accurate multi-person pose estimation in natural environment (Towards accurate multi-person pose estimation in the wild, G-RMI), cascading pyramid network for multi-person pose estimation (Cascaded pyramid network for multi-person pose estimation, CPN), stacking Hourglass network for human body pose estimation (Stacked Hourglass networks for human pose estimation, hourglass), learning feature pyramid for human body pose estimation (Learning feature pyramids for human pose estimation, FPN), quantifying tightly connected u-shaped network (Quantized densely connected u-nets for efficient landmark localization, dense u-net), human body pose estimation by convolution partial heat map regression (Human pose estimation via convolutional part heatmap regression, vcph). The algorithm of the present application is simply referred to as HOPE.

FIG. 5 is a graph showing the results of the present application and other algorithms on MSCOCO datasets; FIG. 6 is a graph showing the results of the present application and other algorithms on MPII datasets; FIG. 7 is a graph showing the results of the present application and other algorithms on a CORDPOSE dataset.

As can be seen from fig. 5, 6 and 7, the AP and PCK values of the present application are superior to those of other algorithms, whether it is a sparse scene or a crowded scene. In addition, fig. 8 shows the detection result of the hop on the MSCOOCO dataset, fig. 9 shows the detection result of the hop on the MPII dataset, and fig. 10 shows the detection result of the hop on the crowtoe dataset. As can be seen from fig. 8-10, the correlation of the results returned in the detection of the keypoints is relatively high, which further illustrates that the present patent has a better effect in the detection of the keypoints.

The embodiment of the invention also provides a gesture recognition system based on the thermodynamic diagram and the offset vector, which comprises:

the acquisition module is used for acquiring the target image to be identified;

at least one processor;

at least one memory for storing at least one program;

Embodiments of the present invention also provide a storage medium having stored therein processor-executable instructions which, when executed by a processor, are for performing the method.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments described above, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims

1. The gesture recognition method based on thermodynamic diagrams and offset vectors is characterized by comprising the following steps of:

acquiring a target image to be identified;

extracting the characteristics of the target image to be identified;

predicting the positions of key points according to the extracted features;

determining the gesture information of the target to be identified according to the key points;

the method further comprises the step of evaluating the validity of the gesture recognition method, the step comprising:

performing performance evaluation by adopting AP and PCK performance evaluation indexes, wherein the AP is used as an evaluation index in COCO and CROWPOSE data sets, and the PCK is used as an evaluation index in MPII data sets;

calculating the similarity between the predicted key points and the marked key points through the object key point similarity OKS, wherein the calculation formula of the similarity is as follows:

wherein D is _i Representing the Euclidean distance between the predicted and labeled keypoints, s being the scale of the object, k _i Is a key control constant for controlling attenuation, v _i Indicating whether the key point is visible, delta (v _i >0) Representing the sum of the visible keypoints;

after giving the threshold value s of OKS, the average precision AP@s on the test set is calculated by the following formula:

the PCK index represents the proportion of all predicted key points falling into a certain standardized distance around the corresponding marked key points; this normalized distance is related to the longest distance of the torso in the picture, denoted pck @ sigma, where sigma is a fraction between intervals 0,1, and the normalized distance in the evaluation index is obtained by multiplying sigma by the longest distance of the torso, as follows:

where N represents the total number of samples, k represents the kth human keypoint,

representing the predictive key point->

Representing the actual key point of the key point,

representing predicted torso diameter,/->

Representing the true torso diameter, the overall PCK is:

the evaluation index used on the MPII dataset was PCKh, which differs from PCK,

PCKh replaces the longest trunk distance used when normalizing distance with the longest head distance.

2. The method for recognizing a gesture based on thermodynamic diagrams and offset vectors according to claim 1, wherein the step of extracting features of the target image to be recognized comprises:

cutting the obtained target image to be identified;

inputting each image obtained by cutting into a residual error network; and

3. The thermodynamic diagram and offset vector based gesture recognition method of claim 2, wherein the residual network comprises five sets of convolution layers;

normalizing each channel; and

4. The gesture recognition method based on thermodynamic diagrams and offset vectors according to claim 2, wherein the step of feature extraction of the target image to be recognized further comprises a decoding step including:

decoding the first feature map through a deconvolution structure; and

and acquiring characteristic response graphs of all the channels.

5. The thermodynamic diagram and offset vector based gesture recognition method of claim 4, wherein predicting key point locations based on extracted features comprises:

obtaining thermodynamic diagrams from the output results of the channels;

6. The method for recognizing a gesture based on thermodynamic diagrams and offset vectors according to claim 4, wherein the step of correcting the predicted keypoints to determine final positions of the keypoints comprises the steps of:

7. The thermodynamic diagram and offset vector based gesture recognition method of claim 6, further comprising the steps of:

8. A thermodynamic diagram and offset vector based gesture recognition system comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any of claims 1-7.

9. A storage medium having stored therein processor executable instructions which, when executed by a processor, are for performing the method of any of claims 1-7.