CN116704554A

CN116704554A - Method, equipment and medium for estimating and identifying hand gesture based on deep learning

Info

Publication number: CN116704554A
Application number: CN202310739958.1A
Authority: CN
Inventors: 尹青山; 冯落落; 高岩
Original assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Current assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-09-05

Abstract

The application discloses a hand gesture estimation and recognition method, equipment and medium based on deep learning, wherein the method comprises the following steps: preprocessing a target image to obtain an intermediate image; inputting the intermediate image into a preset feature extraction module to obtain a first hand feature and an object feature corresponding to the target image; the method comprises the steps that a hand-object interaction module carries out context reasoning on first hand features and object features so as to strengthen the first hand features and obtain second hand features; and inputting the second hand features into the multitasking joint learning module to obtain a hand gesture recognition result and a hand gesture recognition result corresponding to the target image. By enhancing hand features with improved convertors, the accuracy of gesture recognition can be improved. The gesture recognition task and the gesture estimation task can be simultaneously solved, and the output two-dimensional node heat map of the gesture estimation task is used as input of the gesture recognition task, so that the gesture recognition capability of the network is improved.

Description

Method, equipment and medium for estimating and identifying hand gesture based on deep learning

Technical Field

The application relates to the field of visual recognition, in particular to a hand gesture estimation and recognition method, device and medium based on deep learning.

Background

Gesture recognition refers to extracting information such as hand and body gestures from an image or video stream through a computer vision technology, and recognizing gesture behaviors of a user. Early gesture recognition methods mainly rely on manually extracted features such as gesture shapes and actions, require manual design of feature extraction algorithms, and classify according to rules, but because the manually designed features may not be comprehensive enough and cannot adapt to complex scenes, accuracy and robustness are poor.

Gesture pose estimation refers to the detection and estimation of the pose of a human hand from an image or video by computer vision techniques that rely primarily on depth images or RGB images acquired from a single or multiple sensors and processed by machine learning or computer vision algorithms to acquire hand pose information. Early gesture estimation methods were mainly based on sensors, and have certain limitations, such as limited accuracy, poor comfort, and the like. These sensor devices require contact with or close proximity to the human body and may affect the comfort and natural behavior of the user, and thus may cause discomfort during prolonged use. In addition, the sensor device needs to be calibrated to obtain an accurate posture estimation result, which also increases the difficulty of use and the time cost, and is not friendly for some common users.

Gesture recognition and gesture pose estimation are two important issues in the field of computer vision, and in a hand interaction scene, as a hand is often blocked by an object, achieving gesture recognition and gesture pose estimation using monocular RGB images still has a great challenge.

Disclosure of Invention

In order to solve the above problems, the present application proposes a method, apparatus and medium, wherein the method comprises:

acquiring a target image, and preprocessing the target image to obtain an intermediate image; inputting the intermediate image to a preset feature extraction module to obtain a first hand feature and an object feature corresponding to the target image; performing context reasoning on the first hand feature and the object feature through a hand-object interaction module to enhance the first hand feature and obtain a second hand feature; and inputting the second hand features into a multi-task joint learning module to obtain a hand gesture recognition result and a gesture recognition result corresponding to the target image.

In one example, the preprocessing the target image specifically includes: extracting a region-of-interest picture in the target picture according to a preset feature of interest; and cutting the region-of-interest picture to obtain the intermediate image with the first preset size.

In one example, the preset feature extraction module consists of an encoder with a residual neural network and a RoiAlign algorithm; the feature extractor adopts a ResNet-50 network with a residual error connection structure; inputting the intermediate image to a preset feature extraction module to obtain a first hand feature and an object feature corresponding to the target image, wherein the method specifically comprises the following steps: transmitting the intermediate image with the first preset size into the feature extractor to obtain an intermediate feature map with the second preset size; and processing the intermediate feature map by using a RoI Align algorithm to extract feature maps of the hand and the object in the intermediate feature map respectively so as to obtain the first hand feature and the object feature under a third preset size.

In one example, the contextual reasoning, by the hand-object interaction module, on the first hand feature and the object feature to enhance the first hand feature, to obtain a second hand feature, includes: converting the first hand feature into key embedding through a preset first parameter matrix, and converting the object feature into query embedding and value embedding through a second parameter matrix and a third parameter matrix; improving a self-attention mechanism in the transducer model to improve the characteristic characterization capability of the improved transducer model; and carrying out context reasoning on the first hand feature and the object feature through the improved transducer model so as to enhance the first hand feature and obtain a second hand feature.

In one example, the performing, by using the improved transducer model, context reasoning on the first hand feature and the object feature to enhance the first hand feature, to obtain a second hand feature specifically includes: using k x k group convolution to perform context coding on all adjacent key inserts in k x k grids in space, so that the coded key inserts have context information, and coding value inserts through 1 x 1 convolution; embedding the encoded key into a hash check, such as stitching, and then generating an attention matrix by two 1 x 1 convolution and softmax activation functions; capturing local features of the first hand feature by means of depth separable convolution, and then splicing the local features with the attention module output value to obtain a key embedded feature; the key embedded feature is sent into a feedforward network consisting of a multi-layer perceptron and layer normalization; and fusing the output of the feedforward network and the key embedding characteristic to obtain the second hand characteristic.

In one example, the inputting the second hand feature into the multitasking joint learning module to obtain the hand gesture recognition result and the gesture recognition result corresponding to the target image specifically includes: inputting the second hand features into a two-dimensional feature point detection network and a depth regression network in a multitask joint learning module to obtain a two-dimensional joint point heat map and the hand gesture recognition result; and inputting the two-dimensional node heat map and the second hand feature into a gesture recognition network to obtain the gesture recognition result.

In one example, the inputting the second hand feature into the two-dimensional joint point positioning network and the depth regression network in the multitasking joint learning module to obtain the two-dimensional joint point heat map and the hand gesture recognition result specifically includes: inputting the second hand feature into a stacked hourglass network in the two-dimensional joint point positioning network to determine a two-dimensional joint point heat map; taking the difference between the predicted node position and the true node position as a loss function of the two-dimensional node positioning network; the specific definition is as follows:

wherein L is _2D Represents a loss function, K represents the number of joints, p _j Representing the predicted location of the node of the joint,representing the true position of the node; inputting the two-dimensional joint point heat map and the second hand feature into a depth regression network to obtain a MANO model parameterized hand gesture parameter; and determining the hand gesture recognition result through the hand gesture parameters.

In one example, the inputting the two-dimensional node heat map and the second hand feature into a gesture recognition network to obtain the gesture recognition result specifically includes: combining the two-dimensional node heat map and the second hand feature by 1×1 convolution, and inputting the two-dimensional node heat map and the second hand feature into the gesture recognition network; performing convolution operation on a time axis by using the gesture recognition network through the gesture recognition network so as to extract dynamic characteristics in a gesture sequence; and determining the gesture recognition network according to the dynamic characteristics.

The application also provides a device for estimating and identifying hand gestures based on deep learning, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform: acquiring a target image, and preprocessing the target image to obtain an intermediate image; inputting the intermediate image to a preset feature extraction module to obtain a first hand feature and an object feature corresponding to the target image; performing context reasoning on the first hand feature and the object feature through a hand-object interaction module to enhance the first hand feature and obtain a second hand feature; and inputting the second hand features into a multi-task joint learning module to obtain a hand gesture recognition result and a gesture recognition result corresponding to the target image.

The present application also provides a non-volatile computer storage medium storing computer-executable instructions configured to: acquiring a target image, and preprocessing the target image to obtain an intermediate image; inputting the intermediate image to a preset feature extraction module to obtain a first hand feature and an object feature corresponding to the target image; performing context reasoning on the first hand feature and the object feature through a hand-object interaction module to enhance the first hand feature and obtain a second hand feature; and inputting the second hand features into a multi-task joint learning module to obtain a hand gesture recognition result and a gesture recognition result corresponding to the target image.

The method provided by the application has the following beneficial effects: by utilizing the improved transducer to enhance hand characteristics, the hand can be positioned more accurately, thereby improving the accuracy of gesture recognition. The gesture recognition task and the gesture estimation task can be simultaneously solved, two computer vision tasks are simultaneously performed, and the output two-dimensional joint point heat map of the gesture estimation task is used as input of the gesture recognition task, so that the gesture recognition capability of a network is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of a gesture recognition method according to the prior art in an embodiment of the present application;

FIG. 2 is a flow chart of a method for estimating and identifying hand gestures based on deep learning according to an embodiment of the application;

FIG. 3 is a schematic process diagram of a method for estimating and recognizing hand gestures based on deep learning according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a hand interaction model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a conventional transducer model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a device for estimating and identifying hand gestures based on deep learning according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

As shown in fig. 1, the prior art consists of three main phases when performing gesture recognition: image processing, feature extraction and feature fusion. In the image processing stage, the input image data is first cropped to acquire a hand region and a human body posture region. Then, the two areas are respectively subjected to feature extraction through a plurality of three-dimensional convolution layers and a maximum pooling layer, and then hand area features and human body posture area features can be obtained through the full-connection layer. Finally, in the feature fusion stage, the two features are fused together, so that a gesture recognition result is output. However, the prior art is limited to gesture recognition, and cannot complete the task of hand gesture estimation. Gesture pose estimation has great effect in improving experience and efficiency of gesture interaction, limb interaction and human-computer intelligent interaction. Furthermore, in a hand interaction scenario, the hands are often obscured by objects, which makes gesture recognition using video streams and images acquired in real-time by a video camera or cameras relatively inefficient.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Fig. 2 is a flow chart of a method for estimating and identifying hand gestures based on deep learning according to one or more embodiments of the present disclosure. The method can be applied to gesture recognition and gesture pose estimation, especially when part of the hand is blocked by an object. The process may be performed by computing devices in the respective areas, with some input parameters or intermediate results in the process allowing manual intervention adjustments to help improve accuracy.

The implementation of the analysis method according to the embodiment of the present application may be a terminal device or a server, which is not particularly limited in the present application. For ease of understanding and description, the following embodiments are described in detail with reference to a server.

It should be noted that the server may be a single device, or may be a system formed by a plurality of devices, that is, a distributed server, which is not particularly limited in the present application.

As shown in fig. 2 and 3, an embodiment of the present application provides a method for estimating and identifying hand gestures based on deep learning, including:

s101: and acquiring a target image, and preprocessing the target image to obtain an intermediate image.

Firstly, acquiring a hand related image as a target image, and then preprocessing the target image to obtain an intermediate image, wherein the intermediate image is a picture for inputting a model.

The target image may be stored in a storage device of the computer device in advance, and when gesture recognition or hand pose estimation is required for the target image, the computer device may select the target image from the storage device. Of course, the computer device may acquire the target image from other external devices. For example, the target image is stored in the cloud, and when gesture recognition or hand gesture estimation is required to be performed on the target image, the computer device may acquire the target image from the cloud, and the acquisition mode of the target image is not limited in this embodiment.

In one embodiment, when the preprocessing is performed, firstly, a region-of-interest picture in the target picture is extracted according to a preset feature of interest, and then the region-of-interest picture is cut to obtain an intermediate image with a first preset size, where the first preset size may be 512×512×3.

S102: and inputting the intermediate image to a preset feature extraction module to obtain a first hand feature and an object feature corresponding to the target image.

In one embodiment, the feature extraction module consists of an encoder with a residual neural network and the RoiAlign algorithm, in particular, the feature extraction module consists of one feature extractor and one RoiAlign algorithm. The feature extractor adopts a ResNet-50 network with a residual connection structure, and an input image with the size of 512 multiplied by 3 is transmitted into the ResNet-50 network, so that an intermediate feature map with a second preset size can be obtained. For the intermediate feature map, the feature maps of the hand and the object are extracted by the RoI Align algorithm to obtain a first hand feature and an object feature, and the corresponding feature maps are sampled to be 32×32×256 with fixed sizes.

S103: and carrying out context reasoning on the first hand feature and the object feature through a hand-object interaction module so as to enhance the first hand feature and obtain a second hand feature.

In a hand interaction scenario, the hand is typically in contact with an object, so the pose of the hand is highly similar to the pose of the object, and there are many similarities in the feature maps of the hand and the object. Thus, by contextual reasoning about the information of the hand and object feature maps, hand features can be enhanced and hand gestures can be more accurately located and identified. The interaction module of the hand and object feature map can well improve the accuracy of gesture recognition and gesture estimation, so that a more reliable solution is provided for computer vision application in a hand interaction scene. Therefore, the hand characteristics and the object characteristics are input to the hand-object interaction module of the transducer structure at the same time, and the enhanced hand characteristics are obtained.

In one embodiment, the structure of the hand interaction module is shown in fig. 4, where we use the hand feature as a key (key) and the object feature as a query (value). By using Wk, wv, wq, three parameter matrices, we convert the hand and object feature maps into Query embedding (Query), key embedding (Key), and Value embedding (Value), respectively. Conventional transducers use self-attention to acquire an attention matrix formed by a query and keys for each spatial location, but this results in underutilization of rich context information between adjacent keys. Since we treat hand features and object features as keys (queries) and queries, respectively, it is necessary to enhance hand features by learning hand and object context information. Thus, we improve the self-attention mechanism in the transducer so that it can learn the context information adequately.

In the conventional self-attention mechanism shown in fig. 5, all pairs of query-key relationships are independently learned without consideration of context information, which results in a limitation in the self-attention mechanism's ability to learn visual representations on feature graphs.

In order to solve the problem, self-attribute in the Transformer is improved, so that the contextual information can be fully utilized, the characteristic representing capability of the feature is improved, and the feature enhancement can be better carried out on the hand feature map.

As shown in fig. 4, for the first hand feature extracted by the RoIAlign algorithm as a key, unlike the conventional self-attention mechanism, which encodes each key by a 1×1 convolution, we use a k×k set of convolutions to spatially encode the key for all neighboring keys within a k×k mesh, thereby leveraging the rich context information between neighboring keys. Meanwhile, we define the object features as query and value, and encode the value by 1×1 convolution.

K＝F _h W _k ,Q＝F _o Wq,V＝F _o W _v

Wherein K is bond embedded, F _h For the first hand feature, W _k For the first parameter matrix, Q is query embedded, F _o For object features, V is value embedding, wq is a second parameter matrix, W _v Is a third parameter matrix. The key and query that obtain the context information are concatenated and then an attention matrix is generated by two 1 x 1 convolution and softmax activation functions:

A＝Softmax([K,Q]W _α )V

wherein A is attention moment array output, W _α Is a third parameter matrix.

At the same time, for the first hand feature of the original input, we use a depth separable convolution to better capture the local feature, which is then fused with the output of the attention module to obtain the enhanced key feature K', the second hand feature.

K′＝W ₁ [DW(W ₁ F _h )]+A

Wherein W is ₁ [DW(W ₁ F _h )]Is a local feature.

The resulting feature K' is fed into a feed forward network consisting of MLP and Layer Normalization (LN). And finally, fusing the feedforward network output and the characteristic K 'to obtain an enhanced hand characteristic diagram Fh'.

F′ _h ＝K′+MLP(LN(K′))

In this way, in the hand-object interaction module, the hand features and the object features are respectively used as keys and queries, and the hand features are enhanced through context coding and attention mechanisms, so that the hand gestures are more accurately positioned and recognized.

S104: and inputting the second hand features into a multi-task joint learning module to obtain a hand gesture recognition result and a gesture recognition result corresponding to the target image.

In one embodiment, acquiring the hand gesture recognition result and the gesture recognition result requires inputting the second hand feature into a two-dimensional feature point detection network and a depth regression network in the multi-task joint learning module to obtain a two-dimensional joint point heat map and the hand gesture recognition result, and inputting the two-dimensional joint point heat map and the second hand feature into the gesture recognition network to obtain the gesture recognition result.

Specifically, the gesture posture estimation task consists of two parts, namely a two-dimensional characteristic point detection network and a depth regression network. The two-dimensional characteristic point detection network uses a stacked hourglass network (Stacked Hourglass Network) to perform hand two-dimensional joint point positioning, a second hand characteristic diagram output by the hand object interaction module is input, a two-dimensional heat diagram of each joint is output, and the heat diagram resolution is 32 multiplied by 32. Through multi-level feature extraction and key point regression, the stacked hourglass network can obtain higher joint point positioning accuracy, has good robustness, and can have good adaptability to complex conditions such as hand gestures, deformation, shielding and the like, so that the stability of joint point positioning is improved. Meanwhile, the precision and stability of joint point positioning are further improved by adopting technologies such as multi-scale feature fusion, residual error connection and the like. The loss function of a two-dimensional joint location network is the distance between the joint point location and the true joint point location of the metric prediction, which is defined as follows:

wherein L is _2D Represents a loss function, K represents the number of 2D joints, p _j Representing the predicted location of the node of the joint,representing the true position of the node of interest.

The deep regression network consists of four CNN layers and 3 full-connection (FC) layers, and the input of the network is a second hand feature map F output by the hand interaction module _h ^′ And the output is a parameter parameterized by the MANO model in combination with the two-dimensional joint point heat map output by the joint positioning network. The parameterization method of the MANO model is realized byThe hand is decomposed into two parts, namely a shape parameter beta and a posture parameter theta. The shape parameters describe the static shape of the hand and the pose parameters describe the dynamic pose of the hand. The MANO defines the hand model by the shape parameter β and the posture parameter θ as follows:

M(β,θ)＝W(T _p (β,θ),J(β),θ,ω)

wherein W is a Linear Blend Skinning (LBS) function, T _p Is the initial pose of the hand model, J is the joint coordinates of the hand model, ω represents the blending weight. Finally, through calculation, prediction outputAnd the L2 distance between group-trunk (beta, theta, J, omega) as a depth regression network loss function L _3D 。

In the process of gesture recognition, we output a two-dimensional joint point heat map and a second hand feature map F of a gesture posture estimation task _h ^′ Input to the gesture recognition network is combined by a 1 x 1 convolution. Because gesture recognition is a time sequence classification problem, time sequence information of hand actions needs to be considered, time convolution can process time sequence relation in input data, and time change and dynamic characteristics can be captured, so that the designed gesture recognition network uses the time convolution to carry out convolution operation on a time axis, and the dynamic characteristics in a gesture sequence can be effectively extracted. Conventional time convolutions convolve successive time frames, which may result in critical information being obscured or lost at different time scales. Therefore, the expansion convolution is added in the time convolution, the receptive field can be enlarged while the time resolution is maintained, and key information under different time scales can be better captured. In a gesture recognition network, outputting a prediction class y, we use standard classification cross entropy as a loss function L of the gesture recognition network _g The definition is as follows:

wherein N is the number of samples, C is the number of categories, y _i,c Indicating whether the real label of the ith sample is of category c (1 or 0), p _i,c A probability value for category c is predicted for the i-th sample.

Total loss function: the end-to-end training is realized by using a loss function, and the total loss function is formed by combining the loss functions of a two-dimensional joint point positioning network, a depth regression network and a gesture recognition network, wherein the specific definition is as follows:

L＝λ ₁ L _2D +λ ₂ L _3D +λ ₃ L _g

wherein L is the total loss function, lambda ₁ 、λ ₂ 、λ ₃ Is a preset parameter.

As shown in fig. 2, the embodiment of the present application further provides a device for estimating and identifying hand gestures based on deep learning, including:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to:

The embodiment of the application also provides a nonvolatile computer storage medium, which stores computer executable instructions, wherein the computer executable instructions are configured to:

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the apparatus and medium embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the section of the method embodiments being relevant.

The devices and media provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not repeated here.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method for estimating and identifying hand gestures based on deep learning, comprising the steps of:

acquiring a target image, and preprocessing the target image to obtain an intermediate image;

inputting the intermediate image to a preset feature extraction module to obtain a first hand feature and an object feature corresponding to the target image;

performing context reasoning on the first hand feature and the object feature through a hand-object interaction module to enhance the first hand feature and obtain a second hand feature;

and inputting the second hand features into a multi-task joint learning module to obtain a hand gesture recognition result and a gesture recognition result corresponding to the target image.

2. The method according to claim 1, wherein the preprocessing the target image specifically comprises:

extracting a region-of-interest picture in the target picture according to a preset feature of interest;

and cutting the region-of-interest picture to obtain the intermediate image with the first preset size.

3. The method of claim 1, wherein the pre-set feature extraction module consists of an encoder with a residual neural network and a RoiAlign algorithm; the feature extractor adopts a ResNet-50 network with a residual error connection structure;

inputting the intermediate image to a preset feature extraction module to obtain a first hand feature and an object feature corresponding to the target image, wherein the method specifically comprises the following steps:

transmitting the intermediate image with the first preset size into the feature extractor to obtain an intermediate feature map with the second preset size;

and processing the intermediate feature map by using a RoI Align algorithm to extract feature maps of the hand and the object in the intermediate feature map respectively so as to obtain the first hand feature and the object feature under a third preset size.

4. The method of claim 1, wherein the performing, by the hand-object interaction module, context reasoning on the first hand feature and the object feature to enhance the first hand feature to obtain a second hand feature, specifically includes:

converting the first hand feature into key embedding through a preset first parameter matrix, and converting the object feature into query embedding and value embedding through a second parameter matrix and a third parameter matrix;

improving a self-attention mechanism in the transducer model to improve the characteristic characterization capability of the improved transducer model;

and carrying out context reasoning on the first hand feature and the object feature through the improved transducer model so as to enhance the first hand feature and obtain a second hand feature.

5. The method of claim 4, wherein the performing, by the improved transducer model, the contextual reasoning on the first hand feature and the object feature to enhance the first hand feature, and obtaining a second hand feature, comprises:

using k x k group convolution to perform context coding on all adjacent key inserts in k x k grids in space, so that the coded key inserts have context information, and coding value inserts through 1 x 1 convolution;

embedding the encoded key into a hash check, such as stitching, and then generating an attention matrix by two 1 x 1 convolution and softmax activation functions;

capturing local features of the first hand feature by means of depth separable convolution, and then splicing the local features with the attention module output value to obtain a key embedded feature;

the key embedded feature is sent into a feedforward network consisting of a multi-layer perceptron and layer normalization;

and fusing the output of the feedforward network and the key embedding characteristic to obtain the second hand characteristic.

6. The method according to claim 1, wherein the step of inputting the second hand feature into a multitasking joint learning module to obtain a hand gesture recognition result and a hand gesture recognition result corresponding to the target image specifically includes:

inputting the second hand features into a two-dimensional feature point detection network and a depth regression network in a multitask joint learning module to obtain a two-dimensional joint point heat map and the hand gesture recognition result;

and inputting the two-dimensional node heat map and the second hand feature into a gesture recognition network to obtain the gesture recognition result.

7. The method according to claim 6, wherein the inputting the second hand feature into the two-dimensional joint point location network and the depth regression network in the multitasking joint learning module to obtain the two-dimensional joint point heat map and the hand gesture recognition result specifically comprises:

inputting the second hand feature into a stacked hourglass network in the two-dimensional joint point positioning network to determine a two-dimensional joint point heat map; taking the difference between the predicted node position and the true node position as a loss function of the two-dimensional node positioning network; the specific definition is as follows:

wherein L is _2D Represents a loss function, K represents the number of joints, p _j Representing the predicted location of the node of the joint,representing the true position of the node;

inputting the two-dimensional joint point heat map and the second hand feature into a depth regression network to obtain a MANO model parameterized hand gesture parameter;

and determining the hand gesture recognition result through the hand gesture parameters.

8. The method according to claim 6, wherein the inputting the two-dimensional node heat map and the second hand feature into a gesture recognition network to obtain the gesture recognition result specifically comprises:

combining the two-dimensional node heat map and the second hand feature by 1×1 convolution, and inputting the two-dimensional node heat map and the second hand feature into the gesture recognition network;

performing convolution operation on a time axis by using the gesture recognition network through the gesture recognition network so as to extract dynamic characteristics in a gesture sequence;

and determining the gesture recognition network according to the dynamic characteristics.

9. A device for estimating and identifying hand gestures based on deep learning, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform:

10. A non-transitory computer storage medium storing computer-executable instructions, the computer-executable instructions configured to: