CN110929616B

CN110929616B - Human hand identification method and device, electronic equipment and storage medium

Info

Publication number: CN110929616B
Application number: CN201911114483.7A
Authority: CN
Inventors: 张�雄
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2023-07-04
Anticipated expiration: 2039-11-14
Also published as: CN110929616A

Abstract

The disclosure relates to a human hand recognition method, a device, an electronic device and a storage medium, wherein the method comprises the following steps: extracting features of the image to be detected through a feature extractor of the human hand recognition network model to obtain image features; processing the image features through a multi-task branch network layer to obtain a first edge feature map, a first region feature map and a first key point feature map; regression is carried out on the addition result of the first edge feature map, the first region feature map and the first key point feature map through a regression layer, so that a first posture parameter representing the posture of a human hand in the image to be detected and a first shape parameter representing the shape of the human hand in the image to be detected are obtained; based on the first posture parameter and the first shape parameter, a three-dimensional model of the human hand in the image to be detected is generated through the MANO network. By adopting the method and the device, the edge of the human hand, the region of the human hand and the two-dimensional key points of the human hand are identified through one network model, and the three-dimensional model of the human hand is obtained, so that the efficiency of human hand identification can be improved.

Description

Human hand identification method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of image processing, and in particular relates to a method and a device for identifying a human hand, electronic equipment and a storage medium.

Background

With the development of internet technology, the identification of human hands in images is also becoming more and more widely used, for example, in: virtual/augmented reality, human-machine interaction, motion recognition, driving assistance, and the like.

The identification of a human hand in an image includes a plurality of identification tasks including identifying edges of the human hand in the image, identifying regions of the human hand in the image, identifying two-dimensional key points of the human hand in the image, and three-dimensionally modeling the human hand in the image. Currently, in order to complete tasks of identifying edges of human hands, identifying areas of human hands, identifying key points of two-dimensional human hands and performing three-dimensional modeling on human hands in an image in the prior art, a network model is generally required to be built for each identification task to identify. For example, the human hand edge in the network model identification image is respectively identified through the human hand edge, the human hand area in the network model identification image is identified through the human hand area, the two-dimensional human hand key points in the network model identification image are identified through the two-dimensional human hand key points, and the three-dimensional model of the human hand in the image is generated through the three-dimensional reconstruction network model.

As can be seen, in the related art, a plurality of network models need to be constructed when the human hand recognition is performed, resulting in lower efficiency of the human hand recognition.

Disclosure of Invention

The present disclosure provides a method, an apparatus, an electronic device, and a storage medium for identifying a human hand, wherein a human hand edge, a human hand region, and two-dimensional human hand key points in an image are identified by a network model, and a three-dimensional model of a human hand in the image is generated, so that efficiency of human hand identification can be improved. The technical scheme of the present disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided a human hand recognition method, including:

inputting an image to be detected containing a human hand into a pre-trained human hand recognition network model, wherein the human hand recognition network model comprises a feature extractor, a multi-task branch network layer, a regression layer and a MANO network;

extracting the characteristics of the image to be detected through the characteristic extractor to obtain the image characteristics of the image to be detected;

processing the image features through the multi-task branch network layer to obtain a first edge feature map representing the edge of a human hand in the image to be detected, a first area feature map representing the area of the human hand in the image to be detected and a first key point feature map representing two-dimensional key points of the human hand in the image to be detected;

Regression is carried out on the addition results of the first edge feature map, the first region feature map and the first key point feature map through the regression layer, so that a first posture parameter representing the posture of a human hand in the image to be detected and a first shape parameter representing the shape of the human hand in the image to be detected are obtained;

and generating a three-dimensional model of the human hand in the image to be detected through the MANO network based on the first posture parameter and the first shape parameter.

Optionally, the multi-tasking branching network layer comprises an encoder, an edge decoder, a region decoder and a heat pipe encoder;

the processing the image features through the multi-task branch network layer to obtain a first edge feature map representing the edge of a human hand in the image to be detected, a first area feature map representing the area of the human hand in the image to be detected, and a first key point feature map representing the two-dimensional key points of the human hand in the image to be detected, including:

encoding the image features through the encoder to obtain high-level image semantic information of the image to be detected;

decoding the semantic information of the high-level image through the edge decoder to obtain a first edge feature image representing the edge of the human hand in the image to be detected; decoding the semantic information of the high-level image through the area decoder to obtain a first area feature map representing a human hand area in the image to be detected; and decoding the semantic information of the high-level image through the heat map decoder to obtain a first key point feature map representing two-dimensional human hand key points in the image to be detected.

Optionally, the human hand recognition network model further comprises a differential rendering layer;

after the image features are processed through the multi-task branch network layer to obtain a first edge feature map representing edges of hands in the image to be detected, a first area feature map representing areas of hands in the image to be detected, and a first key point feature map representing two-dimensional key points of hands in the image to be detected, the method further comprises:

regression is carried out on the first edge feature map, the first region feature map and the addition result of the first key point feature map through the regression layer, so that a first camera parameter is obtained;

based on the first camera parameters, the three-dimensional model of the human hand in the image to be detected is projected through the differential rendering layer to obtain first human hand projection information, wherein the first human hand projection information comprises at least one of the following: the human hand region projected by the image to be detected, the two-dimensional human hand key points projected by the image to be detected and the three-dimensional human hand key points projected by the image to be detected.

The training step of the human hand recognition network model comprises the following steps:

inputting a sample image containing a human hand into an initial human hand recognition network model to obtain a second edge feature image representing the edge of the human hand in the sample image, a second area feature image representing the area of the human hand in the sample image and a second key point feature image representing the two-dimensional key points of the human hand in the sample image; the sample image is provided with a marked hand area, two-dimensional hand key points and three-dimensional hand key points;

regression is carried out on the addition result of the second edge feature map, the second region feature map and the second key point feature map through the regression layer, so that a second camera parameter, a second posture parameter representing the posture of the human hand in the sample image and a second shape parameter representing the shape of the human hand in the sample image are obtained;

based on the second posture parameter and the second shape parameter, generating a three-dimensional model of a human hand in the sample image through the MANO network to serve as a sample three-dimensional model;

based on the second camera parameters, the sample three-dimensional model is projected through the differential rendering layer, so that second hand projection information is obtained, wherein the second hand projection information comprises at least one of the following: the human hand region projected by the sample image, the two-dimensional human hand key points projected by the sample image and the three-dimensional human hand key points projected by the sample image;

Training model parameters of the hand recognition network model according to the difference between the second hand projection information and the hand information corresponding to the marked sample image;

and when the human hand recognition network model converges, obtaining a trained human hand recognition network model.

Optionally, after the inputting the sample image including the human hand into the initial human hand recognition network model, obtaining a second edge feature map representing an edge of the human hand in the sample image, a second region feature map representing a region of the human hand in the sample image, and a second keypoint feature map representing two-dimensional human hand keypoints in the sample image, the method further includes:

predicting a human hand region in the sample image based on the second region feature map;

predicting two-dimensional human hand key points in the sample image based on the second key point feature map;

the training of the model parameters of the hand recognition network model according to the difference between the second hand projection information and the hand information corresponding to the noted sample image comprises the following steps:

and training model parameters of the hand recognition network model by combining the difference between the second hand projection information and the hand information corresponding to the marked sample image and the difference between the predicted hand information and the hand information corresponding to the marked sample image, wherein the predicted hand information comprises a predicted hand region in the sample image and/or a predicted two-dimensional hand key point in the sample image.

According to a second aspect of embodiments of the present disclosure, there is provided a human hand recognition apparatus comprising:

a first processing module configured to perform inputting an image to be detected including a human hand into a pre-trained human hand recognition network model, the human hand recognition network model including a feature extractor, a multi-tasking branching network layer, a regression layer, a MANO network;

the extraction module is configured to perform feature extraction on the image to be detected through the feature extractor to obtain image features of the image to be detected;

the second processing module is configured to execute processing of the image features through the multi-task branch network layer to obtain a first edge feature map representing edges of hands in the image to be detected, a first area feature map representing areas of the hands in the image to be detected and a first key point feature map representing two-dimensional key points of the hands in the image to be detected;

the regression module is configured to execute regression on the addition result of the first edge feature map, the first region feature map and the first key point feature map through the regression layer to obtain a first posture parameter representing the posture of the human hand in the image to be detected and a first shape parameter representing the shape of the human hand in the image to be detected;

And the generation module is configured to generate a three-dimensional model of the human hand in the image to be detected through the MANO network based on the first gesture parameter and the first shape parameter.

the second processing module is specifically configured to perform encoding on the image features through the encoder to obtain high-level image semantic information of the image to be detected;

the apparatus further comprises:

the third processing module is configured to execute regression on the first edge feature map, the first area feature map and the addition result of the first key point feature map through the regression layer to obtain a first camera parameter;

the apparatus further comprises:

the training module is configured to input a sample image containing a human hand into the initial human hand recognition network model, and a second edge feature map representing the edge of the human hand in the sample image, a second area feature map representing the area of the human hand in the sample image and a second key point feature map representing the two-dimensional key points of the human hand in the sample image are obtained; the sample image is provided with a marked hand area, two-dimensional hand key points and three-dimensional hand key points;

Optionally, the apparatus further includes:

a prediction module configured to perform prediction of a human hand region in the sample image based on the second region feature map;

the training module is specifically configured to perform training on model parameters of the hand recognition network model by combining differences between the second hand projection information and hand information corresponding to the noted sample image and differences between predicted hand information and hand information corresponding to the noted sample image, wherein the predicted hand information includes a predicted hand region in the sample image and/or predicted two-dimensional hand key points in the sample image.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the human hand identification method as described in the first aspect above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the method of human hand recognition as described in the first aspect above.

According to a fifth aspect of embodiments of the present application, there is provided a computer program product, which when executed by a processor of an electronic device, causes the electronic device to perform the method of human hand recognition as described in the first aspect above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the method comprises the steps of carrying out feature extraction on an image to be detected through a feature extractor of a human hand recognition network model to obtain image features, processing the image features through a multi-task branch network layer to obtain a first edge feature image, a first area feature image and a first key point feature image, carrying out regression on the addition result of the first edge feature image, the first area feature image and the first key point feature image through a regression layer to obtain a first posture parameter representing the posture of a human hand in the image to be detected and a first shape parameter representing the shape of the human hand in the image to be detected, and then generating a three-dimensional model of the human hand in the image to be detected through a MANO network based on the first posture parameter and the first shape parameter.

Based on the above processing, the edge, the area and the two-dimensional key points of the human hand in the image can be identified by one network model (namely the human hand identification network model in the embodiment of the disclosure), and a three-dimensional model of the human hand in the image is generated, so that the efficiency of human hand identification can be improved. In addition, the adoption of the multi-task branch network layer can fully utilize the marked information of the image, improves the generalization performance of the human hand recognition network model and the accuracy of the human hand recognition result, and can avoid ambiguity problems because the human hand recognition network model comprises the MANO network.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a flow chart illustrating a method of human hand identification according to an exemplary embodiment.

Fig. 2 is a schematic diagram illustrating a structure of a human hand recognition network model according to an exemplary embodiment.

FIG. 3 is a flowchart illustrating a method of training a human hand to identify a network model, according to an example embodiment.

Fig. 4 is a block diagram illustrating a human hand recognition apparatus according to an exemplary embodiment.

FIG. 5 is a block diagram of an electronic device for identifying a human hand, according to an exemplary embodiment

Figure 6 is a block diagram illustrating an electronic device for identifying a human hand according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Fig. 1 is a flowchart illustrating a method for identifying a human hand according to an exemplary embodiment, and as shown in fig. 1, the method for identifying a human hand may be applied to an electronic device, which may be a terminal (e.g., a mobile phone, a computer, or a tablet computer) or a server. The method may comprise the steps of:

in step S101, an image to be detected including a human hand is input to a human hand recognition network model trained in advance.

The human hand recognition network model can comprise a feature extractor, a multi-task branch network layer, a regression layer and a MANO network. The image to be detected may be an RGB (Red, green, blue, red, green, blue) image.

In step S102, feature extraction is performed on the image to be detected by the feature extractor, so as to obtain image features of the image to be detected.

Wherein the feature extractor may be constituted by a convolutional layer.

In one embodiment, the electronic device may perform convolution operation on the image to be detected through a feature extractor formed by the convolution layer, so as to extract image features of the image to be detected, where the image features of the image to be detected may be feature graphs with smaller sizes, so as to reduce the calculation amount of the network model.

In step S103, the image features are processed by the multi-task branch network layer to obtain a first edge feature map representing the edge of the human hand in the image to be detected, a first area feature map representing the area of the human hand in the image to be detected, and a first key point feature map representing the two-dimensional key points of the human hand in the image to be detected.

In one embodiment, the multi-tasking branch network layer may include a plurality of network layers, and the electronic device may process the image feature process of the image to be detected based on the plurality of network layers, to obtain a feature map representing an edge of a human hand in the image to be detected (i.e., a first edge feature map in an embodiment of the disclosure), a feature map representing an area of the human hand in the image to be detected (i.e., a first area feature map in an embodiment of the disclosure), and a feature map representing two-dimensional key points of the human hand in the image to be detected (i.e., a first key point feature map in an embodiment of the disclosure).

Optionally, the multitasking branching network layer includes an encoder, an Edge Decoder (Edge Decoder), a region Decoder (Mask Decoder), and a Heat map Decoder (Heat-map Decoder), and S103 may include the steps of:

step one, coding the image features through an encoder to obtain high-level image semantic information of an image to be detected.

Among them, an Encoder (Encoder), which is a standard practice in the field of deep learning, is used to extract high-level image semantic information of an image.

Decoding semantic information of the high-level image through an edge decoder to obtain a first edge feature map representing the edge of a human hand in the image to be detected; decoding semantic information of the high-level image through a region decoder to obtain a first region feature map representing a human hand region in the image to be detected; and decoding the semantic information of the high-level image through a heat map decoder to obtain a first key point feature map representing the two-dimensional human hand key points in the image to be detected.

In one embodiment, the edge decoder may decode the high-level image semantic information of the image to be detected to obtain an edge feature map with a size of 256×256 for predicting the edge of the human hand in the image to be detected. The region decoder can decode the semantic information of the high-level image of the image to be detected to obtain a region feature map with the size of 256 multiplied by 256, which is used for predicting the region of the human hand in the image to be detected. The heat chart encoder can decode the semantic information of the high-level image of the image to be detected to obtain a plurality of key point feature images with the size of 256 multiplied by 256, which are used for predicting the two-dimensional human hand key points in the image to be detected.

In step S104, regression is performed on the addition result of the first edge feature map, the first region feature map and the first key point feature map through the regression layer, so as to obtain a first gesture parameter indicating the gesture of the human hand in the image to be detected and a first shape parameter indicating the shape of the human hand in the image to be detected.

The regression layer may be composed of a convolution layer and a full connection layer.

In one embodiment, the electronic device may superimpose the first edge feature map, the first region feature map, and the first key point feature map, and then regress the superimposed result through a regression layer formed by a convolution layer and a full-connection layer to obtain a parameter for representing a posture of a human hand in the image to be detected (i.e., the first posture parameter in the embodiment of the disclosure) and a parameter for representing a shape of the human hand in the image to be detected (i.e., the first shape parameter in the embodiment of the disclosure).

In step S105, a three-dimensional model of the human hand in the image to be detected is generated through the MANO network based on the first pose parameter and the first shape parameter.

The MANO network is a human hand parameterized model proposed by Max Planck Perceiving System (maximum planckian sensing system), and can generate a three-dimensional model of a human hand according to parameters of a human hand posture and parameters of a human hand shape.

According to the human hand recognition method provided by the embodiment of the disclosure, the human hand edges, the human hand areas and the two-dimensional human hand key points in the image can be recognized through only one network model (namely the human hand recognition network model), and the three-dimensional model of the human hand in the image is generated, so that the efficiency of human hand recognition can be improved. In addition, the multi-task branch network layer is adopted, so that the information marked by the image can be fully utilized, the generalization performance of the human hand recognition network model is improved, correspondingly, the accuracy of the human hand recognition result can also be improved, and the ambiguity problem which cannot be avoided by the traditional method, namely the problem that the part, blocked by the human hand, in the image cannot be accurately mapped to the three-dimensional space can be solved by adopting the MANO network to generate the three-dimensional model of the human hand.

Optionally, the human hand recognition network model further comprises a differential rendering layer, and after S103, the method may further comprise the steps of:

and firstly, carrying out regression on the addition result of the first edge feature map, the first region feature map and the first key point feature map through a regression layer to obtain a first camera parameter.

In one implementation manner, when the electronic device regresses the superposition result of the first edge feature map, the first region feature map, and the first key point feature map, a corresponding camera parameter (i.e., the first camera parameter in the embodiment of the present disclosure) may also be obtained.

And secondly, based on the first camera parameters, projecting a three-dimensional model of a human hand in the image to be detected through the differential rendering layer to obtain first human hand projection information.

Wherein the first human hand projection information includes at least one of: the method comprises the steps of a human hand area projected by an image to be detected, a two-dimensional human hand key point projected by the image to be detected and a three-dimensional human hand key point projected by the image to be detected.

In one embodiment, after obtaining the first camera parameter, the electronic device may project, through the differential rendering layer, a three-dimensional model of a human hand in the image to be detected based on the first camera parameter.

According to actual needs, the electronic equipment can project any information or any information combination of the hand area in the image to be detected, the two-dimensional hand key points in the image to be detected and the three-dimensional hand key points in the image to be detected.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating a structure of a human hand recognition network model according to an exemplary embodiment.

In fig. 2, the human hand recognition network model includes a feature extractor, a multi-tasking branching network layer, a regression layer, a MANO network, and a differential rendering layer.

The multi-drop network layer may include an encoder, an edge decoder, a region decoder, and a thermal graph encoder.

θmesh output by the regression layer represents the pose parameter and the shape parameter, and θcam represents the camera parameter.

Optionally, referring to fig. 3, fig. 3 is a flowchart illustrating a method of training a human hand to recognize a network model, according to an exemplary embodiment, the method may include the steps of:

in step S301, a sample image including a human hand is input to an initial human hand recognition network model, and a second edge feature map representing the edge of the human hand in the sample image, a second region feature map representing the region of the human hand in the sample image, and a second keypoint feature map representing two-dimensional human hand keypoints in the sample image are obtained.

The sample image is provided with a marked hand area, two-dimensional hand key points and three-dimensional hand key points.

In one embodiment, the electronic device may obtain the initial human hand recognition network model shown in fig. 2, and input the sample images marked with the human hand region, the two-dimensional human hand key points and the three-dimensional human hand key points into the human hand recognition network model, and may obtain, through the feature extractor and the multi-tasking branch network layer, a feature map representing the edge of the human hand in the sample image (i.e., the second edge feature map in the embodiment of the present disclosure), a feature map representing the region of the human hand in the sample image (i.e., the second region feature map in the embodiment of the present disclosure), and a feature map representing the two-dimensional human hand key points in the sample image (i.e., the second key point feature map in the embodiment of the present disclosure).

In step S302, regression is performed on the addition result of the second edge feature map, the second region feature map, and the second key point feature map through the regression layer, so as to obtain a second camera parameter, a second posture parameter representing the posture of the human hand in the sample image, and a second shape parameter representing the shape of the human hand in the sample image.

In one implementation, the electronic device may superimpose the second edge feature map, the second region feature map, and the second key point feature map, and then regress the superimposed result through a regression layer formed by a convolution layer and a full-connection layer to obtain a parameter for representing a posture of a human hand in the sample image (i.e., a second posture parameter in an embodiment of the disclosure), a parameter for representing a shape of the human hand in the sample image (i.e., a second shape parameter in an embodiment of the disclosure), and a camera parameter (i.e., a second camera parameter in an embodiment of the disclosure).

In step S303, a three-dimensional model of a human hand in the sample image is generated as a sample three-dimensional model through the MANO network based on the second posture parameter and the second shape parameter.

The method for generating the sample three-dimensional model is similar to the method for generating the three-dimensional model of the human hand in the image to be detected in the above embodiment, and will not be described again.

In step S304, based on the second camera parameter, the sample three-dimensional model is projected by the differential rendering layer, so as to obtain second hand projection information.

Wherein the second human hand projection information includes at least one of: the method comprises the steps of a human hand area projected by a sample image, a two-dimensional human hand key point projected by the sample image and a three-dimensional human hand key point projected by the sample image.

The method for generating the second hand projection information is similar to the method for generating the first hand projection information in the above embodiment, and will not be described again.

In step S305, model parameters of the hand recognition network model are trained according to differences between the second hand projection information and the hand information corresponding to the labeled sample image.

The difference between the human hand region obtained by projecting the sample three-dimensional model and the human hand region marked by the sample image can be called a first difference. The difference between the two-dimensional human hand key points obtained by projecting the sample three-dimensional model and the two-dimensional human hand key points marked by the sample image can be called as a second difference. The difference between the three-dimensional human hand key points obtained by projecting the sample three-dimensional model and the three-dimensional human hand key points marked by the sample image can be called as a third difference.

In one embodiment, the electronic device may train model parameters of the human hand recognition network model according to any one of the first difference, the second difference, and the third difference, or according to a combination of any of the differences.

In step S306, when the human hand recognition network model converges, a trained human hand recognition network model is obtained.

Optionally, to further improve the accuracy of the network model for human hand recognition, the method may further include the following steps: and predicting the hand region in the sample image based on the second region feature map, and predicting the two-dimensional hand key points in the sample image based on the second key point feature map.

In one embodiment, after obtaining the second region feature map and the second keypoint feature map of the sample image through the multi-tasking branch network layer, the electronic device may predict a human hand region in the sample image based on the second region feature map and predict a two-dimensional human hand keypoint in the sample image based on the second keypoint feature map.

The second region feature map and the second key point feature map may include a plurality of feature values, where each feature value represents a feature of each position in the sample image.

In one embodiment, the electronic device may determine a first feature value in the second region feature map that is capable of characterizing a region of a human hand, and then correspond the first feature value to a region of locations in the sample image as the region of the human hand in the sample image.

For example, the electronic device may determine a feature value in the second region feature map that is greater than a first preset threshold as the first feature value.

In addition, the electronic device may determine a second feature value in the second keypoint feature map that is capable of characterizing the two-dimensional human hand keypoint, and then correspond the second feature value to a position in the sample image as the two-dimensional human hand keypoint in the sample image.

For example, the electronic device may determine a feature value in the second key point feature map that is greater than a second preset threshold as the second feature value.

The number of the two-dimensional human hand key points can be 21, and the gestures can be determined through the 21 two-dimensional human hand key points.

Therefore, the number of the second key point feature graphs can be 21, and the electronic device can determine one two-dimensional human hand key point according to one second key point feature graph, and then can determine 21 two-dimensional human hand key points.

Correspondingly, the electronic equipment can train model parameters of the hand recognition network model by combining the difference between the second hand projection information and the hand information corresponding to the marked sample image and predicting the difference between the hand information and the hand information corresponding to the marked sample image.

Wherein the predicted human hand information comprises a human hand region in a predicted sample image and/or a two-dimensional human hand keypoint in the predicted sample image.

The difference between the predicted human hand region in the sample image and the human hand region noted in the sample image may be referred to as a fourth difference. The difference between the predicted two-dimensional human hand keypoints in the sample image and the two-dimensional human hand keypoints labeled in the sample image may be referred to as a fifth difference.

In one embodiment, the electronic device may obtain the fourth difference and/or the fifth difference, obtain any one of the first difference, the second difference, and the third difference, or a combination of any of the first difference, the second difference, and the third difference, and train model parameters of the human hand recognition network model according to the obtained differences.

In addition, the second edge feature map may include a plurality of feature values, each feature value representing a feature of a respective location in the sample image.

The electronic device may determine a third feature value in the second edge feature map that is capable of characterizing the edge of the human hand, and then correspond the third feature value to a location in the sample image as the edge of the human hand in the sample image.

For example, the electronic device may determine a feature value in the second edge feature map that is greater than a third preset threshold as a third feature value.

Fig. 4 is a block diagram illustrating a human hand recognition apparatus according to an exemplary embodiment. Referring to fig. 4, the apparatus includes a first processing module 401, an extraction module 402, a second processing module 403, a regression module 404, and a generation module 405.

A first processing module 401 configured to perform inputting an image to be detected containing a human hand into a pre-trained human hand recognition network model comprising a feature extractor, a multi-tasking branching network layer, a regression layer, a MANO network;

an extraction module 402, configured to perform feature extraction on the image to be detected by the feature extractor, so as to obtain image features of the image to be detected;

a second processing module 403, configured to perform processing on the image features through the multi-task branch network layer, to obtain a first edge feature map that represents edges of hands in the image to be detected, a first area feature map that represents areas of hands in the image to be detected, and a first key point feature map that represents two-dimensional key points of hands in the image to be detected;

the regression module 404 is configured to perform regression on the addition result of the first edge feature map, the first region feature map, and the first key point feature map through the regression layer, so as to obtain a first gesture parameter indicating a gesture of a human hand in the image to be detected and a first shape parameter indicating a shape of the human hand in the image to be detected;

A generating module 405 configured to generate a three-dimensional model of a human hand in the image to be detected through the MANO network based on the first pose parameter and the first shape parameter.

the second processing module 403 is specifically configured to perform encoding of the image features by the encoder, so as to obtain high-level image semantic information of the image to be detected;

the apparatus further comprises:

Optionally, the apparatus further includes:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 5 is a block diagram illustrating an electronic device 500 for identifying a human hand, according to an example embodiment. For example, electronic device 500 may be provided as a server. Referring to fig. 5, electronic device 500 includes a processing component 522 that further includes one or more processors and memory resources represented by memory 532 for storing instructions, such as applications, executable by processing component 522. The application programs stored in the memory 532 may include one or more modules each corresponding to a set of instructions. Further, the processing component 522 is configured to execute instructions to perform the human hand recognition methods described above.

The electronic device 500 may also include a power component 526 configured to perform power management of the electronic device 500, a wired or wireless network interface 550 configured to connect the electronic device 500 to a network, and an input output (I/O) interface 558. The electronic device 500 may operate an operating system based on storage 532, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM or similar operating systems.

Figure 6 is a block diagram illustrating an electronic device for identifying a human hand according to an example embodiment. For example, the electronic device may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, the electronic device may include one or more of the following components: a processing component 602, a memory 604, a power component 606, a multimedia component 608, an audio component 610, an input/output (I/O) interface 612, a sensor component 614, and a communication component 616.

The processing component 602 generally controls overall operation of the electronic device, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or part of the steps of the human hand identification method described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 may include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operations at the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like. The memory 604 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 606 provides power to the various components of the electronic device. The power supply components 606 can include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic devices.

The multimedia component 608 includes a screen between the electronic device and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front camera and/or a rear camera. When the electronic device is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

Interface 612 provides an interface between processing component 602 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 614 includes one or more sensors for providing status assessment of various aspects of the electronic device. For example, the sensor assembly 614 may detect an on/off state of the electronic device, a relative positioning of the components, such as a display and keypad of the electronic device, the sensor assembly 614 may also detect a change in position of the electronic device or a component of the electronic device, the presence or absence of user contact with the electronic device, an orientation or acceleration/deceleration of the electronic device, and a change in temperature of the electronic device. The sensor assembly 614 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communication between the electronic device and other devices, either wired or wireless. The electronic device may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the communication component 616 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a storage medium is also provided, such as a memory 604 including instructions executable by the processor 620 of the electronic device to perform the above-described human hand recognition method. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of identifying a person's hand, comprising:

generating a three-dimensional model of a human hand in the image to be detected through the MANO network based on the first posture parameter and the first shape parameter;

the multi-tasking branching network layer comprises an encoder, an edge decoder, a region decoder and a heat pipe encoder;

2. The human hand recognition method according to claim 1, wherein the human hand recognition network model further comprises a differential rendering layer;

3. The human hand recognition method according to claim 1, wherein the human hand recognition network model further comprises a differential rendering layer;

4. A method of human hand recognition according to claim 3, wherein after said inputting a sample image containing a human hand into an initial human hand recognition network model, obtaining a second edge feature map representing the edge of the human hand in the sample image, a second region feature map representing the region of the human hand in the sample image, and a second keypoint feature map representing two-dimensional human hand keypoints in the sample image, the method further comprises:

5. A human hand recognition device, comprising:

A generation module configured to perform generation of a three-dimensional model of a human hand in the image to be detected through the MANO network based on the first pose parameter and the first shape parameter;

6. The human hand recognition device of claim 5, wherein the human hand recognition network model further comprises a differential rendering layer;

The apparatus further comprises:

7. The human hand recognition device of claim 5, wherein the human hand recognition network model further comprises a differential rendering layer;

the apparatus further comprises:

8. A human hand recognition device according to claim 7, wherein the device further comprises:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the human hand identification method of any one of claims 1 to 4.

10. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the human hand recognition method of any one of claims 1 to 4.