CN107066935B - Hand posture estimation method and device based on deep learning - Google Patents

Hand posture estimation method and device based on deep learning Download PDF

Info

Publication number
CN107066935B
CN107066935B CN201710061286.8A CN201710061286A CN107066935B CN 107066935 B CN107066935 B CN 107066935B CN 201710061286 A CN201710061286 A CN 201710061286A CN 107066935 B CN107066935 B CN 107066935B
Authority
CN
China
Prior art keywords
hand
image
point cloud
node
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710061286.8A
Other languages
Chinese (zh)
Other versions
CN107066935A (en
Inventor
张波
丛林
赵辰
李晓燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yixian Advanced Technology Co.,Ltd.
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN201710061286.8A priority Critical patent/CN107066935B/en
Publication of CN107066935A publication Critical patent/CN107066935A/en
Application granted granted Critical
Publication of CN107066935B publication Critical patent/CN107066935B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/11Hand-related biometrics; Hand pose recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Abstract

The embodiment of the invention relates to the technical field of communication and computers, and provides a hand posture estimation method and device based on deep learning, which comprises the following steps: detecting a hand region of interest of the depth image, and segmenting a hand image from the hand region of interest; acquiring a three-dimensional point cloud image of the hand according to the hand image; and performing hand posture estimation on the three-dimensional point cloud image of the hand by adopting a deep learning technology. In the scheme, the hand posture estimation method based on the deep learning technology avoids the operation of manually extracting hand features, and greatly improves the robustness of the hand posture estimation effect.

Description

Hand posture estimation method and device based on deep learning
Technical Field
The embodiment of the invention relates to the technical field of communication and computers, in particular to a hand posture estimation method and device based on deep learning.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
The man-machine interaction refers to the field of specially researching the interaction relationship between a system and a user, plays an increasingly important role in daily life, and can improve the user experience to a great extent. Although the traditional man-machine interaction mode, such as a mouse and a keyboard, can meet the interaction to a certain degree, the convenience is greatly limited. The gesture recognition technology is an important technology in human-computer interaction, belongs to the popular research at present, and achieves the purpose of interaction by performing static or dynamic recognition on gestures, sending out recognition instructions and enabling a system to execute related instructions.
The present disclosure discusses three-dimensional hand pose estimation in gesture recognition technology, which has also gained academic and industrial interest in recent years due to the heat of human-computer interaction technology. Three-dimensional hand pose estimation techniques fall into two broad categories: 1) based on discriminant Model (discriminant Model); 2) based on Generative models (Generative models). The discriminant model is a learning-based method, which first performs feature extraction on the hand region image and then identifies the hand pose by constructing a classifier. The generated model is difficult to restore the posture after the hand tracking fails, and has low speed and low practicability. The generating method has the advantages of large calculation amount, low accuracy and high discriminant speed, but the estimation result has certain error and limited posture.
Chinese patent CN201510670919.6 discloses a method and a system for estimating three-dimensional gesture posture based on depth data in 2016, 3/09, which includes the following steps: 1. extracting depth data and extracting an ROI (region of interest) of a hand; (1) acquiring information of skeleton points by using the SDK, and detecting a hand ROI (region of interest) through a certain skeleton point of a hand; (2) if the information of the skeleton points of the hand cannot be acquired, detecting the ROI area of the hand by adopting a skin color based mode; 2. primarily estimating the hand three-dimensional global direction, firstly extracting features, and then realizing regression of the hand global direction according to a trained classifier; 3. joint pose estimation of three-dimensional gestures: and (4) realizing hand posture estimation according to the trained classifier, and finally correcting the posture. The method comprises the steps of firstly, completing segmentation of ROI data of a hand by adopting two modes in a mutual matching mode, then, completing global direction estimation of the hand by adopting a regression algorithm on the basis, and finally, utilizing the regression algorithm again by taking the data as assistance to realize three-dimensional gesture posture estimation.
Chinese patent CN201610321710.3 discloses a hand posture estimation method based on depth information and correction method in 2016, month 10 and day 26, which includes the following steps: 1. acquiring depth data of a hand, and dividing a hand area from the hand depth data; 2. detecting a palm gesture according to the hand region; 3. calculating the positions of all joint points of the hand by combining the palm posture and the hand standard skeleton model; 4. calculating projection characteristics of each joint point of the hand; 5. and correcting the posture of the finger according to the projection characteristics of each joint point of the hand. The invention directly uses the depth data as the basis, and estimates the gesture of the finger by dividing the hand area, calculating the gesture of the palm and then correcting the depth image and the gesture.
Disclosure of Invention
However, the aforementioned patents CN201510670919.6 and CN201610321710.3 have several problems as follows:
1. in the process of estimating the posture, the regression method is used in the former method, the random forest method is used in the latter method, the characteristics are manually designed, and the hand characteristic extraction operation is carried out, the characteristic design process is complicated, meanwhile, the designed characteristics can not completely represent the characteristics of the hand, and the method has great influence on the final posture estimation effect.
2. Detecting a hand ROI (region of interest): the hand skeleton points are detected by directly using the SDK related to the Kinect v2 sensor, and the skeleton points can not be detected necessarily after other sensors are replaced, so that the applicability is not strong; meanwhile, the method for detecting the skin color is used for segmenting the hand, the influence of external environmental factors and the like is large, the skin colors of different races are specially processed, and the applicability is not very strong.
Therefore, an improved hand pose estimation method and device based on deep learning are needed to solve the defect of poor robustness of hand pose estimation effect caused by manual hand feature extraction in the prior art.
In this context, embodiments of the present invention are intended to provide a method and an apparatus for estimating hand pose based on deep learning.
In a first aspect of embodiments of the present invention, a method for estimating hand pose based on deep learning is provided, including:
detecting a hand region of interest of the depth image, and segmenting a hand image from the hand region of interest;
acquiring a three-dimensional point cloud image of the hand according to the hand image; and
and performing hand posture estimation on the three-dimensional point cloud image of the hand by adopting a deep learning technology.
In some embodiments, according to the method of the above embodiments of the present invention, the detecting the region of interest of the hand on the depth image includes:
acquiring the depth image by adopting a depth sensor in image acquisition equipment; and
and extracting the hand region of interest from the depth image by utilizing the relative depth relation of the foreground and the background in the depth image.
In some embodiments, the method according to any of the above embodiments of the invention, segmenting a hand image from the hand region of interest, comprises:
performing edge detection and contour detection on the hand region of interest to detect a hand region;
and denoising the hand region to segment the hand image.
In some embodiments, according to the method of any one of the above embodiments of the present invention, the contour detection of the hand region of interest adopts an image concave-convex point detection method.
In some embodiments, the method according to any of the above embodiments of the invention, acquiring a three-dimensional point cloud image including a hand from the hand image, includes:
calibrating the internal parameters of the image acquisition equipment;
preliminarily acquiring a three-dimensional point cloud image containing a hand according to the internal parameters;
and carrying out size normalization processing on the preliminarily acquired three-dimensional point cloud image of the hand to obtain the processed three-dimensional point cloud image containing the hand.
In some embodiments, the method according to any of the above embodiments of the present invention, wherein the step of performing hand pose estimation on the three-dimensional point cloud image of the hand by using a deep learning technique includes:
making a hand training data set, and acquiring hand node marking information of the hand training data set;
extracting a three-dimensional point cloud area of the hand from the hand training data set;
training by utilizing a convolutional neural network model according to the three-dimensional point cloud area of the hand and the hand node marking information to form a hand posture model;
and acquiring the joint node positions of the hand by utilizing the hand gesture model according to the three-dimensional point cloud image of the hand.
In some embodiments, the method of creating a hand training data set and obtaining hand node labeling information of the hand training data set according to any of the above embodiments of the present invention includes:
extracting three-dimensional point cloud data of a hand region in the depth image;
constructing an initialized three-dimensional hand model and a loss function L fitted by three-dimensional point cloud data of the hand area, and performing iterative optimization on the loss function L;
when the loss function L is subjected to iterative optimization and meets a preset convergence condition, a successfully-fitted three-dimensional hand model is obtained;
and acquiring hand node marking information of the hand training data set according to the successfully fitted three-dimensional hand model.
In some embodiments, the method according to any of the above embodiments of the present invention, creating a hand training data set, and acquiring hand node labeling information of the hand training data set, further includes:
acquiring a color image and a depth image corresponding to the color image, wherein the color image comprises a hand with a color wrist;
and segmenting the hand area according to the positions of the color wrists in the color image.
In some embodiments, the convolutional neural network model comprises a plurality of convolutional layers, a plurality of pooling layers, a plurality of fully-connected layers, and an activation layer after each convolutional layer and after each fully-connected layer.
In some embodiments, the method according to any of the above embodiments of the invention, wherein the plurality of convolutional layers comprises two convolutional layers with a convolutional kernel size of 5 × 5 and one convolutional layer with a convolutional kernel size of 3 × 3.
In some embodiments, according to the method of any of the above embodiments of the present invention, the plurality of pooling layers includes two pooling layers with a step size of 2 and one pooling layer with a step size of 1.
In some embodiments, the active layer is a ReLU function according to the method of any of the above embodiments of the present invention.
In some embodiments, the method according to any of the above embodiments of the present invention, acquiring joint node positions of a hand by using the hand pose model according to the three-dimensional point cloud image of the hand, includes:
inputting the three-dimensional point cloud image of the hand to the convolutional neural network model;
outputting a hand gesture parameter with a preset number of degrees of freedom to the hand gesture model by the last full connection layer in the plurality of full connection layers;
the hand gesture model outputs joint positions of each joint of the hand.
In some embodiments, the method according to any of the above embodiments of the invention, wherein the hand pose model outputs joint node positions of the hand, comprises:
designing a global loss function G (Ψ) in the hand pose model;
performing iterative optimization on the global loss function by adopting a preset method;
and outputting the joint node positions of the hand when the global loss function reaches a preset convergence condition.
In some embodiments, according to the method of any of the above embodiments of the present invention, the global loss function G (Ψ) comprises a node position loss function Gjoint(Ψ) and the node constraint loss function GDoFconstraint(Ψ), wherein the global loss function G (Ψ) satisfies the following equation:
G(Ψ)=Gjoint(Ψ)+λGconstraint(Ψ)
where λ is a weight adjustment factor for the global loss function G (Ψ).
In some embodiments, the node position loss function G is a function of the position of the node according to the method of any of the above embodiments of the present inventionjoint(Ψ) is:
Figure BDA0001219493010000051
wherein the content of the first and second substances,
Figure BDA0001219493010000052
in the form of a function of forward dynamics,
Figure BDA0001219493010000053
Ψ is a hand gesture parameter of the preset number of degrees of freedom,
Figure BDA0001219493010000054
is the rotation angle, Y, of the joint node of the hand corresponding to the preset number of degrees of freedomgtAnd marking information for the hand nodes in the hand training data set.
In some embodiments, the node constrains the loss function G according to the method of any of the above embodiments of the inventionDoFconstraint(Ψ) is:
Figure BDA0001219493010000061
wherein the content of the first and second substances,
Figure BDA0001219493010000062
is a lower bound on the degree of freedom of the node,
Figure BDA0001219493010000063
is an upper bound on the degree of freedom of the node.
In a second aspect of embodiments of the present invention, there is provided a hand pose estimation device based on deep learning, comprising:
the hand image segmentation module is used for detecting a hand region of interest of the depth image and segmenting a hand image from the hand region of interest;
the point cloud image acquisition module is used for acquiring a three-dimensional point cloud image containing a hand according to the hand image; and
and the hand posture estimation module is used for estimating the hand posture of the three-dimensional point cloud image of the hand by adopting a deep learning technology.
In some embodiments, according to the apparatus of the above-mentioned embodiments of the present invention, the hand image segmentation image module includes an image acquisition device and a hand region-of-interest extraction unit, wherein:
the image acquisition equipment is used for acquiring the depth image through a depth sensor in the image acquisition equipment;
the hand region-of-interest extraction unit is used for extracting the hand region-of-interest from the depth image by using the relative depth relation of the foreground and the background in the depth image.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, the hand image segmentation module comprises a hand region detection unit and a hand image segmentation unit, wherein:
the hand region detection unit is used for carrying out edge detection and contour detection on the hand region of interest to detect a hand region;
the hand image segmentation unit is used for denoising the hand region and segmenting the hand image.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, the hand pose estimation module includes a training data set making sub-module, a point cloud region extraction sub-module, a hand pose model making sub-module, and a hand pose estimation sub-module, wherein:
the training data set making submodule is used for making a hand training data set and obtaining hand node marking information of the hand training data set;
the point cloud area extraction submodule is used for extracting a three-dimensional point cloud area of the hand from the hand training data set;
the hand gesture model making sub-module is used for training a hand gesture model by utilizing a convolutional neural network model according to the three-dimensional point cloud area of the hand and the hand node marking information;
and the hand posture estimation submodule is used for acquiring the joint node positions of the hand by utilizing the hand posture model according to the three-dimensional point cloud image of the hand.
In some embodiments, the apparatus according to any of the above embodiments of the present invention, wherein the convolutional neural network model comprises a plurality of convolutional layers, a plurality of pooling layers, a plurality of fully-connected layers, and an activation layer after each of the convolutional layers and after each of the fully-connected layers.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, the hand pose estimation sub-module comprises a point cloud image input unit, a pose parameter output unit, and a joint node position output unit, wherein:
the point cloud image input unit is used for inputting the three-dimensional point cloud image of the hand part into the convolutional neural network model;
the gesture parameter output unit is used for outputting hand gesture parameters with preset number of degrees of freedom to the hand gesture model by the last full connection layer in the full connection layers;
and the joint node position output unit is used for outputting each joint node position of the hand by the hand posture model.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, the joint node position output unit includes a global loss function design subunit, an iterative optimization subunit, and a joint node position output subunit, wherein:
the global loss function design subunit is used for designing a global loss function G (Ψ) in the hand posture model;
the iterative optimization subunit performs iterative optimization on the global loss function by adopting a preset method;
and the joint node position output subunit is used for outputting the joint node positions of the hand when the global loss function reaches a preset convergence condition.
In some embodiments, according to the apparatus of any of the above embodiments of the present invention, the global loss function G (Ψ) comprises a node position loss function Gjoint(Ψ) and the node constraint loss function GDoFconstraint(Ψ), wherein the global loss function G (Ψ) satisfies the following equation:
G(Ψ)=Gjoint(Ψ)+λGconstraint(Ψ)
where λ is a weight adjustment factor for the global loss function G (Ψ).
In some embodiments, according to the apparatus of any of the above embodiments of the present invention, the node position loss function Gjoint(Ψ) is:
Figure BDA0001219493010000081
wherein the content of the first and second substances,
Figure BDA0001219493010000082
in the form of a function of forward dynamics,
Figure BDA0001219493010000083
Ψ is a hand gesture parameter of the preset number of degrees of freedom,
Figure BDA0001219493010000084
is the rotation angle, Y, of the joint node of the hand corresponding to the preset number of degrees of freedomgtAnd marking information for the hand nodes in the hand training data set.
In some embodiments, according to the apparatus of any of the above embodiments of the present invention, the node constraint loss function GDoFconstraint(Ψ) is:
Figure BDA0001219493010000085
wherein the content of the first and second substances,
Figure BDA0001219493010000086
is a lower bound on the degree of freedom of the node,
Figure BDA0001219493010000087
is an upper bound on the degree of freedom of the node.
The hand posture estimation method and device based on deep learning according to the embodiment of the invention: detecting a hand region of interest of the depth image, and segmenting a hand image from the hand region of interest; acquiring a three-dimensional point cloud image of the hand according to the hand image; in the scheme, the hand posture estimation method based on the deep learning technology avoids the operation of manually extracting hand features, greatly improves the robustness of a hand posture estimation effect, and overcomes the defect of poor hand posture estimation effect caused by manually extracting the hand features in the prior art.
In addition, according to some embodiments, the relative depth relation between the foreground and the background in the depth image can be directly utilized to extract the hand region of interest and segment the hand image by the contour detection method, so that the hand detection effect is further improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 schematically illustrates a flow diagram of a method of deep learning based hand pose estimation according to an embodiment of the present invention;
FIG. 2 schematically illustrates a flow chart for acquiring a hand image from a depth image according to an embodiment of the present invention;
FIG. 3 schematically illustrates a flow chart of hand pose estimation for a three-dimensional point cloud image of a hand using a deep learning technique, according to an embodiment of the invention;
figure 4 schematically illustrates a flow chart for producing a hand training data set according to an embodiment of the present invention;
FIG. 5 schematically shows a schematic diagram of a method of hand pose estimation based on deep learning according to an embodiment of the invention;
FIG. 6 schematically shows a schematic diagram of a deep learning based hand pose estimation apparatus according to an embodiment of the present invention;
FIG. 7 schematically shows another schematic diagram of a deep learning based hand pose estimation apparatus according to an embodiment of the invention;
FIG. 8 schematically illustrates an exemplary diagram of a computer-readable storage medium according to an embodiment of the invention;
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
The principle of the hand posture estimation method based on deep learning used in the present invention will be explained below.
Deep learning is a new field in machine learning research, and its motivation is to establish and simulate a neural network of human brain for analytical learning, which simulates the mechanism of human brain to interpret data such as images, sounds and texts. In an exemplary embodiment of the present disclosure, a Convolutional Neural Network (CNN) technique is used as one of deep learning techniques. The convolution neural network combines two-dimensional discrete convolution operation in image processing with an artificial neural network, and convolution refers to the fact that a convolution layer is arranged at the front end. Such convolution operations can be used to automatically extract features.
In this document, any number of elements in the drawings is by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.
The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.
Summary of The Invention
The inventor finds that, when estimating the hand posture based on deep learning, the following method can be adopted: detecting a hand region of interest of the depth image, and segmenting a hand image from the hand region of interest; acquiring a three-dimensional point cloud image of the hand according to the hand image; in the scheme, the hand posture estimation method based on the deep learning technology avoids the operation of manually extracting hand features, greatly improves the robustness of a hand posture estimation effect, and overcomes the defect of poor hand posture estimation effect caused by manually extracting the hand features in the prior art.
Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Exemplary method
A method for hand pose estimation based on deep learning according to an exemplary embodiment of the present invention is described below with reference to fig. 1. It should be noted that the following exemplary embodiments are merely illustrated for the convenience of understanding the spirit and principle of the present invention, and the embodiments of the present invention are not limited in any way in this respect.
The hand posture estimation method based on deep learning comprises the steps of firstly, accurately segmenting a complete hand interested region by using a foreground detection method and a contour detection method for an acquired depth image, then automatically constructing hand features by using a convolutional neural network in the deep learning, and acquiring the 3D node position of a hand (a)
Figure BDA0001219493010000101
Wherein ji=(xi,yi,zi) 3D position of single depth map) to achieve the purpose of hand posture estimation, and simultaneously the embodiment of the disclosure also discloses a method for making a hand training data set.
Fig. 1 schematically shows a flow diagram of a method for hand pose estimation based on deep learning according to an embodiment of the present invention. As shown in fig. 1, the method may include steps S10, S20, and S30.
The method starts in step S10 with region of interest hand (ROI) detection of the depth image and segmentation of the hand image from the ROI.
In the embodiment of the present invention, referring to fig. 2, the step S10 may further include steps S11, S12, S13, and S14.
Wherein in step S11, the depth image is acquired using a depth sensor in the image capture device.
For example, the depth sensor may be an Astra depth camera, but this disclosure does not limit this. The raw depth image may be acquired directly from the depth sensor.
The present embodiment is based primarily on depth data, with the objective of estimating the pose state of the hand in the depth data. The present embodiment uses depth data as input, and compared to a conventional camera, the depth sensor can obtain distance information of a photographed object, and easily segment a target from a background.
In step S12, the hand region of interest is extracted from the depth image using the relative depth relationship between the foreground and background in the depth image.
In the embodiment of the present invention, the ROI region including the hand may be extracted from the depth image by a background subtraction method, for example, a frame difference method, using the relative depth relationship between the foreground and the background in the depth image.
In step S13, edge detection and contour detection are performed on the hand region of interest, and a hand region is detected.
In the embodiment of the invention, the outline detection of the hand region of interest can adopt an image concave-convex point detection method to detect the approximate outline region of the hand, and the hand segmentation operation is preliminarily completed.
In step S14, the hand region is subjected to noise reduction processing to segment the hand image.
After the hand segmentation operation is preliminarily completed, the detected hand region can be subjected to the following denoising treatment, and finally a complete hand image is segmented;
1) median filtering: eliminating the influence of the jitter and partial noise of the depth data;
2) morphological treatment: the hand image is firstly expanded and then corroded, so that the purposes of smooth contour and edge enhancement are achieved;
3) hole filling treatment: according to the neighborhood correlation of the image, the depth value of the neighborhood of the black hole is used for filling the neighborhood part, and the influence of the black hole in the hand image on the posture recognition is restrained.
It should be noted that the above denoising processing method is only for illustrative purposes, and the present disclosure does not limit the specific denoising processing method, and any image denoising processing method in the prior art may be adopted.
After the step S10 is completed, a step S20 may be further performed to obtain a three-dimensional point cloud image of the hand from the hand image.
In the embodiment of the present invention, a three-dimensional (3D) point cloud image including a hand is obtained according to the hand image, and the following method may be adopted:
calibrating the internal parameters of the image acquisition equipment;
preliminarily acquiring a three-dimensional point cloud image containing a hand according to the internal parameters;
and carrying out size normalization processing on the preliminarily acquired three-dimensional point cloud image of the hand to obtain the processed three-dimensional point cloud image containing the hand.
Specifically, the image acquisition equipment is calibrated to obtain internal parameters of the image acquisition equipment, and then the 3D point cloud of the hand image is preliminarily calculated according to the internal parameters. Then, the 3D point cloud image containing the hand is converted into a preset size (for example, 128 pixels by 128 pixels, but not limited to the size), and meanwhile, the depth value is normalized to be between [ -1,1], so that the influence of hand scale transformation on the result is suppressed. The final processed 3D point cloud image may then be used as an input to a convolutional neural network.
After the step S20 is completed, a step S30 may be further performed, in which the hand pose estimation is performed on the three-dimensional point cloud image of the hand by using a deep learning technique.
In the embodiment of the present invention, referring to fig. 3, performing hand pose estimation on the three-dimensional point cloud image of the hand by using a deep learning technique may include steps S31, S32, S33, and S34.
In step S31, a hand training data set is created, and hand node labeling information of the hand training data set is acquired.
In the embodiment of the present invention, referring to fig. 4, creating a hand training data set, and acquiring hand node labeling information of the hand training data set may include steps S311, S312, S313, S314, S315, S316, S317, and S318.
In step S311, a color image and a corresponding depth image thereof are collected, wherein the color image includes a hand with a colored wrist.
In the embodiment of the invention, the hand training data set can still be acquired by using an Astra camera. Color images (e.g., RGB images) and their corresponding depth images are acquired by an Astra camera, and may be aligned using an Astra's own SDK.
In step S312, the hand region is segmented according to the position of the wrist in the color image.
The proportion of the hand in the acquired depth data is very small, and the hand and the arm are difficult to distinguish due to the similarity of the depth values, so that the background and the arm both affect the hand pose estimation. In order to reduce interference factors, the positions of the wrist points are introduced in a manner of wearing the color wrist on the wrist part, and the positions of the color wrist in the color image are analyzed to segment the hand, so that the wrist and the hand can be better segmented, assuming that the positions of the wrist points are known. The irrelevant content except the hand region is excluded, only the hand region is reserved, and the hand region is divided from the hand depth data.
In step S313, three-dimensional point cloud data of the hand region in the depth image is extracted.
Extracting three-dimensional point clouds of the hand region in the depth image corresponding to the color image, wherein the specific implementation method may refer to step S20, which is not described herein again.
In step S314, a loss function L that is fitted to the initialized three-dimensional hand model and the three-dimensional point cloud data of the hand region is constructed, and iterative optimization is performed on the loss function L.
Constructing a loss function L (F) fitted between the two through the acquired 3D point cloud of the hand region and the three-dimensional hand model3Dcloud,Fmodel) In which F is3DcloudFor 3D point cloud data, FmodelFor a fitted three-dimensional hand model, the loss function L (F) can be modeled, for example, by the Gauss-Newton method3Dcloud,Fmodel)Iterative optimization is performed until it satisfies the following convergence condition:
L(F3Dcloud,Fmodel)<λthreshold (1)
wherein λthresholdIs a preset threshold value.
In step S315, it is determined whether the loss function L converges; when the loss function L converges, go to step S317; otherwise, the process proceeds to step S316.
And when the convergence condition is met, considering that the three-dimensional hand model is successfully fitted, ending the iteration process, and continuing to perform fitting operation between the three-dimensional point cloud data of the next frame of hand region and the three-dimensional hand model.
In step S316, the three-dimensional hand model is reinitialized.
If the fitting of the three-dimensional hand model fails, namely the formula (1) is not converged, the three-dimensional hand model needs to be reinitialized, and then the fitting operation between the three-dimensional point cloud data of the hand region and the three-dimensional hand model is carried out.
In step S317, a successfully fitted three-dimensional hand model is obtained.
In step S318, hand node labeling information of the hand training data set is obtained according to the successfully fitted three-dimensional hand model.
And if the three-dimensional hand model is successfully fitted, namely the formula is converged, outputting hand node data of the hand training data set through the fitted three-dimensional hand model, labeling the hand node data, and acquiring hand node labeling information of the hand training data set.
In step S32, a three-dimensional point cloud area of the hand is extracted from the hand training data set.
In step S33, a hand pose model is formed by training using a convolutional neural network model according to the three-dimensional point cloud area of the hand and the hand node labeling information.
Although the deep learning technique of the convolutional neural network is only exemplified for hand pose estimation in the embodiment of the present invention, the method of the present disclosure may be applied to any deep learning technique, and the present disclosure does not limit this. The convolutional neural network is widely applied because the network avoids the complex preprocessing of the image and can directly input the original image.
Extracting a three-dimensional point cloud area of the hand from the manufactured hand training data set, then marking information by the hand point cloud and the hand nodes in the hand training data set, and training the hand posture model by using a convolutional neural network.
And acquiring the node information of the hand gesture by using the hand gesture model trained by the CNN and the three-dimensional point cloud image of the hand obtained in the step S20 through the forward propagation process of the convolutional neural network, thereby achieving the purpose of estimating the hand gesture.
The network model of the convolutional neural network can be designed as follows:
in an embodiment of the present invention, the convolutional neural network model includes a plurality of convolutional layers, a plurality of pooling layers, a plurality of fully-connected layers, and an activation layer after each convolutional layer and after each fully-connected layer.
In the embodiment of the invention, multilayer convolution is used, and then full-connection layers are used for training, wherein the aim of the multilayer convolution is that the learned characteristics of one layer of convolution are local, and the learned characteristics are global when the number of layers is higher.
Referring to fig. 5, the convolutional neural network model comprises three convolutional layers C1, C2, C3, and one pooling layer is behind each convolutional layer, in turn: p1, P2 and P3, and finally connecting 3 full connecting layers, namely FC1, FC2 and FC3 in sequence, wherein the first two full connecting layers can contain 1024-dimensional features.
In an embodiment of the present invention, the plurality of convolutional layers includes two convolutional layers having a convolutional kernel size of 5 × 5 and one convolutional layer having a convolutional kernel size of 3 × 3.
The convolutional neural network can retain spatial information through a small convolutional kernel (flattening the kernel), and meanwhile, the spatial information cannot be modeled, so that translation robustness is achieved. The convolution layer with the convolution kernel size of 5 multiplied by 5 has a good characteristic extraction effect, and the small number of parameters causes small calculation amount, so that the realization is convenient; the convolution layer with convolution kernel size of 3 x 3 can enhance the extracted effective characteristics at the end of the network, thereby increasing the training parameter capability of the network.
Each convolution is a feature extraction method, and parts of the image which meet the conditions (the conditions are better if the activation value is larger) are screened out. In other embodiments, a plurality of convolution kernels, such as 32 convolution kernels, may be added, and 32 features may be learned. Each convolution kernel produces another image from the image. For example, two convolution kernels may generate two images that may be viewed as different channels of an image.
After the features (features) are obtained by convolution, these features are used for classification. The classifier can be trained with all the extracted features, for example, a 96-by-96 pixel image, assuming that 400 features defined on 8-by-8 input have been learned, each convolution of the features and the image will result in a convolution feature with dimension (96-8+1) multiplied by (96-8+1) ═ 7921, and since there are 400 features, each example (example) will result in a convolution feature vector with dimension 892 multiplied by 400 ═ 3168400, it is inconvenient to learn a classifier with more than 3 million feature inputs, and overfitting is easy to occur.
In the embodiment of the present invention, the plurality of pooling layers includes two pooling layers with a step size of 2 and one pooling layer with a step size of 1.
To solve the problem of overfitting, a large image is described, and features at different locations are aggregated and counted, e.g., the average (or maximum) of a particular feature over a region of the image can be calculated, and these summary features not only have much lower dimensionality (compared to using all extracted features), but also improve the result (not easily overfitting). This polymerization is known as pooling (Pooling), and is sometimes referred to as average pooling or maximum pooling (depending on the method by which pooling is calculated).
In the embodiment of the present invention, the active layer is a ReLU function.
The form of the ReLU function is shown below:
f(x)=max(0,x) (2)
in the embodiment, the ReLU function is selected for the active layer behind each convolutional layer, the ReLU function can zero out non-positive elements, a good effect is achieved in the aspect of keeping effective neurons, the problem of gradient explosion is effectively avoided, a hidden layer taking the ReLU function as the active function is added behind each convolutional layer, the neurons smaller than 0 can be removed by the active function, therefore, after the convolutional neural network model needing to be learned is built by screening out effective features, parameters of the network model are trained by continuously reducing the numerical value of the loss function, and the quality of images is improved. Training the convolutional neural network model to form a corresponding hand posture model so as to construct mapping of the input image to the positions of the hand joint points, and finally processing the corresponding image through the established effective mapping to obtain the positions of all the joint points of the hand.
According to the embodiment of the invention, the convolutional layer and the active layer are introduced, good characteristics are obtained by virtue of the learning capacity of the convolutional layer and the screening capacity of the active layer, the learning capacity of a neural network is greatly enhanced, the mapping from the input image to the output image is accurately learned to establish the mapping from input to output, and thus the prediction and estimation of the hand posture can be carried out through the learned mapping.
By the number of layers of convolution layers and the size of convolution kernels selected from the convolution neural network model established in the embodiment of the invention, on the basis of ensuring the capability of the neural network, the problems of gradient explosion, overfitting, calculation complexity and the like in the training process are avoided; when the convolutional neural network model is trained, the pooling layer is introduced, so that the training is facilitated, and the sufficient capability is provided to obtain a good denoising effect.
In step S34, joint node positions of the hand are acquired from the three-dimensional point cloud image of the hand using the hand posture model.
In the process of hand pose estimation, the three-dimensional point cloud image of the hand processed in step S20 is input, and the 3D node information of the hand is output. In the process of hand posture estimation, a convolutional neural network technology in a deep learning technology is used.
In the embodiment of the invention, the hand gesture model is utilized according to the three-dimensional point cloud image of the hand to acquire the joint node positions of the hand, and the following modes can be adopted:
inputting the three-dimensional point cloud image of the hand to the convolutional neural network model;
outputting a hand gesture parameter with a preset number of degrees of freedom to the hand gesture model by the last full connection layer in the plurality of full connection layers;
the hand gesture model outputs joint positions of each joint of the hand.
In the embodiment of the present invention, the hand gesture model outputs the positions of the joint nodes of the hand, and the following method may be adopted:
designing a global loss function G (Ψ) in the hand pose model;
performing iterative optimization on the global loss function by adopting a preset method;
and outputting the joint node positions of the hand when the global loss function reaches a preset convergence condition.
With continued reference to fig. 5, in the third fully-connected layer FC3, a hand posture parameter Ψ of, for example, 26 (the number of specific degrees of freedom may be preset, and may be 27, 30, or any more or less degrees of freedom) degrees of freedom may be output, which is connected to the hand posture model HML, and the positions of the joint nodes of the hand, that is, the positions of the 3D nodes, (Ψ) are output using the forward dynamical model F (Ψ)
Figure BDA0001219493010000171
Wherein ji=(xi,yi,zi) 3D position for single depth map) where the forward dynamics model satisfies the following chain structure:
Figure BDA0001219493010000172
setting the hand posture parameter as psi ∈ RDWhere D ═ 26 can be set, and is the degree of freedom (DoF) of the joint point of the hand, where 3 degrees of freedom are the global hand (palm) position and 3 degrees of freedom are the global hand (palm) direction. The remaining degrees of freedom are the rotation angles of the nodes (each finger has 4 degrees of freedom, wherein the thumb has 3 degrees of freedom at the point of the palm and one degree of freedom between the fingers; the remaining fingers have 2 degrees of freedom at the point of the palm and two nodes between the fingers, and each node has 1 selfFrom degrees, total 20 degrees of freedom), for each angle of rotation
Figure BDA0001219493010000173
Wherein
Figure BDA0001219493010000174
Are constraints of its upper and lower bounds. All the joint points of the hand are obtained, namely the position of the joint points of the whole hand is obtained.
In the embodiment of the invention, for the hand posture model HML in the deep learning, the following global loss function G (Ψ) is designed.
The global loss function G (Ψ) includes a node position loss function Gjoint(Ψ) and the node constraint loss function GDoFconstraint(Ψ), wherein the global loss function G (Ψ) satisfies the following equation:
G(Ψ)=Gjoint(Ψ)+λGconstraint(Ψ) (4)
where λ is a weight adjustment factor for the global loss function G (Ψ), Ψ ∈ RDIs a hand gesture parameter.
In the embodiment of the invention, the node position loss function Gjoint(Ψ) is:
Figure BDA0001219493010000181
wherein the content of the first and second substances,
Figure BDA0001219493010000182
in the form of a function of forward dynamics,
Figure BDA0001219493010000183
Ψ is a hand gesture parameter of the preset number of degrees of freedom,
Figure BDA0001219493010000184
is the rotation angle, Y, of the joint node of the hand corresponding to the preset number of degrees of freedomgtLabeling hand nodes in the hand training dataset with information, i.e., hand training datasetThe marked hand node location in (1).
In the embodiment of the invention, the node constraint loss function GDoFconstraint(Ψ) is:
Figure BDA0001219493010000185
wherein the content of the first and second substances,
Figure BDA0001219493010000186
is a lower bound on the degree of freedom of the node,
Figure BDA0001219493010000187
is an upper bound on the degree of freedom of the node.
In the above iterative optimization process of the global loss function, the parameters can be optimized by using a standard random gradient descent method. It should be noted that the present invention is not limited to the optimization of parameters by the standard stochastic gradient descent method, and any algorithm that can be used for parameter optimization can be used.
The embodiment of the invention provides a hand posture estimation method based on deep learning, which comprises the following steps: detecting a hand region of interest of the depth image, and segmenting a hand image from the hand region of interest; acquiring a three-dimensional point cloud image of the hand according to the hand image; in the scheme, the hand posture estimation method based on the deep learning technology avoids the operation of manually extracting hand features, and greatly improves the robustness of a hand posture estimation effect, so that the defect that the hand posture estimation effect is poor due to the fact that the hand features need to be manually extracted by using a traditional machine learning method (such as regression or random forest) in the prior art is overcome. Meanwhile, a node constraint condition is added to estimate the hand posture, so that the robustness of the method is greatly improved. Meanwhile, the embodiment provides a whole set of hand posture estimation method, which comprises a whole set of methods of hand segmentation, hand training data set production and hand posture estimation.
In addition, according to some embodiments, the relative depth relation between the foreground and the background in the depth image can be directly utilized to extract the hand region of interest and segment the hand image by the contour detection method, so that the hand detection effect is further improved.
Exemplary device
Having described the method of an exemplary embodiment of the present invention, a deep learning based hand pose estimation apparatus of an exemplary embodiment of the present invention is described next with reference to fig. 6.
Fig. 6 schematically shows a schematic diagram of a hand pose estimation device 10 based on deep learning according to an embodiment of the present invention. As shown in fig. 6, the apparatus 10 may include:
the hand image segmentation module 100 may be configured to perform hand region-of-interest detection on the depth image, and segment a hand image from the hand region-of-interest;
a point cloud image obtaining module 110, configured to obtain a three-dimensional point cloud image including a hand according to the hand image; and
the hand pose estimation module 120 may be configured to perform hand pose estimation on the three-dimensional point cloud image of the hand by using a deep learning technique.
In this embodiment of the present invention, optionally, the hand image segmentation image module 100 may include an image acquisition device and a hand region-of-interest extraction unit, where:
the image acquisition equipment is used for acquiring the depth image through a depth sensor in the image acquisition equipment;
the hand region-of-interest extraction unit is used for extracting the hand region-of-interest from the depth image by using the relative depth relation of the foreground and the background in the depth image.
In this embodiment of the present invention, optionally, the hand image segmentation module 100 may further include a hand region detection unit and a hand image segmentation unit, where:
the hand region detection unit is used for carrying out edge detection and contour detection on the hand region of interest to detect a hand region;
the hand image segmentation unit is used for denoising the hand region and segmenting the hand image.
In this embodiment of the present invention, optionally, the hand posture estimation module 120 may include a training data set making sub-module, a point cloud region extraction sub-module, a hand posture model making sub-module, and a hand posture estimation sub-module, where:
the training data set making submodule is used for making a hand training data set and obtaining hand node marking information of the hand training data set;
the point cloud area extraction submodule is used for extracting a three-dimensional point cloud area of the hand from the hand training data set;
the hand gesture model making sub-module is used for training a hand gesture model by utilizing a convolutional neural network model according to the three-dimensional point cloud area of the hand and the hand node marking information;
and the hand posture estimation submodule is used for acquiring the joint node positions of the hand by utilizing the hand posture model according to the three-dimensional point cloud image of the hand.
In an embodiment of the present invention, the convolutional neural network model may include a plurality of convolutional layers, a plurality of pooling layers, a plurality of fully-connected layers, and an activation layer after each convolutional layer and after each fully-connected layer.
In the embodiment of the present invention, optionally, the hand pose estimation sub-module may include a point cloud image input unit, a pose parameter output unit, and a joint node position output unit, where:
the point cloud image input unit is used for inputting the three-dimensional point cloud image of the hand part into the convolutional neural network model;
the gesture parameter output unit is used for outputting hand gesture parameters with preset number of degrees of freedom to the hand gesture model by the last full connection layer in the full connection layers;
and the joint node position output unit is used for outputting each joint node position of the hand by the hand posture model.
In this embodiment of the present invention, optionally, the joint node position output unit includes a global loss function design subunit, an iterative optimization subunit, and a joint node position output subunit, where:
the global loss function design subunit is used for designing a global loss function G (Ψ) in the hand posture model;
the iterative optimization subunit performs iterative optimization on the global loss function by adopting a preset method;
and the joint node position output subunit is used for outputting the joint node positions of the hand when the global loss function reaches a preset convergence condition.
In this embodiment of the present invention, optionally, the global loss function G (Ψ) includes a node position loss function Gjoint(Ψ) and the node constraint loss function GDoFconstraint(Ψ), wherein the global loss function G (Ψ) satisfies the following equation:
G(Ψ)=Gjoint(Ψ)+λGconstraint(Ψ)
where λ is a weight adjustment factor for the global loss function G (Ψ).
In this embodiment of the present invention, optionally, the node position loss function Gjoint(Ψ) is:
Figure BDA0001219493010000211
wherein the content of the first and second substances,
Figure BDA0001219493010000212
in the form of a function of forward dynamics,
Figure BDA0001219493010000213
Ψ is a hand gesture parameter of the preset number of degrees of freedom,
Figure BDA0001219493010000214
is the rotation angle, Y, of the joint node of the hand corresponding to the preset number of degrees of freedomgtAnd marking information for the hand nodes in the hand training data set.
In the embodiment of the present invention, optionally, the node constraint loss function GDoFconstraint(Ψ) is:
Figure BDA0001219493010000215
wherein the content of the first and second substances,
Figure BDA0001219493010000216
is a lower bound on the degree of freedom of the node,
Figure BDA0001219493010000217
is an upper bound on the degree of freedom of the node.
In the embodiment of the invention, a hand posture estimation device based on deep learning is provided: detecting a hand region of interest of the depth image, and segmenting a hand image from the hand region of interest; acquiring a three-dimensional point cloud image of the hand according to the hand image; in the scheme, the hand posture estimation method based on the deep learning technology avoids the operation of manually extracting hand features, greatly improves the robustness of a hand posture estimation effect, and overcomes the defect of poor hand posture estimation effect caused by manually extracting the hand features in the prior art.
In addition, according to some embodiments, the relative depth relation between the foreground and the background in the depth image can be directly utilized to extract the hand region of interest and segment the hand image by the contour detection method, so that the hand detection effect is further improved.
Exemplary device
Having described the method and apparatus of an exemplary embodiment of the present invention, a deep learning based hand pose estimation apparatus according to another exemplary embodiment of the present invention is described next.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
In some possible embodiments, a deep learning based hand pose estimation apparatus according to the present invention may comprise at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps of the method for maintaining data consistency according to various exemplary embodiments of the present invention described in the above section "exemplary method" of the present specification. For example, the processing unit may execute step S10 as shown in fig. 1: detecting a hand region of interest of the depth image, and segmenting a hand image from the hand region of interest; step S20: acquiring a three-dimensional point cloud image of the hand according to the hand image; step S30: and performing hand posture estimation on the three-dimensional point cloud image of the hand by adopting a deep learning technology.
The deep learning-based hand pose estimation apparatus 50 according to this embodiment of the present invention is described below with reference to fig. 7. The deep learning-based hand pose estimation apparatus 50 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 7, the deep learning based hand pose estimation apparatus 50 is embodied in the form of a general purpose computing device. The components of the deep learning based hand pose estimation device 50 may include, but are not limited to: the at least one processing unit 500, the at least one memory unit 510, and a bus 530 that couples the various system components including the memory unit 510 and the processing unit 500.
Bus 530 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
The storage unit 510 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)512 and/or cache memory 514, and may further include Read Only Memory (ROM) 516.
Storage unit 510 may also include a program/utility 518 having a set (at least one) of program modules 5182, such program modules 5182 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Deep learning based hand pose estimation apparatus 50 may also communicate with one or more external devices 560 (e.g., dipstick device, keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with deep learning based hand pose estimation apparatus 50, and/or with any device (e.g., router, modem, etc.) that enables deep learning based hand pose estimation apparatus 50 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 520. Also, deep learning based hand pose estimation device 50 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through network adapter 550. As shown, the network adapter 550 communicates with the other modules of the deep learning based hand pose estimation device 50 over a bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the deep learning based hand pose estimation apparatus 50, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Exemplary program product
In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product including program code for causing a terminal device to perform the steps of the deep learning based hand pose estimation method according to various exemplary embodiments of the present invention described in the "exemplary methods" section above of this specification when the program product is run on the terminal device, for example, the terminal device may perform the step S10 shown in fig. 1: detecting a hand region of interest of the depth image, and segmenting a hand image from the hand region of interest; step S20: acquiring a three-dimensional point cloud image of the hand according to the hand image; step S30: and performing hand posture estimation on the three-dimensional point cloud image of the hand by adopting a deep learning technology.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
As shown in fig. 8, a program product 60 for deep learning based hand pose estimation according to an embodiment of the present invention is described that may employ a portable compact disc read only memory (CD-ROM) and include program code and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).
It should be noted that although in the above detailed description several modules or units of the deep learning based hand pose estimation apparatus are mentioned, this division is not mandatory only. Indeed, the features and functions of two or more of the devices described above may be embodied in one device, according to embodiments of the invention. Conversely, the features and functions of one apparatus described above may be further divided into embodiments by a plurality of apparatuses.
Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (21)

1. A hand posture estimation method based on deep learning comprises the following steps:
detecting a hand region of interest of the depth image, and segmenting a hand image from the hand region of interest;
acquiring a three-dimensional point cloud image containing a hand according to the hand image; and
performing hand posture estimation on the three-dimensional point cloud image of the hand by adopting a deep learning technology;
segmenting a hand image from the hand region of interest, comprising:
performing edge detection and contour detection on the hand region of interest to detect a hand region;
denoising the hand region to segment the hand image;
the method comprises the following steps of estimating hand postures of the three-dimensional point cloud image of the hand by adopting a deep learning technology, wherein the steps comprise:
making a hand training data set, and acquiring hand node marking information of the hand training data set;
extracting a three-dimensional point cloud area of the hand from the hand training data set;
training by utilizing a convolutional neural network model according to the three-dimensional point cloud area of the hand and the hand node marking information to form a hand posture model;
acquiring joint node positions of the hand by utilizing the hand gesture model according to the three-dimensional point cloud image of the hand;
the hand training data set is produced, the hand node marking information of the hand training data set is obtained, and the method further comprises the following steps:
acquiring a color image and the depth image corresponding to the color image, wherein the color image comprises a hand with a color wrist;
and segmenting the hand area according to the positions of the color wrists in the color image.
2. The method of claim 1, performing hand region of interest detection on the depth image, comprising:
acquiring the depth image by adopting a depth sensor in image acquisition equipment; and
and extracting the hand region of interest from the depth image by utilizing the relative depth relation of the foreground and the background in the depth image.
3. The method as claimed in claim 1, wherein the contour detection of the hand region of interest adopts an image concave-convex point detection method.
4. The method of claim 2, acquiring a three-dimensional point cloud image containing a hand from the hand image, comprising:
calibrating the internal parameters of the image acquisition equipment;
preliminarily acquiring a three-dimensional point cloud image containing a hand according to the internal parameters;
and carrying out size normalization processing on the preliminarily acquired three-dimensional point cloud image of the hand to obtain the processed three-dimensional point cloud image containing the hand.
5. The method of claim 1, wherein creating a hand training dataset and obtaining hand node labeling information for the hand training dataset comprises:
extracting three-dimensional point cloud data of a hand region in the depth image;
constructing an initialized three-dimensional hand model and a loss function L fitted by three-dimensional point cloud data of the hand area, and performing iterative optimization on the loss function L;
when the loss function L is subjected to iterative optimization and meets a preset convergence condition, a successfully-fitted three-dimensional hand model is obtained;
and acquiring hand node marking information of the hand training data set according to the successfully fitted three-dimensional hand model.
6. The method of claim 1, the convolutional neural network model comprising a plurality of convolutional layers, a plurality of pooling layers, a plurality of fully-connected layers, and an activation layer after each of the convolutional layers and after each of the fully-connected layers.
7. The method of claim 6, the plurality of convolutional layers comprising two convolutional layers with a convolutional kernel size of 5 x 5 and one convolutional layer with a convolutional kernel size of 3 x 3.
8. The method of claim 6, the plurality of pooling layers comprising two pooling layers of step 2 and one pooling layer of step 1.
9. The method of claim 6, wherein the active layer is a ReLU function.
10. The method of claim 6, wherein obtaining joint node positions of a hand from the three-dimensional point cloud image of the hand using the hand pose model comprises:
inputting the three-dimensional point cloud image of the hand to the convolutional neural network model;
outputting a hand gesture parameter with a preset number of degrees of freedom to the hand gesture model by the last full connection layer in the plurality of full connection layers;
the hand gesture model outputs joint positions of each joint of the hand.
11. The method of claim 10, the hand pose model outputting joint node positions of a hand, comprising:
designing a global loss function G (Ψ) in the hand pose model;
performing iterative optimization on the global loss function by adopting a preset method;
and outputting the joint node positions of the hand when the global loss function reaches a preset convergence condition.
12. The method of claim 11, wherein the global loss function G (Ψ) comprises a node location loss function Gjoint(Ψ) and the node constraint loss function Gconstraint(Ψ), wherein the global loss function G (Ψ) satisfies the following equation:
G(Ψ)=Gjoint(Ψ)+λGconstraint(Ψ)
where λ is a weight adjustment factor for the global loss function G (Ψ).
13. The method of claim 12, the node position loss function Gjoint(Ψ) is:
Figure FDA0002697682290000031
wherein F (Ψ) is a forward kinetic function and satisfies
Figure FDA0002697682290000032
Wherein
Figure FDA0002697682290000033
Ψ is said presetA hand gesture parameter of a number of degrees of freedom,
Figure FDA0002697682290000034
is the rotation angle, Y, of the joint node of the hand corresponding to the preset number of degrees of freedomgtAnd marking information for the hand nodes in the hand training data set.
14. The method of claim 12, the node constraint loss function Gconstraint(Ψ) is:
Figure FDA0002697682290000041
wherein the content of the first and second substances,
Figure FDA0002697682290000042
the rotation angle of the joint node of the hand corresponding to the preset number of degrees of freedom,
Figure FDA0002697682290000043
is a lower bound on the degree of freedom of the node,
Figure FDA0002697682290000044
is an upper bound on the degree of freedom of the node.
15. A hand pose estimation apparatus based on deep learning, comprising:
the hand image segmentation module is used for detecting a hand region of interest of the depth image and segmenting a hand image from the hand region of interest;
the point cloud image acquisition module is used for acquiring a three-dimensional point cloud image containing a hand according to the hand image; and
the hand posture estimation module is used for estimating the hand posture of the three-dimensional point cloud image of the hand by adopting a deep learning technology;
the hand image segmentation module comprises a hand region detection unit and a hand image segmentation unit, wherein:
the hand region detection unit is used for carrying out edge detection and contour detection on the hand region of interest to detect a hand region;
the hand image segmentation unit is used for denoising the hand region and segmenting the hand image;
the hand posture estimation module comprises a training data set making submodule, a point cloud area extraction submodule, a hand posture model making submodule and a hand posture estimation submodule, wherein:
the training data set making submodule is used for making a hand training data set and obtaining hand node marking information of the hand training data set;
the point cloud area extraction submodule is used for extracting a three-dimensional point cloud area of the hand from the hand training data set;
the hand gesture model making sub-module is used for training a hand gesture model by utilizing a convolutional neural network model according to the three-dimensional point cloud area of the hand and the hand node marking information;
the hand posture estimation submodule is used for acquiring joint node positions of the hand by utilizing the hand posture model according to the three-dimensional point cloud image of the hand;
the hand training data set is produced, the hand node marking information of the hand training data set is obtained, and the method further comprises the following steps:
acquiring a color image and the depth image corresponding to the color image, wherein the color image comprises a hand with a color wrist;
and segmenting the hand area according to the positions of the color wrists in the color image.
16. The apparatus of claim 15, the convolutional neural network model comprising a plurality of convolutional layers, a plurality of pooling layers, a plurality of fully-connected layers, and an activation layer after each of the convolutional layers and after each of the fully-connected layers.
17. The apparatus of any one of claims 15 or 16, the hand pose estimation sub-module comprising a point cloud image input unit, a pose parameter output unit, and a joint node position output unit, wherein:
the point cloud image input unit is used for inputting the three-dimensional point cloud image of the hand part into the convolutional neural network model;
the gesture parameter output unit is used for outputting hand gesture parameters with preset number of degrees of freedom to the hand gesture model by the last full connection layer in the full connection layers;
and the joint node position output unit is used for outputting each joint node position of the hand by the hand posture model.
18. The apparatus of claim 17, the joint node position output unit comprising a global loss function design subunit, an iterative optimization subunit, and a joint node position output subunit, wherein:
the global loss function design subunit is used for designing a global loss function G (Ψ) in the hand posture model;
the iterative optimization subunit performs iterative optimization on the global loss function by adopting a preset method;
and the joint node position output subunit is used for outputting the joint node positions of the hand when the global loss function reaches a preset convergence condition.
19. The apparatus of claim 18, the global loss function G (Ψ) comprising a node location loss function Gjoint(Ψ) and the node constraint loss function Gconstraint(Ψ), wherein the global loss function G (Ψ) satisfies the following equation:
G(Ψ)=Gjoint(Ψ)+λGconstraint(Ψ)
where λ is a weight adjustment factor for the global loss function G (Ψ).
20. The apparatus of claim 19, the node position penalty function Gjoint(Ψ) is:
Figure FDA0002697682290000061
wherein F (Ψ) is a forward kinetic function and satisfies
Figure FDA0002697682290000062
Wherein
Figure FDA0002697682290000063
Ψ is a hand gesture parameter of the preset number of degrees of freedom,
Figure FDA0002697682290000064
is the rotation angle, Y, of the joint node of the hand corresponding to the preset number of degrees of freedomgtAnd marking information for the hand nodes in the hand training data set.
21. The apparatus of claim 19, the node constraint loss function Gconstraint(Ψ) is:
Figure FDA0002697682290000065
wherein the content of the first and second substances,
Figure FDA0002697682290000066
the rotation angle of the joint node of the hand corresponding to the preset number of degrees of freedom,
Figure FDA0002697682290000067
is a lower bound on the degree of freedom of the node,
Figure FDA0002697682290000068
about the upper limit of the degree of freedom of the nodeAnd (4) bundling.
CN201710061286.8A 2017-01-25 2017-01-25 Hand posture estimation method and device based on deep learning Active CN107066935B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710061286.8A CN107066935B (en) 2017-01-25 2017-01-25 Hand posture estimation method and device based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710061286.8A CN107066935B (en) 2017-01-25 2017-01-25 Hand posture estimation method and device based on deep learning

Publications (2)

Publication Number Publication Date
CN107066935A CN107066935A (en) 2017-08-18
CN107066935B true CN107066935B (en) 2020-11-24

Family

ID=59598426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710061286.8A Active CN107066935B (en) 2017-01-25 2017-01-25 Hand posture estimation method and device based on deep learning

Country Status (1)

Country Link
CN (1) CN107066935B (en)

Families Citing this family (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644423B (en) * 2017-09-29 2021-06-15 北京奇虎科技有限公司 Scene segmentation-based video data real-time processing method and device and computing equipment
WO2019061466A1 (en) * 2017-09-30 2019-04-04 深圳市大疆创新科技有限公司 Flight control method, remote control device, and remote control system
CN107977604B (en) * 2017-11-06 2021-01-05 浙江工业大学 Hand detection method based on improved aggregation channel characteristics
CN108196535B (en) * 2017-12-12 2021-09-07 清华大学苏州汽车研究院(吴江) Automatic driving system based on reinforcement learning and multi-sensor fusion
CN109934065B (en) * 2017-12-18 2021-11-09 虹软科技股份有限公司 Method and device for gesture recognition
CN108491752A (en) * 2018-01-16 2018-09-04 北京航空航天大学 A kind of hand gestures method of estimation based on hand Segmentation convolutional network
CN110060296A (en) * 2018-01-18 2019-07-26 北京三星通信技术研究有限公司 Estimate method, electronic equipment and the method and apparatus for showing virtual objects of posture
CN108460338B (en) * 2018-02-02 2020-12-11 北京市商汤科技开发有限公司 Human body posture estimation method and apparatus, electronic device, storage medium, and program
CN108345869B (en) * 2018-03-09 2022-04-08 南京理工大学 Driver posture recognition method based on depth image and virtual data
CN108594997B (en) * 2018-04-16 2020-04-21 腾讯科技(深圳)有限公司 Gesture skeleton construction method, device, equipment and storage medium
CN108549489B (en) * 2018-04-27 2019-12-13 哈尔滨拓博科技有限公司 gesture control method and system based on hand shape, posture, position and motion characteristics
CN108830150B (en) * 2018-05-07 2019-05-28 山东师范大学 One kind being based on 3 D human body Attitude estimation method and device
CN109002837A (en) * 2018-06-21 2018-12-14 网易(杭州)网络有限公司 A kind of image application processing method, medium, device and calculate equipment
CN109446952A (en) * 2018-10-16 2019-03-08 赵笑婷 A kind of piano measure of supervision, device, computer equipment and storage medium
CN111222379A (en) * 2018-11-27 2020-06-02 株式会社日立制作所 Hand detection method and device
CN109635767B (en) * 2018-12-20 2019-11-22 北京字节跳动网络技术有限公司 A kind of training method, device, equipment and the storage medium of palm normal module
CN111382637B (en) * 2018-12-29 2023-08-08 深圳市优必选科技有限公司 Pedestrian detection tracking method, device, terminal equipment and medium
CN111460858B (en) * 2019-01-21 2024-04-12 杭州易现先进科技有限公司 Method and device for determining finger tip point in image, storage medium and electronic equipment
CN109919046B (en) * 2019-02-19 2020-10-13 清华大学 Three-dimensional point cloud feature learning method and device based on relational features
CN110135340A (en) * 2019-05-15 2019-08-16 中国科学技术大学 3D hand gestures estimation method based on cloud
CN110210426B (en) * 2019-06-05 2021-06-08 中国人民解放军国防科技大学 Method for estimating hand posture from single color image based on attention mechanism
CN110348359B (en) * 2019-07-04 2022-01-04 北京航空航天大学 Hand gesture tracking method, device and system
CN110413111B (en) * 2019-07-09 2021-06-01 南京大学 Target keyboard tracking system and method
CN112288798A (en) * 2019-07-24 2021-01-29 鲁班嫡系机器人(深圳)有限公司 Posture recognition and training method, device and system
CN110991237B (en) * 2019-10-30 2023-07-28 华东师范大学 Virtual hand natural gripping action generation method based on gripping taxonomy
WO2021098666A1 (en) * 2019-11-20 2021-05-27 Oppo广东移动通信有限公司 Hand gesture detection method and device, and computer storage medium
WO2021098441A1 (en) * 2019-11-20 2021-05-27 Oppo广东移动通信有限公司 Hand posture estimation method and apparatus, device and computer storage medium
WO2021098576A1 (en) * 2019-11-20 2021-05-27 Oppo广东移动通信有限公司 Hand posture estimation method and apparatus, and computer storage medium
CN112883757B (en) * 2019-11-29 2023-03-24 北京航空航天大学 Method for generating tracking attitude result
CN111178142A (en) * 2019-12-05 2020-05-19 浙江大学 Hand posture estimation method based on space-time context learning
CN111046948B (en) * 2019-12-10 2022-04-22 浙江大学 Point cloud simulation and deep learning workpiece pose identification and robot feeding method
CN111222486B (en) * 2020-01-15 2022-11-04 腾讯科技(深圳)有限公司 Training method, device and equipment for hand gesture recognition model and storage medium
CN111597976A (en) * 2020-05-14 2020-08-28 杭州相芯科技有限公司 Multi-person three-dimensional attitude estimation method based on RGBD camera
CN111666917A (en) * 2020-06-19 2020-09-15 北京市商汤科技开发有限公司 Attitude detection and video processing method and device, electronic equipment and storage medium
KR20210157470A (en) * 2020-06-19 2021-12-28 베이징 센스타임 테크놀로지 디벨롭먼트 컴퍼니 리미티드 Posture detection and video processing methods, devices, electronic devices and storage media
CN112446919A (en) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 Object pose estimation method and device, electronic equipment and computer storage medium
CN112975957A (en) * 2021-02-07 2021-06-18 深圳市广宁股份有限公司 Target extraction method, system, robot and storage medium
CN112597980B (en) * 2021-03-04 2021-06-04 之江实验室 Brain-like gesture sequence recognition method for dynamic vision sensor

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102135417B (en) * 2010-12-26 2013-05-22 北京航空航天大学 Full-automatic three-dimension characteristic extracting method
CN105069423B (en) * 2015-07-29 2018-11-09 北京格灵深瞳信息技术有限公司 A kind of human body attitude detection method and device

Also Published As

Publication number Publication date
CN107066935A (en) 2017-08-18

Similar Documents

Publication Publication Date Title
CN107066935B (en) Hand posture estimation method and device based on deep learning
Chen et al. Pose guided structured region ensemble network for cascaded hand pose estimation
Bandini et al. Analysis of the hands in egocentric vision: A survey
Hasan et al. RETRACTED ARTICLE: Static hand gesture recognition using neural networks
Sharp et al. Accurate, robust, and flexible real-time hand tracking
US11237637B2 (en) Gesture recognition systems
CN108062525B (en) Deep learning hand detection method based on hand region prediction
Liang et al. Model-based hand pose estimation via spatial-temporal hand parsing and 3D fingertip localization
JP6571108B2 (en) Real-time 3D gesture recognition and tracking system for mobile devices
EP2956908A1 (en) Model-based multi-hypothesis target tracker
Chen et al. Learning a deep network with spherical part model for 3D hand pose estimation
CN111460976B (en) Data-driven real-time hand motion assessment method based on RGB video
WO2021047587A1 (en) Gesture recognition method, electronic device, computer-readable storage medium, and chip
WO2023151237A1 (en) Face pose estimation method and apparatus, electronic device, and storage medium
WO2021098802A1 (en) Object detection device, method, and systerm
Ma et al. Real-time and robust hand tracking with a single depth camera
CN110751097A (en) Semi-supervised three-dimensional point cloud gesture key point detection method
Thabet et al. Fast marching method and modified features fusion in enhanced dynamic hand gesture segmentation and detection method under complicated background
de La Gorce et al. A variational approach to monocular hand-pose estimation
Ravi et al. Sign language recognition with multi feature fusion and ANN classifier
Kim et al. 3D human-gesture interface for fighting games using motion recognition sensor
Xu et al. Robust hand gesture recognition based on RGB-D Data for natural human–computer interaction
Chang et al. 2d–3d pose consistency-based conditional random fields for 3d human pose estimation
CN114332927A (en) Classroom hand-raising behavior detection method, system, computer equipment and storage medium
Qi et al. Approach to hand posture recognition based on hand shape features for human–robot interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20210514

Address after: 311200 Room 102, 6 Blocks, C District, Qianjiang Century Park, Xiaoshan District, Hangzhou City, Zhejiang Province

Patentee after: Hangzhou Yixian Advanced Technology Co.,Ltd.

Address before: 310052 Building No. 599, Changhe Street Network Business Road, Binjiang District, Hangzhou City, Zhejiang Province, 4, 7 stories

Patentee before: NETEASE (HANGZHOU) NETWORK Co.,Ltd.

TR01 Transfer of patent right