CN110472507A - Manpower depth image position and orientation estimation method and system based on depth residual error network - Google Patents

Manpower depth image position and orientation estimation method and system based on depth residual error network Download PDF

Info

Publication number
CN110472507A
CN110472507A CN201910629662.8A CN201910629662A CN110472507A CN 110472507 A CN110472507 A CN 110472507A CN 201910629662 A CN201910629662 A CN 201910629662A CN 110472507 A CN110472507 A CN 110472507A
Authority
CN
China
Prior art keywords
human hand
hand
characteristic
pose
depth image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910629662.8A
Other languages
Chinese (zh)
Inventor
李勇波
赵涛
谢中朝
蔡文迪
朱正东
王畯翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN201910629662.8A priority Critical patent/CN110472507A/en
Publication of CN110472507A publication Critical patent/CN110472507A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of manpower depth image position and orientation estimation method and system based on depth residual error network, this method and system input manpower depth image into CNN model first, carry out feature extraction using image of the model to input, obtain manpower characteristic pattern;Secondly, being input to obtained manpower characteristic pattern is extracted in trained regional ensemble network, manpower Attitude estimation is carried out by the network;Wherein, in the regional ensemble network, obtained manpower characteristic pattern will be extracted and be uniformly divided into several characteristic areas, each characteristic area is input to regression model and carries out the estimation of manpower pose, by merging the regression result of each characteristic area, the manpower pose of manpower depth image is finally returned out.This method and system are sufficiently extracted more optimized, more representational manpower feature, and the precision compared to the estimation of other methods manpower pose is higher.

Description

Human hand depth image pose estimation method and system based on depth residual error network
Technical Field
The invention relates to the field of machine learning and computer vision, in particular to an estimation method and an estimation system for researching the pose of a human hand depth image based on a depth residual error rolling machine network and a regional integration network.
Background
With the continuous development of computer vision technology, people are pursuing more natural and harmonious human-computer interaction modes, hand motion is an important channel of human interaction, and hands can express not only semantic information but also quantitative spatial direction and position information, so that a more natural and efficient human-computer interaction environment can be constructed. Therefore, the estimation and motion analysis of the three-dimensional pose of the multi-joint human hand based on vision is an important research direction, and aims to detect the three-dimensional pose of the human hand and the finger joints thereof from images or image sequences in a non-contact manner by using a computer vision method. The multi-joint hand three-dimensional pose estimation method has important significance in the fields of enhancement/virtual reality, intelligent robots, auxiliary driving, medical treatment and health and the like. The man-machine interaction technology has gradually shifted from being centered on a computer to being centered on a human being, and is a brand new multi-media and multi-mode interaction technology. The hand is the most flexible part of the human body, and compared with other interaction modes, the hand takes gestures as a means of human-computer interaction and is more natural, so that the gesture recognition technology is a large research point of human-computer interaction.
Disclosure of Invention
The invention aims to solve the technical problem that the gesture estimation method of the human hand depth image based on the depth residual convolution network is provided aiming at the defect of weak recognition strength in the prior art, and the defect that the three-dimensional gesture coordinate of the human hand is regressed from a single depth image by the existing three-dimensional gesture estimation method is overcome by introducing a gesture-guided convolution neural network structure.
The technical scheme adopted by the invention for solving the technical problems is as follows: a method for estimating the pose of a human hand depth image based on a depth residual error network is constructed, and comprises the following steps:
s1, inputting the hand depth image into a CNN model, and performing feature extraction on the input image by using the CNN model to obtain a hand feature map;
s2, taking the input hand depth image as a training sample, training the regional integrated network, inputting the extracted hand feature map into the trained regional integrated network, and estimating the hand posture through the network; when the hand posture estimation is carried out, the extracted hand characteristic graph is uniformly divided into a plurality of characteristic areas in the area integration network, each characteristic area is input into a regression model to carry out the hand posture estimation, and the hand posture of the hand depth image is finally regressed by fusing the regression result of each characteristic area.
In the method, based on the step S2, the defect that the three-dimensional pose coordinates of the human hand are regressed from a single depth map is overcome, and the extracted feature regions are layered and fused by carrying out feature extraction on the joint regions of the human hand and utilizing the full connection layer in the regional integration network, so that the estimated three-dimensional human hand posture is more accurate.
Further, step S2 includes the following sub-steps:
s21, the CNN model comprises a plurality of convolution layers, wherein the characteristic graph extracted from the last convolution layer is represented as F, and the hand pose estimation p estimated according to the stage t-1, namely the t-1 momentt-1Extracting a first characteristic region from the characteristic diagram F;
s22, at the stage t, cutting the first characteristic region extracted in the step S21 by adopting a rectangular window to obtain a plurality of rectangular regions containing human hand joint points, wherein the rectangular regions are defined asAndfor the joint point i of the human handAt the coordinate point at the upper left corner of the rectangular area, w and h respectively represent the width and height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:
the function represents a feature map F extracted from a human hand depth image and is represented by a rectangular windowCutting out a characteristic area containing a joint point of a human hand;
s23, the regional integration network comprises a plurality of full connection layers, the full connection layers are utilized to fuse a plurality of rectangular regions containing human hand joint points obtained by cutting in the step S22 to obtain a fusion feature region comprising five finger joints, and a human hand pose P of a human hand depth image is regressed by utilizing a regression model R for the fusion feature regiont
Further, in step S23, the joint points on the same finger are pointed, wherein the feature regions obtained by cutting all pass through the full connection layer l1Performing fusion connection to obtain a first fusion characteristic region; then, the first fused feature region obtained by fusing each finger is intensively input to the full link layer l2And carrying out feature region fusion to obtain the fusion feature region comprising the five finger joints of the human hand.
Further, in step S23, all the joints belonging to the same finger are connected in series, wherein the connection function is represented by concatee, and the neurons after connection in series will further pass through the full connection layer l2Carrying out fusion connection to obtain the characteristic regions of different fingers:
wherein,to form the characteristic region of five finger jointsEach input to the full link layer l1Then, obtaining the joint point coordinates of the five fingers, wherein M represents the number of rectangular areas obtained by cutting; iththAll joints on a finger are represented asMiDenotes the iththThe number of joints of each finger; FC (-) means that the input "·" is calculated by using a full connection layer to obtain the corresponding joint point coordinate;
feature areas of different fingersAfter series connection, at the fully-connected layer l to which it is input2In the middle, the final hand pose is regressed
Wherein,
furthermore, in the training process of the area integration network model, a training set equation is set as T0
Wherein N isTRepresenting training samples, i.e. the number of input hand depth images, DiFor input of a depth image of a human hand, Pi 0Is a human beingInitial estimate of hand pose, Pi gtIs a three-dimensional coordinate of the posture of the real hand marked manually.
The invention provides a hand depth image pose estimation system based on a depth residual error network, which comprises the following modules:
the characteristic diagram extraction module is used for inputting the hand depth image into the CNN model and extracting the characteristics of the input image by using the CNN model to obtain a hand characteristic diagram;
the human hand pose estimation module is used for inputting the extracted human hand feature map into a trained regional integrated network and estimating the human hand pose through the network; in the regional integration network, the extracted human hand feature map is uniformly divided into a plurality of feature regions, each feature region is input into a regression model to estimate the human hand pose, and the human hand pose of the human hand depth image is finally regressed by fusing the regression result of each feature region.
Further, the human hand pose estimation module comprises the following sub-modules:
a characteristic region extraction module for representing the characteristic map extracted from the last convolution layer as F, estimating the estimated human hand pose p according to the t-1 moment of the stage, namely the t-1 momentt-1Extracting a first characteristic region from the characteristic diagram F;
a cutting module, configured to cut the first feature region extracted by the feature region extraction module by using a rectangular window at stage t to obtain a plurality of rectangular regions including human hand joint points, where the rectangular region is defined asAnda coordinate point at the upper left corner of a rectangular area where a human hand joint point i is located, wherein w and h respectively represent the width and the height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:
the function represents a feature map F extracted from a human hand depth image and is represented by a rectangular windowCutting out a characteristic area containing a joint point of a human hand;
a hand pose calculation module for fusing a plurality of rectangular regions containing hand joint points obtained by cutting by the cutting module to obtain a fusion feature region containing five finger joints, and regressing the hand pose P of the hand depth image by using a regression model R according to the fusion feature regiont
Furthermore, the human hand pose calculation module aims at joint points on the same finger, wherein the feature areas obtained by cutting all pass through a full-connection layer l1Performing fusion connection to obtain a first fusion characteristic region; then, the first fused feature region obtained by fusing each finger is intensively input to the full link layer l2And carrying out feature region fusion to obtain the fusion feature region comprising the five finger joints of the human hand.
In the method and the system for estimating the position and the attitude of the human hand depth image based on the depth residual error network, the joint area of the human hand is subjected to feature extraction, and after hierarchical fusion, three-dimensional human hand posture estimation is carried out.
The invention discloses a human hand depth image pose estimation method and system based on a depth residual error network, wherein a result area of human hand pose guidance is utilized to integrate a network, predicted human hand pose estimation is used as guidance information and is fed back to a feature map, and better human hand features can be further learned by continuously feeding back errors.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flow chart of a hand depth image pose estimation method disclosed by the invention;
FIG. 2 is a schematic diagram of a residual structure;
FIG. 3 is a diagram of a gesture-guided structured area integration network;
FIG. 4 is a residual convolutional network model;
5-10 are graphs of the average error of each joint point, the projection effect of the hand pose, using three different data sets;
FIG. 11 is a diagram of a pose estimation system for a human hand depth image disclosed in the present invention.
Detailed Description
For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
The invention aims to solve the problem that the network structure can overcome the defect that the three-dimensional pose coordinates of a human hand are regressed from a single depth map by the existing three-dimensional pose estimation method, and the network structure carries out feature extraction on joint areas of the human hand, then carries out layered fusion on the extracted feature maps, and finally carries out three-dimensional human hand pose estimation on the fused feature maps.
Please refer to fig. 1, which is a flowchart of a pose estimation method for a depth image of a human hand disclosed in the present invention, specifically including the following steps:
s1, inputting the hand depth image into a CNN model, and performing feature extraction on the input image by using the CNN model to obtain a hand feature map;
in the CNN model, a convolutional layer is used for extracting features of an input human hand image, wherein the human hand features are extracted through parameter sharing and sparse connection;
in this embodiment, the network has 6 convolutional layers and 2 residual connections, the size of each convolutional core is 3 × 3, the number of convolutional cores is 16, 32, and 64, a nonlinear activation function ReLU is connected behind each convolutional layer, a maximum pooling layer is connected behind every two convolutional layers, and the residual connection is located between the two maximum pooling layers, so as to prevent the problem of disappearance of the depth network gradient, where the structure of the residual is shown in detail in fig. 2.
In this embodiment, the function f for convolution kernelkIt is shown that the size of the convolution kernel is represented by 3 × 3, so the number of connections between each convolution kernel and the human hand image x is 3 × 3, where the length and width of the human hand image are μ and ν, respectively, and the current convolution layer outputs the calculation result:
s2, training the regional integrated network, inputting the extracted hand characteristic diagram into the trained regional integrated network, and estimating the hand posture through the network; wherein:
in the training process of the regional integrated network model, setting a training set equation as T0
Wherein N isTRepresenting the number of training samples, DiFor input of a depth image of a human hand, Pi 0Is an initial value of hand pose estimation, Pi gtIs the three-dimensional hand pose of the depth image. And (4) using the training set model, training each sample in the set, and repeatedly repeating the training set model until the maximum iteration time T is reached.
Please refer to fig. 3, which is a diagram of a pose guidance structured regional integrated network, and this section will explain the human hand pose estimation process in detail by combining the network results, in the regional integrated network, first, a feature map extracted from the last convolutional layer in the CNN model is represented as F, and p is estimated according to the estimated human hand pose estimation p at the stage t-1, i.e. at the time t-1t-1Extracting a characteristic region graph from the characteristic graph F; wherein, when the next operation is carried out, the ith operation is carried outthThe personal hand joint point needs to convert the corresponding world coordinate point into a pixel coordinate point:
secondly, uniformly dividing the extracted characteristic region graph into a plurality of grid regions by using a rectangular window, feeding each grid region to a full-connection layer for fusion, and then performing hand pose regression; wherein the rectangular area is defined asAnda coordinate point at the upper left corner of a rectangular area where a human hand joint point i is located, wherein w and h respectively represent the width and the height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:
the function represents a feature map F extracted from a human hand depth image and is represented by a rectangular windowCutting out a characteristic area containing a joint point of a human hand;
finally, aiming at the joint points on the same finger, wherein the feature areas obtained by cutting are all arranged through the full connecting layer l1Performing fusion connection to obtain a first fusion characteristic region; then, the first fused feature region obtained by fusing each finger is intensively input to the full link layer l2Carrying out feature region fusion to obtain fusion feature regions comprising five finger joints of the human hand; and for the fusion feature region, regressing the hand pose P of the hand depth image by utilizing a regression model Rt. Wherein:
in the cascaded network, the three-dimensional hand pose table is represented by D assuming that the depth image is represented by DShown as
Wherein J is a human hand joint point; at the stage t-1, the result of the currently predicted human hand pose estimation is pt-1The prediction result of the whole regression model R for hand pose estimation at stage t is represented as:
Pt=R(Pt-1,D);
in the whole training process, after T stages, the last hand pose estimation P of the input depth image D can be obtainedT
PT=R(PT-1,D)。
The above is a complete human hand pose estimation process, the characteristic regions of the joints of five fingers of a human hand are respectively input into a full connection layer FC, all joint points belonging to the same finger are connected in series by using a concatee connection function, wherein after the connection in series, the neuron further passes through the full connection layer l2Carrying out fusion connection to obtain the characteristic regions of different fingers:
wherein,to form the characteristic region of five finger jointsEach input to the full link layer l1Then, obtaining the joint point coordinates of the five fingers, wherein M represents the number of rectangular areas obtained by cutting; iththAll joints on a finger are represented asMiDenotes the iththThe number of joints of each finger;
feature areas of different fingersAfter series connection, at the fully-connected layer l to which it is input2In the middle, the final hand pose is regressed
Wherein,
as shown in fig. 4, the convolutional network model architecture for feature extraction is composed of 6 convolutional layers of 3 × 3, the network input is 128 × 128, a deep human hand image is used as input, a ReLU active layer is used for nonlinear feature transformation after each convolutional layer, residual error structures are respectively connected between two pooling layers, and the dimension of the human hand feature extraction network output feature mapping is 12 × 64. For the regression task, this embodiment uses two 2048-dimensional fully-connected layers, where the deactivation rate of neurons of each fully-connected layer for human hand pose regression is 0.5, preventing the model from overfitting, and the final regression result is a 3 × J vector representing the world coordinates of human hand joint points, where J represents the number of human hand joints.
In this example, in order to prove the effectiveness of the algorithm evaluation, three public data sets (ICVL, MSRA, NYU) are respectively used for effect comparison.
The whole model training period is 100, the size of the Batch size is set to be 64, the Adam gradient descent algorithm is adopted by a depth residual error network, and the learning rate is set to be 0.0001; the SGD gradient descent algorithm is adopted in the regional integrated network, the learning rate is set to be 0.005, the learning rate is reduced by 10 times in each iteration of 2 periods, the weight attenuation is 0.0005, and the momentum is 0.9. The structured area integrated network guided by the hand posture adopts an SGD gradient descent algorithm, the learning rate is set to be 0.001, the learning rate is reduced by 10 times every 10 iteration cycles, the weight attenuation is 0.0005, and the momentum is 0.9.
And comparing the predicted hand pose with the maximum joint point error of the real standard hand pose, wherein the average error of the joint points is as follows:
for the above applied 3 different data sets, three different methods are used for hand pose estimation, the three different methods including: the method includes a deep residual convolution network (called ResNet-Hand for short), a regional integration network (called Multi-Region for short) and a Hand posture guidance structured regional integration network (called Pose-Guide for short), and the implementation effect diagrams are shown in FIG. 5-FIG. 10.
In fig. 5, the human hand pose estimation accuracy of the three methods is compared by using the NYU data set. The hand Pose estimation is obtained by using a hand central point as prior information, the average error of a depth residual convolution network method (ResNetHandd) is 13.89mm, the average error of a regional integrated network method (Multi-Region) is 12.63mm, and the average error of a hand posture guidance structured regional integrated network (Pose-Guide) is 11.49 mm. Currently, it is shown that the accuracy of hand pose estimation by using a hand pose guidance structured area integrated network is higher than that of other two methods, specifically because a deep residual convolution network has stronger feature extraction capability than a shallow network. The position and pose estimation is carried out by combining the characteristic diagram information in the area integrated network, compared with a single network, the area integrated network has stronger characteristic expression capability, and the Hand position and pose estimation precision is higher than that of a ResNet-Hand method. The area integration network of human hand pose guidance enables the network to learn better features by incorporating guidance information from previous human hand pose estimates into the feature map.
Fig. 6 is a diagram showing the projection effect of three-dimensional Hand pose estimation on a two-dimensional depth image by using an NYU data set, where the first behavior is the projection of a Hand joint coordinate (GT) actually labeled on a Hand image, the second behavior is the projection of a Hand joint coordinate predicted by using a depth residual convolution network (ResNet-Hand) on the Hand image, and the third behavior is the projection of a Hand joint coordinate predicted by using a Region integration network method (Multi-Region) on the Hand image. The fourth hand posture guides the projection of the coordinates of the joint points of the human hand predicted by a structured area integration network (position-Guide) on the human hand image.
FIG. 7 is a comparison result of human hand pose estimation accuracy of the three methods by using MSRA data set. The human Hand Pose estimation method comprises the steps of obtaining human Hand Pose estimation by using a human Hand central point as prior information, wherein the average error of a depth residual convolution network method (ResNet-Hand) is 9.79mm, the average error of a regional integrated network method (Multi-Region) is 8.65mm, and the average error of a human Hand Pose guidance structured regional integrated network (Pose-Guide) is 8.58 mm.
Currently, the accuracy of the human hand pose estimation by using the human hand pose guidance structured area integrated network is higher than that of other two methods.
Fig. 8 shows a projection effect diagram of three-dimensional human Hand pose estimation on a two-dimensional depth image in an MSRA test set, where the first behavior is a projection of a human Hand joint coordinate (GT) actually labeled on a human Hand image, the second behavior is a projection of a human Hand joint coordinate predicted by using a depth residual convolution network (ResNet-Hand) on a human Hand image, and the third behavior is a projection of a human Hand joint coordinate predicted by using a Region-integration network method (Multi-Region) on a human Hand image. The fourth hand posture guides the projection of the coordinates of the joint points of the human hand predicted by a structured area integration network (position-Guide) on the human hand image.
In fig. 9, the human hand pose estimation accuracies of the above three methods are compared using an ICVL data set. Wherein, the average error of the depth residual convolution network method (ResNet-Hand) is 7.63mm, the average error of the area integration network method (Multi-Region) is 7.31mm, and the average error of the Hand posture guidance structured area integration network (Pose-Guide) is 7.21 mm.
Currently, the accuracy of the human hand pose estimation by using the human hand pose guidance structured area integrated network is higher than that of other two methods.
FIG. 10 shows a projection effect plot of three-dimensional hand pose estimation on a two-dimensional depth image in an ICVL test set; the method comprises the following steps that a first action is projection of a Hand joint coordinate (GT) which is really marked on a Hand image, a second action is projection of the Hand joint coordinate predicted by a depth residual convolution network (ResNet-Hand) on the Hand image, and a third action is projection of the Hand joint coordinate predicted by a regional integration network method (Multi-Region) on the Hand image. The fourth hand posture guides the projection of the coordinates of the joint points of the human hand predicted by a structured area integration network (position-Guide) on the human hand image.
Please refer to fig. 11, which is a structural diagram of a hand depth image pose estimation system disclosed in the present invention, the system includes a feature map extraction module L1 and a hand pose estimation module L2, wherein:
the characteristic diagram extraction module L1 is used for inputting the human hand depth image into the CNN model, and extracting the characteristics of the input image by using the model to obtain a human hand characteristic diagram;
the human hand pose estimation module L2 is used for inputting the extracted human hand feature map into a trained area integration network and carrying out human hand pose estimation through the network; in the regional integration network, the extracted human hand feature map is uniformly divided into a plurality of feature regions, each feature region is input into a regression model to estimate the human hand pose, and the human hand pose of the human hand depth image is finally regressed by fusing the regression result of each feature region. The human hand pose estimation module L2 further includes a feature region extraction module L21, a cutting module L22, and a human hand pose calculation module L23, and further performs three-dimensional human hand pose estimation from an input human hand feature map, where the functions of each module are as follows:
the feature region extraction module L21 is used for representing the feature map extracted by the last convolutional layer as F, and estimating p according to the estimated hand pose at the stage t-1, namely the t-1 momentt-1Extracting a first characteristic region from the characteristic diagram F;
the cutting module L22 is used for cutting the first characteristic region extracted by the characteristic region extraction module by using a rectangular window at the stage t to obtain a plurality of bagsA rectangular area containing a joint point of a human hand, wherein the rectangular area is defined asAnda coordinate point at the upper left corner of a rectangular area where a human hand joint point i is located, wherein w and h respectively represent the width and the height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:
the function represents a feature map F extracted from a human hand depth image and is represented by a rectangular windowCutting out a characteristic area containing a joint point of a human hand;
the hand pose calculation module L23 is used for fusing a plurality of rectangular regions containing hand joint points obtained by cutting by the cutting module to obtain a fusion feature region containing five finger joints, and regressing the hand pose P of the hand depth image by using a regression model R according to the fusion feature regiont
The invention provides a structured regional integrated network method guided by hand gestures by combining the strong feature extraction capability of a deep residual convolution network and the advantage of feature fusion of a regional integrated network. In order to further mine more characteristic information of the depth image, the structural region integration network for human hand posture guidance feeds the predicted human hand posture estimation as the guidance information back to the characteristic diagram, and learns better human hand characteristics through continuous feedback errors. The experimental result shows that the integrated network of the structural area for guiding the hand posture fully extracts more optimized and representative hand characteristics, and compared with other methods, the accuracy of hand posture estimation is higher.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. A hand depth image pose estimation method based on a depth residual error network is characterized by comprising the following steps:
s1, inputting the hand depth image into a CNN model, and performing feature extraction on the input image by using the CNN model to obtain a hand feature map;
s2, taking the input hand depth image as a training sample, training the regional integrated network, inputting the hand characteristic diagram extracted in the step S1 into the trained regional integrated network, and estimating the hand posture through the network; when the hand posture estimation is carried out, the extracted hand characteristic graph is uniformly divided into a plurality of characteristic areas in the area integration network, each characteristic area is input into a regression model to carry out the hand posture estimation, and the hand posture of the hand depth image is finally regressed by fusing the regression result of each characteristic area.
2. The human hand depth image pose estimation method according to claim 1, comprising the following sub-steps in step S2:
s21, the CNN model comprises a plurality of convolution layers, wherein the characteristic graph extracted from the last convolution layer is represented as F, and the hand pose estimation p estimated according to the stage t-1, namely the t-1 momentt-1Extracting a first characteristic region from the characteristic diagram F;
s22, at the stage t, cutting the first characteristic region extracted in the step S21 by adopting a rectangular window to obtain a plurality of rectangular regions containing human hand joint pointsWherein the rectangular area is defined as Anda coordinate point at the upper left corner of a rectangular area where a human hand joint point i is located, wherein w and h respectively represent the width and the height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:
the function represents a feature map F extracted from a human hand depth image and is represented by a rectangular windowCutting out a characteristic area containing a joint point of a human hand;
s23, the regional integration network comprises a plurality of full connection layers, the full connection layers are utilized to fuse a plurality of rectangular regions containing human hand joint points obtained by cutting in the step S22 to obtain a fusion feature region comprising five finger joints, and a human hand pose P of a human hand depth image is regressed by utilizing a regression model R for the fusion feature regiont
3. The pose estimation method for the human hand depth image according to claim 2, wherein the area integration network has a fully connected layer l connected in sequence after the last convolutional layer1And a full connection layer l2(ii) a In step S23, the feature regions obtained by cutting are all connected through the full connection layer for the joint points on the same fingerl1Performing fusion connection to obtain a first fusion characteristic region; then, the first fused feature region obtained by fusing each finger is intensively input to the full link layer l2And carrying out feature region fusion to obtain the fusion feature region comprising the five finger joints of the human hand.
4. The method according to claim 3, wherein in step S23, all the joints belonging to the same finger are connected in series, wherein the connection function is represented by concate, and the connected neurons pass through a full connection layer l2Carrying out fusion connection to obtain the characteristic regions of different fingers:
wherein,to form the characteristic region of five finger jointsEach input to the full link layer l1Then, obtaining the joint point coordinates of the five fingers, wherein M represents the number of rectangular areas obtained by cutting; iththAll joints on a finger are represented asMiDenotes the iththThe number of joints of each finger; FC (-) means that the input "·" is calculated by using a full connection layer to obtain the corresponding joint point coordinate;
feature areas of different fingersAfter series connection, at the fully-connected layer l to which it is input2In the middle, the final hand pose is regressed
Wherein,
5. the method for estimating pose of depth image of human hand according to claim 1, wherein in the training process of the area integration network model, a training set equation is set to be T0
Wherein N isTRepresenting training samples, i.e. the number of input hand depth images, DiFor input of a depth image of a human hand, Pi 0Is an initial estimate of the pose of the human hand, Pi gtIs a three-dimensional coordinate of the posture of the real hand marked manually.
6. A human hand depth image pose estimation system based on a depth residual error network is characterized by comprising the following modules:
the characteristic diagram extraction module is used for inputting the hand depth image into the CNN model and extracting the characteristics of the input image by using the CNN model to obtain a hand characteristic diagram;
the human hand pose estimation module is used for inputting the extracted human hand feature map into a trained regional integrated network and estimating the human hand pose through the network; in the regional integration network, the extracted human hand feature map is uniformly divided into a plurality of feature regions, each feature region is input into a regression model to estimate the human hand pose, and the human hand pose of the human hand depth image is finally regressed by fusing the regression result of each feature region.
7. The human hand depth image pose estimation system according to claim 6, characterized in that the human hand pose estimation module comprises the following sub-modules:
a characteristic region extraction module for representing the characteristic map extracted from the last convolution layer as F, estimating the estimated human hand pose p according to the t-1 moment of the stage, namely the t-1 momentt-1Extracting a first characteristic region from the characteristic diagram F;
a cutting module, configured to cut the first feature region extracted by the feature region extraction module by using a rectangular window at stage t to obtain a plurality of rectangular regions including human hand joint points, where the rectangular region is defined as Anda coordinate point at the upper left corner of a rectangular area where a human hand joint point i is located, wherein w and h respectively represent the width and the height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:
the function represents a feature map F extracted from a human hand depth image and is represented by a rectangular windowCutting out a characteristic area containing a joint point of a human hand;
hand pose calculating modelThe block is used for fusing a plurality of rectangular areas containing human hand joint points and obtained by cutting through the cutting module to obtain a fusion characteristic area containing five finger joints, and for the fusion characteristic area, the human hand pose P of the human hand depth image is regressed by using a regression model Rt
8. The system according to claim 7, wherein the hand pose calculation module is for joint points on the same finger, and the feature regions obtained by cutting are all through a full-connected layer l1Performing fusion connection to obtain a first fusion characteristic region; then, the first fused feature region obtained by fusing each finger is intensively input to the full link layer l2And carrying out feature region fusion to obtain the fusion feature region comprising the five finger joints of the human hand.
CN201910629662.8A 2019-07-12 2019-07-12 Manpower depth image position and orientation estimation method and system based on depth residual error network Pending CN110472507A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910629662.8A CN110472507A (en) 2019-07-12 2019-07-12 Manpower depth image position and orientation estimation method and system based on depth residual error network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910629662.8A CN110472507A (en) 2019-07-12 2019-07-12 Manpower depth image position and orientation estimation method and system based on depth residual error network

Publications (1)

Publication Number Publication Date
CN110472507A true CN110472507A (en) 2019-11-19

Family

ID=68508170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910629662.8A Pending CN110472507A (en) 2019-07-12 2019-07-12 Manpower depth image position and orientation estimation method and system based on depth residual error network

Country Status (1)

Country Link
CN (1) CN110472507A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950521A (en) * 2020-08-27 2020-11-17 深圳市慧鲤科技有限公司 Augmented reality interaction method and device, electronic equipment and storage medium
CN113763572A (en) * 2021-09-17 2021-12-07 北京京航计算通讯研究所 3D entity labeling method based on AI intelligent recognition and storage medium
CN113781492A (en) * 2020-06-10 2021-12-10 阿里巴巴集团控股有限公司 Target element content measuring method, training method, related device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150088116A (en) * 2014-01-23 2015-07-31 삼성전자주식회사 Parameter learning method for estimating posture of articulated object and posture estimating method of articulated object
CN105389539A (en) * 2015-10-15 2016-03-09 电子科技大学 Three-dimensional gesture estimation method and three-dimensional gesture estimation system based on depth data
CN105759967A (en) * 2016-02-19 2016-07-13 电子科技大学 Global hand gesture detecting method based on depth data
CN108960178A (en) * 2018-07-13 2018-12-07 清华大学 A kind of manpower Attitude estimation method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150088116A (en) * 2014-01-23 2015-07-31 삼성전자주식회사 Parameter learning method for estimating posture of articulated object and posture estimating method of articulated object
CN105389539A (en) * 2015-10-15 2016-03-09 电子科技大学 Three-dimensional gesture estimation method and three-dimensional gesture estimation system based on depth data
CN105759967A (en) * 2016-02-19 2016-07-13 电子科技大学 Global hand gesture detecting method based on depth data
CN108960178A (en) * 2018-07-13 2018-12-07 清华大学 A kind of manpower Attitude estimation method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113781492A (en) * 2020-06-10 2021-12-10 阿里巴巴集团控股有限公司 Target element content measuring method, training method, related device and storage medium
CN111950521A (en) * 2020-08-27 2020-11-17 深圳市慧鲤科技有限公司 Augmented reality interaction method and device, electronic equipment and storage medium
CN113763572A (en) * 2021-09-17 2021-12-07 北京京航计算通讯研究所 3D entity labeling method based on AI intelligent recognition and storage medium
CN113763572B (en) * 2021-09-17 2023-06-27 北京京航计算通讯研究所 3D entity labeling method based on AI intelligent recognition and storage medium

Similar Documents

Publication Publication Date Title
Qu et al. Human-like coordination motion learning for a redundant dual-arm robot
Chao et al. A robot calligraphy system: From simple to complex writing by human gestures
Cruz et al. Multi-modal integration of dynamic audiovisual patterns for an interactive reinforcement learning scenario
CN110472507A (en) Manpower depth image position and orientation estimation method and system based on depth residual error network
CN110210426B (en) Method for estimating hand posture from single color image based on attention mechanism
US20160221190A1 (en) Learning manipulation actions from unconstrained videos
CN104573665A (en) Continuous motion recognition method based on improved viterbi algorithm
CN109508686B (en) Human behavior recognition method based on hierarchical feature subspace learning
KR20200087348A (en) Age/Emotion/Gender Classification System using depthwise separable convolutional neural network AI
CN111204476A (en) Vision-touch fusion fine operation method based on reinforcement learning
CN110555383A (en) Gesture recognition method based on convolutional neural network and 3D estimation
Valarezo Anazco et al. Natural object manipulation using anthropomorphic robotic hand through deep reinforcement learning and deep grasping probability network
CN107229921A (en) Dynamic gesture identification method based on Hausdorff distances
Li et al. RoadFormer: Duplex transformer for RGB-normal semantic road scene parsing
Rustler et al. Active visuo-haptic object shape completion
Takano Annotation generation from IMU-based human whole-body motions in daily life behavior
Lu et al. Visual-tactile robot grasping based on human skill learning from demonstrations using a wearable parallel hand exoskeleton
Dave et al. Multimodal visual-tactile representation learning through self-supervised contrastive pre-training
Zhang et al. Digital twin-enabled grasp outcomes assessment for unknown objects using visual-tactile fusion perception
Yu et al. A novel robotic pushing and grasping method based on vision transformer and convolution
Abdulsattar et al. Facial expression recognition using transfer learning and fine-tuning strategies: A comparative study
Liu et al. Human-Robot Collaboration Through a Multi-Scale Graph Convolution Neural Network with Temporal Attention
Li et al. A multi-branch hand pose estimation network with joint-wise feature extraction and fusion
CN111078008B (en) Control method of early education robot
Gao et al. Parallel dual-hand detection by using hand and body features for robot teleoperation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191119

RJ01 Rejection of invention patent application after publication