CN110472507A

CN110472507A - Manpower depth image position and orientation estimation method and system based on depth residual error network

Info

Publication number: CN110472507A
Application number: CN201910629662.8A
Authority: CN
Inventors: 李勇波; 赵涛; 谢中朝; 蔡文迪; 朱正东; 王畯翔
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2019-11-19

Abstract

The invention discloses a kind of manpower depth image position and orientation estimation method and system based on depth residual error network, this method and system input manpower depth image into CNN model first, carry out feature extraction using image of the model to input, obtain manpower characteristic pattern；Secondly, being input to obtained manpower characteristic pattern is extracted in trained regional ensemble network, manpower Attitude estimation is carried out by the network；Wherein, in the regional ensemble network, obtained manpower characteristic pattern will be extracted and be uniformly divided into several characteristic areas, each characteristic area is input to regression model and carries out the estimation of manpower pose, by merging the regression result of each characteristic area, the manpower pose of manpower depth image is finally returned out.This method and system are sufficiently extracted more optimized, more representational manpower feature, and the precision compared to the estimation of other methods manpower pose is higher.

Description

Human hand depth image pose estimation method and system based on depth residual error network

Technical Field

The invention relates to the field of machine learning and computer vision, in particular to an estimation method and an estimation system for researching the pose of a human hand depth image based on a depth residual error rolling machine network and a regional integration network.

Background

With the continuous development of computer vision technology, people are pursuing more natural and harmonious human-computer interaction modes, hand motion is an important channel of human interaction, and hands can express not only semantic information but also quantitative spatial direction and position information, so that a more natural and efficient human-computer interaction environment can be constructed. Therefore, the estimation and motion analysis of the three-dimensional pose of the multi-joint human hand based on vision is an important research direction, and aims to detect the three-dimensional pose of the human hand and the finger joints thereof from images or image sequences in a non-contact manner by using a computer vision method. The multi-joint hand three-dimensional pose estimation method has important significance in the fields of enhancement/virtual reality, intelligent robots, auxiliary driving, medical treatment and health and the like. The man-machine interaction technology has gradually shifted from being centered on a computer to being centered on a human being, and is a brand new multi-media and multi-mode interaction technology. The hand is the most flexible part of the human body, and compared with other interaction modes, the hand takes gestures as a means of human-computer interaction and is more natural, so that the gesture recognition technology is a large research point of human-computer interaction.

Disclosure of Invention

The invention aims to solve the technical problem that the gesture estimation method of the human hand depth image based on the depth residual convolution network is provided aiming at the defect of weak recognition strength in the prior art, and the defect that the three-dimensional gesture coordinate of the human hand is regressed from a single depth image by the existing three-dimensional gesture estimation method is overcome by introducing a gesture-guided convolution neural network structure.

The technical scheme adopted by the invention for solving the technical problems is as follows: a method for estimating the pose of a human hand depth image based on a depth residual error network is constructed, and comprises the following steps:

s1, inputting the hand depth image into a CNN model, and performing feature extraction on the input image by using the CNN model to obtain a hand feature map;

s2, taking the input hand depth image as a training sample, training the regional integrated network, inputting the extracted hand feature map into the trained regional integrated network, and estimating the hand posture through the network; when the hand posture estimation is carried out, the extracted hand characteristic graph is uniformly divided into a plurality of characteristic areas in the area integration network, each characteristic area is input into a regression model to carry out the hand posture estimation, and the hand posture of the hand depth image is finally regressed by fusing the regression result of each characteristic area.

In the method, based on the step S2, the defect that the three-dimensional pose coordinates of the human hand are regressed from a single depth map is overcome, and the extracted feature regions are layered and fused by carrying out feature extraction on the joint regions of the human hand and utilizing the full connection layer in the regional integration network, so that the estimated three-dimensional human hand posture is more accurate.

Further, step S2 includes the following sub-steps:

s21, the CNN model comprises a plurality of convolution layers, wherein the characteristic graph extracted from the last convolution layer is represented as F, and the hand pose estimation p estimated according to the stage t-1, namely the t-1 moment^t-1Extracting a first characteristic region from the characteristic diagram F;

s22, at the stage t, cutting the first characteristic region extracted in the step S21 by adopting a rectangular window to obtain a plurality of rectangular regions containing human hand joint points, wherein the rectangular regions are defined asAndfor the joint point i of the human handAt the coordinate point at the upper left corner of the rectangular area, w and h respectively represent the width and height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:

the function represents a feature map F extracted from a human hand depth image and is represented by a rectangular windowCutting out a characteristic area containing a joint point of a human hand;

s23, the regional integration network comprises a plurality of full connection layers, the full connection layers are utilized to fuse a plurality of rectangular regions containing human hand joint points obtained by cutting in the step S22 to obtain a fusion feature region comprising five finger joints, and a human hand pose P of a human hand depth image is regressed by utilizing a regression model R for the fusion feature region^t。

Further, in step S23, the joint points on the same finger are pointed, wherein the feature regions obtained by cutting all pass through the full connection layer l₁Performing fusion connection to obtain a first fusion characteristic region; then, the first fused feature region obtained by fusing each finger is intensively input to the full link layer l₂And carrying out feature region fusion to obtain the fusion feature region comprising the five finger joints of the human hand.

Further, in step S23, all the joints belonging to the same finger are connected in series, wherein the connection function is represented by concatee, and the neurons after connection in series will further pass through the full connection layer l₂Carrying out fusion connection to obtain the characteristic regions of different fingers:

wherein,to form the characteristic region of five finger jointsEach input to the full link layer l₁Then, obtaining the joint point coordinates of the five fingers, wherein M represents the number of rectangular areas obtained by cutting; ith^thAll joints on a finger are represented asM_iDenotes the ith^thThe number of joints of each finger; FC (-) means that the input "·" is calculated by using a full connection layer to obtain the corresponding joint point coordinate;

feature areas of different fingersAfter series connection, at the fully-connected layer l to which it is input₂In the middle, the final hand pose is regressed

Wherein,

furthermore, in the training process of the area integration network model, a training set equation is set as T⁰：

Wherein N is_TRepresenting training samples, i.e. the number of input hand depth images, D_iFor input of a depth image of a human hand, P_i ⁰Is a human beingInitial estimate of hand pose, P_i ^gtIs a three-dimensional coordinate of the posture of the real hand marked manually.

The invention provides a hand depth image pose estimation system based on a depth residual error network, which comprises the following modules:

the characteristic diagram extraction module is used for inputting the hand depth image into the CNN model and extracting the characteristics of the input image by using the CNN model to obtain a hand characteristic diagram;

the human hand pose estimation module is used for inputting the extracted human hand feature map into a trained regional integrated network and estimating the human hand pose through the network; in the regional integration network, the extracted human hand feature map is uniformly divided into a plurality of feature regions, each feature region is input into a regression model to estimate the human hand pose, and the human hand pose of the human hand depth image is finally regressed by fusing the regression result of each feature region.

Further, the human hand pose estimation module comprises the following sub-modules:

a characteristic region extraction module for representing the characteristic map extracted from the last convolution layer as F, estimating the estimated human hand pose p according to the t-1 moment of the stage, namely the t-1 moment^t-1Extracting a first characteristic region from the characteristic diagram F;

a cutting module, configured to cut the first feature region extracted by the feature region extraction module by using a rectangular window at stage t to obtain a plurality of rectangular regions including human hand joint points, where the rectangular region is defined asAnda coordinate point at the upper left corner of a rectangular area where a human hand joint point i is located, wherein w and h respectively represent the width and the height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:

a hand pose calculation module for fusing a plurality of rectangular regions containing hand joint points obtained by cutting by the cutting module to obtain a fusion feature region containing five finger joints, and regressing the hand pose P of the hand depth image by using a regression model R according to the fusion feature region^t。

Furthermore, the human hand pose calculation module aims at joint points on the same finger, wherein the feature areas obtained by cutting all pass through a full-connection layer l₁Performing fusion connection to obtain a first fusion characteristic region; then, the first fused feature region obtained by fusing each finger is intensively input to the full link layer l₂And carrying out feature region fusion to obtain the fusion feature region comprising the five finger joints of the human hand.

In the method and the system for estimating the position and the attitude of the human hand depth image based on the depth residual error network, the joint area of the human hand is subjected to feature extraction, and after hierarchical fusion, three-dimensional human hand posture estimation is carried out.

The invention discloses a human hand depth image pose estimation method and system based on a depth residual error network, wherein a result area of human hand pose guidance is utilized to integrate a network, predicted human hand pose estimation is used as guidance information and is fed back to a feature map, and better human hand features can be further learned by continuously feeding back errors.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a hand depth image pose estimation method disclosed by the invention;

FIG. 2 is a schematic diagram of a residual structure;

FIG. 3 is a diagram of a gesture-guided structured area integration network;

FIG. 4 is a residual convolutional network model;

5-10 are graphs of the average error of each joint point, the projection effect of the hand pose, using three different data sets;

FIG. 11 is a diagram of a pose estimation system for a human hand depth image disclosed in the present invention.

Detailed Description

For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

The invention aims to solve the problem that the network structure can overcome the defect that the three-dimensional pose coordinates of a human hand are regressed from a single depth map by the existing three-dimensional pose estimation method, and the network structure carries out feature extraction on joint areas of the human hand, then carries out layered fusion on the extracted feature maps, and finally carries out three-dimensional human hand pose estimation on the fused feature maps.

Please refer to fig. 1, which is a flowchart of a pose estimation method for a depth image of a human hand disclosed in the present invention, specifically including the following steps:

in the CNN model, a convolutional layer is used for extracting features of an input human hand image, wherein the human hand features are extracted through parameter sharing and sparse connection;

in this embodiment, the network has 6 convolutional layers and 2 residual connections, the size of each convolutional core is 3 × 3, the number of convolutional cores is 16, 32, and 64, a nonlinear activation function ReLU is connected behind each convolutional layer, a maximum pooling layer is connected behind every two convolutional layers, and the residual connection is located between the two maximum pooling layers, so as to prevent the problem of disappearance of the depth network gradient, where the structure of the residual is shown in detail in fig. 2.

In this embodiment, the function f for convolution kernel_kIt is shown that the size of the convolution kernel is represented by 3 × 3, so the number of connections between each convolution kernel and the human hand image x is 3 × 3, where the length and width of the human hand image are μ and ν, respectively, and the current convolution layer outputs the calculation result:

s2, training the regional integrated network, inputting the extracted hand characteristic diagram into the trained regional integrated network, and estimating the hand posture through the network; wherein:

in the training process of the regional integrated network model, setting a training set equation as T⁰：

Wherein N is_TRepresenting the number of training samples, D_iFor input of a depth image of a human hand, P_i ⁰Is an initial value of hand pose estimation, P_i ^gtIs the three-dimensional hand pose of the depth image. And (4) using the training set model, training each sample in the set, and repeatedly repeating the training set model until the maximum iteration time T is reached.

Please refer to fig. 3, which is a diagram of a pose guidance structured regional integrated network, and this section will explain the human hand pose estimation process in detail by combining the network results, in the regional integrated network, first, a feature map extracted from the last convolutional layer in the CNN model is represented as F, and p is estimated according to the estimated human hand pose estimation p at the stage t-1, i.e. at the time t-1^t-1Extracting a characteristic region graph from the characteristic graph F; wherein, when the next operation is carried out, the ith operation is carried out^thThe personal hand joint point needs to convert the corresponding world coordinate point into a pixel coordinate point:

secondly, uniformly dividing the extracted characteristic region graph into a plurality of grid regions by using a rectangular window, feeding each grid region to a full-connection layer for fusion, and then performing hand pose regression; wherein the rectangular area is defined asAnda coordinate point at the upper left corner of a rectangular area where a human hand joint point i is located, wherein w and h respectively represent the width and the height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:

finally, aiming at the joint points on the same finger, wherein the feature areas obtained by cutting are all arranged through the full connecting layer l₁Performing fusion connection to obtain a first fusion characteristic region; then, the first fused feature region obtained by fusing each finger is intensively input to the full link layer l₂Carrying out feature region fusion to obtain fusion feature regions comprising five finger joints of the human hand; and for the fusion feature region, regressing the hand pose P of the hand depth image by utilizing a regression model R^t. Wherein:

in the cascaded network, the three-dimensional hand pose table is represented by D assuming that the depth image is represented by DShown as

Wherein J is a human hand joint point; at the stage t-1, the result of the currently predicted human hand pose estimation is p^t-1The prediction result of the whole regression model R for hand pose estimation at stage t is represented as:

Pt＝R(P^t-1,D)；

in the whole training process, after T stages, the last hand pose estimation P of the input depth image D can be obtained^T：

P^T＝R(P^T-1,D)。

The above is a complete human hand pose estimation process, the characteristic regions of the joints of five fingers of a human hand are respectively input into a full connection layer FC, all joint points belonging to the same finger are connected in series by using a concatee connection function, wherein after the connection in series, the neuron further passes through the full connection layer l₂Carrying out fusion connection to obtain the characteristic regions of different fingers:

wherein,to form the characteristic region of five finger jointsEach input to the full link layer l₁Then, obtaining the joint point coordinates of the five fingers, wherein M represents the number of rectangular areas obtained by cutting; ith^thAll joints on a finger are represented asM_iDenotes the ith^thThe number of joints of each finger;

Wherein,

as shown in fig. 4, the convolutional network model architecture for feature extraction is composed of 6 convolutional layers of 3 × 3, the network input is 128 × 128, a deep human hand image is used as input, a ReLU active layer is used for nonlinear feature transformation after each convolutional layer, residual error structures are respectively connected between two pooling layers, and the dimension of the human hand feature extraction network output feature mapping is 12 × 64. For the regression task, this embodiment uses two 2048-dimensional fully-connected layers, where the deactivation rate of neurons of each fully-connected layer for human hand pose regression is 0.5, preventing the model from overfitting, and the final regression result is a 3 × J vector representing the world coordinates of human hand joint points, where J represents the number of human hand joints.

In this example, in order to prove the effectiveness of the algorithm evaluation, three public data sets (ICVL, MSRA, NYU) are respectively used for effect comparison.

The whole model training period is 100, the size of the Batch size is set to be 64, the Adam gradient descent algorithm is adopted by a depth residual error network, and the learning rate is set to be 0.0001; the SGD gradient descent algorithm is adopted in the regional integrated network, the learning rate is set to be 0.005, the learning rate is reduced by 10 times in each iteration of 2 periods, the weight attenuation is 0.0005, and the momentum is 0.9. The structured area integrated network guided by the hand posture adopts an SGD gradient descent algorithm, the learning rate is set to be 0.001, the learning rate is reduced by 10 times every 10 iteration cycles, the weight attenuation is 0.0005, and the momentum is 0.9.

And comparing the predicted hand pose with the maximum joint point error of the real standard hand pose, wherein the average error of the joint points is as follows:

for the above applied 3 different data sets, three different methods are used for hand pose estimation, the three different methods including: the method includes a deep residual convolution network (called ResNet-Hand for short), a regional integration network (called Multi-Region for short) and a Hand posture guidance structured regional integration network (called Pose-Guide for short), and the implementation effect diagrams are shown in FIG. 5-FIG. 10.

In fig. 5, the human hand pose estimation accuracy of the three methods is compared by using the NYU data set. The hand Pose estimation is obtained by using a hand central point as prior information, the average error of a depth residual convolution network method (ResNetHandd) is 13.89mm, the average error of a regional integrated network method (Multi-Region) is 12.63mm, and the average error of a hand posture guidance structured regional integrated network (Pose-Guide) is 11.49 mm. Currently, it is shown that the accuracy of hand pose estimation by using a hand pose guidance structured area integrated network is higher than that of other two methods, specifically because a deep residual convolution network has stronger feature extraction capability than a shallow network. The position and pose estimation is carried out by combining the characteristic diagram information in the area integrated network, compared with a single network, the area integrated network has stronger characteristic expression capability, and the Hand position and pose estimation precision is higher than that of a ResNet-Hand method. The area integration network of human hand pose guidance enables the network to learn better features by incorporating guidance information from previous human hand pose estimates into the feature map.

Fig. 6 is a diagram showing the projection effect of three-dimensional Hand pose estimation on a two-dimensional depth image by using an NYU data set, where the first behavior is the projection of a Hand joint coordinate (GT) actually labeled on a Hand image, the second behavior is the projection of a Hand joint coordinate predicted by using a depth residual convolution network (ResNet-Hand) on the Hand image, and the third behavior is the projection of a Hand joint coordinate predicted by using a Region integration network method (Multi-Region) on the Hand image. The fourth hand posture guides the projection of the coordinates of the joint points of the human hand predicted by a structured area integration network (position-Guide) on the human hand image.

FIG. 7 is a comparison result of human hand pose estimation accuracy of the three methods by using MSRA data set. The human Hand Pose estimation method comprises the steps of obtaining human Hand Pose estimation by using a human Hand central point as prior information, wherein the average error of a depth residual convolution network method (ResNet-Hand) is 9.79mm, the average error of a regional integrated network method (Multi-Region) is 8.65mm, and the average error of a human Hand Pose guidance structured regional integrated network (Pose-Guide) is 8.58 mm.

Currently, the accuracy of the human hand pose estimation by using the human hand pose guidance structured area integrated network is higher than that of other two methods.

Fig. 8 shows a projection effect diagram of three-dimensional human Hand pose estimation on a two-dimensional depth image in an MSRA test set, where the first behavior is a projection of a human Hand joint coordinate (GT) actually labeled on a human Hand image, the second behavior is a projection of a human Hand joint coordinate predicted by using a depth residual convolution network (ResNet-Hand) on a human Hand image, and the third behavior is a projection of a human Hand joint coordinate predicted by using a Region-integration network method (Multi-Region) on a human Hand image. The fourth hand posture guides the projection of the coordinates of the joint points of the human hand predicted by a structured area integration network (position-Guide) on the human hand image.

In fig. 9, the human hand pose estimation accuracies of the above three methods are compared using an ICVL data set. Wherein, the average error of the depth residual convolution network method (ResNet-Hand) is 7.63mm, the average error of the area integration network method (Multi-Region) is 7.31mm, and the average error of the Hand posture guidance structured area integration network (Pose-Guide) is 7.21 mm.

FIG. 10 shows a projection effect plot of three-dimensional hand pose estimation on a two-dimensional depth image in an ICVL test set; the method comprises the following steps that a first action is projection of a Hand joint coordinate (GT) which is really marked on a Hand image, a second action is projection of the Hand joint coordinate predicted by a depth residual convolution network (ResNet-Hand) on the Hand image, and a third action is projection of the Hand joint coordinate predicted by a regional integration network method (Multi-Region) on the Hand image. The fourth hand posture guides the projection of the coordinates of the joint points of the human hand predicted by a structured area integration network (position-Guide) on the human hand image.

Please refer to fig. 11, which is a structural diagram of a hand depth image pose estimation system disclosed in the present invention, the system includes a feature map extraction module L1 and a hand pose estimation module L2, wherein:

the characteristic diagram extraction module L1 is used for inputting the human hand depth image into the CNN model, and extracting the characteristics of the input image by using the model to obtain a human hand characteristic diagram;

the human hand pose estimation module L2 is used for inputting the extracted human hand feature map into a trained area integration network and carrying out human hand pose estimation through the network; in the regional integration network, the extracted human hand feature map is uniformly divided into a plurality of feature regions, each feature region is input into a regression model to estimate the human hand pose, and the human hand pose of the human hand depth image is finally regressed by fusing the regression result of each feature region. The human hand pose estimation module L2 further includes a feature region extraction module L21, a cutting module L22, and a human hand pose calculation module L23, and further performs three-dimensional human hand pose estimation from an input human hand feature map, where the functions of each module are as follows:

the feature region extraction module L21 is used for representing the feature map extracted by the last convolutional layer as F, and estimating p according to the estimated hand pose at the stage t-1, namely the t-1 moment^t-1Extracting a first characteristic region from the characteristic diagram F;

the cutting module L22 is used for cutting the first characteristic region extracted by the characteristic region extraction module by using a rectangular window at the stage t to obtain a plurality of bagsA rectangular area containing a joint point of a human hand, wherein the rectangular area is defined asAnda coordinate point at the upper left corner of a rectangular area where a human hand joint point i is located, wherein w and h respectively represent the width and the height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:

the hand pose calculation module L23 is used for fusing a plurality of rectangular regions containing hand joint points obtained by cutting by the cutting module to obtain a fusion feature region containing five finger joints, and regressing the hand pose P of the hand depth image by using a regression model R according to the fusion feature region^t。

The invention provides a structured regional integrated network method guided by hand gestures by combining the strong feature extraction capability of a deep residual convolution network and the advantage of feature fusion of a regional integrated network. In order to further mine more characteristic information of the depth image, the structural region integration network for human hand posture guidance feeds the predicted human hand posture estimation as the guidance information back to the characteristic diagram, and learns better human hand characteristics through continuous feedback errors. The experimental result shows that the integrated network of the structural area for guiding the hand posture fully extracts more optimized and representative hand characteristics, and compared with other methods, the accuracy of hand posture estimation is higher.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A hand depth image pose estimation method based on a depth residual error network is characterized by comprising the following steps:

s2, taking the input hand depth image as a training sample, training the regional integrated network, inputting the hand characteristic diagram extracted in the step S1 into the trained regional integrated network, and estimating the hand posture through the network; when the hand posture estimation is carried out, the extracted hand characteristic graph is uniformly divided into a plurality of characteristic areas in the area integration network, each characteristic area is input into a regression model to carry out the hand posture estimation, and the hand posture of the hand depth image is finally regressed by fusing the regression result of each characteristic area.

2. The human hand depth image pose estimation method according to claim 1, comprising the following sub-steps in step S2:

s22, at the stage t, cutting the first characteristic region extracted in the step S21 by adopting a rectangular window to obtain a plurality of rectangular regions containing human hand joint pointsWherein the rectangular area is defined as Anda coordinate point at the upper left corner of a rectangular area where a human hand joint point i is located, wherein w and h respectively represent the width and the height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:

3. The pose estimation method for the human hand depth image according to claim 2, wherein the area integration network has a fully connected layer l connected in sequence after the last convolutional layer₁And a full connection layer l₂(ii) a In step S23, the feature regions obtained by cutting are all connected through the full connection layer for the joint points on the same fingerl₁Performing fusion connection to obtain a first fusion characteristic region; then, the first fused feature region obtained by fusing each finger is intensively input to the full link layer l₂And carrying out feature region fusion to obtain the fusion feature region comprising the five finger joints of the human hand.

4. The method according to claim 3, wherein in step S23, all the joints belonging to the same finger are connected in series, wherein the connection function is represented by concate, and the connected neurons pass through a full connection layer l₂Carrying out fusion connection to obtain the characteristic regions of different fingers:

Wherein,

5. the method for estimating pose of depth image of human hand according to claim 1, wherein in the training process of the area integration network model, a training set equation is set to be T⁰：

Wherein N is_TRepresenting training samples, i.e. the number of input hand depth images, D_iFor input of a depth image of a human hand, P_i ⁰Is an initial estimate of the pose of the human hand, P_i ^gtIs a three-dimensional coordinate of the posture of the real hand marked manually.

6. A human hand depth image pose estimation system based on a depth residual error network is characterized by comprising the following modules:

7. The human hand depth image pose estimation system according to claim 6, characterized in that the human hand pose estimation module comprises the following sub-modules:

a cutting module, configured to cut the first feature region extracted by the feature region extraction module by using a rectangular window at stage t to obtain a plurality of rectangular regions including human hand joint points, where the rectangular region is defined as Anda coordinate point at the upper left corner of a rectangular area where a human hand joint point i is located, wherein w and h respectively represent the width and the height of the current rectangular area; the characteristic region containing the human hand joint point i is represented as:

hand pose calculating modelThe block is used for fusing a plurality of rectangular areas containing human hand joint points and obtained by cutting through the cutting module to obtain a fusion characteristic area containing five finger joints, and for the fusion characteristic area, the human hand pose P of the human hand depth image is regressed by using a regression model R^t。

8. The system according to claim 7, wherein the hand pose calculation module is for joint points on the same finger, and the feature regions obtained by cutting are all through a full-connected layer l₁Performing fusion connection to obtain a first fusion characteristic region; then, the first fused feature region obtained by fusing each finger is intensively input to the full link layer l₂And carrying out feature region fusion to obtain the fusion feature region comprising the five finger joints of the human hand.