CN111709269A

CN111709269A - Human hand segmentation method and device based on two-dimensional joint information in depth image

Info

Publication number: CN111709269A
Application number: CN202010332317.0A
Authority: CN
Inventors: 左德鑫; 邓小明; 马翠霞; 王宏安
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-09-25
Anticipated expiration: 2040-04-24
Also published as: CN111709269B

Abstract

The invention relates to a human hand segmentation method and a human hand segmentation device based on two-dimensional joint information in a depth image. The method comprises the following steps: acquiring the position of a two-dimensional joint point of a human hand in the depth image by using a two-dimensional joint point detection network; acquiring three-dimensional key points of the human hand by using the two-dimensional joint points and combining the depth image; calculating a three-dimensional directional bounding box of the human hand by using the three-dimensional key points; and filtering the depth image by using the three-dimensional directional bounding box to obtain the well-segmented human hand area. The invention provides a human hand two-dimensional joint point detection method based on a deep neural network, then provides a conversion method from two-dimensional joint points to three-dimensional key points, and finally provides a three-dimensional bounding box and a depth value filtering mode. Through practical use verification, the method has the advantages of high automation degree, high precision and high speed, and can meet professional or popular application requirements.

Description

Human hand segmentation method and device based on two-dimensional joint information in depth image

Technical Field

The invention belongs to the field of computer vision and computer image processing, and particularly relates to a human hand depth image segmentation method and device based on two-dimensional joint points.

Background

The hand posture estimation and the gesture understanding are hot problems in the field of computer vision and human-computer interaction, and are widely applied to scenes such as virtual reality, augmented reality, auxiliary design and the like, and the accurate hand posture estimation and the gesture understanding have great application value and research value. The human hand segmentation algorithm aims at semantically segmenting human hand parts and non-human hand parts in the image, is an important preprocessing link for a computer to understand gestures, and solves the problem of human hand segmentation.

Currently, the mainstream human hand depth image data set (e.g. NYU, handles 2017, ICVL, MSRA) generally provides a depth image of a human hand and joint points, wherein the data set giving Mask for segmenting the human hand only accounts for a few (NYU), so that the position information of the joint points becomes the main basis for obtaining the Mask of the human hand. The human hand joint points comprise various key positions (joints, wrists, palms and the like) of the human hand. The three-dimensional joint point is the coordinate of the joint point in a three-dimensional space and is represented by three scalars. The two-dimensional joint point is the coordinate of the joint point on the plane of the image and is represented by two scalars. The three-dimensional joint points can calculate the three-dimensional bounding box of the object more easily than the two-dimensional joint points so as to obtain the human hand area, but for data without labels, the acquisition of accurate three-dimensional joint points is more difficult than the acquisition of two-dimensional joint points, so how to use the two-dimensional joint points and the depth map information are the key for solving the human hand segmentation problem on the depth map.

Disclosure of Invention

The invention provides a human hand segmentation method and a human hand segmentation device based on two-dimensional joint information in a depth image, and mainly solves the problem of how to segment a human hand region from a single depth image.

The invention discloses a human hand segmentation method based on two-dimensional joint information in a depth image, which comprises the following steps of:

a human hand segmentation method based on two-dimensional joint information in a depth image comprises the following steps:

acquiring the position of a two-dimensional joint point of a human hand in the depth image by using a two-dimensional joint point detection network;

acquiring three-dimensional key points of the human hand by using the two-dimensional joint points and combining the depth image;

calculating a three-dimensional directional bounding box of the human hand by using the three-dimensional key points;

and filtering the depth image by using the three-dimensional directional bounding box to obtain the well-segmented human hand area.

Furthermore, the two-dimensional joint point detection network is mainly an hourglass network, global information and deep features are extracted by utilizing convolution and down-sampling of the hourglass network, required output is decoded by convolution and up-sampling, and the decoded features are guaranteed to contain deep semantic information and shallow morphological features by adding jump connection.

Further, when the two-dimensional joint detection network is trained, firstly preprocessing training data, including scaling to a standard size, normalizing, and acquiring a heat map label; the two-dimensional joint point detection network acquires the specific position of the two-dimensional joint point by using the preprocessed image as input; the output of the two-dimensional joint detection network is a heat image.

Furthermore, the output of the two-dimensional joint point detection network is a heat map of J channels, each channel corresponds to a type of joint point, each pixel contains a scalar value, the probability that a pixel point is used as a J-th type joint point is reflected, and the position of a point with the maximum probability is used as the coordinate of the joint point.

Further, the acquiring three-dimensional key points of the human hand by using the two-dimensional joint points and combining the depth image comprises the following steps: and estimating effective depth values of the adjacent areas of the two-dimensional joint points, and combining the two-dimensional joint points with the effective depth values to finish the conversion from the two-dimensional joint points to the three-dimensional key points.

Further, when calculating the effective depth value, the Gaussian mixture model is used to estimate the distribution of the foreground depth value, the background depth value and the segmented entity depth value, so as to eliminate the interference of the noise depth value.

Further, the main axis direction of the three-dimensional directional bounding box is obtained through principal component analysis of the three-dimensional key points, and the length of the three-dimensional directional bounding box is the corresponding proportion of the projection of the three-dimensional key points on the main axis.

Further, when the depth map is filtered by using the three-dimensional directed bounding box, whether each pixel point of the original depth map is in the box or not is judged, and the acceleration is realized through GPU parallel calculation.

A human hand segmentation apparatus based on two-dimensional joint information in a depth image, comprising:

the two-dimensional joint point detection module is responsible for constructing a two-dimensional joint point detection network and obtaining the position of the two-dimensional joint point of the human hand in the depth image by utilizing the two-dimensional joint point detection network;

the key point acquisition module is responsible for acquiring three-dimensional key points of the human hand by utilizing the two-dimensional joint points and combining the depth image;

the bounding box calculation module is responsible for calculating the three-dimensional directed bounding box of the human hand by using the three-dimensional key points;

and the hand segmentation module is responsible for filtering the depth image by utilizing the three-dimensional directed bounding box to obtain a well segmented hand region.

Further, the apparatus further comprises:

the data preprocessing module is responsible for preprocessing the training data of the two-dimensional joint detection network, zooming the original depth map to a standard size, normalizing and acquiring a heat map label;

and the network construction and training module is responsible for constructing and training the two-dimensional joint detection network and is used for detecting the coordinates of the two-dimensional joint on the image plane.

The invention has the advantages and beneficial effects that:

the invention mainly solves the problem of human hand segmentation by using a human hand joint point prediction data set without Mask marks. The invention provides a segmentation algorithm based on two-dimensional joint point prediction and joint point region depth value clustering, which can eliminate the interference of foreground and background under the condition of large predicted two-dimensional joint point error and obtain accurate hand segmentation. Through practical use verification, the method has the advantages of high automation degree, high precision and real-time performance, and can meet professional or popular application requirements.

Compared with the directed bounding box directly calculated based on the three-dimensional joint points, the method is more accurate because the depth information of the joint labels of the partial data set is not particularly accurate, thereby resulting in incomplete segmentation. The method has advantages in some specific occasions, for example, when manual interactive annotation is carried out, the operation difficulty of annotating the three-dimensional joint points accurately is very high for annotators of the PC platform, the method can automatically combine the depth information of the image, and the depth information of the hand is deduced under the condition that the depth information annotation is missing.

The method obtains the area to be segmented by predicting the two-dimensional joint points, and estimates the distribution conditions of the foreground depth value, the background depth value and the segmented entity depth value by using a Gaussian mixture model to eliminate the interference of the noise depth value. Through practical use verification, the algorithm has high tolerance on errors of two-dimensional joint points and accurate segmentation.

Drawings

FIG. 1 is an overall flow diagram of the method of the present invention.

Fig. 2 is a general block diagram of a human hand two-dimensional joint detection network based on deep learning.

Figure 3 is a block diagram of an hourglass network.

Fig. 4 is a structural diagram of a residual module.

Fig. 5 is a graph showing the results of the present invention in actual testing.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

FIG. 1 is a general flowchart of the human hand segmentation method based on two-dimensional joint information in depth images of the present invention. According to the method, for an input depth map, a two-dimensional joint point detection network is used for obtaining the position of a two-dimensional joint point of a human hand, the two-dimensional joint point is used for obtaining a three-dimensional key point of the human hand by combining the depth map, an oriented bounding box is calculated by using the three-dimensional key point, the depth map is filtered by using the three-dimensional oriented bounding box, and a well-segmented human hand area is obtained.

The data preprocessing method adopted by the invention, the specific structure of the two-dimensional joint detection network, the adopted loss function, the clipping based on the two-dimensional joint, the removal of the noise depth value, the calculation of the bounding box and the calculation of the Mask are sequentially described below.

Step 1: pre-processing of training data

The original depth map is scaled to a standard size, which in this method is set to height × width, normalized (minus the mean of the depth map, divided by the difference between the maximum depth and the minimum depth of the depth camera).

Obtaining a label: the label required by the training joint point detection network is a heat map (heatmap), namely a three-dimensional tensor (tensor) with the size of (height, width, joint), and the label given by a common data set is the joint point position with the size of (J, 3) (J represents the number of joints and is equal to the number of channels of an output picture), and the corresponding heat map needs to be calculated according to the three-dimensional joint points. Let the label of the three-dimensional joint point be (u)_gt,v_gt,d_gt) Wherein u is_gtRepresenting the abscissa, v, of the joint on the picture_gtRepresenting the ordinate of the joint on the picture, d_gtThe depth value of the joint on the picture is represented, any position in the tensor is marked as (u, v, j), wherein u represents an upper horizontal coordinate of a certain pixel on the heat map, v represents a vertical coordinate of the certain pixel on the heat map, and j represents the number of the channel. Value H of the position_GTThe formula for (u, v, j) is:

where σ is a fixed number, s_xAnd s_yThe calculation formula of (2) is as follows:

wherein f is_x,f_yIs the focal length parameter of the camera.

If there are only two-dimensional articulation points (u)_gt,v_gt)，H_GTThe calculation formula of (u, v, j) becomes:

the preprocessed samples that can be used for training are represented as

Wherein N represents the number of samples, Dⁱ、

Respectively, a normalized depth map and a corresponding heat map of the ith sample.

Step 2: construction and training of two-dimensional joint point detection network

The present invention proposes a convolutional neural network of the hourglass (hour glass) type for predicting two-dimensional joint points. The design principle of the network is that global information and deep features are extracted by utilizing convolution and down sampling of an hourglass network, required output is decoded by utilizing convolution and up sampling, and jumping connection is added to ensure that the decoded features comprise deep semantic information and shallow morphological features, namely bottom-layer features and high-layer features of an image are utilized.

The basic modules of the two-dimensional joint point detection network are a residual block (residual block) and a convolution block (convolution block), and each residual block comprises a convolution module and a jump connection. The convolution module is a repeated superposition of convolutional layers, batch regularization (BN) layers, and Linear rectification function (ReLU) layers. In the residual block, the input of the residual block enters the convolution block on the one hand, and is combined with the output of the convolution block in an additive manner on the other hand by means of jump connection.

Fig. 2 shows the overall structure of a two-dimensional joint detection network, and the depth map needs to be first symmetrically padded (symmetry padding) and pooled (pooling) to meet the size requirement. Before inputting into an hourglass module (hourglass network), a plurality of convolution layers and a residual module are required to carry out primary feature extraction. The output of the hourglass module needs to be convolved with several subsequent layers to achieve a desired number of channels. Residual represents a Residual module, K represents the size of the convolution kernel of the layer, c represents the number of channels output by the layer, S represents the step size of the layer, and pad represents how many pixels are filled in the height and width dimensions of the layer respectively. (480,640,1) and (J,2) represent shapes (shape) of the input and output tensors, respectively.

Figure 3 shows a block diagram of the hourglass network. Each cuboid is formed by connecting three residual modules, the residual modules have the same output channel number (all 256), and corresponding pooling and up-sampling operations are performed among cuboids with different sizes. The plus-signed circle connects the two inputs, representing adding the inputs bit-by-bit.

The residual block shown in fig. 4 includes several convolutional layers, among which are a batch regularization (BN) layer and a Rectified Linear Unit (ReLU) layer. c denotes the number of channels of the convolution block output, and the parameters of the convolution kernel are given in fig. 4, "P ═ same'" denotes that the filled pixel needs to have the output and input of the layer equal in height and width. "+" indicates a bitwise addition operation.

The input of the two-dimensional joint detection network is a depth map with the size of (H, W), the output is a heat map with the size of (H/s, W/s, J), and s is the scale of image reduction after the network. The loss function of the network is:

wherein H_predIs a heat map of the network output, H_GTIs a real heat map, H, W, J represents the height and width of the output picture respectivelyAnd the number of channels. The output of the network is a heat map of J channels, each channel corresponds to a class of joint points, each pixel contains a scalar value reflecting the probability of the pixel point as the J-th class of joint points, the method uses the position of the point with the highest probability as the coordinate of the joint point, thus the two-dimensional joint point coordinates (u, v)_jThe calculation method is as follows:

where s is the scale of the image reduction after passing through the network.

The optimizer used in the network of the present invention is Adam, the learning rate is initially set to 0.001, and decays exponentially as the number of training steps increases.

And step 3: depth value estimation

The two-dimensional coordinates J of the joint point can be obtained by the step 2_2DHowever, only two-dimensional information cannot accurately obtain a three-dimensional bounding box, so that it is necessary to use J_2DAnd obtaining the depth information of the corresponding position on the original depth map. It is worth noting that due to the presence of occlusion, J is utilized_2DDirectly acquiring three-dimensional joint point J by depth map_3DIs difficult. However, J may be used_2DObtain points important for segmentation, using J_2DThe advantage of this is that it contains all the joint information and is approximately similar to the contour of the hand, so that no important parts are missed. Can be used as a handle J_2DThe points extracted from the depth map are called key points, P_3DAnd (4) showing. P_3DThe calculation method of (2) is as follows.

P_3DThe acquisition of (a) is subject to the following several principles,

1. experiments show that the depth mean value in a certain area is easily influenced by the depth of surrounding points, is not directly used as the depth of adjacent points, and the depth value is J in order to keep the accuracy of the depth value_2DOf the nearest neighbor point.

2. The depth value of the adjacent point is greatly influenced by the noise depth value under the condition that the prediction error of the two-dimensional joint point is large, J_2DWhen the estimation is inaccurate, the foreground or the background is easy to get and becomes a foreground depth value or a background depth value, so that the effectiveness of the depth of the neighboring point needs to be judged. If it is determined as an invalid point, a valid point P is required_altInstead of this.

The principle of removing the noise depth value is as follows: and counting the probability distribution of the depth values in the field, wherein the foreground depth value, the segmented entity and the background depth value can be described by a Gaussian mixture model containing three components, and the depth value corresponding to the center of the Gaussian component closest to the prefetch target is taken. When the method is realized, an Expectation Maximization (EM) algorithm can be approximately simulated by using k-means clustering.

And cutting a picture area corresponding to the two-dimensional joint point, carrying out k-means clustering on the depth value, obtaining candidate depth by using a defined rule, and combining the candidate depth with the two-dimensional joint point to form a three-dimensional key point. The specific implementation method is as follows:

using J_2DCalculating the minimum bounding box, cutting out an area, and calculating the depth median and recording as d_avg。

By using

Finding out the nearest pixel point of the two-dimensional coordinate on the original image to obtain the depth value thereof

Where i represents the number of the joint point.

Substitution points

The calculation of (2): for each point

Centered thereon, a picture area of size 50 × 50 (the size being adjustable depending on the depth map resolution) is cropped, if the cropped area has no depth value, the cropping size is enlarged by a certain proportion until there is a depth value.

K-means clustering with k being 3 is carried out on the depth value of the area, and the class center is processedThe depth values of (a) are sorted, and the class center arranged in the middle is taken as

Depth value closest to

Is marked as

If it is not

To represent

Is a noise depth value and should be used

Instead of the former

By using

Instead of the former

As dⁱOtherwise

Directly as dⁱ. In this example d_thresholdSet to 100.

Therefore, it is not only easy to use

That is to say dⁱSplice to the 2D points.

And step 3: depth value estimation (acceleration type)

For each point D with depth of the cut out region^j＝(u^j,v^j,d^j)(j represents the number of the pixel) calculates two distances, let

d_threshIs a fixed value, the meaning of the formula is when d^jAnd when the average depth is not much different, the

Are ignored. According to the imaging principle of the camera,

a ray l in space can be determined_iLet us order

Is D_jTo l_iThe distance of (a) to (b),

the calculation method is as follows:

and 4, step 4: computation of directed bounding boxes

Computing a directional bounding box requires computing the center point of the box and the direction and size of the three axes. Center P of the box_3DIs calculated to obtain the average value of (1). By P_3DThe three eigenvectors after Principal Component Analysis (PCA) are taken as three principal axes of the box, and the length of the principal axis is represented by P_3DThe length of the projection interval of (a) is determined, and if necessary, each axis can be extended in an appropriate proportion.

And 5: filtering of depth maps

The depth map area inside the bounding box needs to be acquired, and the method is to judge whether each pixel point of the original depth map is inside the box. The pixels of the depth map are converted into point clouds formed by three-dimensional points by using camera parameters, whether the pixels are in the bounding box or not is judged point by point, and only the pixels corresponding to the points in the bounding box are reserved. The retained pixels constitute the segmented hand portion and the remaining portions of the depth map constitute the non-hand portion.

The bounding box outward direction is set to the positive direction, and each face of the bounding box can determine the parameters (a, b, c, d) of a set of equations of the triplet:

ax+by+cz+d＝0

firstly, converting the pixel point combination depth value to be judged into point cloud of a real space coordinate system by using camera parameters, bringing each point to the left side of six equations determined by six surfaces, and judging whether the results have the same positive and negative. If all positive or all negative then the pixel can be determined to be inside the bounding box, that is to say part of the hand.

The speed is accelerated by GPU parallel computation during the implementation of the step.

Fig. 5 shows the segmentation effect of the method. The input depth map, the predicted two-dimensional joint point, the predicted three-dimensional oriented bounding box and the final segmentation result are sequentially arranged from left to right, namely the 1 st column to the 4 th column.

The scheme of the invention can be realized by software or hardware, such as:

in one embodiment, a human hand segmentation device based on two-dimensional joint information in a depth image is provided, which includes:

In addition, the apparatus may further include:

and the data preprocessing module is responsible for preprocessing data before being input into the neural network (preprocessing training data of the two-dimensional joint detection network), zooming the original depth map to a standard size, normalizing and acquiring a heat map label.

And the network construction and training module is responsible for constructing a two-dimensional joint detection network and is used for detecting the coordinates of the two-dimensional joint on the image plane.

The two-dimensional joint point detection module, the key point acquisition module, the bounding box calculation module and the hand segmentation module, which can also be collectively referred to as a two-dimensional joint point-based hand segmentation module, are responsible for segmenting a hand region, and comprise joint point detection, joint point-to-key point mapping, three-dimensional oriented bounding box calculation, depth map filtering, and finally acquire the hand region.

In another embodiment, an electronic device (computer, server, etc.) is provided comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the method of the invention.

In another embodiment, a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) is provided, which stores a computer program that, when executed by a computer, implements the steps of the method of the present invention.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A human hand segmentation method based on two-dimensional joint information in a depth image is characterized by comprising the following steps:

2. The method of claim 1, wherein the two-dimensional joint point detection network is mainly an hourglass network, global information and deep features are extracted by convolution and downsampling of the hourglass network, and then required output is decoded by convolution and upsampling, and the decoded features are guaranteed to contain both deep semantic information and shallow morphological features by adding jump connection.

3. The method of claim 1, wherein the two-dimensional joint detection network, when trained, first pre-processes the training data, including scaling to a standard size, normalizing, and obtaining a heat map label; the two-dimensional joint point detection network acquires the specific position of the two-dimensional joint point by using the preprocessed image as input; the output of the two-dimensional joint point detection network is a heat image; the loss function of the two-dimensional joint detection network is as follows:

wherein H_predIs a heat map of the network output, H_GTIs a true heat map, H, W, J represents the height, width, and number of channels, respectively, of the output picture.

4. The method of claim 3, wherein the output of the two-dimensional joint detection network is a heat map of J channels, each channel corresponding to a class of joint points, each pixel comprising a scalar value reflecting the probability of a pixel being a class J joint point, the position of the point with the highest probability being the coordinate of the joint point, and two-dimensional joint point coordinates (u, v)_jThe calculation method is as follows:

where u denotes an abscissa of a certain pixel on the heat map, v denotes an ordinate of a certain pixel on the heat map, j denotes a number of a channel, and s is a scale of image reduction after passing through the network.

5. The method of claim 1, wherein acquiring three-dimensional key points of a human hand using two-dimensional joint points in combination with depth images comprises: estimating effective depth values of areas adjacent to the two-dimensional joint points, and completing the conversion from the two-dimensional joint points to the three-dimensional key points by combining the two-dimensional joint points with the effective depth values; when calculating effective depth value, using Gaussian mixture model to estimate foreground depth value, background depth value, and distribution of segmented entity depth value to eliminate interference of noise depth value.

6. The method according to claim 1, wherein the three-dimensional directional bounding box has a principal axis direction obtained by principal component analysis of the three-dimensional key points, and a length corresponding to a proportion of a projection of the three-dimensional key points on the principal axis; when the depth map is filtered by using the three-dimensional directed bounding box, whether the depth map is in the box or not is judged on each pixel point of the original depth map, and the acceleration is realized through GPU parallel calculation.

7. A human hand segmentation device based on two-dimensional joint information in a depth image is characterized by comprising:

8. The apparatus of claim 7, further comprising:

9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 6.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 6.