CN109977834B

CN109977834B - Method and device for segmenting human hand and interactive object from depth image

Info

Publication number: CN109977834B
Application number: CN201910207311.8A
Authority: CN
Inventors: 徐枫; 薄子豪; 雍俊海
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2021-04-06
Anticipated expiration: 2039-03-19
Also published as: CN109977834A

Abstract

The application provides a method and a device for segmenting a human hand and an interactive object from a depth image, wherein the method comprises the following steps: constructing a human hand segmentation data set based on a depth image by using a segmentation method based on a color image; training to obtain a segmentation model by utilizing a human hand segmentation data set based on the depth image, wherein the segmentation model is composed of an encoder, an attention transfer model and a decoder; and segmenting the depth image to be processed by utilizing the segmentation model to obtain a classification label map corresponding to the depth image to be processed, wherein the value of each pixel point in the classification label map is the type value of each pixel point. The method utilizes the segmentation model obtained by training the human hand segmentation data set based on the depth image and the segmentation model to segment the depth image to be processed, thereby realizing the segmentation of the human hand and the object at the pixel level, improving the environmental robustness, having higher segmentation precision and being capable of processing the segmentation condition of the human hand and the object under the complex interaction condition.

Description

Method and device for segmenting human hand and interactive object from depth image

Technical Field

The application relates to the technical field of computer vision, in particular to a method and a device for segmenting a human hand and an interactive object from a depth image.

Background

Human hand segmentation is a fundamental problem in many research fields such as gesture recognition, human hand tracking, human hand reconstruction, etc. Compared with single hand movement, the method is more important in the fields of human-computer interaction and virtual reality for research under the state of interaction with an object.

In recent years, a general semantic segmentation model based on a neural network is more and more perfect, but the existing method model has low environmental robustness and poor segmentation precision and cannot process manual segmentation under the condition of complex interaction.

Disclosure of Invention

The application provides a method and a device for segmenting a human hand and an interactive object from a depth image, which are used for solving the problems that the existing human hand segmentation model in the prior art is low in environmental robustness, poor in segmentation precision and incapable of processing human hand segmentation under the condition of complex interaction.

An embodiment of an aspect of the present application provides a method for segmenting a human hand and an interactive object from a depth image, including:

constructing a human hand segmentation data set based on a depth image by using a segmentation method based on a color image;

training by using the human hand segmentation data set based on the depth image to obtain a segmentation model, wherein the segmentation model consists of an encoder, an attention transfer model and a decoder;

and segmenting the depth image to be processed by utilizing the segmentation model to obtain a classification label map corresponding to the depth image to be processed, wherein the value of each pixel point in the classification label map is the type value of each pixel point, and the type value is used for representing the type of the pixel point in the depth image to be processed.

The method for segmenting the human hand and the interactive object from the depth image comprises the steps of constructing a human hand segmentation data set based on the depth image by utilizing a segmentation method based on a color image, segmenting the data set by utilizing the human hand based on the depth image, training a segmentation model, segmenting the depth image to be processed by utilizing the segmentation model, obtaining a classification label map corresponding to the depth image to be processed, determining the value of each pixel point in the classification label map as the type value of each pixel point, determining the type of each pixel point according to the type value of each pixel point, segmenting the depth image to be processed by utilizing the segmentation model obtained by training the human hand segmentation data set based on the depth image, and achieving segmentation of the human hand and the object at the pixel level by utilizing the segmentation model, the environment robustness is improved, the segmentation precision is high, and the situation of human hand and object segmentation under the complex interaction condition can be processed.

Another embodiment of the present application provides an apparatus for segmenting a human hand and an interactive object from a depth image, including:

the construction module is used for constructing a human hand segmentation data set based on the depth image by utilizing a segmentation method based on the color image;

the training module is used for training to obtain a segmentation model by utilizing the human hand segmentation data set based on the depth image, and the segmentation model is composed of an encoder, an attention transfer model and a decoder;

the identification module is used for segmenting the depth image to be processed by utilizing the segmentation model to obtain a classification label map corresponding to the depth image to be processed, wherein the value of each pixel point in the classification label map is the type value of each pixel point, and the type value is used for representing the type of the pixel point in the depth image to be processed.

The device for segmenting the human hand and the interactive object from the depth image, which is disclosed by the embodiment of the application, is characterized in that a human hand segmentation data set based on the depth image is constructed by utilizing a segmentation method based on a color image, the data set is segmented by utilizing the human hand based on the depth image, a segmentation model is trained, the segmentation model is composed of an encoder, an attention transfer model and a decoder, the depth image to be processed is segmented by utilizing the segmentation model, a classification label map corresponding to the depth image to be processed is obtained, the value of each pixel point in the classification label map is the type value of each pixel point, the type of each pixel point can be determined according to the type value of each pixel point, therefore, the segmentation model obtained by training the human hand segmentation data set based on the depth image is utilized to segment the depth image to be processed by utilizing the segmentation model, and the segmentation of the human hand and the object at the pixel level is realized, the environment robustness is improved, the segmentation precision is high, and the situation of human hand and object segmentation under the complex interaction condition can be processed.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart illustrating a method for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a segmentation model provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an attention mechanism model according to an embodiment of the present disclosure;

FIG. 4 is a schematic flowchart of another method for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of a training process of a segmentation model according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating an effect of using contour error according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an apparatus for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The following describes a method and an apparatus for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present application with reference to the drawings.

Fig. 1 is a flowchart illustrating a method for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present application.

As shown in fig. 1, the method for segmenting a human hand and an interactive object from a depth image includes:

and 101, constructing a human hand segmentation data set based on a depth image by using a segmentation method based on a color image.

Since the depth camera can simultaneously acquire color images and depth images, color images and depth images of human hands interacting with objects can be acquired by the depth camera, so that a plurality of pairs of color images and depth images are acquired. The depth image is then processed based on the color image, thereby obtaining a human hand segmentation dataset from the depth image.

In order to improve the segmentation accuracy, in this embodiment, an object having a color that is greatly different from that of the skin of a human hand may be collected in a fixed light source having the same brightness and color temperature. For example, an image of a hand-held blue pen is captured with the same brightness and light source.

And 102, training to obtain a segmentation model by utilizing a human hand segmentation data set based on the depth image.

After a human hand segmentation data set based on a depth image is obtained, an initial neural network model is trained by using the data set, and a segmentation model meeting requirements is obtained.

In the training process, the prediction performance of the segmentation model can be measured by using a loss function.

In this embodiment, the segmentation model is composed of an encoder, an attention transfer model, and a decoder. Wherein the encoder uses a large convolutional network and the decoder uses a deconvolution layer to recover high level information to the image pixel scale.

Fig. 2 is a schematic structural diagram of a segmentation model according to an embodiment of the present application. As shown in fig. 2, the segmentation model is composed of an encoder, an attention transfer model, and a decoder. In this embodiment, an attention mechanism is added between the encoder and the decoder, and an attention feature map is constructed by fusing multi-scale image features for enhancing the same-layer connection between the encoder and the decoder, so that the accuracy and the effectiveness of information transmission between the encoder and the decoder can be improved.

Fig. 3 is a schematic structural diagram of an attention mechanism model according to an embodiment of the present disclosure. In FIG. 3, the layer 1, layer 2, …, i-1 feature maps are multiplied to obtain the underlying attention (FineAtt); each of layer 1, layer 2, …, and layer i-1 includes a scale Scaling Network (SN) and a Bilinear down-sampling layer (DS), where the SN may normalize the feature map dimensions. Multiplying the feature maps of the (i + 1) th layer, the (i + 2) th layer, … and the n-th layer to obtain a high-level attention (CoarseAtt); wherein, each of the (i + 1) th layer, the (i + 2) th layer, and … the nth layer includes SN and an up-sampling layer (US for short). DS and US are used to downscale and upscale the feature map, respectively. And (4) cascading the acquired FineAtt and CoarseAtt attention diagrams with the characteristic diagram of the i-th layer and inputting the characteristic diagram into a decoder. This attention mechanism is used to enhance the feature map scale for each of the layers 1 through n in fig. 3.

And 103, segmenting the depth image to be processed by utilizing the segmentation model, and acquiring a classification label map corresponding to the depth image to be processed.

In this embodiment, before the processed depth image is identified, the depth image to be processed may be acquired by the depth camera.

After the segmentation model is obtained, the depth image to be processed is input into the segmentation model obtained through training, and the classification label graph corresponding to the depth image to be processed is output by the segmentation model. The size of the classification label graph is the same as that of the depth image to be processed, and the value of each pixel point in the classification label graph is the type value of each pixel point. The type value is used for representing the type of the pixel point in the depth image to be processed. In addition, the pixel coordinate values are hidden in the arrangement of the image pixels, and the value of each pixel in the input depth image is the depth value.

The types of the pixel points in the depth image to be processed can include human hands, objects and backgrounds. In specific implementation, three types of human hands, objects and backgrounds can be represented by different type values. For example, 0 represents a background, 1 represents a human hand, and 2 represents an object.

In this embodiment, according to the type value of each pixel point and the type corresponding to the type value, the segmentation result of the human hand and the object in the depth image to be processed can be obtained, and the segmentation of the human hand and the interactive object is realized.

As shown in fig. 2, the depth image to be processed is input into the depth network model, and the depth image to be processed first passes through the encoder, then passes through the attention transfer model, and finally passes through the decoder, and the classification label map of the depth image to be processed is output, and the positions of the human hand and the object are obtained according to the type value of each pixel point, so that the human hand and the object are segmented.

In the embodiment of the application, the pixel points belonging to the human hand and the pixel points belonging to the object can be determined according to the type value of each pixel point in the to-be-processed depth image output by the segmentation model and the type corresponding to the type value, so that the human hand and the object which interact in the to-be-processed image are segmented, the segmentation of the human hand and the object at the pixel level is realized, the segmentation precision is high, and the human hand and the object which interact under the complex situation can be segmented.

In one embodiment of the present application, a depth image-based human hand segmentation training dataset may be constructed from color images. Fig. 4 is a flowchart illustrating another method for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present application.

As shown in fig. 4, the method for constructing a depth image-based human hand segmentation data set includes:

step 301, acquiring multiple pairs of color images and depth images under the scene of interaction between a human hand and an object.

In this embodiment, some objects with a color different from that of the skin of the human hand can be collected manually. Then, an image of a human hand and each object in an interaction scene is shot by using a depth camera, so that a plurality of pairs of color images and depth images are obtained. In addition, in order to increase the data amount, images of different interaction postures of the human hand and the object can be acquired for the same object.

When capturing images with a depth camera, the lighting environment is fixed, e.g. using fixed light sources of the same brightness and color temperature, to ensure that the captured color images are clear and shadow-free.

Step 302, performing object segmentation based on HSV color space on all color images, and obtaining a type value of each pixel point in each color image.

In this embodiment, all the backgrounds in the color image and the depth image may be removed by the depth threshold, and the images of the human hand and the object may be retained. And then, converting all the acquired color images into an HSV color space according to a conversion formula from the existing RGB color space to the HSV color space. Wherein, the parameters of the HSV color space are respectively as follows: hue (H), saturation (S), lightness (V).

And then, segmenting the HSV color space corresponding to each color image to obtain the type value of each pixel point in each color image. Specifically, the distribution of pixel points of a plurality of pure hand samples and interactive samples in an HSV space is analyzed, the overlapping area among the samples is the area corresponding to the pixel points of the human hand, and a plurality of linear constraint conditions are fitted. And analyzing all the color images, wherein pixel points positioned in the constraint are marked as hands, and pixel points positioned outside the constraint are marked as objects.

And 303, aiming at each pair of color images and depth images, mapping each pixel point in the color images to the corresponding pixel point in the depth images, and constructing a human hand segmentation training data set based on the depth images.

And performing pixel alignment on the color image and the depth image for each pair of the color image and the depth image, namely estimating the camera internal and external parameters of the depth sensor and the color sensor respectively, performing affine transformation on the depth point cloud to a color camera space, and generating a real classification label image based on the color image by using an automatic labeling method, wherein the real classification label image is also a real classification label image of the depth image corresponding to the color image. The type value of each pixel point in the real classification label image can be represented by 0 for background, 1 for hand and 2 for object.

In this embodiment, all the depth images and the real classification label maps thereof constitute a human hand segmentation training data set based on the depth images.

Further, to improve the segmentation accuracy, in an embodiment of the present application, before mapping, the depth image may be preprocessed, denoised by using morphology and contour filtering methods, and analyzed for background in the depth image, and only the human hand and objects interacting with the human hand are retained.

After the data set used for training the segmentation model is obtained, when the model is trained, the human hand segmentation training data set based on the depth images can be divided into a training data set and a test data set, wherein the number of the depth images in the training data set is far larger than that of the depth images in the test data set, the training data set is used for training, and the test data set is used for testing the trained model.

The initial segmentation model is then trained using the training data set, and a first loss function is calculated. Wherein, the first loss function adopts a softmax cross entropy loss function, which is shown in the following formula (1):

wherein, y_iRepresenting true results, x_iThe predicted values of the segmentation model outputs are indicated, the index i indicates a different type, and the index j also indicates a different type. For example, the pixel points have three types, and a loss with a type value i equal to 0 is calculated first, and the loss is

Calculating the loss of type value i ═ 1:

calculate the loss for type value i-2:

then the loss of the model is

The first loss function may be another loss function that can realize the division task.

Specifically, the depth image in the training dataset is input into an initial neural network model, and the network model outputs a prediction classification label map of the depth image. Then, according to the difference between the prediction classification label graph and the real label graph of the depth image, a gradient descent algorithm is used for feeding back all parameters in the network, and the network parameters are updated correspondingly. When the depth image is input next time, the predicted classification label map output by the network is closer to the real classification label map.

Training continues using the contour error as the loss function until the value of the first loss function no longer drops, i.e., the model's performance is optimized using the first loss function. Wherein, the contour error is shown in the following formula (2):

where B is a blurring operation, such as gaussian blurring using a gaussian kernel of 5 × 5 σ -2.121; s is contour extraction, for example, contour extraction is carried out by using a Sorber operator; m_labelsFor true classification label maps, M_logitsAnd outputting the predicted value of the type of the pixel point for network output.

When the value of the contour error is stable and does not decrease any more, the training can be stopped to obtain the segmentation model. Then, the segmentation model is tested by using the test set, specifically, the depth images in the test set can be input into the segmentation model for recognition, the Intersection-over-unity (IOU) scores of all the depth images in the test set are counted, and whether the segmentation model meets the requirements or not is judged by using the IOU scores.

The IOU is a ratio of an intersection to a union, and in this embodiment, is a ratio of an intersection to a union of a model prediction result and a real result, that is, a ratio of an intersection of a model prediction result and a real result to a union of a model prediction result and a real result.

Fig. 5 is a schematic diagram of a training process of a segmentation model according to an embodiment of the present application. In fig. 5, the left side is a schematic diagram of a data construction process, and the right side is a schematic diagram of a model training process. When data is constructed, a color image acquired by a depth camera is aligned with a depth image, and an automatic labeling method is used for generating a real classification label map based on the color image, which is also a real classification label of the aligned corresponding depth image. All depth images and their true classification label images constitute a training data set of human hand segmentation based on the depth images.

During model training, the depth image in the data set is input into the attention segmentation network to obtain a classification label graph predicted by the network model, the classification label graph is compared with a real classification label graph, loss is calculated, and network parameters are updated in a step-by-step iterative mode

Fig. 6 is a schematic diagram illustrating an effect of using a contour error according to an embodiment of the present application. In fig. 6, the left column of objects and hands are real labels, the middle column is a net output without contour error, and the right column shows a net output after contour error is used.

In the embodiment of the application, when the model is segmented in training, the general loss function is used firstly, when the general loss function value is stable, namely the model is optimal under the loss function, the contour error is trained as the loss function, and the attention mechanism model is added into the segmented model, so that the segmentation precision of the model is greatly improved.

Further, in order to enhance the generalization ability of the segmentation model, before the segmentation model is trained by using the training data set, a data augmentation operation may be performed on the training data set, and the depth image obtained by the data augmentation operation may be added to the training data set.

Wherein the data augmentation operation comprises at least one of freely rotating the depth image, adding random noise, and randomly flipping the depth image.

In order to implement the above embodiments, the present application further provides an apparatus for segmenting a human hand and an interactive object from a depth image. Fig. 7 is a schematic structural diagram of an apparatus for segmenting a human hand and an interactive object from a depth image according to an embodiment of the present application.

As shown in fig. 7, the apparatus for segmenting a human hand and an interactive object from a depth image comprises: a building module 610, a training module 620, and a recognition module 630.

A construction module 610 for constructing a depth image-based human hand segmentation dataset using a color image-based segmentation method;

a training module 620, configured to train, by using the human hand segmentation data set based on the depth image, to obtain a segmentation model, where the segmentation model is composed of an encoder, an attention transfer model, and a decoder;

the identifying module 630 is configured to utilize the segmentation model to segment the depth image to be processed, and obtain a classification label map corresponding to the depth image to be processed, where a value of each pixel in the classification label map is a type value of each pixel, and the type value is used to represent a type to which the pixel belongs in the depth image to be processed.

In a possible implementation manner of the embodiment of the present application, the building module 610 is specifically configured to:

collecting a plurality of pairs of color images and depth images under the condition of interaction between hands and objects;

carrying out object segmentation based on HSV color space on all color images to obtain the type value of each pixel point in each color image;

and aiming at each pair of color images and depth images, mapping each pixel point in the color images to the corresponding pixel point in the depth images, and constructing a human hand segmentation training data set based on the depth images.

In a possible implementation manner of the embodiment of the present application, the depth image is preprocessed, including noise and background removal.

In a possible implementation manner of the embodiment of the present application, the human-hand segmentation data set based on the depth image includes a training data set and a testing data set, and the training module 620 is specifically configured to:

training the initial neural network model by utilizing a training data set, and calculating a first loss function, wherein the first loss function adopts a softmax cross entropy loss function;

when the value of the first loss function no longer drops, training continues using the contour error as a loss function.

In a possible implementation manner of the embodiment of the present application, the apparatus further includes:

and the processing module is used for carrying out data augmentation operation on the training data set, wherein the data augmentation operation comprises at least one of freely rotating the depth image, adding random noise and randomly turning over the depth image.

It should be noted that the above explanation of the embodiment of the method for segmenting the human hand and the interactive object from the depth image is also applicable to the apparatus for segmenting the human hand and the interactive object from the depth image in this embodiment, and therefore, the explanation thereof is omitted here.

Claims

1. A method of segmenting a human hand and an interactive object from a depth image, comprising:

training an initial neural network model by using the human hand segmentation data set based on the depth image to obtain a segmentation model, wherein the segmentation model consists of an encoder, an attention transfer model and a decoder;

2. The method of claim 1, wherein constructing a depth image-based human hand segmentation dataset using a color image-based segmentation method comprises:

acquiring a plurality of pairs of color images and depth images under the condition of interaction between hands and objects;

and for each pair of color images and depth images, mapping each pixel point in the color images to corresponding pixel points in the depth images, and constructing a human hand segmentation training data set based on the depth images.

3. The method of claim 2, wherein mapping each pixel point in the color image to a corresponding pixel point in the depth image further comprises:

and preprocessing the depth image, including noise and background removal.

4. The method of claim 2, wherein the depth image-based human hand segmentation data set comprises a training data set and a test data set, and wherein training a segmentation model using the depth image-based human hand segmentation data set comprises:

training an initial neural network model by using the training data set, and calculating a first loss function, wherein the first loss function adopts a softmax cross entropy loss function;

5. The method of claim 4, wherein prior to training the segmentation model using the training data set, further comprising:

and carrying out data augmentation operation on the training data set, wherein the data augmentation operation comprises at least one of free rotation of the depth image, addition of random noise and random inversion of the depth image.

6. An apparatus for segmenting a human hand and an interactive object from a depth image, comprising:

the training module is used for training an initial neural network model by utilizing the human hand segmentation data set based on the depth image to obtain a segmentation model, and the segmentation model is composed of an encoder, an attention transfer model and a decoder;

7. The apparatus of claim 6, wherein the build module is specifically configured to:

8. The apparatus of claim 7, further comprising:

and the preprocessing module is used for preprocessing the depth image, including noise and background removal.

9. The apparatus of claim 7, wherein the depth image-based human hand segmentation dataset comprises a training dataset and a testing dataset, and wherein the training module is specifically configured to:

10. The apparatus of claim 9, further comprising: