CN111985535A

CN111985535A - Method and device for optimizing human body depth map through neural network

Info

Publication number: CN111985535A
Application number: CN202010690167.0A
Authority: CN
Inventors: 朱昊; 叶青云; 郭龙伟; 曹汛
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2020-11-24

Abstract

The invention discloses a method and a device for optimizing a human body depth map through a neural network. The depth neural network adopted by the method takes a rough depth map and a color image as input and takes an accurate depth map as output; in the training stage, the depth map with higher precision and the color picture aligned with the depth map are obtained as training data by carrying out fusion optimization on continuous multi-frame depth maps; and obtaining a neural network model capable of optimizing the rough depth map through training. The device comprises an image sequence and model acquisition module, a model and image sequence alignment module, a human body model extraction module, a network data preprocessing module, a network design module and a network training and prediction module. The depth map predicted by the invention has lower noise and smoother appearance, and has higher precision compared with the input rough depth map.

Description

Method and device for optimizing human body depth map through neural network

Technical Field

The invention relates to the field of computer vision, in particular to a method and a device for optimizing a human body depth map through a neural network.

Background

In recent years, with the development of computer vision and virtual reality technologies, the value of three-dimensional reconstruction (3D reconstruction) is gradually highlighted in a plurality of fields such as movie and television production, virtual reality, game production, modern medical treatment and the like, and how to optimize and obtain a smooth and low-noise three-dimensional model becomes a problem of wide attention.

The proposed algorithms for optimizing the human depth map can be roughly classified into two categories:

the first category of methods enhances the depth data by RGB images, usually making some heuristic assumptions about the correlation between color and depth, and then optimizing the original depth data using the RGB information. In the method, because the texture change is often mistaken for geometric details based on the heuristic assumption between color and depth, texture copy artifacts are unavoidable, and the improvement of quantization precision is limited.

The second method is to fuse multi-frame depth images by a conventional method, and integrate Truncated Signed Distance Function (TSDF) to reduce scanning noise. The method is a method which takes a plurality of frames as input and then optimizes a depth map by using a traditional method so as to obtain a smooth and dense three-dimensional model. However, the conventional method often needs to merge multiple frames to suppress noise, and although many advances have been made in reducing noise and enhancing geometric details, the requirement of real-time performance cannot be met. Meanwhile, the traditional method performs time-domain filtering on the original depth to reduce the noise of the sensor, but simple time-domain filtering can smooth a high-frequency structure, so that unrepairable details are degraded, and a fine three-dimensional appearance cannot be accurately represented.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a device for optimizing a human body depth map through a neural network in order to improve the quality of a reconstructed three-dimensional model.

In order to achieve the purpose of the invention, the method adopts the following technical scheme:

a method for optimizing a human depth map through a neural network comprises the following steps:

s1, acquiring the original data of the rough depth map and the color map of the human body, and reconstructing through a multi-frame depth optimization algorithm to obtain a high-precision human body model containing scene information;

s2, performing coordinate transformation, aligning the reconstructed model, the rough depth map and the color map in a camera coordinate system, and obtaining an accurate depth map through rendering;

s3, removing background information using a mask from the high-precision human body model obtained in step S1, extracting information of a human body part, and obtaining a rough depth map, a color map, and a precise depth map containing only human body information;

s4, preprocessing the rough depth map, the color map and the accurate depth map which only contain the human body information in the step S3, and using the preprocessed rough depth map, the color map and the accurate depth map as marking data to provide input data for the neural network for training;

s5, constructing a neural network optimized for the rough depth map only containing the human body information by using the coding and decoding network;

and S6, training the neural network to obtain a prediction model.

Further, the specific steps of step 2 include: reading in high-precision human body model data containing scene information obtained by reconstruction, converting a reconstructed model from a voxel coordinate system where multi-frame optimization is located into a camera coordinate system, and then obtaining an accurate depth map of the reconstructed model in a new coordinate system through perspective projection.

Further, in the step 2, when the coordinate transformation is performed, the voxel coordinate system takes the center of mass of the foreground part of the collected first frame depth map as an origin and faces the same direction as the camera coordinate system corresponding to the first frame depth map; the transformation between the coordinate systems is achieved by rotating and translating all vertices in the coordinate systems.

Further, the specific step of step S3 is: the intersection of the foreground portions of the coarse and fine depth maps containing scene information is used as a mask, and the pixel values outside the mask in the coarse and fine depth maps are set to 0 to avoid redundant depth information.

Further, in step S4, a square image in the range of the human body is cut out from the rough depth map, the precise depth map, and the color map that only include the human body information, and the image is scaled to a fixed resolution, so that the brightness value of the color map, the depth value of the rough depth map, and the depth value of the precise depth map all fall within the (-1, 1) interval.

Further, in step S5, the neural network is a network with a jump connection layer, and is formed by connecting 5 residual convolution modules and 5 deconvolution layers in series, and finally, a convolution layer with a convolution kernel of 1 is used as an output layer; the inputs of the first 4 residual convolution modules are connected with the outputs of the corresponding convolution layers, that is, the output results of the network coding layers are used as the inputs of the corresponding decoding layers, and the merging is realized by paralleling on the channel dimension.

The invention relates to a device for optimizing a human body depth map through a neural network, which comprises the following components: the image sequence and model acquisition module is used for acquiring the original data of a rough depth map and a color map of a human body and acquiring a reconstructed high-precision human body model through a multi-frame depth optimization algorithm; the model and image sequence alignment module is used for aligning the reconstructed model, the rough depth map and the color map under a camera coordinate system and obtaining an accurate depth map through rendering; the human body model extraction module is used for removing background information in the human body model and keeping information of a human body part; the network data preprocessing module is used for preprocessing the rough depth map, the color map and the accurate depth map; the network design module is used for constructing a neural network optimized aiming at the rough depth map; and the network training and predicting module is used for training the neural network to obtain a prediction model.

Compared with the prior art, the invention has the following remarkable advantages: by adopting a depth learning method and taking a rough depth map, an accurate depth map and a color image as input, the optimized depth map with less noise is generated, the depth smoothness is obviously improved, and the three-dimensional shape of a human body can be more accurately represented.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. The drawings in the following description are only some embodiments of the invention and other drawings may be derived by those skilled in the art without inventive effort.

Fig. 1 is a schematic diagram of an encoding-decoding network incorporating a hopping connection.

Fig. 2 is a comparison graph of (a) the original coarse depth, (b) the target exact depth, and (c) the effect of converting the predicted depth of the inventive network into a patch mesh model.

FIG. 3 is a flow chart of a method for optimizing a human depth map through a neural network according to the present invention.

Fig. 4 is a schematic structural diagram of the device for optimizing the human depth map through the neural network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following will describe the method of the present invention in further detail with reference to the accompanying drawings.

The embodiment provides a method for optimizing a human body depth map through a neural network, and a flow chart is shown in fig. 3, which specifically includes:

(1) and acquiring rough depth map and color map original data of the human body by using a Kinect V2 depth camera, and reconstructing by using a multi-frame depth optimization algorithm to obtain a high-precision human body model. The multi-frame depth optimization algorithm may employ existing methods, such as newcomb, r.a., Fox, d., & Seitz, S.M. (2015). Dynamicfusion: reconfiguration and tracking of non-linear scenes in Real-time. in Proceedings of the IEEE con on computer vision and paper registration (pp.343-352), or Yu, t., Zheng, z., Guo, k., Zhao, j., Dai, q., Li, h., & Liu, Y. (2018). doubtless fusion: read-e focus of human dynamics front panel in which the IEEE 727287 is adopted.

(2) And aligning the reconstructed model and the depth map and the color map under a camera coordinate system. The method comprises the following specific steps: reading in high-precision model data reconstructed from the accurate depth map, converting the reconstructed model from a voxel coordinate system where the multi-frame optimization is located into a camera coordinate system, enabling the depth value of the human body part in the rough depth map to be closely attached to the accurate depth value of the human body part in the high-precision reconstructed model, and finally obtaining the accurate depth map of the reconstructed model in a new coordinate system through perspective projection.

(3) Human body information is extracted from the human body model. Since the original image taken contains all scene information, a mask is used to remove the background and only the human body part is retained. The method for extracting the human body information comprises the following steps: the intersection of the foreground portions of the coarse and fine depth maps is used as a mask, and the pixel values outside the mask in the coarse and fine depth maps are set to 0 to avoid redundant depth information. So as to obtain a rough depth map, a color map and an accurate depth map containing only human body information.

(4) And (4) preprocessing the rough depth map, the color map and the accurate depth map which only contain the human body information and are obtained in the step (3), and taking the rough depth map, the color map and the accurate depth map as marking data together to provide input data for a network for training. And respectively cutting out a square image of a human body range from the rough depth map, the accurate depth map and the color map, scaling the image to a fixed resolution, and enabling the brightness value of the color image, the depth value of the rough depth map and the depth value of the accurate depth map to fall in a (-1, 1) interval.

The specific steps are that the fourth channel in the color image, namely the transparency information, is abandoned at first.

The images are processed as squares to facilitate the input of the network and better training of the data. The specific operation comprises selecting the center of the mask as a cutting center, cutting a square image of a human body range, and reserving a blank area with the length of 10 pixels at the edge. The cropped image is scaled to 256 × 256 resolution. The same operations described above are performed for the coarse depth map, the fine depth map and the color map. On the sampling method adopted by image scaling, the coarse depth image and the accurate depth image adopt a nearest neighbor sampling method, and the color image adopts a bilinear sampling method. The color image reading is a tensor matrix, and the brightness numerical range is 0-255. And after the depth map is read as a tensor, performing average depth zeroing processing in the depth direction.

After the color image is read into a matrix, the value range is 0-255, the value range of the depth data is 1-3, and the prediction result of the network should be the same as the value range of the input rough depth data and is between 1-3. For training to proceed normally, data are unified in the same value range, the luminance value of the color image is divided by 128, and 1 is subtracted, so that the value falls in the (-1, 1) interval. For the rough depth map and the accurate depth map, subtracting 2 from the depth value of the depth effective area, and enabling the depth value to fall in a (-1, 1) interval.

(5) On the basis of the encoding and decoding network, a neural network optimized for the depth map is constructed, as shown in fig. 1. The neural network of this embodiment has a network with a jump connection layer, the network is formed by connecting 5 residual convolution modules and 5 deconvolution layers in series, and finally, a convolution layer with a convolution kernel of 1 is used as an output layer. The inputs of the first 4 residual convolution modules are connected with the outputs of the corresponding convolution layers, that is, the output results of the network coding layer are used as the inputs of the corresponding decoding layers, and the specific combination method is parallel in channel dimension.

(6) And training the neural network to obtain a prediction model. The training data volume is about 1.8 ten thousand groups, and the data volume is as follows: a scale of 1 randomly divides the input pictures into a training set and a test set. The number of data sets required for a single optimization during training is 32, and 10 times of iterative training of complete data is performed on the network. In the testing stage, a single-frame depth image and a color picture are used as input, Mean Squared Error (MSE) and Structural Similarity (SSIM) values are calculated, and the result is visualized.

The neural network used in the invention has the capability of optimizing the depth map after the training process, and the estimation result of the prediction accuracy is shown in table 1. The target value is an accurate depth map, the predicted value is an optimized depth map obtained through network prediction, and the original value is an input rough depth map.

TABLE 1 MSE and SSIM values for comparison of target values with predicted values and original values during single frame data testing

	MSE (unit: cm)²)	SSIM
			Target value-predicted value	0.11813	0.9963
Target value-original value	0.17841	0.9949

From experimental results, compared with the original rough depth, the MSE value of the predicted value obtained by the network is lower, and the SSIM value is higher, namely, from the numerical angle, the predicted value of the network is closer to the accurate target depth than the original rough depth. From the view of the effect of the patch mesh model, the three graphs in fig. 2 are respectively the original rough depth, the accurate target depth and the predicted depth of the network in the invention from left to right, which are converted into the effect graph of the patch mesh model. The surface patch mesh model of the original rough depth data has a large amount of noise points, and the surface of a character is not smooth enough; the patch grid model for predicting the depth by the network successfully reduces noise points, has smooth surface and can be well attached to the depth of a target true value. The depth error of the network predicted value of the invention on most positions of the human body is less than 2cm, only a small amount of errors exist at the positions with sudden depth change, such as the edge of the human body, and the like, and the error is 10cm at most and is lower than the 16cm depth error of the original rough depth value. Experiments prove that the method effectively improves the precision of the rough depth map.

The embodiment also provides a device for optimizing a human depth map through a neural network, as shown in fig. 4, including: and the image sequence and model acquisition module is used for acquiring the original data of the rough depth map and the color map of the human body and acquiring a high-precision human body model through a multi-frame depth optimization algorithm. And the model and image sequence alignment module is used for carrying out coordinate transformation, aligning the reconstructed model, the rough depth map and the color map under a camera coordinate system, and obtaining the accurate depth map through rendering. And the human body model extraction module is used for removing background information in the human body model and keeping the information of the human body part. And the network data preprocessing module is used for preprocessing the rough depth map, the color image and the accurate depth map and providing input data for the neural network for training. And the network design module is used for constructing a neural network optimized aiming at the depth map. And the network training and predicting module is used for training the neural network to obtain a prediction model.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for optimizing a human body depth map through a neural network is characterized by comprising the following steps:

and S6, training the neural network to obtain a prediction model.

2. The method for optimizing the human body depth map through the neural network as claimed in claim 1, wherein the specific steps of the step 2 comprise: reading in high-precision human body model data containing scene information obtained by reconstruction, converting a reconstructed model from a voxel coordinate system where multi-frame optimization is located into a camera coordinate system, and then obtaining an accurate depth map of the reconstructed model in a new coordinate system through perspective projection.

3. The method for optimizing the human body depth map through the neural network as claimed in claim 2, wherein in the step 2, when the coordinate transformation is performed, the voxel coordinate system takes the centroid of the foreground part of the acquired first frame depth map as the origin and is oriented to be the same as the camera coordinate system corresponding to the first frame depth map; the transformation between the coordinate systems is achieved by rotating and translating all vertices in the coordinate systems.

4. The method for optimizing the human depth map through the neural network as claimed in claim 1, wherein the specific steps of the step S3 are as follows: the intersection of the foreground portions of the coarse and fine depth maps containing scene information is used as a mask, and the pixel values outside the mask in the coarse and fine depth maps are set to 0 to avoid redundant depth information.

5. The method for optimizing the human depth map through the neural network as claimed in claim 1, wherein in the step S4, a square image of the human body range is cut out from the rough depth map, the precise depth map and the color map which only contain the human body information, the image is scaled to a fixed resolution, and the brightness value of the color map, the depth values of the rough depth map and the precise depth map are all in the (-1, 1) interval.

6. The method for optimizing the human depth map through the neural network as claimed in claim 1, wherein in the step S5, the neural network is a network with a jump connection layer, and is formed by connecting 5 residual convolution modules and 5 deconvolution layers in series, and finally, a convolution layer with a convolution kernel of 1 is used as an output layer; the inputs of the first 4 residual convolution modules are connected with the outputs of the corresponding convolution layers, that is, the output results of the network coding layers are used as the inputs of the corresponding decoding layers, and the merging is realized by paralleling on the channel dimension.

7. An apparatus for optimizing a depth map of a human body through a neural network, comprising:

the image sequence and model acquisition module is used for acquiring the original data of a rough depth map and a color map of a human body and acquiring a reconstructed high-precision human body model through a multi-frame depth optimization algorithm;

the model and image sequence alignment module is used for aligning the reconstructed model, the rough depth map and the color map under a camera coordinate system and obtaining an accurate depth map through rendering;

the human body model extraction module is used for removing background information in the human body model and keeping information of a human body part;

the network data preprocessing module is used for preprocessing the rough depth map, the color map and the accurate depth map;

the network design module is used for constructing a neural network optimized aiming at the rough depth map;

and the network training and predicting module is used for training the neural network to obtain a prediction model.