CN115731345A

CN115731345A - Human body three-dimensional reconstruction method based on binocular vision

Info

Publication number: CN115731345A
Application number: CN202211426574.6A
Authority: CN
Inventors: 樊养余; 厉行; 刘洋; 何雯清; 吕国云; 郭哲; 王毅; 齐敏
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-03-03

Abstract

The invention provides a human body three-dimensional reconstruction method based on binocular vision, which comprises the steps of firstly, carrying out parallax estimation on a left image and a right image of a human body by a stereo matching method to generate a front parallax image; secondly, calculating a front depth map by a binocular vision imaging principle; and finally, estimating the back color map and the depth map through a human body completion network, and generating a complete three-dimensional model by combining the front color map and the depth map. According to the invention, the parallax estimation is carried out on the left and right images on the front side of the human body through the stereo matching network, and then the parallax image is converted into the depth image, so that the problem of poor depth estimation precision of the human body image is solved, the network can predict the accurate and real front side human body depth image, the problem that the human body completion network is difficult to capture local geometric changes is solved through supervised learning by using the normal phase image as an intermediate variable, and the detailed geometric details of the back side depth image estimated by the network are reserved.

Description

Human body three-dimensional reconstruction method based on binocular vision

Technical Field

The invention relates to the technical field of image processing, computer graphics and three-dimensional reconstruction, in particular to a three-dimensional virtual human reconstruction method.

Background

The human body three-dimensional reconstruction technology is an important and challenging research subject, and has wide application prospects, such as AR/VR, teleconferencing, film making, virtual fitting, virtual games and the like.

However, in practice, there are many problems that are difficult to solve or the effect is difficult to satisfy:

(1) Although high-fidelity virtual human can be obtained by using high-end acquisition equipment and a well-designed capturing environment, the scanning equipment is expensive in cost, complex in operation and difficult to popularize;

(2) When a single image is used for human body three-dimensional reconstruction, due to the lack of priori knowledge of depth information, the generated human body geometry lacks details, and the texture resolution is low.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a human body three-dimensional reconstruction method based on binocular vision. Firstly, carrying out parallax estimation on a left image and a right image of a human body by a stereo matching method to generate a front parallax image; secondly, calculating a front depth map by a binocular vision imaging principle; and finally, estimating a back color map and a depth map through a human body completion network, and generating a complete three-dimensional model by combining the front color map and the depth map.

The invention carries out three-dimensional virtual human reconstruction on the left image and the right image on the front side of the human body, and the acquisition cost of the human image is lower. And the left image and the right image are predicted through the stereo matching network, so that a more accurate disparity map can be generated. The performance of the human body three-dimensional reconstruction method can be effectively improved by using the depth map converted from the disparity map as a priori. The back color map and the depth map generated by the human body completion network have the outlines of the front color map and the depth map, so that the front point cloud and the back point cloud can be well fused, and a complete human body three-dimensional model is generated. Meanwhile, the human body completion network uses the normal phase diagram as an intermediate variable to carry out supervised learning, and the normal phase diagram can well express local depth changes (such as folds of clothes, fine lines of the face and the like), so that the estimated back depth diagram retains detailed details.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1, rendering a human body three-dimensional model by an OpenGL three-dimensional rendering technology, and respectively generating a left image, a right image and a depth image as training sets;

step 2, estimating a disparity map by a stereo matching method according to the left image and the right image acquired in the step 1;

step 3, calculating a depth map according to the disparity map obtained in the step 2 by using a binocular vision imaging principle;

step 4, obtaining a UV coordinate graph through DensePose according to the left graph collected in the step 1, sending the left graph collected in the step 1, the depth graph obtained in the step 3 and the UV coordinate graph obtained in the step 4 into a human body three-dimensional reconstruction network, and predicting a human body three-dimensional model;

and 5, acquiring a left image and a right image of the real human body by using the ZED binocular camera to generate a real human body three-dimensional model.

In the step 1, the step of obtaining the training set is as follows:

rendering the three-dimensional model by an OpenGL three-dimensional rendering technology to generate a left graph and a right graph; firstly, reading a three-dimensional model to be rendered; secondly, placing the three-dimensional model at the center of the vision field, setting an internal parameter matrix and an external parameter matrix of the left camera and the right camera, and setting the cameras to rotate by-30 degrees around the Y axis of a right-hand Cartesian coordinate system; thirdly, setting the direction, position, color and type of illumination; then, setting a projection mode as orthogonal projection; and finally, converting the three-dimensional information into a two-dimensional image, rendering a left image, a right image and a depth image of-30 degrees, and randomly giving different illumination to different visual angles.

In step 2, the step of estimating the disparity map is as follows:

estimating a disparity map by a stereo matching method according to the left image and the right image rendered in the step 1, firstly, respectively sending the left image and the right image into two weight-shared 2D feature extraction modules, respectively extracting feature maps of the left image and the right image, and secondly, constructing the extracted feature maps of the left image and the right image into 4D cost; then, performing 3D feature matching on the 4D cost quantity; finally, predicting the disparity map through disparity regression and geometric optimization;

the 2D feature extraction module extracts different scale features of the 2D image, the features are composed of 2D CNNs (Convolutional Neural Networks) and ASPP (empty space Convolutional Pooling Pyramid), and the feature extraction modules of the left image and the right image share weight; the cost measure reflects the correlation between the left map and the corresponding right map pixel, and consists of a cost measure based on the domain correlation and a cost measure based on the variance;

the 3D feature matching is used for aggregating features in a space dimension and a parallax dimension, and the 3D feature matching module consists of a series of Hourglass networks (Hourglass networks) and Non-local attention modules (Non-local attention modules); in order to make the estimated disparity differentiable, training is performed using back propagation, and successive disparity maps are regressed using Soft Argmin; and finally, improving the precision of the initial disparity map by a geometric refinement module, and sending the initial disparity map, the left map and the semantic features to the CNN for optimization by the geometric refinement module.

In the stereo matching method, the loss of an initial disparity map and a refined disparity map is calculated simultaneously; training the whole network end-to-end by smoothing L1 loss (Soomth L1 loss), which is robust at disparity discontinuities and less sensitive to outliers or noise; finally, the loss function is expressed as:

wherein λ is _i And beta is a hyperparameter, loss _i Initial disparity map, loss, representing the output of the output module _r Representing the fine disparity map after passing through the geometric fine module;

wherein, P _valid Denotes the effective area, P is the number P _valid D (p) represents a true value of the disparity at pixel p, d _i (p) represents the estimated disparity of the initial disparity map at p points of the pixel, d _r (p) represents the estimated disparity of the fine disparity map at p pixels, and N represents the number of pixels in the effective area.

The step of calculating the depth map in the step 3 is as follows:

calculating to obtain a depth map according to the disparity map obtained in the step 2, wherein the binocular vision imaging principle is shown in figure 3, and one point P in the space is a pixel P in the image shot by the left camera _L (x _L Y), pixel p in the image taken by the right camera _R (x _R Y), obtaining the parallax d = | x of the point P in the left and right images through a stereo matching network _L -x _R And computing a depth map by the formula z = f b/d, wherein z represents depth, f represents camera focal length, b represents baseline, and d is estimated parallax.

In the step 4, the step of estimating the human body three-dimensional reconstruction network comprises the following steps:

firstly, estimating a human body UV coordinate graph through a DensePose network, wherein the UV coordinate graph reflects the position relation between a human body 2D image and a 3D model; secondly, sending the left image obtained in the step 1, the depth map obtained in the step 3 and the UV coordinate map into a color map and a depth map of the back of the human body completion network for estimation; then, converting the front color map and the depth map into colored front point clouds, and converting the back color map and the depth map into colored back point clouds; finally, directly accumulating the front point cloud and the back point cloud for fusion to generate a human body 3D model with textures, and generating a complete human body three-dimensional reconstruction network from the front color map and the depth map;

the schematic network structure of the human body complement network is shown in fig. 5, and first, the front color map, the depth map and the UV coordinate map are respectively sent to the color generator

And depth generator

Estimating a back color map and a back depth map; next, the estimated back color map is input to a discriminator

Training a discriminator to distinguish whether the estimated back color map is true; then, the estimated depth map is converted into a normal phase map, and the normal phase map is sent to a discriminator

Training the discriminator to distinguish whether the estimated back-side phase diagram is true; the depth map is converted into a normal phase map and is represented as:

wherein N is _pre (i) Representing the phase of point i, norm (-) representing the normalization of the input vector, P _i The j point is the clockwise direction of the k point relative to the i point and belongs to the neighborhood of the i point;

the loss function of the human body completion network is represented by L _color ,L _gan ,L _depth And L _normal Composition, at point i, estimated Back color map C _pre (i) Sum true value C _gt (i) Loss function L between _color Is defined as:

L _color (i)＝||C _pre (i)-C _gt (i)||

at point i, the estimated back color map D _pre (i) Sum true value D _gt (i) Loss function L between _depth Is defined as:

L _depth (i)＝||D _pre (i)-D _gt (i)||

at point i, the estimated back color map N _pre (i) Sum true value N _gt (i) Loss function L between _normal Is defined as:

L _normai (i)＝||N _pre (i)-N _gt (i)||

to improve the accuracy of the predicted back color map and back depth map, a generation penalty function L is introduced _gan Comprises the following steps:

wherein the generator

In an effort to minimize the objective function

And a discriminator

In an effort to maximize the objective function, i.e.

Similarly, the generators

In an effort to minimize an objective function

And a discriminator

In an effort to maximize the objective function, i.e.

In step 5, the step of predicting on the real data set is:

predicting real left and right images of the human body collected by a ZED binocular camera, wherein the left and right images of the human body are collected by a left camera and a right camera of the ZED respectively; firstly, installing a ZED SDK and a Python-API pyzed; secondly, reading and saving the left and right images of the human body through a ZED API, wherein the left and right images collected by the ZED camera are shown in figure 6, and finally, inputting the collected left and right images into a human body three-dimensional reconstruction network based on binocular vision to generate a human body 3D model.

The method has the advantages that the parallax estimation is carried out on the left and right images on the front side of the human body through the stereo matching network, and then the parallax images are converted into the depth images, so that the problem of poor depth estimation precision of the human body images is solved, and the network can predict the accurate and real front human body depth images.

The method and the device perform supervised learning by using the normal phase diagram as the intermediate variable, solve the problem that the human body completion network is difficult to capture local geometric changes, and enable the back depth diagram of network estimation to keep detailed geometric details.

Drawings

Fig. 1 is a schematic diagram of rendering results of the present invention, in which (a) is a left image of-30 °, (b) is a left image of 0 °, and (c) is a left image of 30 °.

Fig. 2 is a schematic diagram of a network structure of the stereo matching method of the present invention.

Fig. 3 is a schematic view of the binocular vision imaging principle of the present invention.

FIG. 4 is a schematic diagram of generating a complete human body three-dimensional model from a front color map and a depth map according to the present invention.

Fig. 5 is a schematic diagram of a network structure of the human body complement network according to the present invention.

Fig. 6 is a left and right images captured by the ZED camera of the present invention, wherein (a) is a left image captured by the ZED camera and (b) is a right image captured by the ZED camera.

FIG. 7 is a flow chart of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

In order to overcome the defects of the prior art, the invention provides a human body three-dimensional reconstruction method based on binocular vision, which comprises the steps of firstly carrying out parallax estimation on a left image and a right image of a human body by a stereo matching method to generate a front parallax image; secondly, calculating a front depth map by a binocular vision imaging principle; and finally, estimating a back color map and a depth map through a human body completion network, and generating a complete three-dimensional model by combining the front color map and the depth map.

The technical scheme of the invention comprises the following steps:

and 5, acquiring a left image and a right image of the real human body by using a ZED binocular camera to generate a real three-dimensional model of the human body.

In the step 1, the step of obtaining the training set is as follows:

rendering the three-dimensional model by an OpenGL three-dimensional rendering technology to generate a left graph and a right graph; firstly, reading a three-dimensional model to be rendered; secondly, placing the three-dimensional model at the center of the vision field, setting an internal parameter matrix and an external parameter matrix of the left camera and the right camera, and setting the cameras to rotate by-30 degrees around the Y axis of a right-hand Cartesian coordinate system; thirdly, setting the direction, position, color and type of illumination; then, setting a projection mode as orthogonal projection; finally, converting the three-dimensional information into a two-dimensional image, and rendering a left image, a right image and a depth image of-30 degrees; the rendering results are shown in fig. 1, which shows the left images at 0 °,30 ° and-30 °, respectively, and different views are randomly illuminated.

In step 2, the step of estimating the disparity map is as follows:

estimating a disparity map by a stereo matching method according to the left image and the right image rendered in the step 1, wherein a network structure schematic diagram of the stereo matching method is shown in FIG. 2; firstly, respectively sending the left image and the right image into two weight-sharing 2D feature extraction modules, respectively extracting feature images of the left image and the right image, and then constructing the extracted feature images of the left image and the right image into 4D cost; then, performing 3D feature matching on the 4D cost quantity; finally, predicting the disparity map through disparity regression and geometric optimization;

the 2D feature extraction module extracts different scale features of the 2D image, the features are composed of 2D CNNs (Convolutional Neural Networks) and ASPP (empty space Convolutional Pooling Pyramid), and the feature extraction modules of the left image and the right image share weight; the cost measure reflects the correlation between the left map and the corresponding right map pixels, and consists of a domain correlation-based cost measure and a variance-based cost measure.

The 3D feature matching module is used for aggregating features in a space dimension and a parallax dimension, and consists of a series of Hourglass networks (Hourglass networks) and Non-local attention modules (Non-local attention modules); in order to make the estimated disparity differentiable, training is performed using back propagation, and successive disparity maps are regressed using Soft Argmin; and finally, improving the precision of the initial disparity map by a geometric refinement module, and sending the initial disparity map, the left map and the semantic features to the CNN for optimization by the geometric refinement module.

wherein, P _valid Denotes the effective area, P is the number P _valid D (p) represents a true value of the disparity at pixel p, d _i (p) represents the estimated disparity of the initial disparity map at the point p of the pixel, d _r (p) represents the estimated disparity of the fine disparity map at p pixels, and N represents the number of pixels in the effective area.

The step of calculating the depth map in the step 3 is as follows:

calculating to obtain a depth map according to the disparity map obtained in the step 2, wherein the binocular vision imaging principle is shown in figure 3, and one point P in the space is a pixel P in the image shot by the left camera _L (x _L Y), pixel p in the image taken by the right camera _R (x _R Y), obtaining the parallax d = | x of the point P in the left and right images through a stereo matching network _L -x _R And then calculating a depth map by the formula z = f × b/d, wherein z represents depth, f represents camera focal length, b represents baseline, and d is estimated parallax.

firstly, estimating a human body UV coordinate graph through a DensePose network, wherein the UV coordinate graph reflects the position relation between a human body 2D image and a 3D model; secondly, sending the left image obtained in the step 1, the depth image obtained in the step 3 and the UV coordinate image into a color image and a depth image of the back of the human body completion network estimation; then, converting the front color map and the depth map into colored front point clouds, and converting the back color map and the depth map into colored back point clouds; and finally, directly accumulating the front point cloud and the back point cloud for fusion to generate a human body 3D model with texture, and generating a complete human body three-dimensional reconstruction network schematic diagram from the front color map and the depth map as shown in FIG. 4.

The schematic network structure of the human body completion network is shown in fig. 5, and first, the front color map, the depth map and the UV coordinate map are respectively sent to the color generator

And depth ofDegree generator

Training the discriminator to distinguish whether the estimated back phase diagram is true; the depth map is converted into a normal phase map and is represented as:

the loss function of the human body completion network is represented by L _color ,L _gan ,L _depth And L _normal Composition, at point i, estimated Back color map C _pre (i) Sum true value C _gt (i) Loss function L between _color Is defined as follows:

L _color (i)＝||C _pre (i)-C _gt (i)||

at point i, the estimated back color map D _pre (i) Sum true value D _gt (i) Loss function L between _depth Is defined as follows:

L _depth (i)＝||D _pre (i)-D _gt (i)||

L _normal (i)＝||N _pre (i)-N _gt (i)||

wherein the generator

In an effort to minimize an objective function

And a discriminator

In an effort to maximize the objective function, i.e.

Similarly, the generators

In an effort to minimize an objective function

And a discriminator

In an effort to maximize the objective function, i.e.

In step 5, the step of predicting on the real data set is:

predicting real left and right human body images collected by a ZED binocular camera, wherein the left and right human body images are collected by the ZED left camera and the ZED right camera respectively; firstly, installing a ZED SDK and a Python-API pyzed; secondly, a left human body image and a right human body image are read and stored through a ZED API, left and right images collected by a ZED camera are shown in figure 6, and finally the collected left and right images are input into a binocular vision-based three-dimensional human body reconstruction network to generate a human body 3D model.

According to the invention, the parallax estimation is carried out on the left and right images on the front side of the human body through the stereo matching network, and then the parallax image is converted into the depth image, so that the problem of poor depth estimation precision of the human body image is solved, and the network can predict the accurate and real front human body depth image.

Claims

1. A human body three-dimensional reconstruction method based on binocular vision is characterized by comprising the following steps:

2. The binocular vision based human body three-dimensional reconstruction method according to claim 1, wherein:

in the step 1, the step of obtaining the training set is as follows:

3. The binocular vision based human body three-dimensional reconstruction method according to claim 1, wherein:

in step 2, the step of estimating the disparity map is as follows:

estimating a disparity map by a stereo matching method according to the left image and the right image rendered in the step 1, firstly, respectively sending the left image and the right image into two weight-shared 2D feature extraction modules, respectively extracting feature maps of the left image and the right image, and secondly, constructing the extracted feature maps of the left image and the right image into 4D cost; then, performing 3D feature matching on the 4D cost quantity; and finally, predicting the disparity map through disparity regression and geometric optimization.

4. The binocular vision based human body three-dimensional reconstruction method according to claim 3, wherein:

the 2D feature extraction module extracts features of different scales of the 2D image, the features are composed of 2D CNNs and ASPP, and the feature extraction modules of the left image and the right image share weight; the cost measure reflects the correlation between the left map and the corresponding right map pixel, and consists of a cost measure based on the domain correlation and a cost measure based on the variance;

the 3D feature matching is used for aggregating features in spatial dimension and parallax dimension, and the 3D feature matching module consists of a series of hourglass networks and a non-local attention module; in order to make the estimated disparity differentiable, training is performed using back propagation, and the continuous disparity map is regressed using Soft Argmin; finally, the precision of the initial disparity map is improved through a geometric refinement module, and the initial disparity map, the left map and the semantic features are sent to the CNN for optimization through the geometric refinement module;

in the stereo matching method, the loss of an initial disparity map and a refined disparity map is calculated simultaneously; training the whole network end-to-end by smoothing L1 loss, wherein the loss is robust at parallax discontinuity and has low sensitivity to abnormal values or noise; finally, the loss function is expressed as:

wherein, P _valid Denotes the effective area, P is the number P _valid D (p) represents a true value of the disparity at pixel p, d _i (p) represents the estimated disparity of the initial disparity map at p points of the pixel, d _r (p) represents the estimated disparity of the fine disparity map at the pixel p point, and N represents the number of pixels in the effective area.

5. The binocular vision based human body three-dimensional reconstruction method according to claim 1, wherein:

the step of calculating the depth map in the step 3 is as follows:

calculating to obtain a depth map according to the disparity map obtained in the step 2, wherein a point P in the space is a pixel P in the image shot by the left camera _L (x _L Y), pixel p in the image taken by the right camera _R (x _R Y), obtaining the parallax d = | x of the point P in the left and right images through a stereo matching network _L -x _R And computing a depth map by the formula z = f b/d, wherein z represents depth, f represents camera focal length, b represents baseline, and d is estimated parallax.

6. The binocular vision based human body three-dimensional reconstruction method according to claim 1, wherein:

firstly, estimating a human body UV coordinate graph through a DensePose network, wherein the UV coordinate graph reflects the position relation between a human body 2D image and a 3D model; secondly, sending the left image obtained in the step 1, the depth image obtained in the step 3 and the UV coordinate image into a color image and a depth image of the back of the human body completion network estimation; then, converting the front color map and the depth map into colored front point clouds, and converting the back color map and the depth map into colored back point clouds; and finally, directly accumulating the front point cloud and the back point cloud for fusion to generate a human body 3D model with texture, and generating a complete human body three-dimensional reconstruction network from the front color map and the depth map.

7. The binocular vision based human body three-dimensional reconstruction method of claim 6, wherein:

in the human body complement network, firstly, a front color map, a depth map and a UV coordinate map are respectively sent to a color generator

And depth generator

Estimating a back color map and a back depth map; secondly, willThe estimated back color map is input to a discriminator

L _color (i)＝||C _pre (i)-C _gt (i)||

L _depth (i)＝||D _pre (i)-D _gt (i)||

L _normal (i)＝||N _pre (i)-N _gt (i)||

to improve the accuracy of the predicted back color map and back depth map, a generator is introducedBecomes a penalty function L _gan Comprises the following steps:

wherein the generator

In an effort to minimize an objective function

And a discriminator

In an effort to maximize the objective function, i.e.

Similarly, the generators

In an effort to minimize an objective function

And a discriminator

In an effort to maximize the objective function, i.e.

8. The binocular vision based human body three-dimensional reconstruction method of claim 1, wherein:

in step 5, the step of predicting on the real data set is:

predicting real left and right images of the human body collected by a ZED binocular camera, wherein the left and right images of the human body are collected by a left camera and a right camera of the ZED respectively; firstly, installing a ZED SDK and Python-APIpyzed; and secondly, reading and storing a left image and a right image of the human body through a ZED API, and finally, inputting the collected left image and the collected right image into a human body three-dimensional reconstruction network based on binocular vision to generate a human body 3D model.