CN115731345A - Human body three-dimensional reconstruction method based on binocular vision - Google Patents

Human body three-dimensional reconstruction method based on binocular vision Download PDF

Info

Publication number
CN115731345A
CN115731345A CN202211426574.6A CN202211426574A CN115731345A CN 115731345 A CN115731345 A CN 115731345A CN 202211426574 A CN202211426574 A CN 202211426574A CN 115731345 A CN115731345 A CN 115731345A
Authority
CN
China
Prior art keywords
map
human body
image
depth
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211426574.6A
Other languages
Chinese (zh)
Inventor
樊养余
厉行
刘洋
何雯清
吕国云
郭哲
王毅
齐敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202211426574.6A priority Critical patent/CN115731345A/en
Publication of CN115731345A publication Critical patent/CN115731345A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a human body three-dimensional reconstruction method based on binocular vision, which comprises the steps of firstly, carrying out parallax estimation on a left image and a right image of a human body by a stereo matching method to generate a front parallax image; secondly, calculating a front depth map by a binocular vision imaging principle; and finally, estimating the back color map and the depth map through a human body completion network, and generating a complete three-dimensional model by combining the front color map and the depth map. According to the invention, the parallax estimation is carried out on the left and right images on the front side of the human body through the stereo matching network, and then the parallax image is converted into the depth image, so that the problem of poor depth estimation precision of the human body image is solved, the network can predict the accurate and real front side human body depth image, the problem that the human body completion network is difficult to capture local geometric changes is solved through supervised learning by using the normal phase image as an intermediate variable, and the detailed geometric details of the back side depth image estimated by the network are reserved.

Description

Human body three-dimensional reconstruction method based on binocular vision
Technical Field
The invention relates to the technical field of image processing, computer graphics and three-dimensional reconstruction, in particular to a three-dimensional virtual human reconstruction method.
Background
The human body three-dimensional reconstruction technology is an important and challenging research subject, and has wide application prospects, such as AR/VR, teleconferencing, film making, virtual fitting, virtual games and the like.
However, in practice, there are many problems that are difficult to solve or the effect is difficult to satisfy:
(1) Although high-fidelity virtual human can be obtained by using high-end acquisition equipment and a well-designed capturing environment, the scanning equipment is expensive in cost, complex in operation and difficult to popularize;
(2) When a single image is used for human body three-dimensional reconstruction, due to the lack of priori knowledge of depth information, the generated human body geometry lacks details, and the texture resolution is low.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a human body three-dimensional reconstruction method based on binocular vision. Firstly, carrying out parallax estimation on a left image and a right image of a human body by a stereo matching method to generate a front parallax image; secondly, calculating a front depth map by a binocular vision imaging principle; and finally, estimating a back color map and a depth map through a human body completion network, and generating a complete three-dimensional model by combining the front color map and the depth map.
The invention carries out three-dimensional virtual human reconstruction on the left image and the right image on the front side of the human body, and the acquisition cost of the human image is lower. And the left image and the right image are predicted through the stereo matching network, so that a more accurate disparity map can be generated. The performance of the human body three-dimensional reconstruction method can be effectively improved by using the depth map converted from the disparity map as a priori. The back color map and the depth map generated by the human body completion network have the outlines of the front color map and the depth map, so that the front point cloud and the back point cloud can be well fused, and a complete human body three-dimensional model is generated. Meanwhile, the human body completion network uses the normal phase diagram as an intermediate variable to carry out supervised learning, and the normal phase diagram can well express local depth changes (such as folds of clothes, fine lines of the face and the like), so that the estimated back depth diagram retains detailed details.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1, rendering a human body three-dimensional model by an OpenGL three-dimensional rendering technology, and respectively generating a left image, a right image and a depth image as training sets;
step 2, estimating a disparity map by a stereo matching method according to the left image and the right image acquired in the step 1;
step 3, calculating a depth map according to the disparity map obtained in the step 2 by using a binocular vision imaging principle;
step 4, obtaining a UV coordinate graph through DensePose according to the left graph collected in the step 1, sending the left graph collected in the step 1, the depth graph obtained in the step 3 and the UV coordinate graph obtained in the step 4 into a human body three-dimensional reconstruction network, and predicting a human body three-dimensional model;
and 5, acquiring a left image and a right image of the real human body by using the ZED binocular camera to generate a real human body three-dimensional model.
In the step 1, the step of obtaining the training set is as follows:
rendering the three-dimensional model by an OpenGL three-dimensional rendering technology to generate a left graph and a right graph; firstly, reading a three-dimensional model to be rendered; secondly, placing the three-dimensional model at the center of the vision field, setting an internal parameter matrix and an external parameter matrix of the left camera and the right camera, and setting the cameras to rotate by-30 degrees around the Y axis of a right-hand Cartesian coordinate system; thirdly, setting the direction, position, color and type of illumination; then, setting a projection mode as orthogonal projection; and finally, converting the three-dimensional information into a two-dimensional image, rendering a left image, a right image and a depth image of-30 degrees, and randomly giving different illumination to different visual angles.
In step 2, the step of estimating the disparity map is as follows:
estimating a disparity map by a stereo matching method according to the left image and the right image rendered in the step 1, firstly, respectively sending the left image and the right image into two weight-shared 2D feature extraction modules, respectively extracting feature maps of the left image and the right image, and secondly, constructing the extracted feature maps of the left image and the right image into 4D cost; then, performing 3D feature matching on the 4D cost quantity; finally, predicting the disparity map through disparity regression and geometric optimization;
the 2D feature extraction module extracts different scale features of the 2D image, the features are composed of 2D CNNs (Convolutional Neural Networks) and ASPP (empty space Convolutional Pooling Pyramid), and the feature extraction modules of the left image and the right image share weight; the cost measure reflects the correlation between the left map and the corresponding right map pixel, and consists of a cost measure based on the domain correlation and a cost measure based on the variance;
the 3D feature matching is used for aggregating features in a space dimension and a parallax dimension, and the 3D feature matching module consists of a series of Hourglass networks (Hourglass networks) and Non-local attention modules (Non-local attention modules); in order to make the estimated disparity differentiable, training is performed using back propagation, and successive disparity maps are regressed using Soft Argmin; and finally, improving the precision of the initial disparity map by a geometric refinement module, and sending the initial disparity map, the left map and the semantic features to the CNN for optimization by the geometric refinement module.
In the stereo matching method, the loss of an initial disparity map and a refined disparity map is calculated simultaneously; training the whole network end-to-end by smoothing L1 loss (Soomth L1 loss), which is robust at disparity discontinuities and less sensitive to outliers or noise; finally, the loss function is expressed as:
Figure BDA0003942867110000031
wherein λ is i And beta is a hyperparameter, loss i Initial disparity map, loss, representing the output of the output module r Representing the fine disparity map after passing through the geometric fine module;
Figure BDA0003942867110000032
Figure BDA0003942867110000033
wherein, P valid Denotes the effective area, P is the number P valid D (p) represents a true value of the disparity at pixel p, d i (p) represents the estimated disparity of the initial disparity map at p points of the pixel, d r (p) represents the estimated disparity of the fine disparity map at p pixels, and N represents the number of pixels in the effective area.
The step of calculating the depth map in the step 3 is as follows:
calculating to obtain a depth map according to the disparity map obtained in the step 2, wherein the binocular vision imaging principle is shown in figure 3, and one point P in the space is a pixel P in the image shot by the left camera L (x L Y), pixel p in the image taken by the right camera R (x R Y), obtaining the parallax d = | x of the point P in the left and right images through a stereo matching network L -x R And computing a depth map by the formula z = f b/d, wherein z represents depth, f represents camera focal length, b represents baseline, and d is estimated parallax.
In the step 4, the step of estimating the human body three-dimensional reconstruction network comprises the following steps:
firstly, estimating a human body UV coordinate graph through a DensePose network, wherein the UV coordinate graph reflects the position relation between a human body 2D image and a 3D model; secondly, sending the left image obtained in the step 1, the depth map obtained in the step 3 and the UV coordinate map into a color map and a depth map of the back of the human body completion network for estimation; then, converting the front color map and the depth map into colored front point clouds, and converting the back color map and the depth map into colored back point clouds; finally, directly accumulating the front point cloud and the back point cloud for fusion to generate a human body 3D model with textures, and generating a complete human body three-dimensional reconstruction network from the front color map and the depth map;
the schematic network structure of the human body complement network is shown in fig. 5, and first, the front color map, the depth map and the UV coordinate map are respectively sent to the color generator
Figure BDA0003942867110000041
And depth generator
Figure BDA0003942867110000042
Estimating a back color map and a back depth map; next, the estimated back color map is input to a discriminator
Figure BDA0003942867110000043
Training a discriminator to distinguish whether the estimated back color map is true; then, the estimated depth map is converted into a normal phase map, and the normal phase map is sent to a discriminator
Figure BDA0003942867110000044
Training the discriminator to distinguish whether the estimated back-side phase diagram is true; the depth map is converted into a normal phase map and is represented as:
Figure BDA0003942867110000045
wherein N is pre (i) Representing the phase of point i, norm (-) representing the normalization of the input vector, P i The j point is the clockwise direction of the k point relative to the i point and belongs to the neighborhood of the i point;
the loss function of the human body completion network is represented by L color ,L gan ,L depth And L normal Composition, at point i, estimated Back color map C pre (i) Sum true value C gt (i) Loss function L between color Is defined as:
L color (i)=||C pre (i)-C gt (i)||
at point i, the estimated back color map D pre (i) Sum true value D gt (i) Loss function L between depth Is defined as:
L depth (i)=||D pre (i)-D gt (i)||
at point i, the estimated back color map N pre (i) Sum true value N gt (i) Loss function L between normal Is defined as:
L normai (i)=||N pre (i)-N gt (i)||
to improve the accuracy of the predicted back color map and back depth map, a generation penalty function L is introduced gan Comprises the following steps:
Figure BDA0003942867110000046
Figure BDA0003942867110000047
wherein the generator
Figure BDA0003942867110000048
In an effort to minimize the objective function
Figure BDA0003942867110000049
And a discriminator
Figure BDA00039428671100000410
In an effort to maximize the objective function, i.e.
Figure BDA00039428671100000411
Similarly, the generators
Figure BDA00039428671100000412
In an effort to minimize an objective function
Figure BDA00039428671100000413
And a discriminator
Figure BDA00039428671100000414
In an effort to maximize the objective function, i.e.
Figure BDA00039428671100000415
In step 5, the step of predicting on the real data set is:
predicting real left and right images of the human body collected by a ZED binocular camera, wherein the left and right images of the human body are collected by a left camera and a right camera of the ZED respectively; firstly, installing a ZED SDK and a Python-API pyzed; secondly, reading and saving the left and right images of the human body through a ZED API, wherein the left and right images collected by the ZED camera are shown in figure 6, and finally, inputting the collected left and right images into a human body three-dimensional reconstruction network based on binocular vision to generate a human body 3D model.
The method has the advantages that the parallax estimation is carried out on the left and right images on the front side of the human body through the stereo matching network, and then the parallax images are converted into the depth images, so that the problem of poor depth estimation precision of the human body images is solved, and the network can predict the accurate and real front human body depth images.
The method and the device perform supervised learning by using the normal phase diagram as the intermediate variable, solve the problem that the human body completion network is difficult to capture local geometric changes, and enable the back depth diagram of network estimation to keep detailed geometric details.
Drawings
Fig. 1 is a schematic diagram of rendering results of the present invention, in which (a) is a left image of-30 °, (b) is a left image of 0 °, and (c) is a left image of 30 °.
Fig. 2 is a schematic diagram of a network structure of the stereo matching method of the present invention.
Fig. 3 is a schematic view of the binocular vision imaging principle of the present invention.
FIG. 4 is a schematic diagram of generating a complete human body three-dimensional model from a front color map and a depth map according to the present invention.
Fig. 5 is a schematic diagram of a network structure of the human body complement network according to the present invention.
Fig. 6 is a left and right images captured by the ZED camera of the present invention, wherein (a) is a left image captured by the ZED camera and (b) is a right image captured by the ZED camera.
FIG. 7 is a flow chart of the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
In order to overcome the defects of the prior art, the invention provides a human body three-dimensional reconstruction method based on binocular vision, which comprises the steps of firstly carrying out parallax estimation on a left image and a right image of a human body by a stereo matching method to generate a front parallax image; secondly, calculating a front depth map by a binocular vision imaging principle; and finally, estimating a back color map and a depth map through a human body completion network, and generating a complete three-dimensional model by combining the front color map and the depth map.
The invention carries out three-dimensional virtual human reconstruction on the left image and the right image on the front side of the human body, and the acquisition cost of the human image is lower. And the left image and the right image are predicted through the stereo matching network, so that a more accurate disparity map can be generated. The performance of the human body three-dimensional reconstruction method can be effectively improved by using the depth map converted from the disparity map as a priori. The back color map and the depth map generated by the human body completion network have the outlines of the front color map and the depth map, so that the front point cloud and the back point cloud can be well fused, and a complete human body three-dimensional model is generated. Meanwhile, the human body completion network uses the normal phase diagram as an intermediate variable to carry out supervised learning, and the normal phase diagram can well express local depth changes (such as folds of clothes, fine lines of the face and the like), so that the estimated back depth diagram retains detailed details.
The technical scheme of the invention comprises the following steps:
step 1, rendering a human body three-dimensional model by an OpenGL three-dimensional rendering technology, and respectively generating a left image, a right image and a depth image as training sets;
step 2, estimating a disparity map by a stereo matching method according to the left image and the right image acquired in the step 1;
step 3, calculating a depth map according to the disparity map obtained in the step 2 by using a binocular vision imaging principle;
step 4, obtaining a UV coordinate graph through DensePose according to the left graph collected in the step 1, sending the left graph collected in the step 1, the depth graph obtained in the step 3 and the UV coordinate graph obtained in the step 4 into a human body three-dimensional reconstruction network, and predicting a human body three-dimensional model;
and 5, acquiring a left image and a right image of the real human body by using a ZED binocular camera to generate a real three-dimensional model of the human body.
In the step 1, the step of obtaining the training set is as follows:
rendering the three-dimensional model by an OpenGL three-dimensional rendering technology to generate a left graph and a right graph; firstly, reading a three-dimensional model to be rendered; secondly, placing the three-dimensional model at the center of the vision field, setting an internal parameter matrix and an external parameter matrix of the left camera and the right camera, and setting the cameras to rotate by-30 degrees around the Y axis of a right-hand Cartesian coordinate system; thirdly, setting the direction, position, color and type of illumination; then, setting a projection mode as orthogonal projection; finally, converting the three-dimensional information into a two-dimensional image, and rendering a left image, a right image and a depth image of-30 degrees; the rendering results are shown in fig. 1, which shows the left images at 0 °,30 ° and-30 °, respectively, and different views are randomly illuminated.
In step 2, the step of estimating the disparity map is as follows:
estimating a disparity map by a stereo matching method according to the left image and the right image rendered in the step 1, wherein a network structure schematic diagram of the stereo matching method is shown in FIG. 2; firstly, respectively sending the left image and the right image into two weight-sharing 2D feature extraction modules, respectively extracting feature images of the left image and the right image, and then constructing the extracted feature images of the left image and the right image into 4D cost; then, performing 3D feature matching on the 4D cost quantity; finally, predicting the disparity map through disparity regression and geometric optimization;
the 2D feature extraction module extracts different scale features of the 2D image, the features are composed of 2D CNNs (Convolutional Neural Networks) and ASPP (empty space Convolutional Pooling Pyramid), and the feature extraction modules of the left image and the right image share weight; the cost measure reflects the correlation between the left map and the corresponding right map pixels, and consists of a domain correlation-based cost measure and a variance-based cost measure.
The 3D feature matching module is used for aggregating features in a space dimension and a parallax dimension, and consists of a series of Hourglass networks (Hourglass networks) and Non-local attention modules (Non-local attention modules); in order to make the estimated disparity differentiable, training is performed using back propagation, and successive disparity maps are regressed using Soft Argmin; and finally, improving the precision of the initial disparity map by a geometric refinement module, and sending the initial disparity map, the left map and the semantic features to the CNN for optimization by the geometric refinement module.
In the stereo matching method, the loss of an initial disparity map and a refined disparity map is calculated simultaneously; training the whole network end-to-end by smoothing L1 loss (Soomth L1 loss), which is robust at disparity discontinuities and less sensitive to outliers or noise; finally, the loss function is expressed as:
Figure BDA0003942867110000071
wherein λ is i And beta is a hyperparameter, loss i Initial disparity map, loss, representing the output of the output module r Representing the fine disparity map after passing through the geometric fine module;
Figure BDA0003942867110000072
Figure BDA0003942867110000073
wherein, P valid Denotes the effective area, P is the number P valid D (p) represents a true value of the disparity at pixel p, d i (p) represents the estimated disparity of the initial disparity map at the point p of the pixel, d r (p) represents the estimated disparity of the fine disparity map at p pixels, and N represents the number of pixels in the effective area.
The step of calculating the depth map in the step 3 is as follows:
calculating to obtain a depth map according to the disparity map obtained in the step 2, wherein the binocular vision imaging principle is shown in figure 3, and one point P in the space is a pixel P in the image shot by the left camera L (x L Y), pixel p in the image taken by the right camera R (x R Y), obtaining the parallax d = | x of the point P in the left and right images through a stereo matching network L -x R And then calculating a depth map by the formula z = f × b/d, wherein z represents depth, f represents camera focal length, b represents baseline, and d is estimated parallax.
In the step 4, the step of estimating the human body three-dimensional reconstruction network comprises the following steps:
firstly, estimating a human body UV coordinate graph through a DensePose network, wherein the UV coordinate graph reflects the position relation between a human body 2D image and a 3D model; secondly, sending the left image obtained in the step 1, the depth image obtained in the step 3 and the UV coordinate image into a color image and a depth image of the back of the human body completion network estimation; then, converting the front color map and the depth map into colored front point clouds, and converting the back color map and the depth map into colored back point clouds; and finally, directly accumulating the front point cloud and the back point cloud for fusion to generate a human body 3D model with texture, and generating a complete human body three-dimensional reconstruction network schematic diagram from the front color map and the depth map as shown in FIG. 4.
The schematic network structure of the human body completion network is shown in fig. 5, and first, the front color map, the depth map and the UV coordinate map are respectively sent to the color generator
Figure BDA0003942867110000081
And depth ofDegree generator
Figure BDA0003942867110000082
Estimating a back color map and a back depth map; next, the estimated back color map is input to a discriminator
Figure BDA0003942867110000083
Training a discriminator to distinguish whether the estimated back color map is true; then, the estimated depth map is converted into a normal phase map, and the normal phase map is sent to a discriminator
Figure BDA0003942867110000084
Training the discriminator to distinguish whether the estimated back phase diagram is true; the depth map is converted into a normal phase map and is represented as:
Figure BDA0003942867110000085
wherein N is pre (i) Representing the phase of point i, norm (-) representing the normalization of the input vector, P i The j point is the clockwise direction of the k point relative to the i point and belongs to the neighborhood of the i point;
the loss function of the human body completion network is represented by L color ,L gan ,L depth And L normal Composition, at point i, estimated Back color map C pre (i) Sum true value C gt (i) Loss function L between color Is defined as follows:
L color (i)=||C pre (i)-C gt (i)||
at point i, the estimated back color map D pre (i) Sum true value D gt (i) Loss function L between depth Is defined as follows:
L depth (i)=||D pre (i)-D gt (i)||
at point i, the estimated back color map N pre (i) Sum true value N gt (i) Loss function L between normal Is defined as:
L normal (i)=||N pre (i)-N gt (i)||
to improve the accuracy of the predicted back color map and back depth map, a generation penalty function L is introduced gan Comprises the following steps:
Figure BDA0003942867110000086
Figure BDA0003942867110000087
wherein the generator
Figure BDA0003942867110000088
In an effort to minimize an objective function
Figure BDA0003942867110000089
And a discriminator
Figure BDA00039428671100000810
In an effort to maximize the objective function, i.e.
Figure BDA00039428671100000811
Similarly, the generators
Figure BDA00039428671100000812
In an effort to minimize an objective function
Figure BDA00039428671100000813
And a discriminator
Figure BDA00039428671100000814
In an effort to maximize the objective function, i.e.
Figure BDA00039428671100000815
In step 5, the step of predicting on the real data set is:
predicting real left and right human body images collected by a ZED binocular camera, wherein the left and right human body images are collected by the ZED left camera and the ZED right camera respectively; firstly, installing a ZED SDK and a Python-API pyzed; secondly, a left human body image and a right human body image are read and stored through a ZED API, left and right images collected by a ZED camera are shown in figure 6, and finally the collected left and right images are input into a binocular vision-based three-dimensional human body reconstruction network to generate a human body 3D model.
According to the invention, the parallax estimation is carried out on the left and right images on the front side of the human body through the stereo matching network, and then the parallax image is converted into the depth image, so that the problem of poor depth estimation precision of the human body image is solved, and the network can predict the accurate and real front human body depth image.
The method and the device perform supervised learning by using the normal phase diagram as the intermediate variable, solve the problem that the human body completion network is difficult to capture local geometric changes, and enable the back depth diagram of network estimation to keep detailed geometric details.

Claims (8)

1. A human body three-dimensional reconstruction method based on binocular vision is characterized by comprising the following steps:
step 1, rendering a human body three-dimensional model by an OpenGL three-dimensional rendering technology, and respectively generating a left image, a right image and a depth image as training sets;
step 2, estimating a disparity map by a stereo matching method according to the left image and the right image acquired in the step 1;
step 3, calculating a depth map according to the disparity map obtained in the step 2 by using a binocular vision imaging principle;
step 4, obtaining a UV coordinate graph through DensePose according to the left graph collected in the step 1, sending the left graph collected in the step 1, the depth graph obtained in the step 3 and the UV coordinate graph obtained in the step 4 into a human body three-dimensional reconstruction network, and predicting a human body three-dimensional model;
and 5, acquiring a left image and a right image of the real human body by using the ZED binocular camera to generate a real human body three-dimensional model.
2. The binocular vision based human body three-dimensional reconstruction method according to claim 1, wherein:
in the step 1, the step of obtaining the training set is as follows:
rendering the three-dimensional model by an OpenGL three-dimensional rendering technology to generate a left graph and a right graph; firstly, reading a three-dimensional model to be rendered; secondly, placing the three-dimensional model at the center of the vision field, setting an internal parameter matrix and an external parameter matrix of the left camera and the right camera, and setting the cameras to rotate by-30 degrees around the Y axis of a right-hand Cartesian coordinate system; thirdly, setting the direction, position, color and type of illumination; then, setting a projection mode as orthogonal projection; and finally, converting the three-dimensional information into a two-dimensional image, rendering a left image, a right image and a depth image of-30 degrees, and randomly giving different illumination to different visual angles.
3. The binocular vision based human body three-dimensional reconstruction method according to claim 1, wherein:
in step 2, the step of estimating the disparity map is as follows:
estimating a disparity map by a stereo matching method according to the left image and the right image rendered in the step 1, firstly, respectively sending the left image and the right image into two weight-shared 2D feature extraction modules, respectively extracting feature maps of the left image and the right image, and secondly, constructing the extracted feature maps of the left image and the right image into 4D cost; then, performing 3D feature matching on the 4D cost quantity; and finally, predicting the disparity map through disparity regression and geometric optimization.
4. The binocular vision based human body three-dimensional reconstruction method according to claim 3, wherein:
the 2D feature extraction module extracts features of different scales of the 2D image, the features are composed of 2D CNNs and ASPP, and the feature extraction modules of the left image and the right image share weight; the cost measure reflects the correlation between the left map and the corresponding right map pixel, and consists of a cost measure based on the domain correlation and a cost measure based on the variance;
the 3D feature matching is used for aggregating features in spatial dimension and parallax dimension, and the 3D feature matching module consists of a series of hourglass networks and a non-local attention module; in order to make the estimated disparity differentiable, training is performed using back propagation, and the continuous disparity map is regressed using Soft Argmin; finally, the precision of the initial disparity map is improved through a geometric refinement module, and the initial disparity map, the left map and the semantic features are sent to the CNN for optimization through the geometric refinement module;
in the stereo matching method, the loss of an initial disparity map and a refined disparity map is calculated simultaneously; training the whole network end-to-end by smoothing L1 loss, wherein the loss is robust at parallax discontinuity and has low sensitivity to abnormal values or noise; finally, the loss function is expressed as:
Figure FDA0003942867100000021
wherein λ is i And beta is a hyperparameter, loss i Initial disparity map, loss, representing the output of the output module r Representing the fine disparity map after passing through the geometric fine module;
Figure FDA0003942867100000022
Figure FDA0003942867100000023
wherein, P valid Denotes the effective area, P is the number P valid D (p) represents a true value of the disparity at pixel p, d i (p) represents the estimated disparity of the initial disparity map at p points of the pixel, d r (p) represents the estimated disparity of the fine disparity map at the pixel p point, and N represents the number of pixels in the effective area.
5. The binocular vision based human body three-dimensional reconstruction method according to claim 1, wherein:
the step of calculating the depth map in the step 3 is as follows:
calculating to obtain a depth map according to the disparity map obtained in the step 2, wherein a point P in the space is a pixel P in the image shot by the left camera L (x L Y), pixel p in the image taken by the right camera R (x R Y), obtaining the parallax d = | x of the point P in the left and right images through a stereo matching network L -x R And computing a depth map by the formula z = f b/d, wherein z represents depth, f represents camera focal length, b represents baseline, and d is estimated parallax.
6. The binocular vision based human body three-dimensional reconstruction method according to claim 1, wherein:
in the step 4, the step of estimating the human body three-dimensional reconstruction network comprises the following steps:
firstly, estimating a human body UV coordinate graph through a DensePose network, wherein the UV coordinate graph reflects the position relation between a human body 2D image and a 3D model; secondly, sending the left image obtained in the step 1, the depth image obtained in the step 3 and the UV coordinate image into a color image and a depth image of the back of the human body completion network estimation; then, converting the front color map and the depth map into colored front point clouds, and converting the back color map and the depth map into colored back point clouds; and finally, directly accumulating the front point cloud and the back point cloud for fusion to generate a human body 3D model with texture, and generating a complete human body three-dimensional reconstruction network from the front color map and the depth map.
7. The binocular vision based human body three-dimensional reconstruction method of claim 6, wherein:
in the human body complement network, firstly, a front color map, a depth map and a UV coordinate map are respectively sent to a color generator
Figure FDA0003942867100000031
And depth generator
Figure FDA0003942867100000032
Estimating a back color map and a back depth map; secondly, willThe estimated back color map is input to a discriminator
Figure FDA0003942867100000033
Training a discriminator to distinguish whether the estimated back color map is true; then, the estimated depth map is converted into a normal phase map, and the normal phase map is sent to a discriminator
Figure FDA0003942867100000034
Training the discriminator to distinguish whether the estimated back phase diagram is true; the depth map is converted into a normal phase map and is represented as:
Figure FDA0003942867100000035
wherein N is pre (i) Representing the phase of point i, norm (-) representing the normalization of the input vector, P i The j point is the clockwise direction of the k point relative to the i point and belongs to the neighborhood of the i point;
the loss function of the human body completion network is represented by L color ,L gan ,L depth And L normal Composition, at point i, estimated Back color map C pre (i) Sum true value C gt (i) Loss function L between color Is defined as:
L color (i)=||C pre (i)-C gt (i)||
at point i, the estimated back color map D pre (i) Sum true value D gt (i) Loss function L between depth Is defined as:
L depth (i)=||D pre (i)-D gt (i)||
at point i, the estimated back color map N pre (i) Sum true value N gt (i) Loss function L between normal Is defined as:
L normal (i)=||N pre (i)-N gt (i)||
to improve the accuracy of the predicted back color map and back depth map, a generator is introducedBecomes a penalty function L gan Comprises the following steps:
Figure FDA0003942867100000036
Figure FDA0003942867100000037
wherein the generator
Figure FDA0003942867100000038
In an effort to minimize an objective function
Figure FDA0003942867100000039
And a discriminator
Figure FDA00039428671000000310
In an effort to maximize the objective function, i.e.
Figure FDA00039428671000000311
Similarly, the generators
Figure FDA00039428671000000312
In an effort to minimize an objective function
Figure FDA00039428671000000313
And a discriminator
Figure FDA00039428671000000314
In an effort to maximize the objective function, i.e.
Figure FDA00039428671000000315
8. The binocular vision based human body three-dimensional reconstruction method of claim 1, wherein:
in step 5, the step of predicting on the real data set is:
predicting real left and right images of the human body collected by a ZED binocular camera, wherein the left and right images of the human body are collected by a left camera and a right camera of the ZED respectively; firstly, installing a ZED SDK and Python-APIpyzed; and secondly, reading and storing a left image and a right image of the human body through a ZED API, and finally, inputting the collected left image and the collected right image into a human body three-dimensional reconstruction network based on binocular vision to generate a human body 3D model.
CN202211426574.6A 2022-11-15 2022-11-15 Human body three-dimensional reconstruction method based on binocular vision Pending CN115731345A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211426574.6A CN115731345A (en) 2022-11-15 2022-11-15 Human body three-dimensional reconstruction method based on binocular vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211426574.6A CN115731345A (en) 2022-11-15 2022-11-15 Human body three-dimensional reconstruction method based on binocular vision

Publications (1)

Publication Number Publication Date
CN115731345A true CN115731345A (en) 2023-03-03

Family

ID=85295736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211426574.6A Pending CN115731345A (en) 2022-11-15 2022-11-15 Human body three-dimensional reconstruction method based on binocular vision

Country Status (1)

Country Link
CN (1) CN115731345A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117315152A (en) * 2023-09-27 2023-12-29 杭州一隅千象科技有限公司 Binocular stereoscopic imaging method and binocular stereoscopic imaging system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117315152A (en) * 2023-09-27 2023-12-29 杭州一隅千象科技有限公司 Binocular stereoscopic imaging method and binocular stereoscopic imaging system
CN117315152B (en) * 2023-09-27 2024-03-29 杭州一隅千象科技有限公司 Binocular stereoscopic imaging method and binocular stereoscopic imaging system

Similar Documents

Publication Publication Date Title
JP7403528B2 (en) Method and system for reconstructing color and depth information of a scene
CN104677330A (en) Small binocular stereoscopic vision ranging system
CN110910437A (en) Depth prediction method for complex indoor scene
Do et al. Immersive visual communication
Liu et al. High quality depth map estimation of object surface from light-field images
Ubina et al. Intelligent underwater stereo camera design for fish metric estimation using reliable object matching
CN114429555A (en) Image density matching method, system, equipment and storage medium from coarse to fine
CN115731345A (en) Human body three-dimensional reconstruction method based on binocular vision
CN112927348B (en) High-resolution human body three-dimensional reconstruction method based on multi-viewpoint RGBD camera
CN116452757B (en) Human body surface reconstruction method and system under complex scene
Kang et al. Progressive 3D model acquisition with a commodity hand-held camera
CN114935316B (en) Standard depth image generation method based on optical tracking and monocular vision
Price et al. Augmenting crowd-sourced 3d reconstructions using semantic detections
Park et al. A tensor voting approach for multi-view 3D scene flow estimation and refinement
CN115601423A (en) Edge enhancement-based round hole pose measurement method in binocular vision scene
CN113554102A (en) Aviation image DSM matching method for cost calculation dynamic programming
Kim et al. Environment modelling using spherical stereo imaging
Ivan et al. Synthesizing a 4D spatio-angular consistent light field from a single image
Kim et al. A real-time 3d modeling system using multiple stereo cameras for free-viewpoint video generation
Liu et al. Fusion of depth maps based on confidence
Yang et al. Monocular three dimensional dense surface reconstruction by optical flow feedback
Ji et al. Mixed reality depth contour occlusion using binocular similarity matching and three-dimensional contour optimisation
CN113379821B (en) Stable monocular video depth estimation method based on deep learning
CN118038004A (en) ResNet-based self-supervision binocular power equipment identification method and system
Gallego et al. Two variational stereo methods for space-time measurements of ocean waves

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination