CN115330929A

CN115330929A - Multi-view three-dimensional reconstruction method and device

Info

Publication number: CN115330929A
Application number: CN202210325207.0A
Authority: CN
Inventors: 庞大为; 王江安
Original assignee: Tudou Data Technology Group Co ltd
Current assignee: Tudou Data Technology Group Co ltd
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-11-11

Abstract

The application discloses a multi-view three-dimensional reconstruction method and a device, comprising the following steps: processing a plurality of input images by using multilayer common convolution to obtain a plurality of first feature maps with different scales of each input image; processing a plurality of first feature maps with different scales by using a plurality of deformable convolutions which do not share parameters, and obtaining a plurality of second feature maps with different scales of each input image after bilinear interpolation; splicing a plurality of second feature maps of each input image to obtain an output feature map of each input image; constructing a cost body; regularizing the cost body to obtain a probability body; an initial depth map from a reference image; optimizing the initial depth map to obtain a depth optimization map; a dense point cloud is generated. The technical scheme provided by the application overcomes the technical problem that the extraction of the surface features of the weak texture is difficult, so that the weak texture region also has a better reconstruction effect, and the GPU operation loss is reduced while the accurate generation of the depth map is ensured.

Description

Multi-view three-dimensional reconstruction method and device

Technical Field

The application relates to the technical field of remote sensing mapping geographic information, in particular to a multi-view three-dimensional reconstruction method and device.

Background

Three-dimensional reconstruction means that a real scene is described into a mathematical model conforming to computer logic expression through processes of depth data acquisition, preprocessing, point cloud registration and fusion, surface generation and the like according to a single-view or multi-view image, and the method is widely applied to the fields of virtual reality, autonomous driving, game development, architectural design, clinical medicine and the like.

Conventional multi-view stereo matching reconstruction methods use manually designed similarity metric indices and photometric consistency to estimate depth maps and generate dense 3D point clouds. Although these methods show good reconstruction in the ideal lambertian situation, they also have some common limitations. For example, the presence of weak texture, high light and specular reflection regions makes dense matching difficult to handle, resulting in incomplete reconstruction results. To overcome this limitation, deep learning has been introduced in recent years to improve stereo reconstruction methods. The deep learning based approach has a higher level of accuracy and integrity on many MVS (Multi-view Stereo) metrics than the traditional approach.

However, the current three-dimensional reconstruction method based on deep learning still has some problems. At present, the two-dimensional convolution neural network has difficulty in extracting weak texture surface features on a regular pixel grid, which affects the integrity of three-dimensional reconstruction.

Disclosure of Invention

The embodiment of the application provides a multi-view three-dimensional reconstruction method and device, and solves the technical problems that extraction of low and weak texture surface features is difficult and three-dimensional reconstruction integrity is influenced in the existing three-dimensional reconstruction method.

In a first aspect, an embodiment of the present application provides a multi-view three-dimensional reconstruction method, where the method includes:

processing a plurality of input images by using a plurality of layers of common convolution to obtain a plurality of first feature maps with different scales of each input image; the input images comprise a reference image, and the rest are source images;

processing a plurality of first feature maps with different scales by using a plurality of deformable convolutions which do not share parameters, and obtaining a plurality of second feature maps with different scales of each input image after bilinear interpolation; wherein the deformable convolution is defined as follows:

f (p) represents the characteristic value of the pixel p, w _k And p _k Respectively representing the convolution kernel parameter and the fixed offset, Δ p, defined in said ordinary convolution _k And Δ m _k Respectively representing the offset and the weight of the deformable convolution generated by learning;

splicing a plurality of second feature maps of each input image to obtain an output feature map of each input image;

constructing a cost body according to the output feature map and the camera parameters of each input image;

regularizing the cost body to obtain a probability body;

determining an initial depth map of the reference image according to the probability volume;

optimizing the initial depth map to obtain a depth optimization map;

and fusing a plurality of depth optimization maps to generate dense point cloud.

With reference to the first aspect, in a possible implementation manner, the regularizing the cost body to obtain a probability body includes:

regularizing the cost body by using a hierarchical recursive convolution network module to obtain a plurality of regularized cost graphs; the hierarchical recursive convolutional network module comprises a plurality of LSTMConvvcells which are sequentially arranged, and a pooling layer and a deconvolution layer which are arranged between the two LSTMConvvcells;

and generating the corresponding probability body according to the plurality of cost graphs.

With reference to the first aspect, in a possible implementation manner, the determining an initial depth map of a reference image according to the probability volume includes:

and determining an initial depth map of the reference image by taking a depth expectation value as a depth estimation value of each pixel along the depth direction of the probability body.

With reference to the first aspect, in a possible implementation manner, the optimizing the initial depth map includes:

projecting pixel points on the reference image to the corresponding position of each source image through the initial depth map, and calculating the difference value between the value of the output characteristic map of the reference image and the value of the output characteristic map of the source image;

minimizing the difference value using gauss-newton method;

calculating residual errors, wherein the calculation formula of the residual errors is as follows:

ri(p)＝F _i (p _i ′)-F ₀ (p)；

wherein, F _i (p _i ') is the value of the output characteristic map of the source image, F ₀ (p) taking the value of the output characteristic diagram of the reference image;

calculating a first derivative J of each of the residuals to the initial depth map _i (p) and determining the increment δ of the current depth according to the following formula:

δ＝-(J ^T J) ^-1 J ^T r；

wherein J is the matrix { J _i (p) is the superposition of residual vectors { ri (p) };

and adding the increment of the current depth and the value of the initial depth map to obtain the optimized depth map.

With reference to the first aspect, in a possible implementation manner, the fusing the plurality of depth optimization maps to generate a dense point cloud includes:

fusing the dynamic matching consistency of all the depth optimization maps to obtain global dynamic multi-view geometric consistency; the dynamic matching consistency is defined as follows:

the global dynamic multi-view geometric consistency is defined as follows:

wherein epsilon _p And ε _d Respectively representing a pixel reprojection error and a depth reprojection error, wherein lambda represents a coefficient influencing two different reprojection errors;

outliers are filtered using a preset filter coefficient.

With reference to the first aspect, in a possible implementation manner, a cross entropy loss function is used between the probability volume and a one-hot coded volume of a real depth map, where the cross entropy loss function is defined as:

wherein x is _valid Representing an effective set of pixels, G (i, x) representing the true depth map generated by one-hot encoding at i depths of pixel x, P (i, x) representing a pixel in the probability volume;

when optimizing the initial depth map, taking the distance from the real depth map to the optimized depth map as a loss, namely:

wherein d (x) represents a pixel depth value of the real depth map, d _r (x) A pixel depth value representing the optimized depth map;

the loss function when the method is trained is expressed as follows:

and determining whether the method starts a depth map optimization module or not by using the lambda.

With reference to the first aspect, in a possible implementation manner, the constructing a cost body according to the output feature map and the camera parameters of each input image includes:

constructing conical bodies on the reference images at the same intervals by using an over-plane scanning method by taking the main optical axis of the reference images as the scanning direction;

projecting the output feature map of each source image onto each depth plane to form a feature body according to the differentiable homography, and enabling each projection to be the same in size by using an interpolation method;

and determining the cost body corresponding to the reference image based on variance by using a plurality of feature bodies corresponding to the reference image.

In a second aspect, an embodiment of the present application provides a multi-view three-dimensional reconstruction apparatus, including:

the first characteristic module is used for processing a plurality of input images by using multilayer common convolution to obtain a plurality of first characteristic maps with different scales of each input image; the input images comprise a reference image, and the rest are source images;

the second feature module is used for processing a plurality of first feature maps with different scales by using a plurality of deformable convolutions which do not share parameters, and obtaining a plurality of second feature maps with different scales of each input image after bilinear interpolation; wherein the deformable convolution is defined as follows:

f (p) represents the characteristic value of the pixel p, w _k And p _k Respectively representing the convolution kernel parameter and the fixed offset, Δ p, defined in said ordinary convolution _k And Δ m _k Respectively representing the offset and the weight generated by learning of the deformable convolution;

the output characteristic module is used for splicing the second characteristic maps of the input images to obtain an output characteristic map of each input image;

the cost body module is used for constructing a cost body according to the output feature map and the camera parameters of each input image;

the probability body module is used for regularizing the cost body to obtain a probability body;

an initial depth module, configured to determine an initial depth map of the reference image according to the probability volume;

the depth optimization module is used for optimizing the initial depth map to obtain a depth optimization map;

and the fusion module is used for fusing a plurality of depth optimization maps to generate dense point cloud.

With reference to the second aspect, in a possible implementation manner, the probability body module is specifically configured to: regularizing the cost body by using a hierarchical recursive convolutional network module to obtain a plurality of regularized cost graphs; the hierarchical recursive convolutional network module comprises a plurality of LSTMConnvcells which are sequentially arranged, and a pooling layer and a deconvolution layer which are arranged between the two LSTMConnvcells;

With reference to the second aspect, in a possible implementation manner, the initial depth module is specifically configured to: and determining an initial depth map of the reference image by taking a depth expectation value as a depth estimation value of each pixel along the depth direction of the probability body.

With reference to the second aspect, in a possible implementation manner, the depth optimization module is specifically configured to: projecting pixel points on the reference image to the corresponding position of each source image through the initial depth map, and calculating the difference value between the value of the output characteristic map of the reference image and the value of the output characteristic map of the source image;

minimizing the difference value using gauss-newton method;

ri(p)＝F _i (p _i ′)-F ₀ (p)；

wherein, F _i (p _i ') is the value of the output characteristic map of the source image, F ₀ (p) is a value of the output feature map of the reference image;

calculating a first derivative of each of the residuals to the initial depth mapJ _i (p) and determining the increment δ of the current depth according to the following formula:

δ＝-(J ^T J) ^-1 J ^T r；

With reference to the second aspect, in a possible implementation manner, the fusion module is specifically configured to:

the global dynamic multi-view geometric consistency is defined as follows:

wherein epsilon _p And epsilon ^d Respectively representing a pixel reprojection error and a depth reprojection error, wherein lambda represents a coefficient influencing two different reprojection errors;

outliers are filtered using a preset filter coefficient.

With reference to the second aspect, in one possible implementation manner, a cross-entropy loss function is used between the probability volume and the one-hot coded volume of the real depth map, where the cross-entropy loss function is defined as:

wherein x is _valid Representing an efficient set of pixels, G (i, x) representing the true depth map being one-hot coded at i depths of pixel xGenerating, P (i, x) representing a pixel in the probability volume;

in the optimizing the initial depth map, taking the distance of the real depth map to the optimized depth map as a penalty, namely:

the loss function when the method is trained is expressed as follows:

With reference to the second aspect, in a possible implementation manner, the cost body module is specifically configured to:

constructing a conical body for the reference image at the same interval by using an over-plane scanning method with the main optical axis of the reference image as the scanning direction;

and determining the cost body corresponding to the reference image based on the variance by using a plurality of feature bodies corresponding to the reference image.

In a third aspect, an embodiment of the present application provides a multi-view three-dimensional reconstruction apparatus, where the apparatus includes:

a memory for non-transitory storage of computer readable instructions;

a processor, configured to execute the computer-readable instructions, and when executed by the processor, the computer-readable instructions implement the multi-view three-dimensional reconstruction method according to the first aspect and the various possible implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, where computer-readable instructions are stored, and when executed by a processor, the computer-readable instructions implement the multi-view three-dimensional reconstruction method according to the first aspect and the various possible implementations of the first aspect.

The technical scheme provided by the embodiment of the invention at least has the following technical effects or advantages:

the embodiment of the invention provides a multi-view three-dimensional reconstruction method, which adopts multilayer common convolution and a plurality of deformable convolutions without shared parameters to process an input image at a characteristic diagram obtaining stage. After multiple input images are processed by multilayer common convolution, multiple first feature maps with different scales are obtained. After a plurality of first feature maps with different scales are processed by a plurality of deformable convolutions without shared parameters, a plurality of second feature maps with different scales are obtained through bilinear interpolation, and an output feature map is obtained after the plurality of second feature maps are spliced, so that the multi-view three-dimensional reconstruction method realizes multi-scale image information acquisition, overcomes the technical problem of difficulty in extracting weak texture surface features, and enables weak texture areas to have a better reconstruction effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments of the present invention or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic diagram of generating a second feature map according to an embodiment of the present application;

fig. 2 is a flowchart of a multi-view three-dimensional reconstruction method according to an embodiment of the present application;

FIG. 3 is a flow chart of generating a probability volume according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a hierarchical recursive convolutional network module according to an embodiment of the present application;

fig. 5 is a flowchart for obtaining an optimized depth map according to an embodiment of the present application;

FIG. 6 is a flow chart of a fusion optimization depth map provided by an embodiment of the present application;

FIG. 7 is a flowchart for obtaining a cost body according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a multi-view three-dimensional reconstruction apparatus provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a multi-view three-dimensional reconstruction apparatus provided in an embodiment of the present application;

FIG. 10 is a comparison of a depth optimization map and an MVSNet depth map provided by an embodiment of the present application;

fig. 11A, 11B, and 11C are graphs showing the comparison between the reconstruction results provided by the embodiment of the present application and the R-MVSNet reconstruction results in different scenarios.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present application provides a multi-view three-dimensional reconstruction method, which is shown in fig. 2 and includes S201 to S208.

S201: and processing a plurality of input images by using a plurality of layers of common convolutions to obtain a plurality of first feature maps with different scales of each input image. The input images comprise a reference image, and the rest are source images.

Exemplarily, fig. 1 has five layers of

ordinary convolutions

10a, 10b, 10c, 10d, 10e, and the step size of the latter two layers of

ordinary convolutions

10d, 10e is 2, and the input image is processed by the five layers of

ordinary convolutions

10a, 10b, 10c, 10d, 10e to obtain the first feature maps of three different scales.

S202: and processing a plurality of first feature maps with different scales by using a plurality of deformable convolutions which do not share parameters, and obtaining a plurality of second feature maps with different scales of each input image after bilinear interpolation.

Wherein the deformable convolution is defined as follows:

in the above formula, f (p) represents a characteristic value of the pixel p, w _k And p _k Respectively representing the convolution kernel parameter and the fixed offset, Δ p, defined in the ordinary convolution _k And Δ m _k Respectively representing the offset and weight of the deformable convolution generated by learning.

In fig. 1, the multilayer ordinary convolution obtains first feature maps of three different scales, and then three deformable convolutions 20 which do not share parameters are used for processing the first feature maps of three different scales, and then bilinear interpolation is carried out to obtain

second feature maps

30a, 30b and 30c of three different scales. The dimensions of the three

second feature maps

30a, 30b, 30c in fig. 1 are H × W × 16, H/2 × W/2 × 8 and H/4 × W/4 × 8, respectively, where H and W are the dimensions of the input image.

S203: and splicing the second feature maps of the input images to obtain the output feature map of the input images.

In the configuration shown in fig. 1, three

second feature maps

30a, 30b, and 30c with different scales can be obtained for each input image, and the

second feature maps

30a, 30b, and 30c with scales H × W × 16, H/2 × W/2 × 8, and H/4 × W/4 × 8 can be merged into one output feature map with a scale H × W × 32 in S203.

Of course, fig. 1 is only one specific example provided in the present application, and the present application is not limited to the various quantities and dimensions shown in fig. 1, and are specifically described below. The number of ordinary convolutions is not limited to five layers, and may be other numbers of six layers, seven layers, eight layers, and the like. The number of the first feature maps with different scales is not limited to three, and the first feature maps with different scales can be four, five or other number of first feature maps with different scales. The number of the deformable convolutions not sharing the parameter is not limited to three, and may be other numbers; such as: when the number of the first feature maps with different scales is four, the deformable convolution without sharing the parameters is also four; when the number of first feature maps of different scales is five, the deformable convolution that does not share parameters is also five. The scale of the second feature map is not limited to H × W × 16, H/2 × W/2 × 8, or H/4 × W/4 × 8, the scale of the final output feature map is not limited to H × W × 32, and the scale of the second feature map and the scale of the output feature map may be set according to actual needs.

Through S201 to S203, multi-scale acquisition of the input image is realized, the technical problem that the extraction of the surface features of the weak texture region is difficult is solved, the better extraction of the surface features of the weak texture region is realized, and the weak texture region obtains better reconstruction effect in the subsequent steps.

S204: and constructing a cost body according to the output characteristic diagram of each input image and the camera parameters. The specific steps of implementing S204 in the embodiment of the present application are shown in fig. 5, and include S701 to S703.

S701: and constructing cones at the same intervals on the reference image by using an over-plane scanning method with the main optical axis of the reference image as the scanning direction.

S702: and projecting the output feature map of each source image onto each depth plane to form a feature body according to the differentiable homography, and enabling each projection to be identical in size by utilizing an interpolation method. Wherein the micro-homography transformation is defined as follows:

in the above formula, { K, R, t } are camera parameters, which are camera intrinsic parameters, rotation and translation, respectively; n is the main optical axis of the reference image and theta is the depth value.

S703: and determining a cost body corresponding to the reference image based on the variance by using a plurality of characteristic bodies corresponding to the reference image.

Specifically, assuming that the number of source images is N, theoretically, each reference image has N corresponding feature volumes, and a cost volume is constructed by using these feature volumes based on a variance form, and the calculation method is as follows:

in the above equation, V represents a feature volume of each source image.

S205: regularization cost bodies obtain probability bodies. The embodiment of the present application provides a specific implementation manner of obtaining a probability body, as shown in fig. 3, including S301 and S302.

S301: and regularizing the cost body by using a hierarchical recursive convolutional network module to obtain a plurality of regularized cost graphs. Referring to fig. 4, the hierarchical recursive convolutional network module 40 includes: a plurality of

lstmconvcells

41, 43, 45, 47, 49 arranged in sequence, and a

pooling layer

42, 44 and a deconvolution layer 46, 48 disposed between two lstmconvcells.

LSTM (Long Short-Term Memory) is a Neural network having the ability to memorize Long and Short Term information, and can solve the Long-Term dependence problem of RNN (Recurrent Neural Networks) when processing timing problems using deep learning. The

LSTMConvvCells

41, 43, 45, 47 and 49 add convolution calculation in the LSTM, not only can obtain a time sequence relation, but also can extract spatial features like a convolution layer, so that the

LSTMConvvCells

41, 43, 45, 47 and 49 can simultaneously extract time features and spatial features, and the switching between states is also changed into convolution operation, thereby realizing the purposes of well absorbing multi-scale context information and efficiently processing a cost body.

The LSTM portion in each LSTMConvCell41, 43, 45, 47, 49 generates the following four variables:

finally LSTMConvbell 41, 43, 45, 47, 49 outputs two variables

Wherein h is ^t Is the output of the cell.

Illustratively, the specific structure and parameters of the hierarchical recursive convolutional network module 40 may be as shown in the following table:

the structure shown in fig. 4 may also incorporate a ResBlock module after the last LSTMConvCell49, as shown in the above table.

S302: and generating a corresponding probability body according to the multiple cost graphs.

The cost body is divided into D layers in the depth direction, so the cost body can be regarded as D pieces of 2D cost maps which are connected with { C (i) }in the depth direction _{i＝0.....D-1} Outputting the regularized cost map as { CH (i) }in a sequential process _i＝0....D-1 And finally generating a corresponding probability body through the softmax layer.

Besides that S301 and S302 shown in fig. 3 can implement regularization cost body obtaining probability, other specific implementations may also be adopted. Such as: the cost body is directly regularized by using 3DCNN, or a stack recursive structure is used, and the 3D cost body can be processed into a 2D cost graph connected in sequence hierarchically by depending on a state transfer mechanism.

The cost body is directly regularized by using 3DCNN, and although local information and multi-scale context information of the cost body can be well utilized, the depth map of the dense point cloud cannot be directly regressed particularly for a high-resolution image because a memory of a GPU (Graphics Processing Unit) is limited. Therefore, the method is mainly applied to reconstruction of small scene objects, such as using a DTU data set on an MVSNet.

By utilizing a stack recursive structure and depending on a state transfer mechanism, a 3D cost body can be processed into a 2D cost graph which is connected in sequence in a layering mode.

The specific implementation manner for obtaining the probability body by the regularization cost body provided by the embodiment of the application can well absorb multi-scale context information and optimize the memory loss of the GPU, so that the processing accuracy and efficiency are improved.

S206: and determining an initial depth map of the reference image according to the probability volume.

Specifically, the following method may be adopted to obtain the initial depth map: according to the general eating principle of the winner, the initial depth map is directly obtained by using the argmax method. However, the argmax method estimates the depth at a sub-pixel level, which may cause abrupt and unsmooth depth.

When the initial depth map is obtained, the initial depth map of the reference image is determined by taking the expected depth value as the estimated depth value of each pixel along the depth direction of the probability body, so that the interior of different parts in the initial depth map is smooth.

S207: and optimizing the initial depth map to obtain a depth optimization map. Fig. 5 shows a specific manner of implementing S207, including S501 to S505.

S501: and projecting pixel points on the reference image to the corresponding position of each source image through the initial depth map, and calculating the difference value between the value of the output characteristic map of the reference image and the value of the output characteristic map of the source image.

S502: the difference value is minimized using gauss-newton method.

S503: and calculating residual errors, wherein the calculation formula of the residual errors is as follows: ri (p) = F _i (p _i ′)-F ₀ (p)。

In the above formula, F _i (p _i ') is the value of the output characteristic map of the source image, F ₀ And (p) is the value of the output characteristic diagram of the reference image.

S504: calculating the first derivative J of each residual to the initial depth map _i (p) and determining the increment δ of the current depth according to the following formula: δ = - (J) ^T J) ^-1 J ^T r。

Wherein J is the matrix { J _i (p) and r is the residual directionSuperposition of quantities { ri (p) }.

S505: and adding the increment of the current depth and the value of the initial depth map to obtain an optimized depth map.

As shown in fig. 10, by comparing the depth map obtained in the MVSNet method with the depth optimization map of the embodiment of the present application, it can be seen that the depth optimization map finally obtained by performing S201 to S207 in the embodiment of the present application is closer to the real situation, and has fewer defects.

S208: and fusing a plurality of depth optimization maps to generate dense point cloud. A specific manner of implementing S208 in the embodiment of the present application is shown in fig. 6, and includes S601 and S602.

S601: and fusing the dynamic matching consistency of all the depth optimization maps to obtain the global dynamic multi-view geometric consistency.

The dynamic matching consistency is defined as follows:

global dynamic multi-view geometric consistency is defined as follows:

wherein epsilon _p And ε _d Representing the pixel reprojection error and the depth reprojection error, respectively, and lambda represents a coefficient that affects two different reprojection errors.

S602: outliers are filtered using a preset filter coefficient.

In addition to the fusion implementation shown in fig. 6, other ways may be used to implement fusion, such as: the geometric constraints followed when fusing depth maps measure the depth estimation consistency of multiple views.

However, other fusion implementations use pre-fixed parameters, such as pixel reprojection errors and depth reprojection errors. These fixed parameters are not reliable for different scenes, and using fixed parameters may not be able to filter out enough unmatched pixels in different scenes. Therefore, in the embodiment of the application, dynamic consistency check is applied in S601 and S602 to fuse a plurality of optimized depth maps, and consistency of adjacent views is dynamically constrained, so that more accurate and complete dense point cloud can be obtained.

Fig. 11A, 11B, and 11C show a comparison between a multi-view three-dimensional reconstruction method reconstruction result provided by the embodiment of the present application and a R-MVSNet method reconstruction result in different scenarios, and it can be seen that: the point cloud in the reconstruction result of the embodiment of the application is more accurate and complete, and particularly, the framed region in the image is more obvious.

Using a cross-entropy loss function between the probability volume and the one-hot encoded volume of the real depth map, the cross-entropy loss function being defined as:

wherein x is _valid Representing the valid set of pixels, G (i, x) represents the one-hot coded generation of the true depth map at i depths of pixel x, and P (i, x) represents the pixels in the probability volume.

where d (x) denotes the pixel depth value of the real depth map, d _r (x) A pixel depth value representing an optimized depth map.

The loss function in the multi-view three-dimensional reconstruction method provided by the embodiment of the application during training is expressed as follows:

wherein λ determines whether the method starts a depth map optimization module.

At present, a distance function is generally used as a loss function in a three-dimensional reconstruction method based on deep learning, and training is regarded as a regression problem. In the embodiment of the application, the multi-view three-dimensional reconstruction method is divided into two parts, and a cross entropy loss function is used between a probability body and a real depth map and is regarded as a multi-classification task; using a distance function between the optimized depth map and the real depth map, and regarding the distance function as a regression task; by using the cross entropy loss function and the distance function as the loss function of the multi-view three-dimensional reconstruction method, the result obtained after the method provided by the embodiment of the application is trained is more accurate.

The multi-view three-dimensional reconstruction method provided by the embodiment of the application obviously reduces the memory consumption of the GPU, the operation consumption is 60.1% of R-MVSNet and is only 26.1% of MVSNet, and therefore the method can be applied to high-resolution images to reconstruct large-scale scenes. The quantitative results are given in the table below.

Although the present application provides method steps as described in an embodiment or flowchart, additional or fewer steps may be included based on conventional or non-inventive efforts. The sequence of steps recited in this embodiment is only one of many steps in execution sequence, and does not represent a unique order of execution. When the device or the client product in practice executes, it can execute sequentially or in parallel according to the method shown in the embodiment or the figures (for example, in the context of parallel processors or multi-thread processing).

The embodiment of the present application further provides a multi-view three-dimensional reconstruction apparatus, as shown in fig. 8, the apparatus includes a first feature module 81, a second feature module 82, an output feature module 83, a cost body module 84, a probability body module 85, an initial depth module 86, a depth optimization module 87, and a fusion module 88.

The first feature module 81 is configured to process a plurality of input images by using a multi-layer common convolution to obtain a plurality of first feature maps of different scales for each input image; the input images comprise a reference image, and the rest are source images.

The second feature module 82 is configured to process multiple first feature maps of different scales by using multiple deformable convolutions without shared parameters, and obtain multiple second feature maps of different scales for each input image after bilinear interpolation; wherein the deformable convolution is defined as follows:

f (p) represents the characteristic value of the pixel p, w _k And p _k Respectively representing the convolution kernel parameter and the fixed offset, Δ p, defined in the ordinary convolution _k And Δ m _k Respectively representing the offset and weight of the deformable convolution generated by learning.

The output feature module 83 is configured to splice multiple second feature maps of each input image to obtain an output feature map of each input image.

The cost entity module 84 is used for constructing a cost entity according to the output feature map and the camera parameters of each input image.

The probability body module 85 is used for regularizing the cost body to obtain a probability body.

The initial depth module 86 is configured to determine an initial depth map of the reference image based on the probability volume.

The depth optimization module 87 is configured to optimize the initial depth map to obtain a depth optimization map.

The fusion module 88 is configured to fuse the depth optimization maps to generate a dense point cloud.

The cost body module 84 is specifically configured to: constructing a conical body at the same interval for the reference image by using an over-plane scanning method by taking the main optical axis of the reference image as the scanning direction; projecting the output characteristic diagram of each source image onto each depth plane to form a characteristic body according to the differentiable homography, and enabling each projection to be the same in size by utilizing an interpolation method; and determining a cost body corresponding to the reference image based on the variance by using a plurality of characteristic bodies corresponding to the reference image.

The probability body module 85 is specifically configured to: regularizing the cost body by using a hierarchical recursive convolutional network module to obtain a plurality of regularized cost graphs; the hierarchical recursive convolutional network module comprises a plurality of LSTMConnvcells which are sequentially arranged, and a pooling layer and a deconvolution layer which are arranged between the two LSTMConnvcells; and generating a corresponding probability body according to the multiple cost graphs.

The initial depth module 86 is specifically configured to: and determining an initial depth map of the reference image by taking the expected depth value as the depth estimated value of each pixel along the depth direction of the probability body.

The depth optimization module 87 is specifically configured to: projecting pixel points on the reference image to the corresponding position of each source image through the initial depth map, and calculating the difference value between the value of the output characteristic map of the reference image and the value of the output characteristic map of the source image; minimizing the difference value by using a Gauss-Newton method; and calculating residual errors, wherein the calculation formula of the residual errors is as follows: ri (p) = F _i (p _i ′)-F ₀ (p); wherein, F _i (p _i ') is the value of the output characteristic map of the source image, F ₀ (p) is the value of the output characteristic diagram of the reference image; calculating the first derivative J of each residual to the initial depth map _i (p) and determining the increment δ of the current depth according to the following formula: δ = - (J) ^T J) ^-1 J ^T r; wherein J is the matrix { J _i (p) is the superposition of residual vectors { ri (p) }; and adding the increment of the current depth and the value of the initial depth map to obtain an optimized depth map.

The fusion module 88 is specifically configured to: fusing the dynamic matching consistency of all the depth optimization maps to obtain the global dynamic multi-view geometric consistency; the dynamic matching consistency is defined as follows:

global dynamic multi-view geometric consistency is defined as follows:

wherein epsilon _p And epsilon _d Respectively representing a pixel reprojection error and a depth reprojection error, and lambda represents a coefficient affecting two different reprojection errors; outliers are filtered using a preset filter coefficient.

Using cross entropy loss between probability volume and one-hot coded volume of real depth mapThe function, the cross entropy loss function, is defined as:

When optimizing the initial depth map, the distance from the real depth map to the optimized depth map is taken as a loss, namely:

wherein, λ determines whether the method starts a depth map optimization module.

The apparatuses or modules illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. The functionality of the modules may be implemented in the same one or more software and/or hardware implementations of the present application. Of course, a module that implements a certain function may be implemented by a plurality of sub-modules or sub-units in combination.

The embodiment of the present application further provides a multi-view three-dimensional reconstruction device 90, as shown in fig. 9, the multi-view three-dimensional reconstruction device 90 includes a memory 91 and a processor 92 connected by a bus 93. The memory 91 is used to store computer readable instructions non-transiently. The processor 92 is configured to execute computer-readable instructions, and the computer-readable instructions, when executed by the processor, implement the multi-view three-dimensional reconstruction method provided by the embodiment of the present application.

The embodiment of the present application further provides a computer-readable storage medium, where computer-readable instructions are stored, and when executed by a processor, the computer-readable instructions implement the multi-view three-dimensional reconstruction method provided in the embodiment of the present application.

The storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache, a Hard Disk Drive (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions.

The methods, apparatus or modules described herein may be implemented in computer readable program code means for a controller implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application Specific Integrated Circuits (ASICs), programmable logic controllers and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in purely computer readable program code means, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary hardware. Based on such understanding, the technical solutions of the present application may be embodied in the form of software products or in the implementation process of data migration, which essentially or partially contributes to the prior art. The computer software product may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to perform the methods described in the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. All or portions of the present application are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, mobile communication terminals, multiprocessor systems, microprocessor-based systems, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the present application; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit and scope of the present disclosure.

Claims

1. A multi-view three-dimensional reconstruction method, comprising:

regularizing the cost body to obtain a probability body;

optimizing the initial depth map to obtain a depth optimization map;

2. The multi-view three-dimensional reconstruction method of claim 1, wherein the regularizing the cost volume to obtain a probability volume comprises:

regularizing the cost body by using a hierarchical recursive convolutional network module to obtain a plurality of regularized cost graphs; the hierarchical recursive convolutional network module comprises a plurality of LSTMConvvcells which are sequentially arranged, and a pooling layer and a deconvolution layer which are arranged between the two LSTMConvvcells;

3. The method according to claim 1, wherein said determining an initial depth map of a reference image from the probability volume comprises:

4. The multi-view three-dimensional reconstruction method of claim 1, wherein said optimizing said initial depth map comprises:

minimizing the difference value using gauss-newton method;

ri(p)＝F _i (p′ _i )-F ₀ (p)；

wherein, F _i (p′ _i ) Taking the value of the output feature map of the source image, F ₀ (p) taking the value of the output characteristic diagram of the reference image;

δ＝-(J ^T J) ^-1 J ^T r；

5. The multi-view three-dimensional reconstruction method of claim 1, wherein said fusing a plurality of said depth optimization maps to generate a dense point cloud comprises:

the global dynamic multi-view geometric consistency is defined as follows:

wherein epsilon _p And ε _d Respectively representing a pixel reprojection error and a depth reprojection error, and lambda represents a coefficient affecting two different reprojection errors;

outliers are filtered using a preset filter coefficient.

6. The multi-view three-dimensional reconstruction method according to claim 1, wherein a cross-entropy loss function is used between the probability volume and the one-hot coded volume of the real depth map, the cross-entropy loss function being defined as:

the loss function when the method is trained is expressed as follows:

and determining whether the method starts a depth map optimization module or not by the aid of the lambda.

7. The multi-view three-dimensional reconstruction method according to claim 1, wherein said constructing a cost volume from said output feature map and camera parameters of each of said input images comprises:

8. A multi-view three-dimensional reconstruction apparatus, comprising:

the first feature module is used for processing a plurality of input images by using multilayer common convolution to obtain a plurality of first feature maps with different scales of each input image; the input images comprise a reference image, and the rest are source images;

9. A multi-view three-dimensional reconstruction device, comprising:

a memory for non-transitory storage of computer readable instructions;

a processor for executing the computer readable instructions, which when executed by the processor implement the multi-view three-dimensional reconstruction method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing computer-readable instructions which, when executed by a processor, implement the multi-view three-dimensional reconstruction method according to any one of claims 1 to 7.