CN117830520A

CN117830520A - Multi-view three-dimensional reconstruction method based on depth residual error and neural implicit surface learning

Info

Publication number: CN117830520A
Application number: CN202311857090.1A
Authority: CN
Inventors: 高凤娇; 王维; 闫奕名
Original assignee: Institute Of Intelligent Manufacturing Heilongjiang Academy Of Sciences
Current assignee: Institute Of Intelligent Manufacturing Heilongjiang Academy Of Sciences
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-05

Abstract

A multi-view three-dimensional reconstruction method based on depth residual error and neural implicit surface learning belongs to the technical field of three-dimensional reconstruction, and aims to solve the problem that high-fidelity surface reconstruction cannot be realized in a large-scale complex scene in the existing surface reconstruction process through neural rendering of images. Comprising the following steps: acquiring images of a plurality of view angles; generating sparse point cloud by using SFM, and generating camera pose information; light rays emitted by the camera are incident on the surface of the scene, and three-dimensional coordinate information of a sampling point set of the surface of the scene is obtained; generating point cloud information represented by the SDF function by adopting the SDF function; extracting appearance information by using ResNet-50 to calculate a color loss function, a weight constraint function, a luminosity consistency loss function and an SDF loss function of a point; reversely optimizing the SDF network, and calculating the SDF value of each sampling voxel; and extracting the SDF value by using a Matching cube to obtain a zero isosurface and outputting a 3D surface. The method is used for reconstructing the surface of a large-scale complex scene.

Description

Multi-view three-dimensional reconstruction method based on depth residual error and neural implicit surface learning

Technical Field

The invention relates to a multi-view three-dimensional reconstruction method, and belongs to the technical field of three-dimensional reconstruction.

Background

The purpose of three-dimensional reconstruction is to recover accurate scene geometry information from multiple images observed from different perspectives, which can be applied to virtual reality scene representation or robot autonomous navigation environment mapping, etc. Meanwhile, the multi-view-based three-dimensional reconstruction technology can be used for the fields of digital reconstruction of cultural relics, analysis of traffic accidents, reproduction of other building sites and the like.

The conventional Multi-view three-dimensional reconstruction algorithm is a combination of a Motion-recovery Structure (SFM) and Multi-view Stereo Matching (MVS), and although a better reconstruction effect is obtained, due to complicated steps, an accumulated error is inevitably introduced for the finally reconstructed geometric Structure information. Also, one inherent disadvantage of this conventional algorithm is the inability to handle sparse blurred views, such as regions with large areas of uniform color, complex texture regions, or remote sensing scenes that are remotely photographed.

In order to reduce the bias caused by the conventional multi-view reconstruction method, the existing three-dimensional reconstruction method represents scene geometry information as a neural implicit surface and optimizes the surface using volume rendering because volume rendering is more robust than surface rendering. The results show that the method has better effect in indoor data sets (DTU) or outdoor small scene data sets (some data in BlendedMVS data sets are listed) shot at short distance, and the deviation generated by the traditional method is optimized to a certain extent. However, optimizing the surface structure of a scene using only volume rendering synthesized color information still has some difficulty in handling data of extreme weather (cloudy or foggy, black or daytime) conditions and remote sensing scene data with remote sparse views. Furthermore, existing approaches encounter challenges when capturing complex details in large-scale scenes, and the performance of neural rendering is significantly hampered by the sparsity of the views and the complex content inherent in such scenes.

Disclosure of Invention

The invention aims to solve the problem that high-fidelity surface reconstruction cannot be realized in a large-scale complex scene in the existing surface reconstruction process through the neural rendering of an image, and provides a multi-view three-dimensional reconstruction method based on depth residual error and neural implicit surface learning.

The invention relates to a multi-view three-dimensional reconstruction method based on depth residual error and neural implicit surface learning, which comprises the following steps:

s1, acquiring images of a plurality of view angles to be three-dimensionally reconstructed by adopting a camera;

s2, performing image feature extraction, feature matching and sparse point cloud reconstruction on the image acquired in the S1 by adopting a motion restoration structure SFM, generating sparse point cloud with three-dimensional coordinate information, and generating pose information of a camera;

s3, assuming that light rays emitted from a camera are incident to the scene surface, acquiring three-dimensional coordinate information of a sampling point set of the scene surface;

s4, inputting the three-dimensional coordinates of the sparse point cloud obtained in the S2, the pose information of the camera and the three-dimensional coordinate information of the sampling point set obtained in the S3 into an SDF function, and generating point cloud information represented by the SDF function;

s5, extracting appearance information of the image acquired in the S1 by adopting a ResNet-50 convolution block;

s6, inputting the appearance information acquired in the S5, the three-dimensional coordinates of the sparse point cloud acquired in the S2 and the three-dimensional coordinate information of the sampling point set acquired in the S3 into a Color function, and generating a Color value of the point cloud;

S7, setting constraint conditions according to the SDF function point cloud information acquired in the S4 and the color values of the sparse point clouds acquired in the S7, and calculating a color loss function;

calculating a weight constraint function according to the weight constraint condition by using the SDF function point cloud information acquired in the step S4;

setting constraint conditions according to the point cloud color information obtained in the step S6, and calculating a luminosity consistency loss function;

calculating the SDF loss function of the point according to the constraint condition of the set surface point SDF=0 by using the SDF function point cloud information acquired in the step S4;

s8, reversely optimizing the SDF network by adopting the loss function obtained in the S7, converting the SDF into mesh and texture, and directly calculating the SDF value of each sampling voxel;

s9, extracting the SDF value obtained in the S8 by using a Matching cube algorithm, directly outputting the zero equivalent surface to the 3D surface, and completing three-dimensional reconstruction.

Preferably, the specific method for extracting the appearance information of the image acquired in S1 by using the res net-50 convolution block in S5 includes:

extracting shallow features of the appearance in the image, inputting the image into a ResNet-50 convolution block, outputting a feature tensor of 1 multiplied by 256, and finishing appearance embedding;

the ResNet-50 convolution block introduces an inverted residual structure into the residual structure as a bottleneck layer so as to reduce the number of parameters;

The bottleneck layer firstly passes through a 1 multiplied by 1 convolution kernel, then passes through a 3 multiplied by 3 convolution kernel, and finally passes through the 1 multiplied by 1 convolution kernel;

256-dimensional input passes through a convolution layer of 1×1×64, then a convolution layer of 3×3×64, and finally a convolution layer of 1×1×256, each convolution layer being obtained by using a Relu activation function.

Preferably, constraints for three-dimensional reconstruction include color constraints, weight constraints, point constraints, and photometric consistency constraints.

Preferably, the specific method for calculating the weight constraint function according to the weight constraint condition by using the SDF function point cloud information obtained in S4 in S7 includes:

finding two adjacent sampling points near the curved surface along the rays emitted from the camera, the SDF values of the two sampling points satisfy:

f(p(t _i ))·f(p(t _i+1 ))<0

wherein p (t) _i ) Representing the first sample point t of two adjacent sample points _i Is set of initial points, f (p (t) _i ) P (t) _i ) SDF value, p (t) _i+1 ) Representing the other sampling point t of the two adjacent sampling points _i+1 Is set of initial points, f (p (t) _i+1 ) P (t) _i+1 ) SDF value of (a);

the first intersection of the ray with the surface is approximated by linear interpolation as:

wherein,representing a set of points obtained by linear interpolation, t _j+1 Represents p (t) _j+1 ) The value of the point-time variable t, f (p (t _j+1 ) P (t) _j+1 ) SDF value, t of point _j Represents p (t) _j ) The value of the point-time variable t, f (p (t _j ) P (t) _j ) SDF value of the point;

v represents a unit direction vector of the ray;

will beAdded to p (t) _i ) In (3) obtaining a new point set P:

representing a point set obtained after interpolation;

re-rendering the volume by using the newly generated point set P to generate the final color:

wherein w (t) _i ) The weights of the initial set of points are represented,c(t _i ) The pixel values representing the initial set of points,weights representing the set of points obtained after linear interpolation, +.>Representing pixel values of a point set obtained after linear interpolation, wherein n represents the number of points;

color deviation DeltaC _final The method comprises the following steps:

wherein c (t) ^* ) Pixel value, epsilon, representing the intersection of a ray with a surface _interp Representing the deviation, ε, caused by linear interpolation _weight-final Representing the weight deviation;

weight constraint loss:

preferably, the specific method for calculating the photometric consistency loss function according to the constraint condition set by the point cloud color information acquired in S6 in S7 includes:

taking a small part S of the curved surface, wherein the pixels of the S on the projection of the source view are q, and the image blocks of the small part S of the curved surface corresponding to different source views are geometrically consistent except for the shielding part;

using reference picture pixels I _r S is represented by a camera coordinate system of (a), and includes:

wherein d represents the normal direction, n ^T Indicating that SDF network is in Point selfA normal obtained by dynamic differential calculation;

a certain point x in the reference image _i Is positioned with the identity matrix H such that x _i Associated with the corresponding point x in the other image, the relation is obtained:

wherein K is _r And K _i Representing the camera internal calibration matrix, R _r And R is _i Representing a rotation matrix, t _i And t _r Representing translation vectors respectively corresponding to source view I _i And I _r ；H _i X represents a source image pixel, H _i A unit matrix corresponding to a pixel representing a certain point in the source image;

taking the rendered image as a reference image, and calculating to obtain normalized cross-correlation of the reference image and the source view:

wherein Cov () represents covariance, var () represents variance, NCC (X, H _i X) represents the NCC score between the sampled patch and the corresponding patch on all source images, X represents the sampled patch pixel values,

the best four NCC scores are obtained, the value range of the NCC scores is-1 to 1, wherein-1 represents uncorrelation, 1 represents correlation, and the larger the value of the NCC scores is, the better the value of the NCC scores is;

calculating photometric consistency loss of corresponding views

Where i represents the number of NCC scores.

Preferably, the specific method for calculating the SDF loss function of the point according to the constraint condition that the sdf=0 of the surface point is set by the SDF function point cloud information obtained in S4 in S7 includes:

Point constraint loss

Wherein SFM reconstructs sparse 3D point P ₀ ，P ₀ Is f (P), P represents the spatial position of the points, and N represents the number of points.

Preferably, the three-dimensional reconstruction in S7 includes calculating a total loss function, and the specific method includes:

total loss functionLoss for color constraint->Weight constraint loss->Point constraint loss->And photometric consistency constraint loss->Is a weighted sum of:

where λ represents the weight of the weight constraint loss function, α represents the weight of the photometric consistency constraint loss function, and β represents the weight of the point constraint loss function.

Preferably, the SDF function is:

the curved surface S of the scene is expressed as:

S＝{p∈R ³ |f(p)＝0}

wherein f (p) represents a function of f (p) =0 on the surface of the observed object, p represents a spatial position of a certain point, and R represents a symbol distance;

the function is characterized by a neural network, named as an SDF network in NeuS, and the SDF network function is associated with NeRF in NeuS, and the loss function of NeRF is utilized to optimize the surface characterization of the SDF network.

The multi-view three-dimensional reconstruction method based on depth residual error and neural implicit surface learning provided by the invention has the following advantages:

1. the ResNet-50 network is adopted to extract depth characteristics from the image, so that the capability of rendering the nerve surface is improved, and the high-quality three-dimensional surface reconstruction of a large-scale scene can be realized;

2. Embedding the appearance information into a nerve radiation field, learning accurate posture information, and positioning the 3D surface in a coarse-to-fine mode; the surface rendering is added, the original single rendering frame is improved, the rendering process is close to unbiased, and the SDF network is reversely optimized by reducing the difference between the rendering color and the real color of the surface;

3. interpolating points near the surface and optimizing their weights to capture finer 3D geometric details in the reconstruction result;

4. the accuracy of the three-dimensional curved surface is obviously improved by introducing the constraint of luminosity consistency and points; the multi-view stereo matching is introduced to constrain the geometry, so that the problem of geometric blurring existing in the process of optimizing the scene geometry by using only color information is solved;

5. the geometrical region of interest is automatically sampled, so that coarse 3D geometrical positioning is effectively realized, and efficient utilization of computing resources is realized.

Drawings

FIG. 1 is a schematic block diagram of a multi-view three-dimensional reconstruction method based on depth residual and neural implicit surface learning according to the present invention;

FIG. 2 is a schematic block diagram of an appearance embedding algorithm;

FIG. 3 is a diagram of a ResNet-50 network architecture;

FIG. 4 is a DVGO geometric coarse sampling strategy;

FIG. 5 is a Res-NeuS geometric coarse sampling strategy;

FIG. 6 is a qualitative surface reconstruction result of indoor small scene data;

FIG. 7 is a quantitative surface reconstruction result for a large scale scene in a BlendedMVS dataset;

fig. 8 is the results of directional surface reconstruction for ablation studies.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

The invention is further described below with reference to the drawings and specific examples, which are not intended to be limiting.

Example 1:

the following describes, with reference to fig. 1 to 3, a multi-view three-dimensional reconstruction method based on depth residual error and neural implicit surface learning according to the present embodiment, which includes:

Further, the specific method for extracting the appearance information of the image acquired in the step S1 by using the ResNet-50 convolution block in the step S5 comprises the following steps:

In this embodiment, the total number of parameters is: 256×1×1×64+64×3×3 x 64+64×1×1×256= 69632.

Still further, constraints for three-dimensional reconstruction include color constraints, weight constraints, point constraints, and photometric consistency constraints.

Still further, the specific method for calculating the weight constraint function according to the weight constraint condition by using the SDF function point cloud information obtained in S4 in S7 includes:

f(p(t _i ))·f(p(t _i+1 ))<0

v represents a unit direction vector of the ray;

will beAdded to p (t) _i ) In (3) obtaining a new point set P:

representing a point set obtained after interpolation;

Wherein w (t) _i ) Weights representing the initial set of points, c (t _i ) The pixel values representing the initial set of points,weights representing the set of points obtained after linear interpolation, +.>Representing pixel values of a point set obtained after linear interpolation, wherein n represents the number of points;

color deviation DeltaC _final The method comprises the following steps:

wherein c (t) ^* ) Representing ray interactions with a surfacePixel value, epsilon of dot _interp Representing the deviation, ε, caused by linear interpolation _weight-final Representing the weight deviation;

weight constraint loss:

still further, the specific method for calculating the photometric consistency loss function according to the constraint condition set by the point cloud color information acquired in S6 in S7 includes:

wherein d represents the normal direction, n ^T Indicating that SDF network is inThe normal obtained by automatic differential calculation of the points;

wherein K is _r And K _i Representing the camera internal calibration matrix, R _r And R is _i Representing a rotation matrix, t _i And t _r Representing translation vectors respectively corresponding to source view I _i And I _r ；H _i X represents a source image pixel, H _i Pixel correspondence representing a point in a source imageIs a matrix of units of (a);

calculating photometric consistency loss of corresponding views

Where i represents the number of NCC scores.

Still further, the specific method for calculating the SDF loss function of the point according to the constraint condition that the sdf=0 of the surface point is set by the SDF function point cloud information obtained in S4 in S7 includes:

point constraint loss

Still further, the three-dimensional reconstruction in S7 includes calculating a total loss function, and the specific method includes:

total loss function Loss for color constraint->Weight constraint loss->Point constraint loss->And photometric consistency constraint loss->Is a weighted sum of:

Still further, the SDF function is:

the curved surface S of the scene is expressed as:

S＝{p∈R ³ |f(p)＝0}

The invention provides a multi-view three-dimensional reconstruction method based on depth residual error and neural implicit surface learning, which is used for reconstructing a high-fidelity surface of a multi-view complex scene.

The SDF network is adopted to locate the zero level set of the three-dimensional curved surface, the volume rendering color network is positively optimized through image appearance embedding, the surface rendering is also added, the original single rendering frame is improved, the rendering process is close to unbiased, and the SDF network is reversely optimized through reducing the difference between the rendering color and the real color of the surface. Meanwhile, in order to solve the problem of geometric blurring in optimizing the scene geometry by using only color information, multi-view stereoscopic matching is introduced to constrain the geometry. In addition, in order to efficiently utilize computing resources and view dependence, a coarse sampling mode of the point cloud of interest is also designed, and point cloud input which is not of interest is automatically filtered out.

In the invention, the deviation of volume rendering is analyzed; providing an appearance embedded optimized color field function; combining surface rendering and volume rendering to make the rendering result approach unbiased; introducing a multi-view stereo matching mechanism to restrict the three-dimensional geometric structure; a new geometric coarse sampling strategy is proposed.

Multi-view surface reconstruction is a complex process aimed at restoring the exact geometric surface of a three-dimensional object from images acquired from multiple perspectives. Early image-based photogrammetry techniques employed a method called volume occupancy grid to represent a scene. This method is accomplished by examining each cube (or voxel) and marking it as occupied when strict color consistency conditions between corresponding projected image pixels are met. However, the feasibility of this approach is limited by photometric consistency assumptions, as in practical applications, the automatic exposure variation and the presence of non-lambertian materials can result in color consistency that cannot be maintained.

The subsequent method adopts a three-dimensional point cloud generated by a multi-view stereo technology, and then dense surface reconstruction is carried out. However, reliance on the quality of the point cloud often results in loss of surface detail or the appearance of noise, as the point cloud is typically sparse. Recently, learning-based approaches have attempted to improve the point cloud generation process by training neural networks. These methods improve the quality and density of the point cloud by learning image features and constructing cost volumes, but they are still limited by cost volume resolution and cannot fully recover the geometric details of complex objects.

Surface rendering: the projected ray color depends only on the color of the intersection of the ray with the scene geometry. This allows the gradient to only back-propagate to a local area near the intersection point, and therefore, surface-based reconstruction methods have difficulty reconstructing complex objects with severe self-occlusion and large abrupt changes in depth, and require object masks as supervision.

Volume rendering: is an image-based rendering method that renders a 3D scalar field into a 2D image by projecting light along a 3D volume. Such as NeRF, it can process objects of abrupt depth change and synthesize high quality images by color integrating the sampling points of each ray to render the image. However, since the scene representation of density is difficult to constrain its level set, it is difficult to extract high fidelity surfaces from the learned implicit field. Thus, more direct surface modeling is the preferred photogrammetric surface reconstruction problem.

Neural implicit fields are a new method of representing the geometry of an object, and in the reconstruction process, a neural network is typically trained to fit an implicit function that can accept three-dimensional coordinates as input and output surface feature values, such as distance or color, which can be considered as an implicit representation of the entire three-dimensional surface. For example [11,34-41], implicit functions such as occupied meshes [32,33] or signed distance functions are preferred over simple density fields for more accurate definition of scene representations of 3D surfaces.

NeuS is a typical neural implicit surface reconstruction method, applying the volume rendering method [1,14-18] to learn implicit SDF representations. Notably, merely applying the standard volume rendering method to the density values associated with the SDF will result in severe deviations of the reconstructed geometric surface, i.e., the pixel weights at maximum volume density are not at or near the object surface. Thus, neuS constructs a new volume density function and expression of the weight function to satisfy that when the volume density is the same, the weight assigned to the point pixel should be different when the distance from the camera is different.

In NeuS, SDF-based volume rendering is very beneficial for recovering surfaces from 2D images, especially for small scene datasets in some rooms. However, for some outdoor low visibility scenes and remote sensing scenes, achieving high quality three-dimensional surface reconstruction is still challenging, because the view sparsity of these scene features can cause severe geometric distortions or distortions, and furthermore, the deviations present in the precursor rendering paradigm (sampling operation induced deviations and weight deviations) are greatly exaggerated when applied to such scenes.

Based on NeRE and its extension NeuS.

The curved surface S of the scene is represented as follows:

S＝{p∈R ³ |f(p)＝0}

where f (p) is a function of f (p) =0 only on the surface of the observed object, which can be characterized by a neural network, named SDF network in NeuS, and associating the SDF network function with the NeRF in NeuS, optimizing the surface characterization SDF network with the NeRF's loss function.

The classical NeRF volume rendering formula is:

wherep(t)＝o+tv，

to make the volume density description accurate, the volume density is maximized at or near the surface (i.e., σ (x) also reaches a maximum when f (x) =0), so NeuS redefines the volume density expression Φ _s (u)＝se ^-su /(1+e ^-su ) whereeuu=f (x), where the bulk density expression Φ _s (f (x)) is called S-Density, and is brought into a rendering formula to obtain

wherep(t)＝o+tv

Let w (T) =t (T) phi _s (f (p (t))) while the w (t) function is to satisfy the second property: when the volume density is the same and the distance from the camera is different, the pixel is allocated to the pointThe weights should be different, otherwise there is ambiguity. That is, considering the influence of T (T), the weight function is normalized:

let w (T) =t (T) ρ (T), andthus solving for T (T) and ρ (T), where NeuS completes the perfect combination of NeRF and surface reconstruction.

The NeuS scene representation is a pair of multi-layer perceptrons (MLPs), a first MLP receiving sparse 3D points and camera position information x, outputting S-density, and a feature vector, which is sent to a second MLP with a 2D viewing direction D, outputting color. The architecture ensures that different colors are output from different angles of view, with color constraint geometry, but the shape representation of the bottom layer is only a function of position. Therefore, only the feature codes corresponding to sparse 3D points are considered and the interval length (sampling error) between sampling points is ignored.

In the volume rendering process, it is assumed that the color weight distribution used to calculate the color integral reaches a maximum at the location where the SDF is 0, i.e. the surface location, as the ray approaches the surface, which is typically the case for datasets with simple geometry, such as DTU datasets. But facing some low visibility datasets with long distance sparse views and complex geometries, the location of greatest color weight tends to deviate from the location of SDF 0. Thus, color weight deviations must impair the geometric constraint ability.

Let us set C _S For the color of the intersection of the ray and the object surface, C _V In order to render the resulting view color after rendering, representing a geometric surface. For neural rendering, SDF is often inferred through one MLP network, and color field through another MLP network, which can be expressed mathematically as:

sdf(p)＝F _Θ (p)

meanwhile, the view color synthesized after rendering is written in a discrete form as follows:

assume that the initial intersection point of the ray and the curved surface is p (t) ^* )，sdf(t ^* ) =0, the color of the surface in rendering can be expressed as:

C _S ＝c(t ^* )

for synthesizing new views, the goal is to make the color of the synthesized view coincide with the target color

p(t _j ) Is p (t) ^* ) The nearest sampling point, ε _sample Representing the deviation, epsilon, caused by the sampling operation _weight Representing the deviation caused by the volume rendering weighting.

For most neural rendering pipelines, the geometric information of an object is only constrained by the loss of a single view color in each iteration, and consistency of different views in the geometric optimization direction cannot be ensured, so that geometric blurring is caused. When the input view becomes complex and sparse, this ambiguity of geometric information can lead to inaccurate geometry, which is certainly a challenging problem for image sparse remote sensing scenes.

Given a set of multi-view images of known camera poses, the goal is to reconstruct a surface combining the advantages of neural rendering and volume rendering without mask supervision. The three-dimensional space field of the object is represented by a signed distance and the corresponding surface is extracted using the zero level set of the SDF, the object being to optimise the signed distance function during rendering. A fused rendering mode is proposed to eliminate the color weight deviation and the geometric deviation: the method combines the unbiased characteristic of surface rendering and the method of volume rendering of remote space receptive field, introduces the blurring of the three-dimensional geometric shape constrained by the embedding of the shallow appearance of the image, and finally introduces the display SDF optimization to realize geometric consistency. The method is shown in fig. 1.

The color field function is forward optimized for the single bias of the different view feature extraction and to extract apparent shallow features from each image, considering that the data may be captured in different environments, as shown in fig. 2.

In the model training process, the shallow features gradually disappear in consideration of the fact that the previous shallow features are added continuously in the backward training process by the ResNet50 after the convolution is carried out again and again under normal conditions, so that the feature globally is enriched. Thus, the useful features extracted from the image are input to ResNet50, output as a feature tensor of [1×1×256], and this tensor is passed into a second MLP neural network to complete the appearance embedding.

As shown in table 1, an image network convolution result diagram is:

TABLE 1

Neus was taken as baseline and appearance insert was added to the baseline. These two methods were evaluated on the BlendedMVS dataset, and their grid reconstruction effectiveness and view synthesis performance were evaluated, respectively.

Performance description of NeuS and embedded appearance features only on BlendedMVS dataset. Compared with NeuS, the surface noise can be obviously reduced by only adding the appearance embedding, and the reconstruction precision can be improved.

Table 2 is a display of the quantitative results of the reconstructed surface.

TABLE 2

Table 3 is a quantitative index of rendering results.

TABLE 3 Table 3

Method	PSNR↑	SSIM↑	LPIPS↓
				Base line	28.24	0.909	0.077
Appearance embedding	33.99	0.982	0.039

To eliminate the deviation epsilon due to sampling operations _sample First, two adjacent sampling points near the curved surface are found along the rays emitted from the camera, and the SDF value satisfies:

f(p(t _i ))·f(p(t _i+1 ))<0

the first intersection point of the ray and the curved surface is approximated by linear interpolation

Then, the point set obtained by linear interpolation is collectedTo the initial set of points P (t _i ) In (c), a new point set p=p (t ^* )∪P(t _i ) The final color is generated using the new point set P re-volume rendering: />

The color deviation becomes:

after interpolation, though epsilon is obtained _interp ，ε _interp Is a deviation caused by linear interpolation, which is compared with the original sampling error epsilon _sample At least two orders of magnitude smaller.

Meanwhile, the weight error is correspondingly reduced to regularize the weight distribution:

is to delete abnormal weight distribution, such as those far from the surface but with larger weight value, thereby indirectly pushing the weightThe distribution converges towards the surface. Theoretically, when the weight tends to +.>To->The weight error epsilon as the center _weight Will tend to 0 and the final volume rendering result will produce the same result as the surface rendering.

Constraints of photometric consistency loss and points are introduced to display three-dimensional characterization of the supervisory SDF.

Taking a curved surface fraction S whose pixels on the projection of the source view are q, the image blocks of the curved surface fraction S corresponding to different source views should be geometrically consistent except for the occlusion part. Using reference picture pixels I _r S is represented by a camera coordinate system of (a), and includes:

at this time, a certain point x in the reference image _i Can be positioned with a homography matrix H to relate to corresponding points x in other images:

wherein K is _r And K _i Is a camera internal calibration matrix, R _r And R is _i Is a rotation matrix, t _i And t _r For translation vectors, respectively corresponding to source view I _i And other views I _r 。

To measure photometric consistency of different views, normalized cross-correlation of reference image and source view is used

Where Cov represents covariance, var represents variance, the rendered image is taken as a reference image, and NCC scores between the sampled patches and corresponding patches on all source images are calculated. To handle occlusion, the best four of the calculated NCC scores are found for each sample patch and used to calculate the photometric consistency loss of the corresponding view:

in the previous data processing, images with known camera pose are required, the position information of the images is estimated by a motion recovery Structure (SFM), the SDF also reconstructs sparse 3D points, although the points inevitably have noise, but have certain accuracy, and the noise can be corrected to a certain extent due to the end-to-end architecture, so that the sparse 3D points P are utilized ₀ Directly supervising f (p).

One scene is usually dominated by unoccupied space, and more computational resources and view-dependencies are often required in reconstructing the target scene, based on the fact that a coarse 3D region of interest should be efficiently found before fine reconstruction takes place. This significantly reduces the number of query points per ray in the late fine reconstruction stage.

When input data is processed, a motion restoration structure method (SFM) is used for obtaining a camera pose and sparse 3D point cloud, DVGO automatically selects a point cloud of interest according to the closest point and the farthest point of the scene point cloud, which are intersected by light rays emitted by each camera, as shown in fig. 4 and 5, wherein fig. 4 is a DVGO geometric coarse sampling strategy, and fig. 5 is a Res-NeuS geometric coarse sampling strategy. Because the 3D point cloud area selected by the DVGO is too large and has limitation, the positioning of the scene fine structure is not accurate enough. Therefore, a new automatic point cloud filtering method is defined, the center of the point cloud is found through the pose information of the camera, the average distance from the center to the position of the camera is calculated, the average distance is taken as the radius, the point cloud area which surrounds the center by 360 degrees is selected as the point cloud area of interest, and the radius r of the surrounding area is defined according to the shooting mode (surrounding scene or long-distance panorama coverage) of the camera, as shown in fig. 5.

Total loss is defined as the weighted sum of the losses:

in the present invention, this method is evaluated by reconstructing the surface from 16 scenes of the blendedMVS dataset with remote sensing oriented scenes and other scenes with different classes of objects having image resolutions of 768 x 576 and view numbers varying from 56 to 333. The reconstructed surface was evaluated on the BlendedMVS dataset using the chamfer distance of the 3D space, showing the visual effect of the reconstructed surface for the DTU dataset.

Three-dimensional reconstruction surface accuracy is evaluated through distance measurement, and the chamfering distance of the 3D space is mainly used for reconstruction work and is defined as follows:

s in the above formula ₁ Is ground truth value sampling point S ₂ To reconstruct the surface sampling points. The chamfer distance of S1 to S2 is used as an evaluation index (Acc) of the reconstruction accuracy. In contrast, the detrimental distance from S2 to S is considered as an evaluation index (Comp) of the reconstructed integrity. Total score is defined as the average of accuracy and integrity. The smaller the distance, the better the reconstruction effect.

In addition, the view synthesis performance similar to NeRE, peak signal to noise ratio (PSNR), structural Similarity (SSIM), and Learning Perceived Image Patch Similarity (LPIPS) were evaluated by image quality evaluation indicators.

The method provided by the invention is compared with the most advanced learning-based method NeuS and the traditional reconstruction method colmap, and the model reconstruction effect and the evaluation index are compared and analyzed.

Assuming that the region to be reconstructed is within a sphere, 2048 rays are sampled per batch and the data is trained on a single NVIDIA GeForce RTX 4090GPU, a 50000 iteration model of approximately 4 hours is trained, meeting memory requirements. Through network training, a grid is extracted from SDFs in a predefined bounding box through a Marching Cube algorithm, and the size of the grid is 512.

Firstly, a few reconstruction methods are used for testing part of indoor scene data in the DTU data set and part of outdoor small scene data in the BlendedMVS data set, and the reconstruction results are compared, as shown in FIG. 6, the test results show that the method provided by the application is greatly superior to Baselines.

Because, the method proposed by the application is good for reconstructing small scenes. Next, the method provided by the application is applied to some large scenes with low visibility and remote sensing scenes with sparse feature views in BlendedMVS data sets, and the reconstruction result is compared and analyzed. As can be seen in fig. 7 and table 4, the COLMAP reconstructed surface is noisy, while NeuS uses only color constraints resulting in severe deformation, distortion and voids of the geometric surface structure. In contrast, the method provided by the application can reconstruct a precise geometric structure and remove smooth surface noise, for example, can reconstruct a geometric structure of the scene7 (low visibility) and recover depth mutation of the scene 8.

TABLE 4 Table 4

Method	NeuS	COLMAP	The application
				Scene 3	0.125	0.119	0.117
Scene 4	0.00128	0.0425	0.000810
				Scene 5	0.297	1.75	0.0877
Scene 6	1.18	0.165	0.581
				Scene 7	3.475	0.385	0.469
Scene 8	0.285	0.242	0.237
				Scene 9	2.11	1.89	1.84
Scene 10	0.425	0.87	0.421
				Scene 11	1.32	0.692	0.510
Scene 12	0.990	0.921	0.102
				Scene 13	0.121	0.234	0.061
Scene 14	0.540	0.370	0.319
				Scene 15	1.306	0.454	0.311
Scene 16	0.213	0.232	0.120
				Average value of	0.884	0.597	0.369

Using dome church data in the bendinddmvs dataset for ablation experiments, neuS was taken as Baseline, and these modules were added sequentially, as shown in fig. 8, baseline geometry was distorted, the surface generated a lot of noise, and the reconstructed area was incomplete, model a reconstructed the complete area, but the surface noise was still very large. Model B not only reconstructs the complete region, but also significantly optimizes the geometry. Model C perfects the geometry and the error is similar to model B. In contrast, full model shows excellent results in reconstructing precise geometry and reducing surface noise.

Table 5 is the quantitative results of the ablation model.

TABLE 5

Method	Appearance constraint	Weight constraint	Geometric constraints	Integrity degree
					Baseline base				0.2134
model-A	√	√		0.1725
					model-B		√	√	0.1289
model-C	√		√	0.1275
					Full model	√	√	√	0.1203

In conclusion, the appearance embedding is more prone to capturing scene details, geometric reconstruction quality is improved to a certain extent by geometric constraint, and model accuracy is effectively improved by weight constraint.

In summary, the present application proposes a Res-NeuS neuro-surface photogrammetry reconstruction. Res-NeuS achieves surface reconstruction accuracy for large scale scenes. Res-NeuS unlocks the ResNet-50 ability to extract depth image information for neural surface reconstruction modeling as SDF. The application shows that Res-NeuS can be used for reconstructing complex scene structures in object-centered capture and wide remote sensing scenes, and has excellent fidelity. This availability allows detailed large-scale reconstruction from multi-view images. To mitigate randomness and ensure adequate detailed sampling, the present application uses long training iterations.

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that the different dependent claims and the features described herein may be combined in ways other than as described in the original claims. It is also to be understood that features described in connection with separate embodiments may be used in other described embodiments.

Claims

1. The multi-view three-dimensional reconstruction method based on depth residual error and neural implicit surface learning is characterized by comprising the following steps of:

2. The multi-view three-dimensional reconstruction method based on depth residual and neural implicit surface learning according to claim 1, wherein the specific method for extracting the appearance information of the image acquired in S1 by using a res net-50 convolution block in S5 comprises:

256-dimensional input passes through a convolution layer of 1×1×64, then passes through a convolution layer of 3×3×64, and finally passes through a convolution layer of 1×1×256;

each convolution layer is obtained using a Relu activation function.

3. The multi-view three-dimensional reconstruction method based on depth residual and neural implicit surface learning according to claim 1, wherein the constraints of three-dimensional reconstruction include color constraints, weight constraints, point constraints, and photometric consistency constraints.

4. The multi-view three-dimensional reconstruction method based on depth residual error and neural implicit surface learning according to claim 3, wherein the specific method for calculating the weight constraint function according to the weight constraint condition by using the SDF function point cloud information acquired in S4 in S7 comprises the following steps:

f(p(t _i ))·f(p(t _i+1 ))＜0

wherein,representing a set of points obtained by linear interpolation, t _j+1 Represents p (t) _j+1 ) The value of the point-time variable t, f (p (t _j+1 ) P (t) _j+1 ) SDF value, t of point _j Represents p (t) _j ) At the time of pointThe variable t takes on the value, f (p (t _j ) P (t) _j ) SDF value of the point;

v represents a unit direction vector of the ray;

will beAdded to p (t) _i ) In (3) obtaining a new point set P:

representing a point set obtained after interpolation;

color deviation DeltaC _final The method comprises the following steps:

weight constraint loss:

5. the multi-view three-dimensional reconstruction method based on depth residual and neural implicit surface learning according to claim 4, wherein the specific method for calculating the photometric consistency loss function by setting constraint conditions on the point cloud color information acquired in S6 in S7 comprises:

a certain point x in the reference image _i Is positioned with the identity matrix H such that x _i In association with a corresponding point x in the other image,

obtaining a relation:

calculating photometric consistency loss of corresponding views

Where i represents the number of NCC scores.

6. The multi-view three-dimensional reconstruction method based on depth residual and neural implicit surface learning according to claim 5, wherein the specific method for calculating the SDF loss function of the point according to the constraint condition of setting the surface point sdf=0 by the SDF function point cloud information acquired in S4 in S7 comprises:

point constraint loss

7. The multi-view three-dimensional reconstruction method based on depth residual and neural implicit surface learning according to claim 6, wherein S7 the three-dimensional reconstruction comprises calculating a total loss function, the specific method comprising:

total loss functionLoss for color constraint- >Weight constraint loss->Point constraint loss->And photometric consistency constraint loss->Is a weighted sum of:

8. The multi-view three-dimensional reconstruction method based on depth residual and neural implicit surface learning according to any one of claims 1 to 7, wherein the SDF function is:

the curved surface S of the scene is expressed as:

S＝{p∈R ³ |f(p)＝0}

the function is characterized by a neural network, named as an SDF network in NeuS, and the SDF network function is associated with NeRF in NeuS;

the surface characterization SDF network is optimized with the NeRF loss function.