CN116824042A

CN116824042A - Monocular three-dimensional scene reconstruction system suitable for large complex scene

Info

Publication number: CN116824042A
Application number: CN202310569207.XA
Authority: CN
Inventors: 许封元; 吴昊; 高晗; 董沛文
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-09-29

Abstract

The invention relates to a monocular three-dimensional scene reconstruction system suitable for a large-scale complex scene, and belongs to the technical field of system realization for meeting different reconstruction requirements of users under the large-scale complex scene. The system described in this patent comprises three key technologies: block sparse data management, reconstruction from coarse to fine dynamic range, grid generation at multi-layer resolution TSDF. From the flow, the method comprises the following steps: the system can finish three-dimensional reconstruction of a large-scale complex scene by using a host based on GPU acceleration by using the existing open source software to acquire relevant parameters of a mobile phone camera only by using a common mobile phone to shoot a video, so that a wide scene can be completely reconstructed, and the geometric shapes of objects in the scene can be adaptively refined according to shooting actions of the user. Characteristically, the method is as follows: the system described in this patent is used without the user having the relevant expertise.

Description

Monocular three-dimensional scene reconstruction system suitable for large complex scene

Technical Field

The invention relates to a monocular three-dimensional scene reconstruction system suitable for a large-scale complex scene, and belongs to the technical field of system realization for meeting different reconstruction requirements of users under the large-scale complex scene.

Background

Three-dimensional Reconstruction (3D Reconstruction) is one of the important tasks in the field of three-dimensional computer vision, and many applications in reality, such as augmented reality, autonomous navigation, etc., rely on Reconstruction of a real scene. In order to achieve the closest real reconstruction effect, special distance measuring tools such as depth cameras, lidar and the like are required. The depth map is measured by using the tool, and three-dimensional reconstruction is carried out by matching with camera internal parameters and pose data, and the results can be point cloud (PointCloud), three-dimensional grid (Mesh) and the like. One relatively common procedure is to divide the three-dimensional space regularly into a plurality of small Cubes, each cube being called a Voxel (Voxel), calculate a Truncated Signed Distance Function (TSDF) based on the Voxel using a depth map, the TSDF representing the depth of the Voxel to its nearby surface, find an equipotential surface by moving a cube algorithm (Marching Cubes) after the TSDF is present, reconstruct the surface, and generate a three-dimensional grid.

Most users have few specialized ranging tools, and currently the most accessible is a small number of high-end handsets equipped with lidar. In order to reduce the dependence on professional instruments, common users can use common equipment (such as mobile phones and the like) to perform three-dimensional reconstruction, researchers have proposed a plurality of three-dimensional reconstruction methods based on monocular cameras, and the basic principle of the three-dimensional reconstruction methods is to predict depth maps corresponding to images in shot videos and then reconstruct the depth maps. Because the prediction of the depth map has errors and the correlation among the predictions in continuous time is not strong, the problem of spatial inconsistency easily occurs when the depth map obtained by prediction is used for reconstruction. Recently, another class of volumetric reconstruction (Volumetric Reconstruction) based methods has been proposed that skip depth map prediction, directly predict the TSDF values of voxels in three-dimensional space and reconstruct, enabling better spatial coherence to be achieved compared to depth prediction methods.

Currently, existing volume-based three-dimensional reconstruction systems have some problems in large complex scenarios. Taking the example of the neuro-recon, the neuro-recon is a relatively typical volume-based monocular camera three-dimensional reconstruction algorithm. Deep learning is widely used in systems to extract and learn visual features, and hardware such as a Graphics Processor (GPU) is used to accelerate computation, so that in order to ensure spatial consistency of the reconstructed results, intermediate features of the complete space need to be maintained in the continuous reconstruction process, so that subsequent computation can be linked with the previous results. The scale of the intermediate features gradually increases along with the reconstruction process, and under a large scene, the hardware accelerator is easy to overflow the memory. In addition, in the reconstruction process, the voxels in a space region (about 4m×4m×4m cube) are predicted and calculated according to several continuous images each time, and the number and the size of the voxels are also kept fixed, which cannot meet the user requirements in a large complex scene.

In particular, the smaller reconstruction range makes the reconstruction more complete by requiring the photographer to walk through the whole space in a wider scene, and the situation of incomplete reconstruction is very easy to occur. If the photographer does not actively traverse the entire space, as in (1) and (2) of fig. 1, then the reconstruction of the real scene one would be incomplete, with many gaps; the fixation of the voxel size results in a fixation of the restoration effect of the reconstructed geometry, which is particularly reflected in the reconstruction of angular and small objects, as in fig. 1 (3) and (4), even if objects in scene two are shot at close distances, there is an upper limit to the reconstruction effect of its geometry.

The user hopes to have a more perfect three-dimensional reconstruction system, can cope with reconstruction of a large-scale complex scene on the premise of using common shooting equipment only, can completely reconstruct the scene with smaller time and physical effort cost, and can restore the geometric shape of objects in the scene better.

Disclosure of Invention

The invention discloses a monocular three-dimensional scene reconstruction system suitable for a large complex scene, which mainly solves the problems from two aspects:

1. the reconstruction requirement of a large scene is met under the condition of limited GPU memory:

the current three-dimensional reconstruction system based on deep learning relies on a GPU to perform parallel computation acceleration, and compared with the memory of a Central Processing Unit (CPU), the memory of the GPU is very limited, but the existing three-dimensional reconstruction method needs to maintain global intermediate features and TSDF results on the GPU, occupies memory to continuously increase along with reconstruction, and is easy to overflow the memory of the GPU. A simple solution is to store this part of the data in a more voluminous CPU memory, but this also causes a problem that the CPU is far less computationally intensive in parallel than the GPU, and the associated computation becomes very slow after the data size becomes large. Therefore, the system needs to solve the problem of how to avoid overflow of the GPU memory by means of auxiliary storage of the CPU memory, and maintain efficient deep learning model derivation without losing excessive computing efficiency.

2. The self-adaptive adjustment is carried out aiming at a complex shooting scene, so that a complete space can be rebuilt at a low manpower cost, and the geometric shapes of objects in the scene can be restored well:

in the reconstruction process of the existing system, only one space with fixed size is reconstructed at a time, and the number and the size of voxels in the space are kept fixed, so that a far-distance scene cannot be reconstructed, and finer geometric feature reconstruction cannot be performed on an object. If these requirements are to be met at the same time, one solution is to enlarge the spatial range of each reconstruction, while increasing the number of voxels and at the same time reducing the size of the voxels, but this results in a multiple increase in the number of voxels that need to be calculated, which is not affordable, both in terms of memory overhead and in terms of computational overhead. Therefore, how to realize the reconstruction of the difference between the self-adaptive scenes under the reasonable memory and calculation cost is another problem faced by the system.

The specific technical scheme is as follows:

a monocular three-dimensional scene reconstruction system suitable for large complex scenes, comprising:

inputting a video composed of a picture sequence, and camera pose and camera internal parameters corresponding to the pictures;

selecting a key frame for reconstruction from the video through a key frame selection algorithm, wherein a plurality of continuous key frames form a segment, and a plurality of segments form a segment sequence;

inputting the fragment sequence into a deep learning model segment by segment, and sequentially reconstructing the global TSDF of each fragment;

and generating grids of the global TSDF of each segment to obtain a final reconstruction result.

Further, inputting the segment sequence into the deep learning model segment by segment, and sequentially reconstructing the global TSDF of each segment, including:

inputting the images in each segment into a feature extraction module of the deep learning model to extract image features;

back-projecting the two-dimensional image features onto corresponding voxels in a three-dimensional space, dividing the three-dimensional space into a plurality of resolution layers with different heights, the higher resolution represents the smaller voxels, the resolution is 1 cm, 2 cm, 4 cm, 8 cm and 16 cm from high to low respectively,

and obtaining the local TSDF value of the voxels of each resolution layer by adopting a fusion algorithm.

Further, the method further comprises the following steps: the predicted local TSDF value of the lowest resolution layer is screened, and is combined with the three-dimensional characteristics of voxels in the higher resolution layer in an upsampling mode, and the rest multi-layer resolution layers are analogized.

Further, a fusion algorithm is adopted to obtain local TSDF values of voxels of each resolution layer, including:

each resolution layer maintains local intermediate features and global intermediate features, a cyclic gate unit (GRU) in a cyclic neural network (RNN) is adopted to conduct fusion calculation on the local intermediate features and the global intermediate features of each resolution layer, and the global intermediate features are updated by results;

and predicting the three-dimensional characteristics by adopting a multi-layer perceptron (MLP), obtaining local TSDF values of voxels of each resolution layer, and storing the local TSDF values in a global TSDF space of the layer.

Further, the fusion process includes:

firstly, regularly dividing a three-dimensional space into a plurality of sub-blocks to obtain global sub-block position information; each sub-block has its own position in space, the space range corresponding to the sub-block can be known by the position and size of the sub-block, and original sparse voxel data is stored in the corresponding sub-block according to the position. When the system is rebuilt, knowing the rebuilding range of the segment, it can calculate which sub-blocks overlap with the range in space, when the voxel data in the range needs to be obtained, it only needs to make position judgment for the voxels in the sub-blocks, thus greatly reducing the calculated amount compared with the complete global space data. In addition, the sub-blocks adopt a sparse storage mode as the voxel data in the sub-blocks. The position information of the sub-blocks is always stored in the GPU memory, so that the sub-block index related to the reconstruction range can be rapidly calculated by the GPU. In the whole process, the CPU only plays a role in storage, does not perform any complicated calculation work, and avoids the overhead problem caused by weak calculation power of the CPU.

Knowing the reconstruction range of the current segment, calculating sub-block indexes involved in the reconstruction range by using a Graphics Processor (GPU);

transmitting the data of the sub-block index from the CPU memory to the GPU memory;

judging whether the voxel position in the sub-block is in the reconstruction range by adopting a GPU;

fusing voxel data in the reconstruction range with the data of the current segment;

judging sub-blocks corresponding to each voxel respectively, and storing the sub-blocks in a blocking manner according to the sub-blocks;

finally, sub-blocks in the GPU memory are transferred to the CPU for storage;

further, if a sub-block is newly created in the block storage process, the global sub-block position information is updated.

Advantageous effects

In contrast to the prior art, the method has the advantages that,

from the flow, the method comprises the following steps: the system can finish three-dimensional reconstruction of a large-scale complex scene by using a host based on GPU acceleration by using the existing open source software to acquire relevant parameters of a mobile phone camera only by using a common mobile phone to shoot a video, so that a wide scene can be completely reconstructed, and the geometric shapes of objects in the scene can be adaptively refined according to shooting actions of the user.

Characteristically, the method is as follows: the system described in this patent is used without the user having the relevant expertise.

The system described in this patent comprises three key technologies:

and managing partitioned sparse data. The technology aims at the working principle of a reconstruction algorithm, designs a block type management for global sparse data, utilizes CPU memory to assist in storage, limits the upper limit of GPU memory overhead, enables a system to reconstruct a large-scale scene, and simultaneously reduces the speed influence caused by CPU calculation as much as possible.

From coarse to fine dynamic range reconstruction. The technology ensures that each resolution space has different reconstruction ranges, the reconstruction range and voxel size of the low resolution space are large, and a wider and more complete space can be reconstructed; the reconstruction range and voxel size of the high resolution space are small, enabling a better geometry to be reconstructed.

Grid generation at multi-layer resolution TSDF. When multiple layers of TSDF with different resolutions exist, the technology keeps the reconstruction advantages under each resolution as much as possible, and a final grid scene is generated.

Drawings

FIG. 1 is a reconstruction result of NeuralRecon in a complex scene;

FIG. 2 is a general architecture of the system of the present invention;

FIG. 3 is a fusion flow after joining a partitioned sparse data structure;

FIG. 4 is a partitioning of reconstruction space and determination that there is highest resolution;

FIG. 5 is a sub-block TSDF yield strategy;

FIG. 6 is the effect of a simple fusion scheme;

FIG. 7 is a graph of TSDF values and their characteristics for voxels near the plane and corner;

FIG. 8 is a weight setting function;

FIG. 9 is a reconstruction of a large scene and GPU memory changes;

FIG. 10 is a representation of a system on a Scannet test set;

FIG. 11 is a representation of the system in a real scene (a long range scene);

FIG. 12 is a representation of the system in a real scene (close range scene);

FIG. 13 is the effect of the fusion algorithm at the resolution junction;

FIG. 14 is a partially rendered image of a Replica dataset.

Description of the embodiments

The invention will now be described in further detail with reference to the accompanying drawings.

In this embodiment, the existing monocular camera three-dimensional reconstruction algorithm neuroalRecon is systematically improved, and the overall system architecture is as shown in fig. 2, and the general reconstruction flow is as follows:

1. and inputting a video composed of a picture sequence, and inputting camera pose and camera internal parameters corresponding to the pictures.

2. The change between continuous pictures in the truly shot video is very small, and all input models are not necessary to predict, so that the pictures need to be screened, and the cost can be reduced. And selecting a key frame for reconstruction from the video through a key frame selection algorithm, wherein a plurality of continuous key frames form a segment, a plurality of segments form a segment sequence, and each segment reconstructs a part of region in the complete space through a deep learning model.

3. And inputting the images in the fragments into a feature extraction module of the deep learning model to extract image features.

4. The two-dimensional image features are back projected onto corresponding voxels in a three-dimensional space, the three-dimensional space is divided into a plurality of resolution layers with different heights, the higher the resolution is, the smaller the representative voxels are, the resolution is from 1 cm to 2 cm, from 4 cm to 8 cm and from 16 cm respectively, and each resolution layer maintains local intermediate features and global intermediate features.

5. Adopting a cyclic gate unit (GRU) in a cyclic neural network (RNN) to perform fusion calculation on the local intermediate features and the global intermediate features of each resolution layer, and updating the global intermediate features by using the result;

and predicting the three-dimensional characteristics by using a multi-layer perceptron (MLP), obtaining local TSDF values of voxels of each resolution layer, and storing the local TSDF values in a global TSDF space of the layer.

6. The predicted local TSDF value of the lowest resolution layer is screened, and is combined with the three-dimensional characteristics of voxels in the higher resolution layer in an upsampling mode, and the rest multi-layer resolution layers are analogized.

7. And (3) grid generation is carried out on TSDF of all resolution layers, and a final reconstruction result is obtained.

Block sparse data management:

to limit the use of GPU memory in large scenarios, the ever-increasing data portion of the reconstruction process (including global intermediate features and global TSDF values for each layer) is moved from GPU memory into CPU memory. Taking the intermediate feature as an example, when the GRU fusion is performed, the system can take out the local part positioned in the range of the current segment from the global intermediate feature and fuse the local part with the local intermediate feature of the current segment. The existing reconstruction system adopts sparse data structure arrangement for global voxel intermediate features, so that when local features are acquired, the spatial position of each voxel needs to be judged. The fusion process of global TSDF values is similar. On a GPU, the computation may be accelerated in parallel, while on a CPU the speed may be very slow.

It is observed that the reconstruction process takes data within a certain spatial range every time, and it is unnecessary to list the whole space into the calculation, so that a two-stage block sparse structure is adopted to organize the data. Firstly, the three-dimensional space is regularly divided into a plurality of sub-blocks, each sub-block has its own position in space, the space range corresponding to the sub-block can be known according to the position and the size of the sub-block, and the original sparse voxel data is stored in the corresponding sub-block according to the position. When the system is rebuilt, knowing the rebuilding range of the segment, it can calculate which sub-blocks overlap with the range in space, when the voxel data in the range needs to be obtained, it only needs to make position judgment for the voxels in the sub-blocks, thus greatly reducing the calculated amount compared with the complete global space data. In addition, the sub-blocks adopt a sparse storage mode as the voxel data in the sub-blocks. The position information of the sub-blocks is always stored in the GPU memory, so that the sub-block index related to the reconstruction range can be rapidly calculated by the GPU. In the whole process, the CPU only plays a role in storage, does not perform any complicated calculation work, and avoids the overhead problem caused by weak calculation power of the CPU.

The complete flow in fusion is shown in FIG. 3 in combination with CPU memory auxiliary storage:

1. knowing the reconstruction range of the current fragment, firstly calculating sub-block indexes related to the range by using the GPU;

2. transferring the sub-block data from the CPU memory to the GPU memory;

3. then judging whether the voxel position in the sub-block is in the reconstruction range by using the GPU;

4. the voxel data in all the ranges can be fused with the data of the current segment;

5. after fusion, judging which sub-block each voxel is located in, and storing the sub-blocks according to the sub-blocks, wherein a new sub-block can be created in the process;

6. and finally, transferring the sub-blocks in the GPU memory to a CPU for storage, and if the sub-blocks are newly created, updating global sub-block position information.

Dynamic range reconstruction from coarse to fine:

intuitively, when a photographer shoots a scene with a relatively close distance, he thinks that he wants to reconstruct the geometry of the shot object better; when a far scene is photographed, he considers that he wants to reconstruct the scene more completely, without particular emphasis on the geometry of the distant objects. At the same time, tests have found that the higher the resolution employed, i.e. the smaller the voxel size, the better the object geometry can be reconstructed. Based on such intuition and observation, the system is different from the existing three-dimensional reconstruction system, the reconstruction range of each segment in the existing system is fixed, and although a multi-resolution structure is adopted, only the TSDF of the space with the highest resolution is finally used as a result, and each segment of the system can reconstruct the TSDF with different range and resolution. Specifically, the number of voxels in each reconstruction range of each resolution layer is kept the same, and the reconstruction result is geometrically rough due to the large voxel size of the low resolution space, but the reconstruction range is larger; correspondingly, the voxel size of the high resolution space is small, the reconstruction range is smaller, but the reconstructed geometry is finer.

The position of the reconstruction range of different resolutions is controlled by varying the maximum depth of the viewing cone. In graphics, the portion that can be photographed by the camera appears to extend distally in a pyramid shape, typically to a maximum depth (i.e., furthest distance), so that the photographed portion forms a cone, known as a viewing cone. The pictures in a segment correspond to a plurality of different viewing cones, and a cuboid space is used for containing all the viewing cones, and the center point of the cuboid is taken as the center point of the segment reconstruction range. The greater the cone depth, the further the center point of the reconstruction range is from the camera position. The larger viewing cone depth is set for the low resolution space, so that the center of the reconstruction range is closer to the camera from the low resolution space to the high resolution space, and the reconstruction range is smaller.

The TSDF value of a voxel is computed as one of the feature dimensions of the next layer by upsampling from the low resolution to the next higher resolution space. Before upsampling, screening is performed according to the reconstruction range of the high resolution space, and only voxel data in the range is upsampled and enters the next layer.

When a photographer shoots a scene towards a far distance, only reconstruction of a low-resolution space with a larger range is carried out, and reconstruction of a high-resolution space with a small near range is not carried out, which is mainly due to two points, firstly, the photographer is considered not to carry out finer geometric reconstruction on the near scene at the moment, and the overhead of system operation can be saved by skipping the high-resolution reconstruction of the segment; secondly, in this case, it is likely that only an incomplete part of the near scene is contained in a small range, so that the reconstruction of the high resolution scene exhibits the effect of fragmentation, i.e. the result is both small and discrete, so it is desirable to avoid this situation, directly skipping the reconstruction of the high resolution. It is observed that the number of voxels predicted from close in will have a smaller duty cycle over the entire reconstruction range when shooting far scenes, and a larger duty cycle when shooting close in. Based on such observation, a threshold is set for the up-sampling step of the low-resolution to high-resolution space, when the voxel TSDF value of a certain layer of resolution space predicts, how many voxels are in the reconstruction range of the next layer is calculated, and if the number of voxels exceeds the threshold, up-sampling is performed, and the next layer is entered for continuous calculation; otherwise, the reconstruction of the current segment to this layer is stopped.

Grid generation at multi-layer resolution TSDF:

each resolution layer can generate own global space TSDF value, the reconstruction range of the high-resolution TSDF is small, the reconstruction part is likely not to be complete, but the reconstruction part has better performance in geometric shape, in particular to the reconstruction of the edges and corners of an object; the low-resolution TSDF reconstruction range is larger, the reconstruction portion is more complete, the geometry is smoother, and the shape of an object is difficult to highlight. In generating the grid, it is desirable to take the length of both, both to preserve the geometry of the high resolution TSDF and to preserve the reconstruction integrity of the low resolution TSDF.

In order to achieve such an effect, there are several problems to be solved:

1. multiple resolutions may exist simultaneously within the complete reconstruction space, but the Marching Cubes algorithm can only process TSDF of one resolution to generate a grid. How does a grid be generated for such a reconstructed space?

For this, a strategy of grid generation of the spatial division is adopted. Dividing the reconstruction space into a plurality of sub-blocks, and judging the existence of TSDF with different resolutions in each sub-block:

if only one resolution exists in the sub-block, the Marching Cubes can be directly used for generating grids;

if multiple TSDF exist in the sub-block, then the TSDF needs to be subjected to alternative processing to regenerate the grid.

Finally, the final reconstruction result can be obtained by combining the grids generated by each sub-block.

2. Consider how to do a trade-off process when multiple resolution TSDFs are present in a sub-block. If TSDF with different resolutions in the block are simply respectively generated into three-dimensional grids through a Marching cube algorithm, the grids are combined together, and the high-resolution grids are easily covered by the low-resolution grids. Intuitively, it may be desirable to choose the highest resolution possible, but if only the highest resolution TSDF it exists is chosen for each sub-block, this may result in that in some sub-blocks the reconstruction range of the high resolution TSDF is small, and the entire sub-block cannot be covered, making the reconstruction of the sub-block incomplete. How does this distinguish one sub-block to reconstruct using only the highest resolution?

For this, the TSDF yield policy of the sub-block is designed. It is determined whether a sub-block can be reconstructed with only the highest resolution based on the existence of resolutions in the sub-block and its surrounding sub-blocks.

3. If the highest resolution TSDF in one sub-block cannot reconstruct a complete sub-block, it is desirable to complement it with a lower resolution TSDF in the sub-block that has a larger reconstruction range. How to process TSDF of multiple resolutions in this sub-block to achieve the complement effect and to be able to perform Marching tubes?

In this regard, a fusion algorithm is designed, which can fuse two TSDFs with different resolutions in a sub-block, and only one TSDF with one resolution is finally produced, and meanwhile, the geometry of the high-resolution TSDF and the reconstruction range of the low-resolution TSDF are reserved, and the junctions of the two TSDFs are smoother.

1. Spatial blocking strategy

Firstly judging the complete size of the required reconstruction space, and after the size of each sub-block is given, regularly dividing the reconstruction space into a plurality of sub-blocks, and dividing the voxels and TSDF thereof into different sub-blocks according to the positions. Then judging the existence of TSDF in each sub-block, since the reconstruction range of the low resolution space is composed of the high resolution space, if the TSDF with high resolution exists in the sub-block, the TSDF with lower resolution is necessarily existed, so only the highest resolution TSDF exists in the block is needed to be judged. Fig. 4 shows the partitioning of the reconstruction space and the determination that the highest resolution exists.

In blocking, if a voxel is located on the boundary of a sub-block, the voxel needs to be shared by a plurality of sub-blocks. Otherwise, the Marching Cubes will not be able to generate facets on the sub-block boundaries, resulting in significant breaks in the reconstruction results at the sub-block boundaries.

2. TSDF yield strategy for sub-blocks

It is desirable that each sub-block generates a grid with the highest resolution TSDF present as much as possible, but it is necessary to distinguish between cases where a single resolution cannot reconstruct the integrity. Fig. 5 illustrates a designed sub-block TSDF yield strategy. When there is only one lowest resolution sub-block (the deep blue sub-block in fig. 5) in a sub-block, or there is no sub-block with lower resolution than it (the red and green sub-blocks in fig. 5) around it is considered to be inside the same resolution sub-block, the TSDF reconstruction range should contain the complete sub-block, and the highest resolution TSDF present in the sub-block can be directly used for reconstruction. If there are sub-blocks of lower resolution (yellow and cyan sub-blocks in fig. 5) around the sub-block, it is considered that it is located at the boundary of the reconstruction range, and a complete sub-block may not be reconstructed using only the highest resolution TSDF, requiring fusion with the low resolution TSDF within the sub-block.

In the three-dimensional space, each sub-block needs to judge that the resolution of 26 sub-blocks around the sub-block exists. In the fusion yield strategy, only the highest two resolutions that exist are fused. For example, the yellow sub-block in fig. 5, while three resolutions exist at the same time, generally the next highest resolution is already able to reconstruct the sub-block entirely, and there is no need to consider the lower resolution at the time of fusion.

Algorithm 1 shows the flow of the spatial partitioning and sub-block TSDF yield strategy:

the goal of the fusion of different resolution TSDF is to handle both resolutions within the sub-block, high resolution for better restoration of geometry, low resolution for complement of reconstruction range enabling Marching Cubes, and the reconstruction result needs to appear visually natural.

Firstly, in order to perform the marking probes, only one resolution TSDF can be produced finally, and the resolution needs to be unified. The low-resolution TSDF is converted into the high-resolution TSDF by linear interpolation, and when interpolation is carried out, the low-resolution TSDF is required to be multiplied by the cutoff distance of the low-resolution TSDF and then divided by the cutoff distance of the high-resolution TSDF due to different cutoff distances of the low-resolution TSDF and the high-resolution TSDF. In this process, the TSDF will not lose information, that is, the three-dimensional grids generated by the Maring Cubes algorithm by the TSDF before and after interpolation are identical.

There are two types of TSDF, one is a high resolution TSDF and the other is a high resolution TSDF obtained by low resolution interpolation, which is called interpolated TSDF. The simplest fusion method is that for each voxel in a sub-block, the high resolution TSDF is used with a high resolution value, and the interpolation TSDF value is used without the existence, so that the scheme can simply complement the reconstruction range which is not reached by the high resolution TSDF. However, this solution has been found to result in a significant discontinuity at the junction between the two resolutions, as shown in fig. 6, and after the two resolutions are fused, a significant fracture trace appears on the reconstructed wall surface, and it is desirable to reduce this phenomenon as much as possible, so that a smoother transition can be achieved at the junction.

When shooting is found, a photographer tends to shoot an object completely, so that junctions with different resolutions are mainly parts of a wall surface, the ground surface and the like. The use of interpolated TSDF around these planes is better than with high resolution TSDF because the interpolated TSDF is converted from low resolution and the reconstruction effect on the planes is smoother. Furthermore, the reconstruction of the high resolution TSDF has the advantage of geometry such as corners, which is less affected by smoothing the plane. Therefore, a weighting-based fusion algorithm is proposed, the goal is to make the fused TSDF reconstructed more smoothly near the plane, it fuses the two resolution TSDF in a weighted sum manner, and if the voxel is near the plane, the weighting is more prone to interpolation TSDF. The algorithm specifically relates to three parts, namely judging whether the voxels are near a plane, setting weights, and fusing based on the weights.

Measurement of the voxel in the vicinity of the plane:

fig. 7 is a two-dimensional simplification of the three-dimensional TSDF, and a three-dimensional scene can be analogized. The Marching Cubes algorithm composes an equipotential surface, i.e., a finally generated triangle surface, by finding equipotential points between voxels (finding a position where TSDF value is 0), as in (1) of fig. 7. In contrast, the TSDF of fig. 7 (1) produces a very perfect plane, and the TSDF of fig. 7 (4) produces a plurality of uneven faces (angular or curved, when macroscopically) that are both distinguished. FIGS. 7 (2) and (5) are changes in TSDF for adjacent voxels in the x-axis direction; fig. 7 (3) and (6) show the change in TSDF of adjacent voxels in the y-axis direction. It can be seen that when a voxel is located near a complete plane (e.g. the red voxel in fig. 7 (2) and (3)), it has the following characteristics: 1. the TSDF value itself is closer to 0,2. The variation of TSDF values of voxels near it in any direction is the same (as in the red dotted line range of the red voxels in fig. 7 (2) and (3)). While when a voxel is located near an edge, it has some direction in which the TSDF values in its vicinity vary significantly (e.g., the red dotted line range of the red voxels in four (5) and (6)). Thus, the variance of a certain direction variation in a region around a voxel is used to measure its variation difference in this direction, and the variance of a voxel around a perfect plane should be zero, if the variance of a certain direction is greater than zero, this indicates that there is an edge around it. For TSDF of the three-dimensional space, variance of variation in three directions of the xyz axis is calculated respectively, and the maximum value of the variance, the variance and the maximum value is taken for judgment.

Weight setting:

after calculating the variance, the selection tendency for the two resolutions is expressed in terms of weights. In an actual reconstruction process, this variance of any direction cannot be reached to zero even if the voxels are near the plane, since the result of the prediction is almost impossible to produce a perfect plane. The variance is thus not zero and non-zero only, but the smaller the variance, the closer the face near the voxel is to the plane. Two thresholds were set up and the weights were set as a function of the one shown in fig. 8. When the variance is smaller than a first threshold, the voxels are considered to be near the plane, and interpolation TSDF is directly adopted to obtain a smooth reconstruction result; when the variance is greater than a second threshold, the voxels are considered to be at non-planar corners, and the high resolution TSDF is employed to reconstruct the geometry; between the two thresholds, the weights take a linearly varying strategy.

Fusion is performed based on the weights:

judging whether high-resolution TSDF exists at each voxel position for the TSDF finally produced in the block, and if so, carrying out weighted fusion on the two resolution TSDF; if not, the portion that indicates that the reconstruction range of the high resolution TSDF has not been reached is complemented with the interpolated TSDF.

After the fusion, a sub-block boundary alignment process is required. Since the voxels on the boundary of the sub-block are shared by a plurality of sub-blocks during the blocking, the sub-blocks are independently reconstructed during the reconstruction, if the TSDF of the voxels on the boundary is inconsistent, the equipotential points are different, and the generated surface on the boundary is dislocated. And judging the sub-block output strategy around the sub-block of the fusion algorithm currently, and if the sub-block output strategy is different, setting TSDF values on the corresponding boundaries to be consistent.

Algorithm 2 shows the complete fusion algorithm and the flow of generating the grid:

in a specific implementation system, three layers of TSDF in the model are adopted for grid generation, the resolutions of the three layers of TSDF are 1 cm, 4 cm and 8 cm respectively, and the three layers of TSDF respectively represent the requirements of short-distance reconstruction of geometric shapes, the requirements of medium-distance reconstruction of general scene appearance and the requirements of long-distance reconstruction range and integrity of scenes. And the reconstruction range of the 1 cm TSDF is found to be too small, if the 1 cm sub-block positioned in the interior in the output strategy is directly output, the situation of incomplete reconstruction is also possible, so that all the 1 cm sub-blocks are fused with the 1 cm TSDF to be output.

Experiments show that the system can complete reconstruction of a large scene and control the use amount of the GPU memory. The quality of the scene reconstruction of the system and the existing system on a large-scale data set is quantitatively compared through experiments, and the reconstruction effect on the near scene and the integrity of the scene are superior to those of the existing system. Experiments qualitatively show the reconstruction effect of the system in complex scenes.

Experimental environment:

CPU：Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz

memory: 251G

GPU: NVIDIA GeForce RTX 2080 Ti, video memory 11G, CUDA version 11.2

Deep learning framework: pytorch 1.8.0

Specifically:

reconstruction under large scale scene:

the reconstruction test is performed on the complete building and two conference rooms, fig. 9 (1) shows the overall reconstruction effect, and fig. 9 (2) tests the comparison of the GPU memory overhead of different systems in the reconstruction process. The 4 systems were compared, namely original neuroalRecon, added neuroalRecon of a block sparse structure, the present patent system without using a block sparse structure, and the present patent system with a block sparse structure. The complete reconstruction comprises 1451 fragments, and records the GPU memory usage condition reserved by Pytorch for Tensor in the reconstruction process (besides, the real GPU memory overhead also comprises Pytorch context, CUDA context and the like, and is fixed at about 3-4G), if a block sparse structure is not adopted, the complete reconstruction cannot be carried out no matter the system is a NeuralRecon system or the patent system, and the problem of GPU memory overflow can occur in the middle. After the block sparse structure is added, the maximum GPU memory occupation of Pytorch in the reconstruction process is always kept below 1G, so that the reconstruction of a large scene can be completed. In addition, the running speed of the system is measured, and in order to realize multi-resolution reconstruction, the system increases the number of voxels of each resolution layer, so that the calculation cost is higher than that of NeuralRecon. Compared with the original global sparse data structure, the partitioning sparse structure has less than 10% of additional overhead, which is mainly caused by memory exchange of a CPU and a GPU, block data recombination and computation during partitioning, and is acceptable in benefit compared with the benefit caused by the GPU memory.

Scannet test set:

qualitative and quantitative evaluations of the present patent system were performed on the Scannet test set. Fig. 10 illustrates the reconstruction effect of certain scenes in a test set. For scene 1, the real values provided by the scannet dataset itself are preprocessed, and are not generated for the farther parts when photographed, and the reconstruction result of the system is more complete than the result of the neurolnecon and the real values. For scene 2, the reconstruction result of the system is geometrically better than the visual effect of NeuralRecon, and is closer to the true value.

The following table shows the quantitative evaluation on the Scannet test set, divided into two-dimensional depth index and three-dimensional grid index. In the evaluation of the two-dimensional depth index, the Comp index reflects the integrity of the reconstruction, and the system on this index is far superior to the neuroalRecon, which is the benefit of a lower resolution layer for more complete reconstruction. Among the three-dimensional grid indexes, some indexes are inferior to the NeuralRecon, mainly because the three-dimensional indexes measure the similarity of the reconstruction result and the true value in the point cloud form, and the low resolution layer in the system increases the integrity of the reconstruction result, but also enlarges the difference between the reconstruction result and the true value because the reconstruction result is more dense in the point cloud form than the true value (refer to scene 1 of FIG. 10), so that the indexes are reduced.

Two-dimensional depth index:

three-dimensional grid index:

in order to quantitatively evaluate the benefits brought by the high-resolution layer in the system, only the depth map containing the high-resolution scene in the test set is evaluated in the table below, and the system can be seen to be better than NeuralRecon in various indexes, so that a better reconstruction effect of the high-resolution layer is reflected.

Two-dimensional depth index:

and (3) testing a real scene:

qualitative assessment is performed for reconstruction of a real complex scene. Fig. 11 shows the reconstruction result of the scene 1 in fig. 1, where the entire scene cannot be completely reconstructed by the 4 cm resolution layer in the neurolrecon system, and the reconstruction of the 8 cm resolution layer is complete with holes and gaps. After fusion, the final result is also relatively complete, while retaining the geometry of the 4 cm resolution layer. Fig. 12 shows the reconstruction result of the scene 2 in fig. 1, the 1 cm resolution layer is best for the geometric shape restoration of the object, but the reconstruction range is small, the problem of incomplete reconstruction easily occurs, the system better fuses the layers with different resolutions, the edge and corner details in most 1 cm layers are reserved, and the overall integrity is also ensured. Fig. 13 shows the effect of the fusion algorithm on the resolution junction, which makes the junction smoother than a simple solution.

The invention improves the model and training strategy based on the existing three-dimensional reconstruction algorithm NeuralRecon, and the specific implementation is mainly divided into two aspects:

model structure: a total of 5 resolution layers were used, 1 cm, 2 cm, 4 cm, 8 cm and 16 cm from top to bottom. The number of voxels in the reconstruction range is 48 cubes, except for 16 cm layers, and the number of the remaining layers is 96 cubes. That is, the reconstruction range of 16 cm and 8 cm layers is about 7.68 meters cubic space on a side, and the reconstruction range of 4 cm, 2 cm, 1 cm layers is about 3.84 meters, 1.92 meters, and 0.96 meters on a side. Since only 1 cm, 4 cm and 8 cm of TSDF are needed to be fused at last, when judging whether to enter the next layer, only two layers are needed to be processed: from 8 cm layer to 4 cm layer, the set threshold is that the reconstruction range of 4 cm layer contains one fourth of the reconstruction range of 8 cm layer. From 4 cm layer to 2 cm layer, the set threshold is that the reconstruction range of 1 cm layer contains one eighth of the reconstruction range of 4 cm layer, and if this condition is satisfied, 2 cm layer is entered, and 1 cm layer is not entered.

Model training: adopts a layered tuning training mode. The pre-training model based on NeuralRecon firstly carries out weight freezing on the network of the image feature extraction part in the training process, and ensures that the network is not updated in the whole training process. Next, training two layers with resolutions of 16 cm and 8 cm, the Scannet dataset used by neurolrecon (including the indoor scene shot video and corresponding depth maps) is not suitable for training the two layers, because the training data therein is almost mid-close, and the required image depth is not achieved. Thus, an open source Replica dataset is used that contains three-dimensional models of 18 indoor scenes, which are rendered using pandolin (an open source c++ rendering tool) to obtain rendered RGB images and depth maps. When rendering, the track of the camera needs to be given first, the shooting process of a photographer to the far distance is simulated, the track is generated in a mode of transversely translating and shooting to the far distance at the same time, shooting rendering is carried out once every other distance, and the camera pose shot every time is subjected to random disturbance in a small range so as to simulate the real situation. Fig. 14 is a rendering result of one of the three-dimensional models of Replica, and three pictures are respectively the first, middle, and last of the simulated photographing process. By rendering the whole Replica data set in this way, the maximum depth of the obtained image data is different from 5 meters to 8 meters, and the requirements for the depth of training data are basically met. It was found that if training is performed with a Replica-rendered dataset only, during actual use, when a close-range scene is shot, non-existent voxels are erroneously predicted at a far-range location. This is presumed to be because during training using the Replica rendering dataset, the existence of scenes in the distant portion is maintained at all times, so that the model is over-fitted to the reconstruction of such scenes. Thus, a part of the Scannet dataset is added to the Replica rendering dataset as a new hybrid dataset and the duty cycle of the two parts is made comparable. Finally, the mixed data set is adopted to carry out tuning training on two layers of 16 cm and 8 cm. Next, tuning training was performed on the 4 cm layer, and during training, the weights of both the 16 cm and 8 cm layers were frozen, as subsequent training was not expected to affect the reconstruction effect of the 8 cm layer. The training data of the 4 cm layer is a Scannet dataset, and the preprocessing is consistent with the neuroalRecon. Similarly, the 2 cm layer and the 1 cm layer are then trained separately, with the previous resolution layer weights frozen during training.

With the above-described preferred embodiments according to the present invention as an illustration, the above-described descriptions can be used by persons skilled in the relevant art to make various changes and modifications without departing from the scope of the technical idea of the present invention. The technical scope of the present invention is not limited to the description.

Claims

1. A monocular three-dimensional scene reconstruction system suitable for large complex scenes, comprising:

inputting the fragment sequences into a deep learning model segment by segment, and sequentially reconstructing the global TSDF of each fragment;

2. The monocular three-dimensional scene reconstruction system for large complex scenes according to claim 1, wherein inputting the sequence of segments segment by segment into a deep learning model sequentially reconstructs global TSDF for each segment, comprising:

inputting the images in the fragments into a feature extraction module of a deep learning model to extract image features;

back-projecting the two-dimensional image features onto corresponding voxels in a three-dimensional space, wherein the three-dimensional space is divided into a plurality of resolution layers with different heights;

3. The monocular three-dimensional scene reconstruction system adapted for large complex scenes according to claim 2, further comprising: the predicted local TSDF value of the lowest resolution layer is screened, and is combined with the three-dimensional characteristics of voxels in the higher resolution layer in an upsampling mode, and the rest multi-layer resolution layers are analogized.

4. The monocular three-dimensional scene reconstruction system for large complex scenes according to claim 2, wherein said obtaining local TSDF values for voxels of each resolution layer using a fusion algorithm comprises:

5. The monocular three-dimensional scene reconstruction system adapted for large complex scenes according to claim 4, wherein the fusion procedure comprises:

firstly, regularly dividing a three-dimensional space into a plurality of sub-blocks to obtain global sub-block position information;

fusing the voxel data in the reconstruction range with the data of the current segment;

and transferring the sub-blocks in the GPU memory to the CPU for storage.

6. The monocular three-dimensional scene reconstruction system for large complex scenes according to claim 5, wherein if a sub-block is newly created in the process of storing the sub-block, the global sub-block position information is updated.