CN115880419A

CN115880419A - Neural implicit surface generation and interaction method based on voxels

Info

Publication number: CN115880419A
Application number: CN202211001790.6A
Authority: CN
Inventors: 章国锋; 鲍虎军; 李海; 杨兴锐; 翟宏佳
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-08-20
Filing date: 2022-08-20
Publication date: 2023-03-31

Abstract

The invention discloses a neural implicit surface generation and interaction method based on voxels, and belongs to the field of computer vision and computer graphics. The invention decomposes a three-dimensional scene into geometric units taking a voxel block as a unit, stores the geometric and texture information inside the three-dimensional scene into the voxel block in a characteristic vector form, obtains the characteristics of corresponding three-dimensional points in an interpolation mode, and obtains a Symbol Distance Field (SDF) and corresponding colors through a geometric analysis network and a texture analysis network. On the basis, the invention provides the method for further improving the surface and texture precision of the model through progressive voxel elimination and decomposition; it is proposed to increase the number of samples of the active points by using surface-aware sampling. According to the method and the device, the surface and texture effects after editing can be rendered through interactive editing of the generated voxel blocks.

Description

Neural implicit surface generation and interaction method based on voxels

Technical Field

The invention relates to the field of computer vision and computer graphics, in particular to a neural implicit surface generation and interaction method based on voxels.

Background

Virtual content generation and interaction are important components of three-dimensional applications. Often, virtual content needs to be built by a professional designer, which is complex and time consuming. Therefore, the automatic reconstruction of accurate surfaces from multi-view images is crucial for virtual content generation, this is also an important research topic for computer vision and computer graphics. Prior to the deep learning era, image surface reconstruction was largely by multi-view stereogeometric techniques (MVS), which were highly dependent on feature detection and matching. Although these methods are mature in both academic and industrial areas, they are based on indirect feature matching and point cloud representation, so that information loss occurs during reconstruction. These lost information pose challenges to the reconstruction of complex scenes. For example, in the case of weak texture, repetitive features, or inconsistent brightness, it is difficult to match the exact features, resulting in the generation of erroneous three-dimensional points, ultimately resulting in defects in the reconstructed surface. Furthermore, discrete triangular meshes and inconsistent texture blocks are often not capable of rendering a realistic scene, since the corresponding textures of the meshes are generated separately.

In recent two years, work has emerged to represent scenes using neural networks and is rapidly becoming a hotspot of research. Work like OccupanceNet and DeepSDF suggests that implicit surfaces can be generated by learning, such as Symbolic Distance Fields (SDF) or Occupancy fields (Occupancy), and stored in a multi-layered perceptron (MLP). These networks can learn continuous scene representations from discrete three-dimensional sample books. Based on this finding, DVR and IDR extend this representation to the task of image-based surface reconstruction. However, these methods only learn textures for points of the surface, and it is difficult to learn an accurate surface without sufficient observation.

With the advent of NeRF-based methods, the novel view synthesis task has improved considerably. The NeRF and the extension method thereof utilize a volume rendering mode to learn the neural radiation field of the scene, and obvious progress is achieved. However, such methods do not accurately reconstruct the surface. Subsequently, methods like NeuS, UNISURF and VolSDF propose to combine SDF and radiation field to achieve surface reconstruction. These methods can be trained end-to-end directly from multi-view images without introducing additional expressions, thereby minimizing information loss and achieving higher accuracy than conventional methods.

However, these methods only reconstruct the entire space with a single network, and cannot perform large-scale reconstruction due to limited network capacity. In addition, scenes are hidden in the network, and interactive operations such as scene segmentation and editing cannot be performed in an aligned manner.

Disclosure of Invention

To solve the above problem, the present invention employs a hybrid architecture consisting of an explicit voxel representation and an implicit surface representation. This architecture combines the advantages of both representations, allowing explicit manipulation of the scene, with implicit surface and texture representation capabilities. The Vox-Surf is a neural implicit surface rendering framework based on voxels, combines a voxel-based method and an implicit surface reconstruction method based on images, and can be used for efficient surface reconstruction and rendering.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention firstly provides a neural implicit surface generation and interaction method based on voxels, which comprises the following steps:

step 1: dividing a scene into a plurality of non-overlapping voxel blocks aligned with coordinate axes in advance, and establishing an octree structure corresponding to the voxel blocks; storing geometric and texture information inside the voxel block in 8 vertexes of the voxel block in a fixed-length optimized feature vector form;

step 2: generating a ray passing through pixels on an image from the center of a camera by inputting a plurality of RGB or RGBD images with known positions and orientations of the camera, calculating the intersection of the ray and a voxel block, performing three-dimensional point sampling in the intersection area of the ray and the voxel block, and acquiring the feature vector of the voxel block where the ray is located through three-dimensional point coordinates; obtaining a Symbol Distance Field (SDF) and intermediate information through a geometric analysis network, and obtaining a color through a texture analysis network from the obtained intermediate information;

and step 3: calculating a space density value corresponding to the three-dimensional point through SDF, performing weight accumulation on colors on rays in a volume rendering mode, obtaining the color of a predicted ray corresponding pixel through volume rendering, comparing the color with the real color, optimizing a geometric analysis network, a texture analysis network and a feature vector on a voxel block, and gradually generating a neural implicit surface of a scene by adopting progressive training;

and 4, step 4: and (3) for the neural implicit surface of the scene obtained in the step (3), which contains the voxel blocks of the geometric and texture information feature vectors, carrying out independent rendering and interaction on the voxel blocks.

Further, the three-dimensional point sampling in step 2 specifically includes:

sampling an area with intersection with a voxel block on a ray by adopting a surface perception sampling strategy, wherein the process is divided into three steps:

(1) Firstly, sampling a three-dimensional point p according to uniform probability in an area with intersection with a voxel block on a ray and obtaining a characteristic vector e of p through a characteristic extraction function,

(2) Then using a geometry analysis network F _σ Calculating the SDF of each sampling point; determining whether the section of the area contains the surface by judging whether the SDF of the continuous 2 three-dimensional points changes from positive to negative along the ray direction, and marking a voxel block containing the points as an important voxel;

(3) And finally, improving the sampling probability inside important voxels, reducing the sampling probability in other voxel blocks, resampling the region with intersection with the voxel blocks on the ray, and keeping the total sampling point number fixed.

Further, the obtaining of the color of the pixel corresponding to the predicted ray through volume rendering in step 3 specifically includes:

using an S-density function phi _s (σ) converting the SDF of the three-dimensional point p into a density, φ _s (σ) is a unimodal function of the symbol distance σ with respect to point p, where

Is Sigmoid function phi _s S is a scale parameter controlling the shape of the distribution, the value of the point close to the surface being greater than the weight of the far point;

based on phi _s (σ) defining the opacity density ρ (t) as

Thereby defining the bulk density function in volume rendering as:

the discrete cumulative transmittance in the definition volume rendering is as follows:

thus, for N on a ray _p And performing volume rendering on the three-dimensional sampling points to obtain an accumulated color C (r):

wherein c is _i Is the color of point i on the ray.

According to the preferred embodiment of the present invention, the progressive training in step 3 specifically comprises:

the progressive training is to remove and decompose the voxel blocks from coarse to fine, gradually remove the voxel blocks which do not contain the surface, and decompose the rest voxel blocks to obtain a finer surface generation effect;

firstly, uniformly sampling enough three-dimensional points in each voxel block; then using a geometry resolution network F _σ Calculating the SDF of the three-dimensional point; is composed ofDefining a distance threshold τ for deciding whether to retain or reject the voxel block;

here K _i E {0,1} is a flag, where 1 denotes a voxel to be retained;

the method for decomposing the voxel blocks comprises the following steps: the remaining voxel blocks are decomposed into 8 sub-voxel blocks, and the feature vectors of the corner vertices of the newly generated sub-voxel blocks, which are subsequently individually optimized, are computed using the feature retrieval function Γ.

Compared with the prior art, the invention has the advantages that:

1) The three-dimensional expression mode provided by the invention is named as Vox-Surf, and the three-dimensional scene is divided and stored by storing the scene in a plurality of disjoint voxel blocks. The Vox-Surf expression mixes the advantages of voxel expression and neural implicit surface and can be learned in an end-to-end manner. The Vox-Surf provided by the invention is a neural implicit surface expression based on voxels, and can be learned from end to end in a multi-view image. Compared with the prior art, the method and the device for generating the independent geometric rendering unit based on the voxel block are more suitable for interactive editing of scenes.

2) The invention uses the progressive training and surface perception sampling strategy to improve the reconstruction quality without increasing the memory overhead. Meanwhile, the method is faster in rendering speed than the existing method due to the strategy based on Ray-AABB intersection detection and the use of a smaller network.

Drawings

FIG. 1 is a generalized flow chart of the reconstruction of the Vox-Surf of the present invention;

FIG. 2 is a schematic flow diagram of the surface sensing sampling of the present invention;

FIG. 3 is a schematic diagram illustrating a training process of progressive voxel and surface reconstruction proposed by the present invention;

FIG. 4 is a schematic diagram of the interactive editing of the present invention.

Detailed Description

The invention is described in detail below with reference to the accompanying drawings. The technical features of the embodiments of the present invention can be combined correspondingly without mutual conflict.

The invention discloses a voxel-based neural implicit surface generation and interaction method, which comprises the following steps:

step 1, dividing a scene into a plurality of non-overlapping voxel blocks aligned with coordinate axes in advance, and establishing an octree structure corresponding to the voxel blocks; storing geometric and texture information inside the voxel block in 8 vertexes of the voxel block in a fixed-length optimized feature vector form;

in particular, a scene is represented by a set of bolded blocks of voxels

Dividing, wherein each voxel block has 8 angular vertexes, and the angular vertexes contain coded geometric and texture information; this information is based on a fixed-length optimizable feature vector->

Is represented by L _e Is the length of the feature vector; thus, for any voxel V _i Any 3d point inside->

Adjacent voxel blocks share the feature vectors of 4 angular vertices.

Step 2: generating a ray passing through pixels on an image from the center of a camera by inputting a plurality of RGB or RGBD images with known positions and orientations of the camera, calculating the intersection of the ray and a voxel block, performing three-dimensional point sampling in the intersection area of the ray and the voxel block, and acquiring the feature vector of the voxel block where the ray is located through three-dimensional point coordinates; and obtaining a Symbol Distance Field (SDF) and intermediate information through a geometric analysis network, and obtaining the color from the obtained intermediate information through a texture analysis network.

As shown in fig. 1, a ray passing through a pixel on an image in the d direction from the camera center o is defined as r (t) = o + dt, and t is a depth in the ray direction. And each Ray calculates the depth of the intersection point of the Ray and the voxel block through a Ray-AABB intersection detection algorithm, so that the region with the intersection of the Ray and the voxel block on the emergent line is divided.

In a preferred embodiment of the present invention, the obtaining of the Symbol Distance Field (SDF) and the intermediate information through the geometric resolution network in step 2, and then obtaining the color through the texture resolution network from the obtained intermediate information specifically include: defining feature extraction functions

Mapping a three-dimensional point p to a length L _e Is greater than or equal to>

And the feature extraction function is realized by trilinear interpolation, and feature vectors contained by 8 angular vertexes of the voxel block are interpolated according to the three-dimensional coordinate of p and the relative position of the p in the voxel block to obtain the feature vector of p.

The invention adopts a multilayer perceptron network (MLP) to represent a geometric analysis network F _σ And a texture resolution network F _c . Geometric resolution network

Mapping a feature vector e of p to its signed distance field->

And a length L _f Geometric feature vector pick>

The sign of σ indicates whether p is inside or outside the surface S. The surface S of the scene may be extracted by the following equation

Wherein operation [0]Means from F _σ A first value, in the example of the invention the symbol distance field sigma at the position p, is obtained. The geometric characteristic vectors f and p of the three-dimensional point p are positionedAre connected as a texture analysis network F _c Get the color c at p. In practice, the invention adopts a position coding algorithm PE provided in a NeRF method, codes the characteristic vector e before entering a geometric analysis network, and codes the ray direction d before entering a texture analysis network.

In one embodiment of the invention, the invention proposes a surface-aware sampling strategy to sample three-dimensional points of a region on a ray that intersects a voxel block. The process can be roughly divided into three steps, as shown in fig. 2: (1) Firstly, sampling a three-dimensional point p in an area with intersection with a voxel block on a ray according to uniform probability and obtaining a characteristic vector e of the p through a characteristic extraction function, (2) then utilizing a geometric analysis network F _σ The SDF for each sample point is calculated. Whether the segment of the region contains a surface is determined by determining whether the SDFs of 2 consecutive three-dimensional points change from positive (outside) to negative (inside) along the ray direction, and the voxel block containing these points is labeled as a significant voxel. (3) And finally, improving the sampling probability inside important voxels, reducing the sampling probability in other voxel blocks, resampling the region with intersection with the voxel blocks on the ray, and keeping the total sampling point number fixed.

In practice, the resampling is further divided into full surface-aware resampling (fig. 2, 3 rd figure) and first surface-aware resampling (fig. 2, last 1 st figure) depending on whether only the first significant voxel is used. When the shape is unstable, the former is used to optimize all possible surfaces and the latter is used to optimize the fine structure of the stable shape.

And 3, step 3: calculating a space density value corresponding to the three-dimensional point through SDF, performing weight accumulation on colors on rays in a volume rendering mode, obtaining the color of a predicted ray corresponding pixel through volume rendering, comparing the color with the real color, optimizing a geometric analysis network, a texture analysis network and a feature vector on a voxel block, and gradually generating a neural implicit surface of a scene by adopting progressive training;

the step 3 of obtaining the color of the pixel corresponding to the predicted ray through volume rendering specifically includes:

the invention uses an S-density function phi _s (σ) converting the SDF of the three-dimensional point p into a density, φ _s (σ) is a unimodal function of the symbol distance σ with respect to point p, where

Is Sigmoid function phi _s S is a scale parameter that controls the shape of the distribution, with points near the surface having a value that is greater than the weight of points farther away.

Based on phi _s (σ) defining the opacity density ρ (t) as

Thereby defining the bulk density function in volume rendering as:

wherein c is _i Is the color of point i on the ray.

The progressive training in the step 3 specifically comprises:

the voxel block culling step is to first uniformly sample enough three-dimensional points within each voxel block. Then using a geometry resolution network F _σ The SDF of the three-dimensional points is calculated. To decide whether to retain or reject the voxel block, the present invention defines a distance threshold τ.

Here K _i E {0,1} is a flag indicating whether a voxel is retained, where 1 indicates a retained voxel.

The method comprises the following steps of: the remaining voxel blocks are decomposed into 8 sub-voxel blocks, and the feature vectors of the corner vertices of the newly generated sub-voxel blocks, which are subsequently individually optimized, are computed using the feature retrieval function Γ.

The effect of each voxel block culling and decomposition is shown in the left 4 panels of fig. 3, and the resulting surface and texture is shown in the right 2 panels of fig. 3.

In order to optimize the feature vectors, the geometric analysis network and the texture analysis network, the present invention utilizes the following loss functions. For each ray, the ray's cumulative color C (r) is first calculated and then compared to the true color

The L1 loss was calculated.

In order to constrain the SDF, the invention also adds an eikonal loss term on the sampled three-dimensional point p, and the term maintains the stability of the SDF by constraining the normal vector of the adjacent sampling point.

The loss function finally used is

If depth information is entered. The present invention additionally uses the depth loss based on the occupied field.

The occupancy field is defined as

occ(t)＝Sigmoid(-scale·F _σ (Γ(r(t)))[0])

Since the gradient of the occupied field peaks only near the surface S. Therefore, the present invention divides the ray r (t) with depth information into three intervals, with different corresponding losses:

for a given depth

At the previous point, δ t is a small noise tolerant depth range. The present invention always assumes that these points are outside the surface.

For a given depth

The present invention always assumes that these points are within the surface. It has been found in experiments that this loss is still valid even in the case of rays intersecting multiple surfaces, as long as sufficient observations are made.

For

The points in between, which the present invention considers to be on the surface, so the straight constraint SDF is 0.

Finally, the total depth loss is a combination of the three losses mentioned above:

and 4, step 4: and (4) for the neural implicit surface of the scene obtained in the step (3), which contains the voxel blocks of the geometrical and textural information feature vectors, performing independent rendering and interaction on the voxel blocks. The individual rendering and interaction in the step 4 specifically include: each of the voxel blocks trained in the step 3 can be regarded as an independent geometric unit, interactive editing is directly carried out on the scene by changing the properties of the position, the size, the orientation and the like of the voxel block, the interactive freedom degree is improved, and the real texture effect under the current visual angle is directly generated through volume rendering in the step 3.

Examples

Experiments are carried out on two different types of data sets, namely a small scene volume data set DTU and an indoor scene data set ScanNet. For a DTU dataset, the invention first generates an initial voxel block and corresponding octree within a unit cube with a voxel size of 0.8. And adopting voxel block elimination once every 50,000 iterations, and respectively carrying out voxel block decomposition at 20,000, 50,000, 100,000, 20,000 and 300,000 iterations, wherein the elimination threshold is 0.01. Uniform voxel sampling is used prior to the second segmentation, and a full-surface-aware voxel resampling strategy is used during the second to fourth segmentations. After the fourth segmentation, the first surface-aware voxel resampling is used to continue refining detail. The voxel embedding length is 16, the geometry resolution network is a 4-layer MLP with 128 hidden units per layer, and the texture resolution network is a 4-layer MLP with 128 hidden units per layer. Before being input into the extractor, the voxel features are coded with 6 frequencies and the ray direction is coded with 8 frequencies. Compared with COLMAP, DVR, IDR and the prior NeuS method with the highest precision, the invention adopts chamfer measurement index to evaluate the precision of a real three-dimensional model and a reconstructed three-dimensional model, and the average precision of the invention is higher than that of the NeuS and the IDR.

For the Scannet dataset, the present invention uses data with depth for training, first backprojecting all depth observations into three-dimensional points, and then voxelizing these points using an initial voxel size of 0.4. Since the RGB-D sensor is accurate only within a certain distance, the maximum depth range is limited to 5.0 to reduce noise samples. The invention also gradually decomposes and eliminates the voxels twice in the training process, so that the minimum voxel size is 0.1. The invention compares COLMAP and TSDF methods, and the invention compares chamfer index and F-score index, and the result is obviously superior to the traditional TSDF method.

The method can be applied to editing scenes and objects, and as shown in fig. 4, the corresponding realistic scene can be directly rendered by modifying the voxel units in the method, aligning, copying, locally zooming, dividing and the like.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A method for generating and interacting with a neural implicit surface based on voxels, comprising the steps of:

2. The method for generating and interacting with a neural implicit surface based on voxels according to claim 1, wherein the step 1 is specifically:

scene is divided into a set of boldfaced blocks

Adjacent voxel blocks share the feature vectors of 4 angular vertices.

3. The method for generating and interacting a neural implicit surface based on voxels according to claim 1, wherein the intersection of the computed ray and the voxel block in step 2 is specifically:

a ray passing through a pixel on the image in the d direction from the camera center o is defined as r (t) = o + dt, t being the depth in the direction of the ray; and each Ray calculates the depth of the intersection point of the Ray and the voxel block through a Ray-AABB intersection detection algorithm, so that the region with the intersection of the Ray and the voxel block on the emergent line is divided.

4. The method for generating and interacting a neural implicit surface based on voxels according to claim 1, wherein the step 2 of performing three-dimensional point sampling specifically comprises:

(2) Then using a geometry analysis network F _σ Calculating the SDF of each sampling point; determining whether the section of the area contains the surface by judging whether the SDF of continuous 2 three-dimensional points changes from positive to negative along the ray direction, and marking a voxel block containing the points as an important voxel;

5. The method according to claim 1, wherein the step 2 comprises obtaining a Symbolic Distance Field (SDF) and intermediate information through a geometric analysis network, and obtaining color from the obtained intermediate information through a texture analysis network, specifically:

defining a feature extraction function Γ:

The feature extraction function is realized by trilinear interpolation, and feature vectors contained by 8 angular vertexes of the voxel block are interpolated according to the three-dimensional coordinate of p and the relative position of the p in the voxel block to obtain the feature vector of p;

representing a geometry-resolving network F using a multi-layer perceptron network (MLP) _σ And a texture resolution network F _c (ii) a Geometric resolution network F _σ (e)：

Mapping a feature vector e of p to its signed distance field->

And a length L _f Geometric feature vector

Wherein operation [0]Means from F _σ To obtain a symbol distance field sigma at a position p; the ray directions d of the geometric characteristic vectors F and p of the three-dimensional point p and the characteristic vector e of p are connected as a texture analysis network F _c Get the color c at p.

6. The method for generating and interacting a neural implicit surface based on voxels according to claim 1, wherein the color of the pixel corresponding to the predicted ray obtained by volume rendering in step 3 is specifically:

using S-density function phi _s (σ) converting the SDF of the three-dimensional point p into a density, φ _s (σ) is a unimodal function of the symbol distance σ with respect to point p,wherein

based on phi _s (σ) defining the opacity density ρ (t) as

Thereby defining the bulk density function in volume rendering as:

thus, for N on the ray _p And performing volume rendering on the three-dimensional sampling points to obtain an accumulated color C (r):

wherein c is _i Is the color of point i on the ray.

7. The method for generating and interacting a neural implicit surface based on voxels according to claim 1, wherein the progressive training in step 3 is specifically:

firstly, uniformly sampling enough three-dimensional points in each voxel block; then using a geometry resolution network F _σ Calculating the SDF of the three-dimensional point; to decide whether to retain or reject the voxel block, a distance threshold τ is defined;

here K _i E {0,1} is a flag, where 1 denotes a voxel to be retained;

8. The method for generating and interacting a neural implicit surface based on voxels according to claim 1, wherein the individual rendering and interaction in step 4 specifically comprises:

and 3, each of the trained voxel blocks can be regarded as an independent geometric unit, interactive editing is directly carried out on the scene by changing the position, size and orientation of the voxel block, the interactive freedom degree is improved, and the real texture effect under the current visual angle is directly generated through volume rendering in the step 3.