CN116664782B

CN116664782B - Neural radiation field three-dimensional reconstruction method based on fusion voxels

Info

Publication number: CN116664782B
Application number: CN202310947466.1A
Authority: CN
Inventors: 张小瑞; 陈超; 孙伟; 张小娜
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-10-13
Anticipated expiration: 2043-07-31
Also published as: CN116664782A

Abstract

The invention discloses a neural radiation field three-dimensional reconstruction method based on fusion voxels, which comprises the following steps: acquiring two-dimensional features of an image in a convolutional neural network, and generating a depth map; aggregating the two-dimensional features of adjacent images and the features calculated by the coarse-stage MLP to generate a local radiation field represented by voxels; based on a recurrent neural network, fusing the local radiation field to a world coordinate system according to the weight to generate a global radiation field, and continuously updating the weight; inputting the generated global radiation field into a NeRF renderer to obtain coordinates of each point and a nearby point density value; and filtering the global radiation field according to the depth map and the volume density threshold value, and then inputting the filtered global radiation field into a volume renderer for volume rendering, and continuously optimizing loss until training is completed, so as to obtain a three-dimensional reconstruction model. The invention enhances the acquisition of global information by fusing the local radiation fields generated by each view, reduces redundant parts according to the depth map and the voxel volume density screening, and improves the training efficiency.

Description

Neural radiation field three-dimensional reconstruction method based on fusion voxels

Technical Field

The invention belongs to the field of three-dimensional reconstruction, and particularly relates to a neural radiation field three-dimensional reconstruction method based on fusion voxels.

Background

The three-dimensional reconstruction technology is a technology for recovering a three-dimensional model according to the extracted picture features, and is widely applied to the fields of virtual reality, medical treatment, games and the like. In particular, with the recent rise of the metauniverse, the technology is expected to be higher, and people increasingly hope that the technology can have stronger characterization function, so that the reconstructed object is more vivid and lifelike, and finally the purpose of digital twin is achieved.

The initial multi-view three-dimensional reconstruction is to obtain the relation between the characteristic points in the previous frame and the next frame by matching sparse features through SIFT, ORB and other algorithms so as to match the pose of a camera, obtain the three-dimensional coordinates of the characteristic points according to the internal parameters of the camera, and finally generate a dense point cloud, voxels and other display models. However, due to the discreteness of the display representation method, overlapping and artifacts are generated during reconstruction, and besides, the display representation at high resolution can cause a large increase in memory occupation, which also limits the application of the display representation in high resolution scenes. Ben Mildenhall et al in 2020 propose an implicit representation method for synthesizing a realistic view using a combination of neural radiation fields (Neural Radiance Fields, neRF for short) and volume rendering, which exhibits a strong characterization capability, enabling the output of high resolution images while occupying a small amount of memory, without requiring any shape prior information to be obtained compared to other implicit representation methods. In 2021, xiaoshuaiZhang et al proposed fusing radiation fields for neural framework nervus of large-scale scene reconstruction, which focused on local field reconstruction first, built local radiation fields for input key frames, and then fused into world scenes in frame-crossing order. The method solves the defect that the neural network only pays attention to local information, and enhances the global feeling of the system. However, since NeRF needs to intensively collect points in the whole scene, more calculation amount is required compared with the traditional method, and a scene is trained for tens of hours, but for most scenes, the real effective points only occupy 1/5 of the real effective points, and the ineffective points outside the background or the object greatly increase the calculation amount of the system and increase the training time of the NeRF. Furthermore, neRF can produce errors in rendering smooth surface objects. Since NeRF does not add constraint on the surface of the object during rendering, the situation that the surface of the reconstructed object is pothole is easy to generate, and reconstruction errors are also caused. At present, how to solve the problems of overlong training time of NeRF and pothole phenomenon on the surface of an object is still a big problem.

Disclosure of Invention

The invention aims to: the invention aims to provide a neural radiation field three-dimensional reconstruction method based on fusion voxels. And generating local voxels according to the two-dimensional characteristics of the picture and the additional characteristics acquired by the multi-layer perceptron, fusing the local voxels into global voxels by using a recurrent neural network, and reducing the calculated amount by screening voxels with density values within a certain range. In addition, a depth map is generated by a multi-view stereo (MVS) method to limit NeRF-rendered points, and pruning operation is directly performed on points outside the surface of the object to improve surface smoothness.

The technical scheme is as follows: the invention discloses a neural radiation field three-dimensional reconstruction method based on fusion voxels, which is characterized by training a three-dimensional reconstruction model by executing the following steps of;

step 1, inputting an image into a two-dimensional convolutional neural network, acquiring two-dimensional characteristics of the image, and generating a depth map according to the two-dimensional characteristics of the image by using a multi-view stereo MVS method;

step 2, aggregating two-dimensional features of adjacent images in the depth map and additional features calculated based on a coarse-stage MLP (multi-layer perceptron), and generating a local radiation field represented by voxels;

step 3, based on a recurrent neural network, fusing the local radiation field generated by each frame to a world coordinate system according to the weight to generate a global radiation field, and continuously updating the weight;

step 4, inputting the generated global radiation field into a NeRF renderer to obtain coordinates of each point and a nearby point density value, and storing the coordinates and the nearby point density value into each voxel;

step 5, filtering the midpoint of the global radiation field according to the depth map to remove redundant parts in the global radiation field;

step 6, filtering the voxel blocks according to the volume density threshold value, and reserving effective parts in the voxel blocks to obtain an updated global radiation field;

and 7, inputting the updated global radiation field into a volume renderer for volume rendering, calculating a loss function, and continuously optimizing the loss until training is completed, so as to obtain a three-dimensional reconstruction model. And reserving an MLP (multi-level image) network, inputting pictures into the network, generating a three-dimensional model of an object or a scene, and completing the synthesis of a realistic viewing angle of a new view.

The MLP refers to a multi-layer perceptron, multiple MLPs are needed in the whole process of the NeRF three-dimensional reconstruction, wherein the MLP in the coarse stage is responsible for uniform sampling, and the MLP in the fine stage is responsible for sampling near the surface of an object.

Further, the step 1 specifically includes:

parameters of n Zhang Yizhi cameraImage +.>Inputting a two-dimensional convolutional neural network as a sequence, extracting picture features from adjacent pictures, performing parallax matching on the picture features to obtain orderly parallax images with the same size as the original image, and generating depth images corresponding to the original image pixels one by one according to the parallax images; the formula for converting the disparity map into the depth map is as follows:

；

wherein For depth->For baseline length,/->For focal length->For parallax (I)> and />Column coordinates of the main points of the left and right views.

Further, the step 2 specifically includes the following steps:

step 2.1, using deep neural network as the firstImage of frame->Regression of a local nerve volume, extracting two-dimensional image features by using a multi-view stereo MVS technology, and establishing a cost volume represented by voxels according to the features;

step 2.2, using a two-dimensional convolutional neural network to divide the first phase into a plurality of phasesImage of frame->Mapping into a feature map->The scene content of the image is stored, the coarse stage MLP is used for obtaining additional features, the two-dimensional image features and the additional features calculated by the coarse stage MLP are projected onto the corresponding local volumes to obtain single-frame feature volumes, and the formula for generating the single-frame feature volumes is as follows:

；

wherein Is->Frames with voxels->Voxel feature as center +.>Is->Frame center->Corresponding two-dimensional feature projection, < >>Is->Frame view->Additional features calculated by MLP, +.>Representing feature connections;

step 2.3, aggregating the feature volumes of the multiframe to regress using the mean and variance of the voxel featuresLocal volume ∈of frame>To represent the local radiation field, where the mean can fuse the appearance information of multiple views and the variance can help to make geometric reasoning, the formula for generating the local radiation field is as follows:

；

wherein ,representing the local radiation field>Representing deep neural network, ++>Mean value->Indicate->The frame of the frame is a frame of a frame,is indicated at +.>Multiple neighboring views of frame aggregation, +.>Indicate->Voxel characteristics of frame, ">Representing the variance.

Further, the step 3 specifically includes: at each frameLocal radiation field to be generated->Global radiation field generated from the previous frame +.>Performing cyclic fusion to continuously update the global radiation field +.>The local volume of each frame is learned and fused by using a gating circulation unit at the time of updating, and the specific formula of the gating circulation unit is as follows:

；

wherein ,for updating the door->Neural network for controlling update gates，/>For resetting the door +.>To control the neural network of the reset gate, +.>Is based on the global radiation field after fusion of the current frame, < >>The neural network is used for controlling the sequential updating of the whole model and is used for controlling the sequential updating global reconstruction of the whole model; />Multiplying the elements; /> and />Local radiation fields for controlling the current frame of the fusion process, respectively>And the global radiation field of the previous frame +.>During fusion, only the local radiation field of the current frame is +.>And the global radiation field of the previous frame->The application of coincident voxels, other voxels remain unchanged.

Further, the step 4 specifically includes: placing the generated global radiation field into a NeRF renderer to obtain the point density of any pointAnd radiation value->The formula for regressive bulk density and emittance is as follows:

；

wherein Respectively representing the horizontal, vertical and longitudinal coordinates of the point,/-, respectively>Indicating azimuth angle, ++>Representing the polar viewing angle;

density of dotsAfter removing the points of (1) the volume density in each voxel is determined>And storing the volume density in a corresponding voxel, dynamically updating the volume density value, and updating the volume density value according to the following formula:

；

wherein ,indicate->Volume density of frame global radiation field voxels, +.>For controlling the update weight,/->Representing the (th) frame>The volume density of the local radiation field voxels generated from the nth picture.

Further, the step 5 specifically includes: and removing the part of the nerve radiation field outside the object surface according to the depth map, and reserving the part inside the object surface.

Further, the step 6 specifically includes: and removing voxel blocks below the threshold in the nerve radiation field according to the volume density threshold, and retaining the voxel blocks with the volume density threshold within the required range.

Further, the step 7 specifically includes: and (2) repeating the steps (2) to (6), putting the global radiation field filtered by the bulk density threshold and the depth map into a volume renderer for volume rendering, carrying out weighted summation according to the colors of the points sampled on the rays to obtain final rendering colors, and calculating loss according to the volume rendering results and continuously optimizing, wherein the formula of the volume rendering calculated colors is as follows:

；

wherein Representing a ray sampling point +.>Opacity function of>Representing the sampling point +.>Point density of->Representing the upsampling point +.>Interval of->Representing the sampling point +.>Transmittance function of>Representation dot->All previous sampling pointsOpacity function of>Representing the upsampling point +.>Interval of->Is the sampling point +.>Is a function of the probability density of (c) in the (c),is a ray->Color of final rendering, ++>Representing rays +.>Up-sampling number, +.>Representing the sampling point +.>Is a radiation value of (2); for NeRF, the difference between the rendering color and the true value is used as a loss function after volume rendering, and the loss function formula is as follows:

；

wherein Representing a loss function->Is a ray set,/->Is a ray->Is a true color of (c).

The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages:

1. according to the neural radiation field three-dimensional reconstruction method based on the fusion voxels, coarse dense voxels are generated for each view according to the point density, and the calculated amount is reduced by screening the voxels with density values within a certain range.

2. The implicit characterization method for multi-view fusion is provided, the acquisition of global information is enhanced by fusing local radiation fields generated by each view, redundant parts are reduced by screening voxels with the volume density within a required range, the calculated amount is reduced, and the training efficiency is improved.

3. And generating a depth map by using an MVS method to limit NeRF rendering points, and directly removing pruning operation on points outside an isosurface 0 (object surface) to improve the surface smoothing effect.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

The invention aims to provide a neural radiation field three-dimensional reconstruction method based on fusion voxels. The local voxel fields are generated according to the picture characteristics by utilizing the pre-trained network model, the generated local voxel fields are fused according to weights in sequence, so that a complete global radiation field is formed, and the calculated amount is reduced by screening voxels with density values within a certain range. In addition, a depth map is generated by a multi-view stereo (MVS) method to limit NeRF-rendered points, and pruning operation is directly performed on points outside the surface of the object to improve surface smoothness.

step 1: inputting an image into a two-dimensional convolutional neural network, acquiring two-dimensional characteristics of the image, generating a corresponding depth map according to a parallax map of the image by using a multi-view stereo MVS method, wherein the image can use a data set, data in the data set has information such as shooting angle, camera parameters and the like of each picture, and a video shot by the user can also be put into a collmap to generate a corresponding camera pose, parameters and the like. Parameters of n Zhang Yizhi cameraImage +.>Inputting a two-dimensional convolutional neural network as a sequence, extracting picture features from adjacent pictures, performing parallax matching on the picture features to obtain orderly parallax images with the same size as the original image, and generating depth images corresponding to the original image pixels one by one according to the parallax images; disparity map conversionThe formula for the depth map is as follows:

；

Step 2: generating a local radiation field represented by voxels according to the depth map of the adjacent image and the additional features calculated by the coarse stage MLP;

；

step 2.3, aggregating the feature volumes of the multiframe by using the mean and variance of the voxel featuresIntegral regression ofLocal volume ∈of frame>To represent the local radiation field, where the mean can fuse the appearance information of multiple views and the variance can help to make geometric reasoning, the formula for generating the local radiation field is as follows:

；

Step (a)3: based on recurrent neural network, the local radiation field generated by each frame is fused to the world coordinate system according to the weight to generate global radiation field, the weight is continuously updated, and the weight is updated in each frameLocal radiation field to be generated->Global radiation field generated from the previous frame +.>Performing cyclic fusion to continuously update the global radiation field +.>The local volume of each frame is learned and fused by using a gating circulation unit at the time of updating, and the specific formula of the gating circulation unit is as follows:

；

wherein ,for updating the door->To control the neural network of the update gate +.>For resetting the door +.>To control the neural network of the reset gate, +.>Is based on the global radiation field after fusion of the current frame, < >>Is a neural network for controlling the sequential update of the whole model, and is used for controlling the sequential update of the whole modelPartial reconstruction; />Multiplying the elements; /> and />Local radiation fields for controlling the current frame of the fusion process, respectively>And the global radiation field of the previous frame +.>During fusion, only the local radiation field of the current frame is +.>And the global radiation field of the previous frame->The application of coincident voxels, other voxels remain unchanged.

Step 4: inputting the generated global radiation field into a NeRF renderer to obtain the coordinates of each point in the global radiation field and the point density value nearby, removing the point with the point density value of 0, and then obtaining the volume density value of the voxels and storing the volume density value in each voxel; neRF samples the radiation field at each view angle for a point on the incoming ray and uses MLP to obtain the bulk density of any pointAnd (2) radiation degree->Calculate bulk Density->The formula for the irradiance is as follows:

；

density of dotsAfter removing the points of (1) the volume density in each voxel is determined>And storing the volume density in the corresponding voxel, and dynamically updating the volume density value, so that the volume density value is obtained through query instead of MLP calculation in the later use of the volume density, and the formula for updating the volume density value is as follows:

；

Step 5: and removing the part of the nerve radiation field outside the object surface according to the depth map, and reserving the part inside the object surface. And limiting the position of the sampling point of NeRF during rendering by utilizing the depth map information, neglecting points outside the surface of the object, and only storing information of the sampling point meeting the requirement during rendering.

Step 6: and removing voxel blocks with the volume density threshold value lower than the threshold value in the nerve radiation field according to the volume density threshold value, reserving the voxel blocks with the volume density threshold value within the required range, and performing point sampling on voxels with the volume density within the required range only during NeRF rendering.

Step 7: and (3) repeating the steps (2) to (6), putting the global radiation field filtered by the point density threshold and the depth map into a volume renderer for volume rendering, carrying out weighted summation according to the colors of the points sampled on the rays to obtain a final rendering color, and calculating loss according to the volume rendering result and continuously optimizing, wherein the formula of the volume rendering calculated color is as follows:

；

The experiment is carried out under the Windows 10 environment, the processor is I7-12700F, the memory is 32G, and the display card is RTX 3080 12G. The experimental performance of the method is as follows:

in contrast to the neural radiation field reconstruction model IBRNet proposed by the paper IBRNet: learning multi-view image-based reconstruction in 2021, the model NeRF proposed by Ben Mildnhall et al in paper NeRF: representing scenes as neural radiance fields forview synthesis in 2020, and the model NSVF proposed by Lingjie Liu et al in paper Neural sparse voxel fields in 2020, 100 scenes were randomly selected as training data in the Scannet dataset. The peak signal-to-noise ratio PSNR, the structural similarity SSIM and the perception loss LPIPS are used as main indexes for evaluation, wherein the higher the PSNR value is, the less noise is represented, the higher the SSIM value is, the higher the structural similarity is represented, and the lower the LPIPS is, the better the perception effect of people is. Since NSVF and NeRF are both optimized for a single scenario, the experiment only shows experimental results of scenario-by-scenario optimization for fairness.

The above cited documents are as follows:

[1]Wang Q, Wang Z, Genova K, et al. Ibrnet: Learning multi-view image-based rendering[C]//Proceeding-s of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 4690-4699.

[2]Mildenhall B, Srinivasan P P, Tancik M, et al. Nerf: Representing scenes as neural radiance fields for vi-ew synthesis[J]. Communications of the ACM, 2021, 65(1): 99-106.

[3]Liu L, Gu J, Zaw Lin K, et al. Neural sparse voxel fields[J]. Advances in Neural Information Processing Systems, 2020, 33: 15651-15663.

table 1 presents the results of the tests performed on the Scannet dataset. From experimental results we can find that the method has a prominent performance in the face of large real scene data sets.

Table 1 quantitative comparison on ScanNet dataset

Table 2 presents the results of the tests performed on the NeRF synthetic dataset. From the experimental results we can find that the method works well in the face of synthetic datasets.

Table 2 quantitative comparisons on NeRF synthetic datasets

Claims

1. A neural radiation field three-dimensional reconstruction method based on fusion voxels, which is characterized in that a three-dimensional reconstruction model is trained by executing the following steps;

step 1, inputting an image into a two-dimensional convolutional neural network, acquiring two-dimensional characteristics of the image, and generating a depth map;

step 2, aggregating two-dimensional features of adjacent images in the depth map and additional features calculated based on the coarse-stage MLP to generate a local radiation field represented by voxels;

step 7, inputting the updated global radiation field into a volume renderer for volume rendering, calculating a loss function, continuously optimizing loss until training is completed, obtaining a three-dimensional reconstruction model, reserving an MLP (multi-level projection) network of the model, inputting pictures into the network, generating a three-dimensional model of an object or a scene, and completing the synthesis of a new view angle;

the step 2 specifically comprises the following steps:

；

wherein ,representing the local radiation field>Representing deep neural network, ++>Mean value->Indicate->Frame (F)>Is indicated at +.>Multiple neighboring views of frame aggregation, +.>Indicate->Voxel characteristics of frame, ">Representing the variance;

the step 3 is specifically as follows: at each frameLocal radiation field to be generated->Global radiation field generated from the previous frame +.>Performing cyclic fusion to continuously update the global radiation field +.>The local volume of each frame is learned and fused by using a gating circulation unit at the time of updating, and the specific formula of the gating circulation unit is as follows:

；

wherein ,for updating the door->To control the neural network of the update gate +.>For resetting the door +.>To control the neural network of the reset gate, +.>Is based on the global radiation field after fusion of the current frame, < >>The neural network is used for controlling the sequential updating of the whole model and is used for controlling the sequential updating global reconstruction of the whole model; />Multiplying the elements; /> and />Local radiation fields for controlling the current frame of the fusion process, respectively>And the global radiation field of the previous frame +.>During fusion, only the local radiation field of the current frame is +.>And the global radiation field of the previous frame->The overlapped voxels are applied, and other voxels are kept unchanged;

the step 4 is specifically as follows: placing the generated global radiation field into a NeRF renderer to obtain the point density of any pointAnd radiation value->The formula for regressive bulk density and emittance is as follows:

；

wherein ,indicate->Volume density of frame global radiation field voxels, +.>For controlling the update weight of the object,representing the (th) frame>The volume density of the local radiation field voxels generated from the nth picture.

2. The method for three-dimensional reconstruction of a neural radiation field based on fused voxels according to claim 1, wherein step 1 specifically comprises:

；

3. The method for three-dimensional reconstruction of a neural radiation field based on fusion voxels according to claim 1, wherein step 5 is specifically: and removing the part of the nerve radiation field outside the object surface according to the depth map, and reserving the part inside the object surface.

4. The method for three-dimensional reconstruction of a neural radiation field based on fusion voxels according to claim 1, wherein step 6 is specifically: and removing voxel blocks below the threshold in the nerve radiation field according to the volume density threshold, and retaining the voxel blocks with the volume density threshold within the required range.

5. The method for three-dimensional reconstruction of a neural radiation field based on fused voxels according to claim 1, wherein step 7 is specifically: and (2) repeating the steps (2) to (6), putting the global radiation field filtered by the bulk density threshold and the depth map into a volume renderer for volume rendering, carrying out weighted summation according to the colors of the points sampled on the rays to obtain final rendering colors, and calculating loss according to the volume rendering results and continuously optimizing, wherein the formula of the volume rendering calculated colors is as follows:

；

wherein Representing a ray sampling point +.>Opacity function of>Representing the sampling point +.>Point density of->Representing the upsampling point +.>Interval of->Representing the sampling point +.>Transmittance function of>Representation dot->All sample points before +.>Opacity function of>Representing the upsampling point +.>Interval of->Is the sampling point +.>Probability density function of>Is a ray->Color of final rendering, ++>Representing rays +.>Up-sampling number, +.>Representing the sampling point +.>Is a radiation value of (2); for NeRF, the difference between the rendering color and the true value is used as a loss function after volume rendering, and the loss function formula is as follows:

；