CN117036612A

CN117036612A - Three-dimensional reconstruction method based on nerve radiation field

Info

Publication number: CN117036612A
Application number: CN202311052945.3A
Authority: CN
Inventors: 曲英杰; 安恒哲; 徐徐升
Original assignee: Wuhan Chuangsheng Infinite Digital Technology Co ltd
Current assignee: Wuhan Chuangsheng Infinite Digital Technology Co ltd
Priority date: 2023-08-18
Filing date: 2023-08-18
Publication date: 2023-11-10

Abstract

The application discloses a three-dimensional reconstruction method based on a nerve radiation field, which converts the volume density in the nerve radiation field into SDF, and uses a volume rendering equation to return space SDF and color. Meanwhile, the application also provides a method for restraining the object surface with SDF of 0 by using multi-view geometric consistency, and the surface texture is consistent on multiple views by optimizing the position and the orientation of the object surface, so that the quality of generating the mesh model is further improved. The method solves the problems of lower geometric structure precision and larger noise caused by implicit expression of the nerve radiation field in the prior art.

Description

Three-dimensional reconstruction method based on nerve radiation field

Technical Field

The application belongs to the field of image processing, relates to a three-dimensional reconstruction technology based on images, and in particular relates to a three-dimensional reconstruction method based on a nerve radiation field.

Background

Three-dimensional reconstruction refers to the conversion of an object or scene in the real world into a digitized three-dimensional model by computer technology. It has wide application in many fields including virtual reality, augmented reality, computer vision, robotics, etc. Three-dimensional reconstruction can be achieved by different methods and techniques, wherein human reconstruction is an important direction in the field of three-dimensional reconstruction.

Currently, there are many methods for three-dimensional reconstruction of the human body, including depth camera-based methods and multi-view image-based methods. The following is a brief description of some prior art schemes:

the structured light method comprises the following steps: three-dimensional shape information of a human surface is obtained by projecting a structured light source and measuring its reflection using a structured light camera or projector and a depth sensor. This method is often used for human reconstruction in static scenes. But is sensitive to illumination conditions, may be disturbed by ambient light, perform poorly in outdoor or non-uniform light scenes, and may be difficult to treat transparent or reflective surfaces. The technical cost is far higher than that of the multi-view method due to the need to purchase hardware devices such as three-dimensional sensors.

The multi-view reconstruction method comprises the following steps: the human body is photographed at different viewing angles using a plurality of cameras or camera arrays, and then the three-dimensional shape and posture of the human body are restored through an image matching and three-dimensional reconstruction algorithm. The method is suitable for human body reconstruction in static scenes. The method can complete the three-dimensional model with high sense of reality by using the multi-view image, and the patent content also belongs to the method.

Among the multi-view reconstruction schemes, more specific implementations include methods that can be classified into feature matching-based and neural radiation field-based approaches. They will be specifically described below:

the method based on feature matching comprises the following steps:

feature matching-based multi-view reconstruction methods rely on extracting and matching image feature points at multiple views to recover the geometry and camera pose of a three-dimensional scene. The following is a general flow of the method:

feature extraction and matching: key points and their corresponding descriptors are extracted from the images of each view, typically using some feature detection and description algorithm (e.g., SIFT, SURF, ORB, etc.). And comparing feature descriptors under different visual angles to perform feature matching, and establishing corresponding relations among different images.

Multi-view stereo matching: based on sparse feature point matching, matching is further performed on each pixel position of the image so as to acquire denser depth or parallax information. The dense matching method comprises the following steps: matching cost calculation: and calculating the matching cost at each pixel position, and measuring the similarity between corresponding pixels in the two images. Common cost calculation methods include gray scale differences, parallax consistency, and the like. Matching energy optimization: based on the computed matching costs, an energy optimization algorithm (e.g., dynamic planning, graph cut, etc.) is used to infer the best match for each pixel location, i.e., to determine the depth or disparity value for each pixel. Interpolation and filtering: for the obtained dense depth or disparity map, interpolation and filtering operations can be performed to improve the smoothness and accuracy of the result.

Grid reconstruction: and (3) constructing a network for the generated point cloud, outputting a mesh model, namely, connecting points in the point cloud to form a triangular surface patch, representing the geometric structure of the scene, removing noise, filling holes, and generating a smooth surface. Common methods include Poisson reconstruction and Marching Cubes.

A method based on nerve radiation field:

the nerve radiation field is a research hot spot in the 3D vision field in recent years, through discrete multi-view image training, the nerve radiation field can output images with very high sense of reality at a new view angle, and the technical breakthrough brought by the nerve radiation field can be applied to three-dimensional model reconstruction as well, so that the problems of weak textures and high light reflection which are difficult to process in the prior art are expected to be solved. Neural radiation fields are a representation based on neural networks for modeling and reconstructing a scene. It learns the implicit representation of the scene by training the neural network, and can capture information such as geometry and appearance. In the neural radiation field approach, the scene is treated as a function that maps the input ray parameters to properties of the scene, such as color, transparency, etc. By training the neural network, the mapping relation can be learned, so that the reconstruction and the rendering of the scene are realized, and the calculation process is shown in figure 3. Neural radiation field methods are capable of modeling complex geometries and details, including surfaces, edges, textures, and the like. Second, the neural radiation field is a continuous representation that can achieve high spatial resolution, thereby providing a more realistic visual effect. In addition, due to the strong learning capacity of the neural network, the neural radiation field method can infer missing information from limited observed data, and complement and reconstruct incomplete data are realized.

However, neural radiation field methods learn an implicit representation of the scene, many express the scene in terms of volume density at all locations in space, rather than focusing on the object surface, so that they extract geometry with lower accuracy and greater noise, while the present patent proposes a volume rendering method that is more amenable to surface reconstruction of the Symbolic Distance Function (SDF), and uses multi-view geometric image consistency constraints to improve the quality of the surface reconstruction.

The existing neural radiation field technology can realize the rendering of high-reality images, and the volume density and the color of the space are learned from multi-view colors based on a volume rendering equation. The generated volume density can be converted into a mesh model through a Marching Cube algorithm, but the mesh model output by the existing nerve radiation field has the problems of large noise and low precision, and is difficult to apply to practice.

The patent CN 114972632A-image processing method and device based on the nerve radiation field, wherein the image processing method based on the nerve radiation field comprises the following steps: determining sampling position information, sampling perspective information and a target object image in response to the image processing instruction; inputting the sampling position information, the sampling visual angle information and the target object image into a pre-trained neural radiation field model for processing to obtain density information, color characteristic values and symbol distance function values corresponding to each sampling point output by the neural radiation field model, wherein the neural radiation field model is a machine learning model; determining at least two sub-sets of sampling points based on the symbol distance function value corresponding to each sampling point; and rendering according to the density information and the color characteristic value of the sampling points in each sampling point subset to generate a target sub-object corresponding to each sampling point subset, and generating a target object based on each target sub-object. The method still carries out three-dimensional reconstruction based on volume density or implicit expression, so the problems of large noise and low precision are not solved well, and meanwhile, a group of data usually needs to be trained for 8 hours, and a large amount of GPU calculation force is consumed for network convergence.

Disclosure of Invention

The application aims at solving the problems in the prior art, and provides a three-dimensional reconstruction method based on a nerve radiation field, which is a volume rendering method based on SDF, wherein the volume density is restrained from being distributed on the surface of an object in an unbiased manner, the volume density in the nerve radiation field is converted into SDF, and the volume rendering equation is used for returning space SDF and color. Meanwhile, the application also provides a method for restraining the object surface with SDF of 0 by using multi-view geometric consistency, and the surface texture is consistent on multiple views by optimizing the position and the orientation of the object surface, so that the quality of generating the mesh model is further improved. The method solves the problems of lower geometric structure precision and larger noise caused by implicit expression of the nerve radiation field in the prior art.

In order to solve the technical problems, the application adopts the following technical scheme:

in one aspect, the application provides a three-dimensional reconstruction method based on a nerve radiation field, comprising the following steps:

collecting multi-view images;

acquiring attitude information of each image;

establishing a three-dimensional scene space which wraps a modeling object, randomly establishing m pixel rays in all images by utilizing volume rendering, randomly sampling n points on each pixel ray, and obtaining m X n sampling point position sets { X } _i And record the viewing angle V of each sampling point _i Forming a data sample, establishing m pixel rays for a plurality of times, and sampling to obtain a sample set;

constructing a neural radiation field network model capable of predicting SDF;

taking the position and the view angle of each sampling point in the training sample as input, training the neural radiation field network model, and outputting the SDF value and the color of the sampling point;

calculating the light color loss: calculating the light color of the pixel light according to the predicted sampling point SDF value and the sampling point color, and calculating the light color and the reality of the pixel light in the multi-view imageColor difference, obtain color loss L _color ；

Image consistency constraint, while predicting pixel light color, acquiring or calculating a point with SDF value of 0 on each pixel light line, namely a surface point, and calculating similarity of multi-view images to obtain image consistency cost L _photo ；

According to the color loss L _color Cost of consistency with image L _photo Calculating total cost, and adjusting parameters of the neural radiation field network model by using the total cost;

selecting a new training sample from the sample set, and training the neural radiation field network model after the parameters are adjusted again until the color loss L _color Cost of consistency with image L _photo Converging;

and uniformly sampling in the three-dimensional scene space, predicting the SDF values of all sampling points according to the converged neural radiation field network model, and generating a grid model by utilizing the SDF values of all the sampling points.

Further, acquiring the pose information of each image includes the steps of:

the pose information of the camera at the time of each image shooting is calculated by using a motion restoration method, and the pose information of the camera includes a camera position and a camera orientation.

Further, the neural radiation field network model comprises an SDF prediction module and a color prediction module, wherein each sub-module comprises an input layer, a plurality of hidden layers and an output layer;

the input parameters of the input layer of the SDF prediction module are sampling point positions, and the output parameters of the output layer comprise output feature vectors and SDF values;

the input layer of the color prediction model has four input parameters, namely a sampling point visual angle, an input feature vector, a normal vector and a sampling point position;

the output feature vector of the SDF prediction module is directly used as the input feature vector of the color prediction module, the SDF value output by the SDF prediction module is used as the input of the color prediction module after the normal vector is calculated by the normal vector calculation module, and the output of the color prediction module is the color of the sampling point.

Further, the hidden layers of the SDF prediction module have 8 layers, the hidden layers of the color prediction module have 4 layers, the number of neurons of all hidden layers is the same, and the input layer and the middle hidden layer vectors are connected in a jumping mode.

Further, the sampling point positions of the SDF prediction module and the sampling point view angles of the color prediction module are both subjected to dimension expansion by adopting position coding.

Further, the specific method for predicting the color of the light is as follows:

traversing the sampling point on the pixel light to obtain SDF value d (X) _i )；

According to the SDF value d (X) _i ) Calculating opacity;

calculating transmittance T from opacity of sample points _i ；

According to the transmissivity T of all sampling points on the pixel light _i And sample point color c _i Calculating the color c (r) of the light;

traversing all pixel rays, and calculating the color difference between the pixel rays and the pixel rays in the original image to obtain the color loss L of the training sample _color 。

Further, the method for calculating the similarity of the multi-view images is as follows:

selecting a pixel ray, calculating a normal vector at a surface point, and obtaining a tangential plane of the surface point through the normal vector;

selecting a comparison frame on a tangent plane with a surface point on the pixel light as a center as an image block of a source image;

selecting a plurality of reference images in a three-dimensional scene space, and selecting a comparison frame with the same size as an image block of the reference images;

respectively calculating NCC values between the source image and each reference image;

selecting a plurality of consistency costs L with the maximum NCC value as the source image _j ；

Calculating consistency cost L on all pixel rays _j As a consistency cost L _photo 。

Further, the total cost is calculated as follows:

L＝L _color +αL _photo +βL _reg

l is the total cost, alpha is the cost weight of image consistency, L _reg For Eikonal terms at the sample points, β is the weight of the Eikonal term.

In another aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the three-dimensional reconstruction method as described above when executing the program.

In another aspect, the application provides a computer program product comprising a computer program which, when executed by a processor, implements a three-dimensional reconstruction method as described above.

Compared with the prior art, the application has the following advantages:

according to the method, SDF is introduced on the basis of a nerve radiation field technical frame, a functional relation between the SDF and the volume density is established, an optimization target of volume rendering is transferred from the volume density to the SDF, the SDF has the characteristic of being uniformly distributed in space, so that color weights are restricted to be intensively distributed on the surface of an object, the problems of low precision and fine details caused by implicit expression are avoided, and in addition, the method proposes to use multi-view geometric image consistency restriction on the basis of optimizing the SDF, and the restriction utilizes the similarity of multi-view image textures to further improve the surface details and precision.

Drawings

Fig. 1 is a schematic flow chart of a three-dimensional reconstruction method based on a nerve radiation field.

Fig. 2 is a logic overview diagram of the three-dimensional reconstruction method based on the neural radiation field of the present application.

Fig. 3 is a graph of a prior art calculation of a neural radiation field.

Fig. 4 is a diagram of a process of calculating a neural radiation field based on SDF in an embodiment of the present application.

Fig. 5 is a diagram of a neural radiation field network model for predicting SDF in an embodiment of the present application.

FIG. 6 is a schematic diagram of image consistency constraints in an embodiment of the present application.

FIG. 7 is a process diagram of an image consistency constraint technique in an embodiment of the application.

FIG. 8 is a schematic diagram of the volume rendering principle of the present application.

100-input vector, 200-hidden layer, 300-output vector, 400-position coding, 500-three-dimensional scene space, 600-image, 700-modeled object, 800-pixel ray, 900-sample point.

Detailed Description

Embodiments of the present application are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the application but are not intended to limit the scope of the application.

In describing the present application, it should be noted that, first, the following explanation and comments are made on the terms used in the present application:

neural radiation field (Neural Radiance Field): neural radiation fields are a model based on deep learning for modeling and rendering scenes. It represents the radiation properties, such as color and illumination, of each point in the scene by training a neural network. By means of the neuro-radiation field, high quality, realistic images and videos can be generated.

Multi-view geometric consistency (Multi-view Geometry Consistency): multiview geometric consistency refers to maintaining geometric consistency between images or point clouds at multiple perspectives. By using multi-view geometric consistency, tasks such as scene reconstruction, camera attitude estimation, three-dimensional reconstruction and the like can be performed, so that the accuracy and consistency of results are improved.

Implicit Surface (implelit Surface): an implicit surface is a mathematical function that represents the surface of a three-dimensional object. It defines an equation that determines whether the coordinates lie on the surface based on the entered coordinates. Implicit surfaces can be used to represent a variety of shapes, including surfaces, volumes, etc., without explicitly defining meshes or vertices.

Symbol distance function (Signed Distance Function, SDF): the symbolic distance function is a function describing the geometry. For a given point, the SDF returns the signed distance of that point to the nearest surface. It may represent the location of a point inside, outside, or on a surface of a geometric body, as well as the size and direction of the distance of the point from the surface.

Volume Rendering (Volume Rendering): volume rendering is a technique for visualizing three-dimensional volumetric data. It calculates the color and transparency of each pixel by sampling and ray tracing the volume data, thereby generating a realistic image. Volume rendering is widely used in the fields of medical imaging, scientific visualization, computer graphics, and the like.

Mesh model (Mesh): a mesh model is a three-dimensional geometry consisting of vertices, edges, and faces. It consists of a series of triangular or quadrilateral patches for representing and processing complex geometries. The grid model is widely used in the fields of computer graphics, finite element analysis, physical simulation and the like.

The Marching Cube method comprises the following steps: the Marching Cube method is an algorithm for generating a mesh model from volumetric data. The method comprises the steps of sampling volume data, judging the surface passing mode according to the values around sampling points, so as to generate a triangular patch, and finally constructing a grid model for representing the surface.

Eikonal Term (Eikonal Term): the Eikonal term is the term of the Eikonal equation used to solve the velocity field. The Eikonal equation describes the propagation speed of the shortest path from the start point to a certain point. It constrains the SDF values to be uniformly distributed in space herein.

Position coding (Positional Encoding): position coding is a technique used in the field of deep learning and natural language processing to embed positional information of elements in a sequence into a model to help the model understand the relative positions between the elements. The position code is typically a fixed vector whose dimensions are the same as those of the input sequence and is calculated from the positional order of the elements. The most common position coding method is to use sine and cosine functions, called Sinusoidal Positional Encoding. For each element in the sequence, the position code is composed of a series of fixed sine and cosine function values, each function corresponding to a different frequency, so that the position information of the element can be coded.

Softplus: for machine-learned activation functions, softplus expression is as follows:

Softplus(x)＝log(1+e ^x )。

the NCC (Normalized Cross Correlation) algorithm is a matching algorithm based on pixel gray value similarity. The optimal matching position and the parallax value are determined by calculating the gray value correlation coefficient in the vicinity of each pixel point in the left image and the right image.

The CMVS (cluster multi-view clustering) algorithm is a multi-view clustering algorithm, and can cluster and classify images to reduce the time and space cost of dense matching.

The SfM algorithm is a three-dimensional reconstruction offline algorithm which can automatically match and calibrate a camera for an unordered picture set and generate sparse point cloud.

As shown in fig. 1 and 2, the present application provides a three-dimensional reconstruction method based on a neural radiation field, comprising the steps of:

s1: collecting multi-view images;

specifically, for a plurality of images of different angles of the same object, the acquisition mode is not limited, the unmanned aerial vehicle carries a camera to perform surrounding shooting, a handheld camera to perform surrounding shooting, other equipment to perform surrounding shooting and the like, the applicant develops special patent equipment for image acquisition, for example, the applicant has applied for patent CN 2023212468661 or CN 2023300368776 to perform image acquisition, and in the patent technology, a plurality of vertically arranged cameras and a rotatable tray are utilized, so that omnibearing and multi-angle image acquisition is provided for a human body. In addition, the device is also provided with a controllable light source so as to ensure sufficient brightness of the image and make the image quality clearer and clearer.

In this embodiment, visual information of a human body in different angles and directions, including surrounding images of a head, a chest, an abdomen, legs and feet, can be acquired by collecting surrounding images; and (3) intercepting 100 images from the surrounding image for three-dimensional reconstruction, and setting proper overlapping degree (such as 15 degrees) in the surrounding image in order to ensure the precision of the posture estimation, so that a high-quality posture estimation result is obtained.

S2: acquiring attitude information of a camera when each image is shot, and establishing a three-dimensional scene space in which a modeling object is wrapped;

according to the embodiment, the colmap software is used for calculating the human body image gesture by using the SfM algorithm so as to provide accurate and fine image gesture estimation, so that the gesture information of the camera when each image is shot is obtained, and meanwhile, the sparse point of the scanning object, the gesture information of the camera, the camera position and the camera orientation are obtained. Specifically, in this embodiment, when the collmap software calculates, the center of the scanned object (modeling object) is taken as the origin, a three-dimensional coordinate system in space is established, the camera position is represented by the central three-dimensional coordinate of the camera imaging hole, the camera orientation is represented by a unit vector in the shooting direction, specifically a unit vector formed by connecting the center of the image with the center of the camera imaging hole; as shown in fig. 8, the outer bounding box of the sparse points is the three-dimensional scene space 500 in which the modeling object is wrapped, and can be represented by a plane equation in the three-dimensional space.

S3: randomly establishing m pixel rays 800 in all images by utilizing volume rendering, wherein the pixel rays are rays which are emitted from a pixel point on an image and reach a modeling object through the center of an imaging aperture of a camera; randomly sampling n points on each pixel ray to obtain m X n sampling point position sets { X } _i And record the viewing angle V of each sampling point _i And forming a data sample, establishing m pixel rays for a plurality of times, and sampling to obtain a sample set.

The specific mode in this embodiment is as follows:

as shown in fig. 8, a pixel point is selected in an image 600, the coordinates of the pixel point in a three-dimensional coordinate system can be calculated through parameters in a camera, a beam of light, called a pixel ray 800, of which the pixel point passes through the center of an imaging hole of the camera to reach a modeling object 700 can be established by combining the camera coordinates, the pixel point for establishing the pixel ray is randomly selected in the image, the image is also randomly selected, 512 (m=512) times of random selection are performed in the embodiment, and 512 pixel rays are obtained; 128 (n=128) points are randomly selected as sampling points 900 on each pixel ray, 65536 sampling points are obtained, the sampling point view angle is a unit vector overlapped with the pixel ray 800, 65536 sampling points and the unit vector form a data sample together, and then training is carried out on the data sample by data sample during later training.

S4: constructing a neural radiation field network Model (MLP) capable of predicting SDF;

as shown in FIG. 5, the neural radiation field network model of the present application includes an SDF prediction module (MLP _sdf ) Color prediction Module (MLP) _color ) The system comprises two sub-modules, wherein each sub-module comprises an input layer, a plurality of hidden layers and an output layer, and in order to enhance nonlinearity, all the hidden layers adopt Softplus as an activation function;

in particular, for the SDF prediction module (MLP _sdf ) A total of 8 hidden layers, wherein the input parameters of the input layer are sampling point positions, and the output parameters of the output layer comprise output feature vectors and SDF values of sampling points;

for color prediction Modes (MLP) _color ) The input layer has four input parameters, namely a sampling point view angle, an input feature vector, a normal vector and a sampling point position;

the output feature vector of the SDF prediction module is directly used as the input feature vector of the color prediction module, the SDF value output by the SDF prediction module is used as the input of the color prediction module after the normal vector is obtained through calculation, and the output of the color prediction module is the color of the sampling point.

It should be noted that, in order to enhance the expression capability of the low-dimensional vector, the present application uses the position code (Positional encoding) to code the sampling point position X as the input layer of the SDF prediction module and the sampling view angle V as the input layer of the color prediction module, wherein the sampling point position X adopts six frequencies, and extends the sampling point position X from 3 dimensions to 39 dimensions; the sampling point viewing angle V adopts four frequencies, and its dimension is extended from 3 dimensions to 27 dimensions, which is only the preferred embodiment of the present application, and not just such.

In this embodiment, the number of neurons in all hidden layers is the same, in order to prevent gradient from disappearing or exploding, the input layer and the hidden layer vector in the middle are connected by using a jump, and in particular, all hidden layers are 256 neurons, and it is emphasized that 256 neurons are only the optimal examples of the present application, in general, the more hidden layer neurons have better precision, but the larger calculation force is needed, the longer the modeling time is, and the hidden layers of 256 neurons reach a balanced result in modeling precision and calculation force.

S5: taking the position and the view angle of each sampling point in the training sample as input, training the neural radiation field network model, and outputting the SDF value and the color of the sampling point; the SDF prediction module takes the sampling point position X as input, predicts the output of the sampling point SDF value d (X) and simultaneously outputs a feature vector, and the color prediction module takes the sampling point position, the sampling point visual angle, the feature vector and the normal vector N as input to predict the color c of the output sampling point _i The normal vector N has the following calculation formula:

s6: calculating the light color loss: calculating the light color of the pixel light according to the predicted sampling point SDF value and the sampling point color, and calculating the real color difference between the light color and the pixel light in the multi-view image to obtain color loss L _color ；

In the volume rendering frame, a three-dimensional scene space is established, and an image pixel emits a beam of light rays which pass through the three-dimensional scene space and intersect with the boundary of the three-dimensional scene space at O _near (hereinafter abbreviated as origin) and O _far At O _near And O _far A series of points can be sampled, namely sampling points on the pixel light rays, and the representation forms are as follows:

X _i ＝O _near +t _i v formula (1)

Wherein X is _i Is the ith sampling point on the pixel light, is 3-dimensional coordinates, V is the light direction, is a unit vector, t _i Is a constant representing the i-th sampling point and the original point O _near A distance therebetween; then the rendered color c (r) of the pixel ray r passes through the predicted color c for each sample point along the line _i The pixel light color calculation formula is as follows:

t in the above _i Is the transmissivity factor of the i-th sample point, representing the probability of a ray reaching that point without encountering any obstruction. Delta _i Is the opacity of the i-th sample point. The color weight of each sample point is affected by two factors: transmittance T _i And opacity delta _i All of them being bulk density sigma _i The calculation formula is as follows:

δ _i ＝1-exp(-σ _i ·Δt _i ) Formula (3)

Δt _i Is the interval between two adjacent sampling points, deltat _i ＝t _i+1 -t _i . To train the SDF network with the volume rendering method and constrain the volume density as close as possible to the object surface, a probability density function of S-density, denoted as φ, is introduced herein _s (d(X _i ) And d (X) _i ) For sampling point X _i The signed distance at, i.e. the SDF value at the sample point, which is related to the opacity delta _i The functional relation between the two is as follows:

wherein phi is _s The formula is as follows for a unimodal distribution of sigmoid functions:

s is the extent to which the learnable parameter controls the bulk density near the surface. According to the above formula, the optimization object of volume rendering changes from volume density to SDF so that the neural radiation field can directly predict the spatial SDF field. Finally, the SDF and color of the scene may be learned by minimizing the difference between the rendered color and the true color of the input image.

As shown in fig. 4, step S6 can be implemented as follows:

s6.1, traversing sampling points on the pixel light rays, and obtaining an SDF value d (X) of the sampling points _i ) (output by SDF prediction module);

s6.2, SDF value d (X _i ) Calculating opacity delta using equation (5) _i ；

S6.3 opacity delta according to sample point _i Calculating the transmittance T by using the formula (4) _i ；

S6.4, according to the transmittance T of all sampling points on the pixel light _i And sample point color c _i Calculating the light color c (r) by using the formula (2);

s6.5, traversing all pixel rays, and calculating the color difference between the pixel rays and the pixel rays in the original image to obtain the color loss L of the training sample _color 。

Color loss L _color The calculation formula is as follows:

r is the set of all pixel rays sampled during each iterative training, that is, all pixel rays in each sample data, 512 pixels are sampled in this example, so that a set of pixel rays consisting of 512 pixel rays is obtained; c (r) is the predicted scene color for each pixel ray, c _GT (r) is the pixel color corresponding to the ray on the image (original image) ₂ A binary norm is calculated for the representation.

S7: image consistency constraint, while predicting pixel light color, acquiring or calculating a point with SDF value of 0 on each pixel light line, namely a surface point, and calculating similarity of multi-view images to obtain image consistency cost L _ph oto；

The image consistency constraint principle is shown in fig. 6, and due to different materials of objects, scene illumination changes, so that the colors of pixels at the same object surface position are inconsistent in multiple views. But the intrinsic texture of the surface is still similar in multiple views, based on which the application uses image consistency to supervise SDF network learning, ensuring that the learned surface is geometrically consistent in multiple views.

The scene is represented by an implicit SDF network, and the extracted object surfaces (scan object surfaces) are the zero level set of the implicit function:

the application passes through a multi-view three-dimensional constraint surfaceIs to learn SDF and color in the scene by image ray. Likewise, surface points in the scene where the SDF equals zero can be found along this ray: n points are sampled along a ray, corresponding to a three-dimensional coordinate X. The SDF value for each point is denoted d (X). For simplicity, d (X) is expressed as d (t), and the light passing through the surface of the object always has two adjacent sampling points, one located inside the object, SDF value<0, the other is located outside the object, SDF value>0, such a sampling point satisfies the following formula:

M＝{t _i |d(t _i )·d(t _i+1 )<0 formula (9)

M is all sampling points meeting the condition of the above formula, t _i For the distance from the ith sampling point to the original point, t _i+1 Is the distance from the (i+1) th sampling point to the original point, |d (t _i ) The SDF value for the i-th sample point.

The surface point can be obtained by calculation through linear interpolation:

t ^* surface points where SDF equals 0;

X ^* ＝O _near +t ^* v formula (11)

X ^* O, the point at which the pixel ray passes through the object surface (scanned object) _near Transmitting a beam of light to the image pixel to cross the near point of the space boundary intersection of the three-dimensional scene, namely an original point; t is t ^* Is the surface point X ^* The distance from the original point, V, is the unit vector of the pixel ray (also called the ray direction). A ray may pass through the object multiple times, i.e. there are multiple X' s ^* The outermost point with the smallest distance t is chosen for optimization because it has the smallest probability of being occluded.

In SDF field, point X ^* The normal vector at this point is:representing point of interest X ^* The SDF value at that point is differentiated.

Thus, at the surface point X ^* Where there is a tangential plane N ^T The calculation formula is as follows:

N ^T X ^* +l=0 formula (12)

The above is a dot-French plane representation method, X ^* Is a point in the plane, N is the normal vector of the plane, and l is a constant.

When compared, surface point X ^* The image of the tangential plane is called a source image (i.e. an image on a pixel light), the object to be compared is called a reference image, and according to the pose parameter of the source image, the plane parameter (N, l) under the world coordinate system can be converted into the source image coordinate system, and the conversion formula is as follows:

N _s ＝R _s N

X _s ＝R _s (X _s -C _s )

l _s ＝-N _s X _s

wherein R is _s Is the source image I _s Rotation matrix of C _s Is the camera center coordinate of the source image, X _s Is the surface point X ^* In source image I _s Coordinates in the coordinate system, N _s Is the surface point X ^* Coordinates of normal vector in source image coordinate system, l _s Is a constant (N) _s ，l _s ) Is a planar parameter in the source image coordinate system.

It is assumed that the object surface locally approximates a plane. Reference image I _r Pixel block P of (2) _r Image points in (a) and source images I _s Pixel block P of (2) _s The homography relation H induced by the plane exists between the points in the formula as follows:

x=hx' formula (14)

K _s And K _r An internal reference matrix, R, of the source image and the reference image respectively _sr And t _sr Respectively representing the relative postures of the reference image and the source image; x is the source image pixel and x' is the reference image pixel.

The reference images are selected in a three-dimensional scene space according to a certain rule, and in order to select the reference images with high overlapping degree and good geometric conditions, the CMVS view selection method is used for scoring each image, and the images with high scores are preferentially used as the reference images.

Therefore, as shown in fig. 7, the step S7 is specifically implemented as follows:

s7.1, selecting a pixel ray, calculating a normal vector at a surface point, and obtaining a tangential plane of the surface point through the normal vector;

the surface point determining method specifically comprises the following steps:

s7.1.1, traversing all sampling points, and finding out sampling points (surface points) with SDF value of 0;

s7.1.2 traversing all sampling points, finding out sampling point pairs meeting the formula (9), and obtaining a sampling point pair set;

s7.1.3, obtaining the point with SDF equal to 0 in the sampling point pair set by using a formula (10), forming a surface point set together with the sampling points in the step S7.1, and finding out the point closest to the emitting end of the pixel light in the surface point set by using a formula (11) to optimize, so as to ensure that only one surface point is taken on each pixel light;

s7.2, selecting a comparison frame on a tangent plane by taking a surface point on pixel light as a center as an image block of a source image;

the method specifically comprises the following steps:

s7.2.1 calculating the tangent plane N of the surface point using equation (12) ^T ；

S7.2.2, selecting a comparison frame with proper size on the tangent plane by taking a surface point as a center as an image block of a source image for image consistency comparison;

in order to improve the comparison accuracy and speed, the image blocks need to be selected to be of proper size, the size is too small, the information is too small during comparison, the size is too large, the calculation is complex, and the comparison time is too long; generally, square frames can be selected, with a side size of 4-20 pixels being suitable, and a comparison frame of 5 x 5 pixels size is used in this embodiment.

S7.3, selecting a plurality of reference images in the three-dimensional scene space, and selecting a comparison frame with the same size as an image block of the reference images;

in order to improve algorithm robustness, 9 reference images are selected through a CMVS view selection method to calculate image consistency.

S7.4, respectively calculating NCC values between the source image and each reference image, and comparing the similarity between image blocks between the source image and the reference image, wherein the calculation formula is as follows:

in the above formula, cov and Var are the covariance function and variance function between two pixel blocks, respectively, I _s (p _s ) Representing source image I _s Middle pixel block P _s Corresponding image gray value set, I _r (p _r ) Representing reference image I _r Middle pixel block P _r And a corresponding set of image gray values.

In the comparison process, the source image I is obtained by the formulas (13) and (14) _s Pixel block P of (2) _s Conversion of middle pixel coordinates to reference image I _r Pixel block P of (2) _r . The NCC of the pixel ray is then calculated by comparison according to equation (15).

S7.5, selecting a plurality of consistency costs L with the maximum NCC value as the source image _j

According to the application, 9 reference images are selected to calculate the image consistency by a CMVS view selection method, and 4 image consistency costs for calculating the pixel light rays with the largest NCC value are selected:

L _j representing the image consistency cost of the j-th pixel ray; the angle mark rj is the j-th pixel light ray and NCC _rj NCC value for the jth pixel ray, I _r (p _ri ) Pixel block P representing the ith reference image _ri And the corresponding image gray value set, i is 1-4, which means that the maximum 4 values are selected from NCC values in 9 reference images to perform cost calculation.

S7.6, calculating the consistency cost L on all pixel rays _j As a consistency cost L _ph oto。

The calculation formula is as follows:

m is the total number of rays.

Finally, by minimizing the image consistency penalty, the position and normal vector of the surface points in the SDF field are optimized to the exact position and orientation.

S8: according to the color loss L _color Cost of consistency with image L _photo Calculating total cost, and adjusting parameters of the neural radiation field network model by using the total cost;

the total cost is calculated as follows:

L＝L _color +αL _photo +βL _reg formula (18)

L _reg The calculation formula is as follows:

m is the number of light rays, n is the number of sampling points of each light ray,representing the gradient of the SDF value of the ith sample point of the jth ray.

And selecting a new training sample from the sample set, training the neural radiation field network model after the parameters are adjusted again, and training for 30w times to consider that the total cost converges. Up to a color loss L _color Cost of consistency with image L _photo Converging; there are two ways of judging convergence, the first is color loss L _color Cost of consistency with image L _photo The training frequency reaches the set total frequency, for example, in this embodiment, the set total training frequency is 30 ten thousand times, that is, after 30 ten thousand times of training, the neural radiation field network model is determined to reach convergence, and after the training for many times, the SDF of the spatial point of the three-dimensional scene space basically can reach the requirement of high-precision modeling.

S9: and uniformly sampling and predicting the three-dimensional scene space according to the converged neural radiation field network model to obtain SDF values of all points, and inputting a Maring probes method to generate a mesh model. In-space sampling of three-dimensional scene the number may be 128 x 128.

The above embodiments are only for illustrating the present application, and are not limiting of the present application. While the application has been described in detail with reference to the embodiments, those skilled in the art will appreciate that various combinations, modifications, and substitutions can be made thereto without departing from the spirit and scope of the application as defined in the appended claims.

Claims

1. The three-dimensional reconstruction method based on the nerve radiation field is characterized by comprising the following steps of:

collecting multi-view images;

acquiring attitude information of a camera when each image is shot, and establishing a three-dimensional scene space in which a modeling object is wrapped;

randomly establishing m pixel rays in all images by utilizing volume rendering, and randomly sampling n points on each pixel ray to obtain m X n sampling point position sets { X } _i And record the viewing angle V of each sampling point _i Forming a data sample, establishing m pixel rays for a plurality of times, and sampling to obtain a sample set;

constructing a neural radiation field network model capable of predicting SDF;

calculating the light color loss: calculating the light color of the pixel light according to the predicted sampling point SDF value and the sampling point color, and calculating the real color difference between the light color and the pixel light in the multi-view image to obtain color loss L _color ；

Image consistency constraint, calculating the point with SDF value of 0 on each pixel ray, namely a surface point, while predicting the pixel ray color, and calculating the similarity of multi-view images to obtain an image consistency cost L _photo ；

According to the color loss L _color Cost of consistency with image L _photo Calculating total cost, and adjusting parameters of the neural radiation field network model by using the total cost; selecting a new training sample from the sample set, and training the neural radiation field network model after the parameters are adjusted again until the color loss L _color Cost of consistency with image L _photo Converging;

2. The three-dimensional reconstruction method according to claim 1, wherein acquiring the pose information of the camera at the time of each image capturing includes the steps of:

3. The three-dimensional reconstruction method according to claim 1, wherein the neural radiation field network model SDF prediction module and the color prediction module are two sub-modules, each sub-module comprising an input layer, a plurality of hidden layers and an output layer;

4. A three-dimensional reconstruction method according to claim 3, wherein the hidden layer of the SDF prediction module has 8 layers, the hidden layer of the color prediction module has 4 layers, the number of neurons of all hidden layers is the same, and the input layer and the intermediate hidden layer vector are connected by a jump.

5. The method according to claim 4, wherein the sampling point positions of the SDF prediction module and the sampling point views of the color prediction module are both dimensionally extended using position coding.

6. The three-dimensional reconstruction method according to claim 1, wherein the specific method of ray color prediction is as follows:

According to the SDF value d (X) _i ) Calculating opacity;

calculating transmittance T from opacity of sample points _i ；

7. The three-dimensional reconstruction method according to claim 6, wherein the similarity method for calculating the multi-view image is as follows:

8. The three-dimensional reconstruction method according to claim 7, wherein the total cost is calculated as follows:

L＝L _color +αL _photo +βL _reg

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the three-dimensional reconstruction method according to any one of claims 1 to 8 when executing the program.

10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the three-dimensional reconstruction method according to any one of claims 1 to 8.