CN115761178A

CN115761178A - Multi-view three-dimensional reconstruction method based on implicit neural representation

Info

Publication number: CN115761178A
Application number: CN202211232601.6A
Authority: CN
Inventors: 唐琳琳; 黄鑫; 苏敬勇; 刘洋; 漆舒汉
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2023-03-07

Abstract

The invention relates to the technical field of three-dimensional reconstruction of three-dimensional views, in particular to a multi-view three-dimensional reconstruction method based on implicit neural representation.

Description

Multi-view three-dimensional reconstruction method based on implicit neural representation

Technical Field

The invention relates to the technical field of three-dimensional reconstruction of stereoscopic views, in particular to a multi-view three-dimensional reconstruction method based on implicit neural representation.

Background

The existing multi-view three-dimensional reconstruction methods can be divided into traditional multi-view three-dimensional reconstruction methods and methods based on deep learning. In the conventional multi-view stereo reconstruction method, depth is estimated by matching common image features under multiple view-angle pictures, and the predicted depth maps are fused to obtain dense point clouds, which generally need to be generated by post-processing in order to generate an object surface. The reconstruction quality of these methods depends greatly on the quality of feature matching, which can easily result in incomplete or artifact three-dimensional models if a large number of feature matching fails, and in order to reconstruct a real object, texture mapping is often performed by using a de-projection method after the surface of the object is reconstructed, so as to generate a real textured three-dimensional model. The series of operations are not only limited by the influence of feature matching and resolution of the multi-view image, but also cause reconstruction blurring and unnatural texture mapping results due to inevitable noise and camera quantization errors.

With the rise of deep learning, most of the deep learning-based MVS (multi-view stereoscopic vision) methods attempt to replace some parts of the classical MVS pipeline. Such as a fused depth map, or a depth map inferred from a multi-view image. But the method is limited by the inherent defects of three-dimensional representation adopted, namely the complex topological structure of triangular mesh representation, the discrete disorder of point cloud representation, and the limitation of memory size on voxel representation resolution, so that a fine model with high resolution or an object with a complex structure is difficult to reconstruct.

And the defects of the three expressions can be well overcome based on implicit nerve expression. The three-dimensional structure is represented by learning the probability of occupation of a spatial point or a symbolic distance function through a neural network. The three-dimensional information can be continuously represented without being influenced by resolution, and the memory occupation is small. There is a potential to reconstruct complex objects such as non-lambertian surfaces and thin structures that are difficult to reconstruct with classical methods. However, most of these implicit neural multi-view reconstruction methods adopt a surface rendering technique, and additional constraints, such as depth information, object masks, etc., are usually required to reconstruct an effective object surface. Although some recent works combine the advantages of the surface rendering technology and the volume rendering technology to realize that the precise geometric shape of the object can be reconstructed under the multi-view image without a mask, the method can only reconstruct the surface of the solid object, and meanwhile, the reconstructed three-dimensional object still has the problems of shape ambiguity and the like, and the object with a thin structure still cannot be reconstructed correctly.

The present invention seeks to address these and other needs in the art.

Disclosure of Invention

In order to solve at least one technical problem mentioned in the background technology, a multi-view three-dimensional reconstruction algorithm based on pixel feature fusion is provided on the basis of implicit neural representation and volume rendering technologies, the influence of incident light on a reconstructed geometric shape under different visual angles is considered, the pixel features of a multi-view image are extracted by designing a pixel feature map encoder to be fused into the global geometric features of surface points, the volume rendering process of the whole reconstructed model is improved, the surface fineness of an object is further improved, and a high-resolution three-dimensional object model represented by a fine grid is reconstructed.

In one aspect, the present invention is directed to a method for multi-view three-dimensional reconstruction based on implicit neural representation, comprising:

the method comprises the steps that firstly, a global feature map and a local feature map of an image are obtained through a residual convolutional neural network, the global feature map and the local feature map are spliced in a bilinear interpolation mode, and finally a pixel feature map with the same size as an original image is formed, so that each image pixel corresponds to a pixel feature with a specified dimension;

extracting image global features and local features from the pixel feature map, fusing the image global features and the local features to form pixel features corresponding to each ray to represent the characteristics of different wavelengths, and fusing the features into global shape features in a volume rendering formula to improve the reconstruction performance of a volume rendering mathematical model;

step three, the curved surface shape of the object is implicitly represented by using a symbolic distance function SDF, namely for a given arbitrary space point x, the SDF (x) can output the distance from the space point x to the nearest surface, the surface S of the three-dimensional object can be determined by a zero level set implicit SDF (x) =0, and the symbolic distance function is approximated by using a multilayer perceptron;

rendering images under given visual angles and camera parameters by using a pixel feature fusion volume rendering formula, and minimizing the difference with an input image, wherein in the training process, input data consists of a group of RGB images with camera internal and external parameters;

and step five, after the symbol distance of each three-dimensional space point is obtained, a Marching Cube algorithm is used for obtaining a three-dimensional grid surface corresponding to a threshold value, so that a 3D object model represented by a high-resolution grid can be visualized.

In the first step, the local feature map is obtained by extracting the feature map of each layer in the global feature map.

In the first step, for the input image I, the intercepted characteristic diagram includes F ₁ ，F ₂ ，...，F _N In which F is ₁ ，F ₂ ，...，F _N-1 Is a partial feature map, F _N Is a global feature map, the pixel feature map of an image can be represented by the following formula:

f _I ＝F ₁ ，F ₂ ，…，F _N (1)。

in a convolutional neural network, a shallow convolutional layer obtains more characteristic information, the receptive field is smaller, and characteristic data is generally closer to an original image, namely the shallow characteristic can contain more detailed characteristics of objects; with the increasing depth of the layer number, the obtained features are more and more abstract, and the receptive field is also larger and larger, and the image features at the moment can contain the global features of the object, such as the classification information, the shape structure and the like of the object. Shallow features are often referred to as local features and deep features as global features. Because the image pixel coordinate corresponding to each ray can be directly obtained in the volume rendering process, after the pixel feature map of the image is obtained, the pixel feature of the corresponding position can be directly indexed through the pixel coordinate, and the original pixel coordinate system is not required to be subjected to de-projection aiming at each space point, so that the calculation amount is reduced.

In step two, given a pixel, the ray from the camera center o to the pixel is represented as { r (t) = o + td | t >0}, and d is the unit direction vector of the ray from the camera center o to the pixel. Cumulative color along light C (r):

where c (r, n, g, d) represents a Light Field (Light Field), which may also be referred to as Yan Sechang (Color Field), i.e., each spatial point r (t) _i ) The amount of light emitted in the direction d can also be regarded as an RGB color value; the function T (T) represents the cumulative transmission probability along a ray, i.e. the probability that a ray propagates from the camera center o to r (T) without hitting any other particle. And the light field is constrained by the normal n of the surface and this spatial point feature g, i.e.

The spatial point feature g is the pixel feature f generated from the image I corresponding to this ray r (t) _I And global shape feature f _θ，g (r (t)) are jointly generated.

In the fusion mode, the BRDF of common materials is considered in a normal surface coding mode in the volume rendering process, the influence of different wavelengths of different incident light rays and the global surface characteristics can be considered, and the reconstruction accuracy can be effectively improved.

In step two, the bulk density σ (x) is modeled as using a learnable distance function f _θ The transformation of (a) replaces:

σ(x)＝αΨ _β (f _θ (x)) (3)

wherein α, β >0 is a learnable parameter, Ψ _β Is a Laplace cumulative distribution function with a mean value of zero and a scale parameter of beta.

Intuitively, the density σ models a uniform object of constant density α that falls smoothly near the boundary of the object, where the amount of smoothing is controlled by β. This definition of bulk density can bring about two advantages: one is that it can provide a useful generalized bias for the object surface, which makes it f _θ The zero level set of (a) can maximize the bulk density, which can be considered as the object surface, thus providing a principle method of reconstructing curved surfaces. Secondly, using such a volume density function can facilitate the limitation of the opacity error of the rendered volume, which in turn enables the optimization of the sampling strategy such that the points sampled at each ray eventually converge to the points of the object surface.

In step two, along the light ray { r (t) = o + td | t>0 sample N samples x _i Approximating the color of ray r using numerical integration:

T _i ＝exp(-∑ _j<i σ _j (x _j )δ _j ) (6)

wherein delta _j ＝|x _i+1 -x _i I is the distance between adjacent sample points, since this project is to fuse pixel features into global shape features, from (c) _i ，σ _i ) Computing

The whole process of (a) is still differentiable, so that the optimization can be carried out in the manner of a neural network.

In step three, the symbolic distance function SDF implicitly represents the curved surface shape of the object:

that is, for a given arbitrary spatial point x, SDF (x) may output the distance of said spatial point x to the nearest surface, whose sign is positive indicating that the point is outside the object surface, whose sign is negative indicating that the point is inside the object surface, the surface S of the three-dimensional object may be determined by a zero level set implicit SDF (x) =0, i.e.

The symbol distance function is approximated with a multi-layer perceptron.

In the fourth step, a batch of m pixels and corresponding rays under the world coordinate system are randomly sampled from each training, thus the input data is D = { C = _p ，o _p ，d _p In which p is [1,m ]]，C _p Representing the RGB colour value, o, corresponding to the p-th pixel point _p Representing the camera center position corresponding to the p-th pixel point, d _p And expressing the view angle direction of the p-th pixel point. And sampling n spatial points { x ] for each ray using a sampling algorithm for the opacity error boundary _p，i }，i∈[1，n]，x _p，i Representing the ith sample point on the ray corresponding to the p-th pixel.

In step four, the training loss consists of two terms:

wherein

Is the color loss, defined as:

‖·‖ ₁ is represented by ₁ The norm of the number of the first-order-of-arrival,

is a numerical approximation by the volume rendering integral in equation (6);

regularization term

Is exciting f _θ Energy loss approximating a distance function; y is a pair of a spatial point from each ray's random uniform spatial sampling and a spatial point using a sampling algorithm.

Throughout the training process, the hyper-parameter λ is set to 0.1.

In another aspect, the invention is directed to a multi-view three-dimensional reconstruction system based on implicit neural representation, comprising:

a pixel profile encoder configured to: obtaining a global feature map and a local feature map of the image through a residual convolutional neural network, splicing the global feature map and the local feature map in a bilinear interpolation mode, and finally forming a pixel feature map with the same size as the original image, so that each image pixel corresponds to a pixel feature with a specified dimension;

a pixel feature fused volume rendering mathematical model configured to: extracting image global features and local features from the pixel feature map, fusing the image global features and the local features to form pixel features corresponding to each ray to represent the characteristics of different wavelengths, and fusing the features into global shape features in a volume rendering formula to improve the reconstruction performance of a volume rendering mathematical model;

an implicit neural surface model configured to: implicitly representing the shape of the surface of the object using a symbolic distance function SDF, i.e. for a given arbitrary spatial point x, SDF (x) may output the distance of said spatial point x to the nearest surface, the surface S of the three-dimensional object may be determined by a zero level set implicit SDF (x) =0, the symbolic distance function is approximated with a multi-layer perceptron;

a pixel feature fused multi-view three-dimensional reconstruction network model configured to: rendering an image under a given visual angle and camera parameters by using a pixel feature fusion volume rendering formula, and minimizing the difference with an input image, wherein in the training process, input data consists of a group of RGB images with camera internal and external parameters;

visualizing a three-dimensional object model configured to: after the symbolic distance of each three-dimensional space point is obtained, a Marching Cube algorithm can be used to obtain a three-dimensional mesh surface corresponding to a threshold value, so that a 3D object model represented by a mesh with high resolution can be visualized.

The application provides a multi-view three-dimensional reconstruction system based on implicit neural representation, not only has considered the influence of incident light to the reconstruction geometry under the different visual angles, moreover through designing the pixel characteristic of pixel characteristic map encoder extraction multi-view image to integrate into the global geometry characteristic of surface point, improve the volume rendering process of whole reconstruction model, and then improve the meticulous degree in object surface, rebuild the three-dimensional object model of the meticulous mesh of high resolution and show.

In another aspect, a computer apparatus is provided, comprising a processor, a memory, and a program stored on the memory and executable by the processor, the program, when executed, performing at least one of the steps of the method as set forth above.

The above-described preferred conditions may be combined with each other to obtain a specific embodiment, in accordance with common knowledge in the art.

The invention has the beneficial effects that: the invention provides a multi-view three-dimensional reconstruction algorithm based on pixel feature fusion on the basis of implicit neural representation and volume rendering technologies, not only considers the influence of incident light on the reconstruction geometry under different visual angles, but also extracts the pixel features of multi-visual-angle images by designing a pixel feature map encoder so as to be fused into the global geometry features of surface points, improves the volume rendering process of the whole reconstruction model, further improves the surface fineness of an object, and reconstructs a high-resolution three-dimensional object model represented by a fine grid.

The invention adopts the technical scheme for achieving the purpose, makes up the defects of the prior art, and has reasonable design and convenient operation.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

FIG. 1 is a flow chart of a multi-view three-dimensional reconstruction method based on implicit neural representation;

FIG. 2 is a schematic diagram of a pixel profile encoder network architecture;

fig. 3 is a schematic diagram of an overview of a pixel feature fused volume rendering network model.

Detailed Description

Those skilled in the art can appropriately substitute and/or modify the process parameters to implement the present disclosure, but it is specifically noted that all similar substitutes and/or modifications will be apparent to those skilled in the art and are deemed to be included in the present invention. While the invention has been described in terms of preferred embodiments, it will be apparent to those skilled in the art that the technology can be practiced and applied by modifying or appropriately combining the embodiments described herein without departing from the spirit and scope of the invention.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure herein. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is to be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the technical aspects of the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The present invention is described in detail below.

Example 1:

a multi-view three-dimensional reconstruction method based on implicit neural representation is shown in figure 1 and specifically comprises the following steps.

Step one, designing a pixel characteristic diagram encoder structure:

obtaining a global feature map of the image through a residual convolution neural network, extracting the feature map of each layer to be used as a local feature map of the image, splicing the global feature map and the local feature map in a bilinear interpolation mode, and finally forming a pixel feature map with the same size as the original image, so that each image pixel corresponds to a pixel feature with a specified dimension; for the input image I, the intercepted feature map includes F ₁ ，F ₂ ，...，F _N In which F is ₁ ，F ₂ ，...，F _N-1 Is a partial feature map, F _N Is a global feature map, the pixel feature map of an image can be represented by the following formula:

f _I ＝F ₁ ，F ₂ ，…，F _N (1)。

step two, establishing a pixel feature fused volume rendering mathematical model:

extracting image global features and local features from the pixel feature map, fusing the image global features and the local features to form pixel features corresponding to each ray to represent the characteristics of different wavelengths, and fusing the features into global shape features in a volume rendering formula to improve the reconstruction performance of a volume rendering mathematical model; that is, given a pixel, the ray from the camera center o to the pixel is denoted as { r (t) = o + td | t >0}, and d is the unit direction vector along the camera center o to the ray of the pixel. Cumulative color along light C (r):

The spatial point feature g is the pixel feature f generated from the image I corresponding to this ray r (t) _I And global shape feature f _θ，g (r (t)) are generated jointly.

Modeling the bulk density σ (x) as using a learnable distance function f _θ The transformation of (c) replaces:

σ(x)＝αΨ _β (f _θ (x)) (3)

Along a ray { r (t) = o + td | t>0 sample N samples x _i Approximating the color of ray r using numerical integration:

T _i ＝exp(-∑ _j<i σ _j (x _j )δ _j ) (6)

The whole process of (a) is still differentiable, so that the optimization can be carried out in a neural network mode.

Step three, establishing an implicit nerve surface model:

the shape of the curved surface of the object is implicitly represented using a symbolic distance function SDF,

i.e. for a given arbitrary spatial point x, SDF (x) may output the distance of said spatial point x to the nearest surface, whose sign is positive indicating that the point is outside the object surface, whose sign is negative indicating that the point is inside the object surface, the surface S of the three-dimensional object may be determined by a zero level set implicit SDF (x) =0,

the idea of approximating a symbolic distance function with a multi-layer perceptron and approximating an implicit function with a neural network is commonly referred to as implicit neural representation.

Step four, designing a multi-view three-dimensional reconstruction network model with pixel feature fusion:

rendering an image under a given visual angle and camera parameters by using a pixel feature fusion volume rendering formula, and minimizing the difference with an input image, wherein in the training process, input data consists of a group of RGB images with camera internal and external parameters;

each training randomly samples a batch of m pixels and their corresponding rays in the world coordinate system, thus the input data is D = { C = _p ，o _p ，d _p In which p is [1,m ]]，C _p Representing the RGB colour value, o, corresponding to the p-th pixel point _p Representing the camera center position corresponding to the p-th pixel point, d _p And expressing the view angle direction of the p-th pixel point. And sampling n spatial points { x ] for each ray using a sampling algorithm for the opacity error boundary _p，i }，i∈[1，n]，x _p，i Representing the ith sampling point on the ray corresponding to the p pixel;

the training loss is mainly composed of two terms:

wherein

Is the color loss, defined as:

is a numerical approximation by the volume rendering integral in equation (6);

regularization term

Is exciting f _θ Energy loss approximating a distance function; y is a pair of a spatial point from each ray random uniform spatial sampling and a spatial point using a sampling algorithm;

throughout the training process, the hyper-parameter λ is set to 0.1.

And step five, after the symbolic distance of each three-dimensional space point is obtained, a Marching Cube algorithm is used for obtaining a three-dimensional grid surface corresponding to a threshold value, so that a 3D object model represented by a grid with high resolution can be visualized.

Example 2:

a multi-view three-dimensional reconstruction system based on implicit neural representation, comprising:

an implicit neural surface model configured to: implicitly representing the curved surface shape of the object by using a symbolic distance function SDF (SDF), namely for a given arbitrary space point x, the SDF (x) can output the distance from the space point x to the nearest surface, the surface S of the three-dimensional object can be determined by a zero level set implicit SDF (x) =0, and the symbolic distance function is approximated by a multilayer perceptron;

The details are as follows.

Pixel feature map encoder:

in the convolutional neural network, the feature information obtained by the convolutional layer of the shallow layer is more, the receptive field is smaller, and the feature data is generally closer to the original image, namely the shallow layer feature can contain more detailed features of objects; with the deeper and deeper layers, the obtained features are more and more abstract, and the receptive field is also more and more large, at the same time, the image features can contain the global features of the object, such as the classification information, the shape structure and the like of the object. Shallow features are often referred to as local features and deep features as global features.

In order to fully obtain the pixel characteristics of the image under each visual angle, the idea of a residual convolutional neural network is mainly adopted to obtain a global characteristic diagram of the image, meanwhile, the characteristic diagram of each layer is extracted to be used as a local characteristic diagram of the image, the global characteristic diagram and the local characteristic diagram are spliced in a bilinear interpolation mode, and finally, a pixel characteristic diagram with the same size as that of the original image is formed, so that each image pixel corresponds to a pixel characteristic with a specified dimension size. We will refer to the feature map generated in this way as a pixel feature map. Namely, for the input image I, the intercepted local feature map is F ₁ ，F ₂ ，...，F _N In which F is ₁ ，F ₂ ，...，F _N-1 As a partial feature map, F _N Is a global feature map, the pixel feature map of an image can be represented by the following formula:

f _I ＝F ₁ ，F ₂ ，…，F _N (1)。

because the image pixel coordinate corresponding to each ray can be directly obtained in the volume rendering process, after the pixel feature map of the image is obtained, the pixel feature of the corresponding position can be directly indexed through the pixel coordinate, and the original pixel coordinate system is not required to be subjected to de-projection aiming at each space point, so that the calculation amount is reduced. The specific pixel profile encoder network structure is therefore shown in fig. 2.

The volume rendering mathematical model of pixel feature fusion:

inspired by a plenoptic function in computer graphics considering the wavelengths of different incident lights, the characteristics that the incident lights under different visual angles have different wavelengths are fully considered, the method provides that the image global characteristic and the local characteristic are extracted from an image and fused to form the pixel characteristic corresponding to each ray to represent the characteristics of different wavelengths, and the characteristic is fused into the global shape characteristic in a volume rendering formula to improve the reconstruction performance of the whole model. That is, given a pixel, the ray from the camera center o to the pixel is denoted as { r (t) = o + td | t >0}, and d is the unit direction vector along the camera center o to the ray of the pixel. Cumulative color along light C (r):

where c (r, n, g, d) represents a Light Field (Light Field), which may also be referred to as Yan Sechang (Color Field), i.e., each spatial point r (t) _i ) The amount of light emitted in the direction d may also be considered as an RGB color value. And the light field is constrained by the normal n to the surface and this spatial point feature g, i.e.

The spatial point feature g is the pixel feature f generated from the image I corresponding to this ray r (t) _I And global shape feature f _θ，g (r (t)) are generated jointly. In the fusion mode, the BRDF of common materials is considered in a normal surface coding mode in the volume rendering process, the influence of different wavelengths of different incident light rays and the global surface characteristics can be considered, and the reconstruction accuracy can be effectively improved.

In contrast to the volume density function σ (x) in conventional volume rendering, which represents the probability that a ray terminates at a spatial point x, the volume density σ (x) is modeled in this term as using a learnable distance function f _θ Instead of:

σ(x)＝αΨ _β (f _θ (x)) (3)

wherein α, β >0 is a learnable parameter, Ψ _β Is a Laplace cumulative distribution function with a mean value of zero and a scale parameter of beta. Intuitively, the density σ models a uniform object of constant density α that falls smoothly near the boundary of the object, where the amount of smoothing is controlled by β. This definition of bulk density can bring about two advantages: one is that it can provide a useful generalized bias for the object surface, which makes it f _θ The zero level set of (a) can maximize the bulk density, which can be considered as the object surface, thus providing a principle method of reconstructing curved surfaces. Secondly, using such a volume density function can facilitate the limitation of the opacity error of the rendered volume, which in turn enables the optimization of the sampling strategy such that the points sampled at each ray eventually converge to the points of the object surface.

Like conventional volume rendering formulas, the function T (T) represents the cumulative probability of penetration along a ray, i.e., the probability that a ray propagates from o to r (T) without hitting any other particle.

Thus, along the light ray { r (t) = o + td | t>0 sample N samples x _i The color of ray r is approximated using numerical integration:

T _i ＝exp(-∑ _j<i σ _j (x _j )δ _j ) (6)

Implicit neural surface model:

the method adopts an implicit neural representation mode to represent the curved surface of an object, namely, a Symbolic Distance Function (SDF) is used for implicitly representing the curved surface shape of the object:

specifically, for a given arbitrary spatial point x, the SDF (x) may output the distance of that point to the nearest surface, with the sign indicating whether the point is inside (negative) or outside (positive) the object indicates. The surface S of the three-dimensional object can thus be determined by the zero level set implicit SDF (x) =0, i.e.

The symbolic distance function SDF is approximated in terms with a multi-layer perceptron, so the idea of approximating an implicit function with a neural network is also commonly referred to as implicit neural representation.

The multi-view three-dimensional reconstruction network model with the fused pixel features comprises the following steps:

the goal of multi-view three-dimensional reconstruction is to reconstruct the fine surface shape of an object given the multi-view images of the object and the internal and external parameters of the corresponding cameras. To learn the implicit neural surface representation of a three-dimensional object, i.e., to learn the weight parameters of a neural network, an image at a given perspective and camera parameters is rendered using a volume rendering formula of pixel feature fusion, and the difference from the input image is minimized. Experiments prove that the volume rendering method can accurately reconstruct the surface of an object with a complex structure. The network model design and training process is illustrated in fig. 3.

In the training process, the input data is composed of a set of RGB images with camera intrinsic and extrinsic parameters. Randomly sampling a batch of m-sized elements from each training and corresponding rays in a world coordinate system, so that input data is D = { C = { (C) _p ，o _p ，d _p H, where p ∈ [1,m ]]，C _p Representing the corresponding RGB color value of the p-th pixel point, o _p To representThe p-th pixel point corresponds to the camera center position, d _p And expressing the view angle direction of the p-th pixel point. And sampling n spatial points { x ] for each ray using a sampling algorithm for the opacity error boundary _p，i }，i∈[1，n]，x _p，i Representing the ith sample point on the ray corresponding to the p-th pixel.

The training loss is mainly composed of two terms:

wherein the color loss is defined as:

is approximated by the value of the volume rendering integral in equation (6).

Regularization term

Is exciting f _θ Energy loss of the approximate distance function; y is a pair of a spatial point from each ray random uniform spatial sampling and a spatial point using a sampling algorithm;

throughout the training process, the hyper-parameter λ is set to 0.1.

Visualization of the three-dimensional object model:

after the symbolic distance of each three-dimensional space point is obtained, a Marching Cube algorithm can be used to obtain a three-dimensional mesh surface corresponding to a threshold value, so that a 3D object model represented by a mesh with high resolution can be visualized.

Conventional techniques in the above embodiments are known to those skilled in the art, and thus will not be described in detail herein.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

While the invention has been described in detail and with reference to specific embodiments thereof, it will be apparent to one skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope thereof.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

The invention is not the best known technology.

Claims

1. A multi-view three-dimensional reconstruction method based on implicit neural representation is characterized by comprising the following steps:

2. The method of claim 1, wherein:

in the first step, for the input image I, the intercepted characteristic diagram includes F ₁ ，F ₂ ，...，F _N In which F ₁ ，F ₂ ，...，F _N-1 Is a partial feature map, F _N Is a global feature map, the pixel feature map of an image can be represented by the following formula:

f _I ＝F ₁ ，F ₂ ，…，F _N (1)。

3. the method of claim 2, wherein:

in step two, given a pixel, the ray from the camera center o to the pixel is represented as { r (t) = o + td | t >0}, and d is a unit direction vector of the ray from the camera center o to the pixel; cumulative color along light C (r):

where c (r, n, g, d) represents a Light Field (Light Field), which may also be referred to as Yan Sechang (Color Field), i.e., each spatial point r (t) _i ) The amount of light emitted in the direction d can also be regarded as an RGB color value; the function T (T) represents the cumulative probability of transmission along a ray, i.e. the probability that a ray propagates from the camera center o to r (T) without hitting any other particle; and the light field is constrained by the normal n of the surface and the feature g of this space point, i.e.

4. The method of claim 3, wherein:

in step two, the volume density σ (x) is modeled as using a learnable distance function f _θ The transformation of (a) replaces:

σ(x)＝αΨ _β (f _θ (x)) (3)

5. The method of claim 4, wherein:

T _i ＝exp(-∑ _j<i σ _j (x _j )δ _j ) (6)

wherein delta _j ＝|x _i+1 -x _i Is the adjacent sampling pointThe distance between the two is (c) since the item is to fuse the pixel feature into the global shape feature _i ，σ _i ) Computing

6. The method of claim 5, wherein:

The symbol distance function is approximated with a multi-layer perceptron.

7. The method of claim 6, wherein:

in step four, a batch of m pixels and their corresponding rays in the world coordinate system are randomly sampled from each training, so that the input data is D = { C = { _p ，o _p ，d _p In which p is [1,m ]]，C _p Representing the RGB colour value, o, corresponding to the p-th pixel point _p Representing the camera center position corresponding to the p-th pixel point, d _p Representing the view angle direction of the p-th pixel point; and sampling n spatial points { x ] for each ray using a sampling algorithm for the opacity error boundary _p，i }，i∈[1，n]，x _p，i Denotes the p-thThe ith sampling point on the ray corresponding to the pixel.

8. The method of claim 7, wherein:

in step four, the training loss is mainly composed of two items:

wherein

Is the color loss, defined as:

‖·‖ ₁ represents l ₁ The number of the norm is calculated,

is a numerical approximation by the volume rendering integral in equation (6);

regularization term

throughout the training process, the hyper-parameter λ is set to 0.1.

9. A multi-view three-dimensional reconstruction system based on implicit neural representation, comprising:

a pixel-feature fused volume rendering mathematical model configured to: extracting image global features and local features from the pixel feature map, fusing the image global features and the local features to form pixel features corresponding to each ray to represent the characteristics of different wavelengths, and fusing the features into global shape features in a volume rendering formula to improve the reconstruction performance of a volume rendering mathematical model;

10. A computer apparatus comprising a processor, a memory, and a program stored on the memory and executable by the processor, wherein: the program, when executed, performs at least one step of the method of any one of claims 1 to 8.