CN115761178A - Multi-view three-dimensional reconstruction method based on implicit neural representation - Google Patents

Multi-view three-dimensional reconstruction method based on implicit neural representation Download PDF

Info

Publication number
CN115761178A
CN115761178A CN202211232601.6A CN202211232601A CN115761178A CN 115761178 A CN115761178 A CN 115761178A CN 202211232601 A CN202211232601 A CN 202211232601A CN 115761178 A CN115761178 A CN 115761178A
Authority
CN
China
Prior art keywords
pixel
image
features
feature map
ray
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211232601.6A
Other languages
Chinese (zh)
Inventor
唐琳琳
黄鑫
苏敬勇
刘洋
漆舒汉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202211232601.6A priority Critical patent/CN115761178A/en
Publication of CN115761178A publication Critical patent/CN115761178A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Generation (AREA)

Abstract

The invention relates to the technical field of three-dimensional reconstruction of three-dimensional views, in particular to a multi-view three-dimensional reconstruction method based on implicit neural representation.

Description

Multi-view three-dimensional reconstruction method based on implicit neural representation
Technical Field
The invention relates to the technical field of three-dimensional reconstruction of stereoscopic views, in particular to a multi-view three-dimensional reconstruction method based on implicit neural representation.
Background
The existing multi-view three-dimensional reconstruction methods can be divided into traditional multi-view three-dimensional reconstruction methods and methods based on deep learning. In the conventional multi-view stereo reconstruction method, depth is estimated by matching common image features under multiple view-angle pictures, and the predicted depth maps are fused to obtain dense point clouds, which generally need to be generated by post-processing in order to generate an object surface. The reconstruction quality of these methods depends greatly on the quality of feature matching, which can easily result in incomplete or artifact three-dimensional models if a large number of feature matching fails, and in order to reconstruct a real object, texture mapping is often performed by using a de-projection method after the surface of the object is reconstructed, so as to generate a real textured three-dimensional model. The series of operations are not only limited by the influence of feature matching and resolution of the multi-view image, but also cause reconstruction blurring and unnatural texture mapping results due to inevitable noise and camera quantization errors.
With the rise of deep learning, most of the deep learning-based MVS (multi-view stereoscopic vision) methods attempt to replace some parts of the classical MVS pipeline. Such as a fused depth map, or a depth map inferred from a multi-view image. But the method is limited by the inherent defects of three-dimensional representation adopted, namely the complex topological structure of triangular mesh representation, the discrete disorder of point cloud representation, and the limitation of memory size on voxel representation resolution, so that a fine model with high resolution or an object with a complex structure is difficult to reconstruct.
And the defects of the three expressions can be well overcome based on implicit nerve expression. The three-dimensional structure is represented by learning the probability of occupation of a spatial point or a symbolic distance function through a neural network. The three-dimensional information can be continuously represented without being influenced by resolution, and the memory occupation is small. There is a potential to reconstruct complex objects such as non-lambertian surfaces and thin structures that are difficult to reconstruct with classical methods. However, most of these implicit neural multi-view reconstruction methods adopt a surface rendering technique, and additional constraints, such as depth information, object masks, etc., are usually required to reconstruct an effective object surface. Although some recent works combine the advantages of the surface rendering technology and the volume rendering technology to realize that the precise geometric shape of the object can be reconstructed under the multi-view image without a mask, the method can only reconstruct the surface of the solid object, and meanwhile, the reconstructed three-dimensional object still has the problems of shape ambiguity and the like, and the object with a thin structure still cannot be reconstructed correctly.
The present invention seeks to address these and other needs in the art.
Disclosure of Invention
In order to solve at least one technical problem mentioned in the background technology, a multi-view three-dimensional reconstruction algorithm based on pixel feature fusion is provided on the basis of implicit neural representation and volume rendering technologies, the influence of incident light on a reconstructed geometric shape under different visual angles is considered, the pixel features of a multi-view image are extracted by designing a pixel feature map encoder to be fused into the global geometric features of surface points, the volume rendering process of the whole reconstructed model is improved, the surface fineness of an object is further improved, and a high-resolution three-dimensional object model represented by a fine grid is reconstructed.
In one aspect, the present invention is directed to a method for multi-view three-dimensional reconstruction based on implicit neural representation, comprising:
the method comprises the steps that firstly, a global feature map and a local feature map of an image are obtained through a residual convolutional neural network, the global feature map and the local feature map are spliced in a bilinear interpolation mode, and finally a pixel feature map with the same size as an original image is formed, so that each image pixel corresponds to a pixel feature with a specified dimension;
extracting image global features and local features from the pixel feature map, fusing the image global features and the local features to form pixel features corresponding to each ray to represent the characteristics of different wavelengths, and fusing the features into global shape features in a volume rendering formula to improve the reconstruction performance of a volume rendering mathematical model;
step three, the curved surface shape of the object is implicitly represented by using a symbolic distance function SDF, namely for a given arbitrary space point x, the SDF (x) can output the distance from the space point x to the nearest surface, the surface S of the three-dimensional object can be determined by a zero level set implicit SDF (x) =0, and the symbolic distance function is approximated by using a multilayer perceptron;
rendering images under given visual angles and camera parameters by using a pixel feature fusion volume rendering formula, and minimizing the difference with an input image, wherein in the training process, input data consists of a group of RGB images with camera internal and external parameters;
and step five, after the symbol distance of each three-dimensional space point is obtained, a Marching Cube algorithm is used for obtaining a three-dimensional grid surface corresponding to a threshold value, so that a 3D object model represented by a high-resolution grid can be visualized.
In the first step, the local feature map is obtained by extracting the feature map of each layer in the global feature map.
In the first step, for the input image I, the intercepted characteristic diagram includes F 1 ,F 2 ,...,F N In which F is 1 ,F 2 ,...,F N-1 Is a partial feature map, F N Is a global feature map, the pixel feature map of an image can be represented by the following formula:
f I =F 1 ,F 2 ,…,F N (1)。
in a convolutional neural network, a shallow convolutional layer obtains more characteristic information, the receptive field is smaller, and characteristic data is generally closer to an original image, namely the shallow characteristic can contain more detailed characteristics of objects; with the increasing depth of the layer number, the obtained features are more and more abstract, and the receptive field is also larger and larger, and the image features at the moment can contain the global features of the object, such as the classification information, the shape structure and the like of the object. Shallow features are often referred to as local features and deep features as global features. Because the image pixel coordinate corresponding to each ray can be directly obtained in the volume rendering process, after the pixel feature map of the image is obtained, the pixel feature of the corresponding position can be directly indexed through the pixel coordinate, and the original pixel coordinate system is not required to be subjected to de-projection aiming at each space point, so that the calculation amount is reduced.
In step two, given a pixel, the ray from the camera center o to the pixel is represented as { r (t) = o + td | t >0}, and d is the unit direction vector of the ray from the camera center o to the pixel. Cumulative color along light C (r):
Figure BDA0003882026660000031
where c (r, n, g, d) represents a Light Field (Light Field), which may also be referred to as Yan Sechang (Color Field), i.e., each spatial point r (t) i ) The amount of light emitted in the direction d can also be regarded as an RGB color value; the function T (T) represents the cumulative transmission probability along a ray, i.e. the probability that a ray propagates from the camera center o to r (T) without hitting any other particle. And the light field is constrained by the normal n of the surface and this spatial point feature g, i.e.
Figure BDA0003882026660000032
The spatial point feature g is the pixel feature f generated from the image I corresponding to this ray r (t) I And global shape feature f θ,g (r (t)) are jointly generated.
In the fusion mode, the BRDF of common materials is considered in a normal surface coding mode in the volume rendering process, the influence of different wavelengths of different incident light rays and the global surface characteristics can be considered, and the reconstruction accuracy can be effectively improved.
In step two, the bulk density σ (x) is modeled as using a learnable distance function f θ The transformation of (a) replaces:
σ(x)=αΨ β (f θ (x)) (3)
Figure BDA0003882026660000033
wherein α, β >0 is a learnable parameter, Ψ β Is a Laplace cumulative distribution function with a mean value of zero and a scale parameter of beta.
Intuitively, the density σ models a uniform object of constant density α that falls smoothly near the boundary of the object, where the amount of smoothing is controlled by β. This definition of bulk density can bring about two advantages: one is that it can provide a useful generalized bias for the object surface, which makes it f θ The zero level set of (a) can maximize the bulk density, which can be considered as the object surface, thus providing a principle method of reconstructing curved surfaces. Secondly, using such a volume density function can facilitate the limitation of the opacity error of the rendered volume, which in turn enables the optimization of the sampling strategy such that the points sampled at each ray eventually converge to the points of the object surface.
In step two, along the light ray { r (t) = o + td | t>0 sample N samples x i Approximating the color of ray r using numerical integration:
Figure BDA0003882026660000041
T i =exp(-∑ j<i σ j (x jj ) (6)
wherein delta j =|x i+1 -x i I is the distance between adjacent sample points, since this project is to fuse pixel features into global shape features, from (c) i ,σ i ) Computing
Figure BDA0003882026660000042
The whole process of (a) is still differentiable, so that the optimization can be carried out in the manner of a neural network.
In step three, the symbolic distance function SDF implicitly represents the curved surface shape of the object:
Figure BDA0003882026660000043
that is, for a given arbitrary spatial point x, SDF (x) may output the distance of said spatial point x to the nearest surface, whose sign is positive indicating that the point is outside the object surface, whose sign is negative indicating that the point is inside the object surface, the surface S of the three-dimensional object may be determined by a zero level set implicit SDF (x) =0, i.e.
Figure BDA0003882026660000044
Figure BDA0003882026660000045
The symbol distance function is approximated with a multi-layer perceptron.
In the fourth step, a batch of m pixels and corresponding rays under the world coordinate system are randomly sampled from each training, thus the input data is D = { C = p ,o p ,d p In which p is [1,m ]],C p Representing the RGB colour value, o, corresponding to the p-th pixel point p Representing the camera center position corresponding to the p-th pixel point, d p And expressing the view angle direction of the p-th pixel point. And sampling n spatial points { x ] for each ray using a sampling algorithm for the opacity error boundary p,i },i∈[1,n],x p,i Representing the ith sample point on the ray corresponding to the p-th pixel.
In step four, the training loss consists of two terms:
Figure BDA0003882026660000051
wherein
Figure BDA0003882026660000052
Is the color loss, defined as:
Figure BDA0003882026660000053
‖·‖ 1 is represented by 1 The norm of the number of the first-order-of-arrival,
Figure BDA0003882026660000054
is a numerical approximation by the volume rendering integral in equation (6);
regularization term
Figure BDA0003882026660000055
Is exciting f θ Energy loss approximating a distance function; y is a pair of a spatial point from each ray's random uniform spatial sampling and a spatial point using a sampling algorithm.
Figure BDA0003882026660000056
Throughout the training process, the hyper-parameter λ is set to 0.1.
In another aspect, the invention is directed to a multi-view three-dimensional reconstruction system based on implicit neural representation, comprising:
a pixel profile encoder configured to: obtaining a global feature map and a local feature map of the image through a residual convolutional neural network, splicing the global feature map and the local feature map in a bilinear interpolation mode, and finally forming a pixel feature map with the same size as the original image, so that each image pixel corresponds to a pixel feature with a specified dimension;
a pixel feature fused volume rendering mathematical model configured to: extracting image global features and local features from the pixel feature map, fusing the image global features and the local features to form pixel features corresponding to each ray to represent the characteristics of different wavelengths, and fusing the features into global shape features in a volume rendering formula to improve the reconstruction performance of a volume rendering mathematical model;
an implicit neural surface model configured to: implicitly representing the shape of the surface of the object using a symbolic distance function SDF, i.e. for a given arbitrary spatial point x, SDF (x) may output the distance of said spatial point x to the nearest surface, the surface S of the three-dimensional object may be determined by a zero level set implicit SDF (x) =0, the symbolic distance function is approximated with a multi-layer perceptron;
a pixel feature fused multi-view three-dimensional reconstruction network model configured to: rendering an image under a given visual angle and camera parameters by using a pixel feature fusion volume rendering formula, and minimizing the difference with an input image, wherein in the training process, input data consists of a group of RGB images with camera internal and external parameters;
visualizing a three-dimensional object model configured to: after the symbolic distance of each three-dimensional space point is obtained, a Marching Cube algorithm can be used to obtain a three-dimensional mesh surface corresponding to a threshold value, so that a 3D object model represented by a mesh with high resolution can be visualized.
The application provides a multi-view three-dimensional reconstruction system based on implicit neural representation, not only has considered the influence of incident light to the reconstruction geometry under the different visual angles, moreover through designing the pixel characteristic of pixel characteristic map encoder extraction multi-view image to integrate into the global geometry characteristic of surface point, improve the volume rendering process of whole reconstruction model, and then improve the meticulous degree in object surface, rebuild the three-dimensional object model of the meticulous mesh of high resolution and show.
In another aspect, a computer apparatus is provided, comprising a processor, a memory, and a program stored on the memory and executable by the processor, the program, when executed, performing at least one of the steps of the method as set forth above.
The above-described preferred conditions may be combined with each other to obtain a specific embodiment, in accordance with common knowledge in the art.
The invention has the beneficial effects that: the invention provides a multi-view three-dimensional reconstruction algorithm based on pixel feature fusion on the basis of implicit neural representation and volume rendering technologies, not only considers the influence of incident light on the reconstruction geometry under different visual angles, but also extracts the pixel features of multi-visual-angle images by designing a pixel feature map encoder so as to be fused into the global geometry features of surface points, improves the volume rendering process of the whole reconstruction model, further improves the surface fineness of an object, and reconstructs a high-resolution three-dimensional object model represented by a fine grid.
The invention adopts the technical scheme for achieving the purpose, makes up the defects of the prior art, and has reasonable design and convenient operation.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
FIG. 1 is a flow chart of a multi-view three-dimensional reconstruction method based on implicit neural representation;
FIG. 2 is a schematic diagram of a pixel profile encoder network architecture;
fig. 3 is a schematic diagram of an overview of a pixel feature fused volume rendering network model.
Detailed Description
Those skilled in the art can appropriately substitute and/or modify the process parameters to implement the present disclosure, but it is specifically noted that all similar substitutes and/or modifications will be apparent to those skilled in the art and are deemed to be included in the present invention. While the invention has been described in terms of preferred embodiments, it will be apparent to those skilled in the art that the technology can be practiced and applied by modifying or appropriately combining the embodiments described herein without departing from the spirit and scope of the invention.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure herein. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is to be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the technical aspects of the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The present invention is described in detail below.
Example 1:
a multi-view three-dimensional reconstruction method based on implicit neural representation is shown in figure 1 and specifically comprises the following steps.
Step one, designing a pixel characteristic diagram encoder structure:
obtaining a global feature map of the image through a residual convolution neural network, extracting the feature map of each layer to be used as a local feature map of the image, splicing the global feature map and the local feature map in a bilinear interpolation mode, and finally forming a pixel feature map with the same size as the original image, so that each image pixel corresponds to a pixel feature with a specified dimension; for the input image I, the intercepted feature map includes F 1 ,F 2 ,...,F N In which F is 1 ,F 2 ,...,F N-1 Is a partial feature map, F N Is a global feature map, the pixel feature map of an image can be represented by the following formula:
f I =F 1 ,F 2 ,…,F N (1)。
step two, establishing a pixel feature fused volume rendering mathematical model:
extracting image global features and local features from the pixel feature map, fusing the image global features and the local features to form pixel features corresponding to each ray to represent the characteristics of different wavelengths, and fusing the features into global shape features in a volume rendering formula to improve the reconstruction performance of a volume rendering mathematical model; that is, given a pixel, the ray from the camera center o to the pixel is denoted as { r (t) = o + td | t >0}, and d is the unit direction vector along the camera center o to the ray of the pixel. Cumulative color along light C (r):
Figure BDA0003882026660000081
where c (r, n, g, d) represents a Light Field (Light Field), which may also be referred to as Yan Sechang (Color Field), i.e., each spatial point r (t) i ) The amount of light emitted in the direction d can also be regarded as an RGB color value; the function T (T) represents the cumulative transmission probability along a ray, i.e. the probability that a ray propagates from the camera center o to r (T) without hitting any other particle. And the light field is constrained by the normal n of the surface and this spatial point feature g, i.e.
Figure BDA0003882026660000082
The spatial point feature g is the pixel feature f generated from the image I corresponding to this ray r (t) I And global shape feature f θ,g (r (t)) are generated jointly.
Modeling the bulk density σ (x) as using a learnable distance function f θ The transformation of (c) replaces:
σ(x)=αΨ β (f θ (x)) (3)
Figure BDA0003882026660000083
wherein α, β >0 is a learnable parameter, Ψ β Is a Laplace cumulative distribution function with a mean value of zero and a scale parameter of beta.
Along a ray { r (t) = o + td | t>0 sample N samples x i Approximating the color of ray r using numerical integration:
Figure BDA0003882026660000084
T i =exp(-∑ j<i σ j (x jj ) (6)
wherein delta j =|x i+1 -x i I is the distance between adjacent sample points, since this project is to fuse pixel features into global shape features, from (c) i ,σ i ) Computing
Figure BDA0003882026660000085
The whole process of (a) is still differentiable, so that the optimization can be carried out in a neural network mode.
Step three, establishing an implicit nerve surface model:
the shape of the curved surface of the object is implicitly represented using a symbolic distance function SDF,
Figure BDA0003882026660000086
i.e. for a given arbitrary spatial point x, SDF (x) may output the distance of said spatial point x to the nearest surface, whose sign is positive indicating that the point is outside the object surface, whose sign is negative indicating that the point is inside the object surface, the surface S of the three-dimensional object may be determined by a zero level set implicit SDF (x) =0,
Figure BDA0003882026660000091
the idea of approximating a symbolic distance function with a multi-layer perceptron and approximating an implicit function with a neural network is commonly referred to as implicit neural representation.
Step four, designing a multi-view three-dimensional reconstruction network model with pixel feature fusion:
rendering an image under a given visual angle and camera parameters by using a pixel feature fusion volume rendering formula, and minimizing the difference with an input image, wherein in the training process, input data consists of a group of RGB images with camera internal and external parameters;
each training randomly samples a batch of m pixels and their corresponding rays in the world coordinate system, thus the input data is D = { C = p ,o p ,d p In which p is [1,m ]],C p Representing the RGB colour value, o, corresponding to the p-th pixel point p Representing the camera center position corresponding to the p-th pixel point, d p And expressing the view angle direction of the p-th pixel point. And sampling n spatial points { x ] for each ray using a sampling algorithm for the opacity error boundary p,i },i∈[1,n],x p,i Representing the ith sampling point on the ray corresponding to the p pixel;
the training loss is mainly composed of two terms:
Figure BDA0003882026660000092
wherein
Figure BDA0003882026660000093
Is the color loss, defined as:
Figure BDA0003882026660000094
‖·‖ 1 is represented by 1 The norm of the number of the first-order-of-arrival,
Figure BDA0003882026660000095
is a numerical approximation by the volume rendering integral in equation (6);
regularization term
Figure BDA0003882026660000096
Is exciting f θ Energy loss approximating a distance function; y is a pair of a spatial point from each ray random uniform spatial sampling and a spatial point using a sampling algorithm;
Figure BDA0003882026660000097
throughout the training process, the hyper-parameter λ is set to 0.1.
And step five, after the symbolic distance of each three-dimensional space point is obtained, a Marching Cube algorithm is used for obtaining a three-dimensional grid surface corresponding to a threshold value, so that a 3D object model represented by a grid with high resolution can be visualized.
Example 2:
a multi-view three-dimensional reconstruction system based on implicit neural representation, comprising:
a pixel profile encoder configured to: obtaining a global feature map and a local feature map of the image through a residual convolutional neural network, splicing the global feature map and the local feature map in a bilinear interpolation mode, and finally forming a pixel feature map with the same size as the original image, so that each image pixel corresponds to a pixel feature with a specified dimension;
a pixel feature fused volume rendering mathematical model configured to: extracting image global features and local features from the pixel feature map, fusing the image global features and the local features to form pixel features corresponding to each ray to represent the characteristics of different wavelengths, and fusing the features into global shape features in a volume rendering formula to improve the reconstruction performance of a volume rendering mathematical model;
an implicit neural surface model configured to: implicitly representing the curved surface shape of the object by using a symbolic distance function SDF (SDF), namely for a given arbitrary space point x, the SDF (x) can output the distance from the space point x to the nearest surface, the surface S of the three-dimensional object can be determined by a zero level set implicit SDF (x) =0, and the symbolic distance function is approximated by a multilayer perceptron;
a pixel feature fused multi-view three-dimensional reconstruction network model configured to: rendering an image under a given visual angle and camera parameters by using a pixel feature fusion volume rendering formula, and minimizing the difference with an input image, wherein in the training process, input data consists of a group of RGB images with camera internal and external parameters;
visualizing a three-dimensional object model configured to: after the symbolic distance of each three-dimensional space point is obtained, a Marching Cube algorithm can be used to obtain a three-dimensional mesh surface corresponding to a threshold value, so that a 3D object model represented by a mesh with high resolution can be visualized.
The details are as follows.
Pixel feature map encoder:
in the convolutional neural network, the feature information obtained by the convolutional layer of the shallow layer is more, the receptive field is smaller, and the feature data is generally closer to the original image, namely the shallow layer feature can contain more detailed features of objects; with the deeper and deeper layers, the obtained features are more and more abstract, and the receptive field is also more and more large, at the same time, the image features can contain the global features of the object, such as the classification information, the shape structure and the like of the object. Shallow features are often referred to as local features and deep features as global features.
In order to fully obtain the pixel characteristics of the image under each visual angle, the idea of a residual convolutional neural network is mainly adopted to obtain a global characteristic diagram of the image, meanwhile, the characteristic diagram of each layer is extracted to be used as a local characteristic diagram of the image, the global characteristic diagram and the local characteristic diagram are spliced in a bilinear interpolation mode, and finally, a pixel characteristic diagram with the same size as that of the original image is formed, so that each image pixel corresponds to a pixel characteristic with a specified dimension size. We will refer to the feature map generated in this way as a pixel feature map. Namely, for the input image I, the intercepted local feature map is F 1 ,F 2 ,...,F N In which F is 1 ,F 2 ,...,F N-1 As a partial feature map, F N Is a global feature map, the pixel feature map of an image can be represented by the following formula:
f I =F 1 ,F 2 ,…,F N (1)。
because the image pixel coordinate corresponding to each ray can be directly obtained in the volume rendering process, after the pixel feature map of the image is obtained, the pixel feature of the corresponding position can be directly indexed through the pixel coordinate, and the original pixel coordinate system is not required to be subjected to de-projection aiming at each space point, so that the calculation amount is reduced. The specific pixel profile encoder network structure is therefore shown in fig. 2.
The volume rendering mathematical model of pixel feature fusion:
inspired by a plenoptic function in computer graphics considering the wavelengths of different incident lights, the characteristics that the incident lights under different visual angles have different wavelengths are fully considered, the method provides that the image global characteristic and the local characteristic are extracted from an image and fused to form the pixel characteristic corresponding to each ray to represent the characteristics of different wavelengths, and the characteristic is fused into the global shape characteristic in a volume rendering formula to improve the reconstruction performance of the whole model. That is, given a pixel, the ray from the camera center o to the pixel is denoted as { r (t) = o + td | t >0}, and d is the unit direction vector along the camera center o to the ray of the pixel. Cumulative color along light C (r):
Figure BDA0003882026660000111
where c (r, n, g, d) represents a Light Field (Light Field), which may also be referred to as Yan Sechang (Color Field), i.e., each spatial point r (t) i ) The amount of light emitted in the direction d may also be considered as an RGB color value. And the light field is constrained by the normal n to the surface and this spatial point feature g, i.e.
Figure BDA0003882026660000112
Figure BDA0003882026660000113
The spatial point feature g is the pixel feature f generated from the image I corresponding to this ray r (t) I And global shape feature f θ,g (r (t)) are generated jointly. In the fusion mode, the BRDF of common materials is considered in a normal surface coding mode in the volume rendering process, the influence of different wavelengths of different incident light rays and the global surface characteristics can be considered, and the reconstruction accuracy can be effectively improved.
In contrast to the volume density function σ (x) in conventional volume rendering, which represents the probability that a ray terminates at a spatial point x, the volume density σ (x) is modeled in this term as using a learnable distance function f θ Instead of:
σ(x)=αΨ β (f θ (x)) (3)
Figure BDA0003882026660000121
wherein α, β >0 is a learnable parameter, Ψ β Is a Laplace cumulative distribution function with a mean value of zero and a scale parameter of beta. Intuitively, the density σ models a uniform object of constant density α that falls smoothly near the boundary of the object, where the amount of smoothing is controlled by β. This definition of bulk density can bring about two advantages: one is that it can provide a useful generalized bias for the object surface, which makes it f θ The zero level set of (a) can maximize the bulk density, which can be considered as the object surface, thus providing a principle method of reconstructing curved surfaces. Secondly, using such a volume density function can facilitate the limitation of the opacity error of the rendered volume, which in turn enables the optimization of the sampling strategy such that the points sampled at each ray eventually converge to the points of the object surface.
Like conventional volume rendering formulas, the function T (T) represents the cumulative probability of penetration along a ray, i.e., the probability that a ray propagates from o to r (T) without hitting any other particle.
Thus, along the light ray { r (t) = o + td | t>0 sample N samples x i The color of ray r is approximated using numerical integration:
Figure BDA0003882026660000122
T i =exp(-∑ j<i σ j (x jj ) (6)
wherein delta j =|x i+1 -x i I is the distance between adjacent sample points, since this project is to fuse pixel features into global shape features, from (c) i ,σ i ) Computing
Figure BDA0003882026660000123
The whole process of (a) is still differentiable, so that the optimization can be carried out in the manner of a neural network.
Implicit neural surface model:
the method adopts an implicit neural representation mode to represent the curved surface of an object, namely, a Symbolic Distance Function (SDF) is used for implicitly representing the curved surface shape of the object:
Figure BDA0003882026660000124
specifically, for a given arbitrary spatial point x, the SDF (x) may output the distance of that point to the nearest surface, with the sign indicating whether the point is inside (negative) or outside (positive) the object indicates. The surface S of the three-dimensional object can thus be determined by the zero level set implicit SDF (x) =0, i.e.
Figure BDA0003882026660000125
The symbolic distance function SDF is approximated in terms with a multi-layer perceptron, so the idea of approximating an implicit function with a neural network is also commonly referred to as implicit neural representation.
The multi-view three-dimensional reconstruction network model with the fused pixel features comprises the following steps:
the goal of multi-view three-dimensional reconstruction is to reconstruct the fine surface shape of an object given the multi-view images of the object and the internal and external parameters of the corresponding cameras. To learn the implicit neural surface representation of a three-dimensional object, i.e., to learn the weight parameters of a neural network, an image at a given perspective and camera parameters is rendered using a volume rendering formula of pixel feature fusion, and the difference from the input image is minimized. Experiments prove that the volume rendering method can accurately reconstruct the surface of an object with a complex structure. The network model design and training process is illustrated in fig. 3.
In the training process, the input data is composed of a set of RGB images with camera intrinsic and extrinsic parameters. Randomly sampling a batch of m-sized elements from each training and corresponding rays in a world coordinate system, so that input data is D = { C = { (C) p ,o p ,d p H, where p ∈ [1,m ]],C p Representing the corresponding RGB color value of the p-th pixel point, o p To representThe p-th pixel point corresponds to the camera center position, d p And expressing the view angle direction of the p-th pixel point. And sampling n spatial points { x ] for each ray using a sampling algorithm for the opacity error boundary p,i },i∈[1,n],x p,i Representing the ith sample point on the ray corresponding to the p-th pixel.
The training loss is mainly composed of two terms:
Figure BDA0003882026660000131
wherein the color loss is defined as:
Figure BDA0003882026660000132
‖·‖ 1 is represented by 1 The norm of the number of the first-order-of-arrival,
Figure BDA0003882026660000133
is approximated by the value of the volume rendering integral in equation (6).
Regularization term
Figure BDA0003882026660000134
Is exciting f θ Energy loss of the approximate distance function; y is a pair of a spatial point from each ray random uniform spatial sampling and a spatial point using a sampling algorithm;
Figure BDA0003882026660000135
throughout the training process, the hyper-parameter λ is set to 0.1.
Visualization of the three-dimensional object model:
after the symbolic distance of each three-dimensional space point is obtained, a Marching Cube algorithm can be used to obtain a three-dimensional mesh surface corresponding to a threshold value, so that a 3D object model represented by a mesh with high resolution can be visualized.
Conventional techniques in the above embodiments are known to those skilled in the art, and thus will not be described in detail herein.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
While the invention has been described in detail and with reference to specific embodiments thereof, it will be apparent to one skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope thereof.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.
The invention is not the best known technology.

Claims (10)

1. A multi-view three-dimensional reconstruction method based on implicit neural representation is characterized by comprising the following steps:
the method comprises the steps that firstly, a global feature map and a local feature map of an image are obtained through a residual convolutional neural network, the global feature map and the local feature map are spliced in a bilinear interpolation mode, and finally a pixel feature map with the same size as an original image is formed, so that each image pixel corresponds to a pixel feature with a specified dimension;
extracting image global features and local features from the pixel feature map, fusing the image global features and the local features to form pixel features corresponding to each ray to represent the characteristics of different wavelengths, and fusing the features into global shape features in a volume rendering formula to improve the reconstruction performance of a volume rendering mathematical model;
step three, the curved surface shape of the object is implicitly represented by using a symbolic distance function SDF, namely for a given arbitrary space point x, the SDF (x) can output the distance from the space point x to the nearest surface, the surface S of the three-dimensional object can be determined by a zero level set implicit SDF (x) =0, and the symbolic distance function is approximated by using a multilayer perceptron;
rendering images under given visual angles and camera parameters by using a pixel feature fusion volume rendering formula, and minimizing the difference with an input image, wherein in the training process, input data consists of a group of RGB images with camera internal and external parameters;
and step five, after the symbolic distance of each three-dimensional space point is obtained, a Marching Cube algorithm is used for obtaining a three-dimensional grid surface corresponding to a threshold value, so that a 3D object model represented by a grid with high resolution can be visualized.
2. The method of claim 1, wherein:
in the first step, for the input image I, the intercepted characteristic diagram includes F 1 ,F 2 ,...,F N In which F 1 ,F 2 ,...,F N-1 Is a partial feature map, F N Is a global feature map, the pixel feature map of an image can be represented by the following formula:
f I =F 1 ,F 2 ,…,F N (1)。
3. the method of claim 2, wherein:
in step two, given a pixel, the ray from the camera center o to the pixel is represented as { r (t) = o + td | t >0}, and d is a unit direction vector of the ray from the camera center o to the pixel; cumulative color along light C (r):
Figure FDA0003882026650000011
where c (r, n, g, d) represents a Light Field (Light Field), which may also be referred to as Yan Sechang (Color Field), i.e., each spatial point r (t) i ) The amount of light emitted in the direction d can also be regarded as an RGB color value; the function T (T) represents the cumulative probability of transmission along a ray, i.e. the probability that a ray propagates from the camera center o to r (T) without hitting any other particle; and the light field is constrained by the normal n of the surface and the feature g of this space point, i.e.
Figure FDA0003882026650000026
The spatial point feature g is the pixel feature f generated from the image I corresponding to this ray r (t) I And global shape feature f θ,g (r (t)) are jointly generated.
4. The method of claim 3, wherein:
in step two, the volume density σ (x) is modeled as using a learnable distance function f θ The transformation of (a) replaces:
σ(x)=αΨ β (f θ (x)) (3)
Figure FDA0003882026650000022
wherein α, β >0 is a learnable parameter, Ψ β Is a Laplace cumulative distribution function with a mean value of zero and a scale parameter of beta.
5. The method of claim 4, wherein:
in step two, along the light ray { r (t) = o + td | t>0 sample N samples x i Approximating the color of ray r using numerical integration:
Figure FDA0003882026650000023
T i =exp(-∑ j<i σ j (x jj ) (6)
wherein delta j =|x i+1 -x i Is the adjacent sampling pointThe distance between the two is (c) since the item is to fuse the pixel feature into the global shape feature i ,σ i ) Computing
Figure FDA0003882026650000024
The whole process of (a) is still differentiable, so that the optimization can be carried out in the manner of a neural network.
6. The method of claim 5, wherein:
in step three, the symbolic distance function SDF implicitly represents the curved surface shape of the object:
Figure FDA0003882026650000025
that is, for a given arbitrary spatial point x, SDF (x) may output the distance of said spatial point x to the nearest surface, whose sign is positive indicating that the point is outside the object surface, whose sign is negative indicating that the point is inside the object surface, the surface S of the three-dimensional object may be determined by a zero level set implicit SDF (x) =0, i.e.
Figure FDA0003882026650000031
The symbol distance function is approximated with a multi-layer perceptron.
7. The method of claim 6, wherein:
in step four, a batch of m pixels and their corresponding rays in the world coordinate system are randomly sampled from each training, so that the input data is D = { C = { p ,o p ,d p In which p is [1,m ]],C p Representing the RGB colour value, o, corresponding to the p-th pixel point p Representing the camera center position corresponding to the p-th pixel point, d p Representing the view angle direction of the p-th pixel point; and sampling n spatial points { x ] for each ray using a sampling algorithm for the opacity error boundary p,i },i∈[1,n],x p,i Denotes the p-thThe ith sampling point on the ray corresponding to the pixel.
8. The method of claim 7, wherein:
in step four, the training loss is mainly composed of two items:
Figure FDA0003882026650000032
wherein
Figure FDA0003882026650000033
Is the color loss, defined as:
Figure FDA0003882026650000034
‖·‖ 1 represents l 1 The number of the norm is calculated,
Figure FDA0003882026650000035
is a numerical approximation by the volume rendering integral in equation (6);
regularization term
Figure FDA0003882026650000037
Is exciting f θ Energy loss approximating a distance function; y is a pair of a spatial point from each ray random uniform spatial sampling and a spatial point using a sampling algorithm;
Figure FDA0003882026650000036
throughout the training process, the hyper-parameter λ is set to 0.1.
9. A multi-view three-dimensional reconstruction system based on implicit neural representation, comprising:
a pixel profile encoder configured to: obtaining a global feature map and a local feature map of the image through a residual convolutional neural network, splicing the global feature map and the local feature map in a bilinear interpolation mode, and finally forming a pixel feature map with the same size as the original image, so that each image pixel corresponds to a pixel feature with a specified dimension;
a pixel-feature fused volume rendering mathematical model configured to: extracting image global features and local features from the pixel feature map, fusing the image global features and the local features to form pixel features corresponding to each ray to represent the characteristics of different wavelengths, and fusing the features into global shape features in a volume rendering formula to improve the reconstruction performance of a volume rendering mathematical model;
an implicit neural surface model configured to: implicitly representing the shape of the surface of the object using a symbolic distance function SDF, i.e. for a given arbitrary spatial point x, SDF (x) may output the distance of said spatial point x to the nearest surface, the surface S of the three-dimensional object may be determined by a zero level set implicit SDF (x) =0, the symbolic distance function is approximated with a multi-layer perceptron;
a pixel feature fused multi-view three-dimensional reconstruction network model configured to: rendering an image under a given visual angle and camera parameters by using a pixel feature fusion volume rendering formula, and minimizing the difference with an input image, wherein in the training process, input data consists of a group of RGB images with camera internal and external parameters;
visualizing a three-dimensional object model configured to: after the symbolic distance of each three-dimensional space point is obtained, a Marching Cube algorithm can be used to obtain a three-dimensional mesh surface corresponding to a threshold value, so that a 3D object model represented by a mesh with high resolution can be visualized.
10. A computer apparatus comprising a processor, a memory, and a program stored on the memory and executable by the processor, wherein: the program, when executed, performs at least one step of the method of any one of claims 1 to 8.
CN202211232601.6A 2022-10-10 2022-10-10 Multi-view three-dimensional reconstruction method based on implicit neural representation Pending CN115761178A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211232601.6A CN115761178A (en) 2022-10-10 2022-10-10 Multi-view three-dimensional reconstruction method based on implicit neural representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211232601.6A CN115761178A (en) 2022-10-10 2022-10-10 Multi-view three-dimensional reconstruction method based on implicit neural representation

Publications (1)

Publication Number Publication Date
CN115761178A true CN115761178A (en) 2023-03-07

Family

ID=85350972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211232601.6A Pending CN115761178A (en) 2022-10-10 2022-10-10 Multi-view three-dimensional reconstruction method based on implicit neural representation

Country Status (1)

Country Link
CN (1) CN115761178A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468767A (en) * 2023-03-28 2023-07-21 南京航空航天大学 Airplane surface reconstruction method based on local geometric features and implicit distance field
CN117274472A (en) * 2023-08-16 2023-12-22 武汉大学 Aviation true projection image generation method and system based on implicit three-dimensional expression
CN117496091A (en) * 2023-12-28 2024-02-02 西南石油大学 Single-view three-dimensional reconstruction method based on local texture
CN117689747A (en) * 2023-11-23 2024-03-12 杭州图科智能信息科技有限公司 Multi-view nerve implicit surface reconstruction method based on point cloud guidance

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468767A (en) * 2023-03-28 2023-07-21 南京航空航天大学 Airplane surface reconstruction method based on local geometric features and implicit distance field
CN116468767B (en) * 2023-03-28 2023-10-13 南京航空航天大学 Airplane surface reconstruction method based on local geometric features and implicit distance field
CN117274472A (en) * 2023-08-16 2023-12-22 武汉大学 Aviation true projection image generation method and system based on implicit three-dimensional expression
CN117274472B (en) * 2023-08-16 2024-05-31 武汉大学 Aviation true projection image generation method and system based on implicit three-dimensional expression
CN117689747A (en) * 2023-11-23 2024-03-12 杭州图科智能信息科技有限公司 Multi-view nerve implicit surface reconstruction method based on point cloud guidance
CN117496091A (en) * 2023-12-28 2024-02-02 西南石油大学 Single-view three-dimensional reconstruction method based on local texture
CN117496091B (en) * 2023-12-28 2024-03-15 西南石油大学 Single-view three-dimensional reconstruction method based on local texture

Similar Documents

Publication Publication Date Title
Liu et al. Meshdiffusion: Score-based generative 3d mesh modeling
Liu et al. Soft rasterizer: Differentiable rendering for unsupervised single-view mesh reconstruction
CN108510573B (en) Multi-view face three-dimensional model reconstruction method based on deep learning
CN114549731B (en) Method and device for generating visual angle image, electronic equipment and storage medium
CN115761178A (en) Multi-view three-dimensional reconstruction method based on implicit neural representation
Wynn et al. Diffusionerf: Regularizing neural radiance fields with denoising diffusion models
CN111340944B (en) Single-image human body three-dimensional reconstruction method based on implicit function and human body template
CN110223370B (en) Method for generating complete human texture map from single-view picture
Guo et al. Streetsurf: Extending multi-view implicit surface reconstruction to street views
CN113313828B (en) Three-dimensional reconstruction method and system based on single-picture intrinsic image decomposition
CN116958453B (en) Three-dimensional model reconstruction method, device and medium based on nerve radiation field
WO2022198684A1 (en) Methods and systems for training quantized neural radiance field
CN114782634A (en) Monocular image dressing human body reconstruction method and system based on surface implicit function
CN115170741A (en) Rapid radiation field reconstruction method under sparse visual angle input
CN116993826A (en) Scene new view generation method based on local space aggregation nerve radiation field
CN115147709B (en) Underwater target three-dimensional reconstruction method based on deep learning
DE102022113244A1 (en) Joint shape and appearance optimization through topology scanning
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN116797742A (en) Three-dimensional reconstruction method and system for indoor scene
CN114723884A (en) Three-dimensional face reconstruction method and device, computer equipment and storage medium
CN114549669B (en) Color three-dimensional point cloud acquisition method based on image fusion technology
Yang et al. Reconstructing objects in-the-wild for realistic sensor simulation
CN116681839B (en) Live three-dimensional target reconstruction and singulation method based on improved NeRF
CN110675381A (en) Intrinsic image decomposition method based on serial structure network
CN116310228A (en) Surface reconstruction and new view synthesis method for remote sensing scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination