CN116977536A

CN116977536A - Novel visual angle synthesis method for borderless scene based on mixed nerve radiation field

Info

Publication number: CN116977536A
Application number: CN202311018456.6A
Authority: CN
Inventors: 崔林艳; 张旭; 尹继豪; 薛斌党
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2023-10-31

Abstract

The invention relates to a novel visual angle synthesis method of borderless scene based on a mixed nerve radiation field, which comprises the following steps: parameterizing the borderless space to a bounded area, and encoding scene information by utilizing a hash feature grid and a plane feature grid to construct a color-volume density decoder taking MLP as a main body; obtaining feature vectors of all sampling points on the light through spiral sampling and linear interpolation of the surface of the light cone; then, a volume rendering equation is used for obtaining the characteristic vector and the depth average value of each ray, and the corresponding pixel color is decoded through the shallow layer MLP; for rendering results, monitoring a color field by using the real colors of pixels in the optimization process, and monitoring the rendering depth by using the sparse point cloud obtained by SFM; and aiming at the optimization result, giving any camera pose, and rendering an imaging result under the view angle. The invention can realize new view angle synthesis under 360-degree borderless scene, enhance modeling capability of nerve radiation field under sparse view angle, and improve image rendering quality under new view angle.

Description

Novel visual angle synthesis method for borderless scene based on mixed nerve radiation field

Technical Field

The invention relates to the field of new view angle synthesis, in particular to a borderless scene new view angle synthesis method based on a mixed nerve radiation field.

Background

New perspective synthesis is a computer graphics and computer vision technique aimed at generating new perspective images from limited input images. It allows us to observe scenes or objects in a virtual way from different angles, which are not actually present in the original input.

Traditional image synthesis techniques rely primarily on methods such as copying, cropping, stitching, and transforming the images to create new perspectives or scenes. However, these methods are generally limited by input image quality, range of viewing angles, and scene complexity, resulting in the generated image may not be sufficiently realistic or have visual distortion. With the rapid development of Virtual Reality (VR), augmented Reality (AR) and computer vision fields, there is an increasing demand for new angle synthesis of higher quality and more natural.

New view angle synthesis techniques involve complex algorithms and models, such as methods based on generating a countermeasure network (GAN), conditional generation models, and space transformer networks, etc. These methods can infer and generate additional perspectives by learning scene transition rules between input images, thereby enabling the composite results to be more diverse and natural. Despite significant advances in new view angle synthesis techniques, challenges remain in certain applications, such as inaccuracy that may occur when processing complex scenes, motion blur, or low quality images. Therefore, the method still has great significance for the improvement and optimization of the new view angle synthesis technology, and is particularly in the fields of virtual reality, augmented reality, game development, movie production and the like.

Aiming at the research of a new view angle synthesizing method facing borderless scenes, the defects of the prior art are mainly represented in the following aspects: (1) The borderless scene has a large scale change range, and continuous scene information is difficult to learn by a network; (2) The borderless scene contains rich information, and the classical characterization method requires long training time and has the forgetting problem; (3) The existing rendering process needs to perform MLP forward computation for many times, and a great deal of time is consumed for synthesizing the color of one ray.

Disclosure of Invention

The technical solution of the invention is that: the method overcomes the defects of the prior art, provides a new view angle synthesis method based on a borderless scene, improves training speed and reasoning efficiency of a model on the basis of a limited known view angle, simultaneously keeps lower parameter quantity of a nerve radiation field, and realizes high-quality image rendering under any view angle.

The technical scheme of the invention is a novel visual angle synthesis method of borderless scene based on mixed nerve radiation field, comprising the following steps:

(1) Parameterizing the borderless space to a bounded area, and utilizing the multi-resolution hash feature grid and the plane feature grid to encode scene information to construct a color-volume density decoder taking the MLP as a main body;

(2) Aiming at the feature grid established in the step (1), obtaining feature vectors of all sampling points on the light through spiral sampling and linear interpolation of the light cone surface;

(3) Aiming at the sampling point characteristics obtained in the step (2), a volume rendering equation is used for obtaining a characteristic vector and a depth average value of each ray, and corresponding pixel colors are decoded through shallow layer MLP;

(4) Aiming at the rendering result of the step (3), monitoring a color field by using the true color of the pixel in the optimization process, and monitoring the rendering depth by using the sparse point cloud obtained by SMF;

(5) And (3) aiming at the optimized result in the step (4), and giving any camera pose, so that an imaging result under the visual angle can be rendered.

In the step (1), the borderless space is parameterized to a bounded area, scene information is encoded by utilizing a multi-resolution 3D hash feature grid and a 2D plane feature grid, and a color-volume density decoder mainly comprising MLP is constructed, and the method comprises the following steps:

transforming the position of any three-dimensional point in space to a bounded spherical region using a coordinate transformation function:

where x is the three-dimensional coordinates of the sample point. The whole space is parameterized into a sphere area with the radius of 2 through f (x), a multi-resolution 3D hash characteristic grid is constructed on the sphere area, and the scene information is encoded with less parameter quantity:

N _l ＝N _min ·b ^l

wherein N is _max The number of nodes for the highest resolution grid, N _min The number of nodes for the lowest resolution grid, N _l And b is the scale relation between adjacent resolutions for the node number of the first layer grid. The 3D hash feature grids can effectively encode scene information, but under the condition that the length of the hash table is determined, the higher resolution 3D hash feature grids are greatly influenced by hash collision, so color and bulk density ambiguity can be caused, and in order to relieve the hash collision in the query process, mutually orthogonal 2D plane feature grids are introduced to assist in representing scene details; finally, an MLP is built as a decoder responsible for regression of color and bulk density from the high-dimensional feature vectors.

In the step (2), for the feature grid established in the step (1), feature vectors of all sampling truncated cones on the light are obtained through spiral sampling and linear interpolation on the surface of the light cone, and the method comprises the following steps:

generating a light cone which starts from the light center and extends to infinity for each pixel through the pose of the camera, sampling a plurality of truncated cones according to the inverse depth, and ensuring that the sampling frequency is higher in a region close to the light center and gradually reduced in a region gradually far from the light center; for each truncated cone sampled, a spiral parametric equation was constructed on its side, and 7 points were sampled uniformly on the spiral to represent the truncated cone region:

where t is the sampling distance, r is the radius of the light cone on the normalized plane, n is the total number of sampling points, where n=7, m is the number of spiral turns, and p is the coordinates of the sampling points.

Each sampling point is converted into a sphere space through a coordinate transformation function, eight corner points closest to the corner points are queried in a 3D hash feature grid, feature vectors corresponding to the corner points are indexed through the hash function, and then the 3D features of the sampling points are obtained through tri-linear interpolation; for the 2D plane feature grid, the sampling points are respectively projected to three mutually orthogonal planes, four corner points closest to the projection points and feature vectors corresponding to the corner points are inquired in the planes, the feature vectors of the projection points are obtained through bilinear interpolation, and finally the feature vectors of the three projection points are fused, so that the 2D feature of the sampling points is obtained:

feature _3D ＝tri(f ₁ ,f ₂ ,...f ₈ )

where tri is a tri-linear interpolation function, bil is a bi-linear interpolation function, f _i G for indexed 3D features _i Is the indexed 2D feature. The 3D features and the 2D features are spliced to serve as feature representation of sampling points, after feature vectors of 7 sampling points in each truncated cone are obtained through query and interpolation, the average value of space coordinates of the 7 sampling points and the distance between each sampling point and the average value point are calculated, inverse proportion of the distance is normalized to serve as weight of each sampling point, and seven feature vectors are fused into one truncated cone feature vector to describe the whole truncated cone region.

In the step (3), for the truncated cone feature vector obtained in the step (2), a volume rendering equation is used to obtain a color feature vector and a depth average value of each ray, and the corresponding pixel color and depth, that is, a rendering result, are decoded through the shallow layer MLP, and the method is as follows:

firstly, the truncated cone feature vector obtained in the step (2) is passed through a layer of fully-connected network to obtain the volume density and color feature vector at the truncated cone mean point, the volume density distribution on the light is utilized to calculate the light passing rate at all truncated cone mean points, and then the weight of each truncated cone on the light in a rendering equation is obtained:

α _i ＝1-exp(-σ _i δ _i )

λ _i ＝T _i α _i

wherein sigma _i Is bulk density, alpha _i For occupancy probability, delta _i Is the interval distance of light cone, T _i Is the light passing rate lambda _i Is a rendering weight. And weighting all color feature vectors by using the weight, generating a feature for a pixel, finally sending the feature into the MLP, regressing the color of the pixel, and weighting by using the same weight to obtain the depth corresponding to the pixel:

feature _pixel ＝∑ _i λ _i f _i

rgb＝MLP(feature _pixel )

depth＝∑ _i λ _i t _i

wherein f _i Feature vector, which is truncated cone _pixel For pixel characteristics, t is the sampling distance of the truncated cone, rgb is the color of the rendered pixel, and depth is the rendered depth, i.e. the rendering result.

In the step (4), aiming at the rendering result in the step (3), the color field is supervised by using the true color of the pixel in the optimization process, and the depth of rendering is supervised by using the sparse point cloud obtained by the SMF, and the method comprises the following steps:

the method comprises the steps of constructing a loss function of an algorithm, wherein the first term is a color loss function, and because the algorithm is an optimization process of coarse-to-fine, pixel colors are respectively rendered in two stages, and therefore the mean square error is directly formed by using the true colors of pixels and the twice rendered colors:

wherein C (r) is the true color of the pixel, C _coarse (r) pixel color rendered for the coarse phase, C _fine (r) is the pixel color rendered for the fine phase.

The second term is a depth loss function, the training of the nerve radiation field requires a more accurate camera pose, the SMF method recovers the pose and obtains the sparse point cloud of the scene, and the sparse depth under the training pose can be obtained by projecting the sparse point cloud to each view angle and is used for supervising the depth of rendering:

wherein D is the light generated by all the point cloud projection points, D (r) is the point cloud projection depth, and D _fine (r) is the rendering depth. Inverse deep supervision is employed instead of direct use of deep supervision to reduce the impact of outliers on optimization.

In the step (5), for the model optimized in the step (4), given any camera pose, an imaging result under the view angle can be rendered, and the method comprises the following steps:

generating an arbitrary viewing angle in the scene according to the distribution of the training angles, generating light cones for all pixels under the viewing angle, and spirally sampling 49 points on the surface of each truncated cone compared with the training stage for achieving better rendering effect; meanwhile, as the geometric distribution of the scene is found in the training stage, the sampling range can be compressed from zero to positive infinity to the depth vicinity; finally, the decoder returns the color for each pixel position to obtain the imaging result under the visual angle.

Compared with the prior art, the invention has the advantages that:

(1) The invention adopts the mixed scene characterization of fusing the 3D Hash feature grid, the 2D plane feature grid and the MLP, and on the basis of maintaining a lower parameter of an algorithm, the training speed and the reasoning efficiency of the model are accelerated, and the new view angle synthesis quality of the nerve radiation field in a borderless scene is improved.

(2) According to the invention, the rendering equation is improved, after density distribution on the light is returned, the sampling point characteristics on the light are weighted and summed, and then the color is decoded through the MLP, so that the calculated amount in the rendering process is obviously reduced.

In a word, the method has simple principle and can achieve the purpose of synthesizing a high-quality new view angle in a borderless scene.

Drawings

FIG. 1 is a flow chart of a new view angle synthesizing method of borderless scene based on mixed nerve radiation field.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without the inventive effort based on the embodiments of the present invention are within the scope of protection of the present invention.

As shown in FIG. 1, the method for synthesizing the new view angle of the borderless scene based on the mixed nerve radiation field comprises the following specific implementation steps:

and step 1, transforming the position of any three-dimensional point in the borderless space to a bounded spherical area by utilizing a coordinate transformation function. Constructing a multi-resolution 3D hash feature grid on the spherical region according to the set parameter total amount, and encoding scene information with less parameter amount:

N _l ＝N _min ·b ^l

wherein N is _max The number of nodes for the highest resolution grid, N _min The node number of the grid with the lowest resolution is L, the resolution layer number is N _l And b is the scale relation between adjacent resolutions for the node number of the first layer grid.

The 3D hash feature grid can effectively describe the density distribution and the color distribution of the scene, but the high-resolution 3D hash feature grid is greatly influenced by hash collision, so that the problems of rendering quality reduction and the like can be caused. The scale change of the borderless scene is large, the feature grid under a single scale cannot learn all information well, and the characterization capability of the nerve radiation field is improved by constructing the 3D-2D feature grid under different resolutions; and finally, constructing the MLP as a color and bulk density decoder, and combining Fourier direction coding, so that the MLP can return the color of the same position in space, which changes along with the observation direction.

Step 2, generating light cones which start from the light center and extend to infinity through the pixel center by using the pose of the camera, sampling a plurality of truncated cones according to inverse depth after the total sampling number is specified, and ensuring that the sampling frequency is higher in a region close to the light center and gradually reduced in a region gradually far from the light center; for each truncated cone sampled, a spiral parametric equation was constructed on its side, and 7 points were sampled uniformly on the spiral to represent the truncated cone region:

wherein t is the sampling distance, r is the radius of the light cone on the normalized plane, n is the total number of sampling points, m is the number of spiral turns, and p is the coordinates of the sampling points.

The sampling point p is converted into a bounded sphere space through a coordinate transformation function, eight corner points closest to the corner points are queried in a 3D hash feature grid according to the converted coordinates, feature vectors corresponding to the corner points are indexed through the hash function, and then the 3D features of the sampling point are obtained through tri-linear interpolation; respectively carrying out plane projection on the sampling points to three mutually orthogonal 2D plane feature grids, inquiring four corner points closest to the projection points and feature vectors corresponding to the corner points in the plane, obtaining the feature vectors of the projection points through bilinear interpolation, and finally splicing the feature vectors of the three projection points to obtain the 2D features of the sampling points:

feature _3D ＝tri(f ₁ ,f ₂ ,...f ₈ )

where tri is a tri-linear interpolation function, bil is a bi-linear interpolation function, f _i G for indexed 3D features _i Is the indexed 2D feature. And splicing the 3D features with the 2D features, and taking the 3D features and the 2D features as mixed feature representations of sampling points, wherein feature vectors of 7 sampling points in each truncated cone are weighted and fused into a truncated cone feature vector, so that the whole truncated cone region is described.

After the feature vectors of 7 sampling points in each truncated cone are obtained through query and interpolation, calculating the average value of the spatial coordinates of the 7 sampling points, and the distance between each sampling point and the average value point, normalizing the inverse proportion of the distance as the weight of each sampling point, merging the feature vectors of the seven sampling points into one truncated cone feature vector, describing the whole truncated cone region, and participating in subsequent rendering and regression operations.

Step 3, firstly, the truncated cone feature vector on the light line passes through a layer of fully connected network, the first dimension of the result represents the volume density at the truncated cone mean point, the later dimension represents the color feature vector, the light passing rate at all truncated cone mean points is calculated by utilizing the volume density distribution on the light line, and then the weight of each truncated cone on the light line in a rendering equation is obtained:

α _i ＝1-exp(-σ _i δ _i )

λ _i ＝T _i α _i

wherein sigma _i Is bulk density, alpha _i For occupancy probability, delta _i Is the interval distance of light cone, T _i Is the light passing rate lambda _i Is a rendering weight. In order to accelerate the optimization process of the neural radiation field, the color feature vectors are not fed into the MLP regression color, but all the color feature vectors are weighted by the weights of truncated cones on the light, a feature is generated for one pixel, and finally the feature is fed into the MLP, and the pixel color is regressed. And simultaneously, weighting the sampling distance t by using the same weight to obtain the depth corresponding to the pixel:

feature _pixel ＝∑ _i λ _i f _i

rgb＝MLP(feature _pixel )

depth＝∑ _i λ _i t _i

wherein f _i Feature vector, which is truncated cone _pixel For pixel characteristics, rgb is the rendered pixel color, depth is the rendered depth. The MLP forward calculation is only carried out once in the whole rendering process, so that the calculated amount is obviously reduced, and the training time of the nerve radiation field is reduced.

Step 4, establishing a loss function of an algorithm, wherein the first term is a color loss function, and directly utilizing the true color and the rendering color of the pixel to perform a mean square error:

The second term is a depth loss function, the pose of a camera used in the training process is estimated through SFM, meanwhile, the obtained sparse point cloud of the scene is also regarded as the geometric prior of the scene, and the sparse depth under the training pose can be obtained through projection to each view angle and used for supervising the rendering depth:

wherein D is the light generated by all the point cloud projection points, D (r) is the point cloud projection depth, and D _fine (r) is the rendering depth. Inverse deep supervision is employed instead of direct use of deep supervision to reduce the impact of outliers on optimization. Because of the sparsity of the point cloud, only a few of pixels participating in the iteration are projection points of the point cloud, so that the points participating in calculating the depth loss function are few, and the rapid convergence of scene geometry is considered to help the correct convergence of the color, so that a larger weight factor is given to the depth loss. Meanwhile, in order to enable the radiation field to correctly find the optimal direction in the initial stage of training and stably converge in the later stage of training, the learning rate which changes along with the training turns is adopted, and the radiation field is gradually lifted up in five thousand rounds before training and then smoothly falls down.

Step 5, generating any observation view angle in the scene according to the distribution area of the known view angle, generating light cones for all pixels under the view angle, and spirally sampling 49 points on the surface of each truncated cone for achieving better rendering effect; meanwhile, as the geometric distribution of the scene is found in the training stage, the sampling can be guided by the distribution of the density, and the sampling range is compressed from zero to positive infinity to the depth vicinity; finally, the decoder returns the color for each pixel position to obtain the imaging result under the visual angle.

Therefore, the invention can improve the training speed and the characterization precision of the nerve radiation field by utilizing a mixed characterization mode aiming at the borderless scene, and render a high-quality new visual angle image.

What is not described in detail in the present specification belongs to the known technology of those skilled in the art. While the foregoing has been described in relation to illustrative embodiments thereof, so as to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but is to be construed as limited to the spirit and scope of the invention as defined and defined by the appended claims, as long as various changes are apparent to those skilled in the art, all within the scope of which the invention is defined by the appended claims.

Claims

1. The novel borderless scene view angle synthesis method based on the mixed nerve radiation field is characterized by comprising the following steps of:

(1) Parameterizing the borderless space to a bounded spherical area, and utilizing a 3D hash feature grid and a 2D plane feature grid to encode scene information to construct a color-bulk density decoder taking MLP as a main body;

(2) Aiming at the feature grid established in the step (1), obtaining feature vectors of all sampling truncated cones on the light line through spiral sampling and linear interpolation on the surface of the light cone;

(3) Aiming at the sampling truncated cone feature vector obtained in the step (2), a volume rendering equation is used for obtaining the feature vector and the depth average value of each ray color, and corresponding pixel colors and depths, namely rendering results, are decoded through shallow layer MLP;

2. The borderless scene new vision angle synthesis method based on mixed nerve radiation field of claim 1, characterized in that: the specific implementation method of the step (1) is as follows:

transforming the position of any three-dimensional point in borderless space to a bounded spherical region using a coordinate transformation function:

wherein x is the three-dimensional coordinates of the sampling points, the whole borderless space is parameterized into a bounded sphere region with radius of 2 by f (x), and a 3D hash feature grid is constructed on the region:

N _l ＝N _min ·b ^l

wherein N is _max The number of nodes for the highest resolution grid, N _min The node number of the grid with the lowest resolution is L, the resolution layer number is N _l B is the scale relation between adjacent resolutions for the node number of the first layer grid;

2D plane feature grids are established on planes which are orthogonal to each other, and scene details are represented in an auxiliary mode; and constructing the MLP as a color and bulk density decoder, and combining Fourier direction coding, so that the MLP can return the color of the same position in the borderless space, which changes along with the observation direction.

3. The borderless scene new vision angle synthesis method based on mixed nerve radiation field of claim 1, characterized in that: the specific implementation method of the step (2) is as follows:

generating a light cone which extends to infinity from the light center for each pixel through the pose of the camera, sampling a plurality of truncated cones according to the inverse depth, wherein the sampling frequency is higher in a region close to the light center, and the sampling frequency is gradually reduced in a region gradually far from the light center; for each truncated cone sampled, a spiral parametric equation is constructed on its side, and n points are uniformly sampled on the spiral to represent the truncated cone region:

wherein t is the distance between the sampling point and the optical center along the light direction, r is the radius of the light cone on the normalization plane, n is the total number of the sampling points, n=7 is taken here, m is the number of spiral turns, and p is the coordinate of the sampling points;

each sampling point is converted into a bounded spherical area through a coordinate transformation function, eight corner points closest to the corner points are queried in a 3D hash feature grid, feature vectors corresponding to the corner points are indexed through the hash function, and then the 3D features of the sampling points are obtained through tri-linear interpolation; for the 2D plane feature grid, the sampling points are respectively projected to three mutually orthogonal planes, four corner points closest to the projection points and feature vectors corresponding to the corner points are inquired in the planes, the feature vectors of the projection points are obtained through bilinear interpolation, and finally the feature vectors of the three projection points are fused, so that the 2D feature of the sampling points is obtained:

feature _3D ＝tri(f ₁ ,f ₂ ,...f ₈ )

where tri is a tri-linear interpolation function, bil is a bi-linear interpolation function, f _i G for indexed 3D features _i 2D features indexed; the 3D feature is spliced with the 2D feature to be used as a feature vector representation of sampling points, after the feature vectors of 7 sampling points in each truncated cone are obtained through query and interpolation, the average value of the space coordinates of the 7 sampling points and the distance between each sampling point and the average value point are calculated, the inverse proportion of the distance is normalized to be used as the weight of each sampling point, and seven feature vectors are fused into one truncated cone feature vector to describe the whole truncated cone region.

4. The borderless scene new vision angle synthesis method based on mixed nerve radiation field of claim 1, characterized in that: the specific implementation method of the step (3) is as follows:

firstly, the truncated cone feature vector obtained in the step (2) passes through a layer of fully connected network to obtain the volume density and color feature vector at the truncated cone mean point, the volume density distribution on the light is utilized to calculate the light passing rate at all truncated cone mean points, and then the weight of each truncated cone on the light in a rendering equation is obtained:

α _i ＝1-exp(-σ _i δ _i )

λ _i ＝T _i α _i

wherein sigma _i Is bulk density, alpha _i For occupancy probability, delta _i Is the interval distance of light cone, T _i Is the light passing rate lambda _i Is a rendering weight; and weighting all color feature vectors by using the weight, generating a feature for a pixel, finally sending the feature into the MLP, regressing the color of the pixel, and weighting by using the same weight to obtain the depth corresponding to the pixel:

feature _pixel ＝∑ _i λ _i f _i

rgb＝MLP(feature _pixel )

depth＝∑ _i λ _i t _i

wherein f _i Feature vector, which is truncated cone _pixel For pixel characteristics, rgb is the rendered pixel color, depth is the rendered depth.

5. The borderless scene new vision angle synthesis method based on mixed nerve radiation field of claim 1, characterized in that: the specific implementation method of the step (4) is as follows:

constructing a loss function of an algorithm, wherein the first term is a color loss function, the algorithm adopts a coarse-to-fine optimization process, pixel colors are respectively rendered in two stages, and the mean square error is formed by using the true colors of the pixels and the twice rendered colors:

wherein C (r) is the true color of the pixelColor, C _coarse (r) pixel color rendered for the coarse phase, C _fine (r) pixel color rendered for the fine phase;

the second term is a depth loss function, the SMF method is used for recovering the pose and obtaining sparse point clouds of the scene at the same time, and the sparse depth under the pose is obtained by projecting the sparse point clouds to each view angle and is used for supervising the rendering depth:

wherein D is the light generated by all the point cloud projection points, D (r) is the point cloud projection depth, and D _fine (r) is the rendering depth.

6. The borderless scene new vision angle synthesis method based on mixed nerve radiation field of claim 1, characterized in that: the specific implementation method of the step (5) is as follows:

generating any observation view angle under the scene, generating light cones for all pixel positions under the view angle, spirally sampling 49 points on the surface of each truncated cone, guiding and sampling through the depth rendered in the optimization process, and finally, returning the color for each pixel position by a decoder to obtain an imaging result under the view angle.