CN117635801A

CN117635801A - New view synthesis method and system based on real-time rendering generalizable nerve radiation field

Info

Publication number: CN117635801A
Application number: CN202311694564.5A
Authority: CN
Inventors: 方力; 黎雅诗; 胡飞; 叶龙
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2024-03-01

Abstract

The invention provides a new view synthesis method and a system based on real-time rendering generalizable nerve radiation field, wherein the method comprises the following steps: constructing a generalizable neural radiation field algorithm network model based on the spatial feature similarity of the image blocks; the generalizable neural radiation field algorithm network model introduces global information of an input image through the similarity of spatial features of image blocks; training a generalizable neural radiation field algorithm network model through data acquired by data acquisition equipment; wherein the data includes image, depth, and motion data of a real world scene; and (3) performing new view synthesis of the free viewpoint video by using the trained generalizable neural radiation field algorithm network model. The invention can adapt to different scenes and tasks, accelerate the three-dimensional scene reconstruction process, improve the rendering quality, provide finer visual effects and reduce unnecessary computing resource consumption.

Description

New view synthesis method and system based on real-time rendering generalizable nerve radiation field

Technical Field

The invention relates to the technical field of new view synthesis, in particular to a method and a system for synthesizing a new view based on real-time rendering generalizable nerve radiation field.

Background

With the continuous development of digital multimedia technology, people have higher requirements on video quality and content, pay more attention to video interaction and visual sensory experience, and the explosion-shed development of the online entertainment industry also makes the industry more in favor of breaking through value in interaction experience. On this basis, people are continually exploring better visual experience: on the image quality level, the visual media is developed from standard definition, high definition to 4K ultra-high definition and 8K ultra-high definition; in the aspect of interactivity, from 2D to 3D and 4D, the free viewpoint video is one of the directions of multimedia interactive development.

The free viewpoint video is a video combining dynamic and static view angles, and the free movement of the view points in the panoramic range enables the audience to have larger initiative and obtain better immersion experience from the view angle of vision and the space dimension level of interaction, allows the audience to flexibly select the view angle and the position without being limited by the view angle controlled by a content creator, and is widely applied to various fields such as public safety, medical and health, automatic driving, cultural entertainment, electronic commerce and the like, so that the video becomes an important research field in virtual reality. Among them, how to efficiently model and render high quality free viewpoint video becomes a research direction of many scholars.

The free viewpoint video mainly has two manufacturing modes, namely new view synthesis based on images and three-dimensional reconstruction based on models. The new view synthesis method based on the image has photo-level realism and stronger immersion, and becomes one of research hotspots. New view synthesis, which is a task in the field of computer vision and computer graphics, refers to the rendering of a target image with an arbitrary target camera pose from a given source image and its camera pose. However, new view synthesis is an underdetermined problem, and if multiple solutions exist without consideration of prior or constraint conditions, a good solution requires a complete three-dimensional understanding of all visible objects in the invisible view, and requires handling of complications such as occlusion in the scene, lack of textured surfaces, etc. Therefore, conventional free-view video generation techniques generally require a large amount of computing resources and time and complex processing steps to generate high-quality free-view video. The method has the problems of low processing speed, high cost, difficulty in realizing real-time interaction and the like.

With the development of deep learning in the task of image understanding, the information of a three-dimensional scene can be well constructed from a two-dimensional image by using a method based on deep learning, and many new view synthesis methods based on deep learning combined with the traditional method, such as NeRF (Neural Radiance Fields, nerve radiation field), are emerging.

NeRF uses implicit expression to realize new view synthesis effect of photo level for the first time, uses MLP (Multilayer Perceptron, multi-layer perceptron) to fit a continuous function to implicitly learn a static three-dimensional scene, inputs three-dimensional space point coordinates and observation directions into the function to obtain corresponding color and volume density, uses microminiaturized nerve volume rendering to obtain pixel color. NeRF only uses the input image as the supervision information, fits the accurate hidden function for the high-resolution geometric shape, can realize the new view synthesis of photo-level realism for the complex scene, the whole flow of the algorithm is as shown in figure 1, a large number of views with different visual angles are input, the nerve radiation field of the three-dimensional model is constructed, and finally the view with the appointed visual angle is rendered.

NeRF adds a rendering step into the neural network by using a volume rendering method so as to achieve the aim of directly training the network by errors of the rendered image. Fig. 2 shows a micro-renderable flow of NeRF, as shown in fig. 2, where NeRF fits a five-dimensional vector function that describes geometric information and color information of the three-dimensional model, implemented using a multi-layer perceptron. The input of the five-dimensional function consists of a three-dimensional coordinate vector x= (x, y, z) and a two-dimensional view direction vector d= (θ, Φ) for a point in space, and the function output is the volume density δ of the point in space and the color c= (r, g, b) of the point in the d-direction. In the real world, the object color is related to the illumination condition, and different colors are observed when the same position of the same object is observed from different perspectives, so in a specific calculation, δ is only related to the coordinate vector x, and c is determined by both x and d. Modeling each pixel in the image to obtain a corresponding light ray, wherein the light ray is emitted by a camera optical center o, the light ray is marked as r=o+td, and the corresponding pixel is rendered by a volume rendering mode after the sigma and the c of all the spatial points on the light ray are obtained.

However, the functional expression learned by NeRF is only for a single scene, and has the problems of poor generalization, only for static scenes, need to input a large number of multi-view pictures, low training and deducing speeds, and the like, so that it is difficult to generate a free view video for any scene in real time.

Therefore, how to construct a network capable of supporting generalization to any unseen static scene and realizing real-time rendering speed of a new view becomes one of research directions in the technical field of new view synthesis of the current free viewpoint video.

Disclosure of Invention

In view of the problems of low processing speed, high cost, difficulty in realizing real-time interaction and the like in the conventional free viewpoint video generation technology at present, the invention aims to provide a new view synthesis method and system based on a real-time rendering generalizable nerve radiation field, so as to construct a network supporting generalization to any unseen static scene and realize real-time rendering speed of a new view.

In one aspect, the invention provides a new view synthesis method based on real-time rendering generalizable nerve radiation field, comprising the following steps:

s100: constructing a generalizable neural radiation field algorithm network model based on the spatial feature similarity of the image blocks; the generalizable neural radiation field algorithm network model introduces global information of an input image through the similarity of spatial features of image blocks;

S200: training the generalizable neural radiation field algorithm network model through data acquired by data acquisition equipment; wherein the data collected by the data collection device comprises image data and motion data of a real world scene;

s300: and (3) performing new view synthesis of the free viewpoint video by using the trained generalizable neural radiation field algorithm network model.

Wherein, optionally, the generalizable neural radiation field algorithm network model includes:

the 2D feature extraction module is used for extracting multi-scale two-dimensional features of the input image;

the 3D feature extraction module is used for constructing a cost body through the multi-scale two-dimensional features and extracting a depth probability body and a three-dimensional feature body of the cost body through UNet-3D;

the sampling guidance module is used for screening sampling points in the space based on the depth probability body so as to reserve the sampling points meeting the preset depth probability requirement;

the nerve radiation field module is used for calculating low-resolution target image characteristics of the target image by utilizing the multi-scale two-dimensional characteristics and the three-dimensional characteristic body of the reserved sampling points through a preset MLP network;

and the up-sampling module is used for up-sampling the low-resolution target image characteristics to generate high-resolution target image characteristics of the input image, and rendering the high-resolution target image characteristics to obtain a high-resolution image.

The method for extracting the multi-scale two-dimensional features by the 2D feature extraction module comprises the following steps of: from an input image of size 3 XH WIn (2), downsampling to obtain a size of 32XI>2D low resolution image feature F of (2) _i,1 Then using bilinear interpolation on the 2D low resolution feature F _i,1 Upsampling the feature map of (2) to obtain a size of +.>2D feature F of (2) _i,2 The method comprises the steps of carrying out a first treatment on the surface of the Wherein, H tableThe height of the input image is shown, W represents the width of the input image, N represents the number of input images, and i represents the ith image of the N images.

Wherein, optionally, the 2D feature extraction module further comprises a three-dimensional mapping unit for mapping three-dimensional points in space to 2D features F of the input image _i,2 On top of this, the feature { f of pixel alignment is obtained _i I=1, & N; and, aggregating the pixel alignment features on different input views and performing pooling operation to obtain the image feature f of the three-dimensional point _img ＝ψ(f ₁ ,...,f _N )。

Wherein, optionally, the 3D feature extraction module constructs a cost body through the multi-scale two-dimensional feature, including:

using the 2D feature F _i,2 Constructing a cost volume based on a camera view cone of the target view through a minutely homographic variation; wherein N input images are projected onto a plurality of sampling planes { L ] _j I j=1,.. _i,2 Mapping to D sampling planes to obtain a constructed cost body; wherein,

camera intrinsic matrix K given input view Ii _i Rotation matrix R _i Translation matrix T _i And a camera reference matrix K of the target view It _t Rotation matrix R _t Translation matrix T _t The homography variation is defined as:

wherein a is ^T Transpose matrix representing target view camera principal axis a, I is identity matrix, H _i (z) projecting the pixel (u, v) in the target view onto the target view on the depth axis, and defining the eigenvalue obtained by homography change as follows:

by calculating the multi-view feature of each voxel based on the mapped feature after homography changeThe variance of (1) yields a cost volume, where [ u, v,1 ]] ^T Represents [ u, v,1 ]]Is a transposed matrix of (a).

Wherein, optionally, the 3D feature extraction module further includes, after extracting the three-dimensional feature of the cost body: performing tri-linear interpolation on the three-dimensional feature body to obtain voxel alignment feature f with space geometric information _voxel 。

The method for calculating the low-resolution target image features in the multi-scale two-dimensional features and the three-dimensional feature body by the nerve radiation field module through a preset MLP network comprises the following steps:

Inputting the 2D characteristics and the 3D characteristics of the three-dimensional points into a preset MLP network to obtain the point characteristics and the volume density of the three-dimensional points, wherein the point characteristics and the volume density are defined as follows:

f _p ,σ＝φ(f _img ,f _voxel )

wherein phi is an MLP network, f _p For the point characteristics of the three-dimensional points, sigma is the volume density of the three-dimensional points, and the point characteristics f of the three-dimensional points _p Image feature f of the three-dimensional point _img And predicting the mixing weight w of the input view image color with respect to the viewing direction of the three-dimensional point in the target view under the input view _i Then by mixing weight w _i Predicting the observed color features when viewing a three-dimensional point from a certain direction of the target view is defined as:

wherein f _i For inputting two-dimensional image features of a view, the color features are addedAnd the bulk density sigma, 2D polymerization is carried out to obtain the final accumulated color characteristic of each ray.

Wherein the optional scheme is that the high-resolution target image feature of the input image is generated based on the up-sampling of the low-resolution target image feature, and the method comprises the following steps: and generating high-resolution target image features of the input image by performing sub-pixel convolution image feature up-sampling operation on the low-resolution target image features.

Wherein, optionally, the sampling guidance module further includes:

The first-level screening unit is used for obtaining the probability and standard deviation of the pixel point on a certain depth plane through linear interpolation of the depth probability body for the pixel point (u, v) in the target image, so as to obtain the depth range existing on the surface of the depth probability body, skip the blank area in the space according to the depth range and reduce the sampling range;

and the secondary screening unit is used for guiding accurate sampling within a reduced sampling range by utilizing the cumulative density function of the depth probability body so as to screen out reserved sampling points.

The invention also provides a new view synthesis system based on the real-time rendering generalizable nerve radiation field, which performs new view synthesis of free viewpoint video based on the new view synthesis method based on the real-time rendering generalizable nerve radiation field, and comprises the following steps:

the network model construction unit is used for constructing a generalizable neural radiation field algorithm network model based on the spatial feature similarity of the image blocks; the generalizable neural radiation field algorithm network model introduces global information of an input image through the similarity of spatial features of image blocks;

the network model training unit is used for training the generalizable neural radiation field algorithm network model through the data acquired by the data acquisition equipment; wherein the data collected by the data collection device comprises image data and motion data of a real world scene;

And the new view synthesis unit is used for synthesizing the new view of the free view video by using the trained generalizable neural radiation field algorithm network model.

The invention also provides an electronic device comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the steps in a new view synthesis method based on rendering generalizable neural radiation fields in real time as described previously.

According to the technical scheme, the novel visual image synthesis method and system based on the real-time rendering generalizable nerve radiation field provided by the invention are used for improving the generation mode of free view video from three angles of data acquisition, network model and engine rendering by adopting the real-time rendering generalizable nerve radiation field based on the image block, the free view video is manufactured by using the novel visual image synthesis technology, the dynamic scene is split according to frames, the generalizable and real-time rendering nerve radiation field algorithm is realized, the network supporting generalization to any unseen static scene is constructed, and the rendering speed of the novel view is realized to be real-time, so that the free view video is generated in a more efficient and real-time mode, the problems of complex calculation and time delay in the traditional method are eliminated, and more excellent user experience is provided for the fields of virtual reality, game development, live broadcast and the like.

Drawings

Other objects and attainments together with a more complete understanding of the invention will become apparent and appreciated by referring to the following description taken in conjunction with the accompanying drawings. In the drawings:

FIG. 1 is a schematic flow chart of an algorithm of a nerve radiation field;

FIG. 2 is a schematic diagram of a micro-renderable flow of a neural radiation field;

FIG. 3 is a flow diagram of a new method of view synthesis based on rendering generalizable neural radiation fields in real time according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a framework structure of a generalizable neural radiation field algorithm network model based on image block spatial feature similarity according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a data training process of a generalizable neural radiation field algorithm network model based on image block spatial feature similarity according to an embodiment of the present invention;

fig. 6 is a schematic diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more embodiments. It may be evident, however, that such embodiment(s) may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing one or more embodiments.

Aiming at the prior art, the invention provides a new view synthesis method and system based on real-time rendering generalizable nerve radiation field. In order to better explain the technical scheme of the invention, the following will briefly explain some technical terms related to the invention.

Neural radiation field: neural radiation field (Neural Radiance Fields, neRF for short) is a computer vision technique used to generate high quality three-dimensional reconstruction models. The method utilizes a deep learning technology to extract geometric shape and texture information of an object from images of a plurality of view angles, and then uses the information to generate a continuous three-dimensional radiation field, so that a highly realistic three-dimensional model can be presented at any angle and distance. NeRF technology has wide application prospect in the fields of computer graphics, virtual reality, augmented reality and the like.

Multilayer perceptron: a Multi-Layer Perceptron (MLP) is a basic neural network architecture for deep learning tasks. The MLP is made up of multiple layers, each layer made up of neurons (or nodes) that are typically connected in a feed forward (feed forward) manner, each neuron will accept the output of a previous layer of neurons and pass it on to the next layer.

Bulk density: the volumetric density in NeRF refers to the density value of each point in three-dimensional space that characterizes the presence or absence of an object or surface in the scene at a given point, which can be seen as a measure of the probability of the presence of an actual object at a point.

Depth probability volume (Depth Probability Volume, DPV for short): the depth probability distribution of the pixel points is obtained based on a cost volume, the cost volume is the matching degree of each pixel used for storing multiple views in the multi-view image, and the depth is a size of B multiplied by C multiplied by D multiplied by H multiplied by W. The dimension of the DPV is also b×c×d×h×w, and depth probability information of each pixel is stored.

The novel visual image synthesis method and system based on the real-time rendering generalizable nerve radiation field provided by the invention adopt the real-time rendering generalizable nerve radiation field based on the image block, and improve the synthesis mode of the novel visual image from the following three angles:

1. data acquisition equipment: the camera and other devices are used for introducing the efficient image of the real world scene, so that the image data and the motion data of the real world scene can be captured;

2. network model: designing and training a highly optimized neural network model, which can convert the acquired data into a three-dimensional neural radiation field without requiring a large amount of computational resources;

3. Rendering engine: the method comprises the steps of developing a real-time rendering engine composed of an up-sampling module and a micro surface rendering method, learning implicit representation of a scene in a deep learning mode, and rendering the scene through the real-time rendering engine, so that videos with high-quality details and free viewpoint effects can be presented in real time by utilizing a generated nerve radiation field.

The technical scheme of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

It should be noted that the following description of the exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. Techniques and equipment known to those of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate.

In order to illustrate the method, the system and the method for synthesizing the new view based on the real-time rendering generalizable neural radiation field, fig. 3 and fig. 4 show exemplary labeling of the frame structure of the network model of the generalizable neural radiation field algorithm based on the real-time rendering generalizable neural radiation field and the spatial feature similarity of image blocks in the embodiment of the invention.

As shown in fig. 3 and fig. 4 together, the new view synthesis method based on rendering generalizable nerve radiation field in real time provided by the invention mainly comprises the following steps:

s100: constructing a generalizable neural radiation field algorithm network model based on the spatial feature similarity of the image blocks; the generalizable neural radiation field algorithm network model introduces global information of an input image through similarity of spatial features of image blocks.

The global information of the image refers to the overall attribute of the image, such as color features, texture features, shape features and the like, and the global information description is not applicable to the conditions of aliasing, shielding and the like; the local information is the feature information extracted from the image local area, the correlation degree between the feature information is small, and the detection and matching of other features are not affected due to the disappearance of part of features under the shielding condition.

S200: training the generalizable neural radiation field algorithm network model through data acquired by data acquisition equipment; wherein the data collected by the data collection device comprises image data and motion data of a real world scene.

Specifically, as an example, the motion data of the real world scene may be obtained through the COLMAP, and the captured image data of the real world scene may be input into the COLMAP, so that the corresponding data of the camera internal parameters, external parameters, depth, motion, and the like may be obtained.

In order to construct a network supporting generalization to any unseen static scene and realize real-time rendering speed of a new view, the invention firstly needs to construct a generalizable neural radiation field algorithm network model based on image block spatial feature similarity.

NeRF completes the synthesis of a new view of a fitted scene through the density and the color of the coding body, but the essence of the NeRF is the overfitting of the geometric information and the color information of the scene, and the scene is presented by means of network memory, so that the NeRF has no capability of being popularized to the new scene. In order to solve the generalization problem, there are a few methods to introduce image features, and the features of the input image are extracted by convolution, so as to generate a richer and universal scene representation.

Current schemes based on multi-view reconstruction are one of the approaches to solve the generalization problem. Multi-view stereoscopic (Multiple View Stereo, MVS) is a generalization of stereoscopic vision that is capable of viewing and acquiring images of three-dimensional scenes at multiple perspectives and accomplishing matching and depth estimation. The deep learning based MVS method mostly relies on a planar scanning algorithm to generate a cost volume, the core of which is to verify depth hypotheses, and after projecting pixels onto different planes in space, if one of the projection points is captured by a different camera with similar luminosity, the depth value of that point is valid, and in this way the depth interval is divided into discrete values, and the most valid depth is selected among all hypotheses to estimate the final depth. The method can estimate the depth information in the scene on the one hand, can realize generalization on the other hand, gives three-dimensional points in a space, projects the three-dimensional points on each input view, determines the volume density by judging whether the local features on each view are consistent, and can learn the mapping relation between the space points and the input view, so that the method can be popularized to any new scene.

In the imaging process of the camera, the obtained image data is subjected to discretization processing, each pixel on the imaging surface only represents nearby colors due to the limitation of the photosensitive element, macroscopic pixel points are adjacent, a certain distance exists between the microscopic two pixel points, and the pixel existing between the two actual physical pixel points is called a subpixel.

The subpixel algorithm is one method for improving the resolution of a picture, but the subpixel convolution layer does not involve a nonlinear operation, directly using the data in the feature map of the low resolution image to produce a high resolution image. Assuming a magnification of r, if the size-enlarging operation is implemented using conventional interpolation methods, the computation of the convolution layer occurs in high resolution space, which increases r ² The amount of calculation of the times. If deconvolution network is adopted, each input pixel is multiplied and added with deconvolution kernel elements, and the final result is obtained by superposition, so that the calculation is more complex. Because the deconvolution kernel has span r in the sliding process, the number of times of accumulation is more than that of other positions on partial pixel points in the result, so that checkerboard noise is easy to generate, and the image edge and detail information are affected. Therefore, the adoption of the sub-pixel convolution layer in the invention can ensure the speed and has better image super-resolution generation effect.

The application of the nerve radiation field to the dynamic scene has two methods, one is dynamic NeRF, and the information of the time dimension is additionally input besides the information related to the visual angle and the image; another is cross-scene generalization NeRF, i.e. splitting dynamic scenes in frames, and combining each frame of image as a single scene. The invention adopts the second method.

The invention processes the neural radiation field algorithm based on the similarity of the spatial features of the image blocks in the mode of the image blocks, and global information can be introduced through the similarity of the spatial features. The image block is a patch in deep learning, and when the resolution of the image to be processed is too large and the resources (such as the video memory, the computing power and the like) are limited, the image can be divided into small blocks, and the small image blocks are patches. In the invention, in the mode of image blocks, the mode is different from the mode of original NeRF ray-by-ray processing, and if the size of the image block is h multiplied by w, then the mode of image blocks is to process h multiplied by w rays simultaneously, and the correlation among the rays can be introduced.

Fig. 5 is a schematic diagram of a data training process of a generalizable neural radiation field algorithm network model based on image block spatial feature similarity according to an embodiment of the present invention.

As shown in fig. 4 and 5 together, the generalizable neural radiation field algorithm network model 400 constructed based on image block spatial feature similarity according to the embodiment of the present invention mainly includes five parts, namely a 2D feature extraction module 410, a 3D feature extraction module 420, a neural radiation field module 430, an upsampling module 440 and a sampling point guiding module 450.

The device comprises a 2D feature extraction module, a multi-view feature extraction module and a multi-scale feature extraction module, wherein the 2D feature extraction module is used for extracting multi-scale two-dimensional features (2D image features) of a plurality of input images (multi-view images) through a 2D convolutional neural network (2D CNN); the 3D feature extraction module is used for constructing a cost body through a 3D convolutional neural network (3D CNN) by using multi-scale two-dimensional features, extracting a depth probability body and a three-dimensional feature body (3D image feature) of the cost body through UNet-3D, and providing geometric perception information for constructing a nerve radiation field; the sampling guidance module is used for screening sampling points in the space based on the depth probability body so as to reserve the sampling points meeting the preset depth probability requirement; the nerve radiation field module is used for sending the interpolated two-dimensional characteristics and the interpolated three-dimensional characteristics into a preset MLP, and calculating to obtain low-resolution target image characteristics of reserved sampling points; the up-sampling module is used for up-sampling the low-resolution target image features obtained by the nerve radiation field module to generate high-resolution target image features of the input image, and rendering the high-resolution target image based on the high-resolution target image features to obtain a high-resolution image.

Specifically, as an example, the method for extracting the multi-scale two-dimensional features by the 2D feature extraction module includes:

for an input image of size 3 XH WInputting into 2D extraction feature extraction module to obtain extract with size of +.>2D low resolution feature F of (2) _i,1 Then up-sampling the feature map by bilinear interpolation to obtain the value of +.>2 of (2)D feature F _i,2 Where H denotes the height of the input image, W denotes the width of the input image, N denotes the number of input images, and i denotes the ith sheet of the N sheets of images. Multi-scale features are used to construct cost volumes and neuro-radiation fields, where F _i,1 For constructing cost bodies, F _i,2 For reconstructing the neural radiation field.

The above four times downsampling is performed for the purpose of reducing the image size and thus the video memory and the calculation amount, and the two times upsampling is performed for the purpose of obtaining characteristic information of different scales by using bilinear interpolation. Although the low-resolution feature map obtained by the feature extraction module is downsampled 4 times in each dimension compared with the original map, the neighborhood information of the reserved pixels is encoded and stored in a feature channel of 32 channels, and rich semantic information is contained.

In order to construct a network supporting generalization to any unseen static scene, the 2D feature extraction module further comprises a three-dimensional mapping unit for mapping, for any three-dimensional point in space, it to a 2D feature F of the input image _i,2 On or mapped to the downsampled 2D low resolution feature F _i,1 On top of this, the feature { f of pixel alignment is obtained _i I=1, &..n }, aggregating the pixel features on different input views, and performing pooling operation to obtain the image feature f of the three-dimensional point _img ＝ψ(f ₁ ,...,f _N ). Image feature f of the three-dimensional point _img Will be used by the neuro-radiation field module to calculate low resolution target image feature characteristics.

As an example, a specific method for constructing a cost volume by the 3D feature extraction module through the multi-scale two-dimensional feature includes: 2D feature F obtained by the upsampling _i,2 The cost volume is constructed based on the camera view cone of the target view by a minuscule homography. N input images are projected onto a plurality of sampling planes { L ] _j I j=1,.. _i,2 Mapping to D sampling planes to obtain a constructed cost body;

wherein, given an input viewIi camera reference matrix K _i Rotation matrix R _i Translation matrix T _i And a camera reference matrix K of the target view It _t Rotation matrix R _t Translation matrix T _t Homography variation can be defined as:

by calculating the multi-view feature of each voxel based on the mapped feature after homography changeThe variance of (1) yields a cost volume, where [ u, v,1 ]] ^T Represents [ u, v,1 ]]Is a transposed matrix of (a). And then extracting the depth probability body and the three-dimensional feature body of the cost body through UNet-3D.

The extraction of the depth probability body and the three-dimensional feature body is carried out through UNet-3D, and the extraction can be roughly divided into two parts, namely a trunk feature extraction part, a 3D convolution layer and an activation function stacking part, a plurality of effective features are obtained, and a feature extraction strengthening part, wherein the part carries out up-sampling on the effective features obtained in the previous step and carries out feature fusion. Specifically, as an example, UNet-3D includes encoder downsampling, decoder upsampling, and skip connection, where skip connection transfers layers of the same resolution in the encoding path to the decoding path, providing high resolution features for the decoding layer, and finally obtaining a depth probability body and a three-dimensional feature body through two 3D convolution layers, respectively.

In addition, after extracting the three-dimensional feature of the cost volume, 3D feature extraction The module also performs tri-linear interpolation on the three-dimensional feature volume to obtain voxel alignment feature f with spatial geometry information _voxel . The voxel alignment feature f _voxel Will be used by the neuro-radiation field module to calculate low resolution target image features. The voxel alignment feature is used as an input of the MLP network, and can provide space geometric information for the network.

After obtaining the 2D characteristic and the 3D characteristic of the three-dimensional point, inputting the three-dimensional point and the 3D characteristic into an MLP network, and obtaining the point characteristic and the volume density of the three-dimensional point through a nerve radiation field module, wherein the definition is as follows:

f _p ,σ＝φ(f _img ,f _voxel )

wherein phi is an MLP network, f _p For the point characteristics of the three-dimensional points, sigma is the volume density of the three-dimensional points, and the point characteristics f of the three-dimensional points _p Pixel characteristic f of the three-dimensional point _img And predicting the mixing weight w of the input view image color with respect to the viewing direction of the three-dimensional point in the target view under the input view _i Then from w _i Predicting the observed color features when viewing a three-dimensional point from a certain direction of the target view is defined as:

wherein f _i And 2D polymerizing the color features and the volume density obtained by network prediction to obtain the final accumulated color features of each ray for the two-dimensional image features of the input view. Specifically, as an example, the color features of the corresponding pixel points can be obtained by weighted summation of the color features and the volume density of the sampling points on one light ray.

Wherein the mixing weight w _i The acquisition formula of (1) isWherein Δd _i Is the difference between the viewing angle of the target view and the viewing angle of the input view.

The invention is provided by sub-pixel convolution image feature up-sampling operationHigh final image quality. The low-resolution target feature map is input into the up-sampling module to be processed to obtain a high-resolution RGB image as the high-resolution target image feature, the module can reserve more texture areas in a low-resolution space at the tail end of the model, and the speed can be ensured. In addition, the up-sampling of the feature map can learn spatial correlation better and can obtain higher quality images than interpolating a low resolution RGB image to a high resolution RGB image. Sub-pixel convolution is effectively an up-sampling of pixel rearrangement, the principle of which is explained by, for example, [ H, W, C]For a low resolution image, convolution is used to obtain the size of [ H, W, C r ] ² ]Wherein r is an upsampling multiple, and then performing a shuffle transform on the feature map to obtain a feature map having a size of [ H r, W r, C]Thereby enabling up-sampling of the image.

The convolution of the sub-pixels is not performed on the whole pixel point, the network complexity is low, and the whole training speed and the deduction speed of the network can be greatly improved by adopting the convolution of the sub-pixels to perform up-sampling. The flow of the high resolution RGB image obtained by the upsampling module is shown.

The sampling guidance module mainly uses a depth probability body (Depth Probability Volume, abbreviated as DPV) obtained by the three-dimensional feature extraction module, wherein the DPV is associated with three-dimensional points in a scene, and stores depth estimation values and probability information containing surface existence. The DPV is used for carrying out sampling guidance by containing surface existence probability information, on one hand, sampling points can be reduced, so that the calculated amount is reduced, the calculation time is saved, on the other hand, the sampling position is more accurate, the effect of the step is similar to that of a NeRF coarse sampling step, and the volume density for accurate sampling is obtained by a plurality of layers of MLPs like NeRF. Meanwhile, since the depth probability body contains the depth probability information of the sampling points, the calculation of the cumulative density function is performed after estimating the surface existing in the three-dimensional scene based on the depth probability information, and the calculation result decides which sampling points are to be preserved and further refined, and which sampling points are to be discarded. The sampling points meeting the requirement of the preset depth probability can be reserved, for example, a plurality of sampling points with higher depth probability values are reserved, the sampling points are input into the network for further calculation after processing, and the sampling points with smaller depth probability values can be discarded, so that unnecessary sampling is reduced. Through the sampling points with higher reserved depth probability values, more details and accurate rendering results can be obtained. Such a sampling guidance strategy can focus on areas where uncertainty exists without consuming significant computational resources, thereby more efficiently performing scene reconstruction.

The sampling guidance module 450 extracts a depth probability body of the cost body through UNet-3D, and obtains a probability density function of the sampling points from the depth probability body so as to reserve a plurality of sampling points with larger depth probability values. Specifically, as an example, the sampling guidance module 450 further includes:

Since the original NeRF samples 128 points on the light for each pixel, most of the sampling points are located in blank positions, which causes waste of computing resources. The invention utilizes the depth probability body to estimate the probability of the surface in the three-dimensional space, and the estimation of the surface can reduce the number of sampling points to increase the operation speed, and the invention can also improve the effect of reconstructing the image of the final scene because the sampling points are relatively accurately placed due to the estimation of the surface.

According to the embodiment, the novel visual synthesis method based on the real-time rendering generalizable nerve radiation field is a NeRF method capable of performing efficient operation and is used for three-dimensional scene reconstruction and free viewpoint video generation. The method utilizes the depth probability body to estimate the probability of the existence of the surface in the three-dimensional space, and uses the estimation result of the depth probability body to intelligently guide the sampling process so as to more accurately find the proper sampling position and reduce unnecessary waste of calculation resources. In addition, the invention also carries out sampling guidance by estimating the surface probability through the depth probability body, replaces the conventional step of coarse sampling, and directly carries out fine granularity sampling, thereby improving the operation efficiency. And the invention also introduces a sampling guidance strategy of sampling points to further improve efficiency, wherein, an up-sampling of spatial correlation is introduced in an up-sampling module to up-sample the low-resolution feature map, so that the spatial correlation is better learned, thereby obtaining higher-quality images and being helpful for improving efficiency.

On the other hand, the invention also provides a new view synthesis system based on the real-time rendering generalizable nerve radiation field, which performs new view synthesis of free viewpoint video based on the new view synthesis method based on the real-time rendering generalizable nerve radiation field, and comprises the following steps:

The generalizable neural radiation field algorithm network model based on the image block spatial feature similarity constructed by the network model construction unit mainly protects the following five parts:

the nerve radiation field module is used for calculating the low-resolution target image characteristics of the reserved sampling points in the multi-scale two-dimensional characteristics and the three-dimensional characteristic body through a preset MLP network;

The implementation system corresponding to the new view synthesis method based on the real-time rendering and generalizing nerve radiation field can refer to the specific embodiment of the new view synthesis method based on the real-time rendering and generalizing nerve radiation field for specific implementation steps, and will not be described in detail here.

According to the embodiment, the novel view synthesis method and system based on the real-time rendering generalizable nerve radiation field provided by the invention have the following advantages compared with the existing novel view synthesis scheme:

1. The operation efficiency is high: one of the core targets of the invention is to improve the operation efficiency of the new view synthesis scheme of the NeRF class. Through the intelligent sampling guidance strategy, unnecessary calculation expenditure can be reduced, so that the operation is more efficient.

2. More accurate three-dimensional scene reconstruction: the depth probability body surface probability estimation is used, so that the position where the surface exists can be found more accurately; the spatial correlation can be learned by upsampling in the feature dimension, so that a higher-quality image can be obtained, and the accuracy of three-dimensional scene reconstruction is improved.

3. Resource consumption is reduced: by avoiding uniform coarse sampling of the entire scene, the waste of computing resources is reduced, which is helpful to improve performance in computationally intensive tasks.

4. The application is wide: the invention has application potential in multiple fields such as virtual reality, game development, augmented reality, free viewpoint video generation and the like, and can improve the product and service quality of the fields.

By applying the novel view synthesis method and system based on the real-time rendering generalizable nerve radiation field, the following technical effects can be obtained:

1. efficient three-dimensional reconstruction: the invention can accelerate the three-dimensional scene reconstruction process, so that the method is more suitable for real-time or interactive applications such as virtual reality experience, games and live broadcast.

2. The rendering quality is improved: through a more intelligent sampling guidance strategy, the method is expected to improve the rendering quality and provide finer visual effects.

3. Resource saving: unnecessary computing resource consumption is reduced, hardware requirements are reduced, and performance is improved.

4. Scalability: the invention can adapt to different scenes and tasks, and therefore has good expandability and adaptability in various applications.

As another aspect of the present invention, as shown in fig. 6, the present invention also provides an electronic apparatus including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by at least one processor to enable the at least one processor to perform the steps of the new view synthesis method and system method based on rendering generalizable neural radiation fields in real time described above.

It will be appreciated by those skilled in the art that the structure shown in fig. 6 is not limiting of the electronic device 1 and may include fewer or more components than shown, or may combine certain components, or may be arranged in different components.

For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.

Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.

The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The new view synthesis program 12 based on rendering generalizable neural radiation fields in real time stored in the memory 11 in the electronic device 1 is a combination of instructions that, when executed in the processor 10, can implement:

In particular, the specific implementation method of the above instruction by the processor 10 may refer to descriptions of related steps in the corresponding embodiments of fig. 3 and fig. 4, which are not repeated herein.

Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

The novel view synthesis method, system side system and method based on real-time rendering generalizable nerve radiation field proposed by the invention are described above by way of example with reference to the accompanying drawings. However, it will be appreciated by those skilled in the art that various modifications may be made to the above-described novel methods of view synthesis and systems and methods based on rendering generalizable neural radiation fields in real time, without departing from the scope of the present invention. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A new view synthesis method based on real-time rendering generalizable neural radiation field, comprising:

2. The method of claim 1, wherein the generalizable neural radiation field algorithm network model comprises:

3. The method for synthesizing the new view based on the real-time rendering generalizable nerve radiation field according to claim 2, wherein the method for extracting the multi-scale two-dimensional features by the 2D feature extraction module comprises the following steps:

from an input image of size 3 XH WIn the method, the size of the extract is +.>2D low resolution image feature F of (2) _i,1 Then using bilinear interpolation on the 2D low resolution feature F _i,1 Upsampling the feature map of (2) to obtain a size of +.>2D feature F of (2) _i,2 The method comprises the steps of carrying out a first treatment on the surface of the Where H denotes the height of the input image, W denotes the width of the input image, N denotes the number of input images, and i denotes the ith sheet of the N sheets of images.

4. The method of claim 3, wherein the 2D feature extraction module further comprises a three-dimensional mapping unit for mapping three-dimensional points in space to 2D features F of the input image _i,2 On top of this, the feature { f of pixel alignment is obtained _i I=1, & N; and, aggregating the pixel alignment features on different input views and performing pooling operation to obtain the image feature f of the three-dimensional point _img ＝ψ(f ₁ ,...,f _N )。

5. The method of claim 4, wherein the 3D feature extraction module constructs a cost volume from the multi-scale two-dimensional features, comprising:

Using the 2D feature F _i,2 Constructing a cost volume based on a camera view cone of the target view through a minutely homographic variation; wherein N input images are projected onto a plurality of sampling planes { L ] _j I j=1,.. _i,2 Mapping onto D sampling planesConstructing a cost body; wherein,

6. The method of claim 4, wherein the 3D feature extraction module further comprises, after extracting the three-dimensional feature of the cost volume: performing tri-linear interpolation on the three-dimensional feature body to obtain voxel alignment feature f with space geometric information _voxel 。

7. The method of claim 6, wherein the neural radiation field module calculates the multi-scale two-dimensional features and the low resolution target image features in the three-dimensional feature volume via a preset MLP network, comprising:

f _p ,σ＝φ(f _img ,f _voxel )

8. The method of real-time rendering generalizable neural radiation field based new view synthesis of claim 7, wherein generating high resolution target image features of the input image based on the up-sampling of low resolution target image features comprises: and generating high-resolution target image features of the input image by performing sub-pixel convolution image feature up-sampling operation on the low-resolution target image features.

9. The method of claim 8, wherein the sampling guidance module further comprises:

10. A new view synthesis system based on real-time rendering generalizable neural radiation field, performing new view synthesis of free viewpoint video based on the new view synthesis method based on real-time rendering generalizable neural radiation field as claimed in any one of claims 1 to 9, comprising:

11. An electronic device, the electronic device comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the steps in the new view synthesis method based on rendering generalizable neural radiation fields in real time as claimed in any one of claims 1 to 9.