CN116883565A

CN116883565A - Digital twin scene implicit and explicit model fusion rendering method and application

Info

Publication number: CN116883565A
Application number: CN202310690300.6A
Authority: CN
Inventors: 谢远龙; 罗庆良; 王书亭; 张鑫; 徐磊; 梁兆顺; 肖瑞康
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2023-10-13

Abstract

The invention belongs to the technical field of digital twin three-dimensional scene reconstruction, and discloses a method for fusion rendering of an explicit and implicit model of a digital twin scene and application thereof, comprising the following steps: s1, constructing a JAXNeRF+ neural network model; s2, calculating to obtain the corresponding color and density of each sampling point by adopting a neural network model based on the obtained sampling points and the corresponding 2D viewing angle directions; s3, constructing a prerendered sparse voxel grid; s4, performing pixel ray sampling volume rendering by using a pre-rendering voxel grid, and removing artifacts generated during scene construction by adopting fractional distillation sampling loss; s5, a model pose visualization and alignment tool is constructed, and further depth information between the real-time pose of the acquired display model and the camera plane is inserted into a pixel ray body rendering path of the implicit scene model, so that a rendered image can distinguish foreground information and background information of the implicit scene model. The invention improves the quality and the flexibility.

Description

Digital twin scene implicit and explicit model fusion rendering method and application

Technical Field

The invention belongs to the technical field related to digital twin three-dimensional scene reconstruction, and particularly relates to a digital twin scene explicit and implicit model fusion rendering method and application.

Background

The digital twin five-dimensional model includes physical entities, virtual models, services, twin data, and connection interactions between them. The twin model reflecting the physical object is one of the most critical dimensions in the digital twin five-dimensional model, is the first step of constructing a digital twin system, and the performance of the twin model directly influences the accuracy of reflecting the physical entity in the digital twin system, thereby influencing the judgment and decision of the physical entity. Along with the increase of the demands of various fields on the intelligent digital twin system, the digital twin system is required to have the capability of quickly constructing a virtual scene in part of occasions, and has high requirements on the timeliness and the accuracy of the three-dimensional virtual scene. The problem of accurately and efficiently constructing a virtual digital model of a physical scene has become one of key factors restricting the development of a digital twin system.

For a digital twin scene, the hierarchy, rules and complexity of model construction can affect the operational reliability of the system. The existing method for constructing the digital twin virtual model is basically based on the traditional three-dimensional modeling technology, and a three-dimensional modeling worker is required to construct the digital twin virtual model from a plurality of dimensions of geometry-physics-behavior-rule according to the geometric dimension of an actual physical scene. The method has the advantages of wide application range, visual flow and the like, and is a mainstream method for constructing the digital twin virtual model at present.

However, the existing method for constructing the digital twin virtual model through the artificial scene modeling is very complicated and extremely low in efficiency, and when the twin scene is too large or a complex scene which is difficult to construct manually occurs, the artificial scene modeling can seriously influence the construction progress of the digital twin model and even cause modeling failure. Meanwhile, the traditional scene construction method is mainly used for simply constructing scene elements, and is difficult to visually restore the real scene, so that the requirements of a digital twin system on the high-fidelity scene cannot be met.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a digital twin scene implicit model fusion rendering method and application, the method uses a neural radiation field (NeRF: neural Radiance Fields) to construct a high-fidelity implicit model of a scene, a deep neural network is utilized to model a three-dimensional scene, the neural network receives two-position images of the scene observed from different view angles as input, then three-dimensional geometric structures and illumination attributes in the scene are learned through training, and high-quality three-dimensional scene rendering can be deduced through the trained model. In order to ensure the editable capability of the scene, the display model and the NeRF implicit model are subjected to joint body rendering, so that the reconstructed digital twin scene has the rendering effects of high fidelity and high quality of the NeRF implicit model and the flexible editing performance of the display model.

To achieve the above object, according to one aspect of the present invention, there is provided a method for fusion rendering of explicit models of digital twin scenes, the method comprising the steps of:

s1, preprocessing an acquired image of a scene to be reconstructed by adopting an incremental processing mode, generating sampling rays based on the preprocessed image, sampling the sampling rays to obtain sampling points, and then constructing a JAXNeRF+ neural network model;

s2, calculating the color and density corresponding to each sampling point by adopting a neural network model based on the obtained sampling points and the corresponding 2D viewing angle directions, and generating any required viewing angle image by utilizing the voxel rendering technology;

s3, constructing a pre-rendering sparse voxel grid, and realizing conversion from a neural network implicit model to a semi-displayed voxel characteristic model, wherein the neural network model is the neural network implicit model, and the pre-rendering sparse voxel grid is the semi-displayed voxel characteristic model;

s4, performing pixel ray sampling volume rendering by using a pre-rendering voxel grid, and removing artifacts generated during scene construction by adopting fractional distillation sampling loss;

s5, constructing a model pose visualization and alignment tool to provide pose transformation operation and corresponding data output functions required by aligning the display model and the implicit model; and then, the depth information between the real-time pose of the acquired display model and the camera plane is inserted into a pixel ray body rendering path of the implicit scene model, so that a rendered image can distinguish foreground and background information of the implicit scene model and blur the limit of the implicit expression, and a rendered view fused by the implicit expression is obtained.

Further, the incremental processing mode comprises a data segmentation and fusion process, and the data segmentation and fusion process comprises the following substeps: firstly, dividing the whole image database into small local data blocks according to the sequence, respectively reconstructing a motion structure of each data block by using a nonlinear optimization technology, and finally, performing BA optimization on the whole image database by using a graph optimization technology so as to fuse the motion structure information of each data block into a global motion structure model.

Further, in step S1, the sampled rays are extended forward from the camera position, discrete points are sampled along each sampled ray using a form sampler and an SDF sampler, respectively, and then mapped to the corresponding positions of each point in the 3D scene.

Further, the neural network model introduces an adaptive activation function that aggregates feature information of multiple sampling points into a global feature.

Further, neural network models introduce multi-domain approximation and hierarchical sampling strategies.

Further, modeling the radiation field using a neural network model; in the rendering process, the rays are equally divided into N cells, each cell randomly samples a point, and certain weighted summation is carried out on the sampled colors:

wherein delta _i ＝t _i+1 -t _i Is the distance between two sampling points; i is the sampling point sequence number; t is t _i The position of the ith sampling point; u is uniformly distributed; t is t _s Is the initial sampling position; t is t _e To terminate the sampling position; t (T) _i The cumulative light transmittance of the sampling point at the i-th position; sigma (sigma) _i Voxel density at sampling point i; c _i Is the color information at sample point i.

Further, setting a density threshold value xi, comparing the sampled densities sigma (t) and xi of the voxels based on the sampled information of each voxel, and discarding the voxels if sigma (t) is smaller than the threshold value; and performing octree division on the rest voxels, and further obtaining the fully constructed prerendered sparse voxel grid.

Further, the display model information is passed through a transformation matrix T _e2i Aligning into a pre-rendered sparse grid coordinate system of prnrf; and then, determining the volume rendering ray of each pixel according to the input rendering view angle information: r=o+td; then, according to the rendering visual information, performing nerve radiation field parameter conversion on the display model information on the volume rendering ray path; next, based on depth information of the explicit model, neural radiation field parameters of the explicit model are added into the path of voxel rendering.

The invention also provides a system for fusion rendering of the implicit and explicit models of the digital twin scenes, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the fusion rendering method of the implicit and explicit models of the digital twin scenes when executing the computer program.

The present invention also provides a computer-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement a method of explicit model fusion rendering of a digital twin scene as described above.

In general, compared with the prior art, the digital twin scene explicit and implicit model fusion rendering method and application provided by the invention have the following advantages:

1. according to the invention, the reconstructed PRNERF (Pre-Rendered Neural Radiance Fields) is used as a scene environment modeling method, so that the visual simulation effect of the twin scene is greatly improved, meanwhile, the training and rendering pipelines of the model are divided, and the speed of the rendering process with very high real-time requirements is remarkably improved by constructing the Pre-rendering coefficient grid at the cost of increasing the calculation complexity of the training process with low real-time requirements.

2. The invention divides the construction of the digital twin scene into a display scene and an implicit scene, constructs the implicit model with high fidelity through NeRF for the scenes with low dynamic background and the like, and constructs the display model with high editable performance for the scene object with high dynamic, and the concept of constructing the scene model by dividing and controlling ensures that the scene has high quality rendering effect and flexible editing performance.

3. The implicit hybrid rendering method breaks the data expression barriers between the implicit three-dimensional models, does not need the mutual conversion between the implicit three-dimensional models, and directly participates in the NeRF volume rendering process by the display models, so that the rendering process can distinguish the foreground and background relations of the two expression modes, and the fusion rendering effect without the gap between the two expression modes is achieved.

Drawings

FIG. 1 is a schematic flow diagram of a method for fusion rendering of explicit and implicit models of a digital twin scene;

FIG. 2 is a schematic diagram of PRNERF implicit scene rendering in accordance with the present invention;

FIG. 3 is a schematic flow diagram of implicit scene fusion rendering in accordance with the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is intended to illustrate the invention, and is not intended to limit the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Digital twinning is a technique based on virtual reality technology that combines the actual scene with a digital model. The method provides brand new and more real interactive experience for enterprises and organizations through real-time monitoring, simulation and prediction of the reality scene. The appearance of digital twinning provides a new idea and mode for digital transformation of real scenes. The main functions of the digital twin scene are multi-dimensional, multi-time space scale and multi-domain description and characterization of the physical scene or the whole elements of the complex system. The twin model reflecting the physical object is one of the most critical dimensions in the digital twin five-dimensional model, is the first step of constructing a digital twin system, and the performance of the twin model directly influences the accuracy of reflecting the physical entity in the digital twin system, thereby influencing the judgment and decision of the physical entity.

Referring to fig. 1 and fig. 2, in this embodiment, a digital twin scene reconstruction of a tower crane construction scene is taken as an example, and the explicit and implicit model fusion rendering method of the digital twin scene provided by the invention is described in detail. The method comprises a training phase with low time real-time requirement, a pre-rendering phase for accelerating rendering and a rendering phase for fusing rendering by a display and implicit model with high real-time requirement.

The method mainly comprises the following steps:

s1, training stage.

The training stage builds a JAXNeRF+ neural network for training and storing physical scene information as detailed as possible, and is an acquisition source of the scene information used in the pre-rendering stage and the rendering stage.

The training phase mainly comprises the following steps:

s11, acquiring an image of a scene to be reconstructed.

In the process of image acquisition, attention needs to be paid to uniform ambient light, at least one third of field of view overlapping area exists between images, and all fields of view of a scene to be reconstructed are acquired as much as possible.

S12, preprocessing the acquired image by adopting an incremental processing mode.

The image preprocessing process adopts an incremental processing mode, the acquired image data is extremely huge in a large scene, and the traditional global optimization mode cannot be used for the image preprocessing process of the large scene because of overlarge memory overhead. The incremental processing mode comprises a data segmentation and fusion process, and the sequence of shooting image data needs to be known.

The data segmentation and fusion process includes the following sub-steps: firstly, dividing the whole image database into small local data blocks according to the sequence, respectively reconstructing a motion structure of each data block by using a nonlinear optimization technology, and finally, performing BA optimization on the whole image database by using a graph optimization technology so as to fuse the motion structure information of each data block into a global motion structure model.

S13, generating sampling rays based on the preprocessed image, and sampling the sampling rays to obtain sampling points.

The sampled rays are extended forward from the camera position, discrete points are sampled along each sampled ray using a form sampler and an SDF sampler, respectively, and then mapped to the positions in the 3D scene to which each point corresponds.

S14, constructing a JAXNeRF+ neural network model.

The neural network model is similar to the standard NeRF, introducing an adaptive activation function. The adaptive activation function aggregates feature information of multiple sample points into one global feature to reduce noise and improve robustness of the model. In addition, the neural network model introduces a Multi-domain approximation (Multi-field Approximation) technique and a hierarchical sampling strategy (Hierarchical Sampling) to improve the training efficiency and scene reconstruction accuracy of the model.

And S15, calculating the color and the density corresponding to each sampling point by adopting a neural network model based on the obtained sampling points and the corresponding 2D viewing angle directions.

And taking the obtained sampling points and the corresponding 2D viewing angle directions as input to output a group of colors and densities. If a ray is emitted from an angle into a static space, the density σ of each point (x, y, z) in the space of the ray can be queried, and the color c (c= (R, G, B)) that the position exhibits at the ray angle (θ, Φ), i.e., F (x, y, z, θ, Φ) → (R, G, B, σ). The density is used as a calculation weight, and the pixel color can be obtained by weighting, accumulating and summing the (R, G, B) values of sampling points on the ray, wherein the formula of the pixel color is as follows:

wherein T (T) represents the accumulated light transmittance, and C (r) is color information obtained by volume rendering; sigma (r (t)) is voxel density; c (r (t), d) is color information; r (t) is a volume rendering ray; d is the direction of the ray.

S16, generating any needed visual angle image by utilizing the voxel rendering technology and the obtained color and density.

The voxel rendering is to form a ray by connecting the camera focus and the pixel, and to perform weighted cumulative summation on the colors of all sampling points on the ray to obtain the color value of the pixel.

Wherein, the neural network model is used for modeling the radiation field, and since the neural network model is differentiable, the voxel rendering mode is differentiable, so that the mean square error MSE Loss is calculated by using the image and the original image obtained by voxel rendering, and a Loss function is constructed:

the whole process can use gradient back transmission end to perform scene optimization.

In the actual rendering process, the rays are equally divided into N cells, each cell randomly samples a point, and certain weighted summation is carried out on the sampled colors:

S2, pre-rendering.

Referring to fig. 2, the prnrf rendering process needs to construct a pre-rendered sparse voxel grid based on a training neural network model, and then obtain a volume rendering sampling point by sampling the pre-rendered sparse grid, so as to obtain a sampling point ground color, density and direction feature vector. On one hand, the background color model of the scene is obtained by carrying out accumulated volume rendering through background color and density information; and on the other hand, accumulating along rays through the density and direction feature vectors, acquiring corrected color information through a decoder, and finally combining the density and the direction feature vectors to obtain a final rendering result.

The pre-rendering stage mainly comprises the following steps:

s21, constructing a pre-rendering sparse voxel grid, and converting a neural network implicit model into a semi-displayed voxel characteristic model, wherein the neural network model is the neural network implicit model, and the pre-rendering sparse voxel grid is the semi-displayed voxel characteristic model. The method specifically comprises the following substeps:

s211, constructing an initial voxel grid, and covering the whole effective scene. In view of the fact that the construction scene is a borderless scene, two layers of voxel grids are constructed, the first layer of grids store scene information in a normal proportion, and the second layer of grids store compressed background scene information:

wherein x is _j Is the location of the sampling point.

S212, performing NeRF rendering on each voxel, and sampling from the trained neural network implicit model:

σ(t),c _o (t),v(t)＝MLP _Θ (r(t))

wherein σ (t) is the sampling density, c _o (t) is the ground color of the sample point, and v (t) is a direction-dependent feature vector; MLP (Multi-layer Programming protocol) _Θ (r (t)) means sampling from the implicit neural network along r (t).

S213 sets a density threshold ζ, compares the magnitudes of the sampling densities σ (t) and ζ of the voxels based on the information of each voxel obtained by sampling, and discards the voxel if σ (t) is smaller than the threshold.

S214, performing octree division on the rest voxels, dividing each voxel into smaller areas, and improving the voxel resolution.

S215, repeating the steps S212-S214 until the resolution reaches the set upper limit Rn or the segmentation level reaches the upper limit Fn, and obtaining the fully constructed pre-rendered sparse voxel grid.

S22, performing pixel ray sampling volume rendering by using the pre-rendering voxel grid. The method specifically comprises the following substeps:

s221, sampling by a sampler to obtain the density, the ground color and the direction feature vector of each sampling point:

s222, utilizing the sampling point densityAnd (2) the ground color->Performing volume rendering along ray accumulation to obtain a model ground color image:

wherein t is _k Is a sampling point; t (T) _k ) Is the cumulative light transmittance;for the sampling point t _k Density at; />For the sampling point t _k Color information at the location.

S223, utilizing the sampling point densityAnd direction feature vector +.>Volume rendering is performed along ray accumulation, and the accumulated direction characteristics of pixels are obtained:

s224, decoding the accumulated direction characteristics of the pixels to obtain pixel correction information:

where d is the direction vector code, MLP _Φ To decode the neural network.

S225, adding pixel correction information to the model ground color image to obtain a model rendering image under the view angle:

s23, removing artifacts generated during scene construction by using fractional distillation sampling loss. The method aims at the problem that artifacts are easy to generate in implicit NeRF scene construction of auxiliary scenes of a construction scene, and the artifacts in the NeRF optimization process are prevented by using a local 3D prior and a density-based fractional distillation sampling loss through a 3D diffusion method. The removing of the artifact comprises the following steps: firstly, a diffusion model is trained to learn the distribution of three-dimensional surface blocks, and then when a NeRF reconstructs a real three-dimensional scene, the density of local three-dimensional blocks in the scene corresponding to the pre-rendered sparse voxel grid is inquired, and the sampling density is normalized by using the density fraction distillation loss. Using the above local priors can improve reconstruction in sparse supervisory signal areas and eliminate artifacts.

S3, rendering.

Referring to fig. 3, the implicit fusion rendering method is utilized to combine the capability of the implicit model to efficiently express three-dimensional scenes such as complex aggregate shapes, large-scale scenes and the like with the flexible editing performance of the display model, so that the constructed digital twin tower crane construction scene has extremely high visual simulation effect and flexible editing capability.

The implicit model PRNERF provides a solution for transforming the neural radiation field into a display three-dimensional grid coarse mode mesh, and is used for aligning the position, the gesture and the size relation between the display model and the implicit model.

The rendering stage mainly comprises the following steps:

s31, a model pose visualization and alignment tool is constructed to provide pose transformation operation and corresponding data output functions required for aligning the display model and the implicit model.

According to the actual scene requirement and the physical space pose data, aligning the implicit models and obtaining a pose transformation matrix T between the implicit models _e2i And constructing a scene mapping bridge for fusion rendering.

S32, inserting depth information between the real-time pose of the acquired display model and the camera plane into a pixel ray body rendering path of the implicit scene model, so that a rendered image can distinguish foreground and background information of the implicit scene model, and blurring the boundary of the implicit expression. The method specifically comprises the following substeps:

s321, passing the display model information through a transformation matrix T _e2i Aligned into a pre-rendered sparse grid coordinate system of prnrf.

S322, determining a volume rendering ray of each pixel according to the input rendering view angle information: r=o+td.

S323, according to the rendering visual information, performing nerve radiation field parameter conversion on the display model information on the volume rendering ray path. Specifically, a transducer is set to take explicit model material and color information and direction characteristics on the path as inputs to obtain density and color information:

wherein, the liquid crystal display device comprises a liquid crystal display device,mainly related to material transparency; />Color information for an explicit model; />Is explicit model density information; />Is an explicit model direction vector.

S324, adding the nerve radiation field parameters of the explicit model into the voxel rendering path according to the depth information of the explicit model:

then rendering the implicitly fused rendered view according to the prnrf rendered subsequent stream Cheng Huode;

wherein T (T) _i ) Is the cumulative light transmittance;voxel density information; />Color information of sampling points; t (T) _ed Cumulative light transmittance for sampled rays to the explicit model; />Voxel density information at the explicit model; t' (T) _i ) Cumulative light transmittance for the inserted explicit model; />Is the directional feature vector at the sampling point.

The invention synthesizes the technologies of nerve radiation field, volume rendering, neural network and the like, uses a PRNERF frame to reconstruct a digital twin scene, reconstructs the traditional NeRF training and rendering structure, introduces a pre-rendering sparse voxel grid in the training stage, remarkably accelerates the rendering process to run in real time, and designs a fusion rendering method of an explicit model in the rendering stage, so that the rendering scene combines the capability of the implicit model for efficiently expressing three-dimensional scenes such as complex aggregate shapes, large-scale scenes and the like and the flexible editing performance of the explicit model, and the constructed digital twin scene has extremely high visual simulation effect and flexible editing capability.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A digital twin scene explicit model fusion rendering method is characterized by comprising the following steps:

2. The explicit and implicit model fusion rendering method of a digital twin scene according to claim 1, wherein: the incremental processing mode comprises a data segmentation and fusion process, wherein the data segmentation and fusion process comprises the following substeps: firstly, dividing the whole image database into small local data blocks according to the sequence, respectively reconstructing a motion structure of each data block by using a nonlinear optimization technology, and finally, performing BA optimization on the whole image database by using a graph optimization technology so as to fuse the motion structure information of each data block into a global motion structure model.

3. The explicit and implicit model fusion rendering method of a digital twin scene according to claim 1, wherein: in step S1, the sampled rays are extended forward from the camera position, discrete points are sampled along each sampled ray using a form sampler and an SDF sampler, respectively, and then mapped to the positions of each point in the 3D scene.

4. The explicit and implicit model fusion rendering method of a digital twin scene according to claim 1, wherein: the neural network model incorporates an adaptive activation function that aggregates feature information for a plurality of sample points into a global feature.

5. The explicit and implicit model fusion rendering method of a digital twin scene according to claim 4, wherein: the neural network model introduces a multi-domain approximation and hierarchical sampling strategy.

6. The explicit and implicit model fusion rendering method of a digital twin scene according to claim 1, wherein: modeling a radiation field using a neural network model; in the rendering process, the rays are equally divided into N cells, each cell randomly samples a point, and certain weighted summation is carried out on the sampled colors:

7. The explicit and implicit model fusion rendering method of a digital twin scene according to claim 1, wherein: setting a density threshold value, comparing the sampled densities sigma (t) and xi of the voxels based on the sampled information of each voxel, and discarding the voxels if sigma (t) is smaller than the threshold value; and performing octree division on the rest voxels, and further obtaining the fully constructed prerendered sparse voxel grid.

8. The method for fusion rendering of explicit models of digital twinning scenes according to any of claims 1-7, characterized in that: the model will be displayedInformation passing through transformation matrix T _e2i Aligning into a pre-rendered sparse grid coordinate system of prnrf; and then, determining the volume rendering ray of each pixel according to the input rendering view angle information: r=o+td; then, according to the rendering visual information, performing nerve radiation field parameter conversion on the display model information on the volume rendering ray path; next, based on depth information of the explicit model, neural radiation field parameters of the explicit model are added into the path of voxel rendering.

9. A digital twin scene explicit and implicit model fusion rendering system is characterized in that: the system comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to execute the explicit model fusion rendering method of the digital twin scene as set forth in any one of claims 1-8.

10. A computer-readable storage medium, characterized by: the computer-readable storage medium stores machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the explicit model fusion rendering method of a digital twin scene of any of claims 1-8.