CN118015107A

CN118015107A - Dynamic radiation field real-time streaming rendering method, system, device, chip and medium

Info

Publication number: CN118015107A
Application number: CN202410201507.7A
Authority: CN
Inventors: 刘豪; 郭程骋; 王立翱; 赵富强
Original assignee: Shanghai Yushen Digital Technology Co ltd
Current assignee: Shanghai Yushen Digital Technology Co ltd
Priority date: 2024-02-23
Filing date: 2024-02-23
Publication date: 2024-05-10

Abstract

A method, a system, a device, a chip and a medium for real-time streaming rendering of a dynamic radiation field, wherein the method comprises the following steps: s1, converting a frame of a radiation field into a characteristic image and generating a mapping table; s2, integrating the characteristic values of all sampling points on the light, and then rendering the result through a small global multi-layer sensor; s3, performing refined sequence training on the characteristic images by keeping space and time consistency; s4, compressing the trained characteristic images into video streams. According to the invention, the 4D radiation field is converted into the 2D characteristic image stream, so that the data processing amount in the rendering process is greatly reduced; by encoding the dynamic scene into a video stream, the requirement on storage space is remarkably reduced, and the data transmission process is optimized; the method is suitable for mobile equipment, and solves the problem of real-time rendering of the photo-level dynamic scene rendering technology on a mobile platform.

Description

Dynamic radiation field real-time streaming rendering method, system, device, chip and medium

Technical Field

The invention relates to the field of computer vision, in particular to a method, a system, equipment, a chip and a medium for real-time streaming rendering of a dynamic radiation field.

Background

Currently, in the field of computer vision, particularly optical rendering technology involving dynamic scenes, there have been several studies and developments. Mainstream technologies such as neural radiation field (NeRF) perform well in the optical realistic rendering of static scenes. However, when dynamic, long-time-sequential radiation field rendering is involved, particularly on mobile devices, challenges of data transmission and computational constraints remain.

Free-view video (fv) of dynamic scenes can provide an immersive experience in virtual reality and telepresence. In the prior art, the processing of dynamic scenes has mostly relied on treating the scene at each moment as an independent entity, or by maintaining a canonical space and matching it implicitly or explicitly with the real-time space of each frame. But this method of relying on canonical space is inefficient in handling sequences of large motion or topological changes. In addition, other methods such as the use of 4D feature meshes or temporal voxel features currently have significant results in representing topologically transformed scenes, but are still limited by model capacity in representing longer sequences and present significant storage challenges in terms of streaming.

The main problem with these prior art techniques is that they do not enable real-time, efficient rendering of dynamic scenes on mobile devices. Although existing NeRF compression methods may be employed, these methods introduce additional overhead in decoding or rendering, which is not applicable to mobile platforms. Accordingly, the prior art is still deficient in achieving efficient rendering of dynamic scenes and optimization of mobile devices.

Disclosure of Invention

The invention aims to solve the existing problems and aims to provide a dynamic radiation field real-time streaming rendering method, a system, equipment, a chip and a medium, so as to solve the problems of the existing dynamic scene rendering technology that a large amount of data are required to be stored in long-time sequence radiation field processing and the dependence of high computing resources is caused.

In order to achieve the above purpose, the technical scheme adopted by the invention provides a real-time streaming rendering method of a dynamic radiation field, which comprises the following steps:

s1, converting each frame of a radiation field into a corresponding characteristic image, and generating a mapping table;

S2, integrating the characteristic values of all sampling points on the light, and then rendering the result through a small global multi-layer sensor;

S3, performing refined sequence training on the characteristic images by keeping space and time consistency, so as to realize high compression rate;

s4, compressing the trained characteristic images into video streams.

In S1 of some embodiments, for a given three-dimensional vertex position x, a mapping from each non-empty three-dimensional vertex to a corresponding two-dimensional pixel is achieved by retrieving its density σ and multi-dimensional feature f, with the equation:

σ,f＝I(M(x))；

Where M is a three-dimensional to two-dimensional mapping table.

In S1 of some embodiments, when generating the mapping table, pre-training in a coarse phase is performed to generate a density grid, and an occupied grid of each frame is generated based on the density grid; and then using 3D-2D Morton sequencing to record the continuity of the three-dimensional space, and generating a mapping table.

In some embodiments, each frame is also adaptively grouped during the conversion process, and a mapping table is created for each group.

In some embodiments, for a set of consecutive frames { i, i+1, …, i+n }, the maximum number of frames α is determined such that the number of voxels occupied by the union of i to α does not exceed the pixel limit θ:

wherein g () represents the number of occupied grids in the occupied grids of the union;

Then frame i is set to frame α of the frame group, frame α+1 being the start frame of the new frame group.

In some embodiments, in 2D Morton ordering, the feature image is also divided into a number of blocks Bi, and the ordered vertices are grouped into a number of chunks Ci accordingly; for each pixel pin of the block Bi, ordering its relative position in a 2D Morton order; each chunk Ci is then mapped to a block Bi, which is arranged in 2D Morton or-der within the block to form a mapping table.

In S2 of some embodiments, for each sample point along ray r, a cumulative feature is first computed along the ray

Wherein n _s is the number of sampling points on the ray r, T _k is a transparency accumulation term, and sigma _k、f_k and delta _k are the density, the characteristic and the distance between adjacent sampling points of the kth sampling point respectively;

the color of the ray is then calculated using the small global MLP Φ shared across frames:

Where d is the direction of the ray r after the position encoding.

In S3 of some embodiments, a spatial consistency penalty is applied to the feature images to reduce the storage requirement of the feature video by increasing the spatial sparsity of the images, which can be calculated with the total variation penalty:

where P is the set of pixels of the feature image, Δ _u (P) and Δ _v (P) represent the variation of the pixel P in the horizontal and vertical directions, respectively.

In some embodiments S3, a temporal consistency penalty is applied to the feature images by minimizing the feature differences between successive frames within the group to ensure that the residual of successive feature images between each frame t within the group remains small, as expressed by:

L_temporal＝||I^t-I^t-1||₁。

In some embodiments, the total loss is as follows:

L_totat＝L_rgb+λ_sL_spatial+λ_tL_temporal

Where λ _s and λ _t are weights of the introduced regularization terms L _spatial and L _temporal; l _rgb is the photometric loss, expressed as follows: Wherein R is the collection of rays, c (R) and/> The true color and the predicted color of the ray r, respectively. The invention also provides a dynamic radiation field real-time streaming rendering system, which comprises:

The conversion module is used for converting the 4D radiation field into a corresponding 2D characteristic image stream and generating a mapping table; the rendering module is used for integrating the characteristic values of all sampling points on the light, and then rendering the result through a small global multi-layer sensor;

the training module is used for carrying out refined sequence training by applying space consistency loss and time consistency loss to the characteristic images;

and the compression module is used for compressing the trained characteristic images into a video stream.

In some embodiments, a cross-platform player is also included for streaming and rendering dynamic radial fields on a mobile device, enabling real-time interactive scene playback across devices.

The invention also provides a dynamic radiation field real-time streaming rendering device, which comprises a memory and a processor, wherein the memory comprises a computer program capable of running on the memory, and the processor realizes the rendering method when executing the computer program.

The invention also provides a chip comprising one or more processors for invoking and running a computer program from memory, such that a device on which the chip is mounted performs the above-described rendering method.

The present invention also provides a storage medium containing computer executable instructions which, when executed by a computer processor, are used to perform the above-described rendering method.

Compared with the prior art, the invention has the following technical advantages:

1. The rendering efficiency and quality are improved: the invention converts the 4D radiation field into the 2D characteristic image stream, thereby greatly reducing the data processing amount in the rendering process. The conversion not only improves the rendering speed, but also maintains the high-quality rendering effect of the dynamic scene.

2. Reducing storage and transmission requirements: by encoding dynamic scenes as video streams, the need for storage space is significantly reduced and the data transmission process is optimized so that even long-time-sequence dynamic scenes can be efficiently streamed over the network.

3. Cross-platform compatibility and mobile device optimization: the technical scheme of the invention is particularly suitable for mobile equipment, and solves the problem of real-time rendering of the photo-level dynamic scene rendering technology on a mobile platform. Under different hardware environments, including mobile devices with limited resources, efficient photo-level dynamic scene rendering can be achieved.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

Fig. 2 is a schematic diagram of Morton ordering.

Detailed Description

Specific details of the invention will now be described in detail with reference to the accompanying drawings. Referring to fig. 1, fig. 1 shows a first embodiment of the present invention, a dynamic radiation field real-time streaming rendering method, mainly including:

s4, compressing the trained characteristic images into video streams.

In dynamic scene rendering, density (density) and multidimensional feature (features) data stored in a three-dimensional grid (grids) per frame are typically required. However, such grid representations are typically highly sparse and use directly can take up a significant amount of memory. Therefore, to address this problem and thereby optimize storage efficiency, we convert (or "bake") these data from a three-dimensional grid into a two-dimensional image.

Further, the frame of each of the radiation fields in S1 is represented as a feature image I, wherein the first channel stores a density σ and the remaining plurality of channels store a multi-dimensional feature f. For a given three-dimensional vertex position x, its density σ and feature fusion f are retrieved by the following equation:

σ,f＝I(M(x))

Where M is a three-dimensional to two-dimensional mapping table. The mapping table M not only effectively eliminates the blank space to reduce the storage requirement, but also maps each non-empty three-dimensional vertex onto a corresponding 2D pixel, thereby realizing the mapping from each non-empty three-dimensional vertex to a corresponding two-dimensional pixel. This lookup operation has a temporal complexity of O (1); in particular, the feature image sequence is in a 2D format, which is friendly to video codec hardware, making the query fast and convenient.

Further, referring to fig. 1, the generation of the mapping table in S1 includes two steps, coarse pre-training and generation of the mapping table.

In coarse pre-training, a coarse phase of pre-training is first performed for each frame t of radiation field to generate a density gridIn particular, this is achieved by processing the multi-view image using an existing neural network method (DVGO) with the aim of independently creating an explicit three-dimensional grid representing the voxel density for each frame. Then, based on this density gridThe occupancy grid O ^t per frame is regenerated-this is done by comparing the voxels in the density grid with a fixed opacity threshold gamma, only voxels exceeding this threshold gamma being preserved. The process aims at guiding the mapping table to store only the lattice points with effective information, thereby reducing the storage requirement and improving the compression efficiency, and laying a foundation for the subsequent self-adaptive grouping in the baking stage.

In the generation of the mapping table, the mapping table is generated using 3D-2D Morton ordering. After implementing the union occupancy grid, the present embodiment applies Morton ordering to record its three-dimensional spatial continuity. Referring to fig. 2, morton ordering, or Z ordering, employs a binary representation of interleaved spatial coordinates, ensuring that spatially adjacent entities remain adjacent in linear space. And 3D Morton ordering is carried out on the vertexes according to the position coordinates of the vertexes in the union grid. Vertices with densities below the threshold γ are excluded. By this ordering method it is ensured that adjacent voxels in three-dimensional space remain spatially continuous when mapped to a two-dimensional image. Further, 2D block partitioning and 2D Morton ordering are also applied. To pre-serve 3D spatial continuity within the 2D framework, 2D Morton ordering is employed and the feature images are divided into blocks. This approach is consistent with block-wise compression of frames in video codecs, where blocks with local smoothness may lead to more efficient storage.

Specifically, the feature image is first divided into N8×8 blocks, denoted as Bi, and the ordered vertices are correspondingly grouped into N8×8 chunks (denoted as Ci). For each pixel pin of block Bi, we rank their relative positions (u, v) in 2D Morton order. Each chunk Ci is then mapped to a block Bi, which is arranged in 2D Morton or-der within the block to form a mapping table M. Such 2D Morton ordering ensures that the ordering values in the 3D Morton ordering are closely positioned within each feature image block, facilitating efficient compression during conversion.

Further, since generating an independent mapping table per frame t may break the temporal continuity, the storage requirement of the video codec increases after compression. In this regard, to further optimize storage and compression efficiency, the present embodiment also adaptively divides the entire sequence into different groups of frames (Groups of Frames, GOFs), with the frames within each group sharing the same mapping table.

For a set of consecutive frames { i, i+1, …, i+n }, the maximum number of frames α is determined such that the number of voxels occupied by the union of i to α does not exceed the pixel limit θ:

Wherein g () represents the number of occupied grids in the occupied grids of the union; then frame i is set to frame α of the frame group, frame α+1 being the start frame of the new frame group. This method of adaptive grouping allows maintaining time continuity while ensuring that the mapping table corresponding to each group is not excessively large, thereby optimizing storage efficiency. The delayed rendering model is used in the rendering of S2, i.e. for each sample point along ray r, the cumulative features are first computed along the ray

Where n _s is the number of samples on ray r, T _k is the transparency cumulative term, and σ _k、f_k and δ _k are the density, feature, and distance between adjacent samples, respectively, of the kth sample.

Where d is the direction of ray r after position encoding (positional encoding). The advantage of the above-described S2 approach is that the number of times that processing through the MLP is required is significantly reduced, since the MLP is only used once after the integration of all sample point features, only one MLP decoding per ray is required, instead of being used at each sample point. This strategy greatly reduces the amount of computation, enabling real-time rendering even on mobile devices with limited computing power, while maintaining rendering quality.

The video codec friendly training method comprises two key steps: the above-described baking and mapping table generation, and hereinafter the refinement sequence training, collectively ensure that the generated 2D feature image stream is suitable for a video codec, thereby achieving efficient compression and decoding.

In refined sequential training, spatial consistency loss L _spatial and temporal consistency loss L _temporal are preferably applied.

In S3 a spatial consistency penalty is applied to the feature image, in order to enhance the homogeneity of the 2D feature image, for each channel of the feature image I its local smoothness is forced by the following equation:

Where P is the set of pixels of the feature image, Δ _u (P) and Δ _v (P) represent the variation of the pixel P in the horizontal and vertical directions, respectively. By increasing the spatial sparsity, the storage of the feature video after video encoding is reduced at the same quality.

In S3 a temporal consistency penalty is also applied to the feature images by minimizing the feature differences between successive frames within the group to ensure that the residual of successive feature images between each frame t within the group remains small.

In addition to the initial frame of each adaptation group, the similarity between frames is enhanced by regularizing the current feature image with its previous feature image, with the formula:

L_temporal＝||I^t-I^t-1||₁。

By using cadence smoothing, the uncertainty in the overall video codec process can be further reduced, thereby facilitating memory savings.

Finally, the total loss is as follows:

L_total＝L_rgb+λ_sL_spatial+λ_tL_temporal

Where λ _s and λ _t are weights of the introduced regularization terms L _spatial and L _temporal; l _rgb is the photometric loss, expressed as follows:

wherein R is the collection of rays, c (R) and The true color and the predicted color of the ray r, respectively. Based on the above method, the present embodiment trains the model on a single NVIDIA GeForce RTX3090 using PyTorch frames. The newly captured dynamic dataset contains about 80 views with a resolution of 1920 x 1080, 30 frames per second. To ensure robustness of the algorithm, the result presentation in the graph is also performed using ReRF dataset and HumanRF dataset. The present embodiment may render 360 degrees a video sequence of an immersive image capturing human interaction with an object, especially for sequences of large movements and long durations.

To facilitate dynamic radiation field rendering on mobile devices, the present embodiment employs expressions consistent with video codec formats and maintains low computational complexity for shader-based rendering. The embodiment also converts the 4D nerve radiation field into a serialized 2D characteristic image stream, and performs rendering processing by combining an efficient rendering pipeline, so that real-time and efficient rendering and streaming of dynamic scenes on different platforms are realized, the rendering performance and usability of the dynamic scenes are obviously improved, and the effect of maintaining smooth rendering experience under the environment with limited resources is achieved.

A second embodiment of the present invention provides a dynamic radiation field real-time streaming rendering system, based on the foregoing method, mainly including:

Further, a cross-platform player is included, aimed at streaming and rendering dynamic radiation fields on mobile devices. To achieve this, we quantized the feature image stream in the uint8 format and selected h.264 as the video codec. The total resolution of the feature image should be controlled below the 4K color image to ensure seamless real-time streaming and decoding even on the mobile device.

In terms of rendering, the present embodiment implements a rendering process through a fragment shader. To speed up the light travel, a multi-resolution occupied grid hierarchy is employed for each group to skip empty space at different levels. Furthermore, splitting a large matrix into small 4x4 matrices in the shader speeds up the computation of small MLPs.

In a cross-platform player, a user can drag, rotate, pause, play, fast forward/fast backward, find dynamic scenes, or switch between different resolutions like watching an online video, providing a high quality free view viewing experience. Further, the capability is also expanded to various devices from smart phones and tablet computers to notebook computers and desktop computers, and the accessibility and applicability of dynamic radiation fields are widened.

A third embodiment of the present invention also provides a dynamic radiation field real-time streaming rendering device, including a memory and a processor. And a computer program executable thereon, the processor implementing any of the rendering methods described above when executing the computer program.

The memory may include random access memory, flash memory, read-only memory, programmable read-only memory, nonvolatile memory, registers, or the like. The processor may be a central processing unit (Central Processing Unit, CPU), or the like. Or an image processor (Graphic Processing Unit, GPU) memory may store executable instructions. The processor may execute executable instructions stored in the memory to implement the various processes described herein.

It will be appreciated that the memory in this embodiment may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a ROM (Read-only memory), PROM (ProgrammableROM, programmable Read-only memory), EPROM (ErasablePROM, erasable programmable Read-only memory), EEPROM (ElectricallyEPROM, electrically erasable programmable Read-only memory), or flash memory, among others. The volatile memory may be a RAM (random access memory) which serves as an external cache. By way of example, and not limitation, many forms of RAM are available, such as SRAM (STATICRAM, static random access memory), DRAM (DYNAMICRAM, dynamic random access memory), SDRAM (SynchronousDRAM, synchronous dynamic random access memory), ddr SDRAM (DoubleDataRate SDRAM, double data rate synchronous dynamic random access memory), ESDRAM (ENHANCED SDRAM, enhanced synchronous dynamic random access memory), SLDRAM (SYNCHLINKDRAM, synchronous connected dynamic random access memory), and DRRAM (DirectRambusRAM, direct memory bus random access memory). The memory 42 described herein is intended to comprise, without being limited to, these and any other suitable types of memory. In some embodiments, the memory stores the following elements, an upgrade package, an executable unit, or a data structure, or a subset thereof, or an extended set thereof: an operating system and application programs.

The operating system includes various system programs, such as a framework layer, a core library layer, a driving layer, and the like, and is used for realizing various basic services and processing hardware-based tasks. And the application programs comprise various application programs and are used for realizing various application services. The program for implementing the method of the embodiment of the invention can be contained in an application program.

In an embodiment of the present invention, the processor is configured to execute the method steps provided in the first aspect by calling a program or an instruction stored in the memory, in particular, a program or an instruction stored in the application program.

A fourth embodiment of the invention also provides a chip for performing the method of the first aspect. Specifically, the chip includes: a processor for calling and running a computer program from a memory, so that the device on which the chip is mounted is used for executing the dynamic radiation field real-time streaming rendering method in the first aspect. In addition, in a fifth aspect, the present invention further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the dynamic radiation field real-time streaming rendering in the first embodiment of the present invention.

For example, machine-readable storage media may include, but are not limited to, various known and unknown types of non-volatile memory.

A sixth embodiment of the present invention also provides a computer program product comprising computer program instructions for causing a computer to perform the above-mentioned method.

Those of skill in the art will appreciate that the elements and steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or as a combination of software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Claims

1. A real-time streaming rendering method of dynamic radiation field is characterized by comprising the following steps:

s4, compressing the trained characteristic images into video streams.

2. The method according to claim 1, characterized in that: in S1, for a given three-dimensional vertex position x, mapping from each non-empty three-dimensional vertex to a corresponding two-dimensional pixel is achieved by retrieving its density σ and multidimensional feature f, with the equation:

σ，f＝I(M(x))；

Where M is a three-dimensional to two-dimensional mapping table.

3. The method according to claim 1, characterized in that: in S1, when a mapping table is generated, pre-training in a rough stage is firstly performed to generate a density grid, and an occupied grid of each frame is generated based on the density grid; and then using 3D-2D Morton sequencing to record the continuity of the three-dimensional space, and generating a mapping table.

4. A method according to claim 3, characterized in that: each frame is also adaptively grouped during the conversion process and a mapping table is created for each group.

5. The method according to claim 1, characterized in that: in S2, for each sample point along ray r, the cumulative features are first computed along the ray

Where d is the direction of the ray r after the position encoding.

6. The method according to claim 1, characterized in that: in S3, a spatial consistency loss is applied to the feature image to reduce the storage requirement of the feature video by increasing the spatial sparsity of the image, and the total variation loss can be used to calculate:

Where P is the set of pixels of the feature image, Δ _u (P) and Δ _υ (P) represent the variation of the pixel P in the horizontal and vertical directions, respectively;

The time consistency loss is applied to the characteristic images, and the characteristic differences among the continuous frames in the group are minimized, so that the residual error of the continuous characteristic images among each frame t in the group is kept small, and the formula is as follows:

L_temporal＝||I^t-I^t-1||₁

The total loss is as follows:

L_total＝L_rgb+λ_sL_spatial+λ_tL_temporal

Where λ _s and λ _t are weights of the introduced regularization terms L _spatial and L _temporal;

L _rgb is the photometric loss, expressed as follows: Wherein R is the collection of rays, c (R) and/> The true color and the predicted color of the ray r, respectively.

7. A dynamic radiation field real-time streaming rendering system, comprising:

the conversion module is used for converting the 4D radiation field into a corresponding 2D characteristic image stream and generating a mapping table;

The rendering module is used for integrating the characteristic values of all sampling points on the light, and then rendering the result through a small global multi-layer sensor;

8. A dynamic radiation field real-time streaming rendering device, characterized in that: comprising a memory and a processor, the memory comprising a computer program executable thereon, the processor implementing the rendering method of any of claims 1-6 when the computer program is executed.

9. A chip, characterized in that: comprising one or more processors for invoking and running a computer program from memory to cause a device on which the chip is installed to perform the rendering method of any of claims 1-6.

10. A storage medium containing computer executable instructions, which when executed by a computer processor are for performing the rendering method of any one of claims 1-6.