CN118158489A - Efficient streaming free view video generation method based on 3D Gaussian model, computer device and program product - Google Patents
Efficient streaming free view video generation method based on 3D Gaussian model, computer device and program product Download PDFInfo
- Publication number
- CN118158489A CN118158489A CN202410261768.8A CN202410261768A CN118158489A CN 118158489 A CN118158489 A CN 118158489A CN 202410261768 A CN202410261768 A CN 202410261768A CN 118158489 A CN118158489 A CN 118158489A
- Authority
- CN
- China
- Prior art keywords
- gaussian
- neural network
- viewpoint video
- generating
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000013528 artificial neural network Methods 0.000 claims abstract description 35
- 238000009877 rendering Methods 0.000 claims abstract description 29
- 230000008859 change Effects 0.000 claims abstract description 19
- 238000006073 displacement reaction Methods 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims description 30
- 238000004590 computer program Methods 0.000 claims description 20
- 230000008569 process Effects 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 9
- 238000005315 distribution function Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 description 12
- 230000001537 neural effect Effects 0.000 description 11
- 230000009466 transformation Effects 0.000 description 10
- 230000015572 biosynthetic process Effects 0.000 description 7
- 238000005457 optimization Methods 0.000 description 7
- 238000003786 synthesis reaction Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000013507 mapping Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000001419 dependent effect Effects 0.000 description 3
- 230000007992 neural conversion Effects 0.000 description 3
- 230000002085 persistent effect Effects 0.000 description 3
- 230000002194 synthesizing effect Effects 0.000 description 3
- 230000001052 transient effect Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 210000005036 nerve Anatomy 0.000 description 2
- 230000005855 radiation Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical group OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 1
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- 235000019013 Viburnum opulus Nutrition 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/816—Monomedia components thereof involving special video data, e.g 3D video
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44012—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving rendering scenes according to scene graphs, e.g. MPEG-4 scene graphs
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/466—Learning process for intelligent management, e.g. learning user preferences for recommending movies
- H04N21/4662—Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
- H04N21/4666—Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8146—Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/272—Means for inserting a foreground image in a background image, i.e. inlay, outlay
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Graphics (AREA)
- Processing Or Creating Images (AREA)
Abstract
The application relates to a high-efficiency streaming free view video generation method, computer equipment and a program product, which are realized based on a 3D Gaussian model, and the high-efficiency streaming free view video generation method comprises the following steps: obtaining a 3D Gaussian model of a previous frame of the three-dimensional scene, wherein the 3D Gaussian model is a set of 3D Gaussian, and for a 3D Gaussian, the 3D Gaussian model comprises a position point in space and a position point attribute; constructing a neural network comprising a perceptron, wherein position points are recorded in a position hash coding mode, the perceptron receives the position hash coding and maps the position hash coding into attribute change of 3D Gaussian, and the attribute change comprises a first part used for representing 3D Gaussian displacement and a second part used for representing 3D Gaussian rotation; updating the 3D Gaussian in the next frame by utilizing 3D Gaussian displacement and 3D Gaussian rotation, and rendering to obtain a reference image; and optimizing the neural network by using the losses of the reference image and the sample image, and generating a post-frame image by using the optimized neural network.
Description
Technical Field
The application relates to the field of computer vision and deep learning, in particular to a method, computer equipment and a program product for generating high-efficiency streaming free viewpoint video based on a 3D Gaussian model.
Background
Synthesizing views at new perspectives from a set of pictures of a scene with known camera poses is an important research topic in the fields of computer vision and graphics. The conventional Lumigraph method or Light-filtered method implements new view synthesis by Interpolation (Interpolation).
In recent years, neRF (Neural RADIANCE FIELDS) that uses a neuro-radiation field to represent a scene has attracted attention from researchers in rendering its photo-realistic of a scene. A series of follow-up works improve NeRF performances in various aspects, such as coping with challenging scenes, accelerating training speed, realizing real-time rendering, improving reconstruction quality and the like. However, since the original NeRF uses "neural network-requiring" costly volume rendering for new view synthesis, these subsequent methods inevitably require compromises in terms of training time, rendering speed, memory footprint, image quality, and application range.
Instant Neural GRAPHICS PRIMITIVES WITH A Multiresolution Hash Encoding 2022 (abbreviated as I-NGP) greatly speeds up training time by combining multi-resolution hash coding with multi-layer perceptron (full-fused MLP), but sacrifices reconstruction quality. Zip-NeRF Anti-ALIASED GRID-Based Neural RADIANCE FIELDS 2023 (abbreviated as Zip-NeRF), zip-NeRF combines I-NGP with Mip-NeRF 360:Unbounded Anti-Aliased Neural RADIANCE FIELDS (abbreviated as Mip-NeRF-360) which focus on improving reconstruction quality and processing unbounded scenes, but still requires a longer training time. Tri-MipRF Tri-Mip Representation for EFFICIENT ANTI-Aliasing Neural RADIANCE FIELDS 2023 (referred to simply as Tri-MipRF) speeds training and rendering, but only handles bounded scenes captured inward. A series of approaches based on Baking achieve real-time rendering, but require additional storage space.
Kerbl et al (3D Gaussian Splatting for Real-TIME RADIANCE FIELD RENDERING 2023) propose using 3D gaussian model (3 dg,3D gaussian) primitives in combination with differentiable point-based (point-based) rendering to achieve fast radiation field reconstruction and real-time radiation field rendering, with acceptable memory occupancy, real-time high-fidelity new view synthesis can be achieved through short-time training in complex large-scale unbounded scenes, but do not provide a solution to generate dynamic scenes.
Disclosure of Invention
Based on the foregoing, it is necessary to provide a method for generating an efficient streaming free-viewpoint video based on a 3D gaussian model.
The application discloses a method for generating high-efficiency streaming free viewpoint video based on a 3D Gaussian model, which comprises the steps of sequentially generating subsequent frames from an initial frame, wherein in the subsequent frames, generating the subsequent frames according to the previous frames specifically comprises the following steps:
obtaining a 3D Gaussian model of a previous frame of the three-dimensional scene, wherein the 3D Gaussian model is a set of 3D Gaussian, and for a 3D Gaussian, the 3D Gaussian model comprises a position point in space and the position point attribute;
Constructing a neural network comprising a perceptron, wherein the position points are recorded in a position hash coding mode, the perceptron receives the position hash coding and maps the position hash coding into attribute changes of the 3D Gaussian, and the attribute changes comprise a first part used for representing 3D Gaussian displacement and a second part used for representing 3D Gaussian rotation;
Updating the 3D Gaussian in the next frame by using the 3D Gaussian displacement and the 3D Gaussian rotation, and rendering to obtain a reference image;
And optimizing the neural network by using the losses of the reference image and the sample image, and generating a post-frame image by using the optimized neural network.
Optionally, the sensing machine is a neural network with a full connection layer, and the number of layers is less than or equal to sixteen.
Optionally, the 3D gaussian model of the initial frame is obtained by reconstruction;
The efficient streaming free-viewpoint video generation method further comprises the following steps: and rendering by using the 3D Gaussian model of the previous frame to obtain a previous frame image, and combining the previous frame image to generate a free viewpoint video.
Optionally, the attribute of the 3D gaussian is changed into a seven-dimensional vector, wherein the first part is the first three dimensions, and the second part is the remaining dimensions.
Optionally, the remaining dimension is used to represent a unit quaternion, and the unit quaternion is used to represent a rotation of the 3D gaussian.
Optionally, the perceptron receives the location hash code, maps the location hash code to an attribute change of the 3D gaussian, specifically by:
dμ,dq=MLP(h(μ))
In the method, in the process of the invention,
Μ represents the position of the 3D gaussian;
dμ represents a displacement of 3D gaussian;
q represents a unit quaternion;
dq represents the spatial rotation angle of the 3D gaussian;
h (μ) represents hash encoding of μ;
MLP represents the perceptron.
Optionally, the location point attribute includes a color;
And when the 3D Gaussian is updated, simultaneously updating the 3D Gaussian position and the unit quaternion representing rotation, and rotating the spherical harmonic coefficient according to the change of the unit quaternion to update the color of the 3D Gaussian.
Optionally, optimizing the neural network includes a plurality of training rounds, wherein gradient tracking is initiated in a last training round;
After the neural network is optimized, selecting 3D gauss with average amplitude of view space position gradient higher than a threshold value, sampling newly increased gauss positions through a probability distribution function, generating a newly increased set of 3D gauss, and updating to obtain a final 3D gauss;
and generating the post-frame image by using a final 3D Gaussian.
The application also provides a computer device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of the high-efficiency streaming free-viewpoint video generation method based on the 3D Gaussian model.
The present application also provides a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method for generating a high efficiency streaming free view video based on a 3D gaussian model according to the present application.
The high-efficiency streaming free viewpoint video generation method based on the 3D Gaussian model has at least the following effects:
The optimized neural network can be used for obtaining the rendering image which accords with expectations, and is used as a back frame image, each back frame image is generated according to the optimized neural network, the neural network is optimized in real time in different time frames, and the method has the characteristic of instant training and instant application.
The application can synthesize the view under the new view angle from the pictures of a group of known camera pose of the scene, can be used for generating the dynamic video of the new view angle, and has good space and time consistency.
Drawings
Fig. 1 is a flow chart of a method for generating an efficient streaming free-viewpoint video based on a 3D gaussian model according to an embodiment of the present application;
FIG. 2 is a diagram of a model architecture of a method for generating an efficient streaming free-viewpoint video based on a 3D Gaussian model in an embodiment of the application;
FIG. 3 is a diagram of a model architecture of the neural network of FIG. 2;
Fig. 4 is a schematic diagram of an optimization process of a new 3D gaussian, according to an embodiment of the present application, of a method for generating a high-efficiency streaming free view video based on a 3D gaussian model;
fig. 5 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Referring to fig. 1, an embodiment of the present application provides a method for generating an efficient streaming free view video based on a 3D gaussian model, including generating subsequent frames sequentially from an initial frame, wherein in the subsequent frames, generating the subsequent frames according to the previous frames specifically includes:
Step S100, a 3D Gaussian model of a previous frame of the three-dimensional scene is obtained, the 3D Gaussian model is a set of 3D Gaussian, and for a 3D Gaussian, the 3D Gaussian model comprises a position point in space and the position point attribute;
Step S200, constructing a neural network comprising a perceptron, recording position points in a position hash coding mode, receiving the position hash coding by the perceptron, and mapping the position hash coding into attribute change of 3D Gaussian, wherein the attribute change comprises a first part used for representing 3D Gaussian displacement and a second part used for representing 3D Gaussian rotation;
Step S300, updating the 3D Gaussian in the next frame by utilizing 3D Gaussian displacement and 3D Gaussian rotation, and rendering to obtain a reference image;
Step S400, optimizing the neural network by using the losses of the reference image and the sample image, and generating a post-frame image by using the optimized neural network.
In this embodiment, each frame refers to a different time node before and after. In the embodiment, through mapping of position hash coding, 3D Gaussian displacement change and 3D Gaussian rotation change are obtained, and 3D Gaussian is updated. Under different time nodes, corresponding rendering images are obtained according to the change of the 3D Gaussian. It will be appreciated that the optimized neural network may be used to obtain a rendered image that is more desirable as a post-frame image. Each post-frame image of the embodiment is generated according to the optimized neural network, and the neural network is optimized in real time in different time frames, so that the method of the embodiment has the characteristic of instant training.
The 3D gaussian model of the initial frame in this embodiment is obtained by reconstruction. The efficient streaming free-viewpoint video generation method further comprises the following steps: and rendering by using the 3D Gaussian model of the previous frame to obtain a previous frame image, and combining the previous frame image to generate the free viewpoint video.
The high-efficiency streaming free viewpoint video generation method based on the 3D Gaussian model can be used for synthesizing the view under the new view angle from the pictures of a group of known camera poses of the scene, can be used for generating the dynamic video of the new view angle, and has good space and time consistency.
Referring to fig. 2 to 3, an embodiment of the present application provides a method for generating a high-efficiency streaming free view video based on a 3D gaussian model, which is used for constructing a high-quality free view video from a multi-view [2D picture sequence/video ], and can process pictures in time, and has a fast reconstruction speed; the method can render in real time and realize quick high-quality rendering of the free viewpoint video of the dynamic scene based on multiple views.
The efficient streaming free-viewpoint video generation method includes steps 1 to 3 for explaining and explaining the above embodiment in detail. Wherein: step 1, reconstructing a 3D Gaussian model (3D Gaussians) of a 0 th frame of a three-dimensional scene; step 2, caching the change of the 3D Gaussian model; and 3, an aggressive density control method.
Step 1, reconstructing a 3D Gaussian model (3D Gaussians) of a 0 th frame of a three-dimensional scene, realizing scene reconstruction under an initial frame, and simultaneously being used for assisting in explaining a bottom layer working mechanism of each embodiment of the application.
Referring to fig. 2 and 3, the application is realized based on a 3D gaussian model (3D Gaussians), mainly adopts a 3D Gaussians-based three-dimensional explicit expression model, and uses an anisotropic 3D Gaussians as an explicit scene representation by a 3D gaussian model snowball method (3 DGaussian Splatting). Paired with a fast differential rasterizer for rendering a three-dimensional scene into a two-dimensional picture. This representation requires only a few minutes of training to achieve real-time new view synthesis. The scene is first re-modeled by 3D Gaussians based on the multiview of the 0 th frame for initialization of the subsequent frames.
One 3D gaussian (3 DG) is defined by a 3D covariance matrix Σ centered on a point μ:
Wherein,
X is a position in three-dimensional space;
μ, center of 3D gaussian (i.e. mean of 3D gaussian);
sigma, a 3D Gao Sixie variance matrix, can be decomposed into scaling and rotation of 3D Gaussian;
g, gaussian function;
t represents the matrix transpose.
In order to preserve the semi-positive nature during the optimization, the 3D covariance matrix Σ is decomposed into a rotation matrix R and a scaling matrix S:
Σ=RSSTRT (2)
The rotation is conveniently represented by a unit quaternion q, while scaling uses a 3D vector. Each 3DG furthermore comprises a set of spherical harmonic (SH, SPHERICAL HARMONIC FUNCTION) coefficients for representing a viewing-angle dependent color, and an opacity value α.
To perform new view synthesis, kerbl et al (prior art in the background) project 3DG onto a 2D gaussian (2 DG) scatter.
Σ′=JWΣWTJT (3)
In the course of this process, the process is carried out,
Σ', covariance matrix in camera coordinates;
J, projective transformation jacobian matrix of affine approximation;
W, view transformation matrix;
Σ is the 3D Gao Sixie variance matrix.
In the process of 3D gaussian projection to 2D gaussian, by skipping the third row and the third column of Σ', a 2x2 matrix Σ 2d can be obtained. Furthermore, projecting the mean μ of 3DG into image space will result in a 2D mean μ 2d.
Thus, 2DG, denoted as 2D gaussian, G 2d(x;μ2d,Σ2d) can be defined in image space, wherein:
x is a position in two dimensions;
mu 2d, the center of the 2D Gaussian (i.e., the average of the 2D Gaussian);
Σ 2d is a 2D Gao Sixie variance matrix.
With Σ', the color C of a pixel can be calculated by mixing N ordered points overlapping the pixel:
Wherein:
c i, which represents the viewing angle dependent color of the i 3DG at viewing angle direction d i;
α i' by multiplying the opacity α i of the ith 3D gaussian with the corresponding 2D gaussian estimate;
α j' is determined by multiplying the opacity α j of the jth 3D gaussian with the corresponding 2D gaussian estimate.
In the rendering flow, training and rendering speed for 3DG is very fast with a highly optimized rasterization pipeline and custom CUDA core (Compute Unified Device Architecture). For example, for a real scene at the megapixel level, optimization of only a few minutes allows 3DG to achieve realistic visual quality and rendering speeds in excess of 100 fps.
The application applies 3D gaussian to new view synthesis of dynamic scenes and further enables efficient free view video stream synthesis (EFFICIENT FREE-View Video Streaming) using instant training (on-the-FLY TRAINING).
Step 2, caching the change of 3D Gaussians by using a nerve conversion cache framework;
The present application initially provides a neural conversion cache framework (Neural Transformation Cache) to cache the changes of 3D Gaussians, which may be considered at least part of a neural network. The nerve conversion cache framework specifically provides a compact, efficient and self-adaptive structure. The compactness is to reduce model occupation, the high efficiency is to promote training speed and deducing speed, and the self-adaption is a space area in which the model is expected to pay more attention to movement.
After step 1, the initialization of the 0 th frame 3D Gaussians of the scene is already realized, and the position, rotation and other attributes of the scene need to be adjusted later to realize the generation of the next frame 3D Gaussians in the dynamic scene. There is therefore a need for a compact, efficient, adaptive architecture to cache (cache) 3D Gaussians variations. The compactness is to reduce model occupation, the high efficiency is to promote training speed and deducing speed, and the self-adaption is a space area in which the model is expected to pay more attention to movement. Furthermore, the structure satisfies some of the prior regarding dynamic scenes, such as where adjacent portions of the object typically have consistent or similar motion, where motion of the object may be related to moving objects over long distances, and so on. The present application uses multi-resolution hash coding with shallow-fused MLPs as NTCs.
In addition, the neural transformation buffer framework structure satisfies some prior requirements about dynamic scenes, such as that adjacent parts of objects generally have consistent or similar motion, the motion of the objects can be related to moving objects at long distances, and the like. The present application uses multi-resolution hash coding and shallow-layer perceptron (full-fused MLP) as NTC (Neural Transformation Cache). In particular, we partition a scene using multi-resolution feature grids, with voxel grids (voxel grids) at each resolution mapped into a hash table that stores d-dimensional learnable feature vectors. Specifically, for a 3D position x, its hash at resolution l is encoded as a linear interpolation of the corresponding hash entries (eigenvectors) of the eight grid corners surrounding it.
Multi-resolution hash coding meets all the requirements for NTC structures: (1) compact: the multi-resolution hash encoding successfully compresses the memory space required by the model through the hash table. (2) high efficiency: the hash table lookup is O (1) and is well-suited for modern GPUs. (3) self-adaption: hash tables at fine resolution can experience hash collisions, which allow regions that produce larger gradients, in context, i.e., dynamic portions of the scene, to dominate the updating of the hash term. (4) prior: the structure based on linear interpolation and voxel-grid ensures the local consistency of transformation, and the multi-resolution enables the global information and the local information to be effectively combined together.
In addition, in order to enhance the performance of the neural conversion cache frame (NTC, neural Transformation Cache) with as little overhead as possible, a highly optimized shallow-layer perceptron (full-packed MLP), which means that the number of layers is 16 or less and is a fully connected layer neural network, is used. The shallow perceptron maps the position hash code to a 7-dimensional output, where the first three dimensions are used to represent the displacement of 3DG, the remaining dimensions represent the rotation of 3DG, and the remaining four dimensions are used to represent the unit quaternion q.
That is, the attribute of the 3D gaussian changes to a seven-dimensional vector corresponding to step S200, where the first part is the first three dimensions and the second part is the remaining dimensions. The remaining dimensions are used to represent the unit quaternion q, which is used to represent the rotation of the 3D gaussian.
Given multi-resolution hash coding and MLP, NTC is formalized as:
dμ,dq=MLP(h(μ)) (5)
Wherein μ represents the position of 3DG, q represents a unit quaternion;
dμ represents a displacement of 3DG, dq represents a spatial rotation angle of 3 DG;
h (μ) represents hash encoding of μ;
MLP denotes a perceptron.
And the process of receiving the position hash codes by the perceptron and mapping the position hash codes into the attribute change of the 3D Gaussian is completed.
The 3DGs (set of 3D gauss) is updated based on dμ and dq. Specifically, the following parameters of 3DGs were updated:
Average value: mu' =mu+dμ, + represents vector addition
Spatial rotation angle: q' =norm (q) ×norm (dq), ×represents quaternion multiplication, norm represents normalization.
As for the position point attribute mentioned in step S100, it specifically includes color, transparency, spatial rotation angle, and gaussian zoom degree. In step S300, when updating the 3D gaussian, the method further includes rotating the spherical harmonic coefficients to update the color of the 3D gaussian.
Spherical harmonic coefficients: the spherical harmonic (SH, SPHERICAL HARMONIC FUNCTION) coefficients are updated according to the rotation. The SHs (set of SH coefficients) representing view-dependent colors should also be adjusted to be consistent with the rotation of 3DG when rotating 3 DG. SH rotation is directly used to update SHs with SH rotation invariance.
In the first Stage of frame-by-frame training (Stage 1), the 3DGs of the previous frame are processed through a neural conversion buffer frame (NTC) to obtain transformed 3DGs, and are used for rendering to obtain a reference image. The parameters of the NTC are optimized by Loss (Loss) between the reference image and the real picture (Ground Truth, i.e. the sample image in step 400).
L=(1-λ)L1+λLD-SSIM (6)
Wherein:
L represents a loss function;
Lambda, 0.2;
l 1 for calculating a loss function of a minimum absolute value deviation between the rendered image and the real image;
L D-SSIM is used for calculating a loss function of the structural similarity between the rendered image and the real image;
Furthermore, for training stability and introducing a smooth prior, the present application used a pre-heated NTC as the initial value of NTC in all experiments. The Loss used for preheating is
Lwarm-up=||dμ||1-cos2(q,Q) (7)
Wherein:
l warm-up, which represents a preheating loss;
μ, a mean of the 3D gaussian representing the position of the 3D gaussian;
calculating the L1 norm of the 3D Gaussian displacement variation by using the dμ 1;
q, quaternion of a 3D gaussian, used to represent the rotation of the 3D gaussian.
Where Q is a unit quaternion (Identity Quaternion), the former term d μ 1 uses the L1 norm to approximate the estimated displacement to 0, and the latter term approximates the estimated rotation to no rotation by cosine similarity. However, because-q and q represent the same rotation, the present application uses the square of the cosine similarity. For each scene, the preheat is only once after the training of frame 0 is completed, and the shift of 3DGs of the noisy frame 0 is input and stored after 3000 iterations (about 20 s) of training for initialization of the NTC at a subsequent time step.
Step 3, an aggressive density control method (AGGRESSIVE DENSITY control) for generating a new 3D Gaussians in three-dimensional space to model the new emerging object.
Please refer to the detailed steps corresponding to step S400 and see fig. 4, specifically by the visualization of the aggressive density control to generate additional 3D Gaussians and optimize at potential locations for periodic trimming and addition. In one embodiment, step S400 further includes: the optimized neural network includes a plurality of training rounds in which gradient tracking is initiated in a last training round; after the optimization neural network is finished, selecting a 3D Gaussian with the average amplitude of the spatial position gradient of the view being higher than a threshold value, sampling the newly added Gaussian position through a probability distribution function, generating a newly added set of 3D Gaussian, and updating to obtain a final 3D Gaussian; and generating a post-frame image by using the final 3D Gaussian.
Only considering the transformation of 3D Gaussians is enough to cover most of the scenes of real life, and the occlusion and disappearance in the new time step can be effectively solved through the transformation. However, considering only transformation does not deal with objects that have not appeared in frame 0, such as transient objects, and new persistent objects. Meanwhile, the new 3D Gaussians cannot be too much or used in the initialization of the next frame in consideration of the model storage requirement and training complexity. This means that a small number of 3 DGaussians's need to be quickly generated to model new objects to enhance the scene integrity at the current time step, reconstructing the scene with high quality.
To handle objects that do not appear in frame 0, such as some transient objects, new persistent objects, etc., new 3DG needs to be generated to model these new objects. First, the position of the new 3DGs is to be determined: the spatial position gradient of the view of 3DGs is a very important indicator: for new objects in the new frame, there will be a larger gradient of 3DG approaching it, because the optimization tries to "camouflage" the new object by moving Gao Silai. However, since the gaussian color is not directly optimized in the first stage, i.e. the 3D gaussian color is updated by rotating the spherical harmonic coefficients, these 3DG are difficult to camouflage and still are transformed (transform) to a suitable position-combined with a large position gradient.
From the above, it is known that it is appropriate to increase 3DGs around these 3 DGs. Furthermore, the present application employs an aggressive gaussian addition scheme in order to capture as much as possible the location where new objects appear. Specifically, gradient tracking is first started in the last training round of the first stage, and after the first stage training is finished, 3DG with the average amplitude of the view spatial position gradient higher than a threshold tau grad is selected and distributed in a 3D normal wayThe position (mean) of the new Gaussian (Gaussian) is sampled as a Probability Distribution Function (PDF) and a random 3DGs (set of 3D Gaussian) is generated. Although no assumption should be made about other properties of the newly added 3 DGs.
But if the SH coefficients and scaling vectors of the newly added 3DGs are not appropriate, the optimization would be more prone to reduce their opacity (opacity) than to modify their SH coefficients and scaling vectors, which would result in the newly added 3DGs standing horse becoming transparent and thus failing to model the newly emerging object, so that their SH coefficients and scaling vectors inherit from the original 3DGs, and their rotation quaternion is set to q= [1, 0], and the opacity is set to 0.5. The present application uses such aggressive gaussian optimization schemes because most of the emerging objects are small and simple items. Specifically, the learning rate of these new gaussians is fifteen times that of the default setting, as compared to the default setting of the initial gaussians optimized at time step 0. Thanks to a good estimation of the new gaussian position and the subsequent pruning strategy, the new 3DGs can quickly and stably converge to a suitable state even at such high learning rates.
In order to control the newly added Gaussian number and prevent local minima, in the second stage a higher threshold τ α is set for the opacity value (opacity value). For each pass, 3DGs is regenerated around the gaussian with spatial position gradients above τ grad to compensate for the region of imperfect reconstruction, but their other parameters are inherited from the original 3DGs except that the degree of scaling is 0.8 of the original size, and then all new gaussian with opacity below τ α is removed to prevent unnecessary new increase from affecting visual quality.
By the aggressive adaptive 3DG replenishment method of the present application, newly added gaussians can faithfully reconstruct transient objects in new time steps and simple new persistent objects after a short training time. Furthermore, no new added gauss is fed into the next time step training, which effectively prevents the 3DGs trained in the first stage from increasing over time.
The current new view angle generation methods have good effects in terms of generating picture quality, but they generally require complete offline video for rendering and cannot achieve real-time rendering. In order to solve the problems, the application is to adopt the latest knowledge learned by synthesizing the new view of the static scene, and can process the pictures in real time, so that the generation of the free viewpoint video can obtain great improvement on training and rendering speed while ensuring the high quality of the video. The method and the device for generating the streaming free viewpoint video of the colleges and universities realize generation of the streaming free viewpoint video of the colleges and universities by using the 3D Gaussian model trained in real time.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store data for the neural mesh and the 3D gaussian model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for efficient streaming free-viewpoint video generation based on a 3D gaussian model implementation.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for efficient streaming free-viewpoint video generation based on a 3D gaussian model implementation. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
Step S100, a 3D Gaussian model of a previous frame of the three-dimensional scene is obtained, the 3D Gaussian model is a set of 3D Gaussian, and for a 3D Gaussian, the 3D Gaussian model comprises a position point in space and the position point attribute;
Step S200, constructing a neural network comprising a perceptron, recording position points in a position hash coding mode, receiving the position hash coding by the perceptron, and mapping the position hash coding into attribute change of 3D Gaussian, wherein the attribute change comprises a first part used for representing 3D Gaussian displacement and a second part used for representing 3D Gaussian rotation;
Step S300, updating the 3D Gaussian in the next frame by utilizing 3D Gaussian displacement and 3D Gaussian rotation, and rendering to obtain a reference image;
Step S400, optimizing the neural network by using the losses of the reference image and the sample image, and generating a post-frame image by using the optimized neural network.
In one embodiment, a computer program product is provided comprising computer instructions which, when executed by a processor, perform the steps of:
Step S100, a 3D Gaussian model of a previous frame of the three-dimensional scene is obtained, the 3D Gaussian model is a set of 3D Gaussian, and for a 3D Gaussian, the 3D Gaussian model comprises a position point in space and the position point attribute;
Step S200, constructing a neural network comprising a perceptron, recording position points in a position hash coding mode, receiving the position hash coding by the perceptron, and mapping the position hash coding into attribute change of 3D Gaussian, wherein the attribute change comprises a first part used for representing 3D Gaussian displacement and a second part used for representing 3D Gaussian rotation;
Step S300, updating the 3D Gaussian in the next frame by utilizing 3D Gaussian displacement and 3D Gaussian rotation, and rendering to obtain a reference image;
Step S400, optimizing the neural network by using the losses of the reference image and the sample image, and generating a post-frame image by using the optimized neural network.
In this embodiment, the computer program product comprises program code portions for performing the steps of the efficient streaming free-viewpoint video generation method implemented based on a 3D gaussian model in the embodiments of the present application when the computer program product is executed by one or more computing devices. The computer program product may be stored on a computer readable recording medium. The computer program product may also be provided for downloading via a data network, e.g. through the RAN, via the internet and/or through the RBS. Alternatively or additionally, the method may be encoded in a Field Programmable Gate Array (FPGA) and/or an Application Specific Integrated Circuit (ASIC), or the functionality may be provided by means of a hardware description language for downloading.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description. When technical features of different embodiments are embodied in the same drawing, the drawing can be regarded as a combination of the embodiments concerned also being disclosed at the same time.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.
Claims (10)
1. The method for generating the high-efficiency streaming free viewpoint video based on the 3D Gaussian model is characterized by comprising the steps of sequentially generating subsequent frames from an initial frame, wherein in the subsequent frames, generating the subsequent frames according to the previous frames specifically comprises the following steps:
obtaining a 3D Gaussian model of a previous frame of the three-dimensional scene, wherein the 3D Gaussian model is a set of 3D Gaussian, and for a 3D Gaussian, the 3D Gaussian model comprises a position point in space and the position point attribute;
Constructing a neural network comprising a perceptron, wherein the position points are recorded in a position hash coding mode, the perceptron receives the position hash coding and maps the position hash coding into attribute changes of the 3D Gaussian, and the attribute changes comprise a first part used for representing 3D Gaussian displacement and a second part used for representing 3D Gaussian rotation;
Updating the 3D Gaussian in the next frame by using the 3D Gaussian displacement and the 3D Gaussian rotation, and rendering to obtain a reference image;
And optimizing the neural network by using the losses of the reference image and the sample image, and generating a post-frame image by using the optimized neural network.
2. The method of claim 1, wherein the perceptron is a fully-connected layer neural network and the number of layers is sixteen or less.
3. The efficient streaming free-viewpoint video generation method of claim 1, wherein the 3D gaussian model of the initial frame is obtained by reconstruction;
The efficient streaming free-viewpoint video generation method further comprises the following steps: and rendering by using the 3D Gaussian model of the previous frame to obtain a previous frame image, and combining the previous frame image to generate a free viewpoint video.
4. The efficient streaming free-viewpoint video generation method of claim 1, wherein the property change of the 3D gaussian is a seven-dimensional vector, wherein a first part is a first three-dimension and a second part is a remaining dimension.
5. The efficient streaming free-viewpoint video generation method of claim 4, wherein the remaining dimension is used to represent a unit quaternion that is used to represent a rotation of a 3D gaussian.
6. The efficient streaming free-viewpoint video generating method of claim 5, wherein the location point attribute comprises color;
And when the 3D Gaussian is updated, simultaneously updating the 3D Gaussian position and the unit quaternion representing rotation, and rotating the spherical harmonic coefficient according to the change of the unit quaternion to update the color of the 3D Gaussian.
7. The method of claim 1, wherein the perceptron receives the location hash code and maps the location hash code to a change in an attribute of the 3D gaussian, in particular by:
dμ,dq=MLP(h(μ))
In the method, in the process of the invention,
Μ represents the position of the 3D gaussian;
dμ represents a displacement of 3D gaussian;
q represents a unit quaternion;
dq represents the spatial rotation angle of the 3D gaussian;
h (μ) represents hash encoding of μ;
MLP represents the perceptron.
8. The efficient streaming free-viewpoint video generation method of claim 1, wherein optimizing the neural network comprises a plurality of training rounds in which gradient tracking is initiated in a last training round;
And after the neural network is optimized, selecting a 3D Gaussian with the average amplitude of the spatial position gradient of the view being higher than a threshold value, sampling the newly added Gaussian position through a probability distribution function, generating a newly added set of 3D Gaussian, updating to obtain a final 3D Gaussian, and generating the post-frame image by using the final 3D Gaussian.
9. Computer device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the efficient streaming free-viewpoint video generating method according to any of claims 1 to 7.
10. Computer program product comprising computer instructions which, when executed by a processor, implement the steps of the efficient streaming free-viewpoint video generation method according to any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410261768.8A CN118158489A (en) | 2024-03-07 | 2024-03-07 | Efficient streaming free view video generation method based on 3D Gaussian model, computer device and program product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410261768.8A CN118158489A (en) | 2024-03-07 | 2024-03-07 | Efficient streaming free view video generation method based on 3D Gaussian model, computer device and program product |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118158489A true CN118158489A (en) | 2024-06-07 |
Family
ID=91286236
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410261768.8A Pending CN118158489A (en) | 2024-03-07 | 2024-03-07 | Efficient streaming free view video generation method based on 3D Gaussian model, computer device and program product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118158489A (en) |
-
2024
- 2024-03-07 CN CN202410261768.8A patent/CN118158489A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Khakhulin et al. | Realistic one-shot mesh-based head avatars | |
Thies et al. | Deferred neural rendering: Image synthesis using neural textures | |
Flynn et al. | Deepstereo: Learning to predict new views from the world's imagery | |
Liu et al. | Stylerf: Zero-shot 3d style transfer of neural radiance fields | |
Zhu et al. | Fsgs: Real-time few-shot view synthesis using gaussian splatting | |
Li et al. | Infinitenature-zero: Learning perpetual view generation of natural scenes from single images | |
Wu et al. | Spatial-angular attention network for light field reconstruction | |
WO2022198684A1 (en) | Methods and systems for training quantized neural radiance field | |
CN116977522A (en) | Rendering method and device of three-dimensional model, computer equipment and storage medium | |
Zhang et al. | NDF: Neural deformable fields for dynamic human modelling | |
Liu et al. | Facial image inpainting using attention-based multi-level generative network | |
Ouyang et al. | Real-time neural character rendering with pose-guided multiplane images | |
Han et al. | Super-nerf: View-consistent detail generation for nerf super-resolution | |
Zhang et al. | Multi-view consistent generative adversarial networks for compositional 3d-aware image synthesis | |
CN117274066B (en) | Image synthesis model, method, device and storage medium | |
Wang et al. | Proteusnerf: Fast lightweight nerf editing using 3d-aware image context | |
CN117274446A (en) | Scene video processing method, device, equipment and storage medium | |
Li et al. | Progressive multi-scale light field networks | |
Tenze et al. | altiro3d: scene representation from single image and novel view synthesis | |
Wang et al. | BlobGAN-3D: A spatially-disentangled 3D-aware generative model for indoor scenes | |
CN116452715A (en) | Dynamic human hand rendering method, device and storage medium | |
CN115984510A (en) | Stylized face texture modeling method, system, equipment and storage medium | |
Li et al. | Point-Based Neural Scene Rendering for Street Views | |
CN118158489A (en) | Efficient streaming free view video generation method based on 3D Gaussian model, computer device and program product | |
Yan et al. | Stereoscopic image generation from light field with disparity scaling and super-resolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |