CN117315153A - Human body reconstruction and rendering method and device for cooperative light field and occupied field - Google Patents

Human body reconstruction and rendering method and device for cooperative light field and occupied field Download PDF

Info

Publication number
CN117315153A
CN117315153A CN202311273506.5A CN202311273506A CN117315153A CN 117315153 A CN117315153 A CN 117315153A CN 202311273506 A CN202311273506 A CN 202311273506A CN 117315153 A CN117315153 A CN 117315153A
Authority
CN
China
Prior art keywords
depth
color
ray
point
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311273506.5A
Other languages
Chinese (zh)
Inventor
许威威
董政
高耀安
鲍虎军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202311273506.5A priority Critical patent/CN117315153A/en
Publication of CN117315153A publication Critical patent/CN117315153A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/04Texture mapping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Generation (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a human body reconstruction and rendering method and device for a collaborative light field and an occupied field, and the method can be directly applied to a new shooting target. The method integrates PIFU and NeRF expressions, and enables the two expressions to respectively predict the occupied field and the light field of the human body to perform cooperative work. PIFu relies on de-noised depth images to enable models to introduce human body priors, assisting in new view angle rendering tasks. The invention introduces a network model SRONet to process geometry and human body parts simultaneously, and utilizes an occupied field to assist the light field to draw. During training, both geometric and color supervisory signals are applied to the network model to enhance the ability of the network to capture high quality texture details. The invention also introduces a light up-sampling method based on neural network fusion, so as to efficiently up-sample the low-resolution image to the target resolution; a depth image denoising model; a data structure of a two-layer tree is constructed based on denoised depth images for efficient ray rendering point sampling.

Description

Human body reconstruction and rendering method and device for cooperative light field and occupied field
Technical Field
The invention relates to the field of computer three-dimensional vision reconstruction and the field of computer graphics drawing, in particular to a human body reconstruction and rendering method and device based on a collaborative light field and an occupied field.
Background
Human three-dimensional reconstruction and creating free view video for humans are important components in many applications, for example: virtual reality and augmented reality, distance education, virtual meetings, and the like. In order to provide an immersive experience to the user, these applications need to be able to capture high quality mannequins in as near real time as possible through consumer level capture devices and perform freeview rendering. Recently, neural implicit field-based expression has been widely used in human performance capture systems. The hidden function (PIFU) with aligned pixels can reconstruct grids and textures of the dynamic human body three-dimensional model efficiently, the surface model is extracted from the trained implicit occupation field, and the texture part is obtained by predicting the surface point RGB value of the model through a trained network. Neural network-based light field (NeRF) is an implicit network model based on coordinates to encode bulk density and color fields. This expression is popular because NeRF is able to render photorealistic images at dense point sampling. However, for the field of three-dimensional human reconstruction, both of these expressions have certain problems. First, the results based on surface texture coloring (pipu) are often blurred, and rendering effects related to viewing angles cannot be produced, and translucent hair and other materials cannot be treated. Second, neural light field (NeRF) rendering speeds are often inadequate for real-time scenes and have poor generalization ability. The latest generalizable NeRF is also not effective in reconstructing the target object from the sparse view in the training set. For new target objects, higher rendering quality is usually achieved after online optimization. Thus, it remains a challenge to provide a method that can perform high quality human freeview rendering from a sparse view input of a user-level device.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a human body reconstruction and rendering method for synergizing a light field and an occupied field, which can be directly generalized to rendering of an unseen target human body and has lower rendering delay.
Considering that the pipu based on depth data and the NeRF expression work cooperatively, the NeRF can generate a high-quality texture result by using global light field information, and geometric ambiguity or noise of the texture result can be further reduced by the occupation field of the pipu (due to the introduction of geometric constraint), which helps to improve the quality of surface reconstruction. In addition, when PIFU and NeRF are cooperated, the quality of rendering has higher precision requirement on depth because the surface point has highest contribution weight to the light color, and when PIFU input only depends on input RGB image characteristics, the geometric quality of reconstruction from a sparse view angle can not be ensured. After the introduction of accurate depth information, the PIFu constructed geometric surface field can be constrained to better align the light field model image features with the surface model, thereby facilitating NeRF learning.
Based on the above observations, the method of the present invention first introduces a novel depth image denoising model that references the UNet architecture, outputs optimized depth images to reduce depth noise and fill in holes of the original depth images, etc. Then, a novel network model SRONet was introduced to model the human body by combining the occupancy field with the light field. Specifically, SRONet predicts a human body occupancy field based on pixel-aligned de-noised depth map features to reconstruct a human body, predicts a light field based on geometry features and pixel-aligned RGB features to render perspective-dependent human body textures. During training, the geometry and color supervisory signals work cooperatively to enhance the ability of the SRONet to capture high quality details during reconstruction and rendering. In addition, the invention constructs a novel two-layer tree structure through the denoised depth map to perform effective geometric storage, rapid ray-voxel intersection and three-dimensional point sampling. In addition, the invention constructs a light up-sampling method based on neural network fusion, and can render the new view 1K resolution under the condition of smaller calculation cost.
In order to achieve the above, the technical scheme of the present invention is as follows: a human body reconstruction and rendering method of a collaborative light field and an occupied field, the method comprising the steps of:
s1: construction of depth image denoising model F based on convolutional neural network structure d The input of the model comprises a human RGB image I and an original depth image D, and the input is output as a denoised depth image D rf The method comprises the steps of carrying out a first treatment on the surface of the Training a model through depth, normal and three-dimensional consistency loss functions, and using the model after training to perform a depth denoising task;
s2: based on the denoised multi-view depth image and camera parameters of each view, fusing a global human body point cloud, constructing a data structure of a two-layer tree from the point cloud, and storing father and son nodes of the tree as a global list; introducing voxel post-processing operation in the reasoning process to denoise the tree structure, and only preserving voxels near the surface of the human body;
s3: constructing a collaborative network model SRONet for reconstructing human body grids and rendering new view images, wherein the network model comprises two sub-network models, namely an occupied field network OCCNet and a light field network ColorNet; for a given input: multi-view RGBD imageAnd sampling points x and observing view angles d in the three-dimensional space, wherein N is the number of the view angles, and the sub-network model respectively realizes the following tasks:
Multi-view depth image for a given inputAnd sampling the point x, OCCNet predicts a voxel occupancy o for the point x based on a hidden function of pixel alignment (PIFU) x ∈[0,1]Representing the probability that the point is inside the human mesh model;
given input ofMulti-eye RGB image { I } i } i=1,...N ColorNet predicts the color vector c of RGB three channels in the viewing angle direction d for point x based on PIFU x
Occupancy o based on point x prediction x Calculating a human body grid model from the estimated occupied field by using an equivalent cube search algorithm marking cubes so as to reconstruct a three-dimensional human body; training SRONet by utilizing a multi-eye human body data set and combining a geometric-color cooperative loss function and a depth error loss function, wherein the model training is used for predicting an occupied field and a light field after finishing;
s4: calculating rays for each pixel point in the new view angle image according to corresponding camera parameters, and performing a ray Voxel intersection process according to a Voxel Traversal (Voxel Traversal) algorithm to determine voxels in the tree data structure that intersect the rays; recording depth values of far and near intersection points of all the intersecting voxels, designing sampling weight edge ray sampling points in the voxels according to the size of the voxels, and calculating the occupation value and color of the sampling points according to S3; calculating color fusion weight according to the occupation value by utilizing a volume rendering formula so as to fuse the color of each sampling point on the light ray and calculate the final color of the light ray;
S5: upsampling each ray to improve the resolution and quality of the rendered image; constructing a feature fusion network, wherein the input of the network is information shared by each sub-ray, and the information comprises original color, depth, ray features and input RGB images of two adjacent visual angles; outputting the final color of each sub-ray; the feature fusion network is trained through ray-by-ray color errors, structural similarity and feature loss functions, and the training is used for the ray up-sampling operation after the training is completed so as to obtain a target resolution rendered image.
Further, the depth denoising process is specifically designed as follows:
(1) Extracting a region where a portrait is located in an original input image by using a backsroundmatching-v 2 algorithm as a mask image psid of a human body region, and simultaneously obtaining RGBD images I and D of the portrait region; normalizing an input RGBD image to [ -1,1]Interval, record the depth maximum at the same time; merging normalized RGB and depth images as depth image denoisingModel F d Is input to the computer;
(2) UNet-based structural model F d Two independent feature extraction networks HRNetV2-W18-Small-v2 are used for respectively encoding RGB and depth images, a cavity space convolution pooling pyramid module ASPP and a residual error attention module ResCBAM are used for fusing RGB and depth features, the fused features are up-sampled, and the features are fed back to the feature extraction network;
(3) Model F d Output [ -1,1]The interval image is inversely normalized to the original value domain by using the depth maximum value in the step (1), the post-processing image of psi is used, only the depth value of the portrait area is reserved and is marked as D rf
(4) For model F d The trained loss function is as follows:
depth consistency loss L D : for penalizing the real depth image at view iAnd predicted depth image->Pixel-by-pixel deviation, defined as:
normal consistency loss L N : for punishing passing depth imagesAnd->The deviation between the calculated normal directions is defined as:
wherein,depth image +.>And its calculated normal diagram->Depth value and normal vector at the i-th viewing angle pixel p position,/and>representing the corresponding true depth value and normal vector;representing the L1 penalty function,/->The loss function of the L2 is indicated,<·,·>representing a vector inner product operation;
three-dimensional consistency loss L P : for further constraint of passageThe consistency of the fused point cloud and the real point cloud to reduce the depth fusion noise in constructing the two-layer tree structure is defined as:
wherein,representing from->Fused point cloud, F is a truncated symbol distance field Fusion algorithm TSDF-Fusion, K i ,RT i Respectively the internal and external parameters of the camera, P gt To be from a true three-dimensional human body grid model Mid-sampling point cloud,/>Representing a chamfer distance loss function chamfer;
the loss function for depth denoising is expressed as: l=l DN L NP L P Wherein lambda is N ,λ P For the loss function weights, the loss function is optimized using an ADAM algorithm.
Further, the data structure of the two-layer tree is specifically designed as follows:
(1) Fusion of depth images after denoising of all visual angles by using truncated symbol distance field Fusion algorithm TSDF-FusionSimultaneously obtain global human point cloud P rf With fusion cube V tsdf The method comprises the steps of carrying out a first treatment on the surface of the For V tsdf Binarization is carried out and then is marked as V occ The resolution can be set to 128 3 But not limited thereto, the Occupancy value (occ, occupancy) V at the spatial position x occ (x) The method comprises the following steps:
wherein s is v Representing the size of the voxel, alpha is the truncated symbol distance threshold, V tsdf (x) Is V at the space position x tsdf Is a truncated symbol distance of (2);
(2) Voxel post-processing: utilizing OCCNet in S3, rapidly constructing cube with set resolution based on real-time human body reconstruction algorithm Realtime-PIFU, and marking asBinarization and V constructed in (1) occ Fusion to eliminate V occ The noise voxels floating in the middle, a denoised cube is obtained>Space bitFusion occupancy value at x>The method comprises the following steps:
wherein,is a binarization function based on a threshold value beta, gamma is a threshold value for filtering floating voxels, |represents or operates;
(3) Cube after denoisingMiddle->Is marked as valid voxels, parent voxels are fused according to a preset number proportion (for example, the proportion is set to be 64:1, each valid voxel in a cube with length, width and height of 4 corresponds to one valid parent voxel), and all valid voxels are stored as a global list L v Each node corresponds to a voxel, and simultaneously, each node records the index (list position), the spatial position and the size information of the father or son node; construction of index cube V idx To store an index value of each active parent node in the global list, the index value of which is set to-1 for inactive parent nodes; by constructing V idx So as to facilitate efficient computation of ray-voxel intersection in the S4 process.
Further, the specific design of the SRONet is as follows:
(1) OCCNet: based on the occupancy field of the depth information, using a feature encoder HRNetV2-W18 for encoding the depth image; for sample point x, OCCNet predicts the occupancy value o for each view by aggregating the pixel-aligned depth features for that point x The method comprises the steps of carrying out a first treatment on the surface of the Defining the occupancy field as a function
Wherein W is i Depth profile representing coded back view i, W i (x) Representing the point x projection from W at viewing angle i i Depth features taken in c i (x) Representing the depth value of the point x projection under the camera coordinate system under the visual angle i and the truncated symbol distance; hidden function f using fully connected network representation 1 Obtaining the geometric characteristics of each view angle, obtaining the fused global characteristics through an average pooling operation Avg, and sending the global characteristics into a second hidden function f 2 The occupation value o is calculated x
(2) ColorNet: encoding the RGB image using the same feature encoder as the OCCNet based on the light field of the color features and the geometric features to obtain the color features; for sample x, colorNet passes the additional input: viewing angle direction d and geometric featuresTo aggregate the color features of each view to predict the view-dependent color vector +.>Wherein the geometric feature is expressed as +.>Wherein f 3 Is a hidden function for encoding; defining a light field as a function +.>
Wherein M is i Representing the passing of the braidingColor profile, M, of code back viewing angle i i (x) Rgb i Representing point x projection from M at viewing angle i i The obtained color characteristics and pixel colors are obtained; f (f) 4 ,f 5 Representing a hidden function for further processing of the feature;the method is a feature fusion function realized by a transducer, and a basic feature fusion unit is realized by adopting a hydra attribute; camera coordinate system lower viewing angle direction d i =R i d, wherein R is i Representing a rotation matrix of the camera in the external parameters at the viewing angle i;
SRONet predicts occupancy values for sample point x and view-dependent colors.
Further, the specific design of the ray-voxel intersection and the acquisition point in S4 is as follows:
(1) For an emitted ray l, detecting intersected father nodes along the ray l by using a voxel traversal algorithm, and recording depth values of the intersected father voxels at the intersection point under the angle of l, wherein the depth values of the far and near intersection points are respectively recorded as D far And D near The method comprises the steps of carrying out a first treatment on the surface of the For each intersected parent voxel, continuing to detect intersected child voxels by using a voxel traversal algorithm, and simultaneously recording far and near depth values, which are marked as D' far With D' near The method comprises the steps of carrying out a first treatment on the surface of the Summarizing the depth values of all records to beAnd->
(2) The sampling weight w of the ith voxel is calculated by distributing the number of sampling points between the far and near intersection points of each voxel i The following are provided:
wherein d far (i) And d near (i) Respectively representAnd->Far and near depth of middle voxel i, N v For all intersecting voxel numbers, s i Representing the scale size of voxel i, the parent-child voxels may be set to 1 and 4, respectively, to assign more points to the child voxels; specifically, the number m of sampling points inside each voxel i i The method comprises the following steps:
wherein M is the total number of sampling points,representing a downward rounding; during sampling, a sampling point is allocated to each voxel i along the light emission direction so as to ensure that the voxel closest to the camera on the light is always sampled to the point; if sigma i m i < M, then reallocating the remaining M-sigma in sample order i m i Depth value d of the jth sampling point in the ith voxel i (j) The calculation formula is as follows:
wherein j starts from 0; sampling point x passes P cam +d i (j) Calculated d, P cam For camera position, d is the viewing angle direction.
Further, the loss function of the volume rendering of S4 and the SRONet of S3 for training the model is specifically designed as follows:
(1) To calculate the final color of ray l, a normalized surface (UniSurf) and volume rendering technique is used, based on the occupancy value o of each sample point on l x Calculating color fusion weight and fusing point color cx to calculate color of light rayThe following are provided:
wherein, the color of the ith sampling point fuses the weight omega x (i)=o x (i)Π j<i (1-o x (j) A) is provided; meanwhile, by weighting the depth value d of the ith sample point i To calculate the depth value at the intersection point of the light ray l and the human body surface
(2) Color based on ray estimationAnd depth value->The following loss function training optimization SRONet reconstruction and rendering part is designed:
geometric-color synergistic loss L syn : based on PIFU supervision mode, OCCNet is trained by sampling point y in space, and occupancy value o is estimated in punishment mode y And the true occupancy valueErrors between; simultaneously punish the ray-by-ray true color C * (l) And estimate color->Two loss functions work cooperatively:
Wherein S and R respectively represent a sampling point set and a light ray set,representing a cross entropy loss function, ">Represents the L1 loss function, μ o ,μ c The weights of the occupied value loss term and the color loss term are respectively;
depth error loss L D′ : for penalizing estimated ray depth valuesAnd true depth value D * (l) Errors between to further improve reconstruction and rendering details:
wherein,representing an L2 loss function;
the loss function for SRONet is expressed as: l (L) synD′ L D′ Wherein lambda is D′ Is a balance term, and optimizes the loss function using an ADAM algorithm.
Further, the specific design of the light up-sampling in S5 is as follows:
(1) Up-sampling light: for ray l passing at pixel location (x, y), its color is determinedDepth value->Light fusion feature->Respectively disseminateTo 4 sub-pixels, corresponding positions: (x, y), (x+0.5, y), (x, y+0.5), (x+0.5, y+0.5); wherein ft is color Is a feature fusion function->Output characteristics of (2); thereby generating a coarse upsampling result;
(2) Feature fusion operation: two adjacent views n using the target view 0 ,n 1 Enhancing the result in (1) with the original resolution RGB image; specifically, UNet is used for encoding two adjacent RGB images, each sub-ray corresponds to one sub-pixel, and the position of the surface where the sub-ray is located is calculated through the depth value of the sub-ray Projecting a point p into adjacent images and features to obtain a color +.>And (3) withCharacteristics->And->Calculate visibility->And->The visibility calculation formula is as follows: /> z i Representing the depth value of p projection at viewing angle i, < >>Representing a denoised depth image from view i +.>Depth value sigma obtained in (a) v Is a preset visibility weight coefficient; through feature fusion network->Calculating fusion weightsIn three colors of the fusion sub-rays: />Obtaining the final color of the sub-ray; the definition is as follows:
wherein f 6 A hidden function representing the characteristics of the process,for adjacent view angle n 0 ,n 1 Is a raw resolution RGB image of (b);
(3) Training feature fusion networkIs a loss function of (2):
loss of color error and structural similarity L for each ray B : for punishing sample color blocksAnd true color block->Color error betweenThe difference is defined as follows:
wherein R represents a light ray set,is of size S patch ·S patch Color block of->Is the final estimated color of ray r at (i, j); />Representing an L1 loss function, and SSIM represents a structural similarity function; mu (mu) 1 ,μ 2 Balancing terms for the loss function;
characteristic loss function L ft : for punishing color blocksAnd->The characteristic error between the two to further enhance the quality of the rendered image; the loss function is calculated by adopting a pretrained VGG-16 network, and the loss function is defined as follows:
Wherein,representing the L1 loss function between VGG features, which is calculated specifically using the 3 features fed into the first three maximum pooling layers MaxPool2 d;
for feature fusion networksThe loss function of (2) is expressed as: l (L) Bvgg ·L ft Wherein mu vgg Is a balanced term, independently trained using ADAM algorithm +.>Is a parameter of (a).
Further, the specific design for the parallel accelerated rendering flow is as follows:
(1) Accelerating the rendering process by using two GPUs; each display card processes half of data (images and light rays), two batches of data are synchronized through a memory on a CPU, and the rendering process is accelerated through a pipeline;
specifically, the pipeline is divided into 3 parts, each GPU is accelerated by 3 separate data streams: I/O process (CPU to GPU) and depth denoising process; 2. constructing a data structure of a two-layer tree, and encoding a multi-mesh RGB image and a depth image in the SRONet, and encoding an image in light up-sampling; 3. the light sampling point calculates the point occupation value and the color through SRONet, and the light up-sampling based on the feature fusion network; finally, converting all calculated light rays into images for display;
(2) Half-precision quantization and acceleration of a depth image denoising model are carried out by using a TensorRT technology, multi-mesh RGB image coding and depth image coding in SRONet are carried out, and image coding in light upsampling is carried out; accelerating all hidden functions and feature fusion functions realized by a transformer through a GPU shared memory by using a full fused scheme; the accelerated model may render a new view angle image at a resolution of 1K in around 100 ms.
The invention also provides a human body reconstruction and rendering device of the collaborative light field and the occupied field, which comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the processors are used for realizing the human body reconstruction and rendering method of the collaborative light field and the occupied field when executing the executable codes.
The invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the above-described human body reconstruction and rendering method of collaborative light fields and occupied fields.
The applicant has trained the model designed as described above on a human synthetic dataset and tested on a public dataset. Compared with other methods, the method has better effect. In summary, the invention has the following main beneficial effects:
(1) A novel human body capturing method is provided, and a 1K resolution free view video can be created from a sparse RGB input view in about 100 ms. The method can be generalized to the blind performers without further optimization.
(2) A hybrid representation is presented, together with a novel network model SRONet. The model is based on the depth denoising result, utilizes RGBD characteristics of pixel alignment, and cooperates with the occupied field and the light field to perform accurate human body reconstruction and rendering.
(3) A two-layer tree-based data structure is proposed for effective point sampling, a neural fusion-based ray upsampling technique for rapidly increasing the resolution of rendered images, and a parallel computing pipeline for rendering acceleration.
Drawings
Fig. 1 is a schematic diagram of a human reconstruction and rendering method of a collaborative light field and an occupied field in an embodiment of the present invention. Given an RGBD stream image captured by a sparse Azure Kinect sensor as an input, (1) a depth image denoising model removes depth noise based on the input RGBD image, filling an original depth hole; (2) Given the denoising depth image, a novel two-layer tree structure is reconstructed for discretizing the storage global geometry information; (3) Performing efficient ray-voxel intersection and ray point sampling for rendering; (4) Providing a novel network model SRONet to cooperate with a light field and an occupied field to perform human body reconstruction and free view rendering; (5) The output of SRONet is increased to the target resolution by a neural fusion based ray up-sampling technique.
FIG. 2 is a specific process for building a two-level tree structure in an embodiment of the present invention. The method comprises (1) converting the denoised depth image into a whole body point cloud; (2) Building a cube based on a point cloud, wherein voxels are stored Stored in the GPU; (3) Merging small voxels in the ratio of 64:1, taking the small voxels as parent voxels, constructing a two-layer tree structure, and converting the voxels into a node list for storage; (4) When reasoning, OCCNet is utilized, and a cube is quickly constructed based on real-time human body reconstruction algorithm real-PIFUBinarization and tree structure first layer cube V occ Fusion to eliminate floating noise voxels, resulting in denoised cube>
FIG. 3 is a specific process of ray-voxel intersection and ray-point sampling in an embodiment of the invention. A voxel traversal method is used to determine parent-child voxels in the tree structure that intersect the ray.
Fig. 4 is a block diagram of the proposed SRONet and supervisory signals. SRONet consists of OCCNet and ColorNet, and predicts the occupation value and color of input point. And the geometric loss function and the color loss function are trained cooperatively in the supervision process.
Fig. 5 is a proposed light up-sampling and nerve fusion process. Wherein the sub-rays share the color, depth and characteristics of the principal emission rays. For each subpixel, the neural fusion takes as input the characteristics of the adjacent view, the view direction, and the visibility, and predicts weights to fuse the subpixel color with the adjacent view color.
Detailed Description
The following description of the main module embodiments of the present invention is provided in connection with the accompanying drawings, and is presented to facilitate a better understanding of the present invention by those skilled in the art in conjunction with the validity verification experiments of the present invention.
As shown in fig. 1 (2) and fig. 2, the data structure of the two-layer tree is constructed based on the denoised depth point cloud in the present invention. The rendering range can be effectively limited near the three-dimensional surface of the human body by constructing the two-layer tree structure, and the function of geometric constraint is achieved. In addition, the tree structure is stored in the GPU, so that the rapid voxel searching of ray voxel intersection is facilitated, and efficient intra-voxel point sampling is realized. The design of the tree into two layers mainly takes into account: (1) The invisible area of the point cloud cannot be covered by leaf sub-layer voxels, which leads to the loss in rendering; (2) The two-layer structure covers the whole body region with large voxels, and a higher layer number requires more overhead without performance improvement. When in storage, the tree structure parent-child voxels are converted into a global list L v In the GPU, each node stores information such as index, size, center position, parent (child) voxel index, etc.
As shown in fig. 1 (2) and 3, given an emitted ray l, a voxel traversal algorithm (fig. 3 (c), (d)) is first used to detect the valid parent voxel that intersects l and record the depth value of the ray-voxel intersection at the target view angle t. In the example of fig. 3 (c), there are 4 valid voxels that intersect ray l, and the near-far depth values recorded are: d (D) near =d({v i } i=0,1,2,4,5 ,RT t ),D far =d({v i } i=1,2,3,5,6 ,RT t ). Wherein d (·, RT) t ) Is a projection function to obtain the depth value of the three-dimensional point at the viewing angle t. RT (reverse transcription) method t Is a camera external reference at the visual angle t. Then, the voxel traversal algorithm continues to be used to detect all valid child voxels and record their near-far depth values as a list: d'. near ,D′ far As shown in fig. 3 (d). Finally, all near and far depths are combined into a global list, and recorded asAnd->As a result of the final ray-voxel intersection, the point sampling range is determined. The point sampling process is performed in all recorded pairs of near-far depth data, e.g. in { v } 0 ,v 1 },{q 0 ,q 1 Picking points in }. Finally, all the sample points on ray l are used to calculate the ray color.
The training data set according to the present invention is composed of a human rendering data set.
Among them, we use the paper Function4D: the thumb 2.0 phantom dataset introduced by Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors generates the training dataset of the present invention, which contains 500 high quality three-dimensional body scan models. Along the y-axis, real human RGBD images are rendered with CUDA acceleration and 6 degrees apart. We split this dataset into training and test sets in a 4:1 ratio. For the original depth image input, we refer to the method of paper Kinect v2 for mobile robot navigation: evaluation and modeling is a true rendering depth map D gt Sensor noise is added, including noise associated with the pixel depth value z: 1.5z 2 1.5z+1.375, gaussian noise (average 1.5 cm), holes (average width 3 pixels) to cover as much as possible all possible noise cases.
We use pyrerch and CUDA to implement the model to which the present invention relates. In the practical model training process, two NVIDIA RTX3090 graphics cards are used for training a depth image denoising model, an SRONet and a light up-sampling network respectively, and an ADAM optimizer is used for model parameter optimization. Wherein parameter beta in ADAM 1 ,β 2 Set as 0.5,0.99, respectively. The batch size of the depth image denoising model is set to 8, and the learning rate is set to 1e -3 Training was performed for a total of 10 rounds. SRONet batch size is set to 4, learning rate is set to 1e -4 And the learning rate is halved every 5 rounds, 20 rounds are trained in total. The light up-sampling network learning rate setting is the same as SRONet, the batch size is set to 2, and training is performed for 10 rounds. For all loss function balance terms, we denoise λ of the depth image denoising model N ,λ P Set to 0.5 and 0.01 respectively; mu in SRONet o ,μ c ,λ D′ Set as 0.5,1.0,1.0 respectively; upsampling light into mu of network 1 ,μ 2 ,μ vgg Set as 0.4,0.6,0.01 respectively; the weight sigma of the visibility will be calculated v Set to 200. In addition, the super parameter, α, γ in the two-layer tree structure is set to 40,0.01, respectively. Setting the number M of sampling points during training and reasoning48. The final resolution of the rendered image is set to 1K (1024 x 1024).
To verify the effectiveness of the method of the present invention, we performed a rendering quality test on the test set of thumb 2.0 and real beat data. Verification experiments are 6 similar methods based on deep learning with the current mainstream on the test set: pixelNeRF: neural Radiance Fields from One or Few Images, IBRNet: learning Multi-View Image-Based Rendering, MPSNERF general 3D Human Rendering from Multiview Images, NHP Learning Generalizable radiance fields for human performance Rendering, NPBG++, accelerating Neural Point-Based Graphics, PIFU (RGBD) Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization, and in four Image similarity criteria: PSNR (peak signal to noise ratio), SSIM (structural similarity), LPIPS (learning perceived image block similarity), MAE (mean absolute error) were evaluated, and the comparison results are shown in table 1 (test set), table 2 (real shot data):
TABLE 1
TABLE 2
Where PSNR, SSIM with higher values indicates better results, LPIPS, MAE with lower values indicates better results. The best results for each row are indicated by bold numbers and the next best results are indicated by underlines. From the above results, it can be seen that the method (our) proposed by the present invention is superior to other methods as a whole and greatly advanced in rendering average Time (Avg Time) over other methods. This comparative experiment fully demonstrates the superiority of the method of the invention.
In addition, we also performed some disassembly experiments on the test set (thumb 2.0 Dataset) and the real beat data (Our Real Captured Dataset) to analyze the effectiveness of SRONet introduced loss functions (geometric loss functions, depth loss functions), depth denoising process, collaborative geometric field and light field expression, feature fusion operation, and ray up-sampling operation. We have removed or modified the above and retrained the whole network model, respectively, and the experimental results are shown in table 3:
TABLE 3 Table 3
Wherein w/o GT Depth represents the removal of Depth consistent loss function during SRONet training, w/o GT Occ represents the removal of geometric supervision during SRONet training; w/o Denoised Depth means Depth denoising is removed; soft oc. →Density means that the cooperative occupancy field and light field expression is replaced by the NeRF original light field expression, namely, OCCNet predicts the volume Density of the sampling point instead of the occupancy value, and the rendering mode used by the invention is replaced by the NeRF volume rendering mode. Occmlp→dbmlp represents OCCNet while predicting sampling point volume density and occupancy value, preserving geometry supervised loss function, and rendering using the NeRF's volume rendering method. Hydroa att. Fwdarw.self att. Represents the replacement of the hydro-atttion module of the feature fusion function H with the Self atttion module of the transducer model. w/o Upsampling means that the neural fusion-based ray Upsampling module is removed. The disassembly experiment contrast shows that the loss function or the module introduced by the invention improves the overall quality of rendering to a certain extent.
The embodiment of the invention also provides a human body reconstruction and rendering device of the collaborative light field and the occupied field, which comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the processors are used for realizing the human body reconstruction and rendering method of the collaborative light field and the occupied field when executing the executable codes.
The embodiment of the invention also provides a computer readable storage medium, wherein a program is stored on the computer readable storage medium, and when the program is executed by a processor, the human body reconstruction and rendering method of the cooperative light field and the occupied field is realized.
The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims (10)

1. The human body reconstruction and rendering method for the cooperative light field and the occupied field is characterized by comprising the following steps:
s1: construction of depth image denoising model F based on convolutional neural network structure d The input of the model comprises a human RGB image I and an original depth image D, and the input is output as a denoised depth image D rf The method comprises the steps of carrying out a first treatment on the surface of the Training a model through depth, normal and three-dimensional consistency loss functions, and using the model after training to perform a depth denoising task;
s2: based on the denoised multi-view depth image and camera parameters of each view, fusing a global human body point cloud, constructing a data structure of a two-layer tree from the point cloud, and storing father and son nodes of the tree as a global list; introducing voxel post-processing operation in the reasoning process to denoise the tree structure, and only preserving voxels near the surface of the human body;
s3: constructing a collaborative network model SRONet for reconstructing human body grids and rendering new view images, wherein the network model comprises two sub-network models, namely an occupied field network OCCNet and a light field network ColorNet; for a given input: multi-view RGBD imageAnd sampling points x and observing view angles d in the three-dimensional space, wherein N is the number of the view angles, and the sub-network model respectively realizes the following tasks:
multi-view depth image for a given inputOCCNet based on pixels for sampling points xThe aligned hidden function PIFu predicts a voxel occupancy o for point x x ∈[0,1]Representing the probability that the point is inside the human mesh model;
given an input multi-view RGB image { I ] i } i=1,…N ColorNet predicts the color vector c of RGB three channels in the viewing angle direction d for point x based on PIFU x
Occupancy o based on point x prediction x Calculating a human body grid model from the estimated occupied field by using an equivalent cube search algorithm so as to reconstruct a three-dimensional human body; training SRONet by utilizing a multi-eye human body data set and combining a geometric-color cooperative loss function and a depth error loss function, wherein the model training is used for predicting an occupied field and a light field after finishing;
s4: calculating rays for each pixel point in the new view angle image according to the corresponding camera parameters, and executing a ray voxel intersection process according to a voxel traversal algorithm to determine voxels in the tree data structure which intersect the rays; recording depth values of far and near intersection points of all the intersecting voxels, designing sampling weight edge ray sampling points in the voxels according to the size of the voxels, and calculating the occupation value and color of the sampling points according to S3; calculating color fusion weight according to the occupation value by utilizing a volume rendering formula so as to fuse the color of each sampling point on the light ray and calculate the final color of the light ray;
s5: upsampling each ray to improve the resolution and quality of the rendered image; constructing a feature fusion network, wherein the input of the network is information shared by each sub-ray, and the information comprises original color, depth, ray features and input RGB images of two adjacent visual angles; outputting the final color of each sub-ray; the feature fusion network is trained through ray-by-ray color errors, structural similarity and feature loss functions, and the training is used for the ray up-sampling operation after the training is completed so as to obtain a target resolution rendered image.
2. The human body reconstruction and rendering method of a collaborative light field and an occupied field according to claim 1, characterized in that the depth denoising process is specifically designed as follows:
(1) By using BaThe ckgrouding-v 2 algorithm extracts the region where the portrait is located in the original input image as a mask image ψ of the human body region, and simultaneously obtains RGBD images I and D of the portrait region; normalizing an input RGBD image to [ -1,1]Interval, record the depth maximum at the same time; merging normalized RGB and depth images as depth image denoising model F d Is input to the computer;
(2) UNet-based structural model F d Two independent feature extraction networks HRNetV2-W18-Small-v2 are used for respectively encoding RGB and depth images, a cavity space convolution pooling pyramid module and a residual attention module are used for fusing RGB and depth features, the fused features are up-sampled, and the features are fed back to the feature extraction network;
(3) Model F d Output [ -1,1]The interval image is inversely normalized to the original value domain by using the depth maximum value in the step (1), the post-processing image of psi is used, only the depth value of the portrait area is reserved and is marked as D rf
(4) For model F d The trained loss function is as follows:
depth consistency loss L D : for penalizing the real depth image at view i And predicted depth image->Pixel-by-pixel deviation, defined as:
normal consistency loss L N : for punishing passing depth imagesAnd->The deviation between the calculated normal directions is defined as:
wherein,depth image +.>And its calculated normal diagram->Depth value and normal vector at the i-th viewing angle pixel p position,/and>representing the corresponding true depth value and normal vector; />Representing the L1 penalty function,/->The loss function of the L2 is indicated,<·,·>representing a vector inner product operation;
three-dimensional consistency loss L P : for further constraint of passageThe consistency of the fused point cloud and the real point cloud to reduce the depth fusion noise in constructing the two-layer tree structure is defined as:
wherein,representing from->Fused point cloud, F is truncated symbol distance field fusion algorithm, K i ,RT i Respectively the internal and external parameters of the camera, P gt For the point cloud sampled from the real three-dimensional human body mesh model,/i>Representing a chamfer distance loss function;
the loss function for depth denoising is expressed as: l=l DN L NP L P Wherein lambda is NP For the loss function weights, the loss function is optimized using an ADAM algorithm.
3. The method for reconstructing and rendering human bodies of a collaborative light field and an occupied field according to claim 1, wherein the data structure of the two-layer tree is specifically designed as follows:
(1) Fusion of depth images after denoising of all visual angles by using truncated symbol distance field fusion algorithmSimultaneously obtain global human point cloud P rf With fusion cube V tsdf The method comprises the steps of carrying out a first treatment on the surface of the For V tsdf Binarization is carried out and then is marked as V occ Then the occupancy value V at the spatial position x occ (x) The method comprises the following steps:
wherein s is v Representing the size of the voxel, alpha is the truncated symbol distance threshold, V tsdf (x) Is V at the space position x tsdf Is a truncated symbol distance of (2);
(2) Voxel post-processing: utilizing OCCNet in S3, rapidly constructing cube with set resolution based on real-time human body reconstruction algorithm, and marking asBinarization and V constructed in (1) occ Fusion to eliminate V occ The noise voxels floating in the middle, a denoised cube is obtained>The fused occupancy value at spatial position x +.>The method comprises the following steps:
wherein,is a binarization function based on a threshold β, γ is a threshold for filtering floating voxels, | represents or operates;
(3) Cube after denoisingMiddle->The voxels of (2) are marked as effective voxels, parent voxels are fused according to the preset number proportion, and all the effective voxels are stored as a global list L v Each node corresponds to a voxel, and simultaneously, each node records the index, the spatial position and the size information of the father or son node; construction of index cube V idx To store an index value for each active parent node in the global list, which is set to-1 for inactive parent nodes.
4. The human body reconstruction and rendering method of a collaborative light field and an occupied field according to claim 1, wherein the SRONet is specifically designed as follows:
(1) OCCNet, based on the occupancy field of depth information, using the feature encoder HRNetV2-W18 for encoding depth images; for sample point x, OCCNet predicts the occupancy value o for each view by aggregating the pixel-aligned depth features for that point x The method comprises the steps of carrying out a first treatment on the surface of the Defining the occupancy field as a function
Wherein W is i Depth profile representing coded back view i, W i (x) Representing the point x projection from W at viewing angle i i Depth features taken in c i (x) Representing the depth value of the point x projection under the camera coordinate system under the visual angle i and the truncated symbol distance; hidden function f using fully connected network representation 1 To obtain the geometric characteristics of each view angle, obtaining the fused global characteristics through an average pooling operation Avh, and sending the global characteristics into a second hidden function f 2 The occupation value o is calculated x
(2) ColorNet: encoding an RGB image based on a light field of color features and geometric features using the same feature encoder as OCCNet to obtain color features; for sample x, colorNet passes the additional input: viewing angle direction d and geometric features To aggregate the color features of each view to predict the view-dependent color vector +.>Wherein the geometric feature is expressed as +.>Wherein f 3 Is a hidden function for encoding; defining a light field as a function +.>
Wherein M is i Color profile representing coded back view i, M i (x) Rgb i Representing point x projection from M at viewing angle i i The obtained color characteristics and pixel colors are obtained; f (f) 4 ,f 5 Representing a hidden function for further processing of the feature;is a feature fusion function realized by a transducer; camera coordinate system lower viewing angle direction d i =R i d, wherein R is i Representing a rotation matrix of the camera in the external parameters at the viewing angle i;
SRONet predicts occupancy values for sample point x and view-dependent colors.
5. The human body reconstruction and rendering method of a cooperative light field and an occupied field according to claim 1, wherein the specific design of the ray-voxel intersection and acquisition point in S4 is as follows:
(1) For an emitted ray l, detecting intersected father nodes along the ray l by using a voxel traversal algorithm, and recording depth values of the intersected father voxels at the intersection point under the angle of l, wherein the depth values of the far and near intersection points are respectively recorded as D far And D near The method comprises the steps of carrying out a first treatment on the surface of the For each intersected parent voxel, continuing to detect intersected child voxels by using a voxel traversal algorithm, and simultaneously recording far and near depth values, which are marked as D' far With D' near The method comprises the steps of carrying out a first treatment on the surface of the Summarizing the depth values of all records to beAnd->
(2) The sampling weight w of the ith voxel is calculated by distributing the number of sampling points between the far and near intersection points of each voxel i The following are provided:
wherein d far (i) And d near (i) Respectively representAnd->Far and near depth of middle voxel i, N v For all intersecting voxel numbers, s i Representing the scale size of voxel i; specifically, the number m of sampling points inside each voxel i i The method comprises the following steps:
wherein M is the total number of sampling points,representing a downward rounding; during sampling, a sampling point is allocated to each voxel i along the light emission direction so as to ensure that the voxel closest to the camera on the light is always sampled to the point; if sigma i m i < M, then reallocating the remaining M-sigma in sample order i m i Depth value d of the jth sampling point in the ith voxel i (j) The calculation formula is as follows:
wherein j starts from 0; sampling point x passes p cam +d i (j) Calculated d, p cam For camera position, d is the viewing angle direction.
6. The method for reconstructing and rendering human bodies with cooperative light fields and occupied fields according to claim 1, wherein the loss functions of the volume rendering of S4 and the SRONet for training the model of S3 are specifically designed as follows:
(1) To calculate the final color of ray l, a normalized surface and volume rendering technique is used, based on the occupancy value o of each sample point on l x Calculating color fusion weight and fusing point color c x Calculating the color of a rayThe following are provided:
wherein, the color of the ith sampling point fuses the weight omega x (i)=o x (i)∏ j<i (1-o x (j) A) is provided; meanwhile, by weighting the depth value d of the ith sample point i To calculate the depth value at the intersection point of the light ray l and the human body surface
(2) Color based on ray estimationAnd depth value->The following loss function training advantages are designedReconstruction and rendering part of the SRONet:
geometric-color synergistic loss L syn : based on PIFU supervision mode, OCCNet is trained by sampling point y in space, and occupancy value o is estimated in punishment mode y And the true occupancy valueErrors between; simultaneously punish the ray-by-ray true color C * (l) And estimate color->Two loss functions work cooperatively:
wherein S and R respectively represent a sampling point set and a light ray set,representing a cross entropy loss function, ">Represents the L1 loss function, μ oc The weights of the occupied value loss term and the color loss term are respectively;
depth error loss L D′ : for penalizing estimated ray depth valuesAnd true depth value D * (l) Errors between to further improve reconstruction and rendering details:
wherein,representing an L2 loss function;
the loss function for SRONet is expressed as: l (L) synD′ L D′ Wherein lambda is D′ Is a balance term, and optimizes the loss function using an ADAM algorithm.
7. The human body reconstruction and rendering method of a cooperative light field and an occupied field according to claim 1, wherein the light up-sampling in S5 is specifically designed as follows:
(1) Up-sampling light: for ray l passing at pixel location (x, y), its color is determinedDepth value->Light fusion feature->Respectively spread to 4 sub-pixels, corresponding positions: (x, y), (x+0.5, y), (x, y+0.5), (x+0.5, y+0.5); wherein ft is color The output characteristics of the characteristic fusion function H; thereby generating a coarse upsampling result;
(2) Feature fusion operation: two adjacent views n using the target view 0 ,n 1 Enhancing the result in (1) with the original resolution RGB image; specifically, UNet is used for encoding two adjacent RGB images, each sub-ray corresponds to one sub-pixel, and the position of the surface where the sub-ray is located is calculated through the depth value of the sub-rayProjecting a point p into adjacent images and features to obtain a color +.>And->Characteristics->And->Calculate visibility->And->The visibility calculation formula is as follows: /> z i Representing the depth value of p projection at viewing angle i, < >>Representing a denoised depth image from view i +.>Depth value sigma obtained in (a) v Is a preset visibility weight coefficient; through feature fusion network->Calculating fusion weights In three colors of the fusion sub-rays: />Obtaining the final color of the sub-ray; the definition is as follows:
wherein f 6 A hidden function representing the characteristics of the process,for adjacent view angle n 0 ,n 1 Is a raw resolution RGB image of (b);
(3) Training feature fusion networkIs a loss function of (2):
loss of color error and structural similarity L for each ray B : for punishing sample color blocksAnd true color block->The color error between the two is defined as follows:
wherein R represents a light ray set,is of size S patch ·S patch Color block of->Is the final estimated color of ray r at (i, j); />Representing an L1 loss function, and SSIM represents a structural similarity function; mu (mu) 12 Balancing terms for the loss function;
characteristic loss function L ft : for punishing color blocksAnd->The characteristic error between the two to further enhance the quality of the rendered image; the loss function is calculated by adopting a pretrained VGG-16 network, and the loss function is defined as follows:
wherein,representing the L1 loss function between VGG features;
for feature fusion networksThe loss function of (2) is expressed as: l (L) Bvgg ·L ft Wherein mu vgg Is a balanced term, independently trained using ADAM algorithm +.>Is a parameter of (a).
8. The human body reconstruction and rendering method of a collaborative light field and an occupied field according to claim 1, characterized in that the specific design for the parallel accelerated rendering flow is as follows:
(1) Accelerating the rendering process by using two GPUs; each display card processes half of data, and synchronizes two batches of data on a CPU through a memory, and the rendering process is accelerated through a pipeline; specifically, the pipeline is divided into 3 parts, each GPU is accelerated by 3 separate data streams: I/O process and depth denoising process; 2. constructing a data structure of a two-layer tree, and encoding a multi-mesh RGB image and a depth image in the SRONet, and encoding an image in light up-sampling; 3. the light sampling point calculates the point occupation value and the color through SRONet, and the light up-sampling based on the feature fusion network; finally, converting all calculated light rays into images for display;
(2) Half-precision quantization and acceleration of a depth image denoising model are carried out by using a TensorRT technology, multi-mesh RGB image coding and depth image coding in SRONet are carried out, and image coding in light upsampling is carried out; all hidden functions and feature fusion functions realized through a transformer are accelerated through the GPU shared memory.
9. A human reconstruction and rendering device for collaborative light and occupancy comprising a memory and one or more processors, the memory having executable code stored therein, wherein the processor, when executing the executable code, is configured to implement the human reconstruction and rendering method for collaborative light and occupancy of any one of claims 1-8.
10. A computer readable storage medium having stored thereon a program which, when executed by a processor, implements a human reconstruction and rendering method of collaborative light field and occupancy of any one of claims 1-8.
CN202311273506.5A 2023-09-28 2023-09-28 Human body reconstruction and rendering method and device for cooperative light field and occupied field Pending CN117315153A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311273506.5A CN117315153A (en) 2023-09-28 2023-09-28 Human body reconstruction and rendering method and device for cooperative light field and occupied field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311273506.5A CN117315153A (en) 2023-09-28 2023-09-28 Human body reconstruction and rendering method and device for cooperative light field and occupied field

Publications (1)

Publication Number Publication Date
CN117315153A true CN117315153A (en) 2023-12-29

Family

ID=89273264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311273506.5A Pending CN117315153A (en) 2023-09-28 2023-09-28 Human body reconstruction and rendering method and device for cooperative light field and occupied field

Country Status (1)

Country Link
CN (1) CN117315153A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117745924A (en) * 2024-02-19 2024-03-22 北京渲光科技有限公司 Neural rendering method, system and equipment based on depth unbiased estimation
CN117745924B (en) * 2024-02-19 2024-05-14 北京渲光科技有限公司 Neural rendering method, system and equipment based on depth unbiased estimation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117745924A (en) * 2024-02-19 2024-03-22 北京渲光科技有限公司 Neural rendering method, system and equipment based on depth unbiased estimation
CN117745924B (en) * 2024-02-19 2024-05-14 北京渲光科技有限公司 Neural rendering method, system and equipment based on depth unbiased estimation

Similar Documents

Publication Publication Date Title
CN110443842B (en) Depth map prediction method based on visual angle fusion
CN109255831B (en) Single-view face three-dimensional reconstruction and texture generation method based on multi-task learning
CN113706714B (en) New view angle synthesizing method based on depth image and nerve radiation field
Gadelha et al. 3d shape induction from 2d views of multiple objects
Musialski et al. A survey of urban reconstruction
Wang et al. Neuris: Neural reconstruction of indoor scenes using normal priors
CN104616345B (en) Octree forest compression based three-dimensional voxel access method
DE102019130889A1 (en) ESTIMATE THE DEPTH OF A VIDEO DATA STREAM TAKEN BY A MONOCULAR RGB CAMERA
CN105453139A (en) Sparse GPU voxelization for 3D surface reconstruction
CN106023147B (en) The method and device of DSM in a kind of rapidly extracting linear array remote sensing image based on GPU
WO2022198684A1 (en) Methods and systems for training quantized neural radiance field
CN114998515A (en) 3D human body self-supervision reconstruction method based on multi-view images
CN112991537B (en) City scene reconstruction method and device, computer equipment and storage medium
CN110517352A (en) A kind of three-dimensional rebuilding method of object, storage medium, terminal and system
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN115428027A (en) Neural opaque point cloud
Condorelli et al. A comparison between 3D reconstruction using nerf neural networks and mvs algorithms on cultural heritage images
CN116721210A (en) Real-time efficient three-dimensional reconstruction method and device based on neurosigned distance field
Yuan et al. Neural radiance fields from sparse RGB-D images for high-quality view synthesis
Liu et al. Creating simplified 3D models with high quality textures
Rabby et al. Beyondpixels: A comprehensive review of the evolution of neural radiance fields
Tao et al. LiDAR-NeRF: Novel lidar view synthesis via neural radiance fields
Gu et al. Ue4-nerf: Neural radiance field for real-time rendering of large-scale scene
Song et al. Harnessing low-frequency neural fields for few-shot view synthesis
CN116071484B (en) Billion-pixel-level large scene light field intelligent reconstruction method and billion-pixel-level large scene light field intelligent reconstruction device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination