CN117315153A

CN117315153A - Human body reconstruction and rendering method and device for cooperative light field and occupied field

Info

Publication number: CN117315153A
Application number: CN202311273506.5A
Authority: CN
Inventors: 许威威; 董政; 高耀安; 鲍虎军
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2023-12-29

Abstract

The invention discloses a human body reconstruction and rendering method and device for a collaborative light field and an occupied field, and the method can be directly applied to a new shooting target. The method integrates PIFU and NeRF expressions, and enables the two expressions to respectively predict the occupied field and the light field of the human body to perform cooperative work. PIFu relies on de-noised depth images to enable models to introduce human body priors, assisting in new view angle rendering tasks. The invention introduces a network model SRONet to process geometry and human body parts simultaneously, and utilizes an occupied field to assist the light field to draw. During training, both geometric and color supervisory signals are applied to the network model to enhance the ability of the network to capture high quality texture details. The invention also introduces a light up-sampling method based on neural network fusion, so as to efficiently up-sample the low-resolution image to the target resolution; a depth image denoising model; a data structure of a two-layer tree is constructed based on denoised depth images for efficient ray rendering point sampling.

Description

Human body reconstruction and rendering method and device for cooperative light field and occupied field

Technical Field

The invention relates to the field of computer three-dimensional vision reconstruction and the field of computer graphics drawing, in particular to a human body reconstruction and rendering method and device based on a collaborative light field and an occupied field.

Background

Human three-dimensional reconstruction and creating free view video for humans are important components in many applications, for example: virtual reality and augmented reality, distance education, virtual meetings, and the like. In order to provide an immersive experience to the user, these applications need to be able to capture high quality mannequins in as near real time as possible through consumer level capture devices and perform freeview rendering. Recently, neural implicit field-based expression has been widely used in human performance capture systems. The hidden function (PIFU) with aligned pixels can reconstruct grids and textures of the dynamic human body three-dimensional model efficiently, the surface model is extracted from the trained implicit occupation field, and the texture part is obtained by predicting the surface point RGB value of the model through a trained network. Neural network-based light field (NeRF) is an implicit network model based on coordinates to encode bulk density and color fields. This expression is popular because NeRF is able to render photorealistic images at dense point sampling. However, for the field of three-dimensional human reconstruction, both of these expressions have certain problems. First, the results based on surface texture coloring (pipu) are often blurred, and rendering effects related to viewing angles cannot be produced, and translucent hair and other materials cannot be treated. Second, neural light field (NeRF) rendering speeds are often inadequate for real-time scenes and have poor generalization ability. The latest generalizable NeRF is also not effective in reconstructing the target object from the sparse view in the training set. For new target objects, higher rendering quality is usually achieved after online optimization. Thus, it remains a challenge to provide a method that can perform high quality human freeview rendering from a sparse view input of a user-level device.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a human body reconstruction and rendering method for synergizing a light field and an occupied field, which can be directly generalized to rendering of an unseen target human body and has lower rendering delay.

Considering that the pipu based on depth data and the NeRF expression work cooperatively, the NeRF can generate a high-quality texture result by using global light field information, and geometric ambiguity or noise of the texture result can be further reduced by the occupation field of the pipu (due to the introduction of geometric constraint), which helps to improve the quality of surface reconstruction. In addition, when PIFU and NeRF are cooperated, the quality of rendering has higher precision requirement on depth because the surface point has highest contribution weight to the light color, and when PIFU input only depends on input RGB image characteristics, the geometric quality of reconstruction from a sparse view angle can not be ensured. After the introduction of accurate depth information, the PIFu constructed geometric surface field can be constrained to better align the light field model image features with the surface model, thereby facilitating NeRF learning.

Based on the above observations, the method of the present invention first introduces a novel depth image denoising model that references the UNet architecture, outputs optimized depth images to reduce depth noise and fill in holes of the original depth images, etc. Then, a novel network model SRONet was introduced to model the human body by combining the occupancy field with the light field. Specifically, SRONet predicts a human body occupancy field based on pixel-aligned de-noised depth map features to reconstruct a human body, predicts a light field based on geometry features and pixel-aligned RGB features to render perspective-dependent human body textures. During training, the geometry and color supervisory signals work cooperatively to enhance the ability of the SRONet to capture high quality details during reconstruction and rendering. In addition, the invention constructs a novel two-layer tree structure through the denoised depth map to perform effective geometric storage, rapid ray-voxel intersection and three-dimensional point sampling. In addition, the invention constructs a light up-sampling method based on neural network fusion, and can render the new view 1K resolution under the condition of smaller calculation cost.

In order to achieve the above, the technical scheme of the present invention is as follows: a human body reconstruction and rendering method of a collaborative light field and an occupied field, the method comprising the steps of:

s1: construction of depth image denoising model F based on convolutional neural network structure _d The input of the model comprises a human RGB image I and an original depth image D, and the input is output as a denoised depth image D _rf The method comprises the steps of carrying out a first treatment on the surface of the Training a model through depth, normal and three-dimensional consistency loss functions, and using the model after training to perform a depth denoising task;

s2: based on the denoised multi-view depth image and camera parameters of each view, fusing a global human body point cloud, constructing a data structure of a two-layer tree from the point cloud, and storing father and son nodes of the tree as a global list; introducing voxel post-processing operation in the reasoning process to denoise the tree structure, and only preserving voxels near the surface of the human body;

s3: constructing a collaborative network model SRONet for reconstructing human body grids and rendering new view images, wherein the network model comprises two sub-network models, namely an occupied field network OCCNet and a light field network ColorNet; for a given input: multi-view RGBD imageAnd sampling points x and observing view angles d in the three-dimensional space, wherein N is the number of the view angles, and the sub-network model respectively realizes the following tasks:

Multi-view depth image for a given inputAnd sampling the point x, OCCNet predicts a voxel occupancy o for the point x based on a hidden function of pixel alignment (PIFU) _x ∈[0，1]Representing the probability that the point is inside the human mesh model;

given input ofMulti-eye RGB image { I } ⁱ } _i＝1，...N ColorNet predicts the color vector c of RGB three channels in the viewing angle direction d for point x based on PIFU _x ；

Occupancy o based on point x prediction _x Calculating a human body grid model from the estimated occupied field by using an equivalent cube search algorithm marking cubes so as to reconstruct a three-dimensional human body; training SRONet by utilizing a multi-eye human body data set and combining a geometric-color cooperative loss function and a depth error loss function, wherein the model training is used for predicting an occupied field and a light field after finishing;

s4: calculating rays for each pixel point in the new view angle image according to corresponding camera parameters, and performing a ray Voxel intersection process according to a Voxel Traversal (Voxel Traversal) algorithm to determine voxels in the tree data structure that intersect the rays; recording depth values of far and near intersection points of all the intersecting voxels, designing sampling weight edge ray sampling points in the voxels according to the size of the voxels, and calculating the occupation value and color of the sampling points according to S3; calculating color fusion weight according to the occupation value by utilizing a volume rendering formula so as to fuse the color of each sampling point on the light ray and calculate the final color of the light ray;

S5: upsampling each ray to improve the resolution and quality of the rendered image; constructing a feature fusion network, wherein the input of the network is information shared by each sub-ray, and the information comprises original color, depth, ray features and input RGB images of two adjacent visual angles; outputting the final color of each sub-ray; the feature fusion network is trained through ray-by-ray color errors, structural similarity and feature loss functions, and the training is used for the ray up-sampling operation after the training is completed so as to obtain a target resolution rendered image.

Further, the depth denoising process is specifically designed as follows:

(1) Extracting a region where a portrait is located in an original input image by using a backsroundmatching-v 2 algorithm as a mask image psid of a human body region, and simultaneously obtaining RGBD images I and D of the portrait region; normalizing an input RGBD image to [ -1,1]Interval, record the depth maximum at the same time; merging normalized RGB and depth images as depth image denoisingModel F _d Is input to the computer;

(2) UNet-based structural model F _d Two independent feature extraction networks HRNetV2-W18-Small-v2 are used for respectively encoding RGB and depth images, a cavity space convolution pooling pyramid module ASPP and a residual error attention module ResCBAM are used for fusing RGB and depth features, the fused features are up-sampled, and the features are fed back to the feature extraction network;

(3) Model F _d Output [ -1,1]The interval image is inversely normalized to the original value domain by using the depth maximum value in the step (1), the post-processing image of psi is used, only the depth value of the portrait area is reserved and is marked as D _rf ；

(4) For model F _d The trained loss function is as follows:

depth consistency loss L _D : for penalizing the real depth image at view iAnd predicted depth image->Pixel-by-pixel deviation, defined as:

normal consistency loss L _N : for punishing passing depth imagesAnd->The deviation between the calculated normal directions is defined as:

wherein,depth image +.>And its calculated normal diagram->Depth value and normal vector at the i-th viewing angle pixel p position,/and>representing the corresponding true depth value and normal vector;representing the L1 penalty function,/->The loss function of the L2 is indicated,<·，·>representing a vector inner product operation;

three-dimensional consistency loss L _P : for further constraint of passageThe consistency of the fused point cloud and the real point cloud to reduce the depth fusion noise in constructing the two-layer tree structure is defined as:

wherein,representing from->Fused point cloud, F is a truncated symbol distance field Fusion algorithm TSDF-Fusion, K ⁱ ，RT ⁱ Respectively the internal and external parameters of the camera, P _gt To be from a true three-dimensional human body grid model Mid-sampling point cloud,/>Representing a chamfer distance loss function chamfer;

the loss function for depth denoising is expressed as: l=l _D +λ _N L _N +λ _P L _P Wherein lambda is _N ，λ _P For the loss function weights, the loss function is optimized using an ADAM algorithm.

Further, the data structure of the two-layer tree is specifically designed as follows:

(1) Fusion of depth images after denoising of all visual angles by using truncated symbol distance field Fusion algorithm TSDF-FusionSimultaneously obtain global human point cloud P _rf With fusion cube V _tsdf The method comprises the steps of carrying out a first treatment on the surface of the For V _tsdf Binarization is carried out and then is marked as V _occ The resolution can be set to 128 ³ But not limited thereto, the Occupancy value (occ, occupancy) V at the spatial position x _occ (x) The method comprises the following steps:

wherein s is _v Representing the size of the voxel, alpha is the truncated symbol distance threshold, V _tsdf (x) Is V at the space position x _tsdf Is a truncated symbol distance of (2);

(2) Voxel post-processing: utilizing OCCNet in S3, rapidly constructing cube with set resolution based on real-time human body reconstruction algorithm Realtime-PIFU, and marking asBinarization and V constructed in (1) _occ Fusion to eliminate V _occ The noise voxels floating in the middle, a denoised cube is obtained>Space bitFusion occupancy value at x>The method comprises the following steps:

wherein,is a binarization function based on a threshold value beta, gamma is a threshold value for filtering floating voxels, |represents or operates;

(3) Cube after denoisingMiddle->Is marked as valid voxels, parent voxels are fused according to a preset number proportion (for example, the proportion is set to be 64:1, each valid voxel in a cube with length, width and height of 4 corresponds to one valid parent voxel), and all valid voxels are stored as a global list L _v Each node corresponds to a voxel, and simultaneously, each node records the index (list position), the spatial position and the size information of the father or son node; construction of index cube V _idx To store an index value of each active parent node in the global list, the index value of which is set to-1 for inactive parent nodes; by constructing V _idx So as to facilitate efficient computation of ray-voxel intersection in the S4 process.

Further, the specific design of the SRONet is as follows:

(1) OCCNet: based on the occupancy field of the depth information, using a feature encoder HRNetV2-W18 for encoding the depth image; for sample point x, OCCNet predicts the occupancy value o for each view by aggregating the pixel-aligned depth features for that point _x The method comprises the steps of carrying out a first treatment on the surface of the Defining the occupancy field as a function

Wherein W is ⁱ Depth profile representing coded back view i, W ⁱ (x) Representing the point x projection from W at viewing angle i ⁱ Depth features taken in c ⁱ (x) Representing the depth value of the point x projection under the camera coordinate system under the visual angle i and the truncated symbol distance; hidden function f using fully connected network representation ₁ Obtaining the geometric characteristics of each view angle, obtaining the fused global characteristics through an average pooling operation Avg, and sending the global characteristics into a second hidden function f ₂ The occupation value o is calculated _x ；

(2) ColorNet: encoding the RGB image using the same feature encoder as the OCCNet based on the light field of the color features and the geometric features to obtain the color features; for sample x, colorNet passes the additional input: viewing angle direction d and geometric featuresTo aggregate the color features of each view to predict the view-dependent color vector +.>Wherein the geometric feature is expressed as +.>Wherein f ₃ Is a hidden function for encoding; defining a light field as a function +.>：

Wherein M is ⁱ Representing the passing of the braidingColor profile, M, of code back viewing angle i ⁱ (x) Rgb ⁱ Representing point x projection from M at viewing angle i ⁱ The obtained color characteristics and pixel colors are obtained; f (f) ₄ ，f ₅ Representing a hidden function for further processing of the feature;the method is a feature fusion function realized by a transducer, and a basic feature fusion unit is realized by adopting a hydra attribute; camera coordinate system lower viewing angle direction d ⁱ ＝R ⁱ d, wherein R is ⁱ Representing a rotation matrix of the camera in the external parameters at the viewing angle i;

SRONet predicts occupancy values for sample point x and view-dependent colors.

Further, the specific design of the ray-voxel intersection and the acquisition point in S4 is as follows:

(1) For an emitted ray l, detecting intersected father nodes along the ray l by using a voxel traversal algorithm, and recording depth values of the intersected father voxels at the intersection point under the angle of l, wherein the depth values of the far and near intersection points are respectively recorded as D _far And D _near The method comprises the steps of carrying out a first treatment on the surface of the For each intersected parent voxel, continuing to detect intersected child voxels by using a voxel traversal algorithm, and simultaneously recording far and near depth values, which are marked as D' _far With D' _near The method comprises the steps of carrying out a first treatment on the surface of the Summarizing the depth values of all records to beAnd->

(2) The sampling weight w of the ith voxel is calculated by distributing the number of sampling points between the far and near intersection points of each voxel _i The following are provided:

wherein d _far (i) And d _near (i) Respectively representAnd->Far and near depth of middle voxel i, N _v For all intersecting voxel numbers, s _i Representing the scale size of voxel i, the parent-child voxels may be set to 1 and 4, respectively, to assign more points to the child voxels; specifically, the number m of sampling points inside each voxel i _i The method comprises the following steps:

wherein M is the total number of sampling points,representing a downward rounding; during sampling, a sampling point is allocated to each voxel i along the light emission direction so as to ensure that the voxel closest to the camera on the light is always sampled to the point; if sigma _i m _i < M, then reallocating the remaining M-sigma in sample order _i m _i Depth value d of the jth sampling point in the ith voxel _i (j) The calculation formula is as follows:

wherein j starts from 0; sampling point x passes P _cam +d _i (j) Calculated d, P _cam For camera position, d is the viewing angle direction.

Further, the loss function of the volume rendering of S4 and the SRONet of S3 for training the model is specifically designed as follows:

(1) To calculate the final color of ray l, a normalized surface (UniSurf) and volume rendering technique is used, based on the occupancy value o of each sample point on l _x Calculating color fusion weight and fusing point color cx to calculate color of light rayThe following are provided:

wherein, the color of the ith sampling point fuses the weight omega _x (i)＝o _x (i)Π _j＜i (1-o _x (j) A) is provided; meanwhile, by weighting the depth value d of the ith sample point _i To calculate the depth value at the intersection point of the light ray l and the human body surface

(2) Color based on ray estimationAnd depth value->The following loss function training optimization SRONet reconstruction and rendering part is designed:

geometric-color synergistic loss L _syn : based on PIFU supervision mode, OCCNet is trained by sampling point y in space, and occupancy value o is estimated in punishment mode _y And the true occupancy valueErrors between; simultaneously punish the ray-by-ray true color C ^* (l) And estimate color->Two loss functions work cooperatively:

Wherein S and R respectively represent a sampling point set and a light ray set,representing a cross entropy loss function, ">Represents the L1 loss function, μ _o ，μ _c The weights of the occupied value loss term and the color loss term are respectively;

depth error loss L _D′ : for penalizing estimated ray depth valuesAnd true depth value D ^* (l) Errors between to further improve reconstruction and rendering details:

wherein,representing an L2 loss function;

the loss function for SRONet is expressed as: l (L) _syn +λ _D′ L _D′ Wherein lambda is _D′ Is a balance term, and optimizes the loss function using an ADAM algorithm.

Further, the specific design of the light up-sampling in S5 is as follows:

(1) Up-sampling light: for ray l passing at pixel location (x, y), its color is determinedDepth value->Light fusion feature->Respectively disseminateTo 4 sub-pixels, corresponding positions: (x, y), (x+0.5, y), (x, y+0.5), (x+0.5, y+0.5); wherein ft is _color Is a feature fusion function->Output characteristics of (2); thereby generating a coarse upsampling result;

(2) Feature fusion operation: two adjacent views n using the target view ₀ ，n ₁ Enhancing the result in (1) with the original resolution RGB image; specifically, UNet is used for encoding two adjacent RGB images, each sub-ray corresponds to one sub-pixel, and the position of the surface where the sub-ray is located is calculated through the depth value of the sub-ray Projecting a point p into adjacent images and features to obtain a color +.>And (3) withCharacteristics->And->Calculate visibility->And->The visibility calculation formula is as follows: /> z ⁱ Representing the depth value of p projection at viewing angle i, < >>Representing a denoised depth image from view i +.>Depth value sigma obtained in (a) _v Is a preset visibility weight coefficient; through feature fusion network->Calculating fusion weightsIn three colors of the fusion sub-rays: />Obtaining the final color of the sub-ray; the definition is as follows:

wherein f ₆ A hidden function representing the characteristics of the process,for adjacent view angle n ₀ ，n ₁ Is a raw resolution RGB image of (b);

(3) Training feature fusion networkIs a loss function of (2):

loss of color error and structural similarity L for each ray _B : for punishing sample color blocksAnd true color block->Color error betweenThe difference is defined as follows:

wherein R represents a light ray set,is of size S _patch ·S _patch Color block of->Is the final estimated color of ray r at (i, j); />Representing an L1 loss function, and SSIM represents a structural similarity function; mu (mu) ₁ ，μ ₂ Balancing terms for the loss function;

characteristic loss function L _ft : for punishing color blocksAnd->The characteristic error between the two to further enhance the quality of the rendered image; the loss function is calculated by adopting a pretrained VGG-16 network, and the loss function is defined as follows:

Wherein,representing the L1 loss function between VGG features, which is calculated specifically using the 3 features fed into the first three maximum pooling layers MaxPool2 d;

for feature fusion networksThe loss function of (2) is expressed as: l (L) _B +μ _vgg ·L _ft Wherein mu _vgg Is a balanced term, independently trained using ADAM algorithm +.>Is a parameter of (a).

Further, the specific design for the parallel accelerated rendering flow is as follows:

(1) Accelerating the rendering process by using two GPUs; each display card processes half of data (images and light rays), two batches of data are synchronized through a memory on a CPU, and the rendering process is accelerated through a pipeline;

specifically, the pipeline is divided into 3 parts, each GPU is accelerated by 3 separate data streams: I/O process (CPU to GPU) and depth denoising process; 2. constructing a data structure of a two-layer tree, and encoding a multi-mesh RGB image and a depth image in the SRONet, and encoding an image in light up-sampling; 3. the light sampling point calculates the point occupation value and the color through SRONet, and the light up-sampling based on the feature fusion network; finally, converting all calculated light rays into images for display;

(2) Half-precision quantization and acceleration of a depth image denoising model are carried out by using a TensorRT technology, multi-mesh RGB image coding and depth image coding in SRONet are carried out, and image coding in light upsampling is carried out; accelerating all hidden functions and feature fusion functions realized by a transformer through a GPU shared memory by using a full fused scheme; the accelerated model may render a new view angle image at a resolution of 1K in around 100 ms.

The invention also provides a human body reconstruction and rendering device of the collaborative light field and the occupied field, which comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the processors are used for realizing the human body reconstruction and rendering method of the collaborative light field and the occupied field when executing the executable codes.

The invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the above-described human body reconstruction and rendering method of collaborative light fields and occupied fields.

The applicant has trained the model designed as described above on a human synthetic dataset and tested on a public dataset. Compared with other methods, the method has better effect. In summary, the invention has the following main beneficial effects:

(1) A novel human body capturing method is provided, and a 1K resolution free view video can be created from a sparse RGB input view in about 100 ms. The method can be generalized to the blind performers without further optimization.

(2) A hybrid representation is presented, together with a novel network model SRONet. The model is based on the depth denoising result, utilizes RGBD characteristics of pixel alignment, and cooperates with the occupied field and the light field to perform accurate human body reconstruction and rendering.

(3) A two-layer tree-based data structure is proposed for effective point sampling, a neural fusion-based ray upsampling technique for rapidly increasing the resolution of rendered images, and a parallel computing pipeline for rendering acceleration.

Drawings

Fig. 1 is a schematic diagram of a human reconstruction and rendering method of a collaborative light field and an occupied field in an embodiment of the present invention. Given an RGBD stream image captured by a sparse Azure Kinect sensor as an input, (1) a depth image denoising model removes depth noise based on the input RGBD image, filling an original depth hole; (2) Given the denoising depth image, a novel two-layer tree structure is reconstructed for discretizing the storage global geometry information; (3) Performing efficient ray-voxel intersection and ray point sampling for rendering; (4) Providing a novel network model SRONet to cooperate with a light field and an occupied field to perform human body reconstruction and free view rendering; (5) The output of SRONet is increased to the target resolution by a neural fusion based ray up-sampling technique.

FIG. 2 is a specific process for building a two-level tree structure in an embodiment of the present invention. The method comprises (1) converting the denoised depth image into a whole body point cloud; (2) Building a cube based on a point cloud, wherein voxels are stored Stored in the GPU; (3) Merging small voxels in the ratio of 64:1, taking the small voxels as parent voxels, constructing a two-layer tree structure, and converting the voxels into a node list for storage; (4) When reasoning, OCCNet is utilized, and a cube is quickly constructed based on real-time human body reconstruction algorithm real-PIFUBinarization and tree structure first layer cube V _occ Fusion to eliminate floating noise voxels, resulting in denoised cube>

FIG. 3 is a specific process of ray-voxel intersection and ray-point sampling in an embodiment of the invention. A voxel traversal method is used to determine parent-child voxels in the tree structure that intersect the ray.

Fig. 4 is a block diagram of the proposed SRONet and supervisory signals. SRONet consists of OCCNet and ColorNet, and predicts the occupation value and color of input point. And the geometric loss function and the color loss function are trained cooperatively in the supervision process.

Fig. 5 is a proposed light up-sampling and nerve fusion process. Wherein the sub-rays share the color, depth and characteristics of the principal emission rays. For each subpixel, the neural fusion takes as input the characteristics of the adjacent view, the view direction, and the visibility, and predicts weights to fuse the subpixel color with the adjacent view color.

Detailed Description

The following description of the main module embodiments of the present invention is provided in connection with the accompanying drawings, and is presented to facilitate a better understanding of the present invention by those skilled in the art in conjunction with the validity verification experiments of the present invention.

As shown in fig. 1 (2) and fig. 2, the data structure of the two-layer tree is constructed based on the denoised depth point cloud in the present invention. The rendering range can be effectively limited near the three-dimensional surface of the human body by constructing the two-layer tree structure, and the function of geometric constraint is achieved. In addition, the tree structure is stored in the GPU, so that the rapid voxel searching of ray voxel intersection is facilitated, and efficient intra-voxel point sampling is realized. The design of the tree into two layers mainly takes into account: (1) The invisible area of the point cloud cannot be covered by leaf sub-layer voxels, which leads to the loss in rendering; (2) The two-layer structure covers the whole body region with large voxels, and a higher layer number requires more overhead without performance improvement. When in storage, the tree structure parent-child voxels are converted into a global list L _v In the GPU, each node stores information such as index, size, center position, parent (child) voxel index, etc.

As shown in fig. 1 (2) and 3, given an emitted ray l, a voxel traversal algorithm (fig. 3 (c), (d)) is first used to detect the valid parent voxel that intersects l and record the depth value of the ray-voxel intersection at the target view angle t. In the example of fig. 3 (c), there are 4 valid voxels that intersect ray l, and the near-far depth values recorded are: d (D) _near ＝d({v _i } _{i＝0，1，2，4，5} ，RT ^t )，D _far ＝d({v _i } _{i＝1，2，3，5，6} ，RT ^t ). Wherein d (·, RT) ^t ) Is a projection function to obtain the depth value of the three-dimensional point at the viewing angle t. RT (reverse transcription) method ^t Is a camera external reference at the visual angle t. Then, the voxel traversal algorithm continues to be used to detect all valid child voxels and record their near-far depth values as a list: d'. _near ，D′ _far As shown in fig. 3 (d). Finally, all near and far depths are combined into a global list, and recorded asAnd->As a result of the final ray-voxel intersection, the point sampling range is determined. The point sampling process is performed in all recorded pairs of near-far depth data, e.g. in { v } ₀ ，v ₁ }，{q ₀ ，q ₁ Picking points in }. Finally, all the sample points on ray l are used to calculate the ray color.

The training data set according to the present invention is composed of a human rendering data set.

Among them, we use the paper Function4D: the thumb 2.0 phantom dataset introduced by Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors generates the training dataset of the present invention, which contains 500 high quality three-dimensional body scan models. Along the y-axis, real human RGBD images are rendered with CUDA acceleration and 6 degrees apart. We split this dataset into training and test sets in a 4:1 ratio. For the original depth image input, we refer to the method of paper Kinect v2 for mobile robot navigation: evaluation and modeling is a true rendering depth map D _gt Sensor noise is added, including noise associated with the pixel depth value z: 1.5z ² 1.5z+1.375, gaussian noise (average 1.5 cm), holes (average width 3 pixels) to cover as much as possible all possible noise cases.

We use pyrerch and CUDA to implement the model to which the present invention relates. In the practical model training process, two NVIDIA RTX3090 graphics cards are used for training a depth image denoising model, an SRONet and a light up-sampling network respectively, and an ADAM optimizer is used for model parameter optimization. Wherein parameter beta in ADAM ₁ ，β ₂ Set as 0.5,0.99, respectively. The batch size of the depth image denoising model is set to 8, and the learning rate is set to 1e ^-3 Training was performed for a total of 10 rounds. SRONet batch size is set to 4, learning rate is set to 1e ^-4 And the learning rate is halved every 5 rounds, 20 rounds are trained in total. The light up-sampling network learning rate setting is the same as SRONet, the batch size is set to 2, and training is performed for 10 rounds. For all loss function balance terms, we denoise λ of the depth image denoising model _N ，λ _P Set to 0.5 and 0.01 respectively; mu in SRONet _o ，μ _c ，λ _D′ Set as 0.5,1.0,1.0 respectively; upsampling light into mu of network ₁ ，μ ₂ ，μ _vgg Set as 0.4,0.6,0.01 respectively; the weight sigma of the visibility will be calculated _v Set to 200. In addition, the super parameter, α, γ in the two-layer tree structure is set to 40,0.01, respectively. Setting the number M of sampling points during training and reasoning48. The final resolution of the rendered image is set to 1K (1024 x 1024).

To verify the effectiveness of the method of the present invention, we performed a rendering quality test on the test set of thumb 2.0 and real beat data. Verification experiments are 6 similar methods based on deep learning with the current mainstream on the test set: pixelNeRF: neural Radiance Fields from One or Few Images, IBRNet: learning Multi-View Image-Based Rendering, MPSNERF general 3D Human Rendering from Multiview Images, NHP Learning Generalizable radiance fields for human performance Rendering, NPBG++, accelerating Neural Point-Based Graphics, PIFU (RGBD) Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization, and in four Image similarity criteria: PSNR (peak signal to noise ratio), SSIM (structural similarity), LPIPS (learning perceived image block similarity), MAE (mean absolute error) were evaluated, and the comparison results are shown in table 1 (test set), table 2 (real shot data):

TABLE 1

TABLE 2

Where PSNR, SSIM with higher values indicates better results, LPIPS, MAE with lower values indicates better results. The best results for each row are indicated by bold numbers and the next best results are indicated by underlines. From the above results, it can be seen that the method (our) proposed by the present invention is superior to other methods as a whole and greatly advanced in rendering average Time (Avg Time) over other methods. This comparative experiment fully demonstrates the superiority of the method of the invention.

In addition, we also performed some disassembly experiments on the test set (thumb 2.0 Dataset) and the real beat data (Our Real Captured Dataset) to analyze the effectiveness of SRONet introduced loss functions (geometric loss functions, depth loss functions), depth denoising process, collaborative geometric field and light field expression, feature fusion operation, and ray up-sampling operation. We have removed or modified the above and retrained the whole network model, respectively, and the experimental results are shown in table 3:

TABLE 3 Table 3

Wherein w/o GT Depth represents the removal of Depth consistent loss function during SRONet training, w/o GT Occ represents the removal of geometric supervision during SRONet training; w/o Denoised Depth means Depth denoising is removed; soft oc. →Density means that the cooperative occupancy field and light field expression is replaced by the NeRF original light field expression, namely, OCCNet predicts the volume Density of the sampling point instead of the occupancy value, and the rendering mode used by the invention is replaced by the NeRF volume rendering mode. Occmlp→dbmlp represents OCCNet while predicting sampling point volume density and occupancy value, preserving geometry supervised loss function, and rendering using the NeRF's volume rendering method. Hydroa att. Fwdarw.self att. Represents the replacement of the hydro-atttion module of the feature fusion function H with the Self atttion module of the transducer model. w/o Upsampling means that the neural fusion-based ray Upsampling module is removed. The disassembly experiment contrast shows that the loss function or the module introduced by the invention improves the overall quality of rendering to a certain extent.

The embodiment of the invention also provides a human body reconstruction and rendering device of the collaborative light field and the occupied field, which comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the processors are used for realizing the human body reconstruction and rendering method of the collaborative light field and the occupied field when executing the executable codes.

The embodiment of the invention also provides a computer readable storage medium, wherein a program is stored on the computer readable storage medium, and when the program is executed by a processor, the human body reconstruction and rendering method of the cooperative light field and the occupied field is realized.

The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims

1. The human body reconstruction and rendering method for the cooperative light field and the occupied field is characterized by comprising the following steps:

multi-view depth image for a given inputOCCNet based on pixels for sampling points xThe aligned hidden function PIFu predicts a voxel occupancy o for point x _x ∈[0,1]Representing the probability that the point is inside the human mesh model;

given an input multi-view RGB image { I ] ⁱ } _i＝1,…N ColorNet predicts the color vector c of RGB three channels in the viewing angle direction d for point x based on PIFU _x ；

Occupancy o based on point x prediction _x Calculating a human body grid model from the estimated occupied field by using an equivalent cube search algorithm so as to reconstruct a three-dimensional human body; training SRONet by utilizing a multi-eye human body data set and combining a geometric-color cooperative loss function and a depth error loss function, wherein the model training is used for predicting an occupied field and a light field after finishing;

s4: calculating rays for each pixel point in the new view angle image according to the corresponding camera parameters, and executing a ray voxel intersection process according to a voxel traversal algorithm to determine voxels in the tree data structure which intersect the rays; recording depth values of far and near intersection points of all the intersecting voxels, designing sampling weight edge ray sampling points in the voxels according to the size of the voxels, and calculating the occupation value and color of the sampling points according to S3; calculating color fusion weight according to the occupation value by utilizing a volume rendering formula so as to fuse the color of each sampling point on the light ray and calculate the final color of the light ray;

2. The human body reconstruction and rendering method of a collaborative light field and an occupied field according to claim 1, characterized in that the depth denoising process is specifically designed as follows:

(1) By using BaThe ckgrouding-v 2 algorithm extracts the region where the portrait is located in the original input image as a mask image ψ of the human body region, and simultaneously obtains RGBD images I and D of the portrait region; normalizing an input RGBD image to [ -1,1]Interval, record the depth maximum at the same time; merging normalized RGB and depth images as depth image denoising model F _d Is input to the computer;

(2) UNet-based structural model F _d Two independent feature extraction networks HRNetV2-W18-Small-v2 are used for respectively encoding RGB and depth images, a cavity space convolution pooling pyramid module and a residual attention module are used for fusing RGB and depth features, the fused features are up-sampled, and the features are fed back to the feature extraction network;

(4) For model F _d The trained loss function is as follows:

depth consistency loss L _D : for penalizing the real depth image at view i And predicted depth image->Pixel-by-pixel deviation, defined as:

wherein,depth image +.>And its calculated normal diagram->Depth value and normal vector at the i-th viewing angle pixel p position,/and>representing the corresponding true depth value and normal vector; />Representing the L1 penalty function,/->The loss function of the L2 is indicated,<·,·>representing a vector inner product operation;

wherein,representing from->Fused point cloud, F is truncated symbol distance field fusion algorithm, K ⁱ ,RT ⁱ Respectively the internal and external parameters of the camera, P _gt For the point cloud sampled from the real three-dimensional human body mesh model,/i>Representing a chamfer distance loss function;

the loss function for depth denoising is expressed as: l=l _D +λ _N L _N +λ _P L _P Wherein lambda is _N ,λ _P For the loss function weights, the loss function is optimized using an ADAM algorithm.

3. The method for reconstructing and rendering human bodies of a collaborative light field and an occupied field according to claim 1, wherein the data structure of the two-layer tree is specifically designed as follows:

(1) Fusion of depth images after denoising of all visual angles by using truncated symbol distance field fusion algorithmSimultaneously obtain global human point cloud P _rf With fusion cube V _tsdf The method comprises the steps of carrying out a first treatment on the surface of the For V _tsdf Binarization is carried out and then is marked as V _occ Then the occupancy value V at the spatial position x _occ (x) The method comprises the following steps:

(2) Voxel post-processing: utilizing OCCNet in S3, rapidly constructing cube with set resolution based on real-time human body reconstruction algorithm, and marking asBinarization and V constructed in (1) _occ Fusion to eliminate V _occ The noise voxels floating in the middle, a denoised cube is obtained>The fused occupancy value at spatial position x +.>The method comprises the following steps:

wherein,is a binarization function based on a threshold β, γ is a threshold for filtering floating voxels, | represents or operates;

(3) Cube after denoisingMiddle->The voxels of (2) are marked as effective voxels, parent voxels are fused according to the preset number proportion, and all the effective voxels are stored as a global list L _v Each node corresponds to a voxel, and simultaneously, each node records the index, the spatial position and the size information of the father or son node; construction of index cube V _idx To store an index value for each active parent node in the global list, which is set to-1 for inactive parent nodes.

4. The human body reconstruction and rendering method of a collaborative light field and an occupied field according to claim 1, wherein the SRONet is specifically designed as follows:

(1) OCCNet, based on the occupancy field of depth information, using the feature encoder HRNetV2-W18 for encoding depth images; for sample point x, OCCNet predicts the occupancy value o for each view by aggregating the pixel-aligned depth features for that point _x The method comprises the steps of carrying out a first treatment on the surface of the Defining the occupancy field as a function

Wherein W is ⁱ Depth profile representing coded back view i, W ⁱ (x) Representing the point x projection from W at viewing angle i ⁱ Depth features taken in c ⁱ (x) Representing the depth value of the point x projection under the camera coordinate system under the visual angle i and the truncated symbol distance; hidden function f using fully connected network representation ₁ To obtain the geometric characteristics of each view angle, obtaining the fused global characteristics through an average pooling operation Avh, and sending the global characteristics into a second hidden function f ₂ The occupation value o is calculated _x ；

(2) ColorNet: encoding an RGB image based on a light field of color features and geometric features using the same feature encoder as OCCNet to obtain color features; for sample x, colorNet passes the additional input: viewing angle direction d and geometric features To aggregate the color features of each view to predict the view-dependent color vector +.>Wherein the geometric feature is expressed as +.>Wherein f ₃ Is a hidden function for encoding; defining a light field as a function +.>

Wherein M is ⁱ Color profile representing coded back view i, M ⁱ (x) Rgb ⁱ Representing point x projection from M at viewing angle i ⁱ The obtained color characteristics and pixel colors are obtained; f (f) ₄ ,f ₅ Representing a hidden function for further processing of the feature;is a feature fusion function realized by a transducer; camera coordinate system lower viewing angle direction d ⁱ ＝R ⁱ d, wherein R is ⁱ Representing a rotation matrix of the camera in the external parameters at the viewing angle i;

SRONet predicts occupancy values for sample point x and view-dependent colors.

5. The human body reconstruction and rendering method of a cooperative light field and an occupied field according to claim 1, wherein the specific design of the ray-voxel intersection and acquisition point in S4 is as follows:

wherein d _far (i) And d _near (i) Respectively representAnd->Far and near depth of middle voxel i, N _v For all intersecting voxel numbers, s _i Representing the scale size of voxel i; specifically, the number m of sampling points inside each voxel i _i The method comprises the following steps:

6. The method for reconstructing and rendering human bodies with cooperative light fields and occupied fields according to claim 1, wherein the loss functions of the volume rendering of S4 and the SRONet for training the model of S3 are specifically designed as follows:

(1) To calculate the final color of ray l, a normalized surface and volume rendering technique is used, based on the occupancy value o of each sample point on l _x Calculating color fusion weight and fusing point color c _x Calculating the color of a rayThe following are provided:

wherein, the color of the ith sampling point fuses the weight omega _x (i)＝o _x (i)∏ _j＜i (1-o _x (j) A) is provided; meanwhile, by weighting the depth value d of the ith sample point _i To calculate the depth value at the intersection point of the light ray l and the human body surface

(2) Color based on ray estimationAnd depth value->The following loss function training advantages are designedReconstruction and rendering part of the SRONet:

wherein S and R respectively represent a sampling point set and a light ray set,representing a cross entropy loss function, ">Represents the L1 loss function, μ _o ,μ _c The weights of the occupied value loss term and the color loss term are respectively;

wherein,representing an L2 loss function;

7. The human body reconstruction and rendering method of a cooperative light field and an occupied field according to claim 1, wherein the light up-sampling in S5 is specifically designed as follows:

(1) Up-sampling light: for ray l passing at pixel location (x, y), its color is determinedDepth value->Light fusion feature->Respectively spread to 4 sub-pixels, corresponding positions: (x, y), (x+0.5, y), (x, y+0.5), (x+0.5, y+0.5); wherein ft is _color The output characteristics of the characteristic fusion function H; thereby generating a coarse upsampling result;

(2) Feature fusion operation: two adjacent views n using the target view ₀ ,n ₁ Enhancing the result in (1) with the original resolution RGB image; specifically, UNet is used for encoding two adjacent RGB images, each sub-ray corresponds to one sub-pixel, and the position of the surface where the sub-ray is located is calculated through the depth value of the sub-rayProjecting a point p into adjacent images and features to obtain a color +.>And->Characteristics->And->Calculate visibility->And->The visibility calculation formula is as follows: /> z ⁱ Representing the depth value of p projection at viewing angle i, < >>Representing a denoised depth image from view i +.>Depth value sigma obtained in (a) _v Is a preset visibility weight coefficient; through feature fusion network->Calculating fusion weights In three colors of the fusion sub-rays: />Obtaining the final color of the sub-ray; the definition is as follows:

wherein f ₆ A hidden function representing the characteristics of the process,for adjacent view angle n ₀ ,n ₁ Is a raw resolution RGB image of (b);

(3) Training feature fusion networkIs a loss function of (2):

loss of color error and structural similarity L for each ray _B : for punishing sample color blocksAnd true color block->The color error between the two is defined as follows:

wherein R represents a light ray set,is of size S _patch ·S _patch Color block of->Is the final estimated color of ray r at (i, j); />Representing an L1 loss function, and SSIM represents a structural similarity function; mu (mu) ₁ ,μ ₂ Balancing terms for the loss function;

wherein,representing the L1 loss function between VGG features;

8. The human body reconstruction and rendering method of a collaborative light field and an occupied field according to claim 1, characterized in that the specific design for the parallel accelerated rendering flow is as follows:

(1) Accelerating the rendering process by using two GPUs; each display card processes half of data, and synchronizes two batches of data on a CPU through a memory, and the rendering process is accelerated through a pipeline; specifically, the pipeline is divided into 3 parts, each GPU is accelerated by 3 separate data streams: I/O process and depth denoising process; 2. constructing a data structure of a two-layer tree, and encoding a multi-mesh RGB image and a depth image in the SRONet, and encoding an image in light up-sampling; 3. the light sampling point calculates the point occupation value and the color through SRONet, and the light up-sampling based on the feature fusion network; finally, converting all calculated light rays into images for display;

(2) Half-precision quantization and acceleration of a depth image denoising model are carried out by using a TensorRT technology, multi-mesh RGB image coding and depth image coding in SRONet are carried out, and image coding in light upsampling is carried out; all hidden functions and feature fusion functions realized through a transformer are accelerated through the GPU shared memory.

9. A human reconstruction and rendering device for collaborative light and occupancy comprising a memory and one or more processors, the memory having executable code stored therein, wherein the processor, when executing the executable code, is configured to implement the human reconstruction and rendering method for collaborative light and occupancy of any one of claims 1-8.

10. A computer readable storage medium having stored thereon a program which, when executed by a processor, implements a human reconstruction and rendering method of collaborative light field and occupancy of any one of claims 1-8.