CN118212337A

CN118212337A - Human body new viewpoint rendering method based on pixel alignment 3D Gao Sidian cloud representation

Info

Publication number: CN118212337A
Application number: CN202410626849.3A
Authority: CN
Inventors: 张盛平; 郑顺源; 刘烨斌
Original assignee: Tsinghua University; Harbin Institute of Technology Weihai
Current assignee: Tsinghua University; Harbin Institute of Technology Weihai
Priority date: 2024-05-21
Filing date: 2024-05-21
Publication date: 2024-06-18

Abstract

The invention discloses a human body new viewpoint rendering method based on pixel alignment 3D Gao Sidian cloud representation, which comprises the following steps: giving a target viewpoint, selecting two adjacent viewpoints from source viewpoints and carrying out three-dimensional correction on the two adjacent viewpoints; extracting features of the two source viewpoint images and estimating the source viewpoint depth; defining a 3D Gaussian point cloud on a source viewpoint two-dimensional image plane; fusing the multi-scale image features and the depth features, and respectively decoding the Gaussian features aligned with the pixels into a rotation feature map, a scale feature map and a transparency feature map; back-projecting the Gaussian feature maps defined at the two viewpoints into a three-dimensional space, and rendering to a target viewpoint to obtain a final rendering result; the neural network is trained by minimizing the error, and model parameters are learned. The pixel alignment 3D Gao Sidian cloud representation provided by the invention greatly improves the quality and efficiency of generating new view points of human bodies under sparse view points.

Description

Human body new viewpoint rendering method based on pixel alignment 3D Gao Sidian cloud representation

Technical Field

The invention relates to the technical field of image processing and pattern recognition, in particular to a human body new viewpoint rendering method based on pixel alignment 3D Gao Sidian cloud representation.

Background

The new view point rendering of the human body aims at taking RGB images of fixed view points in a scene as input to generate a rendering result of a person in the current scene at any observation view point. The existing methods mainly comprise two types, namely a new viewpoint generation method based on depth distortion, which has higher efficiency, but needs a relatively dense viewpoint as input, and the rendering quality of the method depends on the quality of a geometric proxy by which the method is used; the other is a method based on implicit expression, represented by a nerve radiation field, which has high rendering quality, but almost cannot be performed in real time, and needs to be trained on a specific human body. The 3D Gao Sidian cloud is used as an explicit expression, and has higher rendering quality and rendering efficiency. However, the existing 3D Gao Sidian cloud method still requires a minute-level optimization process, and the new viewpoint rendering result is poor under the sparse viewpoint setting.

Disclosure of Invention

The invention aims to provide a human body new viewpoint rendering method based on pixel alignment 3D Gao Sidian cloud representation, a new target viewpoint is given, two adjacent viewpoint images are selected as input, the source viewpoint depth is restored by using binocular solid geometry, then a 3D Gao Sidian cloud parameter characteristic map defined on an input viewpoint image plane is deduced according to image characteristics and depth characteristics, wherein the characteristic map comprises a transparency map, a scale map and a rotation map, in addition, a Gaussian position characteristic map is calculated according to the predicted depth map, and a Gaussian color characteristic map is directly defined by a source viewpoint color image. Then, we define five Gao Sidian cloud parameter feature graphs on two source viewpoint 2D planes respectively, acquire all foreground pixels by using a human mask, back-project the two source viewpoint foreground pixels into 3D space, i.e. aggregate the two source viewpoint foreground pixels into 3D Gao Sidian cloud representation of human body in the current scene, and render the representation to the target viewpoint to obtain the final rendering result.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a human body new viewpoint rendering method based on pixel alignment 3D Gao Sidian cloud representation comprises the following steps:

Giving a target viewpoint, selecting two adjacent viewpoints from source viewpoints, and carrying out three-dimensional correction on the two adjacent viewpoints;

Further, the inner product of the target viewpoint observation direction and each source viewpoint direction in the scene is calculated, two source viewpoints with the largest correlation are selected to construct a binocular stereoscopic vision model, and stereoscopic correction is carried out on the image.

Extracting features of the two selected source viewpoint images;

Further, the multi-layer convolutional neural network is used for extracting the characteristics of the two selected target viewpoint images, and multi-scale image characteristics are obtained.

Estimating the source viewpoint depth by using a binocular depth estimation network to obtain a depth map;

further, constructing a cost volume by using the image features after the two-source viewpoint coding, predicting a group of depths by using a cyclic neural unit, inquiring the prediction result in the cost volume each time to pursue updating to obtain a larger correlation, and finally, obtaining the depth prediction result under the resolution of the original image by using convex hull up-sampling.

Defining a 3D Gaussian point cloud on a source viewpoint two-dimensional image plane, wherein the 3D Gaussian point cloud comprises a position feature map, a color feature map, a rotation feature map, a scale feature map and a transparency feature map;

Further, the coordinates in the two-dimensional image plane for the source viewpoint are Defining a position feature map/>, respectivelyColor characterization map/>Rotation feature map/>Scale feature map/>And transparency feature map/>。

The position feature map is obtained by calculating an estimated depth map and internal and external parameters of the camera, and the color feature map is directly defined by RGB values of a source viewpoint color image;

further, according to the estimated depth map, the depth map is located The pixels of the location are backprojected from the pixel plane into three-dimensional space according to the backprojection matrix and the camera parameters. RGB values of the source viewpoint color image are used directly as the color feature map.

Fusing the multi-scale image features and the depth features, and sending the fused multi-scale image features and the depth features into a decoder with a U-Net structure to return Gaussian features aligned with pixels under the original resolution;

Further, the depth map is subjected to feature extraction by using the same multi-layer convolutional neural network as the image encoder, and multi-scale depth features are obtained. And splicing the image features and the multi-scale depth features on each image scale, and then sending the image features and the multi-scale depth features into a decoder with a U-Net structure to obtain Gaussian features with the same resolution as the original image.

The Gaussian features aligned with the pixels are respectively decoded into a rotation feature map, a scale feature map and a transparency feature map through three pre-measuring heads and corresponding activating functions;

Further, the gaussian feature is decoded into a corresponding gaussian feature map by three pre-measurement heads, each pre-measurement head being a convolutional neural network comprising two convolutional layers. Before finally being used to define the gaussian representation, the rotation feature map needs to be normalized to represent the rotation quaternion, and the scale feature map and transparency feature map need to be activated to fit their data ranges.

Back-projecting the Gaussian feature maps defined at the two source viewpoints into a three-dimensional space, aggregating the Gaussian feature maps into a 3D Gao Sidian cloud representation of a human body in a current scene, and rendering the Gaussian feature maps to a target viewpoint to obtain a final rendering result;

Further, a human mask is used for extracting Gaussian point clouds corresponding to foreground pixels in a source viewpoint, the Gaussian point clouds are back projected into a three-dimensional space, and the Gaussian point clouds obtained by the back projection of the two source viewpoints are aggregated into a whole and used for representing a human body in a current scene. And rendering the representation to the target viewpoint to obtain a new viewpoint rendering result.

And calculating an error between the new viewpoint rendering result and the true value, training the neural network by minimizing the error, and learning model parameters.

Further, calculating an error between the new viewpoint rendering result obtained in the step and the true value, and training a feature extraction network, a depth estimation network and a Gaussian parameter decoding network to learn model parameters by minimizing the error.

The effects provided in the summary of the invention are merely effects of embodiments, not all effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:

The invention provides a human body new viewpoint rendering method based on pixel alignment 3D Gao Sidian cloud representation, which solves the problem that the existing method cannot generate high-reality new viewpoint human body rendering in real time, and overcomes the problem that the generation quality of the new viewpoint is poor under the setting of a sparse viewpoint camera. A gaussian point cloud representation defined on a two-dimensional image plane is designed that improves on the drawbacks of the original 3D Gao Sidian cloud that require frame-by-frame, human-body optimization. In the training stage, human body priori is learned on a large amount of human body scanning data, and all characteristic attributes of the Gaussian point cloud are directly deduced from the network in a feedforward reasoning mode during testing. The proposed pixel alignment 3D Gao Sidian cloud representation greatly improves the quality and efficiency of the generation of new view points of the human body under the sparse view point.

Drawings

Fig. 1 is a flowchart of a method for rendering a new viewpoint of a human body based on a pixel alignment 3D Gao Sidian cloud representation.

Detailed Description

As shown in fig. 1, the human body new viewpoint rendering method based on the pixel alignment 3D Gao Sidian cloud representation includes the following steps:

S1, giving a target viewpoint, selecting two adjacent viewpoints from source viewpoints, and carrying out three-dimensional correction on the two adjacent viewpoints;

s2, extracting features of the two selected source viewpoint images;

s3, estimating the source viewpoint depth by using a binocular depth estimation network to obtain a depth map;

s4, defining a 3D Gaussian point cloud on a source viewpoint two-dimensional image plane, wherein the source viewpoint two-dimensional image plane comprises a position feature map, a color feature map, a rotation feature map, a scale feature map and a transparency feature map;

S5, calculating a position feature map by using the depth map obtained by estimation and internal and external parameters of a camera, wherein a color feature map is directly defined by RGB values of a source viewpoint color image;

s6, fusing the multi-scale image features and the depth features, and sending the fused multi-scale image features and the depth features into a decoder with a U-Net structure to return Gaussian features aligned with pixels under the original resolution;

S7, the Gaussian features aligned with the pixels are respectively decoded into a rotation feature map, a scale feature map and a transparency feature map through three pre-measuring heads and corresponding activating functions;

s8, back-projecting the Gaussian feature maps defined at the two source viewpoints into a three-dimensional space, aggregating the Gaussian feature maps into 3D Gao Sidian cloud representations of the human body in the current scene, and rendering the Gaussian feature maps to the target viewpoints to obtain a final rendering result;

S9, calculating an error between the new viewpoint rendering result and the true value, training the neural network by minimizing the error, and learning model parameters.

In step S1, giveIndividual source view images, denoted/>Its camera position/>The source viewpoint direction can be expressed as/>Wherein/>Is the center of the scene. Similarly, the target viewpoint direction is expressed as，/>Is the target viewpoint position. Respectively performing point multiplication on all input viewpoint directions and new viewpoint directions to calculate correlation, and selecting two source viewpoints/>, which have the greatest correlationBuilding a binocular stereoscopic model, wherein/>And/>Respectively representing left and right viewpoints. Then, binocular stereo correction is carried out on the image to obtain corrected binocular source viewpoint image/>，Wherein/>And/>The image height and width are respectively, and are set to 1024 pixels in the implementation.

In step S2, inputting the source viewpoint image after the stereo correction into a multi-layer convolutional neural network to obtain image features under multi-scale resolution, which are expressed asSetting/>, in implementationI.e. three scale image features are extracted.

In step S3, the extracted deepest image features，/>Dot-multiplying the computation cost volume/>Wherein/>Is a feature dimension, set/>, in implementation. Expressed as: Wherein/> Is a characteristic diagram/>Line index number of/>Is a characteristic diagram/>Column index number,/>Is a characteristic diagram/>Column index number,/>Is a characteristic diagram/>Channel dimension index number of (c). And then predicting a group of depths by using a cyclic nerve unit, inquiring the prediction result of each time in a cost volume to pursue updating to obtain larger correlation, and finally obtaining the depth prediction result/>, under the resolution of the original image, by using convex hull up-sampling。

In step S4, a Gaussian feature map is definedWherein/>Is the coordinates of the foreground pixels in the image plane,/>The method comprises a position characteristic diagram, a color characteristic diagram, a rotation characteristic diagram, a scale characteristic diagram and a transparency characteristic diagram respectively.

In step S5, a depth map is givenLocated at/>The pixels of the location may be back projected directly from the pixel plane into three-dimensional space using a back projection matrix and camera parameters, and thus the location profile may be expressed as/>. Since human-centered scenes are mostly based on diffuse reflection, RGB values are directly used as color feature maps，/>Is a stereoscopic corrected source viewpoint color image.

In step S6, the same multi-layer convolutional neural network as the image encoder is first usedAnd extracting the characteristics of the depth map to obtain multi-scale depth characteristics. Then, the image features and the multi-scale depth features are spliced on each image scale and sent to a decoder with a U-Net structure to obtain Gaussian features with the same resolution as the original image, which is expressed as/>Wherein/>Is a stereoscopic corrected source viewpoint color image,/>For the depth map corresponding thereto,/>The presentation features are stitched in each dimension.

In step S7, three pre-measurement heads are defined, all with Gaussian characteristicsAnd outputting a rotation characteristic diagram scale characteristic diagram and a transparency characteristic diagram for input respectively, wherein each pre-measurement head is a convolutional neural network comprising two convolutional layers. The rotation feature map needs to be normalized because it represents a quaternion: /(I)Wherein/>The head is pre-measured for rotation. Activation of the scale feature map using Softplus, expressed as: /(I)Wherein/>Is a dimension pre-measurement head. The transparency feature map is activated using Sigmoid, expressed as: /(I)Wherein/>The head is predicted for transparency.

In step S8, a human mask is used to extract a gaussian point cloud corresponding to a foreground pixel in a source viewpoint, and the gaussian point clouds obtained by back projection of two source viewpoints are aggregated into a whole to be used for representing a human body in a current scene. And rendering the representation to the target viewpoint to obtain a new viewpoint rendering result.

In step S9, the average absolute error and the structural similarity error between the new viewpoint rendering result and the true value obtained in the above steps are calculated and respectively recorded asAnd/>. The overall error of the network is the weighted sum of the two, i.eWherein is provided/>，/>. Model parameters are learned by minimizing the error while training a feature extraction network, a depth estimation network, and a gaussian parameter decoding network.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. The human body new viewpoint rendering method based on the pixel alignment 3D Gao Sidian cloud representation is characterized by comprising the following steps of:

step one, giving a target viewpoint, selecting two adjacent viewpoints from source viewpoints, and carrying out three-dimensional correction on the two adjacent viewpoints;

step two, extracting features of the two selected source viewpoint images;

defining a 3D Gaussian point cloud on a source viewpoint two-dimensional image plane, wherein the source viewpoint two-dimensional image plane comprises a position feature map, a color feature map, a rotation feature map, a scale feature map and a transparency feature map;

step five, calculating a position feature map by the depth map estimated in the step three and internal and external parameters of the camera, wherein the color feature map is directly defined by RGB values of a source viewpoint color image;

Step six, fusing the multi-scale image features and depth features, and sending the fused multi-scale image features and depth features into a decoder with a U-Net structure to regress Gaussian features aligned with pixels under the original resolution;

step seven, the Gaussian features aligned with the pixels are respectively decoded into a rotation feature map, a scale feature map and a transparency feature map through three pre-measuring heads and corresponding activating functions;

back projecting Gaussian feature maps defined at two source viewpoints into a three-dimensional space, aggregating the Gaussian feature maps into 3D Gao Sidian cloud representations of human bodies in a current scene, and rendering the 3D Gao Sidian cloud representations to a target viewpoint to obtain a final rendering result;

And step nine, calculating an error between the rendering result of the new viewpoint and the true value, training the neural network by minimizing the error, and learning model parameters.

2. The method for rendering the new viewpoint of the human body based on the pixel alignment 3D Gao Sidian cloud representation as set forth in claim 1, wherein the first step includes:

and calculating the inner product of the target viewpoint observation direction and each source viewpoint direction in the scene, selecting two source viewpoints with the largest correlation to construct a binocular stereoscopic vision model, and carrying out stereoscopic correction on the image.

3. The method for rendering the new viewpoint of the human body based on the pixel alignment 3D Gao Sidian cloud representation as set forth in claim 1, wherein the second step includes:

And extracting features of the two selected source viewpoint images by using a multi-layer convolutional neural network to obtain multi-scale image features.

4. The method for rendering the new viewpoint of the human body based on the pixel alignment 3D Gao Sidian cloud representation as set forth in claim 1, wherein the third step includes:

And constructing a cost volume by using the image features after the two-source viewpoint coding, predicting a group of depths by using a cyclic neural unit, inquiring the prediction result of each time in the cost volume to pursue updating to obtain a larger correlation, and finally obtaining the depth prediction result under the resolution of the original image by using convex hull up-sampling to obtain the depth image.

5. The method for rendering the new viewpoint of the human body based on the pixel alignment 3D Gao Sidian cloud representation as set forth in claim 1, wherein the fourth step includes:

Foreground pixels in a two-dimensional image plane for a source viewpoint Respectively define position feature map/>Color characterization map/>Rotation feature map/>Scale feature map/>And transparency feature map/>。

6. The method for rendering the new viewpoint of the human body based on the pixel alignment 3D Gao Sidian cloud representation as set forth in claim 1, wherein the fifth step includes:

According to the depth map estimated in the third step, back-projecting pixels positioned at the x position from a pixel plane to a three-dimensional space according to a back-projection matrix and camera parameters; RGB values of the source viewpoint color image are used directly as the color feature map.

7. The method for rendering the new viewpoint of the human body based on the pixel alignment 3D Gao Sidian cloud representation as set forth in claim 1, wherein the step six includes:

Performing feature extraction on the depth map by using the same multi-layer convolutional neural network as the image encoder to obtain multi-scale depth features; and splicing the image features and the multi-scale depth features on each image scale, and then sending the image features and the multi-scale depth features into a decoder with a U-Net structure to obtain Gaussian features with the same resolution as the original image.

8. The method for rendering the new viewpoint of the human body based on the pixel alignment 3D Gao Sidian cloud representation as set forth in claim 1, wherein the step seven includes:

The Gaussian feature is decoded into a corresponding Gaussian feature map through three pre-measurement heads, and each pre-measurement head is a convolutional neural network comprising two convolutional layers; before finally being used to define the gaussian representation, the rotation feature map needs to be normalized to represent the rotation quaternion, and the scale feature map and transparency feature map need to be activated to fit their data ranges.

9. The method for rendering the new viewpoint of the human body based on the pixel alignment 3D Gao Sidian cloud representation as set forth in claim 1, wherein the step eight includes:

extracting Gaussian point clouds corresponding to foreground pixels in a source viewpoint by using a human mask, back-projecting the Gaussian point clouds into a three-dimensional space, and aggregating the Gaussian point clouds obtained by the back-projection of the two source viewpoints into a whole for representing a human body in a current scene; and rendering the representation to the target viewpoint to obtain a new viewpoint rendering result.

10. The method for rendering the new viewpoint of the human body based on the pixel alignment 3D Gao Sidian cloud representation as claimed in claim 1, wherein the step nine includes:

And (3) calculating the error between the new viewpoint rendering result and the true value obtained in the step (eight), and simultaneously training a feature extraction network, a depth estimation network and a Gaussian parameter decoding network by minimizing the error to learn model parameters.