CN117974867A

CN117974867A - Monocular face avatar generation method based on Gaussian point rendering

Info

Publication number: CN117974867A
Application number: CN202410381197.1A
Authority: CN
Inventors: 张盛平; 陈宇凡; 柳青林; 孟权令; 吕晓倩; 王晨阳
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2024-04-01
Filing date: 2024-04-01
Publication date: 2024-05-03
Anticipated expiration: 2044-04-01
Also published as: CN117974867B

Abstract

A monocular face avatar generation method based on Gaussian point rendering comprises the following steps: extracting expression parameters and gesture parameters of FLAME from the monocular portrait video; defining an initialization space, a standard space and a deformation space; acquiring Gaussian parameters of points in a deformation space from position information of the points in the deformation space and an initialization space; inputting Gaussian parameters of points in a deformation space into a renderer, and rendering an image; performing image loss on the rendered image and the input monocular portrait video, and training by minimizing the constraint; adding a point adding and deleting strategy in each training iteration to realize point increase; and driving the trained specific character avatar through the driving video. According to the invention, an iterative optimization strategy and a Gaussian point cloud adding and deleting point strategy are designed, the rendering speed and rendering quality of a Gaussian snowball renderer are utilized, training of a Gaussian parameter network and a point deformation network is guided through a pre-trained linear mixed skin function, and the generation quality of a portrait avatar is improved.

Description

Monocular face avatar generation method based on Gaussian point rendering

Technical Field

The invention relates to the technical field of image processing and pattern recognition, in particular to a monocular face avatar generation method based on Gaussian point rendering.

Background

Monocular avatar generation aims at generating a specific character face avatar to perform specific actions and expressions, and has wide application in the fields of man-machine interaction, virtual reality, augmented reality and the like. The existing method mostly adopts an implicit network to solve the problem, but the type of network has long training time and rendering time, and the generated result has poor geometric result. In recent years, new vigor is injected into the field due to the occurrence of point rendering, and the point rendering is lost with a true value by rendering the points on a two-dimensional image, so that a high-precision rendered portrait is obtained. In particular, recently occurring gaussian snowball rendering, which has a faster rendering speed than point rendering, while having a rendering quality. However, the existing Gaussian snowball rendering can only achieve static scenes, and cannot generate dynamic sequences according to the requirements of users. In addition, the existing Gaussian snowball rendering-based method is optimized iteratively, a specific face avatar cannot be generated according to the requirements of a user, and an action sequence is generated according to an input driving signal.

Disclosure of Invention

The invention aims to provide a monocular face avatar generation method based on Gaussian point rendering, which utilizes the high-fidelity image rendering capability of Gaussian snowball rendering to generate a face avatar driven based on FLAME parameters, and utilizes a Gaussian deformation field and a Gaussian parameter prediction network to establish the relation between the offset of points and the FLAME parameters through a pre-trained linear mixed skin function so as to improve the rendering quality, geometric quality and driving quality of the face avatar.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a monocular face avatar generation method based on Gaussian point rendering comprises the following steps:

extracting expression parameters and gesture parameters of FLAME from the monocular portrait video;

further, inputting the data of the monocular portrait training set into a FLAME fitting network, and predicting to obtain facial expression parameters and posture parameters of the corresponding images.

Initializing parameters of points, and defining an initialization space according to the parameters;

further, the number of initialization points is set to 400, the positions of the initialization points are randomly set in the space, and this space is defined as an initialization space.

Predicting Gaussian parameters and the offset of the points from the space positions of the initialized points to obtain new positions and defining a standard space;

further, the gaussian parameter and the offset of the point are predicted from the spatial position of the initialized point, and the offset is added to the position of the corresponding point to obtain a new position, which is defined as the standard space.

Deforming points in the standard space and defining a deformation space;

further, the positions of points in the standard space are input into a linear mixed skin function which is learned in advance to obtain deformed points, and the stage is defined as a deformation space.

Acquiring Gaussian parameters of points in a deformation space from position information of the points in the deformation space and an initialization space;

further, the position of the point in the deformation space is differenced from the position of the point in the initialization space, and the difference is input into a Gaussian deformation field to obtain the deformation quantity of the Gaussian parameter, so that the Gaussian parameter of the point in the deformation space is obtained.

Inputting Gaussian parameters of points in a deformation space into a renderer, and rendering an image;

further, gaussian parameters of the points in the deformation space and positions of the points are input into a Gaussian snowball renderer, and a rendered image is obtained.

Performing image loss on the rendered image and the input monocular portrait video, combining the FLAME loss and the image perception loss, and training by minimizing the weighted sum of the image loss, the FLAME loss and the image perception loss;

further, the rendered image and the input monocular video are subjected to image loss, the FLAME loss and the image perception loss are combined, and the implicit network is trained by minimizing the weighted sum of the image loss, the FLAME loss and the image perception loss, so that model parameters are learned.

Adding a point adding and deleting strategy in each training iteration to realize point increase;

Further, at the end of each training period, the undesirable points are eliminated and a random addition of one point around each remaining point is performed. For example: when the number of points is increased to 100000, at the end of each training period, points which are not satisfactory are deleted, one point is randomly increased around the rest points, and the total number of points is supplemented to 100000.

And driving the trained specific character avatar through the driving video.

Further, the FLAME parameters extracted from the driving video are input into a trained network to obtain point positions and Gaussian parameters corresponding to the point positions, and the point positions and Gaussian parameters are input into a Gaussian snowball renderer to obtain corresponding specific character avatar driving actions and rendering images of the specific character avatar driving actions.

The effects provided in the summary of the invention are merely effects of embodiments, not all effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:

The invention provides a monocular face avatar generation method based on Gaussian point rendering, which solves the problem that the existing method for driving the face avatar rendering to be real-time and overcomes the problem that Gaussian snowball rendering cannot render dynamic scenes. And designing an iterative optimization strategy and a point adding and deleting strategy of the Gaussian point cloud, fully utilizing the rendering speed and rendering quality of the Gaussian snowball renderer, and guiding training of a Gaussian parameter network and a point deformation network through a pre-trained linear mixed skin function so as to improve the generation quality of the portrait avatar.

Drawings

FIG. 1 is a flow chart of a method for generating a monocular face avatar based on Gaussian point rendering.

Detailed Description

As shown in fig. 1, a monocular face avatar generation method based on gaussian point rendering includes the steps of:

S1, extracting expression parameters and posture parameters of FLAME from monocular portrait videos;

S2, initializing parameters of points, and defining an initialization space according to the parameters;

s3, predicting Gaussian parameters and offset of points from the space positions of the initialized points to obtain new positions and defining a standard space;

s4, deforming points in the standard space and defining a deformation space;

S5, acquiring Gaussian parameters of the points in the deformation space from the position information of the points in the deformation space and the initialization space;

s6, inputting Gaussian parameters of points in a deformation space into a renderer, and rendering an image;

S7, performing image loss on the rendered image and the input monocular portrait video, combining the FLAME loss and the image perception loss, and training by minimizing the weighted sum of the image loss, the FLAME loss and the image perception loss;

And S8, adding a point adding and deleting strategy in each training iteration to realize point adding.

S9, driving the trained specific character avatar through the driving video.

In step S1, inputting data of a monocular portrait training set into a FLAME fitting network, and predicting to obtain facial expression parameters and posture parameters of a corresponding image: for a given portrait video, the existing face key point detection method is combined with the camera parameter estimation method of the FLAME, and the face expression parameters and the gesture parameters corresponding to the FLAME are optimized to achieve fitting of the FLAME model and the faces in the input images.

In step S2, an initialization space is defined, in which the initial position of the point is defined: 400 points were randomly sampled on a sphere surface with a radius of 0.5 image size, and their positions were defined as initial positions of the points.

In step S3, according to the positions of the points in the initialization space, the gaussian parameter of each point is predicted by the multi-layer perceptron network: the gaussian parameters of a point are defined as: wherein/> To initialize the position of a Gaussian point in space,/>To initialize the rotation coefficient of a Gaussian point in space,/>To initialize the scaling factor of a Gaussian point in space,/>To initialize the visible coefficients of Gaussian points in space,/>Initializing color parameters of Gaussian points in a space; the prediction process is defined as:。

on the basis, a leachable offset is added to each point, and the position information of the Gaussian points is converted into a standard space: wherein/> Is a multi-layer perceptron.

In step S4, the positions of the points in the standard space are input into a linear mixed skin function learned in advance to obtain deformed points:，

Wherein LBS is a pre-trained linear hybrid skin function, And/>Is a posture base and an expression base of the human head model,For mixed skin weights,/>And/>Linear hybrid skin for output pose and expression,/>And/>Is a posture coefficient and an expression coefficient.

In step S5, the difference between the position of the point in the deformation space and the position of the point in the initialization space is input into the Gaussian deformation field to obtain the Gaussian parameter deformation quantity, so that the Gaussian parameter in the deformation space is obtained. The process outputs Gaussian parameters of Gaussian points in a deformation space by inputting the positions of points in the deformation space and the positions of points in an initialization space to a multi-layer perceptron:

Wherein, Is Gaussian deformation field composed of multiple layers of perceptrons,/>To predict the resulting Gaussian deformation,/>A gaussian parameter representing a gaussian point in the deformation space. The gaussian points in the deformation space are expressed as:。

in step S6, the Gaussian points in the deformation space are used As input, and rendering by a gaussian snowball renderer to obtain a predicted image:

Wherein, Is standard/>A function; /(I)Is the covariance matrix of Gaussian points, which is defined by scaling matrix/>And rotation matrix/>Is formed by the steps of; viewing angle conversion matrix/>And projective transformation mapping jacobian matrix/>Covariance matrix/>Conversion from three-dimensional world coordinates to two-dimensional camera coordinates, i.e. covariance matrix/>；/>Is the effect of each gaussian point on the pixel; /(I)Representing the transmittance term.

In step S7, the rendered image and the input monocular video are subjected to image loss, and the FLAME loss and the image perception loss are combined, and training is performed by minimizing the weighted sum of the image loss, the FLAME loss and the image perception loss:

Wherein, And/>Is a rendered image and a truth image,/>Is obtained by L1 loss of a rendered image and a truth image; /(I)Is a feature of the first four layers of output of the pretrained VGG network,/>Is input by rendering images and truth images/>Obtaining characteristics and L1 loss; /(I)、/>And/>Is based on the false true value of the FLAME vertex,/>、/>And/>Is FLAME loss/>Making all weight values of L2 loss by the expression, the gesture and the skin weight and the corresponding false true value; /(I)Is a loss of structural similarity; /(I)、/>、/>And/>The weight of each loss in the final loss is given.

In step S8, adding a point adding and deleting strategy to each training iteration to realize point adding: defining a rendering radius and a sampling radius which decrease along with the training period and a point number which increases along with the period; deleting points with transmittance less than 0.1 per period, and supplementing the points to the set point number; the number of points, the rendering radius and the sampling radius are updated every 5 periods.

In step S9, expression parameters and gesture parameters of the driving video are extracted through a FLAME fitting network, a trained specific character avatar network is input, the displacement of Gaussian points and the change of the Gaussian parameters are realized, and the driving video is rendered.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. A monocular face avatar generation method based on Gaussian point rendering is characterized by comprising the following steps of: the method comprises the following steps:

extracting expression parameters and posture parameters of FLAME from monocular portrait videos;

Step three, predicting Gaussian parameters and the offset of the points from the space positions of the initialized points to obtain new positions and define a standard space;

Fourthly, deforming the points in the standard space and defining a deformation space;

Fifthly, acquiring Gaussian parameters of the points in the deformation space from the position information of the points in the deformation space and the initialization space;

step six, inputting Gaussian parameters of points in a deformation space into a renderer, and rendering an image;

step seven, performing image loss on the rendered image and the input monocular portrait video, combining the FLAME loss and the image perception loss, and training by minimizing the weighted sum of the image loss, the FLAME loss and the image perception loss;

Step eight, adding a point adding and deleting strategy in each training iteration to realize point increase;

and step nine, driving the trained specific character avatar through driving video.

2. The method for generating a monocular face avatar based on gaussian point rendering according to claim 1, wherein the first step is as follows: inputting the data of the monocular portrait training set into a FLAME fitting network, and predicting to obtain facial expression parameters and posture parameters of the corresponding image.

3. The method for generating a monocular face avatar based on gaussian point rendering according to claim 1, wherein the step two is specifically as follows: the number of initialization points is set, the positions of the initialization points are randomly set in a space, and this space is defined as an initialization space.

4. The method for generating a monocular face avatar based on gaussian point rendering according to claim 1, wherein said step three specifically comprises: the Gaussian parameters and the offset of the points are predicted from the spatial positions of the initialized points, and the offset is added to the positions of the corresponding points to obtain new positions, and the stage is defined as a standard space.

5. The method for generating a monocular face avatar based on gaussian point rendering according to claim 1, wherein said step four specifically comprises: the positions of the points in the standard space are input into a linear mixed skin function which is learned in advance to obtain deformed points, and the stage is defined as a deformation space.

6. The method for generating a monocular face avatar based on gaussian point rendering according to claim 1, wherein the fifth step is specifically as follows: and (3) taking the difference between the position of the midpoint in the deformation space and the position of the midpoint in the initialization space, and inputting the difference into a Gaussian deformation field to obtain the deformation quantity of the Gaussian parameters, thereby obtaining the Gaussian parameters of the point in the deformation space.

7. The method for generating a monocular face avatar based on gaussian point rendering according to claim 1, wherein said step six specifically comprises: and inputting the Gaussian parameters of the points in the deformation space and the positions of the points into a Gaussian snowball renderer to obtain a rendering chart.

8. The method for generating a monocular face avatar based on gaussian point rendering according to claim 1, wherein said step seven specifically comprises: and (3) carrying out image loss on the rendered image and the input monocular portrait video, combining the FLAME loss and the image perception loss, training an implicit network by minimizing the weighted sum of the image loss, the FLAME loss and the image perception loss, and learning model parameters.

9. The method for generating a monocular face avatar based on gaussian point rendering according to claim 1, wherein the step eight is specifically as follows: deleting points which do not meet the requirements when each training period is finished, and randomly adding one point around each remaining point; when the number of points is increased to N, at the end of each training period, the points which are not satisfactory are deleted, one point is randomly increased around the rest points, and the total number of points is supplemented to N.

10. The method for generating a monocular face avatar based on gaussian point rendering according to claim 1, wherein said step nine specifically comprises: and inputting the FLAME parameters extracted from the driving video into a trained network to obtain the point positions and the Gaussian parameters corresponding to the point positions, and inputting the point positions and the Gaussian parameters into a Gaussian snowball renderer to obtain the corresponding specific character avatar driving actions and the rendering images thereof.