CN117115331B

CN117115331B - Virtual image synthesizing method, synthesizing device, equipment and medium

Info

Publication number: CN117115331B
Application number: CN202311387751.9A
Authority: CN
Inventors: 杨延东
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2023-10-25
Filing date: 2023-10-25
Publication date: 2024-02-09
Anticipated expiration: 2043-10-25
Also published as: CN117115331A

Abstract

The invention relates to a synthetic method, a synthetic device, equipment and a medium of an avatar, wherein the synthetic method comprises the following steps: preprocessing sample object parameters: acquiring and preprocessing the sample object parameters; rendering geometric outlines through a sample object three-dimensional model: inputting the sample object parameters into the sample object three-dimensional model for fitting training, and generating a three-dimensional grid of an virtual image corresponding to the sample object parameters; training and learning through a neural radiation field network model: and obtaining information difference, and taking an information difference result as a training target of the neural radiation field network model to perform optimization training. By the technical scheme, the problems of low quality and poor effect of the conventional virtual image synthesis technology can be solved.

Description

Virtual image synthesizing method, synthesizing device, equipment and medium

Technical Field

The present invention relates to the technical field of virtual images, and in particular, to a method, an apparatus, a device and a medium for synthesizing virtual images.

Background

Digital Avatar (Head Avatar) generation is a basic computer graphics technology that is widely used in many practical scenarios, such as retail host, teleconferencing, movie production, video games, etc. However, synthesizing digital people with realism and expressive force is a very challenging task, and the technical difficulties in the process are that the details of geometrical dynamic changes such as real-time synchronization of the mouth and eye communication need to be solved, and the 3D supervision information is lost.

The existing digital human synthesis technical scheme is a hot spot research field, and the main technical scheme can be divided into three types: image-based models, implicit models, and explicit models.

Wherein the image-based model does not need to rely on expressions in any 3D space, transforming an image with a deformation field (Warping fields) to match new poses or expressions; or with a codec structure, the encoder extracts an identification code from a given source image, and the decoder synthesizes an output image based on this identification code and input features, and the decoder may rely on information such as facial keypoints, facial contours, etc. Although this approach can synthesize high quality results, distortion may occur when dealing with large pose or expression changes, and there is a lack of geometric and temporal consistency, possibly because the deformation information of these three-dimensional surfaces is obtained from 2D pictures.

Implicit model-based methods typically employ implicit surface functions, such as symbolic distance functions (i.e., signed distance functions, SDF) or voxel representations. The technical route is to represent the face of a person as a discrete implicit feature voxel grid to synthesize a dynamic transformation. A solution combining neural radiation fields (Neural radiance field, neRF) and voxel rendering is also of great interest, which generally utilizes low-dimensional parameters of a face model or audio signals to synthesize a digital speaker. Although the implicit model approach can solve the consistency problem of space geometry and time sequence to some extent, it is limited to static scene reconstruction and is difficult to generalize to expression or pose that does not appear.

The explicit model synthesis method mainly adopts an explicit triangular mesh characteristic representation method. At this point, the deformation model parameters are used as a priori information for reconstructing facial features of the digital speaker from incomplete (partial occlusion) or noisy data (depth map). Wherein the explicit deformation model is constructed from a series of 3D head scans fit to provide statistical information of facial shape, motion expression, and geometric texture. Recently, techniques have also been developed to generate and optimize against networks (Generative adversarial networks, GAN). In addition, there are methods that employ 2D neural rendering to learn how to generate realistic digital persons. Although these methods can produce digital persons that are geometrically consistent and easy to edit, they are limited to cranial structures and cannot synthesize hair information, or are subject to spatiotemporal inconsistencies due to loose constraints on trunk geometry.

Therefore, the digital human synthesis method or the virtual image synthesis method in the prior art has low general synthesis quality and poor synthesis expression effect.

Disclosure of Invention

In order to solve the technical problems, the invention provides a synthetic method, a synthetic device, equipment and a medium of an avatar, wherein the synthetic method is used for solving the problems of low quality and poor effect of the existing avatar synthesis technology.

In order to achieve the above object, the present invention provides a method for synthesizing an avatar, comprising: preprocessing sample object parameters: acquiring and preprocessing the sample object parameters; rendering geometric outlines through a sample object three-dimensional model: inputting the sample object parameters into the sample object three-dimensional model for fitting training, and generating a three-dimensional grid of an virtual image corresponding to the sample object parameters; training and learning through a neural radiation field network model: and obtaining information difference, and taking an information difference result as a training target of the neural radiation field network model to perform optimization training.

Further, the preprocessing of the sample object parameters specifically includes: preprocessing an input video frame image: acquiring the input video frame images, and extracting camera parameters, gesture parameters and expression parameters of each frame image; the method for performing geometric outline rendering through the sample object three-dimensional model specifically comprises the following steps: geometric contour rendering by a human head model: inputting camera parameters, gesture parameters and expression parameters of each frame of image into the human head model for fitting training, and generating a three-dimensional grid of the digital human head corresponding to the input video frame image; training and learning are carried out through a neural radiation field network model, and the training and learning method specifically comprises the following steps: training and learning through a neural radiation field network model: obtaining each loss function between a rendering picture and an input real picture, and carrying out weighted summation on all the loss functions to serve as a training target of the neural radiation field network model for optimization training; the neural radiation field network model is a multi-resolution hash neural radiation field network model.

Further, each loss function between the rendered picture and the input real picture is obtained, and all the loss functions are weighted and summed, specifically including: acquiring an error loss function, a depth loss function and a semantic loss function between the rendered picture and the input real picture; and carrying out weighted summation on the error loss function, the depth loss function and the semantic loss function to generate an overall loss function.

Further, the synthesis method further comprises: acquiring a human face feature semantic code of the input video frame image, and splicing the human face feature semantic code and the gesture parameter together to serve as a priori information semantic vector; and calculating cosine similarity according to the face feature vector which is rendered and output, and updating and optimizing parameters of the neural radiation field network model.

Further, obtaining the semantic codes of the face features of the input video frame image specifically includes: extracting the facial feature semantic codes of the input video frame images through an image encoder.

Further, obtaining a depth loss function between the rendered picture and the input real picture specifically includes: obtaining a paired visual depth supervision loss function or a depth smoothing loss function between the rendered picture and the input real picture; the formula of the visual depth supervision loss function is as follows:

Wherein,for pixels in monocular depth estimationAnda pairwise ordering relationship between them.

Further, obtaining a semantic loss function between the rendered picture and the input real picture specifically includes: acquiring incident light of three-dimensional grid M of digital human headSampling point onAccording to the mapping functionAcquiring the sampling pointCorresponding points on a standard topological gridThe method comprises the steps of carrying out a first treatment on the surface of the Wherein,the method comprises the steps of carrying out a first treatment on the surface of the For the corresponding pointPerforming feature coding, inputting the feature coding and the facial expression coefficient E into the neural radiation field network model, and rendering to obtain a new view angle video frame pictureThe method comprises the steps of carrying out a first treatment on the surface of the By image encoderRespectively according to the new view angle video frame picturesOriginal input pictureGenerating semantic information encodings before and after rendering、Sequentially generating the semantic loss functions; the formula of the semantic loss function is as follows:

。

further, the error loss function is formulated as:

wherein,。

further, generating the overall loss function specifically includes: the overall loss function is calculated as:wherein, the method comprises the steps of, wherein,、andRespectively, the corresponding weighting coefficients.

Further, at the time of acquiring the sampling pointCorresponding points on a standard topological gridPreviously, the synthesis method further includes a mapping function construction method, which includes: according to the mapping parameters And carrying out weighted average on the neighborhood A of the triangular mesh surface T of the p point to generate the mapping function.

Further, generating the mapping function specifically includes: constructing the mapping function as follows:

wherein the weighting isCoefficients of，Is the center point.

Further, for the corresponding pointPerforming feature coding, inputting the feature coding and the facial expression coefficient E into the neural radiation field network model, and rendering to obtain a new view angle video frame pictureThe method specifically comprises the following steps: acquiring coded 3D coordinate pointsThe expression parameter code E and the visual angle direction d after position coding are input into a standard nerve radiation field functionRendering to obtain the new video frame pictureIs used for the color information and the volume density information of the (c).

Further, input to the standard neural radiation field functionRendering to obtain the new video frame pictureThe volume density information and the color information of (1) specifically include: by standard of the nerve radiation field functionRendering generated colorsSum volume densityThe method comprises the steps of carrying out a first treatment on the surface of the Wherein, the rendering generation formula is:

。

further, rendering generates colorSum volume densityThe method specifically comprises the following steps: calculating RGB color information for generating a preset ray according to a voxel rendering equation; wherein the preset ray is from the center of the camera Rays emanating in direction d.

Further, calculating RGB color information for generating a preset ray according to a voxel rendering equation, specifically including: calculating RGB color information for generating the preset rays according to the formula of the preset rays and the voxel rendering equation; wherein the formula of the preset ray is thatThe formula of the voxel rendering equation is:

wherein the transmittance is accumulated，Step sizeIs constant.

Further, by means of an image encoderRespectively, each ofAccording to the new view angle video frame pictureOriginal input pictureGenerating semantic information encodings before and after rendering、The method specifically comprises the following steps: the new view angle video frame pictureThe original input pictureRespectively input to the image encoderRespectively normalizing to generate semantic information codes before and after rendering、。

Further, extracting camera parameters, gesture parameters and expression parameters of each frame of image specifically includes: and extracting camera parameters, gesture parameters and expression parameters of each frame of image through a facial expression detail capturing and animation production technology and a facial key point fitting algorithm.

Further, the obtaining the input video frame image specifically includes: and acquiring the input video frame image according to a preset sampling frame rate.

The invention also provides a device for synthesizing the virtual image, which is characterized by being used for realizing the method for synthesizing the virtual image, and comprising a sample object parameter preprocessing unit, a geometric outline rendering unit and a model training learning unit: the sample object parameter preprocessing unit is used for preprocessing sample object parameters: acquiring and preprocessing the sample object parameters; the geometric outline rendering unit is used for performing geometric outline rendering through the sample object three-dimensional model: inputting the sample object parameters into the sample object three-dimensional model for fitting training, and generating a three-dimensional grid of an virtual image corresponding to the sample object parameters; the model training learning unit is used for training and learning through a neural radiation field network model: and obtaining information difference, and taking an information difference result as a training target of the neural radiation field network model to perform optimization training.

The present invention also provides a computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the aforementioned method of avatar composition when the computer program is executed.

The present invention further provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the aforementioned avatar composition method.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects: the virtual image synthesizing method is based on a nerve radiation field technical frame, and the training is guided through the nerve radiation field to realize three-dimensional virtual image synthesis, so that the expressive force of the virtual image is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of a NeRFace method of the prior art;

FIG. 2 is a block flow diagram of a NHA scheme in the prior art;

fig. 3 is a flowchart illustrating a method of synthesizing an avatar in accordance with the first embodiment of the present invention;

fig. 4 is a general flow diagram illustrating a method of synthesizing an avatar in a practical embodiment of the present invention;

Fig. 5 is a block diagram showing a construction of an avatar composition apparatus in accordance with a second embodiment of the present invention;

fig. 6 is an internal structure diagram of a computer device in the second embodiment of the present invention.

Detailed Description

In the prior art, there are two digital human synthesis schemes, specifically as follows:

technical solution one

Aiming at the new visual angle synthesis task of the digital person, a digital person reconstruction synthesis scheme (Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction,NeRFace) based on a dynamic nerve radiation field is provided, so that new expression and gesture synthesis is performed.

As shown in fig. 1, the NeRFace method uses a three-dimensional deformation model (3D Morphable Models,3DMM) for facial expression tracking estimation based on a given self-timer video and a background picture that does not contain a human face as inputs. And then based on the estimated facial pose and expression, synthesizing a new view angle and new expression of the face by adopting a volume rendering method. The sampling point information on the view ray and the frame hidden code (Per-frame Learnable Codes) which can be learned are input into a dynamic nerve radiation field together for model training, and the final synthesized color and density are output. The scenario assumes that the background is static and that the color of the last sample on each ray defaults to the corresponding value of the background color.

The technical scheme has the defects that: the scheme can not synthesize eye actions and blinks at present, and the rendering process is slow, mainly because the scheme omits the fine modeling of eyes, and the model training and rendering efficiency is low due to the implicit voxel representation method of the nerve radiation field.

Technical proposal II

As shown in fig. 2, a digital human face geometry and appearance modeling scheme (Neural head Avatars, NHA) based on monocular video, which proposes a hybrid representation method (first coarse-grained modeling of human face shape and expression with deformable model flag, then prediction of dynamic texture and 3D mesh with neural network).

In the scheme, first, low-dimensional shape, expression and posture parameters in a FLAME model are roughly estimated based on a real-time face tracker; then adopting a geometric Network (Geometry Network) to modify the Geometry mesh generated by the FLAME model, and then carrying out rasterization; then adopting Texture Network (Texture Network) to synthesize mesh Texture of the grating surface; finally, parameters of the FLAME model and the neural network are jointly optimized based on a differential optimization method.

The technical scheme has the defects that: although this solution still generates a high-texture detail head portrait when a large view angle flip occurs, a distortion problem may occur to some extent when the mouth motion expression changes significantly (e.g., the mouth is closed and opened, etc.).

To this end, the present invention provides a method, apparatus, device, and medium for synthesizing an avatar to solve the above problems.

In this application, GAN, generative adversarial networks, represents a generated countermeasure network; SDF, signed distance functions, represents a sign distance function; neRF, neural radiance field, represents the nerve radiation field; CLIP, contrastive Language-Image Pre-training, represents a text-Image contrast Pre-training model; FLAME, faces Learned with an Articulated Model and Expressions, represents a facial statistical model; the DECA, detailed Expression Capture and Animation, represents facial expression detail capture and animation.

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 3, an embodiment of the present invention provides a method for synthesizing an avatar, including: preprocessing sample object parameters: acquiring and preprocessing the sample object parameters; rendering geometric outlines through a sample object three-dimensional model: inputting the sample object parameters into the sample object three-dimensional model for fitting training, and generating a three-dimensional grid of an virtual image corresponding to the sample object parameters; training and learning through a neural radiation field network model: and obtaining information difference, and taking an information difference result as a training target of the neural radiation field network model to perform optimization training.

Therefore, the virtual image synthesizing method is based on a nerve radiation field technical framework, training is guided through the nerve radiation field, three-dimensional virtual image synthesis is realized, and the expressive force of the virtual image is effectively improved.

In a preferred embodiment, the preprocessing of the sample object parameters specifically includes: preprocessing an input video frame image: acquiring input video frame images, and extracting camera parameters, gesture parameters and expression parameters of each frame image;

the method for performing geometric outline rendering through the sample object three-dimensional model specifically comprises the following steps: geometric contour rendering by a human head model: inputting camera parameters, gesture parameters and expression parameters of each frame of image into a human head model for fitting training, and generating a three-dimensional grid of the digital human head corresponding to the input video frame image; training and learning are carried out through a neural radiation field network model, and the training and learning method specifically comprises the following steps: training and learning through a neural radiation field network model: and acquiring each loss function between the rendered picture and the input real picture, and carrying out weighted summation on all the loss functions to serve as a training target of the neural radiation field network model for optimization training.

In a specific embodiment, the human head model may be, but not limited to, a FLAME human head model, and the above-mentioned avatar synthesis method may implement a digital human synthesis method, which mainly includes the following steps: preprocessing an input video frame image, performing geometric outline rendering through a FLAME human head model, and performing training and learning through a neural radiation field network model; after the geometric outline rendering is completed and a three-dimensional grid of the digital human head corresponding to the input video frame image is generated, training and learning are carried out through a neural radiation field network model, and the weighted summation of each loss function between the rendered picture and the input real picture is taken as a training target; therefore, the digital person synthesis method is based on a nerve radiation field technical frame, and guides training and realizes three-dimensional digital person synthesis through the nerve radiation field of multi-resolution hash coding, so that the expressive force of digital persons is effectively improved; the multi-resolution Hash acceleration technology can effectively improve the training and rendering efficiency of the nerve radiation field.

In an actual embodiment, the digital person synthesizing method can integrate the priori information guidance and the multi-resolution hash encoded nerve radiation field, and can effectively improve the expressive force of the digital person.

In practice, in order to realize three-dimensional new view angle synthesis of a high-fidelity digital person, the digital person synthesis method fuses a pre-training model and a nerve radiation field, and can guide the training and rendering process through priori information and a visual depth geometric supervision loss function;

further, the pre-training model may be selected from, but not limited to, a CLIP pre-training model.

Furthermore, the multi-resolution hash acceleration technology is adopted, so that the training and rendering efficiency of the nerve radiation field is improved.

When an input video is given, the training of the whole flow can be completed, and the digital person outputting a new visual angle can be deduced.

In a preferred embodiment, each loss function between the rendered picture and the input real picture is obtained, and all the loss functions are weighted and summed, specifically including: acquiring an error loss function, a depth loss function and a semantic loss function between a rendered picture and an input real picture; and carrying out weighted summation on the error loss function, the depth loss function and the semantic loss function to generate an overall loss function.

In a practical embodiment, error loss, depth loss and semantic loss between the rendered picture and the input real picture are calculated, and these loss functions are weighted and summed to optimize as training targets for the multi-resolution hash neural radiation field framework.

In a preferred embodiment, the synthesis method further comprises: acquiring a human face feature semantic code of an input video frame image, and splicing the human face feature semantic code and the gesture parameter together to serve as a priori information semantic vector; and calculating cosine similarity according to the face feature vector which is rendered and output, and updating and optimizing parameters of the neural radiation field network model.

In a preferred embodiment, the acquiring the semantic codes of the face features of the input video frame image specifically includes: and extracting the semantic codes of the face features of the input video frame images by an image encoder.

In a practical embodiment, to further improve the quality of reconstruction synthesis, the image Encoder may be, but is not limited to, a CLIP image Encoder (CLIP Encoder) introduced with CLIP to extract the semantic codes of the face features of the input video frame, and splice with the pose internal parameters P of the camera as a priori information semantic vector; and then, calculating cosine similarity with the face feature vector which is rendered and output to guide the parameter updating optimization of the neural radiation field network model.

Therefore, the virtual image synthesizing method is based on a nerve radiation field technical framework, a three-dimensional digital person synthesizing technical scheme based on the fusion of the prior information guiding and the nerve radiation field of the multi-resolution hash coding is provided, and the expressive force of digital persons is effectively improved.

The virtual image synthesizing method also provides a semantic information feature extraction fusion modeling strategy based on the CLIP pre-training image encoder aiming at the problem of inconsistent multi-view angles of the digital person, and further improves the fluency and the authenticity of three-dimensional visual representation of the digital person.

In a preferred embodiment, the obtaining a depth loss function between the rendered picture and the input real picture specifically includes:

obtaining paired visual depth supervision loss functions or depth smoothing loss functions between the rendered picture and the input real picture; the formula of the visual depth supervision loss function is as follows:

In practice, it is generally difficult to obtain high quality 3D reconstruction results only by relying on prior information provided by the pre-training model, but lacking sufficient geometric regularization; further, a pair of visual depth supervision loss or depth smoothing loss may be employed to keep the depth of the rendering result consistent with the depth of the reference view picture.

In a practical embodiment, this depth loss function is used to act as an additional geometric supervision.

In a preferred embodiment, the method for obtaining the semantic loss function between the rendered picture and the input real picture specifically comprises the following steps:

Acquiring incident light of three-dimensional grid M of digital human headSampling point onAccording to the mapping functionAcquiring sampling pointsCorresponding points on a standard topological gridThe method comprises the steps of carrying out a first treatment on the surface of the Wherein,the method comprises the steps of carrying out a first treatment on the surface of the Corresponding points by multi-resolution hashingPerforming feature coding, inputting the feature coding and the facial expression coefficient E into a neural radiation field network model, and rendering to obtain a new view angle video frame pictureThe method comprises the steps of carrying out a first treatment on the surface of the Image encoder by CLIPRespectively according to the new view angle video frame picturesOriginal input pictureGenerating semantic information encodings before and after rendering、Sequentially generating semantic loss functions; the formula of the semantic loss function is as follows:

。

in a preferred embodiment, the method for obtaining the error loss function between the rendered picture and the input real picture specifically comprises the following steps:

constructing an error loss function based on the Huber picture; the formula of the error loss function is as follows:

wherein,。

in a preferred embodiment, the generating the overall loss function specifically includes: the overall loss function is calculated as:

wherein, the method comprises the steps of, wherein,、andRespectively, the corresponding weighting coefficients.

In a practical embodiment, after the weighted sum of the overall loss functions is calculated, a common random gradient descent method can be used for error minimization optimization training.

In a preferred embodiment, the sampling point is obtainedCorresponding points on a standard topological gridPreviously, the synthesis method further includes a mapping function construction method, which includes: according to the mapping parametersAnd carrying out weighted average on the neighborhood A of the triangular mesh surface T of the p point to generate a mapping function.

In a preferred embodiment, the generation of the mapping function specifically comprises: the construction mapping function is:

wherein the weighting coefficient，Is the center point.

In a preferred embodiment, the corresponding points are hashed by multiple resolutionsPerforming feature coding, inputting the feature coding and the facial expression coefficient E into a neural radiation field network model, and rendering to obtain a new view angle video frame pictureThe method specifically comprises the following steps: acquiring coded 3D coordinate pointsThe expression parameter code E and the visual angle direction d after position coding are input into a standard nerve radiation field functionRendering to obtain a new view angle video frame pictureIs used for the color information and the volume density information of the (c).

In a preferred embodiment, the function of the nerve radiation field input to the standardRendering to obtain a new view angle video frame pictureThe volume density information and the color information of (1) specifically include: by standard neural radiation field functions Rendering generated colorsSum volume densityThe method comprises the steps of carrying out a first treatment on the surface of the Wherein, the rendering generation formula is:

。

in a preferred embodiment, rendering generates colorSum volume densityThe method specifically comprises the following steps:

calculating RGB color information for generating a preset ray according to a voxel rendering equation; wherein the preset ray is from the center of the cameraRays emanating in direction d.

In a preferred embodiment, the RGB color information for generating the preset ray is calculated according to a voxel rendering equation, and specifically includes:

calculating RGB color information of the preset rays according to a formula of the preset rays and a voxel rendering equation; wherein the formula of the preset rays is thatThe formula of the voxel rendering equation is:

wherein the cumulative transmittance，Step sizeIs constant.

In a preferred embodiment, the video encoder is a CLIP image encoderRespectively according to the new view angle video frame picturesOriginal input pictureGenerating semantic information encodings before and after rendering、The method specifically comprises the following steps: new view angle video frame pictureOriginal input pictureRespectively input CLIP image encoderRespectively normalizing to generate semantic information codes before and after rendering、。

In a preferred embodiment, extracting the camera parameters, the gesture parameters and the expression parameters of each frame of image specifically includes:

And extracting camera parameters, gesture parameters and expression parameters of each frame of image through a facial expression detail capturing and animation production technology and a facial key point fitting algorithm.

The facial expression detail capturing and animation technology can be selected from, but not limited to, DECA facial expression detail capturing and animation technology.

In a preferred embodiment, acquiring an image of an input video frame specifically includes: and acquiring an input video frame image according to a preset sampling frame rate.

In a practical embodiment, the frame rate in the video may be sampled to 25FPS first; then, camera parameters, pose and expression parameters of each frame of image are extracted based on the common DECA and facial key point fitting algorithm so as to synthesize digital people corresponding to the video.

In summary, in order to further improve the synthetic quality and the expression effect of the digital person, the practical embodiment of the invention provides a technical scheme which is based on a language-Image Pre-training model (Contrastive Language-Image Pre-training, CLIP) as prior information and is fused with a multi-resolution hash neural radiation field; the frame of the technology learns and synthesizes the high-fidelity digital person with dynamic facial expression change from the monocular video, can generate vivid facial detail expression and geometric texture structure, and further improves the generalization performance and rendering efficiency of the model.

As shown in fig. 4, in the practical embodiment, the above-mentioned avatar composition method is mainly implemented by three modules, and the functions of the three modules are as follows: 1) Preprocessing an input video frame image; 2) Facial priori semantic information extraction based on a CLIP pre-training encoder model; 3) Training learning of neural radiation fields based on multi-resolution hash table encoding.

The specific implementation flow of each module is as follows:

1. preprocessing of input video frame images

Extracting camera parameters K, expression (Expression) coefficients E and Pose (Pose) parameters P of each frame from an input video frame (monocular RGB face video or self-timer face video) by adopting a facial Expression detail capturing and animation producing module (Detailed Expression Capture and Animation, DECA) and a facial key point fitting module (Landmark fitting); then the video frame head model is input into a human head model FLAME for fitting training, and a three-dimensional grid (me) M corresponding to the video frame head image is obtained.

2. Facial priori semantic information extraction based on CLIP pre-training encoder

CLIP is a commonly used, image and text-based multimodal pre-training model that constructs training objectives by mapping input text and images to the same feature space through image and text encoder branches, respectively, and computing similarity of feature vectors. The model adopts more than 4 hundred million text-image pairs as training data, and obtains good effects in common multi-modal tasks, such as image retrieval, action recognition and the like; this verifies the powerful feature extraction and abstract expressive power of the model.

Three-dimensional reconstruction of faces and new view angle synthesis are under-constrained problems due to the complex diversity of facial expressions and shape variations of facial geometry. High quality three-dimensional reconstruction of faces requires not only low level perception (visual depth, texture, coloring, etc.), but also high level understanding (type, function, structure, etc.).

Therefore, in order to further improve the quality of reconstruction synthesis, a CLIP image Encoder (CLIP encoding) is introduced to extract the semantic codes of the facial features of the input video frames, and spliced together with the pose internal parameters P of the camera to serve as prior information semantic vectors; and then, calculating cosine similarity with the face feature vector which is rendered and output to guide the parameter updating optimization of the neural radiation field network model.

3. Training learning of neural radiation field based on multi-resolution hash coding

Stage one: obtaining the corresponding point of the sampling point p on the standard topological gridCorresponding points in standard space

Let it be assumed that the incident ray of a given head three-dimensional mesh MSampling point onSince the sampling point p is located on the deformed face triangular sub-grid t (note that the three-dimensional grid of the head extracted based on FLAME is composed of the triangular sub-grids in a mosaic mode), the corresponding point of the deformed point p on the standard topological grid needs to be found through a mapping I.e.，

Wherein the method comprises the steps ofThe construction formula of (2) is as follows:

the mapping function isWeighted average, weighting coefficient over neighborhood a (containing T) of triangle mesh plane T of p points，Is the center point.

Wherein, the Frenet frame method on differential geometry is adopted for constructionThe method comprises the steps of carrying out a first treatment on the surface of the Assume that the triangular sub-grid on the three-dimensional grid M is，The three-dimensional grid and the triangular sub-grid in the standard space areThen:wherein, the method comprises the steps of, wherein,、wherein the matrix is rotatedAndrespectively calculating based on the tangent line, the double tangent lines and the normal line corresponding to the p point; re-sum transformation vectorAndcombined together to form a Frenet frame systemAnd. The speed of searching the triangular sub-grids in the neighborhood of the grid sampling points is accelerated by adopting a classical BVH (bounding volume hierarchy) algorithm.

Stage two: performing feature encoding and rendering by adopting multi-resolution hash to obtain picture with new view angleSolving for the corresponding point in the standard spaceThen, adopting multi-resolution hash to perform feature coding; and the new video frame picture is rendered by inputting the new video frame picture and the facial expression coefficient E into a network of a nerve radiation fieldTwo information of volume Density (Density) and color (RGB color).

Specifically, the 3D coordinate points encoded in a given scene The expression parameter code E and the visual angle direction d after position coding are input into a standard nerve radiation field functionIn which the color can be obtained by outputtingSum volume densityThe method comprises the following steps:

for the slave camera centerRays emitted in direction dThe desired RGB color information may be calculated from a voxel rendering equation, which is specifically as follows:

wherein the cumulative transmittance，While the step sizeIs set as a constant。

Stage three: calculating semantic loss function, depth supervision loss function and error loss function

Picture in rendering to get new view angleThen, can be matched with the corresponding original input pictureImage encoder respectively input to CLIPIn (a) and (b); respectively normalizing to obtain semantic information codes before and after renderingAndthus, the semantic loss function can be calculated, specifically as follows:

The depth supervision loss function is specifically as follows:

Wherein,representing pixels in monocular depth estimationAndand a pairwise ordering relationship, the loss function being used to act as additional geometric supervision.

In addition, a Huber-based picture reconstruction error loss function is employed:

wherein,。

stage four: error minimization optimization training by integral loss function

In summary, the overall loss function can be expressed as follows:

wherein the method comprises the steps of、Andrespectively corresponding weighting coefficients;

after the weighted sum loss function is calculated, a common random gradient descent method can be adopted for error minimization optimization training.

In an actual embodiment, the virtual image synthesizing method carries out training and rendering of a digital person by means of a CLIP pre-training image encoder and a hash-coding-based nerve radiation field frame, and the selected nerve radiation field network and the encoder network are all common feedforward nerve network MLP;

taking common short video data such as celebrity speech video or animal video as an example for analysis, the specific steps include:

description of the preferred embodiments

1. Preprocessing of input video frame images

Step A: the frame rate in BBC speaker video is sampled to 25FPS.

And (B) step (B): camera parameters, pose and expression parameters of each frame of image are extracted based on a common DECA and facial key point fitting algorithm.

Geometrical outline rendering of FLAME face model

Step C: and inputting the camera parameters, expression and gesture data obtained by preprocessing each frame into a FLAME model for fitting training to obtain the three-dimensional grid of the digital human head.

3. Training learning of deformable neural radiation fields

Step D: and calculating error loss, depth loss and semantic loss between the rendered picture and the input real picture, and carrying out weighted summation on the loss functions to be used as a training target of the multi-resolution hash nerve radiation field frame for optimization.

Second embodiment

1. Preprocessing of input video frame images

Step a samples the frame rate in the animal head video to 25FPS.

And B, extracting camera parameters, gestures and expression parameters of each frame of image based on a common facial key point fitting extraction algorithm.

2. Geometric contour rendering of a sample object three-dimensional statistical model

And C, inputting the camera parameters, the expression and the gesture data obtained by preprocessing each frame into a three-dimensional statistical model for fitting training to obtain three-dimensional grid data of the animal head.

3. Training learning of deformable neural radiation fields

And D, calculating error loss, depth loss and semantic loss between the rendered picture and the input real picture, and taking weighted summation of the loss functions as a training target of the multi-resolution hash nerve radiation field frame for optimization.

In summary, the virtual image synthesizing method provided by the practical embodiment of the invention is a digital person rendering scheme for guiding the nerve radiation field based on prior information, the digital person finally generated by the technical scheme can be more matched with the propaganda and broadcasting task of the real scene, and the generated three-dimensional visual representation has more reality and consistency.

Compared with the prior art, the method has the advantage that generalization and effectiveness of three-dimensional visual characteristic representation are considered. The method has important scientific significance and potential application value for researches in related fields such as commercial anchor, animation production, video games and the like.

It should be noted that the above-mentioned avatar composition method may also be adapted to the fields of commercial anchor, video media, video games, and digital content generation as basic technical solutions for these fields.

It should be noted that, although the steps in the flowchart are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in the flowcharts may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, and may be performed in turn or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

Example two

As shown in fig. 5, an embodiment of the present invention further provides an avatar composition apparatus, which is characterized in that the composition apparatus is configured to implement the aforementioned avatar composition method, and the composition apparatus includes a sample object parameter preprocessing unit, a geometric outline rendering unit, and a model training learning unit: the sample object parameter preprocessing unit is used for preprocessing sample object parameters: acquiring and preprocessing the sample object parameters; the geometric outline rendering unit is used for performing geometric outline rendering through the sample object three-dimensional model: inputting the sample object parameters into the sample object three-dimensional model for fitting training, and generating a three-dimensional grid of an virtual image corresponding to the sample object parameters; the model training learning unit is used for training and learning through a neural radiation field network model: and obtaining information difference, and taking an information difference result as a training target of the neural radiation field network model to perform optimization training.

In a preferred embodiment, the sample object parameter preprocessing unit is further configured to: preprocessing an input video frame image: and acquiring the input video frame images, and extracting camera parameters, gesture parameters and expression parameters of each frame image.

The geometric outline rendering unit is further configured to: geometric contour rendering by a FLAME human head model: and inputting the camera parameters, the gesture parameters and the expression parameters of each frame of image into the FLAME human head model for fitting training, and generating a three-dimensional grid of the digital human head corresponding to the input video frame image.

The model training learning unit is further configured to: training and learning through a neural radiation field network model: obtaining each loss function between a rendering picture and an input real picture, and carrying out weighted summation on all the loss functions to serve as a training target of the neural radiation field network model for optimization training; the neural radiation field network model is a multi-resolution hash neural radiation field network model.

In a preferred embodiment, the model training learning unit comprises: a loss function acquisition unit for: and acquiring an error loss function, a depth loss function and a semantic loss function between the rendered picture and the input real picture.

An overall loss function generation unit for: and carrying out weighted summation on the error loss function, the depth loss function and the semantic loss function to generate an overall loss function.

In a preferred embodiment, the synthesizing device further comprises: a priori information semantic vector generation unit for: acquiring a human face feature semantic code of an input video frame image, and splicing the human face feature semantic code and the gesture parameter together to serve as a priori information semantic vector; a model parameter optimization unit for: and calculating cosine similarity according to the face feature vector which is rendered and output, and updating and optimizing parameters of the neural radiation field network model.

In a preferred embodiment, the a priori information semantic vector generation unit is further configured to: and extracting the semantic codes of the face features of the input video frame images by a CLIP image encoder.

In a preferred embodiment, the loss function obtaining unit is further configured to: obtaining paired visual depth supervision loss functions or depth smoothing loss functions between the rendered picture and the input real picture; the formula of the visual depth supervision loss function is as follows:

In a preferred embodiment, the loss function obtaining unit is further configured to: acquiring incident light of three-dimensional grid M of digital human headSampling point on According to the mapping functionAcquiring sampling pointsCorresponding points on a standard topological gridThe method comprises the steps of carrying out a first treatment on the surface of the Wherein,the method comprises the steps of carrying out a first treatment on the surface of the Corresponding points by multi-resolution hashingPerforming feature coding, inputting the feature coding and the facial expression coefficient E into a neural radiation field network model, and rendering to obtain a new view angle video frame pictureThe method comprises the steps of carrying out a first treatment on the surface of the Image encoder by CLIPRespectively according to the new view angle video frame picturesOriginal input pictureGenerating semantic information encodings before and after rendering、Sequentially generating semantic loss functions; the formula of the semantic loss function is as follows:

。

in a preferred embodiment, the loss function obtaining unit is further configured to construct an error loss function based on the Huber picture; the formula of the error loss function is as follows:

wherein,。

in a preferred embodiment, the overall loss function generating unit is further configured to calculate an overall loss function as:

wherein,、andRespectively, the corresponding weighting coefficients.

In a preferred embodiment, the synthesis method further comprises a mapping function construction unit for: according to the mapping parametersAnd carrying out weighted average on the neighborhood A of the triangular mesh surface T of the p point to generate a mapping function.

In a preferred embodiment, the mapping function construction unit is further configured to construct the mapping function as:

Wherein the weighting coefficient，Is the center point.

In a preferred embodiment, the loss function obtaining unit is further configured to: acquiring coded 3D coordinate pointsThe expression parameter code E and the visual angle direction d after position coding are input into a standard nerve radiation field functionRendering to obtain a new view angle video frame pictureIs used for the color information and the volume density information of the (c).

In a preferred embodiment, the loss function obtaining unit is further configured to: by standard neural radiation field functionsRendering generated colorsSum volume densityThe method comprises the steps of carrying out a first treatment on the surface of the Wherein, the rendering generation formula is:

。

in a preferred embodiment, the loss function obtaining unit is further configured to: calculating RGB color information for generating a preset ray according to a voxel rendering equation; wherein the preset ray is from the center of the cameraRays emanating in direction d.

In a preferred embodiment, the loss function obtaining unit is further configured to: calculating RGB color information of the preset rays according to a formula of the preset rays and a voxel rendering equation; wherein the formula of the preset rays is thatThe formula of the voxel rendering equation is:

wherein the cumulative transmittance，Step sizeIs constant.

In a preferred embodiment, the loss function obtaining unit is further configured to: new view angle video frame picture Original input pictureRespectively input CLIP image encoderRespectively normalizing to generate semantic information codes before and after rendering、。

In a preferred embodiment, the loss function obtaining unit is further configured to: extracting camera parameters, gesture parameters and expression parameters of each frame of image through a DECA facial expression detail capturing and animation production technology and a facial key point fitting algorithm.

In a preferred embodiment, the loss function obtaining unit is further configured to: and acquiring an input video frame image according to a preset sampling frame rate.

For specific limitations of the above apparatus, reference may be made to the limitations of the method described above, which are not repeated here.

Each of the modules in the above apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware, or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The computer device may be a terminal, as shown in fig. 6, which includes a processor, a memory, a network interface, a display screen, and an input device connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It is to be understood that the structures shown in the above figures are merely block diagrams of some of the structures associated with the present invention and are not limiting of the computer devices to which the present invention may be applied, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

Implementation of all or part of the flow in the above-described embodiment methods may be accomplished by a computer program that instructs related hardware, and the computer program may be stored in a non-volatile computer readable storage medium, and the computer program may include the flow in the above-described embodiment methods when executed.

Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method of synthesizing an avatar, comprising:

preprocessing sample object parameters, including preprocessing an input video frame image: acquiring the input video frame images, and extracting camera parameters, gesture parameters and expression parameters of each frame image;

rendering geometric contours through a sample object three-dimensional model, comprising: geometric contour rendering by a human head model: inputting camera parameters, gesture parameters and expression parameters of each frame of image into the human head model for fitting training, and generating a three-dimensional grid of the digital human head corresponding to the input video frame image;

Training learning through a neural radiation field network model, comprising: obtaining each loss function between a rendering picture and an input real picture, and carrying out weighted summation on all the loss functions to serve as a training target of the neural radiation field network model for optimization training; the neural radiation field network model is a multi-resolution hash neural radiation field network model, and the loss function comprises a semantic loss function;

the method for obtaining the semantic loss function between the rendered picture and the input real picture comprises the following steps:

acquiring sampling points p on incident light rays r of a digital human head three-dimensional grid M according to a mapping functionAcquiring a corresponding point p' of the sampling point p on a standard topological grid; wherein (1)>

Performing feature coding on the corresponding point p', inputting the feature coding and the facial expression coefficient E into the neural radiation field network model, and rendering to obtain a new video frame picture I ^new ；

Through the image encoder phi (·), the video frame pictures I according to the new view angle respectively ^new Original input Picture I ^orig Generating semantic information encodings of before and after rendering phi (I ^new )、φ(I ^orig ) And generating the semantic loss function; the formula of the semantic loss function is as follows: l (L) _semantic ＝φ(I ^new ) ^T •φ(I ^orig )。

2. The avatar composition method of claim 1, wherein acquiring respective loss functions between the rendered picture and the input real picture and weighting and summing all the loss functions, comprises:

acquiring an error loss function, a depth loss function and a semantic loss function between the rendered picture and the input real picture;

and carrying out weighted summation on the error loss function, the depth loss function and the semantic loss function to generate an overall loss function.

3. The avatar composition method of claim 2, wherein the composition method further comprises:

acquiring a human face feature semantic code of the input video frame image, and splicing the human face feature semantic code and the gesture parameter together to serve as a priori information semantic vector;

and calculating cosine similarity according to the face feature vector which is rendered and output, and updating and optimizing parameters of the neural radiation field network model.

4. A method of avatar composition according to claim 3, wherein obtaining a semantic code of facial features of the input video frame image comprises:

extracting the facial feature semantic codes of the input video frame images through an image encoder.

5. The avatar composition method of claim 4, wherein acquiring a depth loss function between the rendered picture and the input real picture comprises:

obtaining a paired visual depth supervision loss function or a depth smoothing loss function between the rendered picture and the input real picture; the formula of the visual depth supervision loss function is as follows:

wherein r is _k For pixels in monocular depth estimation +.>And->Ordering in pairs betweenRelationship.

6. The avatar composition method of claim 5, wherein the error loss function has a formula of:

wherein δ=0.1.

7. The avatar composition method of claim 6, wherein generating the overall loss function comprises:

the overall loss function is calculated as: l=λ _cblor L _color +λ _semantic L _semantic +λ _depth L _depth Wherein lambda is _color 、λ _semantic Lambda of _depth Respectively, the corresponding weighting coefficients.

8. The avatar composition method of claim 7, wherein before acquiring the corresponding point p' of the sampling point p on the standard topology mesh, the composition method further comprises a mapping function construction method comprising:

according to the mapping parameter F _f And carrying out weighted average on the neighborhood A of the triangular mesh surface T of the p point to generate the mapping function.

9. The avatar composition method of claim 8, wherein generating the mapping function comprises:

constructing the mapping function as follows:wherein the weighting coefficient omega _f ＝exp(-β||c _f -p|| ₂ )，β＝4，c _f Is the center point.

10. The avatar composition method of claim 7,the method is characterized in that the corresponding point p' is subjected to feature coding, and is input into the neural radiation field network model together with the facial expression coefficient E, and a new video frame picture I is rendered and obtained ^new Comprising:

acquiring an encoded 3D coordinate point p', inputting the expression parameter code E and the position-encoded visual angle direction D into a standard nerve radiation field function F, and rendering to obtain a new visual angle video frame picture I ^new Is used for the color information and the volume density information of the (c).

11. The avatar composition method of claim 10, wherein the new video frame picture I is rendered by inputting into a standard neural radiation field function F ^new Is a volume density information and color information of a display device, comprising:

generating a color c and a volume density sigma by rendering the standard neural radiation field function F; wherein, the rendering generation formula is: f: (p ', E, d) → (c (p ', E, d), σ (p ', E)).

12. The avatar composition method of claim 11, wherein rendering the generated color c and volume density σ, comprises:

calculating RGB color information for generating a preset ray according to a voxel rendering equation; the preset rays are rays emitted from the center o of the camera along the direction d.

13. The avatar composition method of claim 12, wherein calculating RGB color information for generating the preset ray according to a voxel rendering equation comprises:

calculating RGB color information for generating the preset rays according to the formula of the preset rays and the voxel rendering equation; the formula of the preset ray is r (t) =o+td, and the formula of the voxel rendering equation is as follows:wherein, cumulative transmittance->α _n ≡1-exp(-σ _n δ _n ) Step size delta _n Is constant.

14. The avatar composition method of claim 7, wherein the new view video frame pictures I are respectively based on the video frames of the new view video frame pictures I by an image encoder Φ () ^new Original input Picture I ^orig Generating semantic information encodings of before and after rendering phi (I ^new )、φ(I ^orig ) Comprising:

the new view angle video frame picture I ^new Said original input picture I ^orig Respectively inputting the semantic information codes into the image encoder phi (·) for normalization to generate semantic information codes phi (I) before and after rendering ^new )、φ(I ^orig )。

15. The avatar composition method of claim 1, wherein extracting camera parameters, pose parameters, and expression parameters of each frame of image comprises:

16. The avatar composition method of claim 15, wherein the acquiring the input video frame image comprises:

and acquiring the input video frame image according to a preset sampling frame rate.

17. An avatar composition apparatus for implementing the avatar composition method of any one of claims 1-16, comprising a sample object parameter preprocessing unit, a geometric contour rendering unit, and a model training learning unit:

the sample object parameter preprocessing unit is used for preprocessing sample object parameters: acquiring and preprocessing the sample object parameters;

the geometric outline rendering unit is used for performing geometric outline rendering through the sample object three-dimensional model: inputting the sample object parameters into the sample object three-dimensional model for fitting training, and generating a three-dimensional grid of an virtual image corresponding to the sample object parameters;

The model training learning unit is used for training and learning through a neural radiation field network model: and obtaining information difference, and taking an information difference result as a training target of the neural radiation field network model to perform optimization training.

18. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, carries out the steps of the method for synthesizing an avatar according to any one of claims 1-16.

19. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the avatar composition method of any one of claims 1-16.