CN116704084A

CN116704084A - Training method of facial animation generation network, facial animation generation method and device

Info

Publication number: CN116704084A
Application number: CN202310959636.8A
Authority: CN
Inventors: 孙红岩
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2023-08-01
Filing date: 2023-08-01
Publication date: 2023-09-05
Anticipated expiration: 2043-08-01
Also published as: CN116704084B

Abstract

The application provides a training method of a human face animation generation network, a human face animation generation method and a human face animation generation device, wherein the training method comprises the following steps: acquiring a plurality of training sample images and sample label information thereof; for any sample image, determining corresponding hidden variables and characteristic points, wherein the hidden variables represent geometric and texture information of a human face, and the characteristic points represent semantic information of a human face region; inputting hidden variables and characteristic points into an initial network, determining the integral characteristics of a sample image by the initial network according to the hidden variables and the characteristic points, and performing volume rendering on the integral characteristics to obtain a face animation image; constructing a target loss function according to the face animation image and the sample label information; and performing iterative training on the initial network by utilizing a plurality of training sample images, and adjusting network parameters in a mode of converging the target loss function to obtain the face animation generation network. The application solves the technical problem that the animation image corresponding to the face can not be finely depicted and generated in the related technology.

Description

Training method of facial animation generation network, facial animation generation method and device

Technical Field

The embodiment of the application relates to the technical field of machine learning, in particular to a training method of a face animation generation network, a face animation generation method and a face animation generation device.

Background

In recent years, development of digital people and virtual images in the technology of virtual people gradually becomes an emerging technical issue, and besides the technology for virtual real human images, the development technology of digital people can make the character expression more vivid and interact with audiences. Traditional virtual person animation generation requires the following three steps: 1. virtual person modeling, 2 virtual person binding, 3 virtual person driving. When the virtual human animation is generated, firstly, digital human assets, namely virtual human modeling, are needed, after the digital human assets are produced, key points are set on the digital human assets, skin coefficients are set, expressions are collected at the same time, weight calculation is carried out, virtual human binding is completed, finally, the key points are driven by a real person, and the digital human is driven by a deformation animation mode after the skin coefficients are calculated.

With the rise of the neural network, generalization and universality of the neural network bring convenience to the animation production of the virtual person, and the neural network is driven by a digital person to be realized in two main modes at present: one is vertex-by-vertex driving, and the other is using hidden variable driving. Although the vertex driving mode can be used for driving the digital person more precisely in theory, the current high-simulation digital person may need hundreds of thousands of vertices to model, and the driving of hundreds of thousands of vertices cannot be accurately realized in the actual neural network implementation, so that the digital person is driven by adopting hidden variables, and the digital person can be represented more accurately by adopting the hidden variables, but the driving of the digital person cannot be predicted due to the unexplainability of the hidden variables, and reverse engineering is usually needed to decompile the digital person into images and other characteristics.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides a training method of a human face animation generation network, a human face animation generation method and a human face animation generation device, which at least solve the technical problem that an animation image corresponding to a human face cannot be finely depicted in the related art.

According to an embodiment of the present application, there is provided a training method of a face animation generation network, the method including: acquiring a plurality of training sample images, and determining sample label information corresponding to each training sample image; for any training sample image, determining a first hidden variable and a first characteristic point corresponding to the training sample image, wherein the first hidden variable is used for representing geometric information and texture information of a face in the training sample image, and the first characteristic point is used for representing first semantic information of a face area in the training sample image; inputting the first hidden variable and the first characteristic point into an initial network to obtain a first face animation image output by the initial network, wherein the initial network is used for determining a first integral characteristic of a training sample image according to the first hidden variable and the first characteristic point, and performing volume rendering according to the first integral characteristic to obtain the first face animation image; constructing a target loss function according to sample label information corresponding to the first face animation image and the training sample image; and performing iterative training on the initial network by utilizing a plurality of training sample images, wherein network parameters of the initial network are adjusted in a mode of converging the target loss function, and a face animation generation network is obtained.

In one exemplary embodiment, acquiring a plurality of training sample images includes: screening a plurality of first videos from a first video data set, wherein the videos in the first video data set are face speaking videos, and face frame images of the first videos comprise first feature points with target numbers; filtering first videos of which the definition does not meet preset conditions from the plurality of first videos; and taking the face frame image in the reserved first video as a training sample image.

In an exemplary embodiment, taking the face frame image in the reserved first video as a training sample image includes: preprocessing the face frame image in each reserved first video, wherein the preprocessing comprises the following steps: for each frame of face frame image, taking a face as a center, and intercepting an image of a region with a preset size as a training sample image.

In an exemplary embodiment, determining a first hidden variable and a first feature point corresponding to a training sample image includes: determining a first hidden variable corresponding to the training sample image by utilizing a pre-trained inversion inference network; and determining a first characteristic point corresponding to the training sample image by utilizing the pre-trained characteristic point extraction network.

In one exemplary embodiment, the initial network is constructed before the first hidden variable and the first feature point are input into the initial network, wherein the initial network at least includes: variable fusion sub-networks, affine transformation sub-networks, transform coding sub-networks, and three-dimensional decoding sub-networks.

In an exemplary embodiment, inputting a first hidden variable and a first feature point into an initial network to obtain a first face animation image, including: the first hidden variable and the first characteristic point input variable fusion sub-network are fused to obtain a second hidden variable comprising facial geometry, texture and semantic information; inputting the second hidden variable into an affine transformation sub-network to carry out affine transformation to obtain a first characteristic variable in a target space, wherein the affine transformation comprises scale transformation and translation transformation; inputting the first characteristic variable into a conversion coding sub-network to perform coding conversion to obtain a first integral characteristic of the training sample image; converting the first integral feature into a first three-dimensional spatial feature of the training sample image; inputting the first three-dimensional space features into a three-dimensional decoding sub-network for decoupling, and obtaining first voxel information, first color information and second semantic information corresponding to the training sample image; and performing volume rendering according to the first voxel information, the first color information and the second semantic information to obtain a first facial animation image.

In one exemplary embodiment, converting the first global feature into a first three-dimensional spatial feature of the training sample image includes: dividing the first global feature into first sub-features of three planes; and carrying out space feature fusion on the three first sub-features in a summation mode to obtain first three-dimensional space features of the training sample image.

In one exemplary embodiment, the second semantic information is used to characterize face region information, wherein the face region comprises at least one of: skin, left eyebrow, right eyebrow, left ear, right ear, mouth, upper lip, lower lip, hair, hat, neck, left eye, right eye, nose, glasses.

In an exemplary embodiment, performing volume rendering according to the first voxel information, the first color information and the second semantic information to obtain a first face animation image, including: performing volume rendering according to the first voxel information and the first color information to obtain a first image rendering result; performing volume rendering according to the first voxel information and the second semantic information to obtain a first semantic rendering result; and generating a first facial animation image according to the first image rendering result and the first semantic rendering result.

In one exemplary embodiment, determining sample label information corresponding to each training sample image includes: for each training sample image, semantic image information corresponding to the training sample image and first pixel information of the training sample image are respectively determined; and taking semantic image information, first pixel information and information of the first feature points of the training sample image as sample label information of the training sample image.

In one exemplary embodiment, constructing the target loss function according to sample tag information corresponding to the first face animation image and the training sample image includes: determining second pixel information of the first facial animation image, and determining information of second feature points corresponding to the first facial animation image by utilizing a feature point extraction network; constructing a semantic loss function according to the first semantic rendering result and the semantic image information; constructing a characteristic point loss function according to the information of the second characteristic point and the information of the first characteristic point; constructing a pixel loss function according to the second pixel information and the first pixel information; and determining an objective loss function according to the semantic loss function, the characteristic point loss function and the pixel loss function.

In one exemplary embodiment, determining the target loss function from the semantic loss function, the feature point loss function, and the pixel loss function includes: respectively determining weight values corresponding to the semantic loss function, the characteristic point loss function and the pixel loss function; and carrying out weighted summation on the semantic loss function, the characteristic point loss function and the pixel loss function according to the weight value to obtain a target loss function.

In one exemplary embodiment, iteratively training an initial network with a plurality of training sample images includes: determining the scale of the training batch, and determining a target optimizer and an initial learning rate; and sequentially inputting training sample images of the training batch scale into an initial network, and performing iterative training on the initial network according to the target optimizer and the initial learning rate.

In an exemplary embodiment, after the face animation generation network is obtained, a plurality of second videos are screened from a second video data set, wherein the videos in the second video data set are lip-reading videos, and the face frame image of the second video comprises a target number of first feature points; filtering second videos of which the definition does not meet preset conditions in the plurality of second videos, and taking face frame images in the reserved second videos as test sample images; and testing the human face animation generation network by using the test sample image.

According to another embodiment of the present application, there is provided a face animation generation method, including: acquiring a target face image of a target object; determining a third hidden variable and a third characteristic point corresponding to the target face image, wherein the third hidden variable is used for representing geometric information and texture information of a face in the target face image, and the third characteristic point is used for representing third semantic information of a face region in the target face image; and inputting the third hidden variable and the third feature point into a face animation generation network to obtain a second face animation image output by the face animation generation network, wherein the face animation generation network is trained according to the training method of the face animation generation network.

In an exemplary embodiment, determining a third hidden variable and a third feature point corresponding to the target face image includes: determining a third hidden variable corresponding to the target face image by utilizing a pre-trained inversion inference network; and determining a third feature point corresponding to the target face image by utilizing the pre-trained feature point extraction network.

In an exemplary embodiment, inputting the third hidden variable and the third feature point into a face animation generation network to obtain a second face animation image output by the face animation generation network, including: inputting the third hidden variable and the third feature point into a variable fusion sub-network in a face animation generation network for fusion to obtain a fourth hidden variable comprising face geometric, texture and semantic information; inputting the fourth hidden variable into an affine transformation sub-network in a face animation generation network to carry out affine transformation to obtain a second characteristic variable in a target space, wherein the affine transformation comprises scale transformation and translation transformation; inputting the second characteristic variable into a conversion coding sub-network in a face animation generation network to perform coding conversion to obtain a second integral characteristic of the target face image; converting the second integral feature into a second three-dimensional spatial feature of the target face image; inputting the second three-dimensional space features into a three-dimensional decoding sub-network in a face animation generation network to be decoupled, and obtaining second voxel information, second color information and fourth semantic information corresponding to the target face image; and performing volume rendering according to the second voxel information, the second color information and the fourth semantic information to obtain a second face animation image.

In one exemplary embodiment, converting the second global feature into a second three-dimensional spatial feature of the target face image includes: dividing the second global feature into three planar second sub-features; and carrying out spatial feature fusion on the three second sub-features in a summation mode to obtain second three-dimensional spatial features of the target face image.

In an exemplary embodiment, performing volume rendering according to the second voxel information, the second color information and the fourth semantic information to obtain a second face animation image, including: performing volume rendering according to the second voxel information and the second color information to obtain a second image rendering result; performing volume rendering according to the second voxel information and the fourth semantic information to obtain a second semantic rendering result; and generating a second facial animation image according to the second image rendering result and the second semantic rendering result.

According to another embodiment of the present application, there is provided a training apparatus of a face animation generation network, including: the first acquisition module is used for acquiring a plurality of training sample images and determining sample label information corresponding to each training sample image; the first determining module is used for determining a first hidden variable and a first characteristic point corresponding to any training sample image, wherein the first hidden variable is used for representing geometric information and texture information of a face in the training sample image, and the first characteristic point is used for representing first semantic information of a face area in the training sample image; the processing module is used for inputting the first hidden variable and the first characteristic point into an initial network to obtain a first face animation image, wherein the initial network is used for determining a first integral characteristic of the training sample image according to the first hidden variable and the first characteristic point and performing body rendering according to the first integral characteristic to obtain the first face animation image; the construction module is used for constructing a target loss function according to sample label information corresponding to the first face animation image and the training sample image; and the training module is used for carrying out iterative training on the initial network by utilizing a plurality of training sample images, wherein the network parameters of the initial network are adjusted in a mode of converging the target loss function, and the face animation generation network is obtained.

According to another embodiment of the present application, there is provided a facial animation generating apparatus including: the second acquisition module is used for acquiring a target face image of a target object; the second determining module is used for determining a third hidden variable and a third characteristic point corresponding to the target face image, wherein the third hidden variable is used for representing geometric information and texture information of a face in the target face image, and the third characteristic point is used for representing third semantic information of a face area in the target face image; the generating module is used for inputting the third hidden variable and the third feature point into the face animation generating network to obtain a second face animation image output by the face animation generating network, wherein the face animation generating network is trained according to the training method of the face animation generating network.

According to another embodiment of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program when executed by a processor implements the above-described training method of a face animation generation network or steps of a face animation generation method.

According to another embodiment of the present application, there is also provided an electronic apparatus, characterized by including: the system comprises a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the training method of the facial animation generating network or the steps of the facial animation generating method through the computer program.

In the embodiment of the application, the hidden variables representing the geometric and texture information of the face and the explicit feature points representing the semantic information of the face area are fused to obtain the integral feature of the face, and then the integral feature is decoupled and subjected to volume rendering to obtain a fine face animation image; in practical application, the face image of the user can be acquired in real time, when the user speaks, hidden variables may not change obviously, but the positions of the feature points change, namely, the scheme of the application can realize fine driving of the face animation image according to the feature points, and effectively solves the technical problem that the animation image corresponding to the face cannot be finely depicted in the related technology.

Drawings

Fig. 1 is a schematic structural view of a computer terminal according to an embodiment of the present application;

FIG. 2 is a flow chart of a training method of a facial animation generating network, according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a variable fusion subnetwork according to an embodiment of the present application;

fig. 4 is a schematic diagram of an affine transformation sub-network according to an embodiment of the application;

FIG. 5 is a schematic diagram of a transcoding subnetwork according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a three-dimensional decoding sub-network according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a semantic information corresponding face region according to an embodiment of the present application;

FIG. 8 is a schematic diagram of the structure of an initial network according to an embodiment of the present application;

FIG. 9 is a flow chart of a face animation generation method according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a training device of a facial animation generating network, according to an embodiment of the application;

fig. 11 is a schematic structural view of a facial animation generating device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims and drawings of the present application are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For a better understanding of the embodiments of the present application, some nouns or translations of terms that appear during the description of the embodiments of the present application are explained first as follows:

VoxCeleb2 dataset: including fragments of over 6000 celebrities, one million pieces of speech extracted from YouTube. The speaker covers different ages, sexes and accents, and the voice scenes are very rich, including red blanket running shows, outdoor venues, indoor video recorders and the like; the sound collection device comprises a professional end and a handheld end, and background noise comprises talking sound, laughter and different scene effects.

LRW (lip-reading in the Wild) dataset: a speech recognition data set comprising video segments of 500 english words, each word being read by a person in front of a front camera, each video segment having a duration of about 3 seconds and comprising 25 frames of images per second. LRW datasets are widely used in research of fields such as speech recognition, mouth shape recognition, audio-visual integration, and the like. It provides a rich resource for researchers to develop and improve the performance of machine learning models on the task of mouth shape recognition.

Tri-plane algorithm: an algorithm for use in image processing and computer vision for dividing pixels in an image into three planes representing intensities of red, green and blue channels, respectively. The algorithm can be used in applications such as image segmentation and color space conversion.

The following describes the application in connection with preferred embodiments.

According to an embodiment of the present application, there is provided a training method of a face animation generation network and a method embodiment of the face animation generation method, it being noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be performed in an order different from that herein.

The method embodiments provided by the embodiments of the present application may be performed in a mobile terminal, a computer terminal, or similar computing device. Fig. 1 shows a hardware block diagram of a computer terminal for implementing a training method or a face animation generation method of a face animation generation network. As shown in fig. 1, the computer terminal 10 may include one or more processors 102 (shown as 102a, 102b, … …,102 n) 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module or incorporated, in whole or in part, into any of the other elements in the computer terminal 10. As referred to in embodiments of the application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as a training method of a facial animation generating network or a program instruction/data storage device corresponding to the facial animation generating method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby executing various functional applications and data processing, that is, implementing the vulnerability detection method of the application program. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10.

In the above operating environment, the embodiment of the present application first provides a training method for a face animation generating network, as shown in fig. 1, where the method includes the following steps:

step S202, a plurality of training sample images are obtained, and sample label information corresponding to each training sample image is determined;

step S204, for any training sample image, determining a first hidden variable and a first characteristic point corresponding to the training sample image, wherein the first hidden variable is used for representing geometric information and texture information of a face in the training sample image, and the first characteristic point is used for representing first semantic information of a face area in the training sample image;

Step S206, inputting the first hidden variable and the first characteristic point into an initial network to obtain a first face animation image output by the initial network, wherein the initial network is used for determining a first integral characteristic of a training sample image according to the first hidden variable and the first characteristic point and performing volume rendering according to the first integral characteristic to obtain the first face animation image;

step S208, constructing a target loss function according to sample label information corresponding to the first face animation image and the training sample image;

and S210, performing iterative training on the initial network by utilizing a plurality of training sample images, wherein network parameters of the initial network are adjusted in a mode of converging a target loss function, and a face animation generation network is obtained.

The steps of the network training method are described below in connection with specific implementation procedures.

As an alternative embodiment, a plurality of training sample images may be acquired by: screening a plurality of first videos from a first video data set, wherein the videos in the first video data set are face speaking videos, and face frame images of the first videos comprise first feature points with target numbers; filtering first videos of which the definition does not meet preset conditions from the plurality of first videos; and taking the face frame image in the reserved first video as a training sample image.

The first feature points are key points in the face and can represent semantic information such as different areas in the face.

The first video data set can be a VoxCeleb2 data set, in order to ensure that enough first characteristic points are available in the training sample images for training, when the first video is screened, the video with the face facing the lens in the forward direction and the swing amplitude of-15 degrees to 15 degrees can be selected, so that the first characteristic points in the face frame images can reach the target number, the target number can be adjusted by self according to the requirement, and the embodiment of the application is preferably 68.

Optionally, when the face frame image in the first video reserved after the data cleaning is used as a training sample image, preprocessing may be performed on the face frame image in each reserved first video, where the preprocessing includes: for each frame of face frame image, taking the face as a center, intercepting an image of a region with a preset size as a training sample image, so as to improve the accuracy of an image recognition detection result, and further ensure the accuracy of a network obtained by final training. The preprocessed frame image may be saved back as a video format file.

As an alternative implementation manner, when determining the first hidden variable and the first feature point corresponding to the training sample image, the following may be performed: determining a first hidden variable corresponding to the training sample image by utilizing a pre-trained inversion reasoning network, wherein the first hidden variable can be marked as a later Z; and determining a first characteristic point corresponding to the training sample image by utilizing the pre-trained characteristic point extraction network, wherein the first characteristic point can be marked as landmarks.

Wherein the inverse inference network (Inverse Reasoning Network) is a neural network model for solving the inference problem with the goal of inferring possible causes or inputs from observed results. In an embodiment of the present application, the inversion inference network may choose a pSp (Pixel 2Style2 Pixel) image conversion framework based on a new encoder network that generates a series of Style vectors that are input into a pre-trained Style gan (Style Generative Adversarial Networks, style-based challenge generation network) to form an extended hidden space or hidden variable.

The feature point extraction network can select a common CNN (Convolutional Neural Networks, convolutional neural network) network structure, and can find out key points in the image through training.

The face animation generation network is trained, and an initial network is constructed first. Optionally, the initial network in the embodiment of the present application at least includes: variable fusion sub-networks, affine transformation sub-networks, transform coding sub-networks, and three-dimensional decoding sub-networks.

The variable fusion sub-Network may be a Mapping Network, an optional structure of which is shown in fig. 3, and the Network includes three Linear layers of 512-dimension, 392-dimension and 512-dimension, which are used for fusing the input first hidden variable labern Z and the first feature point landmarks to obtain a second hidden variable W including face geometric, texture and semantic information.

An affine transformation sub-network is denoted as A, an alternative structure of which is shown in FIG. 4, the network comprises three Linear layers of 512-dimension, 256-dimension and 128-dimension respectively, and the input 512-dimension second hidden variable W can be subjected to affine transformation to obtain128-dimensional feature variables in space, wherein the first 64 feature variables are +.>Scale transformation in space, the last 64 feature variables are translation transformations.

The transcoding subnetwork may be a transform network, an alternative structure of which is shown in fig. 5, and the network includes: embedding layer Embedding, two normalization layers of Norm, attention layer Multi-Head Attention and multilayer perception layer MLP, wherein, two normalization layers of Norm can be Self-modulation normalization layer Self-Modulated LayerNorm, its computational formula is:

in the method, in the process of the invention,for normalization result->For embedded layer input, < >>Is the second hidden variable,>and->Is the mean and variance of the input network layer, +.>And->Is the second hidden variable->The dimension of the characteristic variable can be 64, and the characteristic variable output by the affine transformation sub-network can be added into the conversion coding sub-network through two self-modulation normalization layers, so that the integral characteristic of the image is achieved.

The three-dimensional decoding sub-network can be a multi-layer perception network, which can decouple the three-dimensional space characteristics into voxel information Color information->And semantic information->An alternative architecture is shown in fig. 6, the network comprising: two basic Linear layers of 32 and 256 dimensions, respectively, with color information +.>Corresponding two 128-dimensional and 3-dimensional Linear layers, and semantic information +.>Corresponding two Linear layers of 128 and 15 dimensions, respectively, with voxel information +.>The corresponding two are 128 and 1 dimensional Linear layer Linear, respectively.

As an alternative embodiment, after the first hidden variable and the first feature point are input into the initial network, the entire processing procedure of the initial network is as follows: the first hidden variable and the first characteristic point input variable fusion sub-network are fused to obtain a second hidden variable comprising facial geometry, texture and semantic information; inputting the second hidden variable into an affine transformation sub-network to carry out affine transformation to obtain a first characteristic variable in a target space, wherein the affine transformation comprises scale transformation and translation transformation; inputting the first characteristic variable into a characteristic conversion sub-network to perform code conversion to obtain a first integral characteristic of the training sample image; converting the first integral feature into a first three-dimensional spatial feature of the training sample image; inputting the first three-dimensional space features into a multi-layer perception network for decoupling, and obtaining first voxel information, first color information and second semantic information corresponding to the training sample image; and performing volume rendering according to the first voxel information, the first color information and the second semantic information to obtain a first facial animation image.

Alternatively, when converting the first global feature into the first three-dimensional spatial feature of the training sample image, a tri-plane algorithm may be used to divide the first global feature into first sub-features of three planes, for example, dividing the first global feature of 256×256×96 into three first sub-features of 256×256×32; and then carrying out space feature fusion on the three first sub-features in a summation mode to obtain a first three-dimensional space feature of the training sample image.

Optionally, the decoupled second semantic information may characterize face region information, wherein the face region comprises at least one of: skin, left eyebrow, right eyebrow, left ear, right ear, mouth, upper lip, lower lip, hair, hat, neck, left eye, right eye, nose, glasses, fig. 7 shows a face area information.

Alternatively, when performing volume rendering according to the first voxel information, the first color information, and the second semantic information, the following manner may be performed: firstly, performing volume rendering according to first voxel information and first color information to obtain a first image rendering result, wherein the calculation formula is as follows:

in the method, in the process of the invention,is the image rendering result,/->，/>Is the transparency of the medium and, Is from the camera start point->Starting from, in the direction->Radiation point to the object, < > x >>For the time of propagation, +.>Voxel density corresponding to ray point, +.>Color information corresponding to the ray points.

And then performing volume rendering according to the first voxel information and the second semantic information to obtain a first semantic rendering result, wherein the calculation formula is as follows:

in the method, in the process of the invention,is the semantic rendering result,/->Semantic information corresponding to the ray points.

And finally, generating a first facial animation image according to the first image rendering result and the first semantic rendering result.

Fig. 8 shows a schematic structural diagram of a complete initial Network, in which a first hidden variable labern Z and a first feature point landmaks input variable fusion sub-Network Mapping Network are fused to obtain a second hidden variable W including face geometry, texture and semantic information; inputting the second hidden variable W into an affine transformation sub-network A to carry out affine transformation to obtain a first characteristic variable in a target space; inputting the first feature variable into a feature conversion sub-network Transformer Encoder to perform code conversion to obtain a first integral feature of the training sample image; converting the first integral feature into a first three-dimensional spatial feature of the training sample image; inputting the first three-dimensional space features into a multi-layer perception network MLP for decoupling to obtain first voxel information corresponding to the training sample image First color information->And second semantic information->The method comprises the steps of carrying out a first treatment on the surface of the According to the first voxel informationFirst color information->And second semantic information->Performing Volume Rendering to obtain a first semantic Rendering result and a first image Rendering result, performing up-sampling on the first semantic Rendering result to obtain a semantic Rendering result S, performing Super Resolution processing on the first image Rendering result to obtain an image Rendering result I, and generating a first face animation image according to the semantic Rendering result S and the image Rendering result I.

As an alternative embodiment, in determining the sample label information corresponding to each training sample image, the following may be performed: for each training sample image, semantic image information corresponding to the training sample image and first pixel information of the training sample image are respectively determined; and taking semantic image information, first pixel information and information of the first feature points of the training sample image as sample label information of the training sample image.

Optionally, when the objective loss function is constructed, first determining second pixel information of the first face animation image, and determining information of a second feature point corresponding to the first face animation image by using a feature point extraction network; then rendering the result according to the first semantics And semantic image information->Constructing a semantic loss function->The method comprises the following steps:

where H, W is the length and width of the image, respectively.

Information according to the second feature pointAnd information of the first feature point +.>Construction of a characteristic Point loss function->The method comprises the following steps:

wherein, the preset weight belongs to the super parameter.

According to the second pixel informationAnd first pixel information->Constructing a pixel loss function +.>The method comprises the following steps:

finally, determining a target loss function according to the semantic loss function, the characteristic point loss function and the pixel loss function。

Alternatively, when determining the target loss function according to the semantic loss function, the feature point loss function and the pixel loss function, the target loss function may be obtained by direct summation, where the target loss function is:

in order to further improve the reliability of the network training, the weight values corresponding to the semantic loss function, the feature point loss function and the pixel loss function can be respectively determined asAnd then carrying out weighted summation on the semantic loss function, the characteristic point loss function and the pixel loss function according to the weight value to obtain a target loss function as follows:

when a plurality of training sample images are used for carrying out iterative training on an initial network, firstly, determining the batch size of training, and determining a target optimizer and an initial learning rate; and then sequentially inputting training sample images of the training batch scale into an initial network, and carrying out iterative training on the initial network according to the target optimizer and the initial learning rate. The target optimizer may use Adam, and the batch size and the initial learning rate are set according to the requirement, and in this embodiment, the batch size=64 and the initial learning rate is 0.0001.

As an optional implementation manner, after the face animation generating network is obtained, the training network may be tested by using a test data set, an acquisition method of the test data set is similar to that of the training data set, and a plurality of second videos may be screened out from the second video data set, where the videos in the second video data set are lip-reading videos, and a face frame image of the second video includes a target number of first feature points; filtering second videos of which the definition does not meet preset conditions in the plurality of second videos, and taking face frame images in the reserved second videos as test sample images; and finally, testing the face animation generation network by using the test sample image. Wherein the second video data set may be an LRW data set.

On the basis of the training obtained face animation generation network, the embodiment of the application also provides a face animation generation method, as shown in fig. 9, comprising the following steps:

step S902, obtaining a target face image of a target object;

step S904, determining a third hidden variable and a third characteristic point corresponding to the target face image, wherein the third hidden variable is used for representing geometric information and texture information of a face in the target face image, and the third characteristic point is used for representing third semantic information of a face region in the target face image;

step S906, inputting the third hidden variable and the third feature point into a face animation generation network to obtain a second face animation image output by the face animation generation network, wherein the face animation generation network is trained according to the training method of the face animation generation network.

The obtained target face image of the target object may be a user face image collected in real time through a camera. As an optional implementation manner, when determining the third hidden variable and the third feature point corresponding to the target face image, the pre-trained inversion inference network may be used to determine the third hidden variable corresponding to the target face image, and then the pre-trained feature point extraction network may be used to determine the third feature point corresponding to the target face image.

After the third hidden variable and the third feature point are input into the face animation generation network, the whole processing procedure of the face animation generation network is as follows: inputting the third hidden variable and the third feature point into a variable fusion sub-network in a face animation generation network for fusion to obtain a fourth hidden variable comprising face geometric, texture and semantic information; inputting the fourth hidden variable into an affine transformation sub-network in a face animation generation network to carry out affine transformation to obtain a second characteristic variable in a target space, wherein the affine transformation comprises scale transformation and translation transformation; inputting the second characteristic variable into a characteristic conversion sub-network in a face animation generation network to perform code conversion to obtain a second integral characteristic of the target face image; converting the second integral feature into a second three-dimensional spatial feature of the target face image; inputting the second three-dimensional space features into a multi-layer perception network in a face animation generation network for decoupling to obtain second voxel information, second color information and fourth semantic information corresponding to the target face image; and performing volume rendering according to the second voxel information, the second color information and the fourth semantic information to obtain a second face animation image.

Optionally, when the second integral feature is converted into the second three-dimensional spatial feature of the target face image, the second integral feature may be divided into three planar second sub-features, and then the three second sub-features are subjected to spatial feature fusion in a summation manner, so as to obtain the second three-dimensional spatial feature of the target face image.

Optionally, when performing volume rendering according to the second voxel information, the second color information and the fourth semantic information to obtain a second face animation image, performing volume rendering according to the second voxel information and the second color information to obtain a second image rendering result; then performing volume rendering according to the second voxel information and the fourth semantic information to obtain a second semantic rendering result; and generating a second facial animation image according to the second image rendering result and the second semantic rendering result.

In the embodiment of the application, the face image of the user can be acquired in real time, when the user speaks, hidden variables may not change obviously, but the positions of the characteristic points change, namely the scheme of the application can realize the fine driving of the face animation image according to the characteristic points, and effectively solves the technical problem that the animation image corresponding to the face cannot be finely depicted in the related technology.

According to an embodiment of the present application, there is further provided a training device of a facial animation generating network for implementing the training method of the facial animation generating network, as shown in fig. 10, where the device at least includes:

a first obtaining module 101, configured to obtain a plurality of training sample images, and determine sample label information corresponding to each training sample image;

the first determining module 102 is configured to determine, for any training sample image, a first hidden variable and a first feature point corresponding to the training sample image, where the first hidden variable is used to represent geometric information and texture information of a face in the training sample image, and the first feature point is used to represent first semantic information of a face region in the training sample image;

the processing module 103 is configured to input a first hidden variable and a first feature point into an initial network to obtain a first face animation image, where the initial network is configured to determine a first overall feature of the training sample image according to the first hidden variable and the first feature point, and perform volume rendering according to the first overall feature to obtain the first face animation image;

a construction module 104, configured to construct a target loss function according to sample tag information corresponding to the first face animation image and the training sample image;

And the training module 105 is configured to perform iterative training on the initial network by using a plurality of training sample images, wherein network parameters of the initial network are adjusted by means of converging the objective loss function, so as to obtain a face animation generation network.

It should be noted that, each module in the training device of the facial animation generating network in the embodiment of the present application corresponds to each implementation step of the training method of the facial animation generating network one by one, and since the foregoing method embodiment has been described in detail, details that are not shown in the embodiment of the present application may refer to the method embodiment, and will not be described in detail herein.

According to an embodiment of the present application, there is also provided a facial animation generating device for implementing the above facial animation generating method, as shown in fig. 11, where the device at least includes:

a second acquiring module 111, configured to acquire a target face image of a target object;

a second determining module 112, configured to determine a third hidden variable and a third feature point corresponding to the target face image, where the third hidden variable is used to represent geometric information and texture information of a face in the target face image, and the third feature point is used to represent third semantic information of a face area in the target face image;

The generating module 113 is configured to input the third hidden variable and the third feature point into a face animation generating network, and obtain a second face animation image output by the face animation generating network, where the face animation generating network is trained according to the training method of the face animation generating network.

It should be noted that, each module in the facial animation generating device in the embodiment of the present application corresponds to each implementation step of the facial animation generating method one by one, and since the foregoing method embodiment has been described in detail, details that are not shown in the embodiment of the present application may refer to the method embodiment, and will not be described in detail here.

Optionally, an embodiment of the present application further provides a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the steps of any of the method embodiments described above when run.

Specifically, the computer program when run performs the steps of: acquiring a plurality of training sample images, and determining sample label information corresponding to each training sample image; for any training sample image, determining a first hidden variable and a first characteristic point corresponding to the training sample image, wherein the first hidden variable is used for representing geometric information and texture information of a face in the training sample image, and the first characteristic point is used for representing first semantic information of a face area in the training sample image; inputting the first hidden variable and the first characteristic point into an initial network to obtain a first face animation image output by the initial network, wherein the initial network is used for determining a first integral characteristic of a training sample image according to the first hidden variable and the first characteristic point, and performing volume rendering according to the first integral characteristic to obtain the first face animation image; constructing a target loss function according to sample label information corresponding to the first face animation image and the training sample image; and performing iterative training on the initial network by utilizing a plurality of training sample images, wherein network parameters of the initial network are adjusted in a mode of converging the target loss function, and a face animation generation network is obtained.

Optionally, the computer program when run performs the steps of: acquiring a target face image of a target object; determining a third hidden variable and a third characteristic point corresponding to the target face image, wherein the third hidden variable is used for representing geometric information and texture information of a face in the target face image, and the third characteristic point is used for representing third semantic information of a face region in the target face image; and inputting the third hidden variable and the third feature point into a face animation generation network to obtain a second face animation image output by the face animation generation network, wherein the face animation generation network is trained according to the training method of the face animation generation network.

In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a USB flash disk, a read-only memory, a random access memory, a removable hard disk, a magnetic disk or an optical disk.

An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

In particular, the processor is configured to implement the following steps by computer program execution: acquiring a plurality of training sample images, and determining sample label information corresponding to each training sample image; for any training sample image, determining a first hidden variable and a first characteristic point corresponding to the training sample image, wherein the first hidden variable is used for representing geometric information and texture information of a face in the training sample image, and the first characteristic point is used for representing first semantic information of a face area in the training sample image; inputting the first hidden variable and the first characteristic point into an initial network to obtain a first face animation image output by the initial network, wherein the initial network is used for determining a first integral characteristic of a training sample image according to the first hidden variable and the first characteristic point, and performing volume rendering according to the first integral characteristic to obtain the first face animation image; constructing a target loss function according to sample label information corresponding to the first face animation image and the training sample image; and performing iterative training on the initial network by utilizing a plurality of training sample images, wherein network parameters of the initial network are adjusted in a mode of converging the target loss function, and a face animation generation network is obtained.

Optionally, the processor is configured to implement the following steps by computer program execution: acquiring a target face image of a target object; determining a third hidden variable and a third characteristic point corresponding to the target face image, wherein the third hidden variable is used for representing geometric information and texture information of a face in the target face image, and the third characteristic point is used for representing third semantic information of a face region in the target face image; and inputting the third hidden variable and the third feature point into a face animation generation network to obtain a second face animation image output by the face animation generation network, wherein the face animation generation network is trained according to the training method of the face animation generation network.

In an exemplary embodiment, the electronic device may further include a transmission device connected to the processor, and an input/output device connected to the processor.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of units may be a logic function division, and there may be another division manner in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a read-only memory, a random access memory, a removable hard disk, a magnetic disk, or an optical disk, or the like.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A training method for a facial animation generating network, comprising:

acquiring a plurality of training sample images, and determining sample label information corresponding to each training sample image;

for any training sample image, determining a first hidden variable and a first characteristic point corresponding to the training sample image, wherein the first hidden variable is used for representing geometric information and texture information of a face in the training sample image, and the first characteristic point is used for representing first semantic information of a face region in the training sample image;

inputting the first hidden variable and the first characteristic point into an initial network to obtain a first face animation image output by the initial network, wherein the initial network is used for determining a first integral characteristic of the training sample image according to the first hidden variable and the first characteristic point, and performing volume rendering according to the first integral characteristic to obtain the first face animation image;

Constructing a target loss function according to sample label information corresponding to the first face animation image and the training sample image;

and performing iterative training on the initial network by using the training sample images, wherein network parameters of the initial network are adjusted in a mode of converging the target loss function, and a face animation generation network is obtained.

2. The method of claim 1, wherein acquiring a plurality of training sample images comprises:

screening a plurality of first videos from a first video data set, wherein the videos in the first video data set are face speaking videos, and face frame images of the first videos comprise a target number of first feature points;

filtering first videos of which the definition does not meet preset conditions in the plurality of first videos;

and taking the face frame image in the reserved first video as the training sample image.

3. The method according to claim 2, wherein taking the face frame image in the retained first video as the training sample image comprises:

preprocessing the face frame image in each reserved first video, wherein the preprocessing comprises the following steps: and for each frame of the face frame image, taking the face as a center, and intercepting an image of a region with a preset size as the training sample image.

4. The method of claim 1, wherein determining a first hidden variable and a first feature point corresponding to the training sample image comprises:

determining the first hidden variable corresponding to the training sample image by utilizing a pre-trained inversion inference network;

and determining the first characteristic points corresponding to the training sample images by utilizing a pre-trained characteristic point extraction network.

5. The method of claim 1, wherein prior to entering the first hidden variable and the first feature point into an initial network, the method further comprises:

constructing the initial network, wherein the initial network at least comprises the following steps: variable fusion sub-networks, affine transformation sub-networks, transform coding sub-networks, and three-dimensional decoding sub-networks.

6. The method of claim 5, wherein inputting the first hidden variable and the first feature point into an initial network to obtain a first facial animation image comprises:

inputting the first hidden variable and the first characteristic point into the variable fusion sub-network for fusion to obtain a second hidden variable comprising facial geometry, texture and semantic information;

inputting the second hidden variable into the affine transformation sub-network to carry out affine transformation to obtain a first characteristic variable in a target space, wherein the affine transformation comprises scale transformation and translation transformation;

Inputting the first characteristic variable into the conversion coding sub-network to perform coding conversion to obtain the first integral characteristic of the training sample image;

converting the first global feature into a first three-dimensional spatial feature of the training sample image;

inputting the first three-dimensional space features into the three-dimensional decoding sub-network for decoupling, and obtaining first voxel information, first color information and second semantic information corresponding to the training sample image;

and performing volume rendering according to the first voxel information, the first color information and the second semantic information to obtain the first facial animation image.

7. The method of claim 6, wherein converting the first global feature to a first three-dimensional spatial feature of the training sample image comprises:

dividing the first global feature into three planar first sub-features;

and carrying out space feature fusion on the three first sub-features in a summation mode to obtain a first three-dimensional space feature of the training sample image.

8. The method of claim 6, wherein the step of providing the first layer comprises,

the second semantic information is used for representing face region information, wherein the face region comprises at least one of the following: skin, left eyebrow, right eyebrow, left ear, right ear, mouth, upper lip, lower lip, hair, hat, neck, left eye, right eye, nose, glasses.

9. The method of claim 6, wherein performing volume rendering based on the first voxel information, the first color information, and the second semantic information to obtain the first facial animation image comprises:

performing volume rendering according to the first voxel information and the first color information to obtain a first image rendering result;

performing volume rendering according to the first voxel information and the second semantic information to obtain a first semantic rendering result;

and generating the first facial animation image according to the first image rendering result and the first semantic rendering result.

10. The method of claim 9, wherein determining sample label information corresponding to each of the training sample images comprises:

for each training sample image, semantic image information corresponding to the training sample image and first pixel information of the training sample image are respectively determined;

and taking the semantic image information, the first pixel information and the information of the first feature points of the training sample image as sample label information of the training sample image.

11. The method of claim 10, wherein constructing an objective loss function from sample label information corresponding to the first face animation image and the training sample image comprises:

Determining second pixel information of the first facial animation image, and determining information of second feature points corresponding to the first facial animation image by utilizing a feature point extraction network;

constructing a semantic loss function according to the first semantic rendering result and the semantic image information;

constructing a characteristic point loss function according to the information of the second characteristic point and the information of the first characteristic point;

constructing a pixel loss function according to the second pixel information and the first pixel information;

and determining the target loss function according to the semantic loss function, the characteristic point loss function and the pixel loss function.

12. The method of claim 11, wherein determining the target loss function from the semantic loss function, the feature point loss function, and the pixel loss function comprises:

respectively determining weight values corresponding to the semantic loss function, the characteristic point loss function and the pixel loss function;

and carrying out weighted summation on the semantic loss function, the characteristic point loss function and the pixel loss function according to the weight value to obtain the target loss function.

13. The method of claim 1, wherein iteratively training the initial network using the plurality of training sample images comprises:

Determining the scale of the training batch, and determining a target optimizer and an initial learning rate;

and sequentially inputting the training sample images of the training batch scale into the initial network, and performing iterative training on the initial network according to the target optimizer and the initial learning rate.

14. The method of claim 1, wherein after obtaining the face animation generation network, the method further comprises:

screening a plurality of second videos from a second video data set, wherein the videos in the second video data set are lip-reading videos, and the face frame images of the second videos comprise a target number of first feature points;

filtering the second videos of which the definition does not meet the preset condition in the plurality of second videos, and taking the face frame images in the reserved second videos as test sample images;

and testing the facial animation generating network by using the test sample image.

15. The face animation generation method is characterized by comprising the following steps of:

acquiring a target face image of a target object;

determining a third hidden variable and a third characteristic point corresponding to the target face image, wherein the third hidden variable is used for representing geometric information and texture information of a face in the target face image, and the third characteristic point is used for representing third semantic information of a face area in the target face image;

Inputting the third hidden variable and the third feature point into a face animation generation network to obtain a second face animation image output by the face animation generation network, wherein the face animation generation network is trained according to the training method of the face animation generation network according to any one of claims 1 to 14.

16. The method of claim 15, wherein determining a third hidden variable and a third feature point corresponding to the target face image comprises:

determining the third hidden variable corresponding to the target face image by utilizing a pre-trained inversion inference network;

and determining the third characteristic point corresponding to the target face image by utilizing a pre-trained characteristic point extraction network.

17. The method of claim 15, wherein inputting the third hidden variable and the third feature point into a face animation generation network to obtain a second face animation image output by the face animation generation network, comprises:

inputting the third hidden variable and the third feature point into a variable fusion sub-network in the face animation generation network for fusion to obtain a fourth hidden variable comprising face geometry, texture and semantic information;

Inputting the fourth hidden variable into an affine transformation sub-network in the face animation generation network to carry out affine transformation to obtain a second characteristic variable in a target space, wherein the affine transformation comprises scale transformation and translation transformation;

inputting the second characteristic variable into a conversion coding sub-network in the face animation generation network to perform coding conversion to obtain a second integral characteristic of the target face image;

converting the second integral feature into a second three-dimensional spatial feature of the target face image;

inputting the second three-dimensional space features into a three-dimensional decoding sub-network in the face animation generation network to be decoupled, so as to obtain second voxel information, second color information and fourth semantic information corresponding to the target face image;

and performing volume rendering according to the second voxel information, the second color information and the fourth semantic information to obtain the second face animation image.

18. The method of claim 17, wherein converting the second global feature to a second three-dimensional spatial feature of the target face image comprises:

dividing the second global feature into three planar second sub-features;

And carrying out space feature fusion on the three second sub-features in a summation mode to obtain second three-dimensional space features of the target face image.

19. The method of claim 17, wherein performing volume rendering based on the second voxel information, the second color information, and the fourth semantic information to obtain the second facial animation image comprises:

performing volume rendering according to the second voxel information and the second color information to obtain a second image rendering result;

performing volume rendering according to the second voxel information and the fourth semantic information to obtain a second semantic rendering result;

and generating the second facial animation image according to the second image rendering result and the second semantic rendering result.

20. A training device for a facial animation generating network, comprising:

the first acquisition module is used for acquiring a plurality of training sample images and determining sample label information corresponding to each training sample image;

the first determining module is used for determining a first hidden variable and a first characteristic point corresponding to any training sample image, wherein the first hidden variable is used for representing geometric information and texture information of a face in the training sample image, and the first characteristic point is used for representing first semantic information of a face area in the training sample image;

The processing module is used for inputting the first hidden variable and the first characteristic point into an initial network to obtain a first face animation image, wherein the initial network is used for determining a first integral characteristic of the training sample image according to the first hidden variable and the first characteristic point and performing volume rendering according to the first integral characteristic to obtain the first face animation image;

the construction module is used for constructing a target loss function according to the sample label information corresponding to the first face animation image and the training sample image;

and the training module is used for carrying out iterative training on the initial network by utilizing the training sample images, wherein the network parameters of the initial network are adjusted in a mode of converging the target loss function, and the face animation generation network is obtained.

21. A facial animation generating device, comprising:

the second acquisition module is used for acquiring a target face image of a target object;

the second determining module is used for determining a third hidden variable and a third characteristic point corresponding to the target face image, wherein the third hidden variable is used for representing geometric information and texture information of a face in the target face image, and the third characteristic point is used for representing third semantic information of a face region in the target face image;

The generating module is configured to input the third hidden variable and the third feature point into a face animation generating network to obtain a second face animation image output by the face animation generating network, where the face animation generating network is trained according to the training method of the face animation generating network according to any one of claims 1 to 14.

22. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, wherein the computer program, when executed by a processor, implements the training method of the face animation generation network of any of claims 1 to 14 or the steps of the face animation generation method of any of claims 15 to 17.

23. An electronic device, comprising: a memory and a processor, wherein the memory has stored therein a computer program, the processor being configured to execute the steps of the training method of the face animation generation network of any one of claims 1 to 14 or the face animation generation method of any one of claims 15 to 17 by means of the computer program.