CN117808941A

CN117808941A - Face driving method, device and storage medium

Info

Publication number: CN117808941A
Application number: CN202311831483.5A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Thread Intelligent Technology Chengdu Co ltd
Current assignee: Moore Thread Intelligent Technology Chengdu Co ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-04-02

Abstract

The present disclosure relates to a face driving method, apparatus and storage medium. The method comprises the following steps: acquiring input data based on a target task, wherein the target task comprises at least two digital human driven tasks; based on the target task, the neural network model is utilized to process the input data, a target face image is determined, and the target face image presents the expression represented by the driving data. According to the embodiment of the application, the input data are acquired based on the target task, wherein the target task comprises at least two digital person driving tasks, the input data can be processed and determined based on the target task by utilizing the neural network model, so that the neural network model can support the processing of different digital person driving tasks, the face driving problem of a plurality of digital persons can be solved by utilizing one model at the same time, the final image presents the expression represented by the driving data, the processing efficiency is higher, the generalization of the model is strong, and the processing effect is better.

Description

Face driving method, device and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence (artificial intelligence, AI) technology, and in particular, to a face driving method, apparatus, and storage medium.

Background

Virtual digital man is an important application in the AI technology field. Virtual digital persons include two-dimensional (2D) digital persons and three-dimensional (3D) digital persons, typically implemented by face driving, wherein faces in a virtual digital person image can be driven by changes in indicated facial expressions in input data to make expressions similar to the indicated facial expressions. The expression made through the face drive can be changed into the interesting expression which can be made by the demonstration reality, so that brand-new interaction and demonstration experience can be provided for the user through the function.

The virtual digital person image can be 2D or 3D, and can be driven by taking audio or video as a signal source respectively, so that the 2D digital person or the 3D digital person can display corresponding facial expression changes. However, since different digital human driven tasks have different processing requirements and modes, the current scheme can only be realized for a single task, and the efficiency is low.

Disclosure of Invention

In view of this, the present disclosure proposes a face driving method, apparatus, and storage medium.

According to an aspect of the present disclosure, a face driving method is provided. The method comprises the following steps:

Acquiring input data based on a target task, wherein the target task comprises at least two digital human driven tasks;

based on the target task, the neural network model is utilized to process the input data, a target face image is determined, and the target face image presents the expression represented by the driving data.

In one possible implementation manner, in a case where the target task is a two-dimensional digital human driving task, the input data includes driving data and a source face image, the target face image is a target two-dimensional face image, the input data is processed by using a neural network model based on the target task, and determining the target face image includes:

determining expression parameters of the face based on the driving data;

extracting features of the source face image to obtain first image features;

and carrying out two-dimensional face rendering based on the expression parameters and the first image characteristics of the face, and determining a target two-dimensional face image.

In one possible implementation, the driving data includes driving images, and determining expression parameters of the face based on the driving data includes:

determining a first expression parameter of a face based on the driving image;

based on the expression parameters and the first image characteristics of the face, performing two-dimensional face rendering to determine a target two-dimensional face image, including:

And performing two-dimensional face rendering based on the first expression parameters and the first image characteristics, and determining a target two-dimensional face image.

In one possible implementation, the driving data includes driving audio, and determining the expression parameters of the face based on the driving data includes:

determining a first audio feature based on the driving audio;

determining a second expression parameter of the face based on the first audio feature;

and based on the second expression parameters and the first image characteristics, performing two-dimensional face rendering, and determining a target two-dimensional face image.

In one possible implementation, in a case where the target task is a three-dimensional digital human driving task, the target face image is a target three-dimensional face image, the input data includes driving audio, the neural network model processes the input data based on the target task, and determining the target face image includes:

determining a second audio feature based on the driving audio;

determining a third expression parameter of the face based on the second audio feature;

determining a first shape parameter and a first face texture map of a face based on a predetermined three-dimensional face model;

And rendering the three-dimensional face based on the third expression parameter, the first shape parameter and the first face texture map, and determining a target three-dimensional face image.

In one possible implementation, in a case where the target task is a three-dimensional digital human driving task, the target face image is a target three-dimensional face image, the input data includes a driving image, the input data is processed by using a neural network model based on the target task, and determining the target face image includes:

determining a second image feature based on the driving image;

determining a fourth expression parameter of the face based on the second image feature;

determining a second shape parameter and a second face texture map of the face based on a predetermined three-dimensional face model;

and rendering the three-dimensional face based on the fourth expression parameter, the second shape parameter and the second face texture map, and determining a target three-dimensional face image.

In one possible implementation, the neural network model includes one or more of the following modules: an audio feature extractor, an image feature extractor, an audio expression predictor, an image expression predictor, a face shape predictor, a texture predictor, a two-dimensional face renderer, and a three-dimensional face renderer;

The audio feature extractor is used for extracting audio features of the driving data to obtain audio features; the image feature extractor is used for extracting image features of the driving data to obtain image features; the audio expression predictor is used for determining expression parameters of the corresponding face based on the audio characteristics; the image expression predictor is used for determining expression parameters of the corresponding face based on the image characteristics; the face shape predictor is used for determining shape parameters of the corresponding face based on the image characteristics; the texture predictor is used for determining a corresponding face texture map based on the image characteristics; the two-dimensional face renderer is used for determining a two-dimensional face image based on the expression parameters and the image characteristics of the face; the three-dimensional face renderer is used for determining a three-dimensional face image based on the expression parameters of the face, the shape parameters of the face and the face texture map.

In one possible implementation, the neural network model is a trained neural network model, and the method further includes:

determining training data, wherein the training data comprises training videos and training audios corresponding to the training videos;

and training the initial neural network model based on the two-dimensional digital person driving task and the three-dimensional digital person driving task according to the training video and the training audio to obtain a trained neural network model.

In one possible implementation, according to the training video and the training audio, training the initial neural network model based on the two-dimensional digital person-driven task and the three-dimensional digital person-driven task to obtain a trained neural network model, including:

training the initial neural network model based on a two-dimensional digital person driving task according to the training video and the training audio to obtain an intermediate neural network model;

and training the middle neural network model based on the two-dimensional digital person driving task and the three-dimensional digital person driving task according to the training video and the training audio to obtain a trained neural network model.

according to the training video and the training audio, a source face image sample and a driving image sample are determined from the training video, and a driving audio sample corresponding to the source face image sample is determined from the training audio;

and training the initial neural network model based on the two-dimensional digital human driving task and the three-dimensional digital human driving task according to the source face image sample, the driving image sample and the driving audio sample to obtain a trained neural network model.

In one possible implementation, according to the source face image sample, the driving image sample, and the driving audio sample, training an initial neural network model based on a two-dimensional digital person driving task and a three-dimensional digital person driving task to obtain a trained neural network model, including:

determining a target face image by using an initial neural network model according to a source face image sample, a driving image sample and a driving audio sample, wherein the target face image comprises a target two-dimensional face image and a target three-dimensional face image;

in the two-dimensional digital human driving task, the loss optimization is carried out on the relevant parameters of the initial neural network model based on the L1 distance between the target two-dimensional human face image and the driving image sample, and in the three-dimensional digital human driving task, the loss optimization is carried out on the relevant parameters of the initial neural network model based on the L1 distance between the target three-dimensional human face image and the source human face image sample, so that the trained neural network model is obtained.

In one possible implementation, the neural network model is used for fine-tuning training among other associated tasks to determine a fine-tuned neural network model.

According to another aspect of the present disclosure, a face driving apparatus is provided. The device comprises:

The acquisition module is used for acquiring input data based on a target task, wherein the target task comprises at least two digital person driving tasks;

the first determining module is used for processing the input data by utilizing the neural network model based on the target task to determine a target face image, and the target face image presents the expression represented by the driving data.

In one possible implementation manner, in a case that the target task is a two-dimensional digital human driving task, the input data includes driving data and a source face image, the target face image is a target two-dimensional face image, and the first determining module is configured to:

determining expression parameters of the face based on the driving data;

extracting features of the source face image to obtain first image features;

determining a first expression parameter of a face based on the driving image;

determining a first audio feature based on the driving audio;

In one possible implementation manner, in the case that the target task is a three-dimensional digital human driving task, the target face image is a target three-dimensional face image, the input data includes driving audio, and the first determining module is configured to:

determining a second audio feature based on the driving audio;

determining a second image feature based on the driving image;

In one possible implementation, the neural network model is a trained neural network model, and the apparatus further includes:

the second determining module is used for determining training data, wherein the training data comprises training videos and training audios corresponding to the training videos;

and the third determining module is used for training the initial neural network model based on the two-dimensional digital person driving task and the three-dimensional digital person driving task according to the training video and the training audio to obtain a trained neural network model.

In one possible implementation manner, the third determining module is configured to:

According to another aspect of the present disclosure, there is provided a face driving apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-described method.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

According to the embodiment of the application, the input data are acquired based on the target task, wherein the target task comprises at least two digital person driving tasks, the input data can be processed and determined based on the target task by utilizing the neural network model, so that the neural network model can support the processing of different digital person driving tasks, the face driving problem of a plurality of digital persons can be solved by utilizing one model at the same time, the final image presents the expression represented by the driving data, the processing efficiency is higher, the generalization of the model is strong, and the processing effect is better.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a schematic diagram of an application scenario according to an embodiment of the present application.

Fig. 2 shows a flowchart of a face driving method according to an embodiment of the present application.

Fig. 3 shows a schematic structural diagram of a neural network model according to an embodiment of the present application.

Fig. 4 shows a flowchart of a face driving method according to an embodiment of the present application.

Fig. 5 shows a schematic diagram of training data according to an embodiment of the present application.

Fig. 6 shows a structural diagram of a face driving apparatus according to an embodiment of the present application.

Fig. 7 is a block diagram illustrating an apparatus 1900 for face driving according to an example embodiment.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

Virtual digital man is an important application in the AI technology field. Virtual digital persons, including 2D digital persons and 3D digital persons, are typically implemented by face driving, wherein faces in a virtual digital person image may be driven by changes in indicated facial expressions in input data to make expressions similar to the indicated facial expressions. The expression made through the face drive can be changed into the interesting expression which can be made by the demonstration reality, so that brand-new interaction and demonstration experience can be provided for the user through the function. The virtual digital person image can be 2D or 3D, and can be driven by taking audio or video as a signal source respectively, so that the 2D digital person or the 3D digital person can display corresponding facial expression changes. However, since different digital human driven tasks have different processing requirements and modes, the current scheme can only be realized for a single task, and the efficiency is low.

In view of this, the present application provides a face driving method. According to the method, the input data can be acquired based on the target task, wherein the target task comprises at least two digital person driving tasks, the input data can be processed and the target face image can be determined based on the target task by utilizing the neural network model, so that the neural network model can support the processing of different digital person driving tasks, the face driving problem of various digital persons can be solved by utilizing one model at the same time, the final image presents the expression represented by the driving data, the processing efficiency is higher, the generalization of the model is strong, and the processing effect is better.

Fig. 1 shows a schematic diagram of an application scenario according to an embodiment of the present application. As shown in fig. 1, the method of the embodiment of the present application may be applied to a scene for producing a face-driven animation. The method of the embodiment of the application can be applied to a face driving system, the system can acquire different input data of a user based on different digital human tasks, and according to the input data, the expression represented by the driving data in the input data (namely, the expression expected to be displayed in a target face image by the user) is migrated to a two-dimensional face image or a three-dimensional face image to be output as the target face image, so that the expression of the face is driven, and the expression expected by the user is displayed in the target face image. Video can also be generated through the combination of multiple frames of target face images so as to show corresponding facial expression changes.

The face driving system can be applied to terminal equipment or a server. The terminal device may be any one or more of a mobile phone, a foldable electronic device, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, a cellular phone, a personal digital assistant (personal digital assistant, PDA), and a vehicle-mounted device. The embodiment of the application does not particularly limit the specific type of the terminal equipment, and can have a wired or wireless communication function.

The server may be located at a local or cloud end, may be a physical device, may also be a virtual device, such as a virtual machine, a container, or the like, and has a wireless communication function, where the wireless communication function may be provided on a chip (system) or other parts or components of the server. The wireless communication function can be realized by mobile communication technologies such as 2G/3G/4G/5G, wi-Fi, bluetooth, frequency modulation (frequency modulation, FM), digital radio, satellite communication, and the like. Communication may also be via a wired connection to enable interaction with other devices.

The face driving method according to the embodiment of the present application is described below with reference to fig. 2 to 5.

Fig. 2 shows a flowchart of a face driving method according to an embodiment of the present application. The face driving method may be used in the face driving system, as shown in fig. 2, and the method may include:

step S201, based on the target task, input data is acquired.

Wherein the target task may comprise at least two digital person driven tasks. For example, the target tasks may include two-dimensional digital person-driven tasks and three-dimensional digital person-driven tasks. The 2D digital human driving task can be aimed at generating a video driven by a given two-dimensional human face through driving data and the given human face; the goal of the 3D digital person driven task may be to generate a video driven by a three-dimensional face from the driving data. The content of the face driver may include controlling facial expressions, actions, etc. in the generated video.

Wherein different input data may be acquired for different target tasks. In the case where the target task is a two-dimensional digital person driving task, the input data may include driving data and a source face image; in the case where the target task is a three-dimensional digital human driven task, the input data may include driving data.

The driving data may be used to predict facial expressions to generate realistic facial images in the video. The source face image may be a face image, or a video frame in a source video. The two-dimensional digital person driving task may include a two-dimensional digital person video driving task and a two-dimensional digital person audio driving task, and the three-dimensional digital person driving task may include a three-dimensional digital person video driving task and a three-dimensional digital person audio driving task. In the video driving task, the driving data may be a driving image, and the driving image may be any video frame in the driving video; in an audio driving task, the driving data may be driving audio.

Step S202, based on the target task, the input data are processed by using the neural network model, and a target face image is determined.

The target face image may present, among other things, an expression that is characteristic of the driving data, such as an expression of the digital person during speaking.

In the two-dimensional digital human driving task, the target human face image and the source human face image may correspond to the human face of the same person, and the content of the driving data is not limited as long as the driving data can represent expressions, for example, the driving data may be a character video including a character expression or an audio including a character voice, wherein the characters in the driving data do not necessarily correspond to the same person as the target human face image, and the expressions represented in the driving data may be migrated to the source human face image to obtain the target human face image.

In a three-dimensional digital person driving task, the shape, texture, etc. of a target face may be determined based on a predetermined three-dimensional face model, which may be determined by capturing relevant data of a real person through a sensor. According to the embodiment, the expression represented in the driving data can be migrated to the preset three-dimensional face model, and the target face image is obtained.

The target task may be any of the digital person driven tasks described above. For example, when the target task is a two-dimensional digital human video driving task or a three-dimensional digital human video driving task, the expression of the generated target face image may be controlled by the driving image; when the target task is a two-dimensional digital human audio driving task or a three-dimensional digital human audio driving task, the expression of the generated target face image can be controlled by the driving audio.

Different target tasks can be supported by the neural network model, and when the target task is a two-dimensional digital human driving task, the target face image can be a target two-dimensional face image; when the target task is a three-dimensional digital human driving task, the target face image may be a target three-dimensional face image.

The multi-frame target face image obtained by processing the input data of continuous or discontinuous frames can generate a face-driven video, such as a digital human speaking video. If audio is included in the input data, the face-driven video may be a talking face video synchronized with the audio.

The structure of the neural network model can be seen in fig. 3, which shows a schematic diagram of the structure of the neural network model according to an embodiment of the present application. As shown in fig. 3, the neural network model may include one or more of the following: an audio feature extractor, an image feature extractor, an audio expression predictor, an image expression predictor, a face shape predictor, a texture predictor, a two-dimensional face renderer, and a three-dimensional face renderer.

The audio feature extractor may be configured to extract audio features of the driving data, to obtain audio features. The audio feature extractor may be implemented based on an audio feature encoder such as a time convolutional network (temporal convolutional network, TCN), wav2vec2.0, a compressor, a recurrent neural network (recurrent neural network, RNN), etc.

The image feature extractor may be configured to extract image features from the driving data to obtain image features. The image feature extractor may be implemented based on an image encoder such as Resnet, mobileNet, viT (vision transformer).

The audio expression predictor may be used to drive expression parameters (expression blendshape coefficients) of the corresponding face based on the audio features. The expression parameters of the face can be used as the coefficients of the expression groups, the expression groups can be used for representing specific expression shapes or actions, and the weights or the intensities of different expression groups can be controlled and adjusted through the expression parameters. The facial expression can be accurately controlled through the preset expression base and the generated expression parameters.

The image expression predictor may be configured to determine expression parameters of a corresponding face based on image features.

The face shape predictor may be used to determine shape parameters of the corresponding face based on image features (e.g., shape blend map/identify blendshape in a 3DMM model). The shape factor of a face may be used to represent the shape characteristics of the face.

The texture predictor may be used to determine a corresponding face texture map based on the image features. The texture predictor may be implemented based on DECA (detailed expression capture and animation) or the like.

The two-dimensional face renderer may be used to determine a two-dimensional face image based on the facial expression parameters and image features. The two-dimensional face renderer may be implemented based on a model of face replay, such as a decoder (decoder) of a face animation generation algorithm, FOMM (First Order Motion Model), PIRender, etc.

The three-dimensional face renderer may be configured to determine a three-dimensional face image based on the expression parameters of the face, the shape parameters of the face, and the face texture map. The three-dimensional face renderer may be implemented based on a differentiable 3D face rendering algorithm such as DECA, HIFI3D, faceverse, MICA (Metric face), etc.

The structure of the neural network model as shown in fig. 3 can be used for the training phase and the reasoning phase.

The neural network model may be a trained neural network model, and the process of training the neural network model is described below with reference to fig. 4. After training the initial neural network model, the obtained trained neural network model can be used for reasoning based on the input data in fig. 2 to obtain a target face image.

Referring to fig. 4, a flow chart of a face driving method according to an embodiment of the present application is shown. As shown in fig. 4, the method further includes:

step S401, determining training data.

The training data may be unlabeled training data, and may include training video and training audio corresponding to the training video. For example, the training video and the corresponding training audio may be determined from the same piece of audio-video synchronized speaker face audio-video data, and the training audio may be an audio clip of the corresponding time stamp of the training video. The training video contains a human face, the human face can be a true human face or a cartoon human face, and the original audio and video data can be acquired in any mode.

The raw audio-video data may be preprocessed to obtain training data. For example, the face detection may be performed on the video frame by a face detection algorithm (yolo, retinaface, etc.), and the head and the vicinity may be cut out based on the detection result. Fig. 5 shows a schematic diagram of training data according to an embodiment of the present application. The truncated image can be seen in fig. 5 as a video frame in the training video.

Step S402, training an initial neural network model based on a two-dimensional digital person driving task and a three-dimensional digital person driving task according to training videos and training audios to obtain a trained neural network model.

The initial neural network model can be trained based on the two-dimensional digital human driven task and the three-dimensional digital human driven task at the same time, and training can be performed for different tasks respectively.

Optionally, the step S402 may include:

The source face image sample and the driving image sample may be different video frames in the training video, and the driving audio sample may be an audio clip of a timestamp corresponding to the driving image sample. The source face image sample can be used for determining the shape of a face in a target face image in a two-dimensional digital face driving task, and the driving image sample and the driving audio sample can be used for determining the expression of the face in the target face image. Wherein the person involved in the driving image sample and the source face image sample is the same person.

Training tasks may include two-dimensional digital person video driven tasks, two-dimensional digital person audio driven tasks, three-dimensional digital person video driven tasks, and three-dimensional digital person audio driven tasks.

Taking training the 4 tasks as an example, training an initial neural network model based on a two-dimensional digital person driving task and a three-dimensional digital person driving task according to a source face image sample, a driving image sample and a driving audio sample to obtain a trained neural network model, which can comprise:

and determining a target face image by utilizing the initial neural network model according to the source face image sample, the driving image sample and the driving audio sample, wherein the target face image comprises a target two-dimensional face image and a target three-dimensional face image.

The method comprises the steps that a source face image sample can be input into an image feature extractor to obtain image features of the source face image sample aiming at a two-dimensional digital human video driving task; inputting the driving image sample into an image feature extractor to obtain image features of the driving image sample, and inputting the image features of the driving image sample into an image expression predictor to obtain expression parameters of a human face; and inputting the expression parameters of the face and the image characteristics of the source face image sample into a two-dimensional face renderer to obtain a target two-dimensional face image.

For a two-dimensional digital human audio driving task, a source human face image sample can be input into an image feature extractor to obtain image features of the source human face image sample; inputting the driving audio sample into an audio feature extractor to obtain audio features of the driving audio sample, and inputting the audio features of the driving audio sample into an audio expression predictor to obtain expression parameters of a human face; and inputting the expression parameters of the face and the image characteristics of the source face image sample into a two-dimensional face renderer to obtain a target two-dimensional face image.

Aiming at a three-dimensional digital human video driving task, a source human face image sample can be input into an image feature extractor to obtain the image features of the source human face image sample; inputting the image characteristics of the source face image sample into an image expression predictor to obtain expression parameters of the face; inputting the image characteristics of the source face image sample into a face shape predictor to obtain the shape parameters of the face; inputting image features of a source face image sample into a texture predictor to obtain a face texture map; and inputting the expression parameters of the human face, the shape parameters of the human face and the human face texture map into a three-dimensional human face renderer to obtain a target three-dimensional human face image.

For a three-dimensional digital human audio driving task, a source human face image sample can be input into an image feature extractor to obtain image features of the source human face image sample; inputting the image characteristics of the source image sample into a face shape predictor to obtain shape parameters of the face; inputting image features of a source image sample into a texture predictor to obtain a face texture map; inputting the driving audio sample into an audio feature extractor to obtain audio features of the driving audio sample; inputting the audio characteristics of the driving audio sample into an audio expression predictor to obtain expression parameters of the face; and inputting the expression parameters of the human face, the shape parameters of the human face and the human face texture map into a three-dimensional human face renderer to obtain a target three-dimensional human face image.

In the training process aiming at the three-dimensional digital human driving task, non-labeling self-supervision training can be adopted, and the shape parameters of the human face and the human face texture map can be determined by utilizing the source human face image sample in the training stage, so that the training process is convenient. Because the three-dimensional face renderer can be realized based on the units technology, in the subsequent reasoning stage, the three-dimensional face renderer can directly generate a target three-dimensional face image based on input data and a preset three-dimensional face model.

In the two-dimensional digital human driving task, the loss optimization of the related parameters of the initial neural network model can be performed based on the L1 distance between the target two-dimensional human face image and the driving image sample, and in the three-dimensional digital human driving task, the loss optimization of the related parameters of the initial neural network model is performed based on the L1 distance between the target three-dimensional human face image and the source human face image sample, so that the trained neural network model is obtained.

The L1 distance may be a pixel distance between images.

In the training process for the two-dimensional digital human driving task, the obtained target two-dimensional human face image can be understood as a human face image obtained after the expression represented by the driving image sample is migrated to the source human face image sample, and because the human involved in the driving image sample and the source human face image sample are the same human figure, the target two-dimensional human face image can be understood as a driving image sample regenerated by a model, and loss optimization can be performed based on the L1 distance between the target two-dimensional human face image and the driving image sample.

In the two-dimensional digital human video driving task, parameters of modules such as an image feature extractor, an image expression predictor, a two-dimensional human face renderer and the like in the neural network model can be subjected to loss optimization based on the target two-dimensional human face image and the L1 distance of the driving image sample.

In the two-dimensional digital human audio driving task, parameters of modules such as an image feature extractor, an audio expression predictor, a two-dimensional human face renderer and the like in the neural network model can be subjected to loss optimization based on the target two-dimensional human face image and the L1 distance of the driving image sample.

In the three-dimensional digital human video driving task, the countermeasures loss (GAN loss) can be calculated based on the L1 distance between the target three-dimensional human face image and the source human face image sample, and the loss optimization can be performed on parameters of modules such as an image feature extractor, an image expression predictor, a human face shape predictor, a texture predictor, a three-dimensional human face renderer and the like in the neural network model by minimizing the GAN loss.

In the three-dimensional digital human audio driving task, the countermeasures loss (GAN loss) can be calculated based on the L1 distance between the target three-dimensional human face image and the source human face image sample, and the loss optimization can be performed on parameters of modules such as an image feature extractor, an audio expression predictor, a human face shape predictor, a texture predictor, a three-dimensional human face renderer and the like in the neural network model by minimizing the GAN loss.

Therefore, multi-mode learning of audio and video can be realized in the training process of the neural network model through multi-task learning, and the audio and video are mutually added, so that each module in the neural network model learns under the influence of multi-task and multi-mode, the reasoning capacity of each module is more problematic, and the neural network model with better effect is obtained.

In the training process, the 4 tasks can be simultaneously subjected to multi-task learning, optionally, in order to obtain a better training effect, the training can be performed based on a two-dimensional digital person driving task, an image feature extractor, an audio feature extractor, an image expression predictor, an audio expression predictor and a two-dimensional face renderer in an initial neural network model are trained to have good capabilities, and then the 4 tasks are subjected to multi-task learning together. In this example, the step S402 may include:

training the initial neural network model based on a two-dimensional digital person driving task according to the training video and the training audio to obtain an intermediate neural network model; and training the middle neural network model based on the two-dimensional digital person driving task and the three-dimensional digital person driving task according to the training video and the training audio to obtain a trained neural network model.

The method comprises the steps that based on a two-dimensional digital human driving task, an image feature extractor, an audio feature extractor, an image expression predictor, an audio expression predictor and a two-dimensional human face renderer in an initial neural network model can be trained by utilizing the mode, so that an intermediate neural network model is obtained; and performing multitask training based on the two-dimensional digital person driving task and the three-dimensional digital person driving task, and training all modules of the intermediate neural network model to obtain the trained neural network model.

Alternatively, the neural network model may be used for fine-tuning training in other associated tasks to determine a fine-tuned neural network model. For example, one or several modules in the trained neural network model can be subjected to fine tuning training based on other associated tasks (any computer vision or audio tasks), the fine-tuned modules can be independently used for executing the associated tasks in the reasoning process, and the generalization capability of the neural network model is better.

After the trained neural network model is obtained, referring back to fig. 2, the trained neural network model can be utilized to perform the corresponding digital human driven task during the inference phase.

In the case that the target task is a two-dimensional digital human driving task, the input data may include driving data and a source face image, and the target face image is a target two-dimensional face image, and the step S202 may include:

determining expression parameters of the face based on the driving data;

extracting features of the source face image to obtain first image features;

The image feature extractor may be used to extract features of the source face image, so as to obtain a first image feature.

In the case that the two-dimensional digital person driving task is a two-dimensional digital person video driving task, the driving data may include a driving image, the driving image may be any video frame in the driving video, and determining the expression parameter of the face based on the driving data may include:

determining a first expression parameter of a face based on the driving image;

based on the expression parameters and the first image features of the face, performing two-dimensional face rendering to determine a target two-dimensional face image, which may include:

The image feature extractor can be utilized to extract the features of the driving image to obtain the image features of the driving image; and processing the image characteristics of the driving image by using an image expression predictor to obtain first expression parameters of the human face. The two-dimensional face renderer can be utilized to render the two-dimensional face based on the first expression parameter and the first image characteristic, and a target two-dimensional face image is determined.

In the case that the two-dimensional digital person driving task is a two-dimensional digital person audio driving task, the driving data may include driving audio, and determining expression parameters of the face based on the driving data includes:

determining a first audio feature based on the driving audio;

The method comprises the steps that a driving audio can be subjected to feature extraction by using an audio feature extractor to obtain first audio features; and processing the first audio features by using an audio expression predictor to obtain second expression parameters of the human face. The two-dimensional face renderer can be utilized to render the two-dimensional face based on the expression parameters and the first image features of the face, and a target two-dimensional face image is determined.

In the case that the target task is a three-dimensional digital human audio driving task, the target face image may be a target three-dimensional face image, and the input data may include driving audio, and the step S202 may include:

determining a second audio feature based on the driving audio;

The audio feature extractor can be utilized to extract features of the driving audio to obtain second audio features; and processing the second audio feature by using the audio expression predictor to obtain a third expression parameter of the human face. And a three-dimensional face renderer can be utilized to render the three-dimensional face based on the third expression parameter, so as to determine a target three-dimensional face image. The method comprises the steps of obtaining a first shape parameter and a first face texture map of a face according to related parameters of a preset three-dimensional face model, and carrying out three-dimensional face rendering by combining the first shape parameter and the first face texture map of the face based on a third expression parameter to determine a target three-dimensional face image. The three-dimensional face model may be predetermined by capturing relevant data of a real person by the sensor.

Optionally, in the case that the target task is a three-dimensional digital human audio driving task, the input data may further include a source face image, and the first shape parameter of the face and the first face texture map may be obtained by processing the first face texture map by a neural network model based on the source face image. The first shape parameter may be determined, for example, by the face shape predictor based on image features of the source face image, and the first face texture map may be determined by the texture predictor based on image features of the source face image.

In the case that the target task is a three-dimensional digital human video driving task, the target face image may be a target three-dimensional face image, and the input data may include a driving image, and the step S202 may include:

determining a second image feature based on the driving image;

The image feature extractor can be utilized to extract features of the driving image to obtain second image features; and processing the second image characteristic by using an image expression predictor to obtain a fourth expression parameter of the human face. The second shape parameter and the second face texture map of the face can be obtained according to the related parameters of the preset three-dimensional face model, and the three-dimensional face rendering can be performed by combining the second shape parameter and the second face texture map of the face based on the fourth expression parameter, so that the target three-dimensional face image is determined.

Optionally, when the target task is a three-dimensional digital human video driving task, the input data may further include a source face image, and the second shape parameter of the face and the second face texture map may be obtained by processing the neural network model based on the source face image. The second shape parameter may be determined, for example, by the face shape predictor based on image features of the source face image, and the second face texture map may be determined by the texture predictor based on image features of the source face image.

All usage data referred to in this application has been licensed or otherwise authorized.

Fig. 6 shows a structural diagram of a face driving apparatus according to an embodiment of the present application. As shown in fig. 6, the apparatus includes:

the acquiring module 601 is configured to acquire input data based on a target task, where the target task includes at least two digital person driving tasks;

the first determining module 602 is configured to process the input data by using a neural network model based on the target task, determine a target face image, and display an expression represented by the driving data.

In a possible implementation manner, in a case where the target task is a two-dimensional digital human driving task, the input data includes driving data and a source face image, and the target face image is a target two-dimensional face image, and the first determining module 602 is configured to:

Determining expression parameters of the face based on the driving data;

extracting features of the source face image to obtain first image features;

determining a first expression parameter of a face based on the driving image;

determining a first audio feature based on the driving audio;

In a possible implementation manner, in a case that the target task is a three-dimensional digital human driving task, the target face image is a target three-dimensional face image, the input data includes driving audio, and the first determining module 602 is configured to:

determining a second audio feature based on the driving audio;

determining a second image feature based on the driving image;

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

Fig. 7 is a block diagram illustrating an apparatus 1900 for face driving according to an example embodiment. For example, the apparatus 1900 may be provided as a server or terminal device. Referring to fig. 7, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The apparatus 1900 may further comprise a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output interface 1958 (I/O interface). The apparatus 1900 may operate based on an operating system stored in the memory 1932, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of apparatus 1900 to perform the above-described methods.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A face driving method, the method comprising:

acquiring input data based on a target task, wherein the target task comprises at least two digital person driving tasks;

and processing the input data by utilizing a neural network model based on the target task to determine a target face image, wherein the target face image presents the expression represented by the driving data.

2. The method of claim 1, wherein, in the case where the target task is a two-dimensional digital human driving task, the input data includes driving data and a source face image, the target face image is a target two-dimensional face image, the processing the input data with a neural network model based on the target task to determine a target face image includes:

Determining expression parameters of the face based on the driving data;

extracting features of the source face image to obtain first image features;

and performing two-dimensional face rendering based on the expression parameters of the face and the first image features, and determining a target two-dimensional face image.

3. The method of claim 2, wherein the driving data comprises a driving image, and wherein the determining the facial expression parameter based on the driving data comprises:

determining a first expression parameter of a face based on the driving image;

the performing two-dimensional face rendering based on the expression parameter of the face and the first image feature, and determining a target two-dimensional face image includes:

and performing two-dimensional face rendering based on the first expression parameters and the first image features, and determining a target two-dimensional face image.

4. The method of claim 2, wherein the driving data comprises driving audio, and wherein the determining the facial expression parameters based on the driving data comprises:

determining a first audio feature based on the driving audio;

and performing two-dimensional face rendering based on the second expression parameters and the first image features, and determining a target two-dimensional face image.

5. The method of claim 1, wherein in the case where the target task is a three-dimensional digital human driven task, the target face image is a target three-dimensional face image, the input data includes driving audio, the processing the input data with a neural network model based on the target task to determine a target face image includes:

determining a second audio feature based on the driving audio;

and performing three-dimensional face rendering based on the third expression parameter, the first shape parameter and the first face texture map, and determining a target three-dimensional face image.

6. The method of claim 1, wherein in the case where the target task is a three-dimensional digital human driven task, the target face image is a target three-dimensional face image, the input data includes a driving image, the processing the input data with a neural network model based on the target task to determine a target face image includes:

Determining a second image feature based on the driving image;

and performing three-dimensional face rendering based on the fourth expression parameter, the second shape parameter and the second face texture map, and determining a target three-dimensional face image.

7. The method of claim 1, wherein the neural network model comprises one or more of the following: an audio feature extractor, an image feature extractor, an audio expression predictor, an image expression predictor, a face shape predictor, a texture predictor, a two-dimensional face renderer, and a three-dimensional face renderer;

8. The method of claim 1, wherein the neural network model is a trained neural network model, the method further comprising:

9. The method of claim 8, wherein training the initial neural network model based on the two-dimensional digital person-driven task and the three-dimensional digital person-driven task based on the training video and the training audio to obtain a trained neural network model comprises:

training an initial neural network model based on the two-dimensional digital human driving task according to the training video and the training audio to obtain an intermediate neural network model;

and training the intermediate neural network model based on the two-dimensional digital person driving task and the three-dimensional digital person driving task according to the training video and the training audio to obtain a trained neural network model.

10. The method of claim 8, wherein training the initial neural network model based on the two-dimensional digital person-driven task and the three-dimensional digital person-driven task based on the training video and the training audio to obtain a trained neural network model comprises:

and training an initial neural network model based on the two-dimensional digital person driving task and the three-dimensional digital person driving task according to the source face image sample, the driving image sample and the driving audio sample to obtain the trained neural network model.

11. The method of claim 10, wherein training the initial neural network model based on the two-dimensional digital person-driven task and the three-dimensional digital person-driven task based on the source face image sample, the driving image sample, and the driving audio sample to obtain the trained neural network model comprises:

Determining a target face image by using the initial neural network model according to the source face image sample, the driving image sample and the driving audio sample, wherein the target face image comprises a target two-dimensional face image and a target three-dimensional face image;

and in the two-dimensional digital human driving task, carrying out loss optimization on relevant parameters of the initial neural network model based on the L1 distance between the target two-dimensional human face image and the driving image sample, and carrying out loss optimization on relevant parameters of the initial neural network model based on the L1 distance between the target three-dimensional human face image and the source human face image sample in the three-dimensional digital human driving task to obtain the trained neural network model.

12. The method of claim 1, wherein the neural network model is used for fine-tuning training in other associated tasks to determine a fine-tuned neural network model.

13. A face driving apparatus, the apparatus comprising:

The first determining module is used for processing the input data by utilizing a neural network model based on the target task to determine a target face image, and the target face image presents the expression represented by the driving data.

14. A face driving apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any one of claims 1 to 12 when executing the instructions stored by the memory.

15. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 12.