CN115409923A

CN115409923A - Method, device and system for generating three-dimensional virtual image facial animation

Info

Publication number: CN115409923A
Application number: CN202211203194.6A
Authority: CN
Inventors: 朱灵杰; 左琪; 邱俊杰; 王方; 刘筱力; 潘攀
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2022-11-29

Abstract

The application discloses a method, a device and a system for generating a three-dimensional virtual image facial animation. Wherein, the method comprises the following steps: acquiring a currently input text to be processed, wherein the text to be processed is used for driving an initial facial animation of a three-dimensional virtual image to change; performing character synthesis voice processing on the text to be processed to obtain voice to be processed; carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data; extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords; and driving and generating a target face animation of the three-dimensional virtual image based on the first face pinching data and the second face pinching data. The method and the device solve the technical problem that the generation cost of the three-dimensional virtual image facial animation is high in the related technology, and meet the requirements of people on virtual interaction.

Description

Method, device and system for generating three-dimensional virtual image facial animation

Technical Field

The application relates to the technical field of computers, in particular to a method, a device and a system for generating a three-dimensional virtual image facial animation.

Background

Digital human is a digital character image which is created by applying a digital technology and is close to a human image, and in order to meet the requirement of more updating of virtual interaction, a digital human animation which is natural in action and synchronous in lip sound needs to be generated urgently.

In order to obtain the digital human animation meeting the requirements in the related technology, a scheme that audio features directly return to the vertexes of a digital human face mesh (mesh) is basically adopted, and a professional actor and high-precision three-dimensional human face scanning equipment are matched with the scheme, so that data collection and processing are completed. However, the data acquisition cost of the scheme is high, the key points of the model need to be manually aligned in the algorithm to be driven to other face models, the use adaptation cost is high, and the generalization capability is poor.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides a method, a device and a system for generating three-dimensional virtual image facial animation, which at least solve the technical problem of high generation cost of the three-dimensional virtual image facial animation in the related technology and meet the requirements of people on virtual interaction.

According to one embodiment of the application, a method for generating a three-dimensional avatar face animation is provided, which comprises the following steps: acquiring a currently input text to be processed, wherein the text to be processed is used for driving an initial facial animation of a three-dimensional virtual image to change; performing character synthesis voice processing on the text to be processed to obtain voice to be processed; carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data; extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords; and driving to generate a target face animation of the three-dimensional avatar based on the first face pinching data and the second face pinching data.

There is also provided, in accordance with an embodiment of the present application, a method for generating a three-dimensional avatar face animation, including: receiving a text to be processed from a client, wherein the text to be processed is used for driving the initial facial animation of the three-dimensional virtual image to change; performing character synthesis voice processing on a text to be processed to obtain a voice to be processed, performing voice-driven animation processing on the voice to be processed to obtain first face pinching data, extracting expression keywords from the text to be processed, acquiring second face pinching data associated with the expression keywords, and generating a target face animation of a three-dimensional virtual image based on the first face pinching data and the second face pinching data; and feeding back the target facial animation to the client.

There is also provided, in accordance with an embodiment of the present application, a method for generating a three-dimensional avatar face animation, including: acquiring a currently input text to be processed, wherein the content recorded in the text to be processed comprises: text information used for language recovery training, wherein the text to be processed is used for driving the initial facial animation of the three-dimensional virtual image to change; performing character synthesis voice processing on the text to be processed to obtain the voice to be processed for language recovery training; carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data; extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords; generating a target face animation of the three-dimensional avatar based on the first and second pinch face data.

There is further provided, in accordance with an embodiment of the present application, apparatus for generating a three-dimensional avatar face animation, including: the acquisition module is used for acquiring a currently input text to be processed, wherein the text to be processed is used for driving the initial facial animation of the three-dimensional virtual image to change; the first processing module is used for carrying out character synthesis voice processing on the text to be processed to obtain the voice to be processed; the second processing module is used for carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data; the third processing module is used for extracting expression keywords from the text to be processed and acquiring second face pinching data associated with the expression keywords; and the generating module is used for generating a target face animation of the three-dimensional virtual image based on the first face pinching data and the second face pinching data.

According to an embodiment of the present application, there is further provided a storage medium including a stored program, wherein when the program runs, a device on which the storage medium is controlled to execute the method for generating the three-dimensional avatar face animation of any one of the embodiments of the present application.

There is also provided, in accordance with an embodiment of the present application, a system for generating a three-dimensional avatar face animation, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a currently input text to be processed, wherein the text to be processed is used for driving an initial facial animation of a three-dimensional virtual image to change; performing character synthesis voice processing on the text to be processed to obtain voice to be processed; carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data; extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords; and driving to generate a target face animation of the three-dimensional avatar based on the first face pinching data and the second face pinching data.

In the embodiment of the application, the text to be processed capable of driving the initial facial animation of the three-dimensional virtual image to change is obtained, then character synthesis voice processing is carried out on the text to be processed, the voice to be processed is obtained, voice-driven animation processing is carried out on the voice to be processed, first face pinching data is obtained, expression keywords are extracted from the text to be processed, second face pinching data associated with the expression keywords is obtained, finally, the target facial animation of the three-dimensional virtual image is generated based on the first face pinching data and the second face pinching data in a driving mode, the purpose of generating the three-dimensional virtual image facial animation meeting interaction requirements based on the preset text is achieved, the technical effect of reducing the generation cost of the three-dimensional virtual image facial animation is achieved, the technical problem that the generation cost of the three-dimensional virtual image facial animation in the related technology is high is solved, and the requirement of people for virtual interaction is met.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing a method of generating a three-dimensional avatar face animation according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of generating a three-dimensional avatar face animation according to an embodiment of the present application;

fig. 3 is a schematic diagram of a process of generating a predictive image according to an embodiment of the application;

FIG. 4 is a schematic diagram of an initial network model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a method of generating a three-dimensional avatar face animation according to an embodiment of the application;

FIG. 6 is a flow chart of yet another method of generating a three-dimensional avatar face animation according to an embodiment of the present application;

fig. 7 is a schematic diagram of a method for generating a three-dimensional avatar face animation at a cloud server according to an embodiment of the present application;

FIG. 8 is a flow chart of yet another method of generating a three-dimensional avatar face animation in accordance with an embodiment of the present application;

FIG. 9 is a schematic diagram of an apparatus for generating a three-dimensional avatar face animation according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of an apparatus for generating a three-dimensional avatar face animation according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of an apparatus for generating a three-dimensional avatar face animation according to an embodiment of the present application;

fig. 12 is a block diagram of another computer terminal according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be implemented in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:

text To Speech (TTS) technology is a Speech synthesis application that can convert a predetermined Text into a natural Speech and output it.

A Text-To-Animation (TTA) technology aims To generate digital human Animation with natural action and synchronous lip sound according To a preset Text.

The Speech-driven Animation (STA) technology aims at driving the virtual image To speak and feeding back emotion and action through Speech.

Convolutional Neural Networks (CNNs) are a class of feed-forward Neural Networks that contain convolution computations and have a depth structure.

In order to obtain the digital human animation meeting the requirements in the related technology, a scheme that audio features directly return to the mesh vertex of a digital human face is basically adopted, and a professional actor and high-precision three-dimensional human face scanning equipment are matched with the scheme, so that data acquisition and processing are completed. The high-precision three-dimensional face scanning device can comprise multiple cameras, structured light, a Red Green Blue Distance (RGB-D) camera, laser point cloud and other devices. However, the data acquisition cost of the scheme is high, the key points of the model need to be manually aligned in the algorithm to be driven to other face models, the use adaptation cost is high, and the generalization capability is poor. In addition, the related art may also adopt a scheme of audio feature regression to a full fusion morphology (Blend Shapes, BS) of a human face to generate a digital human animation, and reconstruct the mapping of the voice to the facial shape by acquiring a large number of precise three-dimensional positions of faces when a large number of characters speak. The adoption of the scheme still needs the acquisition and processing of data by professional actors and professional software, and the generation effect of the digital human animation is generally inferior to that of the first scheme.

Example 1

There is also provided, in accordance with an embodiment of the present application, an embodiment of a method for generating a three-dimensional avatar face animation, the steps illustrated in the flowchart of the figure being executable on a computer system, such as a set of computer executable instructions, and although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be executed in an order different than that presented herein.

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 1 shows a block diagram of a hardware configuration of a computer terminal (or mobile device) for implementing a method of generating a three-dimensional avatar face animation. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, \8230; 102 n) processors 102 (the processors 102 may include, but are not limited to, a Microprocessor (MCU) or a Programmable logic device (FPGA) or other processing device), a memory 104 for storing data, and a transmission device 106 for communication functions. In addition, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a cursor control device, a keyboard, a Universal Serial Bus (USB) port (which may be included as one of the ports of the Bus), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the method for generating a three-dimensional avatar face animation in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, so as to implement the above-mentioned method for generating a three-dimensional avatar face animation. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet via wireless.

The Display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with the user interface of the computer terminal 10 (or mobile device).

It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

Under the above operating environment, the present application provides a method of generating a three-dimensional avatar face animation as shown in fig. 2. FIG. 2 is a flow chart of a method of generating a three-dimensional avatar face animation according to an embodiment of the application, the method comprising the steps of:

s21, acquiring a currently input text to be processed, wherein the text to be processed is used for driving an initial facial animation of the three-dimensional virtual image to change;

the text to be processed may include text contents of different languages, such as chinese text contents or english text contents. The text to be processed can be obtained by extracting the content of the preset text, and can also be obtained by detecting the content input by the text input module. For example, the pending text content may include chinese text: "only one or more months of time", may also include english text: "Itwa early spring".

The three-dimensional virtual image can be a digitalized figure image which is created by using a digital technology and is close to a human image, and can be called a digital person. The three-dimensional virtual image can be applied to application scenes in different fields of computer game making, movie and TV play making, on-line live broadcast, language teaching, language recovery training, rehabilitation and evaluation of auditory disorder and the like, and can meet the requirement of a user on virtual interaction.

The initial facial animation can be default expressionless facial animation, and the expressive or motion change of the expressive facial animation of the three-dimensional virtual image is driven by the text to be processed, so that the digital human animation with natural motion and synchronous lip sounds can be generated.

S22, performing character synthesis voice processing on the text to be processed to obtain voice to be processed;

specifically, in the process of performing text-to-speech processing on a text to be processed, text analysis is performed first, and linguistic analysis is performed on the text to be processed, that is, vocabulary, grammar and semantic analysis are performed sentence by sentence to determine the underlying structure of each single sentence and the composition of phonemes of each word in the text to be processed, and the method specifically includes text sentence breaking, word segmentation, processing of polyphones, processing of numbers, processing of abbreviations and the like. And secondly, voice synthesis is carried out, single words or phrases corresponding to the processed text are extracted from a voice synthesis library, and the linguistic description is converted into a speech waveform. Finally, prosody processing is carried out, and the synthesis sound Quality (Quality Of Synthetic Speech) Of the Speech to be processed is evaluated. The synthesized voice quality refers to the quality of speech output by a TTS system, and is generally subjectively evaluated from the aspects of definition or intelligibility, naturalness and coherence, wherein the definition is the percentage of correctly listening and distinguishing meaningful words, the naturalness is used for evaluating whether the synthesized voice quality is close to the voice of a human speaking, the tone of synthesized words is natural, and the coherence is used for evaluating whether synthesized sentences are smooth or not. The readability of the text to be processed can be increased by performing the character synthesis voice processing on the text to be processed.

Step S23, carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data;

the first face pinching data is used for driving mouth animations corresponding to the target face animations. Specifically, in the process of performing voice-driven animation processing on the voice to be processed, the deep learning neural network and computer graphics can be combined, so that the computer can understand the content of the voice to be processed and finely drive the lip movement of the three-dimensional virtual image based on the first face-pinching data. The first face pinching data can be used for representing BS data related to the mouth animation of the three-dimensional virtual image.

Step S24, extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords;

and the second face pinching data is used for driving the eye expression animation corresponding to the target face animation. Specifically, the second face-pinching data may be used to represent BS data related to the eye expression animation of the three-dimensional avatar. The expression keywords and the eye expressions of the three-dimensional virtual image have corresponding relations, and the corresponding relations between the expression keywords and the eye expressions are collected and mapped, so that the three-dimensional virtual image can change the eye expressions when the corresponding expression keywords are detected, and the vividness of the generated target facial animation is improved.

And step S25, generating a target face animation of the three-dimensional virtual image based on the first face pinching data and the second face pinching data.

The first face pinching data and the second face pinching data are fusion form data, namely BS data, and the BS data can be applied to a form fusion deformer, so that the local animation can be manufactured. Specifically, by performing interpolation operation between two adjacent grids, a local part in the three-dimensional virtual image can be fused from one shape to another shape, so that a face pinching effect is realized.

Based on the steps S21 to S25, a text to be processed that can drive an initial facial animation of the three-dimensional avatar to change is obtained, text-to-speech processing is performed on the text to be processed to obtain a speech to be processed, speech-driven animation processing is performed on the speech to be processed to obtain first face-pinching data, expression keywords are extracted from the text to be processed, second face-pinching data associated with the expression keywords are obtained, and finally, a target facial animation of the three-dimensional avatar is generated based on the first face-pinching data and the second face-pinching data in a driving manner.

The method for generating a three-dimensional avatar face animation proposed in the above embodiments is further described below.

In an alternative embodiment, in step S23, performing a voice-driven animation process on the voice to be processed to obtain the first face-pinching data includes:

inputting the voice to be processed into a target network model to obtain first face pinching data, wherein the target network model is obtained by a plurality of groups of data through machine learning training, and each group of data in the plurality of groups of data comprises: sample image frames and voice signals extracted from the sample video.

The sample video may be a monocular color video, and the target network model includes: video encoders (Visual encoders), audio encoders (Audio encoders), speech Content Space (Speech Content Space), and multiple Multilayer Perceptron (MLP). And the target network model is utilized to carry out voice-driven animation reasoning on the voice to be processed, so that BS data related to the mouth animation of the three-dimensional virtual image can be quickly obtained, and the data acquisition cost is further reduced.

In an alternative embodiment, the method of generating a three-dimensional avatar face animation further comprises:

inputting the sample image frame and the voice signal into an initial network model to obtain a multi-frame prediction image, wherein the initial network model is used for training the corresponding relation between the voice signal and the figure mouth shape of the sample image frame;

determining a plurality of loss functions using the multi-frame predicted image, wherein the plurality of loss functions comprises: the method comprises the steps of calculating the loss of an image layer of a multi-frame predicted image, calculating the loss of a perception layer of the multi-frame predicted image, calculating the loss of a first part of human face key points of the multi-frame predicted image, calculating the loss of a third part of human face key points of the multi-frame predicted image, and calculating the loss of a second part of human face key points of the multi-frame predicted image;

and optimizing the model parameters of the initial network model based on a plurality of loss functions to obtain a target network model.

Specifically, the sample image frame may be further extracted from the sample video, the sample image frame and the voice signal may be input to the initial network model to obtain a Multi-frame prediction image (Multi-image), and the Multi-image may be used to determine the first loss function, the second loss function, the third loss function, and the fourth loss function. The first Loss function is used for calculating Image layer Loss (Image Level Loss) of the Multi-Image, the second Loss function is used for calculating Perception layer Loss (Perception Level Loss) of the Multi-Image, the third Loss function is used for calculating first part face key point Loss (Audio Landmark Loss) of the Multi-Image, and the fourth Loss function is used for calculating second part face key point Loss (Other Landmark Loss) of the Multi-Image. Model parameters of the initial network model are optimized based on Image Level Loss, permission Level Loss, audio Landmark Loss and Other Landmark Loss to obtain a target network model.

Fig. 3 is a schematic diagram of a process of generating a prediction image according to an embodiment of the present application, and as shown in fig. 3, skin texture estimation and key point detection are performed on an original face image, so that an estimated skin mask (mask) a and key point coordinates q can be obtained. Based on the skin mask A and the key point coordinate q, an Image Level Loss can be determined, wherein the Image Level Loss comprises a stable luminosity Loss (Robust Photometric Loss) and a key point coordinate Loss (Landmark Location Loss). The Robust photonic Loss is used for removing interference of hairs, beards, eyes and the like on texture estimation, and the Landmark Location Loss is used for structurally aligning the predicted image and the original face image. The original face Image is analyzed by using Image Level Loss, permission Level Loss and a face recognition network, and a three-dimensional virtual Image face Image can be obtained. Wherein, the face recognition network can be an R network (R-Net); the persistence Level Loss includes a depth Identity Feature Loss (Deep Identity Feature Loss) for making the predicted image and the original face image as similar as possible. Specifically, the Robust Photometric Loss can be calculated by the following formula (1):

in formula (1), I is an original face image, x is a predicted coefficient, I' is a predicted image, and a is a skin mask corresponding to I.

The Perception Level Loss can be calculated by the following equation (2):

in formula (2), I is an original face image, x is a predicted coefficient, I' is a predicted image, and f is a network layer feature extracted through a face recognition network.

The loss of the first part of face key points can be obtained by synthesizing a shape and texture parameterized model through pose coefficients, identity information and voice BS data and calculating loss aiming at the lower half face key points, wherein the lower half face key points specifically comprise key points corresponding to the mouth and the lower jaw. The second part of face key point loss can be obtained by synthesizing pose coefficients, identity information (identity) and other BS data and calculating loss for key points of the upper part of face, where the key points of the upper part of face specifically may include corresponding key points on both sides of eyes and head. Identity information in the sample image frames remains unchanged, and the variance of identity information per training Batch (Batch) can be constrained by increasing the persistence Loss (Consistency Loss).

The Landmark Location Loss can be calculated by the following formula (3):

in formula (3), x is a predicted coefficient, q is a face model key point coordinate corresponding to the original face image, q' is a face model key point coordinate corresponding to x, and N is the number of key points.

Through the various Loss settings, the voice can only learn the action mapped to the mouth and the actions of other parts (such as eyes and eyebrows) of the face are decomposed to other places, so that the decoupling purpose is achieved, a target network model with better processing performance is obtained, the processing efficiency of the three-dimensional virtual image face animation data is effectively improved, and the data processing cost is reduced.

The Face model is generally a three-dimensional Face deformation statistical model (3D movable Face model, 3dmm), and specifically includes two parts, one part representing identity information coefficients and the other part representing facial Expression (Expression) coefficients. The BS decoupling learning training frame based on the video sequence provided by the embodiment of the application starts from a training stage, on one hand, the identity information of a speaker in the video sequence is restrained, the coefficient for describing the identity information is controlled to be basically unchanged, and on the other hand, the deformation of the fitting expression and the mouth shape is given to the coefficient for describing the facial expression for fitting. Finally, the effect of partially decoupling different BS data in a video sequence of a single speaker is achieved, the lip synchronization capability of the facial animation is enhanced, and the lip synchronization capability is more natural and conforms to the context semantics.

Based on the optional embodiment, the sample image frames and the voice signals are input into the initial network model to obtain the multi-frame prediction image, the multi-frame prediction image is used for determining the loss functions, and finally, the model parameters of the initial network model are optimized based on the loss functions, so that the target network model with better performance can be obtained, the driving effect of the text-driven animation is improved, the method can be directly applied to various common BS data driving scenes, and the generalization capability is better.

In an alternative embodiment, the initial network model comprises: the method comprises the following steps of inputting a sample image frame and a voice signal into an initial network model to obtain a multi-frame prediction image, wherein the video encoder, the audio encoder, the voice content space and a plurality of multilayer perceptrons comprise:

carrying out video coding processing on the sample image frame by using a video coder to obtain a first coding result, and carrying out audio coding processing on the voice signal by using an audio coder to obtain a second coding result;

acquiring a third face pinching parameter, a fourth face pinching parameter and a sample image frame attribute parameter through the first multi-layer perceptron by using the first encoding result, wherein the third face pinching parameter is face pinching data associated with the person identity of the sample image frame, the fourth face pinching parameter is face pinching data associated with the person expression of the sample image frame, and the sample image frame attribute parameter comprises: texture parameters, pose parameters and light parameters of the sample image frame;

inputting the first coding result into a voice content space through a second multilayer perceptron, inputting the second coding result into the voice content space, and outputting a synchronous result of the sample image frame and the voice signal;

inputting the synchronization result into a third multilayer perceptron to obtain a fifth face pinching parameter, wherein the fifth face pinching parameter is face pinching data associated with the object mouth shape;

and generating a multi-frame prediction image based on the third face pinching parameter, the fourth face pinching parameter, the sample image frame attribute parameter and the fifth face pinching parameter.

The third face-pinching parameter is BS data associated with the person Identity of the sample image frame, i.e., identity pinching data (Identity BS); the fourth pinch face parameter is BS data associated with the human expression of the sample image frame, that is, other pinch face data (Other BS), which is expression pinch face data unrelated to the mouth shape of the human face, for example, BS data of the upper half of the face; the sample image frame attribute parameters include: texture Parameters (Texture Parameters), pose Parameters (Pose Parameters), and Light Parameters (Light Parameters) of the sample image frames; the fifth pinching parameter may express BS data of a lower half of the face related to the utterance, for example, the fifth pinching parameter may be BS data associated with a human mouth shape of the sample image frame.

Fig. 4 is a schematic structural diagram of an initial network model according to an embodiment of the present application, where the initial network model includes a video encoder, an audio encoder, a speech content space, a first multi-layer perceptron, a second multi-layer perceptron, and a third multi-layer perceptron, as shown in fig. 4. And the middle dimensions of the first multilayer perceptron, the second multilayer perceptron and the third multilayer perceptron are different. The method comprises the steps that sample image frames and voice signals extracted from sample videos are subjected to video coding processing by a video coder to obtain a first coding result, and voice coding processing is performed on the voice signals by an audio coder to obtain a second coding result; then inputting the first coding result into a voice content space through a second multilayer perceptron, inputting the second coding result into the voice content space, and acquiring a fifth face pinching parameter through a third multilayer perceptron; generating a multi-frame prediction image based on the third face pinching parameter, the fourth face pinching parameter, the sample image frame attribute parameter and the fifth face pinching parameter; and determining a first loss function, a second loss function, a third loss function and a fourth loss function by using the multi-frame prediction image, and finally optimizing the model parameters of the initial network model based on a plurality of loss functions to obtain the target network model.

In the initial network model shown in fig. 4, the audio encoder, the second multi-layer perceptron and the speech content space may form a sound-picture synchronization network, which may be pre-trained using speech signals. In the sound-picture Synchronization network, in order to ensure that the first encoding result and the second encoding result can be in the same content space and can be synchronized, feature spaces at different moments can be adopted to perform constraint on a Synchronization Triplet Loss function (Synchronization Triplet Loss) of comparison learning. In particular, the synchronous triplet loss function

The calculation can be made by the following equation (4):

in formula (4), F represents a content space characteristic, v represents a first encoding result, a represents a second encoding result, subscript c represents a content space, N-represents a set of speech frames corresponding to a non-current frame, a fixed number can be randomly decimated, D represents a Cosine (Cosine) distance between two vectors, and v2a represents video-to-speech.

For the same reason, the above-mentioned synchronous triplet loss function

Can also be expressed as

The

The calculation method mentioned in the above formula (4) is still used, and the difference is only that: a2v represents voice versus video.

Based on the above optional embodiment, a first coding result is obtained by performing video coding processing on the sample image frame by using a video coder, and a second coding result is obtained by performing audio coding processing on the speech signal by using an audio coder;

and finally, based on the third face pinching parameter, the fourth face pinching parameter, the sample image frame attribute parameter and the fifth face pinching parameter, a multi-frame prediction image can be rapidly generated for carrying out efficient training on the initial network model to obtain a target network model, thereby further reducing the data processing cost for generating the three-dimensional virtual image facial animation.

In an alternative embodiment, in step S24, extracting expression keywords from the text to be processed, and obtaining second face-pinching data associated with the expression keywords includes:

acquiring a preset corresponding relation, wherein the preset corresponding relation is used for collecting the corresponding relation between different expression keywords in the training corpus and different part expressions on the face in advance; extracting expression keywords from the text to be processed; and searching the eye expression corresponding to the expression keyword in the preset corresponding relation to obtain second face pinching data.

Specifically, in the initial network model, the correspondence between the keywords and the eye expressions in the training expectation may be obtained from the sample video, and the preset correspondence may be established by collecting the correspondence between the keywords and the eye expressions. When the target neural network is used for reasoning, the eye expressions corresponding to the expression keywords extracted from the text to be processed can be determined by using corresponding relations between different expression keywords and different eye expressions in the pre-collected training corpus, and second face pinching data is obtained, so that the three-dimensional virtual image can be changed into the eye expressions when the corresponding expression keywords are detected.

Based on the optional embodiment, the preset corresponding relation is obtained, the expression keywords are further extracted from the text to be processed, finally, the eye expressions corresponding to the expression keywords are searched in the preset corresponding relation, second face pinching data are obtained, and the data acquisition cost for generating the three-dimensional virtual image facial animation can be further reduced.

In an alternative embodiment, the driving generation of the target face animation based on the first and second pinch face data at step S25 includes:

adjusting the control objects corresponding to the first face pinching data, the second face pinching data, the sixth face pinching data and the seventh face pinching data respectively to generate a target face animation, wherein the control object corresponding to the first face pinching data is a mouth animation of the three-dimensional virtual image, the control object corresponding to the first face pinching data is an eye expression animation of the three-dimensional virtual image, the control object corresponding to the sixth face pinching data is the identity of the three-dimensional virtual image, and the control object corresponding to the seventh face pinching data is the head animation naturalness of the three-dimensional virtual image.

Specifically, a control object of the first face pinching data in the initial face animation is a mouth animation of the three-dimensional virtual image, a control object of the second face pinching data in the initial face animation is an eye expression animation of the three-dimensional virtual image, a control object of the sixth face pinching data in the initial face animation is an animation identity of the three-dimensional virtual image, a control object of the seventh face pinching data in the initial face animation is a head animation naturalness of the three-dimensional virtual image, and finally the control object is adjusted based on the first face pinching data, the second face pinching data, the sixth face pinching data and the seventh face pinching data to generate the target face animation.

The sixth face pinching data may be used to represent BS data corresponding to the identity of the three-dimensional avatar, such as BS data corresponding to the identity of a teacher or BS data corresponding to the identity information of a speaker. The seventh face pinching data can be used for representing BS data corresponding to random animation, the random animation is obtained by randomly selecting facial movements within a period of time based on the head pose information, and the naturalness and the fidelity of the facial animation can be effectively increased.

Based on the optional embodiment, the control objects of the first face pinching data, the second face pinching data, the sixth face pinching data and the seventh face pinching data in the initial facial animation are obtained, and the control objects corresponding to the first face pinching data, the second face pinching data, the sixth face pinching data and the seventh face pinching data are adjusted to generate the target facial animation.

In an alternative embodiment, in step S22, performing text-to-speech processing on the text to be processed to obtain the speech to be processed includes:

converting the morphemes contained in the text to be processed into phonemes to obtain a phoneme sequence; performing phoneme duration prediction and fundamental frequency prediction on the phoneme sequence to obtain a prediction result; and carrying out audio synthesis on the phoneme sequence and the prediction result and converting the phoneme sequence and the prediction result into a sound waveform to obtain the speech to be processed.

The morphemes contained in the text to be processed are the smallest pronunciation-meaning combination in the language, and the expression forms of the morphemes in different language systems are different. Phones (phones) are the smallest phonetic unit divided according to natural attributes of speech, and are analyzed according to pronunciation actions in syllables, and one action constitutes one phone. Phonemes are divided into two main categories, vowels and consonants. For example, the chinese syllable "o" (257k) has only one phoneme, "a" (aji) has two phonemes, "a" (d aji) has three phonemes, etc.

When the morphemes contained in the text to be processed are converted into the phonemes, the phoneme sequences corresponding to the text to be processed one by one can be obtained by inquiring a standard phoneme dictionary or Chinese pinyin. For example, when the input text to be processed is: [ "It", "was", "early", "spring" ], the phoneme sequence obtained after conversion is: [ [ IH1, T, ], [ W, AA1, Z, ], [ ER1, L, IY0, ], [ S, P, R, IH1, NG, ] ]. When the input text to be processed is: [ "only", "one", "a", "many", "month", "time", "interior" ], and the phoneme sequence obtained after conversion is as follows: [ jin3 yi1 ge4 duo1 yue4 de5 shi2 jian1 li3], wherein the number in the phoneme sequence corresponding to the Chinese text is used for representing the tone of the Chinese character.

In the process of predicting the phoneme duration of the phoneme sequence, the duration corresponding to each phoneme may be obtained according to the context information of each phoneme in the phoneme sequence. Specifically, the scene of each phoneme utterance is first matched, and then an audio segmentation segment corresponding to the phoneme and an utterance position in the audio are obtained through a neural network-based sequential class Classification loss (CTC loss) model, so as to realize prediction of phoneme duration, wherein the CTC loss can be used for aligning the audio and the text.

For example, when the phoneme sequence is [ [ IH1, T. ], [ W, AA1, Z. ], [ ER1, L, IY0, ], [ S, P, R, IH1, NG, ] ], the phoneme duration prediction for the phoneme sequence results in a prediction result of: [ IH1 (0.1 s), T (0.05 s), [ 0.01s ] ].

The pitch and intonation of each phoneme can also be obtained by performing fundamental frequency prediction on the phoneme sequence. Specifically, by performing phoneme duration and fundamental frequency prediction on the phoneme sequence, a duration data pair of each phoneme and a fundamental frequency data pair of the phoneme can be obtained, where the fundamental frequency data pair refers to a frequency spectrum of the phoneme at the current time point in the real-person pronunciation process. The pitch and intonation of each phoneme can be predicted based on the phoneme duration data pair and the fundamental frequency data pair.

For example, when the phoneme sequence is [ [ IH1, T. ], [ W, AA1, Z. ], [ ER1, L, IY 0. ], [ S, P, R, IH1, NG. ] ], the phoneme duration prediction and the fundamental frequency prediction for the phoneme sequence result in the prediction results as follows: [ (IH, 0.05s, 140hz), (T, 0.07s, 141hz), \ 8230; ].

Further, the phoneme sequence and the prediction result are subjected to audio synthesis and converted into sound waveforms, and the speech to be processed is obtained. Specifically, the prediction results corresponding to each phoneme are combined, and the phonemes, the duration and the fundamental frequency are combined, so that the phonemes, the duration and the fundamental frequency are converted into sound waveforms, and then the speech to be processed is output.

For example, the inputs when synthesizing the phoneme sequence and the prediction result in audio are: [ IH1 (140hz, 0.5s), T (142hz, 0.1s), (Not voiced,0.2 s), W (140hz, 0.3s),. So ] so that the speech to be processed can be output.

Based on the optional embodiment, the phoneme sequence is obtained by converting the morphemes contained in the text to be processed into the phonemes, the phoneme sequence is subjected to phoneme duration prediction and fundamental frequency prediction to obtain a prediction result, and finally, the phoneme sequence and the prediction result are subjected to audio synthesis and are converted into a sound waveform, so that the speech to be processed corresponding to the text to be processed can be accurately and quickly synthesized, and the readability of the text to be processed is further improved.

Fig. 5 is a schematic diagram of a method for generating a three-dimensional avatar face animation according to an embodiment of the present application, and as shown in fig. 5, a currently input text to be processed is first obtained, where the text to be processed is used to drive an initial face animation of a three-dimensional avatar to change, and a text to be processed is subjected to text-to-speech processing to obtain a speech to be processed; further, performing voice-driven animation processing on the voice to be processed to obtain first face pinching data; extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords; the control objects corresponding to the first pinching data are adjusted respectively based on the first pinching data, the second pinching data, the sixth pinching data and the seventh pinching data to generate a target face animation, wherein the control object corresponding to the first pinching data is a mouth animation of the three-dimensional virtual image, the control object corresponding to the first pinching data is an eye expression animation of the three-dimensional virtual image, the control object corresponding to the sixth pinching data is an identity of the three-dimensional virtual image and can be set in advance, and the control object corresponding to the seventh pinching data is a head animation naturalness of the three-dimensional virtual image.

According to the method for generating the three-dimensional virtual image facial animation, the complicated high-cost scanning reconstruction equipment is not needed, and the applicability and generalization performance of the generated three-dimensional virtual image facial animation are stronger.

For example, the three-dimensional avatar face animation generated by the method of the embodiment of the application can be applied to the fields of electronic game production, movie and television play production, online live broadcast and the like. In addition, the text-driven digital human animation can provide visual information of the facial and lip movements during pronunciation, thereby being used for language teaching, language recovery training, and rehabilitation treatment and evaluation of auditory disorders. It should be noted that the method for generating a three-dimensional avatar face animation provided in the embodiment of the present application may also be applied in more related fields, and the embodiment of the present application does not form a specific limitation to the application field.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method according to the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (such as a ROM/RAM, a magnetic disk, and an optical disk), and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, or a network device) to execute the method described in the embodiments of the present application.

Example 2

There is also provided a method for generating a three-dimensional avatar face animation, according to an embodiment of the present application, where the method for generating a three-dimensional avatar face animation is executed on a cloud server, and fig. 6 is a flowchart of another method for generating a three-dimensional avatar face animation according to an embodiment of the present application, as shown in fig. 6, where the method for generating a three-dimensional avatar face animation includes:

step S61, receiving a text to be processed from a client, wherein the text to be processed is used for driving the initial facial animation of the three-dimensional virtual image to change;

step S62, performing character synthesis voice processing on the text to be processed to obtain voice to be processed, performing voice-driven animation processing on the voice to be processed to obtain first face pinching data, extracting expression keywords from the text to be processed, acquiring second face pinching data associated with the expression keywords, and generating a target facial animation of the three-dimensional virtual image based on the first face pinching data and the second face pinching data;

and S63, feeding back the target facial animation to the client.

Based on the steps S61 to S63, the text to be processed from the client is received, then text-to-speech processing is performed on the text to be processed to obtain the speech to be processed, speech-driven animation processing is performed on the speech to be processed to obtain first face-pinching data, expression keywords are extracted from the text to be processed, second face-pinching data associated with the expression keywords are obtained, a target face animation of the three-dimensional avatar is generated based on the first face-pinching data and the second face-pinching data, and finally the target face animation is fed back to the client.

Optionally, fig. 7 is a schematic diagram of a method for generating a three-dimensional avatar facial animation at a cloud server according to an embodiment of the present application, and as shown in fig. 7, the cloud server acquires a to-be-processed text from a client through a network, further performs text-to-speech processing on the to-be-processed text to obtain to-be-processed speech, performs speech-driven animation processing on the to-be-processed speech to obtain first face-pinching data, extracts expression keywords from the to-be-processed text and acquires second face-pinching data associated with the expression keywords, and generates a target facial animation of the three-dimensional avatar based on the first face-pinching data and the second face-pinching data.

Example 3

There is also provided a method of generating a three-dimensional avatar face animation according to an embodiment of the present application, fig. 8 is a flowchart of yet another method of generating a three-dimensional avatar face animation according to an embodiment of the present application, as shown in fig. 8, the method of generating a three-dimensional avatar face animation including:

step S81, acquiring a currently input text to be processed, where the content recorded in the text to be processed includes: the text information is used for language recovery training, and the text to be processed is used for driving the initial facial animation of the three-dimensional virtual image to change;

step S82, performing character synthesis voice processing on the text to be processed to obtain the voice to be processed for language recovery training;

s83, carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data;

step S84, extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords;

and S85, generating a target face animation of the three-dimensional virtual image based on the first face pinching data and the second face pinching data.

Based on the steps S81 to S85, the currently input text to be processed is acquired, then character synthesis voice processing is performed on the text to be processed, the voice to be processed for language recovery training is obtained, then voice-driven animation processing is performed on the voice to be processed, first face pinching data is obtained, expression keywords are extracted from the text to be processed, second face pinching data associated with the expression keywords are obtained, and finally, a target facial animation of the three-dimensional avatar is generated based on the first face pinching data and the second face pinching data.

Example 4

There is also provided an apparatus for implementing the method for generating a three-dimensional avatar face animation according to an embodiment of the present application, and fig. 9 is a schematic structural diagram of an apparatus for generating a three-dimensional avatar face animation according to an embodiment of the present application, as shown in fig. 9, the apparatus including:

an obtaining module 901, configured to obtain a currently input text to be processed, where the text to be processed is used to drive an initial facial animation of a three-dimensional avatar to change;

the first processing module 902 is configured to perform text-to-speech processing on a text to be processed to obtain a speech to be processed;

the second processing module 903 is configured to perform voice-driven animation processing on the voice to be processed to obtain first face pinching data;

a third processing module 904, configured to extract the expression keywords from the text to be processed, and obtain second face pinching data associated with the expression keywords;

a generating module 905 configured to generate a target face animation of the three-dimensional avatar based on the first face-pinching data and the second face-pinching data.

Optionally, the second processing module 903 is further configured to: inputting the voice to be processed into a target network model to obtain first face pinching data, wherein the target network model is obtained by multiple groups of data through machine learning training, and each group of data in the multiple groups of data comprises: sample image frames and voice signals extracted from the sample video.

Optionally, the apparatus for generating a three-dimensional avatar face animation further comprises: a fourth processing module 906 (not shown in the figure), configured to input the sample image frame and the voice signal into an initial network model, so as to obtain a multi-frame predicted image, where the initial network model is used to train a correspondence between the voice signal and a person mouth shape of the sample image frame; a determining module 907 (not shown in the figures) for determining a plurality of loss functions using the multi-frame predicted image, wherein the plurality of loss functions includes: the method comprises the steps of calculating the loss of an image layer of a multi-frame predicted image, calculating the loss of a perception layer of the multi-frame predicted image, calculating the loss of a first part of human face key points of the multi-frame predicted image, calculating the loss of a third part of human face key points of the multi-frame predicted image, and calculating the loss of a second part of human face key points of the multi-frame predicted image; an optimization module 908 (not shown) is configured to optimize model parameters of the initial network model based on a plurality of loss functions to obtain a target network model.

Optionally, the initial network model comprises: a video encoder, an audio encoder, a speech content space and a plurality of multi-layer perceptrons, the fourth processing module 906 (not shown in the figure) being further configured to: carrying out video coding processing on the sample image frame by using a video coder to obtain a first coding result, and carrying out audio coding processing on the voice signal by using an audio coder to obtain a second coding result; acquiring a third face pinching parameter, a fourth face pinching parameter and a sample image frame attribute parameter through the first multi-layer perceptron by using the first encoding result, wherein the third face pinching parameter is face pinching data associated with the person identity of the sample image frame, the fourth face pinching parameter is face pinching data associated with the person expression of the sample image frame, and the sample image frame attribute parameter comprises: texture parameters, pose parameters and light parameters of the sample image frame; inputting the first coding result into a voice content space through a second multilayer perceptron, inputting the second coding result into the voice content space, and outputting a synchronous result of the sample image frame and the voice signal; inputting the synchronization result into a third multilayer perceptron to obtain a fifth face pinching parameter, wherein the fifth face pinching parameter is face pinching data related to the object mouth shape; and generating a multi-frame prediction image based on the third face pinching parameter, the fourth face pinching parameter, the sample image frame attribute parameter and the fifth face pinching parameter.

Optionally, the third processing module 904 is further configured to: acquiring a preset corresponding relation, wherein the preset corresponding relation is used for collecting the corresponding relation between different expression keywords in the training corpus and different part expressions on the face in advance; extracting expression keywords from the text to be processed; and searching the eye expression corresponding to the expression keyword in the preset corresponding relation to obtain second face pinching data.

Optionally, the generating module 905 is further configured to: adjusting the control objects corresponding to the first face pinching data, the second face pinching data, the sixth face pinching data and the seventh face pinching data respectively to generate a target face animation, wherein the control object corresponding to the first face pinching data is a mouth animation of the three-dimensional virtual image, the control object corresponding to the first face pinching data is an eye expression animation of the three-dimensional virtual image, the control object corresponding to the sixth face pinching data is the identity of the three-dimensional virtual image, and the control object corresponding to the seventh face pinching data is the head animation naturalness of the three-dimensional virtual image.

Optionally, the first processing module 902 is further configured to: converting the morphemes contained in the text to be processed into phonemes to obtain a phoneme sequence; carrying out phoneme duration prediction and fundamental frequency prediction on the phoneme sequence to obtain a prediction result; and carrying out audio synthesis on the phoneme sequence and the prediction result and converting the phoneme sequence and the prediction result into a sound waveform to obtain the speech to be processed.

It should be noted here that the acquiring module 901, the first processing module 902, the second processing module 903, the third processing module 904, and the generating module 905 correspond to steps S21 to S25 in embodiment 1, and the five modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules as a part of the apparatus may operate in the computer terminal 10 provided in embodiment 1.

Alternatively, fig. 10 is a schematic structural diagram of another apparatus for generating a three-dimensional avatar face animation according to an embodiment of the present application, as shown in fig. 10, the apparatus includes:

a receiving module 1001, configured to receive a to-be-processed text from a client, where the to-be-processed text is used to drive an initial facial animation of a three-dimensional avatar to change;

the processing module 1002 is configured to perform text-to-speech synthesis and speech processing on a text to be processed to obtain a speech to be processed, perform speech-driven animation processing on the speech to be processed to obtain first face-pinching data, extract expression keywords from the text to be processed and obtain second face-pinching data associated with the expression keywords, and generate a target facial animation of a three-dimensional avatar based on the first face-pinching data and the second face-pinching data;

and a feedback module 1003, configured to feed back the target facial animation to the client.

It should be noted here that the receiving module 1001, the processing module 1002 and the feedback module 1003 correspond to steps S61 to S63 in embodiment 2, and the three modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure of embodiment 2.

Alternatively, fig. 11 is a schematic structural diagram of another apparatus for generating a three-dimensional avatar face animation according to an embodiment of the present application, as shown in fig. 11, the apparatus includes:

an obtaining module 1101, configured to obtain a currently input text to be processed, where content recorded in the text to be processed includes: text information used for language recovery training, wherein the text to be processed is used for driving the initial facial animation of the three-dimensional virtual image to change;

the first processing module 1102 is configured to perform word synthesis speech processing on a text to be processed to obtain a speech to be processed for language recovery training;

the second processing module 1103 is configured to perform voice-driven animation processing on the voice to be processed to obtain first face pinching data;

the third processing module 1104 is configured to extract expression keywords from the text to be processed, and acquire second face pinching data associated with the expression keywords;

a generating module 1105 configured to generate a target face animation of the three-dimensional avatar based on the first face-pinching data and the second face-pinching data.

It should be noted here that the acquiring module 1101, the first processing module 1102, the second processing module 1103, the third processing module 1104 and the generating module 1105 correspond to steps S81 to S85 in embodiment 3, and the five modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 3.

In the embodiment of the application, a text to be processed capable of driving an initial facial animation of a three-dimensional virtual image to change is obtained, text synthesis voice processing is performed on the text to be processed to obtain voice to be processed, voice-driven animation processing is performed on the voice to be processed to obtain first face pinching data, expression keywords are extracted from the text to be processed, second face pinching data associated with the expression keywords are obtained, and finally a target facial animation of the three-dimensional virtual image is generated based on the first face pinching data and the second face pinching data in a driving mode.

Example 5

The embodiment of the application can provide a computer terminal, and the computer terminal can be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In the present embodiment, the computer terminal described above may execute program code for the following steps in the method of generating a three-dimensional avatar face animation: acquiring a currently input text to be processed, wherein the text to be processed is used for driving an initial facial animation of a three-dimensional virtual image to change; performing character synthesis voice processing on the text to be processed to obtain voice to be processed; carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data; extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords; and driving to generate a target face animation of the three-dimensional avatar based on the first face pinching data and the second face pinching data.

In this embodiment, the computer terminal described above may further execute program code for the following steps in the method of generating a three-dimensional avatar face animation: receiving a text to be processed from a client, wherein the text to be processed is used for driving the initial facial animation of the three-dimensional virtual image to change; performing character synthesis voice processing on a text to be processed to obtain a voice to be processed, performing voice-driven animation processing on the voice to be processed to obtain first face pinching data, extracting expression keywords from the text to be processed, acquiring second face pinching data associated with the expression keywords, and generating a target face animation of a three-dimensional virtual image based on the first face pinching data and the second face pinching data; and feeding back the target facial animation to the client.

In this embodiment, the computer terminal may further execute program codes of the following steps in the method for generating a three-dimensional avatar face animation: acquiring a currently input text to be processed, wherein the content recorded in the text to be processed comprises: the text information is used for language recovery training, and the text to be processed is used for driving the initial facial animation of the three-dimensional virtual image to change; performing character synthesis voice processing on the text to be processed to obtain the voice to be processed for language recovery training; carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data; extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords; generating a target face animation of the three-dimensional avatar based on the first and second pinch face data.

Optionally, fig. 12 is a block diagram of another structure of a computer terminal according to an embodiment of the present application, and as shown in fig. 12, the computer terminal may include: one or more processors 1202 (only one of which is shown), a memory 1204, and a peripheral interface 1206. The memory may be connected to the processor 1202 and the peripheral interface 1206 through the memory controller, and the peripheral interface 1206 may also be used to connect to a display screen, an audio module, and a radio frequency module.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for generating a three-dimensional avatar face animation in the embodiments of the present application, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, so as to implement the above-mentioned method for generating a three-dimensional avatar face animation. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located from the processor, which may be connected to the computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring a currently input text to be processed, wherein the text to be processed is used for driving an initial facial animation of a three-dimensional virtual image to change; performing character synthesis voice processing on the text to be processed to obtain voice to be processed; carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data; extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords; and driving and generating a target face animation of the three-dimensional virtual image based on the first face pinching data and the second face pinching data.

Optionally, the processor may further execute the program code of the following steps: inputting the voice to be processed into a target network model to obtain first face pinching data, wherein the target network model is obtained by a plurality of groups of data through machine learning training, and each group of data in the plurality of groups of data comprises: sample image frames and voice signals extracted from the sample video.

Optionally, the processor may further execute the program code of the following steps: inputting the sample image frame and the voice signal into an initial network model to obtain a multi-frame prediction image, wherein the initial network model is used for training the corresponding relation between the voice signal and the figure mouth shape of the sample image frame; determining a plurality of loss functions using the multi-frame predicted image, wherein the plurality of loss functions comprises: the method comprises the steps of calculating the loss of an image layer of a multi-frame prediction image according to a first loss function, a second loss function, a third loss function and a fourth loss function, wherein the first loss function is used for calculating the loss of an image layer of the multi-frame prediction image, the second loss function is used for calculating the loss of a perception layer of the multi-frame prediction image, the third loss function is used for calculating the loss of a first part of human face key points of the multi-frame prediction image, and the fourth loss function is used for calculating the loss of a second part of human face key points of the multi-frame prediction image; and optimizing the model parameters of the initial network model based on a plurality of loss functions to obtain a target network model.

Optionally, the processor may further execute the program code of the following steps: carrying out video coding processing on the sample image frame by using a video coder to obtain a first coding result, and carrying out audio coding processing on the voice signal by using an audio coder to obtain a second coding result; acquiring a third face pinching parameter, a fourth face pinching parameter and a sample image frame attribute parameter through the first multi-layer perceptron by using the first encoding result, wherein the third face pinching parameter is face pinching data associated with the person identity of the sample image frame, the fourth face pinching parameter is face pinching data associated with the person expression of the sample image frame, and the sample image frame attribute parameter comprises: texture parameters, pose parameters and light parameters of the sample image frame; inputting the first coding result into a voice content space through a second multilayer perceptron, inputting the second coding result into the voice content space, and outputting a synchronization result of the sample image frame and the voice signal; inputting the synchronization result into a third multilayer perceptron to obtain a fifth face pinching parameter, wherein the fifth face pinching parameter is face pinching data associated with the object mouth shape; and generating a multi-frame prediction image based on the third face pinching parameter, the fourth face pinching parameter, the sample image frame attribute parameter and the fifth face pinching parameter.

Optionally, the processor may further execute the program code of the following steps: acquiring a preset corresponding relation, wherein the preset corresponding relation is used for collecting the corresponding relation between different expression keywords in the training corpus and different part expressions on the face in advance; extracting expression keywords from the text to be processed; and searching the eye expression corresponding to the expression keyword in the preset corresponding relation to obtain second face pinching data.

Optionally, the processor may further execute the program code of the following steps: adjusting the control objects corresponding to the first pinching data, the second pinching data, the sixth pinching data and the seventh pinching data respectively based on the first pinching data, the second pinching data, the sixth pinching data and the seventh pinching data to generate a target face animation, wherein the control object corresponding to the first pinching data is a mouth animation of the three-dimensional virtual image, the control object corresponding to the first pinching data is an eye expression animation of the three-dimensional virtual image, the control object corresponding to the sixth pinching data is the identity of the three-dimensional virtual image, and the control object corresponding to the seventh pinching data is the head animation naturalness of the three-dimensional virtual image.

Optionally, the processor may further execute the program code of the following steps: converting the morphemes contained in the text to be processed into phonemes to obtain a phoneme sequence; performing phoneme duration prediction and fundamental frequency prediction on the phoneme sequence to obtain a prediction result; and carrying out audio synthesis on the phoneme sequence and the prediction result and converting the phoneme sequence and the prediction result into a sound waveform to obtain the speech to be processed.

By adopting the method for generating the three-dimensional virtual image facial animation, the text to be processed which can drive the three-dimensional virtual image to change is obtained, then the text to be processed is subjected to character synthesis voice processing to obtain voice to be processed, then voice driving animation processing is carried out on the voice to be processed to obtain first face pinching data, expression keywords are extracted from the text to be processed, second face pinching data related to the expression keywords are obtained, finally, the target facial animation of the three-dimensional virtual image is generated based on the first face pinching data and the second face pinching data, the purpose of generating the three-dimensional virtual image facial animation which meets the interaction requirement based on the preset text is achieved, the technical effect of reducing the generation cost of the three-dimensional virtual image facial animation is achieved, the technical problem that the generation cost of the three-dimensional virtual image facial animation in the related technology is high is solved, and the requirement of people on virtual interaction is met.

It can be understood by those skilled in the art that the structure shown in fig. 12 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 12 is a diagram illustrating the structure of the electronic device. For example, the computer terminal may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 12, or have a different configuration than shown in FIG. 12.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 6

Embodiments of the present application also provide a storage medium. Alternatively, in the present embodiment, the storage medium may be configured to store program codes executed by the method for generating a three-dimensional avatar face animation provided in the above embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:

acquiring a currently input text to be processed, wherein the text to be processed is used for driving an initial facial animation of a three-dimensional virtual image to change; performing character synthesis voice processing on the text to be processed to obtain voice to be processed; carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data; extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords; and driving to generate a target face animation of the three-dimensional avatar based on the first face pinching data and the second face pinching data.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: inputting the voice to be processed into a target network model to obtain first face pinching data, wherein the target network model is obtained by multiple groups of data through machine learning training, and each group of data in the multiple groups of data comprises: sample image frames and voice signals extracted from the sample video.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: inputting the sample image frame and the voice signal into an initial network model to obtain a multi-frame prediction image, wherein the initial network model is used for training the corresponding relation between the voice signal and the figure mouth shape of the sample image frame; determining a plurality of loss functions using the multi-frame predicted image, wherein the plurality of loss functions comprises: the method comprises the steps of calculating the loss of an image layer of a multi-frame predicted image, calculating the loss of a perception layer of the multi-frame predicted image, calculating the loss of a first part of human face key points of the multi-frame predicted image, calculating the loss of a third part of human face key points of the multi-frame predicted image, and calculating the loss of a second part of human face key points of the multi-frame predicted image; and optimizing the model parameters of the initial network model based on a plurality of loss functions to obtain a target network model.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: carrying out video coding processing on the sample image frame by using a video coder to obtain a first coding result, and carrying out audio coding processing on the voice signal by using an audio coder to obtain a second coding result; acquiring a third face pinching parameter, a fourth face pinching parameter and a sample image frame attribute parameter through the first multi-layer perceptron by using the first encoding result, wherein the third face pinching parameter is face pinching data associated with the person identity of the sample image frame, the fourth face pinching parameter is face pinching data associated with the person expression of the sample image frame, and the sample image frame attribute parameter comprises: texture parameters, pose parameters and light parameters of the sample image frame; inputting the first coding result into a voice content space through a second multilayer perceptron, inputting the second coding result into the voice content space, and outputting a synchronization result of the sample image frame and the voice signal; inputting the synchronization result into a third multilayer perceptron to obtain a fifth face pinching parameter, wherein the fifth face pinching parameter is face pinching data related to the object mouth shape; and generating a multi-frame prediction image based on the third face pinching parameter, the fourth face pinching parameter, the sample image frame attribute parameter and the fifth face pinching parameter.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a preset corresponding relation, wherein the preset corresponding relation is used for collecting the corresponding relation between different expression keywords in the training corpus and different part expressions on the face in advance; extracting expression keywords from the text to be processed; and searching the eye expression corresponding to the expression keyword in the preset corresponding relation to obtain second face pinching data.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: adjusting the control objects corresponding to the first face pinching data, the second face pinching data, the sixth face pinching data and the seventh face pinching data respectively to generate a target face animation, wherein the control object corresponding to the first face pinching data is a mouth animation of the three-dimensional virtual image, the control object corresponding to the first face pinching data is an eye expression animation of the three-dimensional virtual image, the control object corresponding to the sixth face pinching data is the identity of the three-dimensional virtual image, and the control object corresponding to the seventh face pinching data is the head animation naturalness of the three-dimensional virtual image.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: converting the morphemes contained in the text to be processed into phonemes to obtain a phoneme sequence; performing phoneme duration prediction and fundamental frequency prediction on the phoneme sequence to obtain a prediction result; and carrying out audio synthesis on the phoneme sequence and the prediction result and converting the phoneme sequence and the prediction result into a sound waveform to obtain the speech to be processed.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: receiving a text to be processed from a client, wherein the text to be processed is used for driving the initial facial animation of the three-dimensional virtual image to change; performing character synthesis voice processing on a text to be processed to obtain a voice to be processed, performing voice-driven animation processing on the voice to be processed to obtain first face pinching data, extracting expression keywords from the text to be processed, acquiring second face pinching data associated with the expression keywords, and generating a target face animation of a three-dimensional virtual image based on the first face pinching data and the second face pinching data; and feeding back the target facial animation to the client.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a currently input text to be processed, wherein the content recorded in the text to be processed comprises: text information used for language recovery training, wherein the text to be processed is used for driving the initial facial animation of the three-dimensional virtual image to change; performing character synthesis voice processing on the text to be processed to obtain the voice to be processed for language recovery training; carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data; extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords; generating a target face animation of the three-dimensional avatar based on the first and second pinch face data.

Example 7

Embodiments of the present application also provide a system for generating a three-dimensional avatar face animation, the system comprising: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a currently input text to be processed, wherein the text to be processed is used for driving an initial facial animation of a three-dimensional virtual image to change; performing character synthesis voice processing on the text to be processed to obtain voice to be processed; carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data; extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords; and driving and generating a target face animation of the three-dimensional virtual image based on the first face pinching data and the second face pinching data.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technical content can be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method of generating a three-dimensional avatar face animation, comprising:

acquiring a currently input text to be processed, wherein the text to be processed is used for driving an initial facial animation of a three-dimensional virtual image to change;

performing character synthesis voice processing on the text to be processed to obtain voice to be processed;

carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data;

extracting expression keywords from the text to be processed, and acquiring second face pinching data associated with the expression keywords;

driving generation of a target face animation of the three-dimensional avatar based on the first pinch face data and the second pinch face data.

2. The method of claim 1, wherein performing speech-driven animation processing on the speech to be processed to obtain the first face pinching data comprises:

inputting the voice to be processed into a target network model to obtain the first face pinching data, wherein the target network model is obtained by performing machine learning training on multiple groups of data, and each group of data in the multiple groups of data comprises: sample image frames and voice signals extracted from the sample video.

3. The method of claim 2, further comprising:

determining a plurality of loss functions using the plurality of predicted images, wherein the plurality of loss functions comprises: the method comprises the steps of calculating the loss of an image layer of the multi-frame prediction image, calculating the loss of a perception layer of the multi-frame prediction image by using a first loss function, a second loss function, a third loss function and a fourth loss function, wherein the third loss function is used for calculating the loss of a first part of face key points of the multi-frame prediction image, and the fourth loss function is used for calculating the loss of a second part of face key points of the multi-frame prediction image;

and optimizing the model parameters of the initial network model based on the loss functions to obtain the target network model.

4. The method of claim 3, wherein the initial network model comprises: the method comprises the following steps of inputting the sample image frames and the voice signals into the initial network model to obtain the multi-frame prediction image, wherein the steps of inputting the sample image frames and the voice signals into the initial network model comprise:

carrying out video coding processing on the sample image frame by using the video coder to obtain a first coding result, and carrying out audio coding processing on the voice signal by using the audio coder to obtain a second coding result;

obtaining, by a first multi-layered perceptron, a third pinching face parameter, a fourth pinching face parameter, and a sample image frame attribute parameter using the first encoding result, wherein the third pinching face parameter is pinching face data associated with a person identity of the sample image frame, the fourth pinching face parameter is pinching face data associated with a person expression of the sample image frame, and the sample image frame attribute parameter comprises: texture parameters, pose parameters and light parameters of the sample image frame;

inputting the first coding result into the voice content space through a second multilayer perceptron, inputting the second coding result into the voice content space, and outputting a synchronization result of the sample image frame and the voice signal;

inputting the synchronization result into a third multilayer perceptron to obtain a fifth face pinching parameter, wherein the fifth face pinching parameter is face pinching data associated with the human mouth shape;

generating the multi-frame predicted image based on the third pinching parameter, the fourth pinching parameter, the sample image frame attribute parameter, and the fifth pinching parameter.

5. The method of claim 1, wherein extracting the expression keywords from the text to be processed to obtain the second face-pinching data associated with the expression keywords comprises:

acquiring a preset corresponding relation, wherein the preset corresponding relation is used for collecting the corresponding relation between different expression keywords in the training corpus and different part expressions on the face in advance;

extracting the expression key words from the text to be processed;

and searching the eye expression corresponding to the expression keyword in the preset corresponding relation to obtain the second face pinching data.

6. The method of claim 1, wherein generating the target facial animation based on the first face pinch data and the second face pinch data drive comprises:

based on respectively first face data of holding between the fingers the second is held between the fingers face data, the sixth is held between the fingers face data and the seventh is held between the fingers face data and is adjusted the control object that corresponds separately, generates target facial animation, wherein, the first control object that corresponds of holding between the fingers face data does three-dimensional avatar's mouth animation, the first control object that corresponds of holding between the fingers face data does three-dimensional avatar's eye expression animation, the sixth is held between the fingers face data corresponding control object does three-dimensional avatar's identity, the seventh is held between the fingers face data corresponding control object does three-dimensional avatar's head animation naturality.

7. The method of claim 1, wherein performing word synthesis speech processing on the text to be processed to obtain the speech to be processed comprises:

converting the morphemes contained in the text to be processed into phonemes to obtain a phoneme sequence;

performing phoneme duration prediction and fundamental frequency prediction on the phoneme sequence to obtain a prediction result;

and carrying out audio synthesis on the phoneme sequence and the prediction result and converting the phoneme sequence and the prediction result into a sound waveform to obtain the speech to be processed.

8. A method of generating a three-dimensional avatar face animation, comprising:

receiving a text to be processed from a client, wherein the text to be processed is used for driving an initial facial animation of a three-dimensional virtual image to change;

performing character synthesis voice processing on the text to be processed to obtain voice to be processed, performing voice-driven animation processing on the voice to be processed to obtain first face pinching data, extracting expression keywords from the text to be processed, acquiring second face pinching data associated with the expression keywords, and generating target facial animation of the three-dimensional virtual image based on the first face pinching data and the second face pinching data;

and feeding back the target facial animation to the client.

9. A method of generating a three-dimensional avatar face animation, comprising:

acquiring a currently input text to be processed, wherein the content recorded in the text to be processed comprises: text information used for language recovery training, wherein the text to be processed is used for driving the initial facial animation of the three-dimensional virtual image to change;

performing character synthesis voice processing on the text to be processed to obtain voice to be processed for language recovery training;

generating a target facial animation of the three-dimensional avatar based on the first face-pinching data and the second face-pinching data.

10. An apparatus for generating a facial animation of a three-dimensional avatar, comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a currently input text to be processed, and the text to be processed is used for driving an initial facial animation of a three-dimensional virtual image to change;

the first processing module is used for carrying out character synthesis voice processing on the text to be processed to obtain voice to be processed;

the second processing module is used for carrying out voice-driven animation processing on the voice to be processed to obtain first face pinching data;

the third processing module is used for extracting expression keywords from the text to be processed and acquiring second face pinching data associated with the expression keywords;

a generating module for generating a target face animation of the three-dimensional avatar based on the first face-pinching data and the second face-pinching data.

11. A storage medium characterized in that the storage medium includes a stored program, wherein an apparatus on which the storage medium is controlled when the program is executed performs the method of generating a three-dimensional avatar face animation of any one of claims 1 to 9.

12. A system for generating a facial animation of a three-dimensional avatar, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:

and driving and generating a target face animation of the three-dimensional virtual image based on the first face pinching data and the second face pinching data.