CN118052919A

CN118052919A - Face animation generation method and device, electronic equipment and storage medium

Info

Publication number: CN118052919A
Application number: CN202410148891.9A
Authority: CN
Inventors: 姜悦人; 王斌; 柴金祥
Original assignee: Shanghai Movu Technology Co Ltd; Mofa Shanghai Information Technology Co Ltd
Current assignee: Shanghai Movu Technology Co Ltd; Mofa Shanghai Information Technology Co Ltd
Priority date: 2024-02-01
Filing date: 2024-02-01
Publication date: 2024-05-17

Abstract

The invention relates to the technical field of artificial intelligence, and provides a face animation generation method, a device, electronic equipment and a storage medium, wherein the method firstly acquires user voice; then extracting semantic features and acoustic features of the user voice; and finally, generating the facial animation based on the semantic features and the acoustic features. By extracting semantic features and acoustic features from the voice of the user, the voice features can be decoupled, and even if voices of different users are input, the content and style of the facial animation effect can be ensured to be basically consistent as long as the content is the same and the language is similar, so that the generated facial animation is prevented from being interfered by the tone of a speaker. The method does not need to construct an action library in advance, and can greatly save the generation cost of the facial animation. Due to the application of the semantic features and the acoustic features, the content of the voice of the user and the emotion during speaking can be more accurately mastered, the overall effect of the generated face animation can be further improved, the lip shape is more accurate, and the richness is higher.

Description

Face animation generation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for generating a facial animation, an electronic device, and a storage medium.

Background

Existing face animation generation is often realized by determining a face animation matched with a given voice through the given voice.

When the face animation is determined, one scheme is realized by matching an action library, and the scheme is matched with the corresponding animation in the action library through syllable characteristics of the voice. The scheme is influenced by the scale of the action library, and the defects of poor overall effect, inaccurate lip shape, low richness and the like are easy to occur.

The other scheme is that the voice characteristics are processed through a deep learning model, so that the face animation is predicted. Although the scheme does not need to construct an action library in advance, the applied voice features comprise accents, intonation and the like, the voice features are different, the predicted face animation is interfered by the tone of the speaker, and when the tone of the speaker changes, the face animation is unstable, so that the animation generation effect is poor, and the wide application of the face animation is not facilitated.

Disclosure of Invention

The invention provides a face animation generation method, a face animation generation device, electronic equipment and a storage medium, which are used for solving the defects in the prior art.

The invention provides a face animation generation method, which comprises the following steps:

Acquiring user voice;

Extracting semantic features and acoustic features of the user voice;

And generating a face animation based on the semantic features and the acoustic features.

According to the face animation generation method provided by the invention, the face animation comprises a first local face animation and a second local face animation;

the generating a face animation based on the semantic features and the acoustic features includes:

Based on the semantic features and the acoustic features, respectively predicting the first partial face animation and the second partial face animation;

And generating the face animation based on the first local face animation and the second local face animation.

According to the face animation generation method provided by the invention, the method for respectively predicting the first local face animation and the second local face animation based on the semantic features and the acoustic features comprises the following steps:

predicting the first partial face animation based on style features, the semantic features, and the acoustic features;

And/or the number of the groups of groups,

Predicting the second partial facial animation based on style features, the semantic features, and the acoustic features;

the style characteristics are obtained by pre-storing or encoding the identification information of the animation style selected by the user.

According to the face animation generation method provided by the invention, the first local face animation is predicted based on the style characteristics, the semantic characteristics and the acoustic characteristics, and the method comprises the following steps:

inputting the style features, the semantic features and the acoustic features into a first local prediction model to obtain the first local facial animation output by the first local prediction model;

The first local prediction model is obtained based on training a first local face animation sample and corresponding style characteristics, semantic characteristics and acoustic characteristics.

According to the face animation generation method provided by the invention, the second local face animation is predicted based on the style characteristics, the semantic characteristics and the acoustic characteristics, and the face animation generation method comprises the following steps:

Inputting the style features, the semantic features and the acoustic features into a second local prediction model to obtain the second local facial animation output by the second local prediction model;

The second local prediction model is obtained based on training a second local face animation sample and corresponding style characteristics, semantic characteristics and acoustic characteristics.

According to the face animation generation method provided by the invention, the first local face animation is a lip cheek animation and/or the second local face animation is a eyebrow animation.

According to the method for generating the facial animation provided by the invention, the facial animation is generated based on the semantic features and the acoustic features, and the method comprises the following steps:

Generating the face animation based on the style features, the semantic features and the acoustic features;

acquiring body motion information;

Extracting movement characteristics in the body movement information;

and generating a third local face animation of the face animation based on the motion characteristics.

According to the face animation generation method provided by the invention, the third local face animation of the face animation is generated based on the motion characteristics, and the method comprises the following steps:

generating the third local face animation based on the style characteristics and the motion characteristics;

According to the face animation generation method provided by the invention, the extraction of the motion characteristics in the body motion information comprises the following steps:

Acquiring lens position information;

Inputting the lens position information and the body motion information into a motion feature extraction model to obtain the motion features output by the motion feature extraction model;

the motion feature extraction model is obtained by training based on sample lens positions with motion feature labels and sample motion information.

According to the face animation generation method provided by the invention, the third local face animation is a catch animation.

According to the face animation generation method provided by the invention, the semantic features and acoustic features of the user voice are extracted, and the method comprises the following steps:

inputting the user voice into a semantic feature extraction model and an acoustic feature extraction model respectively to obtain the semantic features output by the semantic feature extraction model and the acoustic features output by the acoustic feature extraction model;

The semantic feature extraction model is obtained based on sample voice training with semantic feature labels, and the acoustic feature extraction model is obtained based on sample voice training with acoustic feature labels.

According to the face animation generation method provided by the invention, the face animation is generated based on the semantic features and the acoustic features, and then the face animation generation method comprises the following steps:

acquiring body motion information;

Extracting movement characteristics in the body movement information;

Generating a third local face animation based on the motion characteristics;

And optimizing the facial animation based on the third local facial animation.

The invention also provides a face animation generation method, which comprises the following steps:

Acquiring user voice;

Extracting voice characteristics of the user voice;

And respectively predicting a first local face animation and a second local face animation based on the voice characteristics, and generating the face animation based on the first local face animation and the second local face animation.

The invention also provides a device for generating the facial animation, which comprises the following steps:

The first voice acquisition module is used for acquiring user voices;

the first feature extraction module is used for extracting semantic features and acoustic features of the user voice;

and the first animation generation module is used for generating a face animation based on the semantic features and the acoustic features.

The second voice acquisition module is used for acquiring the voice of the user;

The second feature extraction module is used for extracting the voice features of the user voice;

And the second animation generation module is used for respectively predicting a first local face animation and a second local face animation based on the voice characteristics and generating the face animation based on the first local face animation and the second local face animation.

The invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the method for generating the facial animation when executing the computer program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a face animation generation method as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a face animation generation method as described in any of the above.

In one embodiment of the face animation generation method, the face animation generation device, the electronic equipment and the storage medium, user voice is acquired first; then extracting semantic features and acoustic features of the user voice; and finally, generating the facial animation based on the semantic features and the acoustic features. According to the method, semantic features and acoustic features of the voice of the user are extracted, and the voice features can be decoupled, so that even if voices of different users are input, the generated face animation effect can be ensured to be basically consistent in content and style as long as the content is the same and the language is similar, and the generated face animation can be prevented from being interfered by the tone of a speaker. In addition, the method does not need to construct an action library in advance, and can greatly save the generation cost of the facial animation. In addition, due to the application of the semantic features and the acoustic features, the content of the voice of the user and the emotion during speaking can be more accurately mastered, the overall effect of the generated face animation can be further improved, the lip shape is more accurate, and the richness is higher.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a face animation generation method provided by the invention;

FIG. 2 is a second flow chart of the face animation generation method according to the present invention;

FIG. 3 is a schematic diagram of a face animation generating device according to the present invention;

fig. 4 is a schematic structural diagram of an electronic device provided by the present invention;

FIG. 5 is a schematic diagram of a face animation generating device according to the present invention;

Fig. 6 is a schematic diagram of the physical structure of the electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The features of the invention "first", "second" and the like in the description and in the claims may be used for the explicit or implicit inclusion of one or more such features. In the description of the invention, unless otherwise indicated, the meaning of "a plurality" is two or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

Because various defects exist in the scheme for determining the facial animation in the prior art, such as the problems of poor overall effect, inaccurate lip shape, low richness and the like exist in the scheme of matching an action library, the obtained facial animation is changeable and unstable due to the tone interference of a speaker in the scheme generated by deep learning, and high-frequency details in voice characteristics can be lost, so that the generated facial animation is not vivid. In order to solve the defects, the embodiment of the invention provides a face animation generation method which can generate the face animation which is not interfered by the tone of a speaker, can generate the more coordinated face animation which characterizes local, and can generate the vivid and lively face animation which characterizes local. In addition, the facial animation which is not interfered by the tone of the speaker and is used for describing local, more coordinated, vivid and lively can be generated.

Fig. 1 is a flow chart of a face animation generation method provided in an embodiment of the present invention, as shown in fig. 1, the method includes:

S11, acquiring user voice;

s12, extracting semantic features and acoustic features of the user voice;

and S13, generating a face animation based on the semantic features and the acoustic features.

Specifically, in the face animation generation method provided in the embodiment of the present invention, the execution subject is a face animation generation device, and the device may be configured in a terminal device such as a computer, a mobile phone, or a cloud terminal, which is not limited herein.

Firstly, step S11 is executed to obtain user voice, where the user voice may be acquired through a sound pickup device and transmitted to a facial animation generating device, or may be stored in advance in the facial animation generating device or in a third party device. The sound pickup apparatus refers to an apparatus for collecting and transmitting sound, and may be, for example, a microphone, a voice acquisition card, an intelligent voice acquisition unit, or the like. The third party device may be a voice database and may be connected to the facial animation generating means. When the face animation generation device has a face animation generation requirement, the user voice can be acquired by accessing a voice database.

Step S12 is then performed to extract speech features of the user' S speech, which may include semantic features as well as acoustic features.

The semantic features are used for representing content information of user voices, and extraction can be achieved through the following method: 1) The feature extraction method based on deep learning comprises the following steps: for example, models such as deep neural networks (Deep Neural Networks, DNN), recurrent neural networks (Recurrent Neural Network, RNN), etc. may be employed to extract semantic features in the user's speech. These models enable automatic learning of high-level feature representations in the user's speech and have good robustness and recognition performance. 2) The feature extraction method based on voice recognition comprises the following steps: the method utilizes a voice recognition technology to convert user voice into text, and then carries out semantic analysis on the text to obtain semantic features in the user voice.

Acoustic features include a series of parametric features related to the spectrum, amplitude, time, etc. of the user's speech, such as pitch, intensity, duration, formant frequency, bandwidth, and energy. The acoustic features can reflect changes in the quality, tone, and intensity of the user's speech.

The acoustic features may be extracted by: 1) Short-time Fourier transform (Short-Time Fourier Transform, STFT): firstly, dividing the user voice into short-time analysis windows, and carrying out Fourier transform on signals in each window to obtain spectrum information in each window. By analyzing and processing the spectrum information of different windows, the acoustic features in the user voice can be extracted. 2) Mel-frequency cepstral coefficient (Mel-frequency Cepstral Coefficients, MFCC): the spectrum of the user voice is firstly converted into a mel frequency domain, discrete cosine transform (Discrete Cosine Transform, DCT) is carried out on the spectrum to obtain cepstrum coefficients, and the acoustic characteristics in the user voice are represented by the cepstrum coefficients. 3) Linear predictive Coding (LINEAR PREDICTIVE Coding, LPC): by linear prediction of the user speech, a set of prediction coefficients can be obtained by which the acoustic features in the user speech are represented. 4) Perceptual linear prediction (Perceptual LINEAR PREDICTIVE, PLP): the method comprises the steps of pre-emphasis, framing, windowing and the like of user voice, and filtering the processed signals by using a perception filter group to obtain a group of perception linear prediction coefficients, wherein the perception acoustic characteristics in the user voice are reflected by the group of perception linear prediction coefficients. 5) A wavelet transform-based method. 6) Neural network based methods, and the like.

Here, both the semantic features and the acoustic features may be represented in the form of vectors.

And finally, executing step S13, and generating the face animation of the virtual digital person by utilizing the semantic features and the acoustic features. In this step, the semantic features and the acoustic features may be input to an animation generation model, which may generate and output a face animation of the virtual digital person through the semantic features and the acoustic features, so as to show details such as emotion, mouth shape, etc. in the voice of the user. The animation generation model can be obtained by training an initial generation model based on a human face animation sample, corresponding semantic features and acoustic features. The face animation may be a 3D face animation of a virtual digital person.

When training the initial generation model, semantic features and acoustic features corresponding to the face animation sample can be input into the initial generation model to obtain a generation result output by the initial generation model, then a loss value is calculated according to the generation result and the face animation sample, and finally the structural parameters of the initial generation model are updated according to the loss value; and iteratively executing the input process and the calculation process until the loss value converges or reaches the preset iteration times to obtain the animation generation model.

Thereafter, the generated face animation of the virtual digital person may be rendered for display or recording. Specialized computer graphics techniques, such as ray tracing or path tracing, may be used to improve the quality and realism of the rendering.

The embodiment of the invention provides a face animation generation method, which comprises the steps of firstly, obtaining user voice; then extracting semantic features and acoustic features of the user voice; and finally, generating the facial animation based on the semantic features and the acoustic features. According to the method, semantic features and acoustic features of the voice of the user are extracted, and the voice features can be decoupled, so that even if voices of different users are input, the generated face animation effect can be ensured to be basically consistent in content and style as long as the content is the same and the language is similar, and the generated face animation can be prevented from being interfered by the tone of a speaker.

Moreover, the facial animation is generated through the steps without constructing an action library in advance, so that the generation cost of the facial animation can be greatly saved.

In addition, due to the application of the semantic features and the acoustic features, the content of the voice of the user and the emotion during speaking can be more accurately mastered, the overall effect of the generated face animation can be further improved, the lip shape is more accurate, and the richness is higher.

On the basis of the above embodiment, the face animation generation method provided in the embodiment of the present invention includes a first partial face animation and a second partial face animation;

Specifically, when the semantic features and the acoustic features are used for generating the facial animation, different local facial animations can be respectively predicted because the action modes of different local facial animations are greatly different. Here, the first partial face animation and the second partial face animation may be predicted, respectively. The first partial facial animation and the second partial facial animation may be facial animations of two parts of lip cheek, eyebrow, nose, ear, forehead, etc.

The first partial facial animation may be predicted using the semantic features and the acoustic features. Here, the first partial facial animation of the known facial animation and the corresponding semantic features and acoustic features thereof can be used as samples, the mapping relation between the semantic features and the acoustic features and the first partial facial animation of the known facial animation can be pre-constructed, and the mapping relation is used to combine the semantic features and the acoustic features of the user voice, so that the first partial facial animation corresponding to the semantic features and the acoustic features can be obtained. In addition, a first local prediction model can be obtained through training by using a sample, and a first local face animation output by the first local prediction model can be obtained through inputting semantic features and acoustic features into the first local prediction model. The first local face animation can be used for describing first local information of the face of the user.

Likewise, a second partial face animation may also be predicted using semantic features as well as acoustic features. Here, the second partial facial animation of the known facial animation and the corresponding semantic features and acoustic features thereof can be used as samples, the mapping relationship between the semantic features and the acoustic features and the second partial facial animation of the known facial animation can be pre-constructed, and the second partial facial animation corresponding to the semantic features and the acoustic features can be obtained by combining the semantic features and the acoustic features of the user voice by using the mapping relationship. In addition, a second local prediction model can be obtained through training by using a sample, and a second local face animation output by the second local prediction model can be obtained through inputting semantic features and acoustic features into the second local prediction model. The second local face animation may be configured to depict second local information of the face of the user.

Finally, the face animation can be obtained by synthesizing the first local face animation, the second local face animation and the global face animation. It can be understood that, the global face animation may be a face animation generated by the above animation generation model, or may be a face animation generated by a conventional means, or may be a face animation generated directly by using the voice characteristics of the user voice, which is not particularly limited herein. The final obtained facial animation can have the facial local information which clearly and accurately accords with the voice of the user.

In the embodiment of the invention, the first partial face animation and the second partial face animation are respectively predicted, the decoupling of the face parts is realized, and the finally synthesized face animation can restore the independent motion mode of the animation of each part of the face, so that the face animation is more coordinated.

Because the movement modes of the lip shape and the eyebrow shape are greatly different, the animation generation effect on the lip shape cheek, the eyebrow shape and other parts is particularly poor when the facial animation is generated in the prior art, and the interaction of the user can be influenced by the lip shape cheek, the eyebrow shape and other parts.

In order to solve the above problems, on the basis of the above embodiments, in the face animation generation method provided in the embodiment of the present invention, the first partial face animation is a lip cheek animation and/or the second partial face animation is a eyebrow animation.

Specifically, the lip cheek animation may be used as a first partial face animation, and the eyebrow animation may be used as a second partial face animation, which are respectively predicted by semantic features and acoustic features.

The obtained lip cheek animation can comprise lip information and cheek information, the lip information can comprise information such as lip shapes, mouth angle and the like, and the lip shapes can comprise thickness, length, width and the like, and structural characteristics such as lip peaks, lip valleys and the like. The angle of the mouth is the opening and closing degree of the upper lip and the lower lip, namely the shape of the mouth angle, and different emotions and expressions can be expressed. Cheek information may include the tension of cheek muscles, and emotion such as emotional state may be reflected by expression.

The obtained eyebrow animation can comprise eyebrow shape information, wherein the eyebrow shape information can comprise eyebrow shape, thickness, length and the like, and the eyebrow shape can comprise horizontal eyebrow, upward eyebrow lifting, downward eyebrow dropping and the like.

The final facial animation can have the facial local information lip information, cheek information and eyebrow information which clearly and accurately accord with the voice of the user.

On the basis of the foregoing embodiment, the face animation generation method provided in the embodiment of the present invention predicts the first partial face animation and the second partial face animation based on the semantic features and the acoustic features, respectively, including:

And/or the number of the groups of groups,

The style characteristics are stored in advance or are obtained by encoding identification information of the animation style selected by the user.

Specifically, when the first local face animation and the second local face animation are respectively predicted, style characteristics can be introduced, and the prediction of the first local face animation and/or the second local face animation is realized by utilizing the style characteristics, the semantic characteristics and the acoustic characteristics so as to control the style and the relaxation degree of the finally generated face animation.

The style characteristics can be obtained by pre-storing, and the animation style of the generated facial animation is fixed. In addition, the style characteristics may be obtained by encoding identification information (Identity, ID) of the animation style selected by the user, and the generated face animation is determined by the animation style selected by the user. For the face animation driven by the same section of user voice, the user can switch between different animation styles according to the animation style selected by the user. Here, animation style may include gentle, robust, exciting, vivid, and the like.

Here, the identification information of the animation style selected by the user may be encoded by the style identification encoding model. The identification information of the animation style selected by the user is input into a style identification coding model, and the style identification coding model codes the identification information of the animation style selected by the user to obtain and output style characteristics.

The style identification coding model can be obtained by training the initial style identification coding model through sample style identification information with style characteristic labels.

When the initial style identification coding model is trained, sample style identification information can be input into the initial style identification coding model to obtain a coding result output by the initial style identification coding model, then a loss value is calculated according to the coding result and a style characteristic label, and finally the structural parameters of the initial style identification coding model are updated according to the loss value; and iteratively executing the input process and the calculation process until the loss value converges or reaches the preset iteration times, and obtaining the style identification coding model.

When the style features, the semantic features and the acoustic features are utilized to predict the first local facial animation, the first local facial animation of the known facial animation and the corresponding style features, semantic features and acoustic features thereof can be utilized as samples, the mapping relation between the style features, the semantic features, the acoustic features and the first local facial animation of the known facial animation is constructed in advance, and the mapping relation is utilized to combine the semantic features and the acoustic features of the user voice and the style features, so that the first local facial animation corresponding to the style features, the semantic features and the acoustic features can be obtained.

When the style features, the semantic features and the acoustic features are utilized to predict the second local facial animation, the second local facial animation of the known facial animation and the corresponding style features, semantic features and acoustic features thereof can be utilized as samples, the mapping relation between the style features, the semantic features, the acoustic features and the second local facial animation of the known facial animation is constructed in advance, and the second local facial animation corresponding to the style features, the semantic features and the acoustic features can be obtained by utilizing the mapping relation and combining the semantic features and the acoustic features of the user voice.

In the embodiment of the invention, on the basis of decoupling of the face part, the face animation can have a unique style by introducing the style characteristic. If the style characteristics are obtained by encoding the identification information of the animation style selected by the user, the facial animation can further meet the requirements of the user on the animation style.

On the basis of the foregoing embodiment, the face animation generation method provided in the embodiment of the present invention predicts the first local face animation based on the style feature, the semantic feature, and the acoustic feature, and includes:

Specifically, when predicting the first local face animation, the style features, the semantic features and the acoustic features can be input into the first local prediction model, the style features, the semantic features and the acoustic features are comprehensively analyzed by the first local prediction model, and the first local face animation is rendered and output.

The first local prediction model may be a generator in a Generation Antagonism Network (GAN), or may be another neural Network model, which is not specifically limited herein.

The first local prediction model can be obtained by training an initial first local prediction model through a first local face animation sample and corresponding style characteristics, semantic characteristics and acoustic characteristics thereof. When the initial first local prediction model is trained, semantic features and acoustic features corresponding to the first local face animation sample can be input into the initial first local prediction model to obtain a prediction result output by the initial first local prediction model, then a loss value is calculated according to the prediction result and the first local face animation sample, and finally structural parameters of the initial first local prediction model are updated according to the loss value; and iteratively executing the input process and the calculation process until the loss value converges or reaches the preset iteration times to obtain a first local prediction model.

In particular, if the first partial face animation is a lip cheek animation, the first partial predictive model is a lip cheek predictive model. Accordingly, the first partial face animation sample is a lip cheek animation sample.

In the embodiment of the invention, the first local facial animation is predicted by the model, so that the prediction efficiency can be ensured, and the obtained first local facial animation is more vivid and natural.

On the basis of the foregoing embodiment, the face animation generation method provided in the embodiment of the present invention predicts the second local face animation based on the style feature, the semantic feature, and the acoustic feature, and includes:

Specifically, when predicting the second local face animation, the style features, the semantic features and the acoustic features may be input to the second local prediction model, and the style features, the semantic features and the acoustic features are comprehensively resolved by the second local prediction model, rendered, and output the second local face animation. The second local prediction model may be a variational self-encoder (VAE), or may be another neural network model, which is not specifically limited herein.

The second local prediction model can be obtained by training the initial second local prediction model through the second local face animation sample and corresponding style characteristics, semantic characteristics and acoustic characteristics thereof. When the initial second local prediction model is trained, semantic features and acoustic features corresponding to the second local face animation sample can be input into the initial second local prediction model to obtain a prediction result output by the initial second local prediction model, then a loss value is calculated according to the prediction result and the second local face animation sample, and finally structural parameters of the initial second local prediction model are updated according to the loss value; and iteratively executing the input process and the calculation process until the loss value converges or reaches the preset iteration times to obtain a second local prediction model.

In particular, if the second local facial animation is a eyebrow shape animation, the second local prediction model is a eyebrow shape prediction model. Correspondingly, the second local facial animation sample is a eyebrow-shaped animation sample.

In the embodiment of the invention, the second local facial animation is predicted by the model, so that the prediction efficiency can be ensured, and the obtained second local facial animation is more vivid and natural.

On the basis of the foregoing embodiment, the face animation generating method provided in the embodiment of the present invention generates a face animation based on the semantic features and the acoustic features, including:

Specifically, when the facial animation is generated, style characteristics can be introduced, and the generation of the facial animation is realized by combining semantic characteristics and acoustic characteristics by utilizing the style characteristics.

The style characteristics can be obtained by pre-storing, and the animation style of the generated facial animation is fixed. In addition, the style characteristics can be obtained by encoding the identification information of the animation style selected by the user, and the generated face animation is determined by the animation style selected by the user. For the face animation driven by the same section of user voice, the user can switch between different animation styles according to the animation style selected by the user. Here, animation style may include gentle, robust, exciting, vivid, and the like.

The style identification coding model can be obtained by training the initial style identification coding model through sample style identification information with style characteristic labels. With specific reference to the foregoing embodiments, details are not repeated herein.

When the style feature, the semantic feature and the acoustic feature are utilized to predict the face animation, the known face animation and the corresponding style feature, semantic feature and acoustic feature can be utilized as samples, the mapping relation among the style feature, the semantic feature, the acoustic feature and the known face animation is constructed in advance, and the style feature, the semantic feature and the acoustic feature of the user voice and the style feature are combined by utilizing the mapping relation, so that the face animation corresponding to the style feature, the semantic feature and the acoustic feature can be obtained.

The style characteristics, the semantic characteristics and the acoustic characteristics can be input into the animation generation model, the style characteristics, the semantic characteristics and the acoustic characteristics are comprehensively analyzed by the animation generation model, and the facial animation is rendered and output.

The animation generation model can be obtained by training the initial generation model based on the human face animation sample and the corresponding style characteristics, semantic characteristics and acoustic characteristics thereof.

When training the initial generation model, style features, semantic features and acoustic features corresponding to the face animation sample can be input into the initial generation model to obtain a generation result output by the initial generation model, then a loss value is calculated according to the generation result and the face animation sample, and finally structural parameters of the initial generation model are updated according to the loss value; and iteratively executing the input process and the calculation process until the loss value converges or reaches the preset iteration times to obtain the animation generation model.

In the embodiment of the invention, the facial animation can have a unique style by introducing the style characteristics. If the style characteristics are obtained by encoding the identification information of the animation style selected by the user, the facial animation can further meet the requirements of the user on the animation style.

acquiring body motion information;

Extracting movement characteristics in the body movement information;

Specifically, in the process of generating the facial animation by utilizing the semantic features and the acoustic features, the generation of the third local facial animation can be introduced, and the generated third local facial animation is synthesized with the global facial animation or synthesized with the global facial animation, the first local facial animation and the second local facial animation to obtain the facial animation, so that the facial animation is more vivid, and the marked emotion is finer and richer. The third partial facial animation may be a catch animation, an expression animation, or the like.

When the third partial face animation is generated, body motion information can be acquired first, and the body motion information can be a numerical matrix of body skeleton node positions of a real user or a numerical matrix of body skeleton node positions of a virtual digital person which is generated in advance.

The body motion information may be input to the motion feature extraction model, and the motion features in the body motion information may be extracted by the motion feature extraction model. The motion features may characterize motion information of the user, and may include, for example, head motion features, gesture motion features, and the like. The motion feature extraction model can be obtained by training an initial motion feature extraction model through sample motion information with motion feature labels.

When the initial motion feature extraction model is trained, sample motion information can be input into the initial motion feature extraction model to obtain an extraction result output by the initial motion feature extraction model, then a loss value is calculated according to the extraction result and the motion feature label, and finally the structural parameters of the initial motion feature extraction model are updated according to the loss value; and iteratively executing the input process and the calculation process until the loss value converges or reaches the preset iteration times, and obtaining the action feature extraction model.

Thereafter, using the motion features, a third partial face animation may be generated. The third local face animation can be generated by introducing a third local prediction model, inputting the motion characteristics into the third local prediction model, analyzing the motion characteristics by the third local prediction model, rendering and outputting the third local face animation. The third local prediction model may be a multi-layer perceptron (Multilayer Perceptron, MLP) or may be another neural network model, which is not particularly limited herein.

The third local prediction model can be obtained by training an initial third local model through a third local face animation sample and corresponding motion characteristics thereof.

When the initial third local prediction model is trained, the motion characteristics corresponding to the third local face animation sample can be input into the initial third local prediction model to obtain a prediction result output by the initial third local prediction model, then a loss value is calculated according to the prediction result and the third local face animation sample, and finally the structural parameters of the initial third local prediction model are updated according to the loss value; and iteratively executing the input process and the calculation process until the loss value converges or reaches the preset iteration times to obtain a third local prediction model.

On the basis of the foregoing embodiment, the face animation generating method provided in the embodiment of the present invention generates a face animation based on the semantic features and the acoustic features, and then includes:

acquiring body motion information;

Extracting movement characteristics in the body movement information;

Generating a third local face animation based on the motion characteristics;

And optimizing the facial animation based on the third local facial animation.

Specifically, after the third local facial animation is generated by utilizing the semantic features and the acoustic features, the generated facial animation can be optimized through the third local facial animation, namely, the third local facial animation and the facial animation are synthesized, and the third local area in the facial animation is optimized through the third local facial animation, so that the optimized facial animation is more vivid, and the marked emotion is finer and richer.

Because the eye god is greatly different from other local areas of the face, more emotion information is contained, and the animation generation effect on the eye god part is not ideal when the face animation is generated in the prior art. Therefore, on the basis of the above-described embodiment, the catch animation may be preferably used as the third partial face animation.

On the basis of the foregoing embodiment, the generating a third local face animation based on the motion feature includes:

Specifically, when the third local face animation is generated, style characteristics can be introduced, and the style characteristics and the motion characteristics are combined. When the third local face animation is generated, the style characteristics and the motion characteristics can be input into the third local prediction model together, the style characteristics and the motion characteristics are comprehensively analyzed by the third local prediction model, and the third local face animation is rendered and output.

When the model is understood, the third local prediction model can be obtained by training the initial third local model through the third local face animation sample, the corresponding style features and the motion features. The specific training process is detailed in the above embodiment, and the difference is that the input of the initial third local model in the process is the style feature and the motion feature corresponding to the third local face animation sample.

The style characteristics may be obtained by referring to the above embodiments, and are not described herein. In particular, if the third partial face animation is a catch animation, the third partial predictive model is a catch predictive model. Correspondingly, the third local facial animation sample is a catch animation sample.

In the embodiment of the invention, the third local facial animation has a unique style by introducing the style characteristics. If the style characteristic is obtained by encoding the identification information of the animation style selected by the user, the third local facial animation can further meet the requirement of the user on the animation style.

On the basis of the above embodiment, the extracting the movement feature in the body motion information includes:

Acquiring lens position information;

Specifically, in the case where the third partial face animation is a catch animation, in order to avoid the generated catch animation from deviating from the lens, the catch animation is caused to follow the lens, and when the motion feature in the body motion information is extracted, lens position information may be introduced, which may be an absolute position in the world coordinate system, and may be fixed.

Then, the lens position information and the body motion information can be input into the motion feature extraction model, and the motion feature extraction model performs joint analysis on the lens position information and the body motion information to obtain and output motion features.

The motion feature extraction model can be obtained by training an initial motion feature extraction model by utilizing sample lens positions with motion feature labels and sample motion information.

When the initial motion feature extraction model is trained, the sample lens position and the sample motion information can be input into the initial motion feature extraction model to obtain an extraction result output by the initial motion feature extraction model, then a loss value is calculated according to the extraction result and the motion feature label, and finally the structural parameters of the initial motion feature extraction model are updated according to the loss value; and iteratively executing the input process and the calculation process until the loss value converges or reaches the preset iteration times, and obtaining the action feature extraction model.

The sample action information adopted may be the action information collected when the user is facing the lens, or the action information collected when the user has a certain azimuth angle with the lens, and is not particularly limited herein.

According to the embodiment of the invention, the lens position information is introduced when the motion characteristics in the body motion information are extracted, so that the eye animation which is generated later can be prevented from deviating from the lens, and the eye animation follows the lens, and the obtained face animation is more natural.

On the basis of the above embodiment, the extracting the semantic features and the acoustic features of the user voice includes:

Specifically, in order to improve extraction efficiency and accuracy of the semantic features and the acoustic features, the extraction can be implemented through a semantic feature extraction model and an acoustic feature extraction model respectively.

The user voice can be input into the semantic feature extraction model, semantic features are obtained by the semantic feature extraction model, the user voice is input into the acoustic feature extraction model, and acoustic features are output by the acoustic feature extraction model. The specific structures of the semantic feature extraction model and the acoustic feature extraction model can be selected according to actual needs, and are not specifically limited herein.

The semantic feature extraction model can be obtained by training an initial semantic feature extraction model through sample voice carrying semantic feature labels, for example, the sample voice can be input into the initial semantic feature extraction model to obtain an extraction result output by the initial semantic feature extraction model, then a loss value is calculated according to the extraction result and the semantic feature labels, and finally the structural parameters of the initial semantic feature extraction model are updated according to the loss value; and iteratively executing the input process and the calculation process until the loss value converges or reaches the preset iteration times, and obtaining the semantic feature extraction model.

The acoustic feature extraction model can be obtained by training an initial acoustic feature extraction model through sample voice carrying an acoustic feature tag, for example, the sample voice can be input into the initial acoustic feature extraction model to obtain an extraction result output by the initial acoustic feature extraction model, then a loss value is calculated according to the extraction result and the acoustic feature tag, and finally the structural parameters of the initial acoustic feature extraction model are updated according to the loss value; and iteratively executing the input process and the calculation process until the loss value converges or reaches the preset iteration times, so as to obtain the acoustic feature extraction model.

As shown in fig. 2, the embodiment of the present invention further provides a method for generating a facial animation, which includes:

S21, acquiring user voice;

s22, extracting voice characteristics of the user voice;

s23, respectively predicting a first local face animation and a second local face animation based on the voice characteristics, and generating the face animation based on the first local face animation and the second local face animation.

First, step S21 is executed to obtain the voice of the user, and the specific execution process of step S21 is the same as the execution process of step S11 described above, which is not repeated here.

Then, step S22 is performed to extract speech features of the user' S speech, where the speech features may include at least one of semantic features and acoustic features, and may also include tone features, etc. The method of extracting the speech features may be a conventional extraction algorithm, and is not particularly limited herein.

Finally, step S23 is executed to predict the first partial face animation and the second partial face animation by using the voice features.

Here, the first partial face animation may be predicted using the voice feature. The first local facial animation of the known facial animation and the corresponding voice feature thereof can be taken as a sample, the mapping relation between the voice feature and the first local facial animation of the known facial animation is constructed in advance, and the first local facial animation corresponding to the voice feature can be obtained by combining the voice feature of the user by utilizing the mapping relation. In addition, a first local prediction model can be obtained through training by using a sample, and a first local face animation output by the first local prediction model can be obtained through inputting voice features into the first local prediction model.

Likewise, a second partial facial animation may also be predicted using the speech features. Here, the second local facial animation of the known facial animation and the corresponding voice feature thereof can be used as a sample, the mapping relationship between the voice feature and the second local facial animation of the known facial animation is pre-constructed, and the second local facial animation corresponding to the voice feature can be obtained by combining the voice feature of the user voice by using the mapping relationship. In addition, a second local prediction model can be obtained through training by using the sample, and a second local face animation output by the second local prediction model can be obtained through inputting the voice characteristics into the second local prediction model.

And finally, combining the global facial animation by utilizing the first local facial animation and the second local facial animation to synthesize the facial animation. It will be appreciated that the global face animation may be a face animation generated by a voice feature of a user's voice, or may be a face animation generated by a conventional means, which is not limited herein. The final obtained facial animation can have the facial local information which clearly and accurately accords with the voice of the user.

The face animation generation method provided by the embodiment of the invention comprises the steps of firstly, acquiring user voice; then extracting the voice characteristics of the user voice; and finally, respectively predicting the first local face animation and the second local face animation based on the voice characteristics, and generating the face animation based on the first local face animation and the second local face animation. According to the method, the voice characteristics of the voice of the user are utilized to respectively predict the first partial face animation and the second partial face animation, the decoupling of the face positions is realized, and finally, the synthesized face animation can restore the independent motion mode of the animation of each part of the face, so that the face animation is more coordinated.

In summary, as shown in fig. 3, a method for generating a face animation according to an embodiment of the present invention includes:

User voice, body motion information and lens position information are acquired, and identification information of animation styles selected by a user is received.

The user voice is respectively input into a semantic feature extraction model and an acoustic feature extraction model, semantic features are output by the semantic feature extraction model, and acoustic features are output by the acoustic feature extraction model.

And inputting the identification information of the animation style selected by the user into a style identification coding model, and outputting style characteristics by the style identification coding model.

The lens position information and the body motion information are input into the motion feature extraction model, and the motion feature is output by the motion feature extraction model.

The semantic features, the acoustic features and the style features are fused, the obtained first fusion result is input into a lip cheek prediction model, a lip cheek animation is output by the lip cheek prediction model, the first fusion result is input into a eyebrow prediction model, and a eyebrow animation is output by the eyebrow prediction model.

And fusing the style characteristics and the action characteristics, inputting the obtained second fusion result into the eye prediction model, and outputting the eye animation by the eye prediction model.

The combination of semantic features, acoustic features and style features and the combination of style features and action features can be splicing or connecting.

And finally, combining the lip cheek animation, the eyebrow animation and the eye animation to obtain the facial animation.

That is, by combining user voice, body motion information, and shot position information and letting the user select the identification information of the animation style, a virtual digital human video clip with facial animation can be generated in one key.

The virtual digital person has the problems of serious homogenization, low quality and efficiency and the like of character setting. The face animation generation method provided by the embodiment of the invention is used for generating the virtual digital person, so that the face animation effect of the virtual digital person can be improved. The super-realistic virtual digital person obtained by the method can be used for virtual digital person live broadcasting, virtual digital person video, virtual digital person service and the like, so that the user visually has general experience of real person interaction, real person video and real person service.

The face animation generation method provided by the embodiment of the invention can restore the vivid and lifelike face animation with multiple wind lattices based on the user voice, is not interfered by the tone color change of the speaker, and can depict local, more coordinated and vivid.

On the basis, for example, when the face animation is generated, the first local face animation, the second local face animation, the third local face animation and part of face animation in the global face animation can be generated by adopting a pre-constructed action library so as to improve the generation efficiency; the rest face animation can be generated by adopting the steps in the face animation generation method provided by the embodiment of the invention, so as to improve the generation effect.

On the basis, when the facial animation is generated, the facial animation in the preset state can be generated according to the action library which is built in advance, and the facial animation in other states except the preset state can be generated by adopting the steps in the facial animation generation method provided by the embodiment of the invention, so that the generation efficiency can be both considered, and the operation amount of a processor can be reduced; meanwhile, the generation quality of the facial animation can be considered. The preset state may include: calm state, i.e. neutral face animation, or state under specific emotion or specific action, etc.

As shown in fig. 4, on the basis of the above embodiment, an embodiment of the present invention provides a facial animation generating device, including:

a first voice acquisition module 41, configured to acquire a user voice;

a first feature extraction module 42, configured to extract semantic features and acoustic features of the user speech;

A first animation generation module 43, configured to generate a facial animation based on the semantic features and the acoustic features.

On the basis of the above embodiment, the facial animation generating device provided in the embodiment of the present invention includes a first partial facial animation and a second partial facial animation;

the first animation generation module is specifically configured to:

On the basis of the foregoing embodiment, in the face animation generating device provided in the embodiment of the present invention, the first animation generating module is specifically configured to:

And/or the number of the groups of groups,

On the basis of the foregoing embodiment, in the face animation generating device provided in the embodiment of the present invention, the first animation generating module is further specifically configured to:

On the basis of the above embodiment, the facial animation generating device provided in the embodiment of the present invention, the first partial facial animation is a lip cheek animation, and/or the second partial facial animation is a eyebrow animation.

acquiring body motion information;

Extracting movement characteristics in the body movement information;

And generating a third local facial eye animation of the facial animation based on the motion characteristics.

On the basis of the above embodiment, the facial animation generating device provided in the embodiment of the present invention, the third partial facial animation is a catch animation.

Acquiring lens position information;

On the basis of the foregoing embodiment, in the facial animation generating device provided in the embodiment of the present invention, the first feature extraction module is specifically configured to:

On the basis of the above embodiment, the facial animation generating device provided in the embodiment of the present invention further includes an optimizing module, configured to:

acquiring body motion information;

Extracting movement characteristics in the body movement information;

Generating a third local face animation based on the motion characteristics;

And optimizing the facial animation based on the third local facial animation.

Specifically, the functions of each module in the facial animation generating device provided in the embodiment of the present invention are in one-to-one correspondence with the operation flow of each step in the series method embodiment shown in fig. 1, and the achieved effects are also consistent.

As shown in fig. 5, on the basis of the above embodiment, the embodiment of the present invention further provides a facial animation generating device, including:

A second voice acquisition module 51, configured to acquire a user voice;

A second feature extraction module 52, configured to extract speech features of the user speech;

the second animation generation module 53 is configured to predict a first local face animation and a second local face animation based on the voice feature, and generate a face animation based on the first local face animation and the second local face animation.

Specifically, the functions of each module in the facial animation generating device provided in the embodiment of the present invention are in one-to-one correspondence with the operation flow of each step in the series method embodiment shown in fig. 2, and the achieved effects are also consistent.

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor (Processor) 610, communication interface (Communications Interface) 620, memory (Memory) 630, and communication bus 640, wherein Processor 610, communication interface 620, memory 630 complete communication with each other through communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform the facial animation generating methods provided in the embodiments described above.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can perform the face animation generation method provided in the foregoing embodiments.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the face animation generation method provided in the above embodiments.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The face animation generation method is characterized by comprising the following steps of:

Acquiring user voice;

Extracting semantic features and acoustic features of the user voice;

2. The face animation generation method of claim 1, wherein the face animation comprises a first partial face animation and a second partial face animation;

3. The face animation generation method according to claim 2, wherein predicting the first partial face animation and the second partial face animation based on the semantic features and the acoustic features, respectively, comprises:

And/or the number of the groups of groups,

4. A face animation generation method according to claim 3, wherein predicting the first partial face animation based on the style features, the semantic features, and the acoustic features comprises:

5. A face animation generation method according to claim 3, wherein predicting the second partial face animation based on the style features, the semantic features, and the acoustic features comprises:

6. The face animation generation method according to claim 2, wherein the first partial face animation is a lip cheek animation and/or the second partial face animation is a eyebrow animation.

7. The face animation generation method according to claim 1, wherein the generating a face animation based on the semantic features and the acoustic features comprises:

8. The face animation generation method according to any one of claims 1 to 7, wherein the generating a face animation based on the semantic features and the acoustic features comprises:

acquiring body motion information;

Extracting movement characteristics in the body movement information;

9. The face animation generation method of claim 8, wherein the generating a third partial face animation of the face animation based on the motion feature comprises:

10. The face animation generation method of claim 8 wherein the third partial face animation is a catch animation.

11. The method of claim 10, wherein the extracting motion features in the body motion information comprises:

Acquiring lens position information;

12. The face animation generation method according to any one of claims 1-7, wherein the extracting semantic features and acoustic features of the user's voice comprises:

13. The face animation generation method according to any one of claims 1-7, wherein the generating a face animation based on the semantic features and the acoustic features, then comprises:

acquiring body motion information;

Extracting movement characteristics in the body movement information;

Generating a third local face animation based on the motion characteristics;

And optimizing the facial animation based on the third local facial animation.

14. The face animation generation method is characterized by comprising the following steps of:

Acquiring user voice;

Extracting voice characteristics of the user voice;

15. A facial animation generating device, comprising:

The first voice acquisition module is used for acquiring user voices;

16. A facial animation generating device, comprising:

17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the facial animation generation method of any of claims 1-14 when the computer program is executed by the processor.

18. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a face animation generation method according to any of claims 1-14.