CN116863038A

CN116863038A - Method for generating digital human voice and facial animation by text

Info

Publication number: CN116863038A
Application number: CN202310831606.9A
Authority: CN
Inventors: 吴清强; 罗晗月; 赵凯祥; 孟俊; 苏少岩; 李晓东; 洪清启
Original assignee: Dongbo Future Artificial Intelligence Research Institute Xiamen Co ltd
Current assignee: Dongbo Future Artificial Intelligence Research Institute Xiamen Co ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-10-10

Abstract

The invention discloses a method for generating digital human voice and facial animation by text, which comprises the following steps: s1, collecting text materials and audio of a speaker to be uttered; s2, preprocessing a text; s3, analyzing emotion of the text; s4, constructing an acoustic model; s5, a decoder using an acoustic model; s6, generating digital facial animation; s7, synchronizing digital human voice and animation; s8, presenting an effect. Compared with the prior art, the invention has the advantages that: 1. the characters can be converted into voice; 2. generating corresponding facial expressions and lip movements according to the text content and the mood input by the user; 3. the digital facial expression can convey emotion and intention, and finer and more accurate emotion expression is provided; 4. so as to realize more real and vivid human-computer interaction experience; 5. the text-generating voice and digital human face driving method can be applied to various fields.

Description

Method for generating digital human voice and facial animation by text

Technical Field

The invention relates to the technical field of digital person generation by texts, in particular to a method for generating digital person voice and facial animation by texts.

Background

With the rapid development of Virtual Reality (VR), augmented Reality (AR) and Artificial Intelligence (AI), digital human technology has gradually become an important research direction in the field of human-computer interaction. Digital persons refer to virtual characters generated by a computer that can exhibit a look, action, and interactive capabilities similar to real humans. Digital man-made technology is widely used, including virtual assistants, virtual characters, game characters, etc., to provide users with a more immersive and personalized interactive experience. However, there are still challenges in achieving a realistic digital human interaction experience. In particular, in terms of speech synthesis and facial animation, the prior art has the following problems:

speech synthesis problem: traditional speech synthesis techniques often lack naturalness and fluency in synthesizing digital human speech, and sound relatively mechanical and artificial. This speech synthesis approach is not sufficient to provide quality comparable to real human speech.

Facial animation problem: the existing facial animation technology has room for improvement in terms of accuracy and expressive force. Conventional methods often generate facial animation by simple motion rules or manual editing, resulting in difficulty in accurate matching with speech, and inability to present natural facial expressions.

The disadvantage of the prior art is that the prior speech synthesis method often cannot generate enough natural and smooth digital human speech. The speech synthesis results sound mechanical, artificial, lacking quality comparable to real human speech. Some facial animation generation methods present challenges in achieving an accurate match to speech. The traditional rule and manual editing method cannot realize accurate facial animation, so that mismatch or unnatural conditions exist between facial expression and voice.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects and provide a method for generating digital human voice and facial animation by using texts.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: a method for text generation of digital human speech and facial animation, comprising the steps of:

s1, collecting text materials and audios of speakers to be synthesized, wherein the purpose of collecting the audios of the speakers to be synthesized is to capture pronunciation characteristics and individual differences of the voices to be synthesized, and by collecting a large number of different audios of the speakers to be synthesized, pronunciation habits, tones, speech speeds, intonation and other individual characteristics of different people can be captured, and the audio samples are used for training a voice synthesis model so as to simulate the voice and the voice characteristics of different speakers;

s2, preprocessing a text, specifically comprising the following steps:

(1) Deleting the messy codes and the unrecognizable characters, and for the text containing the messy codes or the unrecognizable characters, deleting or replacing the text with proper characters by using filtering operation, so that the consistency and the readability of the text can be ensured;

(2) Dividing words, namely dividing a text according to units of words or sub-words, using space or punctuation marks as separators, dividing the text into sequences of words or sub-words, and extracting semantically significant units by the dividing words to provide input for subsequent text processing tasks;

(3) Punctuation processing, wherein the punctuation can provide semantic and structural information, and the reserved punctuation is selected to be used in emotion analysis tasks;

(4) Case conversion, which uniformly converts letters in a text into upper case or lower case forms so as to eliminate the influence of the case on text classification;

(5) Disabling word removal;

s3, text emotion analysis, which specifically comprises the following steps:

(1) Data vectorization, converting text into vector representation so as to facilitate input of a convolutional neural network model, using a word embedding model to represent each word as a vector, and converting the word into one-hot coding, wherein the text length may be different, standardized processing can be performed, for example, all texts are filled or truncated to the same length, and special filling symbols can be added at the end of a sequence to realize filling;

(2) Constructing a convolutional neural network model, selecting a python appropriate convolutional neural network, selecting super parameters such as the layer number of the configuration model, the number of neurons and the like according to the requirements of tasks and the characteristics of a data set, and setting a loss function of the model: cross entropy loss, optimizer: adam, SGD and evaluation index: accuracy, precision, recall;

(3) Classifying, namely classifying texts into positive emotion, neutral emotion and negative emotion according to requirements, performing emotion classification prediction on the new texts by using a trained model, performing a preprocessing step on the new texts, converting the new texts into vector representations, and inputting the vector representations into the trained model for prediction;

s4, constructing an acoustic model, which specifically comprises the following steps:

(1) Preparing data;

(2) A text encoder module configured to convert the text that has been quantized above into a representation of a latent semantic space, to sequence encode words or character inserts using a convolutional neural network model, to further context encode the encoded sequences based on the sequence encoding, to facilitate capturing a more global and semantically rich representation of the text by aggregating hidden states of the plurality of sequence encodings or by an attention mechanism, to subsequently map the context-encoded representation of the text to the latent semantic space to form the latent semantic representation, to implement the latent semantic representation using a fully connected layer, the dimension of the latent semantic space being generally lower to reduce the dimension of the representation, and to capture the primary semantic information, the feed-forward network of the text encoder being comprised of two layers of convolutions, the convolutions employed by the feed-forward network in the text encoder being equal-length convolutions;

(3) The time length predictor is used for carrying out random modeling on hidden variables instead of frequency spectrums by connecting the acoustic model and the vocoder in the voice synthesis in series, and the random time length predictor is utilized to improve the diversity of the synthesized voice, input the same text and be capable of synthesizing voices with different voices and rhythms;

(4) A variational automatic encoder, which introduces the structure of the variational automatic encoder into an acoustic model to realize the continuity and random sampling of the potential space, wherein the automatic encoder comprises an encoder network and a decoder network, and is trained by maximizing the likelihood of observed data and minimizing the KL divergence of the potential space;

(5) Challenge training, namely improving the generation capacity and naturalness of an acoustic model, introducing a challenge training mechanism, constructing a discriminator network for distinguishing generated acoustic features from real acoustic features, and training the acoustic model by minimizing a loss function of the discriminator to generate more realistic acoustic features;

s5, using a decoder of an acoustic model, using a generator of HiFiGANV1, wherein the generator mainly comprises a plurality of groups of transposed convolutions, and each group of transposed convolutions is followed by a multi-receptive field fusion module, and the multi-receptive field fusion module mainly comprises residual modules formed by equal-size one-dimensional convolutions;

s6, generating digital human face animation, converting the generated audio data into weight data for driving the digital human face Blendshape in real time by using a human face key point tracking algorithm, and specifically comprising the following steps of:

(1) The human language analysis layer extracts a voice characteristic sequence which changes along with time, and then drives pronunciation, firstly uses an autocorrelation analysis function with a fixed function to extract information, then uses 5 convolution layers to extract information, and after training, the network of the layer can extract short-time characteristics in human voice: phonemes, intonation, accents, and specific phonemes;

(2) An emotion network consisting of 5 convolution layers, analyzing the time evolution of the features, and finally outputting an abstract feature vector describing the facial pose in the center of the audio window, this layer receiving as input the emotional states, disambiguating between different expressions and speaking styles, the emotional states being represented as an e-dimensional vector, which we directly connect to the output of each layer in the connection network, enabling the subsequent layers to change their behavior accordingly, organizing the convolutions into two different phases to avoid overfitting;

(3) Modifying key points, tracking facial animation data, which are specific locations of the face, such as eyes, mouth, eyebrows, etc., to generate the final 116 blendhooks, the output network is implemented as a pair of fully connected layers that perform a simple linear transformation on the data, initializing the second layer to 150 pre-computed PCA components that together account for the 99.9% variance seen in the training data;

s7, synchronizing the digital human voice and the animation, wherein the voice synthesis module and the mouth animation generation module are connected and coordinated with each other to realize real-time synchronization of the voice and the mouth animation, when a user inputs a text, the voice synthesis module generates a corresponding voice signal and transmits the corresponding voice signal to the mouth animation generation module, and the mouth animation generation module generates a corresponding mouth animation sequence in real time according to the voice characteristics and the voice expression of the voice signal and presents the corresponding mouth animation sequence and the voice simultaneously, so that synchronous expression of the digital human voice and the mouth animation is realized;

and S8, rendering and synthesizing the voice and the facial animation after synchronous adjustment to generate final digital human voice and facial animation.

As an improvement, the text may contain some noise, wrong characters, messy codes or special characters, which may interfere with subsequent text processing tasks, and through text preprocessing, the text may be cleaned, such invalid information may be removed, and the quality and consistency of the data may be improved.

According to the text emotion analysis, emotion tendencies in the text are recognized and understood according to processed text content so as to convey proper emotion colors in synthesized voice, parameters and models of a voice synthesis system can be adjusted by recognizing emotion tendencies in the text, so that the generated voice can better express corresponding emotion in terms of intonation, speech speed, volume and the like, and the voice synthesis system can generate corresponding emotion responses by analyzing emotion according to text content input by a user, so that more emotional and personalized interaction experience is provided.

As an improvement, the acoustic model is constructed by using an acoustic model with a variational automatic encoder, the semantic vector of the text is taken as input, and an acoustic feature representation is generated, the semantic vector is mapped to a gaussian distribution in a potential space by the encoder of the automatic encoder, the potential encoding is obtained through random sampling, then the potential encoding is mapped to the acoustic feature representation by the decoder of the automatic encoder, the pronunciation rules are input into a pre-trained voice encoder, and the voice encoder generates the feature representation of the voice signal according to the pronunciation rules.

As an improvement, the stop words are common words which frequently appear in the text but usually do not carry too much semantic information, such as prepositions, conjunctions, pronouns and the like, and feature dimensions can be reduced by removing the stop words, so that model effect and calculation efficiency are improved.

As an improvement, the data is prepared into a text and voice data set for training, the data set comprises a text and a corresponding voice sample, the text is a text which is already classified after a series of processing, and similarly, voices are also classified into positive, neutral and negative voice speeds, and the voices of the part are classified into manual classification.

Compared with the prior art, the invention has the advantages that: 1. the invention can convert characters into voice and realize the synchronization of facial expression and lip movement through a digital human face driving technology. The multi-mode interaction experience can provide a richer and more visual communication mode for users, and the communication effect and the user participation degree are enhanced.

2. The invention can generate corresponding facial expression and lip movement according to the text content and the mood input by the user through the digital human face driving technology. This allows the user to customize the appearance and appearance of the digital person to better fit personal preferences and needs.

3. Digital human facial expressions can convey emotion and intention, and finer and more accurate emotion expression is provided. When the user can display proper facial expression according to the input text content, the user can better understand and feel the response of the digital person, and the emotion resonance of communication is enhanced.

4. By combining the text-generated voice and the digital human face driving technology, the invention can realize more real and vivid human-computer interaction experience. The user can hear the voice output and observe the facial expression of the digital person at the same time, so as to increase the interactive sense of reality and the feeling of being personally on the scene, and improve the user satisfaction and participation.

5. The text-generating voice and digital human face driving method can be applied to various fields such as virtual assistants, virtual characters, games, entertainment and the like. The method can be used for enhancing the interaction between the reality and the virtual reality, and providing a more natural and more interactive human-computer interface.

Drawings

Fig. 1 is a schematic flow chart of the present invention.

FIG. 2 is a schematic diagram of a text emotion analysis flow of the present invention.

FIG. 3 is a schematic flow chart of the present invention for constructing an acoustic model.

Fig. 4 is a schematic flow diagram of a duration predictor of the present invention.

Fig. 5 is a schematic diagram of a digital human face animation generation flow of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

1-5, a method for generating digital human voice and facial animation by text, comprising the following steps:

the method comprises the steps of firstly collecting text materials and audios of speakers to be synthesized, collecting the audios of the speakers to be synthesized to capture pronunciation characteristics and individual differences of the voices to be synthesized, collecting a large number of different audios of the speakers to be synthesized, capturing individual characteristics such as pronunciation habits, tones, speech speeds and intonation of different people, training a voice synthesis model through the audio samples, enabling the voice synthesis model to simulate the voice and the voice characteristics of different speakers, learning to convert input texts into natural smooth voice output through simultaneous use of the texts and the audios of the speakers to be synthesized, personalizing synthesized voices according to the characteristics of the audios of the speakers to be synthesized, and providing voice synthesis results with more individuality and naturalness, so that a voice synthesis system is more accurate and lifelike when simulating different speakers and adapting to different contexts.

Step two, text preprocessing, the text may contain some noise, wrong characters, messy codes or special characters, which may interfere with subsequent text processing tasks. Through text preprocessing, the text can be cleaned, invalid information is removed, and the quality and consistency of data are improved. We mainly need to perform the following steps:

(1) The messy codes and unrecognizable characters are deleted, and for text containing the messy codes or unrecognizable characters, the filter operation may be used to delete or replace them with the appropriate characters. This can ensure consistency and readability of the text.

(2) And segmenting the text according to the units of the words or the sub words. Text is segmented into sequences of words or subwords using spaces or punctuation marks as separators. The word segmentation can extract semantically meaningful units and provide input for subsequent text processing tasks.

(3) Punctuation processing punctuation can provide semantic and structural information, and we choose to reserve punctuation at this stage for use in emotion analysis tasks.

(4) Case conversion, which uniformly converts letters in text into upper case or lower case forms to eliminate the effect of the case on, for example, text classification.

(5) Stop word removal, which is a common vocabulary that frequently appears in text but does not typically carry much semantic information, such as prepositions, conjunctions, pronouns, and the like. Feature dimensions can be reduced by removing stop words, and model effect and calculation efficiency are improved.

And thirdly, text emotion analysis, namely identifying and understanding emotion tendencies according to the processed text content so as to convey proper emotion colors in the synthesized voice. Through identifying emotion tendencies in the text, parameters and models of a voice synthesis system can be adjusted, so that the generated voice can better express corresponding emotion in terms of intonation, speech speed, volume and the like. In some application scenarios, such as virtual assistants, emotion companion robots, etc., by analyzing emotion from text content entered by a user, the speech synthesis system can generate corresponding emotion responses, providing a more emotional and personalized interactive experience. The method comprises the following specific steps:

(1) Text is converted to a vector representation to facilitate input to the CNN model. Each word is represented as a vector using a word embedding model and converted into one-hot coding. Since text lengths may be different, a normalization process may be performed, such as filling or truncating all text to the same length. Padding may be achieved by adding a special padding symbol (e.g. 0) at the end of the sequence.

(2) And selecting a python proper Convolutional Neural Network (CNN), and selecting the hyper-parameters such as the layer number of the configuration model, the neuron number and the like according to the requirements of tasks and the characteristics of a data set. Setting a loss function of the model: cross entropy loss, optimizer: adam, SGD and evaluation index: accuracy, precision, recall.

(3) According to the requirements, the texts are divided into positive emotion, neutral emotion and negative emotion, and the trained model is used for carrying out emotion classification prediction on the new texts. And (3) carrying out a preprocessing step on the new text, converting the new text into vector representation, and inputting the vector representation into a trained model for prediction.

And fourthly, constructing an acoustic model, using the acoustic model with a Variational Automatic Encoder (VAE), taking the semantic vector of the text as input, and generating an acoustic characteristic representation. The encoder of the VAE maps the semantic vector to a gaussian distribution in a potential space and obtains the potential encoding by random sampling. The decoder of the VAE then maps the potential codes to the echographic feature representation. The pronunciation rules are input to a pre-trained speech coder (vocoder) that generates a representation of the characteristics of the speech signal based on the pronunciation rules. The method comprises the following specific steps:

(1) Data preparation prepares the text and speech data sets for training. The data set contains text and corresponding speech samples for establishing a text-to-speech mapping. The text is classified after the series of processing, and the voice is classified into three voice speech speeds of positive, neutral and negative, and the voice of the part is classified into manual classification, for example, the voice speed of the positive voice is faster, the voice tone is higher, and the negative voice is opposite.

(2) The text encoder builds a text encoder module that converts the above quantized text into a representation of the underlying semantic space. The CNN model is used to sequence code word or character embeddings. On the basis of the sequence coding, the coded sequence is further subjected to context coding. By aggregating hidden states of multiple sequence encodings or by an attention mechanism. Context coding helps capture a more global and semantically rich text representation. The context-encoded text representation is then mapped to a latent semantic space to form a latent semantic representation. Using a fully connected layer. The dimensions of the latent semantic space are typically low in order to reduce the dimensions of the representation and capture the primary semantic information. The feed-forward network of the text encoder consists of two layers of convolutions, and the convolutions adopted by the feed-forward network in the text encoder are equal-length convolutions.

(3) The duration predictor is used for carrying out random modeling on hidden variables instead of frequency spectrums by connecting the acoustic model and the vocoder in the voice synthesis in series, and the random duration predictor is used for improving the diversity of the synthesized voice and inputting the same text, so that voices with different voices and rhythms can be synthesized. The specific flow is as follows: the random duration predictor inputs the result of the text encoder and outputs the logarithm of the phoneme duration. The text encoding tensor first performs one-dimensional convolution by preprocessing and then enters the neural spline stream to output the phoneme duration. During speech synthesis, different text segments may have different durations. Thus, the function of the random duration predictor is to predict a random duration for each text segment in order to properly align the text and acoustic features. For a given text sequence, a random duration predictor may predict the duration of each text segment to achieve proper duration control in the acoustic model. By introducing randomness, the random duration predictor makes the generated speech more natural and smooth. By randomly selecting the duration, the model can better capture natural intonation and prosody in speech. The design of such a random duration predictor can help solve challenges for duration control in traditional speech synthesis and improve naturalness and fluency of synthesized speech.

(4) A Variational Automatic Encoder (VAE) whose structure is introduced in the acoustic model to achieve continuity and random sampling of the potential space. The VAE includes an encoder network and a decoder network, trained by maximizing the likelihood of observed data and minimizing the KL divergence of potential space.

(5) Challenge training to improve the generation capacity and naturalness of acoustic models, a challenge training mechanism is introduced. A discriminator network is constructed for discriminating between the generated acoustic features and the actual acoustic features. The acoustic model is trained by minimizing the loss function of the discriminator so that it generates more realistic acoustic features.

And fifthly, using an acoustic model decoder, wherein the acoustic model decoder is a HiFiGANV1 generator, mainly comprises a plurality of groups of transposed convolutions, each group of transposed convolutions is followed by a multi-receptive field fusion module, and the multi-receptive field fusion module is mainly a residual module formed by equal-size one-dimensional convolutions. The HiFiGAN generator employs a deep convolutional neural network structure that includes a plurality of convolutional layers and an upsampling layer. These hierarchies facilitate model learning of time and frequency domain features of the audio signal to generate high quality synthesized audio. At the same time, a residual connection was introduced, similar to the design in WaveNet. These connections allow the model to capture detailed and subtle changes in the audio signal, helping to improve the quality and naturalness of the synthesized audio.

Step six, generating digital human face animation, and converting the generated audio data into weight data for driving the digital human face Blendshape in real time by using a human face key point tracking algorithm. The method comprises the following specific steps:

(1) The human language analysis layer extracts the time-varying speech feature sequence, which then drives the pronunciation, first extracts the information using a fixed function autocorrelation analysis function, and then refines the information using 5 convolution layers. The network of this layer can extract short-term features in human voice after training: phonemes, intonation, accents, and specific phonemes.

(2) And the emotion network consists of 5 convolution layers, analyzes the time evolution of the features, and finally outputs an abstract feature vector for describing the facial gesture in the center of the audio window. This layer receives as input the emotional states, represented as an e-dimensional vector, which we directly connect to the output of each layer in the connected network, enabling subsequent layers to change their behavior accordingly, disambiguating between different expressions and utterances. The result is not very sensitive to the exact number of layers or feature maps, but we have found that it is necessary to organize the convolution into two different phases to avoid overfitting.

(3) Modifying the key points, we trace the key points of facial animation data. These key points are specific locations of the face, such as eyes, mouth, eyebrows, etc., resulting in the final 116 blendhooks. The output network is implemented as a pair of fully connected layers that perform simple linear transformations on the data. We initialize the second layer to 150 pre-calculated PCA components that together explain the 99.9% variance seen in the training data.

And step seven, synchronizing the voice and the animation of the digital person, and realizing the real-time synchronization of the voice and the animation of the mouth by connecting and coordinating the voice synthesis module and the animation generation module of the mouth. When a user inputs text, the voice synthesis module generates a corresponding voice signal and transmits the voice signal to the mouth animation generation module. The mouth animation generation module generates a corresponding mouth animation sequence in real time according to the sound characteristics and the voice expression of the voice signal and presents the mouth animation sequence and the voice simultaneously, so that the synchronous expression of the digital human voice and the mouth animation is realized. The method can provide more realistic and natural digital human interaction experience, and enhance the immersion and emotion connection of the user.

And step eight, rendering and synthesizing the voice and the facial animation after synchronous adjustment to generate final digital human voice and facial animation.

The invention and its embodiments have been described above with no limitation, and the actual construction is not limited to the embodiments of the invention as shown in the drawings. In summary, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical solution should not be creatively devised without departing from the gist of the present invention.

Claims

1. A method for text generation of digital human voice and facial animation, comprising the steps of:

s2, preprocessing a text, specifically comprising the following steps:

(5) Disabling word removal;

s3, text emotion analysis, which specifically comprises the following steps:

(1) Preparing data;

2. The method for generating digital human voice and facial animation from text according to claim 1, wherein: the text may contain some noise, wrong characters, messy codes or special characters, which may interfere with subsequent text processing tasks, and the text may be cleaned through text preprocessing, so that invalid information is removed, and the quality and consistency of data are improved.

3. The method for generating digital human voice and facial animation from text according to claim 1, wherein: according to the text emotion analysis, emotion tendencies in the text are recognized and understood according to the processed text content so as to convey proper emotion colors in the synthesized voice, parameters and models of a voice synthesis system can be adjusted by recognizing the emotion tendencies in the text, so that the generated voice can better express corresponding emotion in terms of intonation, speech speed, volume and the like, and the voice synthesis system can generate corresponding emotion response by analyzing emotion according to the text content input by a user, so that more emotional and personalized interaction experience is provided.

4. The method for generating digital human voice and facial animation from text according to claim 1, wherein: the construction of the acoustic model uses an acoustic model with a variational automatic encoder, semantic vectors of text are used as input, an acoustic feature representation is generated, the encoder of the automatic encoder maps the semantic vectors to Gaussian distribution in a potential space, potential codes are obtained through random sampling, then a decoder of the automatic encoder maps the potential codes to the acoustic feature representation, pronunciation rules are input into a pre-trained voice encoder, and the voice encoder generates feature representations of voice signals according to the pronunciation rules.

5. The method for generating digital human voice and facial animation from text according to claim 1, wherein: the stop words are common words which frequently appear in the text but usually do not carry too much semantic information, such as prepositions, conjunctions, pronouns and the like, and feature dimensions can be reduced by removing the stop words, so that model effect and calculation efficiency are improved.

6. The method for generating digital human voice and facial animation from text according to claim 1, wherein: the data are prepared into a text and voice data set for training, the data set comprises a text and a corresponding voice sample, the text and the voice data set are used for establishing a mapping relation between the text and the voice, the text at the moment is a text which is already classified after a series of processing, and the voice is also classified into three voice speeds of positive, neutral and negative in the same way, and the voice of the part is classified into manual classification.