CN116863038A - Method for generating digital human voice and facial animation by text - Google Patents
Method for generating digital human voice and facial animation by text Download PDFInfo
- Publication number
- CN116863038A CN116863038A CN202310831606.9A CN202310831606A CN116863038A CN 116863038 A CN116863038 A CN 116863038A CN 202310831606 A CN202310831606 A CN 202310831606A CN 116863038 A CN116863038 A CN 116863038A
- Authority
- CN
- China
- Prior art keywords
- text
- voice
- emotion
- animation
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000001815 facial effect Effects 0.000 title claims abstract description 32
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000008451 emotion Effects 0.000 claims abstract description 47
- 230000014509 gene expression Effects 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 230000003993 interaction Effects 0.000 claims abstract description 9
- 230000000694 effects Effects 0.000 claims abstract description 6
- 239000000463 material Substances 0.000 claims abstract description 4
- 230000015572 biosynthetic process Effects 0.000 claims description 28
- 238000003786 synthesis reaction Methods 0.000 claims description 28
- 239000013598 vector Substances 0.000 claims description 24
- 238000012549 training Methods 0.000 claims description 20
- 238000004458 analytical method Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 14
- 238000013527 convolutional neural network Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 13
- 230000002996 emotional effect Effects 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims description 8
- 230000004927 fusion Effects 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 6
- 230000007935 neutral effect Effects 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 230000001360 synchronised effect Effects 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 230000006399 behavior Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 239000003086 colorant Substances 0.000 claims description 3
- 230000006397 emotional response Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 210000004709 eyebrow Anatomy 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000009877 rendering Methods 0.000 claims description 3
- 230000033764 rhythmic process Effects 0.000 claims description 3
- 238000001228 spectrum Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 238000001914 filtration Methods 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims description 2
- 230000008921 facial expression Effects 0.000 abstract description 9
- 230000033001 locomotion Effects 0.000 abstract description 4
- 230000008901 benefit Effects 0.000 abstract description 2
- 230000036651 mood Effects 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 description 6
- 230000006872 improvement Effects 0.000 description 5
- 230000002452 interceptive effect Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000007654 immersion Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/055—Time compression or expansion for synchronising with other signals, e.g. video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Abstract
The invention discloses a method for generating digital human voice and facial animation by text, which comprises the following steps: s1, collecting text materials and audio of a speaker to be uttered; s2, preprocessing a text; s3, analyzing emotion of the text; s4, constructing an acoustic model; s5, a decoder using an acoustic model; s6, generating digital facial animation; s7, synchronizing digital human voice and animation; s8, presenting an effect. Compared with the prior art, the invention has the advantages that: 1. the characters can be converted into voice; 2. generating corresponding facial expressions and lip movements according to the text content and the mood input by the user; 3. the digital facial expression can convey emotion and intention, and finer and more accurate emotion expression is provided; 4. so as to realize more real and vivid human-computer interaction experience; 5. the text-generating voice and digital human face driving method can be applied to various fields.
Description
Technical Field
The invention relates to the technical field of digital person generation by texts, in particular to a method for generating digital person voice and facial animation by texts.
Background
With the rapid development of Virtual Reality (VR), augmented Reality (AR) and Artificial Intelligence (AI), digital human technology has gradually become an important research direction in the field of human-computer interaction. Digital persons refer to virtual characters generated by a computer that can exhibit a look, action, and interactive capabilities similar to real humans. Digital man-made technology is widely used, including virtual assistants, virtual characters, game characters, etc., to provide users with a more immersive and personalized interactive experience. However, there are still challenges in achieving a realistic digital human interaction experience. In particular, in terms of speech synthesis and facial animation, the prior art has the following problems:
speech synthesis problem: traditional speech synthesis techniques often lack naturalness and fluency in synthesizing digital human speech, and sound relatively mechanical and artificial. This speech synthesis approach is not sufficient to provide quality comparable to real human speech.
Facial animation problem: the existing facial animation technology has room for improvement in terms of accuracy and expressive force. Conventional methods often generate facial animation by simple motion rules or manual editing, resulting in difficulty in accurate matching with speech, and inability to present natural facial expressions.
The disadvantage of the prior art is that the prior speech synthesis method often cannot generate enough natural and smooth digital human speech. The speech synthesis results sound mechanical, artificial, lacking quality comparable to real human speech. Some facial animation generation methods present challenges in achieving an accurate match to speech. The traditional rule and manual editing method cannot realize accurate facial animation, so that mismatch or unnatural conditions exist between facial expression and voice.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defects and provide a method for generating digital human voice and facial animation by using texts.
In order to solve the technical problems, the technical scheme provided by the invention is as follows: a method for text generation of digital human speech and facial animation, comprising the steps of:
s1, collecting text materials and audios of speakers to be synthesized, wherein the purpose of collecting the audios of the speakers to be synthesized is to capture pronunciation characteristics and individual differences of the voices to be synthesized, and by collecting a large number of different audios of the speakers to be synthesized, pronunciation habits, tones, speech speeds, intonation and other individual characteristics of different people can be captured, and the audio samples are used for training a voice synthesis model so as to simulate the voice and the voice characteristics of different speakers;
s2, preprocessing a text, specifically comprising the following steps:
(1) Deleting the messy codes and the unrecognizable characters, and for the text containing the messy codes or the unrecognizable characters, deleting or replacing the text with proper characters by using filtering operation, so that the consistency and the readability of the text can be ensured;
(2) Dividing words, namely dividing a text according to units of words or sub-words, using space or punctuation marks as separators, dividing the text into sequences of words or sub-words, and extracting semantically significant units by the dividing words to provide input for subsequent text processing tasks;
(3) Punctuation processing, wherein the punctuation can provide semantic and structural information, and the reserved punctuation is selected to be used in emotion analysis tasks;
(4) Case conversion, which uniformly converts letters in a text into upper case or lower case forms so as to eliminate the influence of the case on text classification;
(5) Disabling word removal;
s3, text emotion analysis, which specifically comprises the following steps:
(1) Data vectorization, converting text into vector representation so as to facilitate input of a convolutional neural network model, using a word embedding model to represent each word as a vector, and converting the word into one-hot coding, wherein the text length may be different, standardized processing can be performed, for example, all texts are filled or truncated to the same length, and special filling symbols can be added at the end of a sequence to realize filling;
(2) Constructing a convolutional neural network model, selecting a python appropriate convolutional neural network, selecting super parameters such as the layer number of the configuration model, the number of neurons and the like according to the requirements of tasks and the characteristics of a data set, and setting a loss function of the model: cross entropy loss, optimizer: adam, SGD and evaluation index: accuracy, precision, recall;
(3) Classifying, namely classifying texts into positive emotion, neutral emotion and negative emotion according to requirements, performing emotion classification prediction on the new texts by using a trained model, performing a preprocessing step on the new texts, converting the new texts into vector representations, and inputting the vector representations into the trained model for prediction;
s4, constructing an acoustic model, which specifically comprises the following steps:
(1) Preparing data;
(2) A text encoder module configured to convert the text that has been quantized above into a representation of a latent semantic space, to sequence encode words or character inserts using a convolutional neural network model, to further context encode the encoded sequences based on the sequence encoding, to facilitate capturing a more global and semantically rich representation of the text by aggregating hidden states of the plurality of sequence encodings or by an attention mechanism, to subsequently map the context-encoded representation of the text to the latent semantic space to form the latent semantic representation, to implement the latent semantic representation using a fully connected layer, the dimension of the latent semantic space being generally lower to reduce the dimension of the representation, and to capture the primary semantic information, the feed-forward network of the text encoder being comprised of two layers of convolutions, the convolutions employed by the feed-forward network in the text encoder being equal-length convolutions;
(3) The time length predictor is used for carrying out random modeling on hidden variables instead of frequency spectrums by connecting the acoustic model and the vocoder in the voice synthesis in series, and the random time length predictor is utilized to improve the diversity of the synthesized voice, input the same text and be capable of synthesizing voices with different voices and rhythms;
(4) A variational automatic encoder, which introduces the structure of the variational automatic encoder into an acoustic model to realize the continuity and random sampling of the potential space, wherein the automatic encoder comprises an encoder network and a decoder network, and is trained by maximizing the likelihood of observed data and minimizing the KL divergence of the potential space;
(5) Challenge training, namely improving the generation capacity and naturalness of an acoustic model, introducing a challenge training mechanism, constructing a discriminator network for distinguishing generated acoustic features from real acoustic features, and training the acoustic model by minimizing a loss function of the discriminator to generate more realistic acoustic features;
s5, using a decoder of an acoustic model, using a generator of HiFiGANV1, wherein the generator mainly comprises a plurality of groups of transposed convolutions, and each group of transposed convolutions is followed by a multi-receptive field fusion module, and the multi-receptive field fusion module mainly comprises residual modules formed by equal-size one-dimensional convolutions;
s6, generating digital human face animation, converting the generated audio data into weight data for driving the digital human face Blendshape in real time by using a human face key point tracking algorithm, and specifically comprising the following steps of:
(1) The human language analysis layer extracts a voice characteristic sequence which changes along with time, and then drives pronunciation, firstly uses an autocorrelation analysis function with a fixed function to extract information, then uses 5 convolution layers to extract information, and after training, the network of the layer can extract short-time characteristics in human voice: phonemes, intonation, accents, and specific phonemes;
(2) An emotion network consisting of 5 convolution layers, analyzing the time evolution of the features, and finally outputting an abstract feature vector describing the facial pose in the center of the audio window, this layer receiving as input the emotional states, disambiguating between different expressions and speaking styles, the emotional states being represented as an e-dimensional vector, which we directly connect to the output of each layer in the connection network, enabling the subsequent layers to change their behavior accordingly, organizing the convolutions into two different phases to avoid overfitting;
(3) Modifying key points, tracking facial animation data, which are specific locations of the face, such as eyes, mouth, eyebrows, etc., to generate the final 116 blendhooks, the output network is implemented as a pair of fully connected layers that perform a simple linear transformation on the data, initializing the second layer to 150 pre-computed PCA components that together account for the 99.9% variance seen in the training data;
s7, synchronizing the digital human voice and the animation, wherein the voice synthesis module and the mouth animation generation module are connected and coordinated with each other to realize real-time synchronization of the voice and the mouth animation, when a user inputs a text, the voice synthesis module generates a corresponding voice signal and transmits the corresponding voice signal to the mouth animation generation module, and the mouth animation generation module generates a corresponding mouth animation sequence in real time according to the voice characteristics and the voice expression of the voice signal and presents the corresponding mouth animation sequence and the voice simultaneously, so that synchronous expression of the digital human voice and the mouth animation is realized;
and S8, rendering and synthesizing the voice and the facial animation after synchronous adjustment to generate final digital human voice and facial animation.
As an improvement, the text may contain some noise, wrong characters, messy codes or special characters, which may interfere with subsequent text processing tasks, and through text preprocessing, the text may be cleaned, such invalid information may be removed, and the quality and consistency of the data may be improved.
According to the text emotion analysis, emotion tendencies in the text are recognized and understood according to processed text content so as to convey proper emotion colors in synthesized voice, parameters and models of a voice synthesis system can be adjusted by recognizing emotion tendencies in the text, so that the generated voice can better express corresponding emotion in terms of intonation, speech speed, volume and the like, and the voice synthesis system can generate corresponding emotion responses by analyzing emotion according to text content input by a user, so that more emotional and personalized interaction experience is provided.
As an improvement, the acoustic model is constructed by using an acoustic model with a variational automatic encoder, the semantic vector of the text is taken as input, and an acoustic feature representation is generated, the semantic vector is mapped to a gaussian distribution in a potential space by the encoder of the automatic encoder, the potential encoding is obtained through random sampling, then the potential encoding is mapped to the acoustic feature representation by the decoder of the automatic encoder, the pronunciation rules are input into a pre-trained voice encoder, and the voice encoder generates the feature representation of the voice signal according to the pronunciation rules.
As an improvement, the stop words are common words which frequently appear in the text but usually do not carry too much semantic information, such as prepositions, conjunctions, pronouns and the like, and feature dimensions can be reduced by removing the stop words, so that model effect and calculation efficiency are improved.
As an improvement, the data is prepared into a text and voice data set for training, the data set comprises a text and a corresponding voice sample, the text is a text which is already classified after a series of processing, and similarly, voices are also classified into positive, neutral and negative voice speeds, and the voices of the part are classified into manual classification.
Compared with the prior art, the invention has the advantages that: 1. the invention can convert characters into voice and realize the synchronization of facial expression and lip movement through a digital human face driving technology. The multi-mode interaction experience can provide a richer and more visual communication mode for users, and the communication effect and the user participation degree are enhanced.
2. The invention can generate corresponding facial expression and lip movement according to the text content and the mood input by the user through the digital human face driving technology. This allows the user to customize the appearance and appearance of the digital person to better fit personal preferences and needs.
3. Digital human facial expressions can convey emotion and intention, and finer and more accurate emotion expression is provided. When the user can display proper facial expression according to the input text content, the user can better understand and feel the response of the digital person, and the emotion resonance of communication is enhanced.
4. By combining the text-generated voice and the digital human face driving technology, the invention can realize more real and vivid human-computer interaction experience. The user can hear the voice output and observe the facial expression of the digital person at the same time, so as to increase the interactive sense of reality and the feeling of being personally on the scene, and improve the user satisfaction and participation.
5. The text-generating voice and digital human face driving method can be applied to various fields such as virtual assistants, virtual characters, games, entertainment and the like. The method can be used for enhancing the interaction between the reality and the virtual reality, and providing a more natural and more interactive human-computer interface.
Drawings
Fig. 1 is a schematic flow chart of the present invention.
FIG. 2 is a schematic diagram of a text emotion analysis flow of the present invention.
FIG. 3 is a schematic flow chart of the present invention for constructing an acoustic model.
Fig. 4 is a schematic flow diagram of a duration predictor of the present invention.
Fig. 5 is a schematic diagram of a digital human face animation generation flow of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
1-5, a method for generating digital human voice and facial animation by text, comprising the following steps:
the method comprises the steps of firstly collecting text materials and audios of speakers to be synthesized, collecting the audios of the speakers to be synthesized to capture pronunciation characteristics and individual differences of the voices to be synthesized, collecting a large number of different audios of the speakers to be synthesized, capturing individual characteristics such as pronunciation habits, tones, speech speeds and intonation of different people, training a voice synthesis model through the audio samples, enabling the voice synthesis model to simulate the voice and the voice characteristics of different speakers, learning to convert input texts into natural smooth voice output through simultaneous use of the texts and the audios of the speakers to be synthesized, personalizing synthesized voices according to the characteristics of the audios of the speakers to be synthesized, and providing voice synthesis results with more individuality and naturalness, so that a voice synthesis system is more accurate and lifelike when simulating different speakers and adapting to different contexts.
Step two, text preprocessing, the text may contain some noise, wrong characters, messy codes or special characters, which may interfere with subsequent text processing tasks. Through text preprocessing, the text can be cleaned, invalid information is removed, and the quality and consistency of data are improved. We mainly need to perform the following steps:
(1) The messy codes and unrecognizable characters are deleted, and for text containing the messy codes or unrecognizable characters, the filter operation may be used to delete or replace them with the appropriate characters. This can ensure consistency and readability of the text.
(2) And segmenting the text according to the units of the words or the sub words. Text is segmented into sequences of words or subwords using spaces or punctuation marks as separators. The word segmentation can extract semantically meaningful units and provide input for subsequent text processing tasks.
(3) Punctuation processing punctuation can provide semantic and structural information, and we choose to reserve punctuation at this stage for use in emotion analysis tasks.
(4) Case conversion, which uniformly converts letters in text into upper case or lower case forms to eliminate the effect of the case on, for example, text classification.
(5) Stop word removal, which is a common vocabulary that frequently appears in text but does not typically carry much semantic information, such as prepositions, conjunctions, pronouns, and the like. Feature dimensions can be reduced by removing stop words, and model effect and calculation efficiency are improved.
And thirdly, text emotion analysis, namely identifying and understanding emotion tendencies according to the processed text content so as to convey proper emotion colors in the synthesized voice. Through identifying emotion tendencies in the text, parameters and models of a voice synthesis system can be adjusted, so that the generated voice can better express corresponding emotion in terms of intonation, speech speed, volume and the like. In some application scenarios, such as virtual assistants, emotion companion robots, etc., by analyzing emotion from text content entered by a user, the speech synthesis system can generate corresponding emotion responses, providing a more emotional and personalized interactive experience. The method comprises the following specific steps:
(1) Text is converted to a vector representation to facilitate input to the CNN model. Each word is represented as a vector using a word embedding model and converted into one-hot coding. Since text lengths may be different, a normalization process may be performed, such as filling or truncating all text to the same length. Padding may be achieved by adding a special padding symbol (e.g. 0) at the end of the sequence.
(2) And selecting a python proper Convolutional Neural Network (CNN), and selecting the hyper-parameters such as the layer number of the configuration model, the neuron number and the like according to the requirements of tasks and the characteristics of a data set. Setting a loss function of the model: cross entropy loss, optimizer: adam, SGD and evaluation index: accuracy, precision, recall.
(3) According to the requirements, the texts are divided into positive emotion, neutral emotion and negative emotion, and the trained model is used for carrying out emotion classification prediction on the new texts. And (3) carrying out a preprocessing step on the new text, converting the new text into vector representation, and inputting the vector representation into a trained model for prediction.
And fourthly, constructing an acoustic model, using the acoustic model with a Variational Automatic Encoder (VAE), taking the semantic vector of the text as input, and generating an acoustic characteristic representation. The encoder of the VAE maps the semantic vector to a gaussian distribution in a potential space and obtains the potential encoding by random sampling. The decoder of the VAE then maps the potential codes to the echographic feature representation. The pronunciation rules are input to a pre-trained speech coder (vocoder) that generates a representation of the characteristics of the speech signal based on the pronunciation rules. The method comprises the following specific steps:
(1) Data preparation prepares the text and speech data sets for training. The data set contains text and corresponding speech samples for establishing a text-to-speech mapping. The text is classified after the series of processing, and the voice is classified into three voice speech speeds of positive, neutral and negative, and the voice of the part is classified into manual classification, for example, the voice speed of the positive voice is faster, the voice tone is higher, and the negative voice is opposite.
(2) The text encoder builds a text encoder module that converts the above quantized text into a representation of the underlying semantic space. The CNN model is used to sequence code word or character embeddings. On the basis of the sequence coding, the coded sequence is further subjected to context coding. By aggregating hidden states of multiple sequence encodings or by an attention mechanism. Context coding helps capture a more global and semantically rich text representation. The context-encoded text representation is then mapped to a latent semantic space to form a latent semantic representation. Using a fully connected layer. The dimensions of the latent semantic space are typically low in order to reduce the dimensions of the representation and capture the primary semantic information. The feed-forward network of the text encoder consists of two layers of convolutions, and the convolutions adopted by the feed-forward network in the text encoder are equal-length convolutions.
(3) The duration predictor is used for carrying out random modeling on hidden variables instead of frequency spectrums by connecting the acoustic model and the vocoder in the voice synthesis in series, and the random duration predictor is used for improving the diversity of the synthesized voice and inputting the same text, so that voices with different voices and rhythms can be synthesized. The specific flow is as follows: the random duration predictor inputs the result of the text encoder and outputs the logarithm of the phoneme duration. The text encoding tensor first performs one-dimensional convolution by preprocessing and then enters the neural spline stream to output the phoneme duration. During speech synthesis, different text segments may have different durations. Thus, the function of the random duration predictor is to predict a random duration for each text segment in order to properly align the text and acoustic features. For a given text sequence, a random duration predictor may predict the duration of each text segment to achieve proper duration control in the acoustic model. By introducing randomness, the random duration predictor makes the generated speech more natural and smooth. By randomly selecting the duration, the model can better capture natural intonation and prosody in speech. The design of such a random duration predictor can help solve challenges for duration control in traditional speech synthesis and improve naturalness and fluency of synthesized speech.
(4) A Variational Automatic Encoder (VAE) whose structure is introduced in the acoustic model to achieve continuity and random sampling of the potential space. The VAE includes an encoder network and a decoder network, trained by maximizing the likelihood of observed data and minimizing the KL divergence of potential space.
(5) Challenge training to improve the generation capacity and naturalness of acoustic models, a challenge training mechanism is introduced. A discriminator network is constructed for discriminating between the generated acoustic features and the actual acoustic features. The acoustic model is trained by minimizing the loss function of the discriminator so that it generates more realistic acoustic features.
And fifthly, using an acoustic model decoder, wherein the acoustic model decoder is a HiFiGANV1 generator, mainly comprises a plurality of groups of transposed convolutions, each group of transposed convolutions is followed by a multi-receptive field fusion module, and the multi-receptive field fusion module is mainly a residual module formed by equal-size one-dimensional convolutions. The HiFiGAN generator employs a deep convolutional neural network structure that includes a plurality of convolutional layers and an upsampling layer. These hierarchies facilitate model learning of time and frequency domain features of the audio signal to generate high quality synthesized audio. At the same time, a residual connection was introduced, similar to the design in WaveNet. These connections allow the model to capture detailed and subtle changes in the audio signal, helping to improve the quality and naturalness of the synthesized audio.
Step six, generating digital human face animation, and converting the generated audio data into weight data for driving the digital human face Blendshape in real time by using a human face key point tracking algorithm. The method comprises the following specific steps:
(1) The human language analysis layer extracts the time-varying speech feature sequence, which then drives the pronunciation, first extracts the information using a fixed function autocorrelation analysis function, and then refines the information using 5 convolution layers. The network of this layer can extract short-term features in human voice after training: phonemes, intonation, accents, and specific phonemes.
(2) And the emotion network consists of 5 convolution layers, analyzes the time evolution of the features, and finally outputs an abstract feature vector for describing the facial gesture in the center of the audio window. This layer receives as input the emotional states, represented as an e-dimensional vector, which we directly connect to the output of each layer in the connected network, enabling subsequent layers to change their behavior accordingly, disambiguating between different expressions and utterances. The result is not very sensitive to the exact number of layers or feature maps, but we have found that it is necessary to organize the convolution into two different phases to avoid overfitting.
(3) Modifying the key points, we trace the key points of facial animation data. These key points are specific locations of the face, such as eyes, mouth, eyebrows, etc., resulting in the final 116 blendhooks. The output network is implemented as a pair of fully connected layers that perform simple linear transformations on the data. We initialize the second layer to 150 pre-calculated PCA components that together explain the 99.9% variance seen in the training data.
And step seven, synchronizing the voice and the animation of the digital person, and realizing the real-time synchronization of the voice and the animation of the mouth by connecting and coordinating the voice synthesis module and the animation generation module of the mouth. When a user inputs text, the voice synthesis module generates a corresponding voice signal and transmits the voice signal to the mouth animation generation module. The mouth animation generation module generates a corresponding mouth animation sequence in real time according to the sound characteristics and the voice expression of the voice signal and presents the mouth animation sequence and the voice simultaneously, so that the synchronous expression of the digital human voice and the mouth animation is realized. The method can provide more realistic and natural digital human interaction experience, and enhance the immersion and emotion connection of the user.
And step eight, rendering and synthesizing the voice and the facial animation after synchronous adjustment to generate final digital human voice and facial animation.
The invention and its embodiments have been described above with no limitation, and the actual construction is not limited to the embodiments of the invention as shown in the drawings. In summary, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical solution should not be creatively devised without departing from the gist of the present invention.
Claims (6)
1. A method for text generation of digital human voice and facial animation, comprising the steps of:
s1, collecting text materials and audios of speakers to be synthesized, wherein the purpose of collecting the audios of the speakers to be synthesized is to capture pronunciation characteristics and individual differences of the voices to be synthesized, and by collecting a large number of different audios of the speakers to be synthesized, pronunciation habits, tones, speech speeds, intonation and other individual characteristics of different people can be captured, and the audio samples are used for training a voice synthesis model so as to simulate the voice and the voice characteristics of different speakers;
s2, preprocessing a text, specifically comprising the following steps:
(1) Deleting the messy codes and the unrecognizable characters, and for the text containing the messy codes or the unrecognizable characters, deleting or replacing the text with proper characters by using filtering operation, so that the consistency and the readability of the text can be ensured;
(2) Dividing words, namely dividing a text according to units of words or sub-words, using space or punctuation marks as separators, dividing the text into sequences of words or sub-words, and extracting semantically significant units by the dividing words to provide input for subsequent text processing tasks;
(3) Punctuation processing, wherein the punctuation can provide semantic and structural information, and the reserved punctuation is selected to be used in emotion analysis tasks;
(4) Case conversion, which uniformly converts letters in a text into upper case or lower case forms so as to eliminate the influence of the case on text classification;
(5) Disabling word removal;
s3, text emotion analysis, which specifically comprises the following steps:
(1) Data vectorization, converting text into vector representation so as to facilitate input of a convolutional neural network model, using a word embedding model to represent each word as a vector, and converting the word into one-hot coding, wherein the text length may be different, standardized processing can be performed, for example, all texts are filled or truncated to the same length, and special filling symbols can be added at the end of a sequence to realize filling;
(2) Constructing a convolutional neural network model, selecting a python appropriate convolutional neural network, selecting super parameters such as the layer number of the configuration model, the number of neurons and the like according to the requirements of tasks and the characteristics of a data set, and setting a loss function of the model: cross entropy loss, optimizer: adam, SGD and evaluation index: accuracy, precision, recall;
(3) Classifying, namely classifying texts into positive emotion, neutral emotion and negative emotion according to requirements, performing emotion classification prediction on the new texts by using a trained model, performing a preprocessing step on the new texts, converting the new texts into vector representations, and inputting the vector representations into the trained model for prediction;
s4, constructing an acoustic model, which specifically comprises the following steps:
(1) Preparing data;
(2) A text encoder module configured to convert the text that has been quantized above into a representation of a latent semantic space, to sequence encode words or character inserts using a convolutional neural network model, to further context encode the encoded sequences based on the sequence encoding, to facilitate capturing a more global and semantically rich representation of the text by aggregating hidden states of the plurality of sequence encodings or by an attention mechanism, to subsequently map the context-encoded representation of the text to the latent semantic space to form the latent semantic representation, to implement the latent semantic representation using a fully connected layer, the dimension of the latent semantic space being generally lower to reduce the dimension of the representation, and to capture the primary semantic information, the feed-forward network of the text encoder being comprised of two layers of convolutions, the convolutions employed by the feed-forward network in the text encoder being equal-length convolutions;
(3) The time length predictor is used for carrying out random modeling on hidden variables instead of frequency spectrums by connecting the acoustic model and the vocoder in the voice synthesis in series, and the random time length predictor is utilized to improve the diversity of the synthesized voice, input the same text and be capable of synthesizing voices with different voices and rhythms;
(4) A variational automatic encoder, which introduces the structure of the variational automatic encoder into an acoustic model to realize the continuity and random sampling of the potential space, wherein the automatic encoder comprises an encoder network and a decoder network, and is trained by maximizing the likelihood of observed data and minimizing the KL divergence of the potential space;
(5) Challenge training, namely improving the generation capacity and naturalness of an acoustic model, introducing a challenge training mechanism, constructing a discriminator network for distinguishing generated acoustic features from real acoustic features, and training the acoustic model by minimizing a loss function of the discriminator to generate more realistic acoustic features;
s5, using a decoder of an acoustic model, using a generator of HiFiGANV1, wherein the generator mainly comprises a plurality of groups of transposed convolutions, and each group of transposed convolutions is followed by a multi-receptive field fusion module, and the multi-receptive field fusion module mainly comprises residual modules formed by equal-size one-dimensional convolutions;
s6, generating digital human face animation, converting the generated audio data into weight data for driving the digital human face Blendshape in real time by using a human face key point tracking algorithm, and specifically comprising the following steps of:
(1) The human language analysis layer extracts a voice characteristic sequence which changes along with time, and then drives pronunciation, firstly uses an autocorrelation analysis function with a fixed function to extract information, then uses 5 convolution layers to extract information, and after training, the network of the layer can extract short-time characteristics in human voice: phonemes, intonation, accents, and specific phonemes;
(2) An emotion network consisting of 5 convolution layers, analyzing the time evolution of the features, and finally outputting an abstract feature vector describing the facial pose in the center of the audio window, this layer receiving as input the emotional states, disambiguating between different expressions and speaking styles, the emotional states being represented as an e-dimensional vector, which we directly connect to the output of each layer in the connection network, enabling the subsequent layers to change their behavior accordingly, organizing the convolutions into two different phases to avoid overfitting;
(3) Modifying key points, tracking facial animation data, which are specific locations of the face, such as eyes, mouth, eyebrows, etc., to generate the final 116 blendhooks, the output network is implemented as a pair of fully connected layers that perform a simple linear transformation on the data, initializing the second layer to 150 pre-computed PCA components that together account for the 99.9% variance seen in the training data;
s7, synchronizing the digital human voice and the animation, wherein the voice synthesis module and the mouth animation generation module are connected and coordinated with each other to realize real-time synchronization of the voice and the mouth animation, when a user inputs a text, the voice synthesis module generates a corresponding voice signal and transmits the corresponding voice signal to the mouth animation generation module, and the mouth animation generation module generates a corresponding mouth animation sequence in real time according to the voice characteristics and the voice expression of the voice signal and presents the corresponding mouth animation sequence and the voice simultaneously, so that synchronous expression of the digital human voice and the mouth animation is realized;
and S8, rendering and synthesizing the voice and the facial animation after synchronous adjustment to generate final digital human voice and facial animation.
2. The method for generating digital human voice and facial animation from text according to claim 1, wherein: the text may contain some noise, wrong characters, messy codes or special characters, which may interfere with subsequent text processing tasks, and the text may be cleaned through text preprocessing, so that invalid information is removed, and the quality and consistency of data are improved.
3. The method for generating digital human voice and facial animation from text according to claim 1, wherein: according to the text emotion analysis, emotion tendencies in the text are recognized and understood according to the processed text content so as to convey proper emotion colors in the synthesized voice, parameters and models of a voice synthesis system can be adjusted by recognizing the emotion tendencies in the text, so that the generated voice can better express corresponding emotion in terms of intonation, speech speed, volume and the like, and the voice synthesis system can generate corresponding emotion response by analyzing emotion according to the text content input by a user, so that more emotional and personalized interaction experience is provided.
4. The method for generating digital human voice and facial animation from text according to claim 1, wherein: the construction of the acoustic model uses an acoustic model with a variational automatic encoder, semantic vectors of text are used as input, an acoustic feature representation is generated, the encoder of the automatic encoder maps the semantic vectors to Gaussian distribution in a potential space, potential codes are obtained through random sampling, then a decoder of the automatic encoder maps the potential codes to the acoustic feature representation, pronunciation rules are input into a pre-trained voice encoder, and the voice encoder generates feature representations of voice signals according to the pronunciation rules.
5. The method for generating digital human voice and facial animation from text according to claim 1, wherein: the stop words are common words which frequently appear in the text but usually do not carry too much semantic information, such as prepositions, conjunctions, pronouns and the like, and feature dimensions can be reduced by removing the stop words, so that model effect and calculation efficiency are improved.
6. The method for generating digital human voice and facial animation from text according to claim 1, wherein: the data are prepared into a text and voice data set for training, the data set comprises a text and a corresponding voice sample, the text and the voice data set are used for establishing a mapping relation between the text and the voice, the text at the moment is a text which is already classified after a series of processing, and the voice is also classified into three voice speeds of positive, neutral and negative in the same way, and the voice of the part is classified into manual classification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310831606.9A CN116863038A (en) | 2023-07-07 | 2023-07-07 | Method for generating digital human voice and facial animation by text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310831606.9A CN116863038A (en) | 2023-07-07 | 2023-07-07 | Method for generating digital human voice and facial animation by text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116863038A true CN116863038A (en) | 2023-10-10 |
Family
ID=88224725
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310831606.9A Pending CN116863038A (en) | 2023-07-07 | 2023-07-07 | Method for generating digital human voice and facial animation by text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116863038A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117058286A (en) * | 2023-10-13 | 2023-11-14 | 北京蔚领时代科技有限公司 | Method and device for generating video by using word driving digital person |
CN117173294A (en) * | 2023-11-03 | 2023-12-05 | 之江实验室科技控股有限公司 | Method and system for automatically generating digital person |
CN117217807A (en) * | 2023-11-08 | 2023-12-12 | 四川智筹科技有限公司 | Bad asset valuation algorithm based on multi-mode high-dimensional characteristics |
CN117274450A (en) * | 2023-11-21 | 2023-12-22 | 长春职业技术学院 | Animation image generation system and method based on artificial intelligence |
CN117576279A (en) * | 2023-11-28 | 2024-02-20 | 世优(北京)科技有限公司 | Digital person driving method and system based on multi-mode data |
CN117710543A (en) * | 2024-02-04 | 2024-03-15 | 淘宝(中国)软件有限公司 | Digital person-based video generation and interaction method, device, storage medium, and program product |
-
2023
- 2023-07-07 CN CN202310831606.9A patent/CN116863038A/en active Pending
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117058286A (en) * | 2023-10-13 | 2023-11-14 | 北京蔚领时代科技有限公司 | Method and device for generating video by using word driving digital person |
CN117058286B (en) * | 2023-10-13 | 2024-01-23 | 北京蔚领时代科技有限公司 | Method and device for generating video by using word driving digital person |
CN117173294A (en) * | 2023-11-03 | 2023-12-05 | 之江实验室科技控股有限公司 | Method and system for automatically generating digital person |
CN117173294B (en) * | 2023-11-03 | 2024-02-13 | 之江实验室科技控股有限公司 | Method and system for automatically generating digital person |
CN117217807A (en) * | 2023-11-08 | 2023-12-12 | 四川智筹科技有限公司 | Bad asset valuation algorithm based on multi-mode high-dimensional characteristics |
CN117217807B (en) * | 2023-11-08 | 2024-01-26 | 四川智筹科技有限公司 | Bad asset estimation method based on multi-mode high-dimensional characteristics |
CN117274450A (en) * | 2023-11-21 | 2023-12-22 | 长春职业技术学院 | Animation image generation system and method based on artificial intelligence |
CN117274450B (en) * | 2023-11-21 | 2024-01-26 | 长春职业技术学院 | Animation image generation system and method based on artificial intelligence |
CN117576279A (en) * | 2023-11-28 | 2024-02-20 | 世优(北京)科技有限公司 | Digital person driving method and system based on multi-mode data |
CN117576279B (en) * | 2023-11-28 | 2024-04-19 | 世优(北京)科技有限公司 | Digital person driving method and system based on multi-mode data |
CN117710543A (en) * | 2024-02-04 | 2024-03-15 | 淘宝(中国)软件有限公司 | Digital person-based video generation and interaction method, device, storage medium, and program product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116863038A (en) | Method for generating digital human voice and facial animation by text | |
Zhang et al. | Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching | |
CN112184858B (en) | Virtual object animation generation method and device based on text, storage medium and terminal | |
CN106653052A (en) | Virtual human face animation generation method and device | |
CN113378806B (en) | Audio-driven face animation generation method and system integrating emotion coding | |
CN107993665A (en) | Spokesman role determines method, intelligent meeting method and system in multi-conference scene | |
CN103996155A (en) | Intelligent interaction and psychological comfort robot service system | |
Malcangi | Text-driven avatars based on artificial neural networks and fuzzy logic | |
CN112184859B (en) | End-to-end virtual object animation generation method and device, storage medium and terminal | |
CN113538636B (en) | Virtual object control method and device, electronic equipment and medium | |
Wang et al. | Comic-guided speech synthesis | |
CN115330911A (en) | Method and system for driving mimicry expression by using audio | |
CN116311456A (en) | Personalized virtual human expression generating method based on multi-mode interaction information | |
CN115662435A (en) | Virtual teacher simulation voice generation method and terminal | |
CN114911932A (en) | Heterogeneous graph structure multi-conversation person emotion analysis method based on theme semantic enhancement | |
KR20190135853A (en) | Method and system of text to multiple speech | |
van Rijn et al. | VoiceMe: Personalized voice generation in TTS | |
CN110956859A (en) | VR intelligent voice interaction English method based on deep learning | |
CN116129868A (en) | Method and system for generating structured photo | |
CN116092472A (en) | Speech synthesis method and synthesis system | |
CN112242134A (en) | Speech synthesis method and device | |
CN114446278A (en) | Speech synthesis method and apparatus, device and storage medium | |
CN112992116A (en) | Automatic generation method and system of video content | |
CN116580721B (en) | Expression animation generation method and device and digital human platform | |
CN115547296B (en) | Voice synthesis method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |