CN117711042A - Method and device for generating broadcast video of digital person based on driving text - Google Patents

Method and device for generating broadcast video of digital person based on driving text Download PDF

Info

Publication number
CN117711042A
CN117711042A CN202311561042.8A CN202311561042A CN117711042A CN 117711042 A CN117711042 A CN 117711042A CN 202311561042 A CN202311561042 A CN 202311561042A CN 117711042 A CN117711042 A CN 117711042A
Authority
CN
China
Prior art keywords
target
text
emotion
driving
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311561042.8A
Other languages
Chinese (zh)
Inventor
周丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202311561042.8A priority Critical patent/CN117711042A/en
Publication of CN117711042A publication Critical patent/CN117711042A/en
Pending legal-status Critical Current

Links

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The application provides a method and a device for generating a broadcasting video of a digital person based on a driving text, wherein the method comprises the following steps: acquiring target voice based on the driving text, and inputting the target voice into a preset face driving model to obtain target face animation data; the target voice can express emotion information contained in the driving text; acquiring emotion information contained in a driving text; determining target face base emotion data matched with emotion information in a preset expression database; and generating a target broadcasting video of the digital person according to the target voice, the target face animation data and the target face base form data. Therefore, rich facial animation can be generated, and the expression of the virtual digital person when broadcasting the video can be more real and smooth, so that the sense of reality and the emotion expression effect of the virtual digital person are improved, and the sense of reality and the expressive force of the virtual digital person are improved.

Description

Method and device for generating broadcast video of digital person based on driving text
Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to a method and a device for generating a broadcasting video of a digital person based on a driving text.
Background
With the rise of concepts such as metauniverse, virtual reality and virtual digital person and the vigorous development of artificial intelligence technology, the virtual digital person has been applied to related fields such as voice broadcasting, news broadcasting and customer service introduction. In terms of speech driven technology, existing technologies include traditional linguistic-based models or neural network-based model implementations, and while some progress has been made in these technologies, there are challenges such as: some existing technologies firstly collect audio data sent by a biological object, perform feature extraction on the audio data to obtain multi-modal features of the audio data, generate target action data of the biological object based on the multi-modal features, and drive an virtual image corresponding to the biological object based on the target action data, wherein the mode needs to collect real voice data with emotion, and has longer manufacturing period and relatively larger labor cost; the lip-shape driving process of the virtual person in the other part of the prior art only considers the synchronicity of the mouth shape and the text, does not study the consistency of the emotion expression of the virtual person and the driving text, and the used voice synthesis technology does not consider emotion factors in the driving text, so that the generated voice is a broadcast-like tone, and has difference and poorer authenticity with the voice with real emotion.
Disclosure of Invention
The embodiment of the application provides a method and a device for generating a broadcasting video of a digital person based on a driving text, which are used for solving the technical problems that when the text is used for driving the broadcasting of the digital person in the related technology, emotion information of the text is not considered, and the broadcasting effect is poor in authenticity and lacks of infectivity when the video is finally broadcasted.
In order to solve the above technical problems, the embodiments of the present application provide the following aspects:
in a first aspect, an embodiment of the present application provides a method for generating a broadcast video of a digital person based on driving text, the method including:
acquiring target voice based on a driving text, and inputting the target voice into a preset face driving model to obtain target face animation data; wherein, the target voice can express emotion information contained in the driving text;
acquiring emotion information contained in the driving text;
determining target face base emotion data matched with the emotion information in a preset expression database;
and generating a target broadcasting video of the digital person according to the target voice, the target face animation data and the target face base form data.
Optionally, obtaining the target voice based on the driving text includes:
Acquiring a phoneme sequence feature vector and a position feature vector of the driving text;
splicing and fusing the phoneme sequence feature vector and the position feature vector to obtain a phoneme position feature vector;
extracting the context characteristics of the driving text to obtain a context characteristic vector;
performing feature processing on the phoneme position feature vector and the context feature vector to obtain an enhanced feature vector;
decoding the enhanced feature vector to obtain a mel frequency spectrum;
the target speech is generated based on the mel-frequency spectrum and the vocoder.
Optionally, acquiring emotion information contained in the driving text includes:
inputting the driving text into a preset emotion analysis model to obtain emotion information contained in the driving text;
before the driving text is input into a preset emotion analysis model to obtain emotion information contained in the driving text, the method further comprises the steps of:
inputting a text corpus of a source field and a text corpus of a target field into an implicit dirichlet allocation LDA model to obtain a first theme feature of the source field corresponding to the source field and a first theme feature of the target field corresponding to the target field;
Calculating the information quantity and the information loss quantity between the first theme characteristics of the source field and the first theme characteristics of the target field;
optimizing the LDA model based on the information quantity and the information loss quantity, wherein the optimized LDA model is more suitable for data distribution in the target field;
re-inputting the text corpus of the source domain and the text corpus of the target domain into the optimized LDA model to obtain a second topic feature of the source domain corresponding to the source domain and a second topic feature of the target domain corresponding to the target domain;
inputting the text corpus of the source field into a bi-directional coding characterization model BERT model to obtain semantic features of the source field;
combining the semantic features, the source domain second topic features and the target domain second topic features to generate a word vector matrix;
training a preset emotion classifier based on the word vector matrix to obtain the emotion analysis model.
Optionally, before determining the target face base emotion data matched with the emotion information in a preset expression database, the method further includes:
The expression database is established, which comprises the following steps:
acquiring a text corpus used for model training and emotion tags corresponding to each text corpus in the text corpus one by one;
generating facial expression data corresponding to the emotion label according to the emotion label;
and storing the facial expression data in a database, and determining the database as the expression database.
Optionally, before inputting the target voice into a preset face driving model to obtain target face animation data, the method further includes:
training a deep learning network model based on a voice data set and a face model data set to obtain the face driving model;
wherein each voice in the voice data set and each face model data in the face model data set are in one-to-one correspondence.
Optionally, generating the target broadcast video of the digital person according to the target voice, the target face animation data and the target face base table condition data includes:
taking the target facial expression data as basic expression data;
taking the target voice as a reference standard, and carrying out fusion rendering on the base emotion data and the target face animation data to obtain a mouth shape animation matched with the target voice and a facial expression matched with the emotion information;
And synthesizing the mouth shape animation, the facial expression and the target voice to obtain the target broadcasting video.
In a second aspect, an embodiment of the present application provides an apparatus for generating a broadcast video of a digital person based on driving text, the apparatus including:
the first generation module is used for acquiring target voice based on the driving text, inputting the target voice into a preset face driving model and obtaining target face animation data; wherein, the target voice can express emotion information contained in the driving text;
the acquisition module is used for acquiring emotion information contained in the driving text;
the determining module is used for determining target face base emotion data matched with the emotion information in a preset expression database;
and the second generation module is used for generating a target broadcasting video of the digital person according to the target voice, the target face animation data and the target face base form data.
Optionally, the first generating module is further configured to obtain a phoneme sequence feature vector and a position feature vector of the driving text;
splicing and fusing the phoneme sequence feature vector and the position feature vector to obtain a phoneme position feature vector;
Extracting the context characteristics of the driving text to obtain a context characteristic vector;
performing feature processing on the phoneme position feature vector and the context feature vector to obtain an enhanced feature vector;
decoding the enhanced feature vector to obtain a mel frequency spectrum;
the target speech is generated based on the mel-frequency spectrum and the vocoder.
Optionally, the obtaining module is further configured to input the driving text into a preset emotion analysis model to obtain emotion information contained in the driving text;
the apparatus further comprises:
the second determining module is used for inputting a text corpus of a source domain and a text corpus of a target domain into an implicit dirichlet allocation (LDA) model before inputting the driving text into a preset emotion analysis model to obtain emotion information contained in the driving text, so as to obtain a first theme feature of the source domain corresponding to the source domain and a first theme feature of the target domain corresponding to the target domain;
the computing module is used for computing the information quantity and the information loss quantity between the first theme characteristics of the source domain and the first theme characteristics of the target domain;
The model optimization module is used for optimizing the LDA model based on the information quantity and the information loss quantity, wherein the optimized LDA model is more suitable for data distribution of the target field;
the second determining module is further configured to re-input the text corpus of the source domain and the text corpus of the target domain into the optimized LDA model, to obtain a source domain second topic feature corresponding to the source domain and a target domain second topic feature corresponding to the target domain;
the third determining module is used for inputting the text corpus of the source field into a bi-directional coding characterization model BERT model to obtain semantic features of the source field;
the third generation module is used for combining the semantic features, the source domain second theme features and the target domain second theme features to generate a word vector matrix;
and a fourth determining module, configured to train a preset emotion classifier based on the word vector matrix, so as to obtain the emotion analysis model.
Optionally, the apparatus further includes:
the establishing module is used for establishing an expression database before determining target face base emotion data matched with the emotion information in the preset expression database, and comprises the following steps: acquiring a text corpus used for model training and emotion tags corresponding to each text corpus in the text corpus one by one; generating facial expression data corresponding to the emotion label according to the emotion label; and storing the facial expression data in a database, and determining the database as the expression database.
Optionally, the apparatus further includes:
a fifth determining module, configured to train a deep learning network model based on a speech data set and a face model data set before inputting the target speech into a preset face driving model to obtain target face animation data, so as to obtain the face driving model;
wherein each voice in the voice data set and each face model data in the face model data set are in one-to-one correspondence.
Optionally, the second generating module is configured to use the target facial expression data as base expression data; taking the target voice as a reference standard, and carrying out fusion rendering on the base emotion data and the target face animation data to obtain a mouth shape animation matched with the target voice and a facial expression matched with the emotion information; and synthesizing the mouth shape animation, the facial expression and the target voice to obtain the target broadcasting video.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method of generating a broadcast video of a digital person based on driving text as described in the first aspect.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of generating a digital person's broadcast video based on driving text according to the first aspect.
Therefore, the method comprises the steps of firstly obtaining target voice based on the driving text, then carrying out emotion analysis on the driving text, obtaining face model data corresponding to emotion labels, generating face animation data by the target voice through a face driving model, fusing the face animation data with the face model data obtained based on the emotion analysis of the driving text and the target voice, and finally generating the face animation consistent with the emotion style of the driving text. Compared with the related art, the method has the following advantages: the emotion information of the driving text is fully utilized to generate subsequent target voice, target face animation data and target face base emotion data, and the parameters are fused to generate broadcasting video of the digital person, so that abundant face animation can be generated, the expression of the virtual digital person can be more real and smooth, the sense of reality and emotion expression effect of the virtual digital person are improved, the emotion expression of the abundant virtual digital person is realized, and the sense of reality and expressive force of the virtual digital person are improved.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
fig. 1 is a flowchart of a method for generating a broadcast video of a digital person based on a driving text according to an embodiment of the present application;
fig. 2 is a flowchart of a method for generating a broadcast video of a digital person based on a driving text according to an embodiment of the present application;
fig. 3 is a flowchart of a method for generating a broadcast video of a digital person based on a driving text according to an embodiment of the present application;
fig. 4 is a flowchart of a method for generating a broadcast video of a digital person based on a driving text according to an embodiment of the present application;
fig. 5 is a flowchart of a method for generating a broadcast video of a digital person based on a driving text according to an embodiment of the present application;
fig. 6 is a flowchart of a method for generating a broadcast video of a digital person based on a driving text according to an embodiment of the present application;
fig. 7 is a block diagram of a device for generating a broadcast video of a digital person based on a driving text according to an embodiment of the present application;
Fig. 8 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
Fig. 1 shows a method for generating a broadcast video of a digital person based on driving text according to an embodiment of the present application, where, as shown in fig. 1, the method includes:
step S101, acquiring target voice based on a driving text, and inputting the target voice into a preset face driving model to obtain target face animation data;
the target voice can express emotion information contained in the driving text;
step S102, acquiring emotion information contained in a driving text;
step S103, determining target face base emotion data matched with emotion information in a preset expression database;
and step S104, generating a target broadcasting video of the digital person according to the target voice, the target face animation data and the target face base table condition data.
In step S101, the driving text may be subjected to operations such as feature extraction, encoding, and decoding based on the speech synthesis model, so as to generate a target speech matching the emotion scene of the text. And specifically, as shown in fig. 2, acquiring a target voice based on a driving text includes:
step S201, obtaining a phoneme sequence feature vector and a position feature vector of a driving text;
step S202, splicing and fusing the phoneme sequence feature vector and the position feature vector to obtain a phoneme position feature vector;
step S203, extracting the context characteristics of the driving text to obtain a context characteristic vector;
step S204, carrying out feature processing on the phoneme position feature vector and the context feature vector to obtain an enhanced feature vector;
step S205, decoding the enhancement feature vector to obtain a Mel frequency spectrum;
step S206, generating target voice based on the Mel frequency spectrum and the vocoder.
In steps S201 to S202, syllable tones and pause times corresponding to each character in the driving text are converted into phoneme sequences, and the phoneme sequences are input to a phoneme embedding layer of the speech synthesis model and processed to obtain phoneme sequence feature vectors. And marking the position of each phoneme in the phoneme sequence in the target dialogue text, acquiring a position-coded text by using One-Hot (One-bit effective coding) coding according to the marked position information and the phoneme sequence, and inputting the position-coded text into the position coding of the speech synthesis model to carry out coding processing to acquire a position feature vector. And (3) splicing and fusing the phoneme sequence feature vector and the position feature vector, inputting the phoneme sequence feature vector and the position feature vector into an encoder for encoding, wherein the obtained fused feature vector contains prosodic features and position hidden features of the driving text.
In step S203, feature extraction is performed on the BERT (Bidirectional Encoder Representations from Transformers, bi-directional coding representation model) that drives text input pre-training to obtain sentence vectors of each sentence of text, where the sentence vectors include prosodic features and emotion features corresponding to the enumerated text. The sentence vectors are input into a convolution layer for feature extraction and dimension mapping, and the feature vectors after dimension mapping are input into a bidirectional LSTM (Long Short-Term Memory network) model for learning to obtain the feature vectors with the context features. The feature vector is input to a second convolution layer to perform feature extraction and dimension mapping to obtain a final context feature vector, so that the voice synthesis model can fully learn information such as emotion, rhythm features and the like in the text.
In step S204, the phoneme location feature vector and the context feature vector may be spliced, feature enhanced, and expanded to obtain an enhanced feature vector. Specifically, after the phoneme position feature vector and the context feature vector are spliced and fused, the feature vector is input into a variance adapter of a speech synthesis model to be subjected to feature enhancement and expansion, and then an enhanced feature vector is obtained.
In step S205 to step S206, the enhancement feature vector may be decoded to obtain a mel spectrum corresponding to the target speech, and the mel spectrum may be mapped and converted into the target speech. Specifically, the enhancement feature vector is input into a decoder of a speech synthesis model for decoding processing, a mel spectrum corresponding to the target speech is obtained, the mel spectrum is mapped into a sound waveform by a vocoder, and then the target speech is generated based on the sound waveform by using a speech synthesis technology.
In summary, based on the method shown in fig. 2, the phoneme sequence and the position information of the driving text can be extracted, the driving text is input into the speech synthesis model to perform context feature extraction, the context feature extraction is performed on the driving text after the context feature extraction and the feature enhancement are performed to obtain an enhanced feature vector, the enhanced feature vector is input into a decoder to perform decoding processing to obtain a mel spectrum corresponding to the target speech, and then the mel spectrum is mapped and converted into the target speech, wherein the target speech accords with information such as emotion and rhythm features in the driving text. Therefore, the text is driven to extract the context characteristics, the prosodic characteristics and emotion characteristics corresponding to the text are included, and the fused characteristic vectors are enhanced and act on the process of generating the target voice, so that the generated target voice is more in accordance with the emotion style of the text.
In addition, it should be noted that, in step S101, the preset speech synthesis model is pre-trained by the sample text data and the corresponding sample mel spectrum data. Specifically, the sample text data is processed through steps S201 to S204 shown in fig. 2, a predicted sample mel spectrum corresponding to the text is obtained, a real mel spectrum and a loss value of the predicted mel spectrum are calculated based on a loss function, and a speech synthesis model is trained through a back propagation algorithm.
In step S101, after generating the target voice, the target voice is further input into a preset face driving model to obtain target face animation data. Before the target voice is input into the preset face driving model to obtain the target face animation data, the method further comprises the following steps: training the deep learning network model based on the voice data set and the face model data set to obtain a face driving model; wherein each voice in the voice data set and each face model data in the face model data set are in one-to-one correspondence.
Referring now to the process of model training and generating face animation data, fig. 3 is a flowchart of generating face animation data based on a voice-driven model by target voice according to an embodiment of the present application, as shown in fig. 3, where the method includes:
Step S301, performing multi-layer convolution operation on an audio digital signal data set to extract an audio feature vector;
step S302, inputting the audio feature vector into a linear difference layer, a coding layer, a linear mapping layer and a decoding layer, and finally outputting the facial animation information;
step S303, calculating face reconstruction loss and countering network loss based on a mean absolute error formula, and training to obtain a voice driving model;
and step S304, predicting according to the pre-trained voice driving model to obtain a face animation model (data) corresponding to the target voice.
In step S301, the audio digital signal dataset may be sequentially input into a tone analysis layer and a pronunciation analysis layer, respectively trained by using 5 convolution layers, and after feature selection by the pooling layer, an audio feature vector is output by using two full-connection layers, where the feature vector includes intonation, key phonemes, association features of adjacent columns of frames, and the like.
In step S302, the encoding layer and decoding layer include a multi-head attention layer and a feedforward neural network layer, wherein a residual connection is provided between the input and output of the feedforward neural network layer, and then a normalization operation is performed. The audio feature vector is subjected to linear difference processing by a linear difference layer, then is input into an encoding layer to obtain encoded information, and the audio encoded information is subjected to linear mapping processing and then is subjected to decoding layer to finally output the face animation information.
In step S303, face reconstruction loss is calculated based on the absolute error equation, antagonism loss is obtained based on the antagonism network, face animation loss is obtained by fusing the reconstruction loss and the antagonism loss, and model parameters are adjusted to make the face animation loss converged and finally trained to obtain a face animation model.
In step S304, the audio data set is input into a convolutional neural network to perform multi-layer convolutional operation to extract features, the pooled layer performs feature selection, the fully connected layer outputs to obtain feature vectors, and then sequentially input into a linear difference layer, a coding layer, a linear mapping layer and a decoding layer, and finally face animation information is output. And calculating the face reconstruction loss based on a mean absolute error formula, acquiring the antagonism loss based on the antagonism network, fusing the reconstruction loss and the antagonism loss to obtain the face animation loss, finally training to obtain a face animation model, and predicting according to the voice driving model which is finished by pre-training to obtain the face animation model (data) corresponding to the target voice.
In one possible implementation manner, step S102, obtaining emotion information contained in the driving text includes: and inputting the driving text into a preset emotion analysis model to obtain emotion information contained in the driving text. Alternatively, the emotion analysis model may be a cross-domain text emotion analysis model, and before inputting the driving text into the preset emotion analysis model to obtain emotion information contained in the driving text, as shown in fig. 4, the method further includes:
Step S401, inputting a text corpus of a source domain and a text corpus of a target domain into an LDA (Latent Dirichlet Allocation, hidden Dirichlet distribution) model to obtain a first topic feature of the source domain corresponding to the source domain and a first topic feature of the target domain corresponding to the target domain;
step S402, calculating information quantity and information loss quantity between the first theme characteristics of the source field and the first theme characteristics of the target field;
step S403, optimizing the LDA model based on the information quantity and the information loss quantity;
optionally, the LDA model can be optimized by using the cost flow model, and the optimized LDA model is more suitable for data distribution in the target field;
step S404, re-inputting the text corpus of the source domain and the text corpus of the target domain into the optimized LDA model to obtain second topic features of the source domain and second topic features of the target domain corresponding to the source domain;
step S405, inputting a text corpus of a source field into a BERT model to obtain semantic features of the source field;
step S406, combining the semantic features, the second theme features of the source field and the second theme features of the target field to generate a word vector matrix;
Step S407, training a preset emotion classifier based on the word vector matrix to obtain an emotion analysis model.
It should be noted that, by the method shown in fig. 4, an emotion analysis model may be obtained based on training the text corpus data set, then step S102 may be performed to obtain emotion information contained in the driving text, and then step S103 may be performed to determine, in a preset expression database, target face base emotion data matched with the emotion information, that is, perform emotion analysis on the driving text to obtain corresponding target face base emotion data in the expression database.
In step S401, the preprocessed text corpus may be put into an LDA model, and the respective internal topic relations may be obtained in the source domain and the target domain. Specifically, the source and target domain linguistic data and the target domain linguistic data can be based on an LDA model and parameter estimation is performed by using a Gibbd Sampling algorithm, so that topic relations in the respective domains, namely a document-topic distribution matrix and a topic-word distribution matrix, are obtained.
In steps S402-S403, a minimum cost maximum flow frame is required to be established, the information quantity between cross-domain features and the loss quantity during cross-domain feature information transmission are calculated, the information of the target domain is fully utilized to optimize the source domain LDA model, the normalized point-to-point information is used as the information quantity between cross-domain features in the corpus, and KL divergence D is utilized KL (p t /p s ) To measure the amount of information lost when information is transferred across domain features. And establishing a document layer, a theme layer and a feature layer of the corpus of the source field and the target field based on the minimum cost maximum flow model frame, and perfecting a graph model between fields under the minimum cost maximum flow model frame according to the point mutual information and the information loss. And solving according to the minimum cost maximum flow model to obtain an optimized theme vector mu.
In step S404 to step S407, the word vector output by the BERT model pre-training is combined with the topic vector output by the optimized cross-domain LDA model, and a BERT emotion classifier is constructed to train the model. Specifically, the preprocessed corpus of text is input into a BERT pre-training model, each word being mapped into a vector sum w of word vectors, text vectors and position vectors ij The word vector is fused with the LDA model theme vector mu optimized through the cost flow network to obtain an improved word vector w ij (omega+delta+rho+mu) inputting the fused word vectors into a bidirectional transducer (based on self-attention mechanism) encoder, wherein the encoder connects a multi-head mechanism and a feedforward layer through a residual error network, the multi-head mechanism carries out linear transformation on the input vectors for a plurality of times to obtain different values and then calculates attention weights, and the single-layer neural network connected transducer [ CLS ] is constructed ]Corresponding toThe output vector of (2) is used as a classifier, and after Softmax output is obtained, mask LM tasks and NSP tasks are sequentially executed, so that a classification model can be obtained.
In summary, the embodiment of the application provides an emotion analysis model (which can be a cross-domain text emotion analysis model), firstly, inputting a text corpus in a source domain and a text corpus in a destination domain into an LDA model to extract topic features in the source domain and the destination domain; then calculating the information quantity among the cross-domain features and the loss quantity of the cross-domain feature transmission information, and optimizing the LDA model parameters of the source domain through a cost flow model; meanwhile, the source field training set is put into the BERT pre-training model to extract semantic features; finally, combining the topic features extracted after cross-domain optimization of the LDA model and the semantic features extracted by BERT pre-training to obtain a word vector matrix, and inputting the word vector matrix into a constructed emotion classifier (the basic model can be BERT). Obtaining emotion analysis results corresponding to the driving text according to the cross-domain text emotion analysis model, and obtaining pre-configured corresponding target face base emotion data from an expression database (step S102-S103).
Therefore, the cross-domain text emotion analysis model can enable the training set and the test data set to come from the fields with different distributions, further expands the application field of the driving text and improves the accuracy of text emotion classification.
In step S103, target face base emotion data matched with emotion information may be determined in a preset expression database. In one possible implementation manner, before determining the target face base emotion data matched with the emotion information in the preset expression database in step S103, the method further includes: establishing an expression database, which comprises the following steps: acquiring a text corpus used for model training and emotion tags corresponding to each text corpus in the text corpus one by one; generating facial expression data corresponding to the emotion label according to the emotion label; the facial expression data is stored in a database, and the database is determined as an expression database.
It should be noted that, the expression database can be built by collecting the text corpus, the emotion labels matched with the text sentences, the facial model data corresponding to the voice data and the facial expression data and emotion label data pairs. Specifically, emotion labels matched with text sentences in a text corpus can be obtained, and preprocessing such as text cleaning can be performed on the text; facial expression data corresponding to the emotion labels are produced and stored in an expression database.
It should be noted that, in step S104, a target broadcast video of the digital person may be generated according to the target voice, the target face animation data and the target face base table data, and the specific refinement steps may be as shown in fig. 5, including:
step S501, taking target facial expression data as basic expression data;
step S502, fusion rendering is carried out on the base emotion data and the target face animation data by taking the target voice as a reference standard, so that mouth shape animation matched with the target voice and facial expression matched with emotion information are obtained;
and step S503, synthesizing the mouth shape animation, the facial expression and the target voice to obtain a target broadcast video.
It should be noted that, the method shown in fig. 5 illustrates how to fuse the target facial expression data generated by the driving text and the facial animation data generated by the target voice, and render and synthesize the target broadcast video by combining the audio data of the target voice. Specifically, target facial expression data generated by the driving text is used as a base expression, the base expression data and the facial animation data generated by the target voice are fused and synthesized, mouth-shaped animation matched with the driving text and the facial expression matched with the emotion are rendered, and the animation and the target voice are subjected to audio-video synthesis to finally obtain the target broadcasting video. Therefore, the mouth shape animation of the digital person in the target video generated by the video generation method based on the text-driven digital person broadcasting is matched with the driving text, and the facial expression and the voice emotion are consistent with the emotion style of the driving text.
In general, the method provided in the embodiments of the present application includes the following steps, as shown in fig. 6, including:
step S601, based on a voice synthesis model, performing operations such as feature extraction, encoding processing, decoding processing and the like on the driving text to generate target voice conforming to a text emotion scene;
step S602, collecting a text corpus, emotion tags matched with text sentences, voice data, face model data corresponding to the voice data, facial expression data and emotion tag data pairs, and establishing an expression database;
step S603, training based on a text corpus data set to obtain a cross-domain text emotion analysis model, performing emotion analysis on the driving text, and further obtaining corresponding target face base emotion data in an expression database;
step S604, training a deep learning network model based on a voice data set and a face model data set, obtaining a face driving model after training, and inputting target voice into the face driving model to obtain target face animation data;
step S605, fusing the target face base emotion data generated by the driving text and the face animation data generated by the target voice, and rendering and synthesizing the target broadcasting video by combining the audio data of the target voice.
Compared with the prior art, the video generation method based on text-driven digital man broadcasting has the following advantages: 1. compared with the technical scheme of the traditional voice-driven facial mouth shape animation, after the emotion information of the driving text is analyzed to obtain corresponding facial expression data, the facial expression data and the facial animation data generated by the target voice data output by the voice synthesis technology are fused to obtain the final facial animation video, and the mouth shape movement is consistent with the target voice, and the facial expression is matched with the emotion style of the driving text. 2. In the voice synthesis step of converting the driving text into the target voice, the emotion of the text is considered, so that the generated target voice has emotion information instead of the traditional broadcasting and reading style, emotion information is carried in subsequent voice driving model training, and the generated mouth-shaped animation is more vivid and realistic.
The application scene of the method is very wide, can be applied to the technical field of computers, the field of artificial intelligence, the field of meta universe and the field of digital people, can be applied to any scene requiring virtual digital people to carry out word broadcasting, and can be applied to digital person customer service explanation in websites, digital person video introduction in scenic spots in venues and the like.
Fig. 7 shows an apparatus for generating a broadcast video of a digital person based on driving text according to an embodiment of the present application, and as shown in fig. 7, an apparatus 70 includes:
the first generating module 701 is configured to obtain a target voice based on a driving text, and input the target voice into a preset face driving model to obtain target face animation data; the target voice can express emotion information contained in the driving text;
an obtaining module 702, configured to obtain emotion information contained in a driving text;
a determining module 703, configured to determine, in a preset expression database, target face base emotion data that matches emotion information;
the second generating module 704 is configured to generate a target broadcast video of the digital person according to the target voice, the target face animation data and the target face base table data.
In a possible implementation manner, the first generating module 701 is further configured to obtain a phoneme sequence feature vector and a position feature vector of the driving text;
splicing and fusing the phoneme sequence feature vector and the position feature vector to obtain a phoneme position feature vector;
extracting the context characteristics of the driving text to obtain a context characteristic vector;
Performing feature processing on the phoneme position feature vector and the context feature vector to obtain an enhanced feature vector;
decoding the enhanced feature vector to obtain a mel frequency spectrum;
target speech is generated based on the mel-frequency spectrum and the vocoder.
In a possible implementation manner, the obtaining module 702 is further configured to input the driving text into a preset emotion analysis model to obtain emotion information contained in the driving text;
the apparatus 70 further comprises:
the second determining module is used for inputting the text corpus of the source domain and the text corpus of the target domain into the implicit dirichlet allocation LDA model before inputting the driving text into a preset emotion analysis model to obtain emotion information contained in the driving text, so as to obtain a first theme feature of the source domain corresponding to the source domain and a first theme feature of the target domain corresponding to the target domain;
the computing module is used for computing the information quantity and the information loss quantity between the first theme characteristics of the source field and the first theme characteristics of the target field;
the model optimization module is used for optimizing the LDA model based on the information quantity and the information loss quantity, wherein the optimized LDA model is more suitable for data distribution in the target field;
The second determining module is further configured to re-input the text corpus of the source domain and the text corpus of the target domain into the optimized LDA model to obtain a second topic feature of the source domain corresponding to the source domain and a second topic feature of the target domain corresponding to the target domain;
the third determining module is used for inputting the text corpus of the source field into a bi-directional coding characterization model BERT model to obtain semantic features of the source field;
the third generation module is used for combining the semantic features, the second theme features of the source field and the second theme features of the target field to generate a word vector matrix;
and the fourth determining module is used for training a preset emotion classifier based on the word vector matrix to obtain an emotion analysis model.
In one possible implementation, the apparatus 70 further includes:
the establishing module is used for establishing an expression database before determining target face base emotion data matched with emotion information in the preset expression database, and comprises the following steps: acquiring a text corpus used for model training and emotion tags corresponding to each text corpus in the text corpus one by one; generating facial expression data corresponding to the emotion label according to the emotion label; the facial expression data is stored in a database, and the database is determined as an expression database.
In one possible implementation, the apparatus 70 further includes:
the fifth determining module is used for training the deep learning network model based on the voice data set and the face model data set before inputting the target voice into the preset face driving model to obtain the target face animation data, so as to obtain the face driving model;
wherein each voice in the voice data set and each face model data in the face model data set are in one-to-one correspondence.
In one possible implementation, the second generating module 704 is configured to take the target facial expression data as basic expression data; the basic emotion data and the target facial animation data are fused and rendered by taking the target voice as a reference standard, so that a mouth shape animation matched with the target voice and a facial expression matched with emotion information are obtained; and synthesizing the mouth shape animation, the facial expression and the target voice to obtain a target broadcast video.
Therefore, the emotion information of the driving text can be fully utilized, the emotion information of the text is analyzed and extracted, and the face base emotion data is obtained from the expression database according to the extracted expression labels. The emotion expression of the digital person is enriched through the face animation fused with the face base table, and the sense of reality of the virtual digital person is improved; in the voice synthesis technology for generating the target voice according to the driving text, information such as emotion and rhythm characteristics in the driving text is considered, and the generated target voice accords with emotion styles consistent with the driving text, so that emotion information and rhythm characteristics in the target voice can be learned in a subsequent voice driving model, and the facial animation generated based on the target voice also comprises emotion and rhythm characteristics, so that in the finally generated digital person broadcasting video, mouth animation and facial expression are more vivid and realistic.
The embodiment of the application further provides an electronic device 80, as shown in fig. 8, including: the method includes the steps of a processor 801, a memory 802, and a program stored in the memory 802 and executable on the processor 801, which when executed by the processor, implements a method for generating a broadcast video of a digital person based on driving text as in the above-described embodiments.
The embodiment of the application also provides a computer readable storage medium, and a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the steps of the method for generating the broadcast video of the digital person based on the driving text shown in the embodiment are realized. And the same technical effects can be achieved, and in order to avoid repetition, the description is omitted here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims (10)

1. A method for generating a broadcast video of a digital person based on a driver text, the method comprising:
acquiring target voice based on a driving text, and inputting the target voice into a preset face driving model to obtain target face animation data; wherein, the target voice can express emotion information contained in the driving text;
acquiring emotion information contained in the driving text;
determining target face base emotion data matched with the emotion information in a preset expression database;
and generating a target broadcasting video of the digital person according to the target voice, the target face animation data and the target face base form data.
2. The method of claim 1, wherein obtaining the target speech based on the driver text comprises:
acquiring a phoneme sequence feature vector and a position feature vector of the driving text;
splicing and fusing the phoneme sequence feature vector and the position feature vector to obtain a phoneme position feature vector;
extracting the context characteristics of the driving text to obtain a context characteristic vector;
performing feature processing on the phoneme position feature vector and the context feature vector to obtain an enhanced feature vector;
Decoding the enhanced feature vector to obtain a mel frequency spectrum;
the target speech is generated based on the mel-frequency spectrum and the vocoder.
3. The method of claim 1, wherein obtaining emotion information contained in the driver text comprises:
inputting the driving text into a preset emotion analysis model to obtain emotion information contained in the driving text;
before the driving text is input into a preset emotion analysis model to obtain emotion information contained in the driving text, the method further comprises the steps of:
inputting a text corpus of a source field and a text corpus of a target field into an implicit dirichlet allocation LDA model to obtain a first theme feature of the source field corresponding to the source field and a first theme feature of the target field corresponding to the target field;
calculating the information quantity and the information loss quantity between the first theme characteristics of the source field and the first theme characteristics of the target field;
optimizing the LDA model based on the information quantity and the information loss quantity, wherein the optimized LDA model is more suitable for data distribution in the target field;
Re-inputting the text corpus of the source domain and the text corpus of the target domain into the optimized LDA model to obtain a second topic feature of the source domain corresponding to the source domain and a second topic feature of the target domain corresponding to the target domain;
inputting the text corpus of the source field into a bi-directional coding characterization model BERT model to obtain semantic features of the source field;
combining the semantic features, the source domain second topic features and the target domain second topic features to generate a word vector matrix;
training a preset emotion classifier based on the word vector matrix to obtain the emotion analysis model.
4. The method of claim 1, wherein prior to determining the target face-based emotion data that matches the emotion information in a preset expression database, the method further comprises:
the expression database is established, which comprises the following steps:
acquiring a text corpus used for model training and emotion tags corresponding to each text corpus in the text corpus one by one;
generating facial expression data corresponding to the emotion label according to the emotion label;
And storing the facial expression data in a database, and determining the database as the expression database.
5. The method of claim 1, wherein prior to inputting the target speech into a preset face driving model to obtain target face animation data, the method further comprises:
training a deep learning network model based on a voice data set and a face model data set to obtain the face driving model;
wherein each voice in the voice data set and each face model data in the face model data set are in one-to-one correspondence.
6. The method of claim 1, wherein generating a target broadcast video of a digital person based on the target voice, the target face animation data, and the target face-based mood data comprises:
taking the target facial expression data as basic expression data;
taking the target voice as a reference standard, and carrying out fusion rendering on the base emotion data and the target face animation data to obtain a mouth shape animation matched with the target voice and a facial expression matched with the emotion information;
and synthesizing the mouth shape animation, the facial expression and the target voice to obtain the target broadcasting video.
7. An apparatus for generating a broadcast video of a digital person based on driving text, the apparatus comprising:
the first generation module is used for acquiring target voice based on the driving text, inputting the target voice into a preset face driving model and obtaining target face animation data; wherein, the target voice can express emotion information contained in the driving text;
the first acquisition module is used for acquiring emotion information contained in the driving text;
the determining module is used for determining target face base emotion data matched with the emotion information in a preset expression database;
and the second generation module is used for generating a target broadcasting video of the digital person according to the target voice, the target face animation data and the target face base form data.
8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,
the first generation module is further configured to obtain a phoneme sequence feature vector and a position feature vector of the driving text;
splicing and fusing the phoneme sequence feature vector and the position feature vector to obtain a phoneme position feature vector;
extracting the context characteristics of the driving text to obtain a context characteristic vector;
Performing feature processing on the phoneme position feature vector and the context feature vector to obtain an enhanced feature vector;
decoding the enhanced feature vector to obtain a mel frequency spectrum;
the target speech is generated based on the mel-frequency spectrum and the vocoder.
9. An electronic device, comprising: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor performs the steps of a method of generating a broadcast video of a digital person based on driving text as claimed in any one of claims 1 to 6.
10. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of a method of generating a broadcast video of a digital person based on driving text as claimed in any one of claims 1 to 6.
CN202311561042.8A 2023-11-22 2023-11-22 Method and device for generating broadcast video of digital person based on driving text Pending CN117711042A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311561042.8A CN117711042A (en) 2023-11-22 2023-11-22 Method and device for generating broadcast video of digital person based on driving text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311561042.8A CN117711042A (en) 2023-11-22 2023-11-22 Method and device for generating broadcast video of digital person based on driving text

Publications (1)

Publication Number Publication Date
CN117711042A true CN117711042A (en) 2024-03-15

Family

ID=90145152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311561042.8A Pending CN117711042A (en) 2023-11-22 2023-11-22 Method and device for generating broadcast video of digital person based on driving text

Country Status (1)

Country Link
CN (1) CN117711042A (en)

Similar Documents

Publication Publication Date Title
CN112735373B (en) Speech synthesis method, device, equipment and storage medium
CN111897933B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN111566656A (en) Speech translation method and system using multi-language text speech synthesis model
Robinson et al. Sequence-to-sequence modelling of f0 for speech emotion conversion
KR20200015418A (en) Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature
CN111667812A (en) Voice synthesis method, device, equipment and storage medium
KR20220000391A (en) Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature
CN106971709A (en) Statistic parameter model method for building up and device, phoneme synthesizing method and device
JP4829477B2 (en) Voice quality conversion device, voice quality conversion method, and voice quality conversion program
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN112397056B (en) Voice evaluation method and computer storage medium
CN116129863A (en) Training method of voice synthesis model, voice synthesis method and related device
CN113822017A (en) Audio generation method, device, equipment and storage medium based on artificial intelligence
CN112837669A (en) Voice synthesis method and device and server
CN115101046A (en) Method and device for synthesizing voice of specific speaker
CN114360492A (en) Audio synthesis method and device, computer equipment and storage medium
CN116580691A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
WO2023279976A1 (en) Speech synthesis method, apparatus, device, and storage medium
CN112885326A (en) Method and device for creating personalized speech synthesis model, method and device for synthesizing and testing speech
CN117711042A (en) Method and device for generating broadcast video of digital person based on driving text
CN114333903A (en) Voice conversion method and device, electronic equipment and storage medium
CN114724540A (en) Model processing method and device, emotion voice synthesis method and device
CN113628609A (en) Automatic audio content generation
CN112863476A (en) Method and device for constructing personalized speech synthesis model, method and device for speech synthesis and testing
CN110211563B (en) Chinese speech synthesis method, device and storage medium for scenes and emotion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination