WO2024078303A1 - 人脸驱动模型的训练方法、视频生成方法及装置 - Google Patents

人脸驱动模型的训练方法、视频生成方法及装置 Download PDF

Info

Publication number
WO2024078303A1
WO2024078303A1 PCT/CN2023/120778 CN2023120778W WO2024078303A1 WO 2024078303 A1 WO2024078303 A1 WO 2024078303A1 CN 2023120778 W CN2023120778 W CN 2023120778W WO 2024078303 A1 WO2024078303 A1 WO 2024078303A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
facial expression
data
sub
expression prediction
Prior art date
Application number
PCT/CN2023/120778
Other languages
English (en)
French (fr)
Inventor
杨春勇
蒋宁
刘敏
曾琳铖曦
Original Assignee
马上消费金融股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 马上消费金融股份有限公司 filed Critical 马上消费金融股份有限公司
Publication of WO2024078303A1 publication Critical patent/WO2024078303A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular to a training method, a video generation method and a device for a face-driven model.
  • virtual digital humans can be assistant-type digital humans such as virtual customer service, virtual tour guides, and smart assistants, they can also be entertainment-type digital humans such as virtual singers and virtual spokespersons, and they can also be anchor-type digital humans such as virtual anchors and virtual hosts; however, in some cases, the parameter accuracy of the face-driven model used to generate virtual digital human videos is low, resulting in low accuracy in the expression prediction of the virtual digital human, which in turn leads to the problem of poor authenticity of the expression of the virtual digital human in the video.
  • the purpose of the embodiments of the present application is to provide a training method, a video generation method and a device for a face-driven model, which can improve the expression prediction accuracy of the facial expression prediction sub-model, and thereby improve the expression authenticity of the virtual digital human video generated using the trained face-driven model.
  • the present application provides a method for training a face-driven model, the method The method comprises: obtaining N video sample data; each of the video sample data comprises a real face image and sample voice data of a sample user, where N is an integer greater than 1; inputting the N video sample data into a model to be trained for iterative model training to obtain a face-driven model; wherein the model to be trained comprises a first vector extraction sub-model, a second vector extraction sub-model and a facial expression prediction sub-model; the specific implementation method of each model training comprises: for each video sample data: the first vector extraction sub-model performs text content recognition on the sample voice data in the video sample data to obtain a first feature vector; the second vector extraction sub-model performs non-phoneme feature recognition on the sample voice data to obtain a second feature vector; the facial expression prediction sub-model performs facial expression prediction based on the first feature vector and the second feature vector to obtain first facial expression prediction data; determining a first loss value based on the first facial expression prediction data and the real facial
  • an embodiment of the present application provides a video generation method, which includes: obtaining target voice data; the target voice data includes the original voice data of the target user or the synthesized voice data of the target user; inputting the target voice data into a trained face drive model for face drive processing to obtain a target virtual digital human video; wherein the face drive model includes a first vector extraction sub-model, a second vector extraction sub-model and a facial expression prediction sub-model; the specific implementation method of the face drive processing is: the first vector extraction sub-model performs text content recognition on the target voice data to obtain a third feature vector; the second vector extraction sub-model performs non-phoneme feature recognition on the target voice data to obtain a fourth feature vector; the facial expression prediction sub-model performs expression prediction based on the third feature vector and the fourth feature vector to obtain second facial expression prediction data; image rendering is performed based on the second facial expression prediction data to obtain the target virtual digital human video.
  • an embodiment of the present application provides a training device for a face-driven model, the device comprising: a sample data acquisition module, for acquiring N video sample data; each of the video sample data comprises a real face image and sample voice data of a sample user, N being an integer greater than 1; a module A model training module is used to input the N video sample data into the model to be trained for model iterative training until the current model training result meets the preset model training end condition to obtain a trained face driving model; wherein, the model to be trained includes a first vector extraction sub-model, a second vector extraction sub-model and a facial expression prediction sub-model; the specific implementation method of each model training is as follows: for each of the video sample data: the first vector extraction sub-model performs text content recognition on the sample voice data in the video sample data to obtain a first feature vector; the second vector extraction sub-model performs non-phoneme feature recognition on the sample voice data to obtain a second feature vector; the facial expression prediction sub-model performs facial expression prediction based on the
  • an embodiment of the present application provides a video generation device, which includes: a target data acquisition module, used to acquire target voice data; the target voice data includes the original voice data of the target user or the synthesized voice data of the target user; a video generation module, used to input the target voice data into a trained face drive model for face drive processing to obtain a target virtual digital human video; wherein the face drive model includes a first vector extraction sub-model, a second vector extraction sub-model and a facial expression prediction sub-model; the specific implementation method of the face drive processing is: the first vector extraction sub-model performs text content recognition on the target voice data to obtain a third feature vector; the second vector extraction sub-model performs non-phoneme feature recognition on the target voice data to obtain a fourth feature vector; the facial expression prediction sub-model performs expression prediction based on the third feature vector and the fourth feature vector to obtain second facial expression prediction data; image rendering is performed based on the second facial expression prediction data to obtain the target virtual digital human video.
  • an embodiment of the present application provides a computer device, the device comprising: a processor; and a memory arranged to store computer executable instructions, the executable instructions being configured to be executed by the processor, the executable instructions comprising instructions for executing steps in the above method.
  • an embodiment of the present application provides a storage medium, wherein the storage medium is used to store computer-executable instructions, and the executable instructions enable a computer to execute the steps in the above method.
  • an embodiment of the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the above method.
  • FIG1 is a schematic diagram of a flow chart of a method for training a face-driven model provided in an embodiment of the present application
  • FIG2 is a schematic diagram of a first flow chart of each model training process in the face driving model training method provided in an embodiment of the present application;
  • FIG3 is a schematic diagram of a first implementation principle of a training method for a face driving model provided in an embodiment of the present application
  • FIG4 is a schematic diagram of a second flow chart of each model training process in the face driving model training method provided in an embodiment of the present application;
  • FIG5 is a schematic diagram of a second implementation principle of the face driving model training method provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of a third implementation principle of the face driving model training method provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of a flow chart of a video generation method provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the implementation principle of the video generation method provided in an embodiment of the present application.
  • FIG9 is a schematic diagram of the module composition of a training device for a face driving model provided in an embodiment of the present application.
  • FIG10 is a schematic diagram of the module composition of a video generating device provided in an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.
  • One or more embodiments of the present application provide a training method, a video generation method and a device for a face-driven model. Considering that if a first feature vector is first extracted from the phoneme dimension based on speech data, and only the first feature vector is input into the facial expression prediction submodel in the face-driven model for facial expression prediction to obtain facial expression prediction data; the expression prediction data is adjusted and processed using a parameter matrix T representing individual characteristics to achieve mapping between expression parameters unrelated to individual characteristics and expression parameters related to individual characteristics.
  • the sample users do not include the target user (i.e., the target speaker)
  • a parameter matrix T corresponding to the target user cannot be accurately trained, so that the mapping between expression parameters unrelated to individual characteristics and expression parameters related to individual characteristics cannot be achieved for the target user.
  • the technical scheme uses, during the training process of the face-driven model, a first feature vector obtained by performing text content recognition on sample speech data and a second feature vector obtained by performing non-phoneme feature recognition as input data of a facial expression prediction sub-model to perform facial expression prediction and obtain facial expression prediction data; then, based on the facial expression prediction data and the real facial expression data, parameters of the training model are updated; that is, not only a first feature vector irrelevant to individual features is extracted from the phoneme dimension, but also a second feature vector related to individual features is extracted from the non-phoneme dimension; at the same time, based on the first feature vector and the second feature vector, parameters of the facial expression prediction sub-model are updated, so that the facial expression prediction sub-model can simultaneously learn expression parameters irrelevant to individual features and expression parameters related to individual features, thereby improving the accuracy of model parameters of the facial expression prediction sub-model, thereby improving the expression prediction accuracy of the facial expression prediction sub-model; and then, in the application stage of the face-driven model (i.e., including the trained facial
  • the present application directly uses the second feature vector related to the individual characteristics extracted from the non-phoneme dimension as the model input on the basis of the first feature vector irrelevant to the individual characteristics extracted from the phoneme dimension as the model input, so that the expression prediction sub-model can not only learn the expression parameters irrelevant to the individual characteristics, but also learn the expression parameters related to the individual characteristics.
  • the facial expression prediction data output by the expression prediction sub-model can represent the common expression data of different target speakers predicted based on the third feature vector (i.e., the feature vector extracted from the phoneme dimension), and can also represent the expression difference data of different target speakers predicted based on the fourth feature vector (i.e., the feature vector extracted from the non-phoneme dimension). Therefore, since there is no need to train the target speaker in advance to obtain a corresponding parameter matrix T, there is no need to obtain a large amount of video data of the target speaker, and it can also ensure that the facial expression prediction data reflects the expression differences between individuals. In this way, when only a single image of the target speaker can be obtained, the target speaker can still be accurately predicted.
  • the facial expression of the target speaker is predicted, which reduces the limit on the amount of video data required for the target speaker and improves the applicability of the trained face-driven model.
  • FIG1 is a first flow chart of a method for training a face-driven model provided by one or more embodiments of the present application.
  • the method in FIG1 can be executed by an electronic device provided with a face-driven model training device, which can be a terminal device or a designated server, wherein the hardware device for training the face-driven model (i.e., an electronic device provided with a face-driven model training device) and the hardware device for generating virtual digital human videos (i.e., an electronic device provided with a virtual digital human video generating device) can be the same or different.
  • a face-driven model training device i.e., an electronic device provided with a face-driven model training device
  • virtual digital human videos i.e., an electronic device provided with a virtual digital human video generating device
  • the face-driven model trained by the model training method provided by the embodiment of the present application can be applied to any specific application scenario where a virtual digital human video needs to be generated, for example, an application scenario for generating a question-answering video of a virtual customer service, another example, an application scenario for generating a song singing video of a virtual singer, and another example, an application scenario for generating a product introduction video of a virtual anchor.
  • the method includes at least the following steps:
  • each video sample data includes a real face image of a sample user and sample voice data, and N is an integer greater than 1.
  • the historical voice data of M sample users in the preset application scenario are used as video sample data, where M is an integer greater than 1 and less than or equal to N; for example, if the preset application scenario is an application scenario for generating a question-answering video of a virtual customer service, then the M sample customer service staff's question-answering videos are used as video sample data; for another example, if the preset application scenario is an application scenario for generating a song singing video of a virtual singer, then the M sample virtual singer's song singing videos are used as video sample data; for another example, if the preset application scenario is an application scenario for generating a product introduction video of a virtual anchor, then the M sample anchor's product introduction videos are used as video sample data.
  • S104 input the N video sample data into the model to be trained to perform iterative model training to obtain a face driving model.
  • the training model is treated based on the video sample data set.
  • the preset model parameters are iteratively updated until the current model training result meets the preset model training end condition, and a face driving model for generating a virtual digital human video is obtained; wherein the above-mentioned preset model training end condition may include: the current model training round number is equal to the total training round number, and the model loss function converges.
  • the specific implementation process of the model iterative training is described below. Since the processing process of each model training in the model iterative training process is the same, any model training is taken as an example for detailed description. Specifically, if the above model to be trained includes a first vector extraction sub-model, a second vector extraction sub-model and a facial expression prediction sub-model; as shown in FIG2 , the specific implementation method of each model training can include the following steps S1041 to S1043:
  • the first vector extraction sub-model performs text content recognition on the sample voice data in the video sample data to obtain a first feature vector
  • the second vector extraction sub-model performs non-phoneme feature recognition on the sample voice data to obtain a second feature vector
  • the facial expression prediction sub-model performs facial expression prediction based on the above-mentioned first feature vector and second feature vector to obtain first facial expression prediction data.
  • the above-mentioned first vector extraction sub-model can be a pre-trained speech recognition model, for example, the speech recognition model can be a pre-trained model "DeepSpeech RNN based on deep learning recurrent neural network", or it can be other neural network models that learn audio to text mapping; after obtaining the video sample data, the sample speech data in the video sample data is input into the first vector extraction sub-model, and the text content features in each frame of speech signal are extracted from the phoneme dimension based on the sample speech data (i.e., speech-to-text processing) to obtain the first feature vector, for example, the ASR feature vector (speech recognition ASR, Automatic Speech Recognition) as the first input data of the facial expression prediction sub-model.
  • the speech recognition model can be a pre-trained model "DeepSpeech RNN based on deep learning recurrent neural network", or it can be other neural network models that learn audio to text mapping
  • the sample speech data in the video sample data is input into the first vector extraction sub-model,
  • the second vector extraction sub-model can be a pre-trained speaker representation recognition model.
  • the speaker representation recognition model can be a model based on the voiceprint recognition algorithm voxceleb, or other neural network models that learn speaker identification and confirmation.
  • the sample speech data is input into a second vector extraction sub-model so that the second vector extraction sub-model can be used to extract speaker representation features in each frame of speech signal from a non-phoneme dimension based on the sample speech data to obtain a second feature vector (i.e., a speaker representation vector used to reflect the differences in individual facial expressions).
  • a second feature vector i.e., a speaker representation vector used to reflect the differences in individual facial expressions.
  • the representation vector that can characterize the speaker characteristics is used as the second input data of the facial expression prediction sub-model.
  • the speaker's speaking speed emotions and other speaking styles can affect the personalized differences in the speaker's facial expressions
  • the speaking speed of different speakers will affect the speaker's facial expressions such as the frequency of mouth opening and closing
  • the emotional differences between different speakers will affect the speaker's facial expressions such as the degree of upturned mouth corners
  • the voiceprint feature can characterize the speaker's speaking speed and other speaking styles, for the extraction process of the second feature vector, the above-mentioned non-phoneme feature recognition can include at least one of voiceprint feature recognition and emotion feature recognition.
  • the above-mentioned facial expression prediction sub-model is a neural network model to be trained for predicting facial expressions of human faces; after obtaining the first feature vector through the first vector extraction sub-model and obtaining the second feature vector through the second vector extraction sub-model, the first feature vector and the second feature vector are input into the facial expression prediction sub-model, and the output of the facial expression prediction sub-model is the first facial expression prediction data; wherein, the model parameters in the facial expression prediction sub-model are the model parameters that need to be iteratively trained.
  • the facial expression prediction data After obtaining the first facial expression prediction data using the facial expression prediction sub-model, obtain the real facial expression data corresponding to the video sample data, use the real facial expression data as the real label, use the first facial expression prediction data as the prediction label, calculate the expression prediction sub-loss value between the first facial expression prediction data and the real facial expression data, that is, obtain the expression prediction loss of the model based on the real label and the prediction label, and then obtain the first loss value based on the expression prediction sub-loss value corresponding to each video sample data; for a certain video sample data, extract the real facial expression data from the real face image in the video sample data as the real label, and extract the first eigenvector and the second eigenvector from the sample voice data in the video sample data, and then calculate the expression prediction sub-loss value based on the first eigenvector and the first loss value.
  • the facial expression is predicted based on the two feature vectors, and the first facial expression prediction data is obtained as the prediction label, and then the expression prediction loss information between the
  • the above-mentioned real facial expression data can be obtained in advance by using a facial feature extractor to extract facial feature data based on real face images in video sample data, or can be obtained in real time by using a facial feature extractor to extract facial feature data based on real face images in video sample data;
  • the facial feature extractor can be an existing three-dimensional face tracking (3-dimension face tracking, 3D face tracking) or other facial feature extractors;
  • the facial feature extractor is used to extract facial feature data from real face images to obtain the user's real facial expression data, shape feature vectors and texture feature vectors; wherein the shape feature vectors and texture feature vectors can be used as basic data for facial image rendering based on facial expression prediction data.
  • the gradient descent method is used to adjust the parameters of the facial expression prediction sub-model based on the above-mentioned first loss value; wherein, since the first facial expression prediction data is determined based on the above-mentioned first eigenvector and second eigenvector, the first loss value can not only reflect the expression prediction loss component considered from the phoneme feature dimension, but also reflect the expression prediction loss component considered from the non-phoneme feature dimension. Therefore, the accuracy of the first loss value can be improved, thereby improving the accuracy of the model parameter adjustment, so that the facial expression prediction sub-model after training has higher accuracy in facial expression prediction.
  • the model parameters are iteratively trained based on the first loss value of the model to be trained.
  • the trained face-driven model can be obtained by referring to the process of tuning the model parameters by back propagation using the gradient descent method in the related art, which will not be repeated here.
  • the sample voice data is The first feature vector obtained by text content recognition and the second feature vector obtained by non-phoneme feature recognition are used as input data of the facial expression prediction sub-model to predict facial expression and obtain facial expression prediction data; then the parameters of the training model are updated based on the facial expression prediction data and the real facial expression data; that is, not only the first feature vector irrelevant to individual characteristics is extracted from the phoneme dimension, but also the second feature vector related to individual characteristics is extracted from the non-phoneme dimension, and at the same time, the parameters of the facial expression prediction sub-model are updated based on the first feature vector and the second feature vector, so that the facial expression prediction sub-model can simultaneously learn the expression parameters irrelevant to individual characteristics and the expression parameters related to individual characteristics.
  • the third eigenvector obtained by performing text content recognition on the target voice data and the fourth eigenvector obtained by performing non-phoneme feature recognition are used as input data of the facial expression prediction sub-model to perform facial expression prediction to obtain facial expression prediction data; and then image rendering is performed based on the facial expression prediction data to obtain a virtual digital human video, which can improve the expression authenticity of the virtual digital human video generated by the trained face-driven model.
  • a schematic diagram of a specific implementation principle of a face-driven model training process is given, which specifically includes: obtaining video sample data of multiple sample users; each video sample data includes a real face image and sample voice data; inputting the sample voice data in the video sample data into a first vector extraction sub-model for text content recognition to obtain a first feature vector; and, inputting the sample voice data into a second vector extraction sub-model for non-phoneme feature recognition to obtain a second feature vector; inputting the above-mentioned first feature vector and the above-mentioned second feature vector into a facial expression prediction sub-model to be trained for facial expression prediction to obtain first facial expression prediction data; inputting the real face image in the video sample data into a facial feature extractor for feature extraction to obtain the real facial expression data of the sample user; wherein the above-mentioned facial feature extractor can also be used to extract the shape feature vector and texture feature vector of the sample user, and the facial feature data extraction process can be completed before the model training, or can be performed synchronous
  • the facial feature extractor may be independent of the trained face driving model. For example, when the facial feature data extraction process is completed before model training, the facial feature extractor and the face driving model are relatively independent.
  • the facial feature extractor may also belong to the trained face driving image. For example, when the facial feature data extraction process and the model training process are carried out simultaneously, the trained face driving model includes the facial feature extractor.
  • the lip shape does not correspond to the audio and there is a delayed response
  • the process of adjusting the model parameters of the facial expression prediction sub-model not only the facial expression loss obtained based on the real expression label and the predicted expression label is considered, but also the synchronization loss between the lip shape in the predicted expression image frame and the text corresponding to the speech frame is considered.
  • the loss value finally used for model parameter optimization includes not only the first loss value between the facial expression prediction data and the real facial expression data, but also the second loss value between the lip shape information and the text corresponding to the speech frame.
  • the above-mentioned model to be trained also includes a facial expression renderer and a lip shape and speech synchronization recognition sub-model; the specific implementation method of each model training may also include the following steps S1044 to S1045:
  • the facial expression renderer performs facial image rendering based on the above-mentioned first facial expression prediction data to obtain a face prediction image;
  • the lip-synch and speech synchronization recognition sub-model determines the synchronization sub-loss value corresponding to the video sample data based on the above-mentioned face prediction image and the sample speech data.
  • the facial expression renderer can be a 3D face renderer in the related art.
  • the 3D face renderer can be a differentiable renderer tf-mesh-render, or other facial expression renderers.
  • device after obtaining the first facial expression prediction data, as well as the shape feature vector and the texture feature vector, the first facial expression prediction data, the shape feature vector and the texture feature vector are input into the facial expression renderer, and facial image rendering is performed to obtain a predicted face image, so that the predicted face image can be used as input data of the lip-speech synchronization recognition sub-model to perform synchronization recognition of lip shape and speech.
  • the above-mentioned lip-sync and speech synchronization recognition sub-model can be a pre-trained neural network model for identifying whether the facial lip shape and the text corresponding to the speech frame are synchronized, for example, a SycNet model or other neural network models; after obtaining the face prediction image (including the predicted facial lip shape), the face prediction image and the sample speech data are input into the lip-sync and speech synchronization recognition sub-model to score the synchronization between the facial lip shape in the face prediction image frame and the text corresponding to the sample speech frame to obtain a synchronization sub-loss value.
  • multiple synchronization sub-loss values are summed or weighted summed to obtain the second loss value.
  • a weighted loss value is determined; based on the weighted loss value, the model parameters of the facial expression prediction sub-model in the above-mentioned model to be trained are iteratively updated to obtain a trained face driving model; wherein the trained face driving model may include a lip shape and speech synchronization recognition sub-model (used to recognize the synchronization between the facial lip shape in the face rendering image rendered based on the second facial expression prediction data and the text corresponding to the speech frame in the target speech data, Therefore, the accuracy of the facial expression prediction sub-model can be evaluated based on the synchronization recognition results), and the trained face-driven model may not include the lip-sync recognition sub-model.
  • the lip shape and speech synchronization recognition sub-model is mainly used to recognize the synchronization between the lip shape and the sample speech data in the facial expression prediction data, so as to obtain the second loss value representing the delay degree between the facial lip shape and the speech frame
  • the model parameters of the facial expression prediction sub-model are tuned based on the facial expression loss between the real expression and the predicted expression, and the synchronization loss between the facial lip shape and the speech frame.
  • the synchronization between the lip shape information in the predicted expression output by the facial expression prediction sub-model and the corresponding real speech frame is supervised, so that the facial expression prediction data output by the facial expression prediction sub-model can ensure both the authenticity of the facial expression and the synchronization between the facial lip shape and the speech frame.
  • the lip shape and speech synchronization recognition sub-model can be used only in the model training stage, and in the model application stage, the lip shape and speech synchronization recognition sub-model may not be needed, that is, the model to be trained may include the above-mentioned first vector extraction sub-model, the second vector extraction sub-model, the facial expression prediction sub-model, the facial expression renderer and the lip shape and speech synchronization recognition sub-model, and the trained face driving model may include the above-mentioned first vector extraction sub-model, the second vector extraction sub-model, the facial expression prediction sub-model and the facial expression renderer.
  • a schematic diagram of the specific implementation principle of another face-driven model training process is given, which specifically includes: obtaining video sample data of multiple sample users; each video sample data includes a real face image and sample voice data; inputting the sample voice data in the video sample data into the first vector extraction sub-model for text content recognition to obtain a first feature vector; and, inputting the sample voice data into the second vector extraction sub-model for non-phoneme feature recognition to obtain a second feature vector; inputting the above-mentioned first feature vector and the above-mentioned second feature vector into the facial expression prediction sub-model to be trained for facial expression prediction to obtain first facial expression prediction data; inputting the real face image in the video sample data into the facial feature extractor for feature extraction to obtain the real facial expression data, shape feature vector and texture feature vector of the sample user; wherein the facial feature data extraction process can be completed before the model training, or can be performed synchronously with the model training process; the first facial expression prediction data, shape feature vector and texture feature
  • the lip-synch recognition sub-model may include a multi-layer neural network and a loss information output network.
  • the multi-layer neural network maps the predicted face image and sample speech data to a high-dimensional space, performs lip shape and speech synchronization comparison, and obtains a speech lip shape synchronization score.
  • the face prediction image frame and the corresponding sample speech frame are input into the multi-layer neural network together; the multi-layer neural network extracts lip shape features based on the face prediction image frame to obtain a high-dimensional feature vector of the image frame; and the multi-layer neural network extracts speech features based on the sample speech frame to obtain a high-dimensional feature vector of the speech frame; the high-dimensional feature vector of the image frame is correlated with the high-dimensional feature vector of the speech frame, and a speech lip synchronization score is obtained based on the correlation calculation result.
  • the loss information output network determines the synchronization sub-loss value corresponding to the video sample data based on the speech lip synchronization score.
  • the facial expression prediction sub-model may include a vector concatenation network and a facial expression recognition network.
  • the above-mentioned vector concatenation network concatenates the above-mentioned first eigenvector and the second eigenvector Processing, get the target feature vector.
  • the first eigenvector and the second eigenvector are input into the vector splicing network after the last round of parameter update, and the first eigenvector and the second eigenvector are weighted summed up.
  • the output of the vector splicing network is the target eigenvector.
  • the facial expression recognition network performs facial expression recognition based on the target feature vector to obtain first facial expression prediction data.
  • the target feature vector After obtaining the target feature vector corresponding to each video sample data, the target feature vector is input into the facial expression recognition network after the last round of parameter update to perform facial expression prediction.
  • the output of the facial expression recognition network is the first facial expression prediction data corresponding to the video sample data.
  • the model parameters of the facial expression prediction sub-model are mainly iteratively updated, and the above-mentioned first vector extraction sub-model, the second vector extraction sub-model and the lip-sync recognition sub-model are all preset neural network models that are pre-trained based on the corresponding sample set data; and the above-mentioned facial expression renderer should also be a pre-built three-dimensional face renderer; using the pre-trained first vector extraction sub-model, the second vector extraction sub-model, the lip-sync recognition sub-model and the facial expression renderer, the basic data required for the training process of the face-driven model is output, so as to obtain the total loss value of the facial expression prediction sub-model based on the basic data, and then the model parameters of the facial expression prediction sub-model are iteratively updated based on the total loss value.
  • the training process of the above-mentioned first vector extraction sub-model can refer to the specific training process of the end-to-end speech recognition model in the relevant technology; and the construction process of the above-mentioned facial expression renderer can refer to the construction process of the differentiable renderer in the relevant technology, which will not be repeated here.
  • the training process of the second vector extraction sub-model, before the above S102, obtaining N video sample data further includes:
  • Step A1 obtaining a first sample data set; wherein the first sample data set may include a speaker speech data set used for speaker classification training through non-phoneme feature recognition.
  • the first sample data set may include a data set for speaker classification training through voiceprint feature recognition.
  • the present invention relates to a speaker speech dataset for training speech recognition and/or a speaker speech dataset for speaker classification training through emotion feature recognition.
  • Step A2 Iteratively update the parameters of the first preset neural network model based on the first sample data set to obtain the trained second vector extraction sub-model.
  • the above-mentioned first preset neural network model can be a speaker representation recognition model to be trained.
  • the first sample data set is first input into the speaker representation recognition model to be trained for speaker classification to obtain a speaker classification prediction result, and then a speaker classification loss value is determined based on the speaker classification prediction result and the speaker's actual classification information.
  • the model parameters of the first preset neural network model are iteratively updated until the current model training result meets the preset model training end condition, and the trained first preset neural network model is obtained; the trained first preset neural network model is used as the trained second vector extraction submodel; wherein the trained second vector extraction submodel is used to perform non-phoneme feature recognition on the sample speech data to obtain a second feature vector.
  • the training process of the lip-sync recognition sub-model, before the step S102 of obtaining N video sample data further includes:
  • Step B1 obtaining a second sample data set; wherein the second sample data set may include a sample data set for classification training on whether lip shape and speech are synchronized, and each sample data includes a pair of image frames and speech frames.
  • Step B2 iteratively updating the parameters of the second preset neural network model based on the second sample data set to obtain the trained lip-sync recognition sub-model.
  • the above-mentioned second preset neural network model can be a neural network model to be trained for identifying whether the facial lip shape and speech frame are synchronized.
  • the second sample data set is first input into the second preset neural network model to be trained for binary classification to obtain a lip-speech synchronization classification prediction result, and then the lip-speech synchronization classification loss value is determined based on the lip-speech synchronization classification prediction result and the actual classification information of whether the lip-speech is synchronized or not.
  • the model parameters of the second preset neural network model are iteratively updated until the current model training result meets the preset model training end condition, thereby obtaining the trained second preset neural network model;
  • the trained second preset neural network model is used as It is a trained lip-sync and speech synchronization recognition sub-model; wherein the trained lip-sync and speech synchronization recognition sub-model is used to determine the synchronization sub-loss value corresponding to the video sample data based on the face prediction image and sample speech data.
  • a more realistic background picture can also be added to the virtual digital human video, that is, the face prediction image rendered by the facial expression renderer needs to be fused with the specified background picture.
  • the trained face-driven model may also include a background synthesis renderer, that is, the trained face-driven model includes the first vector extraction sub-model, the second vector extraction sub-model, the facial expression prediction sub-model, the facial expression renderer and the background synthesis renderer; wherein the background synthesis renderer may include a neural renderer or other three-dimensional renderers; the background synthesis renderer may be pre-constructed based on a third sample data set (i.e., a sample video data set containing the target background picture) before the training of the face-driven model, or may be constructed in real time based on the above-mentioned face prediction image and the target background picture during the training of the face-driven model.
  • a background synthesis renderer that is, the trained face-driven model includes the first vector extraction sub-model, the second vector extraction sub-model, the facial expression prediction sub-model, the facial expression renderer and the background synthesis renderer
  • the background synthesis renderer may include a neural renderer or other three-dimensional renderers
  • FIG. 6 another schematic diagram of the specific implementation principle of the face-driven model training process is given, which specifically includes:
  • a trained first vector extraction sub-model is obtained in advance based on sample data set 1; and a trained second vector extraction sub-model is obtained in advance based on sample data set 2; and a trained lip-sync recognition sub-model is obtained in advance based on sample data set 3; and a facial expression renderer is constructed in advance based on sample data set 4; then, a model to be trained is generated based on the trained first vector extraction sub-model, the second vector extraction sub-model, the lip-sync recognition sub-model, the facial expression renderer and the facial expression prediction sub-model to be trained; then, based on the obtained video sample data of multiple sample users, the model parameters of the facial expression prediction sub-model to be trained are iteratively updated to obtain the facial expression prediction sub-model after parameter iterative update; and based on the target background picture video and the face prediction image output by the facial expression renderer, the parameters of the initial background synthesis renderer are iteratively optimized to obtain the background after parameter optimization.
  • a synthesis renderer generates a trained face driving model based on the pre-trained first vector extraction sub-model, the second vector extraction sub-model, the facial expression renderer, the facial expression prediction sub-model after parameter iterative update, and the background synthesis renderer after parameter optimization.
  • the training method of the face-driven model in the embodiment of the present application uses the first feature vector obtained by performing text content recognition on the sample speech data and the second feature vector obtained by performing non-phoneme feature recognition as input data of the facial expression prediction sub-model to perform facial expression prediction and obtain facial expression prediction data; then based on the facial expression prediction data and the real facial expression data, the parameters of the training model are updated; that is, not only the first feature vector irrelevant to the individual characteristics is extracted from the phoneme dimension, but also the second feature vector related to the individual characteristics is extracted from the non-phoneme dimension, and at the same time, the parameters of the facial expression prediction sub-model are updated based on the first feature vector and the second feature vector, so that the facial expression prediction sub-model can learn at the same time.
  • the accuracy of the model parameters of the facial expression prediction sub-model can be improved, thereby improving the expression prediction accuracy of the facial expression prediction sub-model.
  • the third eigenvector obtained by performing text content recognition on the target speech data and the fourth eigenvector obtained by performing non-phoneme feature recognition are used as input data of the facial expression prediction sub-model to perform facial expression prediction and obtain facial expression prediction data.
  • Image rendering is then performed based on the facial expression prediction data to obtain a virtual digital human video. This can improve the expression authenticity of the virtual digital human video generated by the trained face-driven model.
  • FIG. 7 is a flow chart of the video generation method provided by the embodiment of the present application.
  • the method in Figure 7 can be executed by an electronic device provided with a video generation device, and the electronic device can be a terminal device or a designated server.
  • a pre-trained face driving model is deployed in the electronic device; wherein the trained face driving model includes a trained first vector extraction sub-model, a second vector extraction sub-model, a facial expression renderer, and a facial expression prediction sub-model.
  • the face-driven model trained by the model training method provided in the embodiment of the present application can be applied to any specific application scenario that requires the generation of a virtual digital human video, for example, an application scenario for generating a virtual customer service question-answering video, an application scenario for generating a virtual singer's song singing video, and an application scenario for generating a virtual anchor's product introduction video.
  • the above-mentioned video generation method includes at least the following steps:
  • target voice data includes original voice data of a target user or synthesized voice data of the target user.
  • the target voice data may be voice data directly acquired (i.e., the original voice data of the target user), or voice data converted based on preset text content (i.e., the synthesized voice data of the target user); wherein, in the case where the original voice data of the target user cannot be directly acquired, the target text data is first acquired; then, a preset text-to-speech model is used to perform text-to-speech conversion on the target text data to obtain the synthesized voice data of the target user; wherein, the preset text-to-speech model may be pre-trained based on a sample data set; for example, it may be a text-to-speech model based on text-to-speech (TTS) technology, and the synthesized voice data containing the non-phonemic features of the target user is obtained by using the text-to-speech model, wherein the synthesized voice data converted based on the target text data can reflect the voiceprint features of the target user (i.e., a segment
  • the trained face drive model may include a first vector extraction sub-model, a second vector extraction sub-model and a facial expression prediction sub-model; wherein the specific implementation methods of the face drive processing are: the first vector extraction sub-model performs text content recognition on the target voice data to obtain a third feature vector; the second vector extraction sub-model performs non-phoneme feature recognition on the target voice data to obtain a fourth feature vector; the facial expression prediction sub-model performs expression prediction based on the third feature vector and the fourth feature vector to obtain second facial expression prediction data; image rendering is performed based on the second facial expression prediction data to obtain a target virtual digital human video.
  • the specific implementation methods of the face drive processing are: the first vector extraction sub-model performs text content recognition on the target voice data to obtain a third feature vector; the second vector extraction sub-model performs non-phoneme feature recognition on the target voice data to obtain a fourth feature vector; the facial expression prediction sub-model performs expression prediction based on the third feature vector and the fourth feature vector to obtain second facial expression prediction data; image rendering is
  • a third eigenvector obtained by performing text content recognition on the target voice data and a fourth eigenvector obtained by performing non-phoneme feature recognition are used as input data of a facial expression prediction sub-model to perform facial expression prediction to obtain facial expression prediction data; then image rendering is performed based on the facial expression prediction data to obtain a virtual digital human video; therefore, when performing facial expression prediction, the facial expression prediction sub-model not only considers the third eigenvector that is not related to individual characteristics, but also considers the third eigenvector that is related to individual characteristics, thereby improving the prediction accuracy of the second facial expression prediction data; and, if the lip-speech synchronization loss is considered during the training process of the facial expression prediction sub-model, the facial expression prediction sub-model can also ensure the synchronization of the facial lip shape and the voice frame in the second facial expression prediction data when performing facial expression prediction.
  • the above-mentioned face driving model also includes a facial expression renderer and a background synthesis renderer; specifically, the above-mentioned image rendering based on the second facial expression prediction data to obtain the target virtual digital human video specifically includes: the facial expression renderer performs facial image rendering based on the above-mentioned second facial expression prediction data to obtain a face rendering image; the background synthesis renderer performs background synthesis based on the above-mentioned face rendering image to obtain the target virtual digital human video.
  • the voice data and image data in the target virtual digital human video finally generated can be of the same user or different users, that is, the target virtual digital human video can include the voice data and virtual digital human image of the target user, or
  • the target virtual digital human video may include the target user's voice data and other users' virtual digital human images; and the background image in the target virtual digital human video may also be flexibly set according to actual needs.
  • the target voice data under the preset application scenario is obtained; the target voice data is input into the trained face drive model for face drive processing to obtain a target virtual digital human video; for example, if the preset application scenario is an application scenario for generating a question-answering video of a virtual customer service, the target question-answering voice obtained based on the target question-answering text conversion is obtained, and the target question-answering voice is used as the target voice data; for another example, if the preset application scenario is an application scenario for generating a song singing video of a virtual singer, the target song singing voice obtained based on the target song lyrics text conversion is obtained, and the target song singing voice is used as the target voice data; for another example, if the preset application scenario is an application scenario for generating a product introduction video of a virtual anchor, the target product introduction voice obtained based on the target product introduction text conversion is obtained, and the target product introduction voice is used as the target voice data.
  • a schematic diagram of the specific implementation principle of a virtual digital human video generation process is given, which specifically includes: obtaining target text data under a preset application scenario; using a preset TTS technology to perform text-to-speech conversion processing on the target text data to obtain the synthesized speech data of the first target user; inputting the synthesized speech data of the first target user into the first vector extraction sub-model for text content recognition to obtain a third feature vector; and inputting the synthesized speech data of the first target user into the second vector extraction sub-model for non-phoneme feature recognition to obtain a fourth feature vector; inputting the above third feature vector and the above fourth feature vector into the facial expression prediction sub-model for facial expression prediction
  • the method can be performed in a synchronized manner with the generation process of the virtual digital human video; the second facial expression prediction data, the shape feature vector and the texture feature vector are input into the facial expression renderer for facial image rendering to obtain a target face rendering image; the target face rendering image is input into the background synthesis renderer for background synthesis to obtain a target virtual digital human video.
  • the video generation method in the embodiment of the present application uses the third feature vector obtained by performing text content recognition on the target voice data and the fourth feature vector obtained by performing non-phoneme feature recognition as input data of the facial expression prediction sub-model to perform facial expression prediction, and obtain facial expression prediction data; then, based on the facial expression prediction data, image rendering is performed to obtain a virtual digital human video; wherein, since in the training process of the face driving model, the first feature vector obtained by performing text content recognition on the sample voice data and the second feature vector obtained by performing non-phoneme feature recognition are mainly used as input data of the facial expression prediction sub-model to perform facial expression prediction, and the facial expression prediction data is obtained.
  • the parameters of the to-be-trained model are updated; that is, not only a first eigenvector irrelevant to individual characteristics is extracted from the phoneme dimension, but also a second eigenvector related to individual characteristics is extracted from the non-phoneme dimension, and at the same time, based on the first eigenvector and the second eigenvector, the parameters of the facial expression prediction sub-model are updated, so that the facial expression prediction sub-model can simultaneously learn expression parameters irrelevant to individual characteristics and expression parameters related to individual characteristics, which can improve the accuracy of the model parameters of the facial expression prediction sub-model, thereby improving the expression prediction accuracy of the facial expression prediction sub-model, and therefore, can improve the expression authenticity of the virtual digital human video generated by the trained face-driven model.
  • the embodiment of the present application also provides a training device for the face driving model.
  • the module composition diagram of the training device of the face-driven model provided in the embodiment is used to execute the training method of the face-driven model described in Figures 1 to 6.
  • the device includes: a sample data acquisition module 902, used to obtain N video sample data; each of the video sample data includes a real face image and sample voice data of a sample user, and N is an integer greater than 1; a model training module 904, used to input the N video sample data into the model to be trained for model iterative training, until the current model training result meets the preset model training end condition, and the trained face-driven model is obtained; wherein the model to be trained includes a first vector extraction sub-model, a second vector extraction sub-model and a facial expression prediction sub-model; the specific model training of each time is repeated.
  • the implementation methods include: for each of the video sample data: the first vector extraction submodel performs text content recognition on the sample voice data in the video sample data to obtain a first feature vector; the second vector extraction submodel performs non-phoneme feature recognition on the sample voice data to obtain a second feature vector; the facial expression prediction submodel performs facial expression prediction based on the first feature vector and the second feature vector to obtain first facial expression prediction data; based on the first facial expression prediction data and the real facial expression data corresponding to each of the video sample data, a first loss value is determined; the real facial expression data is obtained based on the real face image in the video sample data; based on the first loss value, the parameters of the model to be trained are updated.
  • the training device of the face-driven model in the embodiment of the present application uses the first feature vector obtained by performing text content recognition on the sample speech data and the second feature vector obtained by performing non-phoneme feature recognition as input data of the facial expression prediction sub-model to perform facial expression prediction and obtain facial expression prediction data; then based on the facial expression prediction data and the real facial expression data, the parameters of the training model are updated; that is, not only the first feature vector irrelevant to individual characteristics is extracted from the phoneme dimension, but also the second feature vector related to individual characteristics is extracted from the non-phoneme dimension, and at the same time, based on the first feature vector and the second feature vector, the parameters of the facial expression prediction sub-model are updated, so that the facial expression prediction sub-model can simultaneously learn the expression parameters irrelevant to individual characteristics and the expression parameters related to individual characteristics, which can improve the accuracy of the model parameters of the facial expression prediction sub-model, thereby improving the expression prediction accuracy of the facial expression prediction sub-model, and further improving the facial expression prediction accuracy of the facial expression prediction sub
  • the third eigenvector obtained by performing text content recognition on the target speech data and the fourth eigenvector obtained by performing non-phoneme feature recognition are used as input data of the facial expression prediction sub-model to perform facial expression prediction and obtain facial expression prediction data; then, image rendering is performed based on the facial expression prediction data to obtain a virtual digital human video, which can improve the expression authenticity of the virtual digital human video generated by the trained face-driven model.
  • the embodiment of the training device for the face-driven model in the present application and the embodiment of the training method for the face-driven model in the present application are based on the same inventive concept. Therefore, the specific implementation of this embodiment can refer to the implementation of the corresponding training method for the face-driven model mentioned above, and the repeated parts will not be repeated.
  • FIG. 10 is a schematic diagram of the module composition of the video generation device provided in the embodiment of the present application.
  • the device is used to execute the video generation method described in Figures 7 to 8.
  • the device includes: a target data acquisition module 1002, which is used to acquire target voice data; the target voice data includes the original voice data of the target user or the synthesized voice data of the target user; a video generation module 1004, which is used to input the target voice data into the trained face drive model for face drive processing to obtain the target virtual digital human video.
  • the face drive model includes a first vector extraction sub-model, a second vector extraction sub-model and a facial expression prediction sub-model;
  • the specific implementation methods of the face drive processing are: the first vector extraction sub-model performs text content recognition on the target voice data to obtain a third feature vector; the second vector extraction sub-model performs non-phoneme feature recognition on the target voice data to obtain a fourth feature vector; the facial expression prediction sub-model performs expression prediction based on the third feature vector and the fourth feature vector to obtain second facial expression prediction data; image rendering is performed based on the second facial expression prediction data to obtain the target virtual digital human video.
  • the video generation device in the embodiment of the present application is based on the face driving model (i.e., including the trained face In the application stage of the facial expression prediction sub-model), the third eigenvector obtained by performing text content recognition on the target voice data and the fourth eigenvector obtained by performing non-phoneme feature recognition are used as input data of the facial expression prediction sub-model to perform facial expression prediction, so as to obtain facial expression prediction data; then, image rendering is performed based on the facial expression prediction data to obtain a virtual digital human video; wherein, since in the training process of the face-driven model, the first eigenvector obtained by performing text content recognition on the sample voice data and the second eigenvector obtained by performing non-phoneme feature recognition are mainly used as input data of the facial expression prediction sub-model to perform facial expression prediction, so as to obtain facial expression prediction data; then, based on the facial expression prediction data, image rendering is performed to obtain a virtual digital human video.
  • the face driving model i.e., including the trained face In the application stage of the
  • the parameters of the training model are updated using the data and real facial expression data; that is, not only the first eigenvector independent of individual characteristics is extracted from the phoneme dimension, but also the second eigenvector related to individual characteristics is extracted from the non-phoneme dimension, and the parameters of the facial expression prediction sub-model are updated based on the first eigenvector and the second eigenvector, so that the facial expression prediction sub-model can simultaneously learn the expression parameters independent of individual characteristics and the expression parameters related to individual characteristics, which can improve the accuracy of the model parameters of the facial expression prediction sub-model, thereby improving the expression prediction accuracy of the facial expression prediction sub-model, and therefore, can improve the expression authenticity of the virtual digital human video generated by the trained face-driven model.
  • the embodiment of the training device for the face-driven model in the present application and the embodiment of the training method for the face-driven model in the present application are based on the same inventive concept. Therefore, the specific implementation of this embodiment can refer to the implementation of the corresponding training method for the face-driven model mentioned above, and the repeated parts will not be repeated.
  • an embodiment of the present application also provides a computer device, which is used to execute the above-mentioned face-driven model training method or video generation method, as shown in Figure 11.
  • Computer devices may vary greatly due to different configurations or performances, and may include one or more processors 1101 and memory 1102.
  • the memory 1102 may store one or more application programs or data.
  • the memory 1102 may be a temporary storage or a permanent storage.
  • the application program stored in the memory 1102 may include one or more modules (not shown in the figure), and each module may include a series of computer executable instructions in the computer device.
  • the processor 1101 may be configured to communicate with the memory 1102 to execute a series of computer executable instructions in the memory 1102 on the computer device.
  • the computer device may also include one or more power supplies 1103, one or more wired or wireless network interfaces 1104, one or more input and output interfaces 1105, one or more keyboards 1106, etc.
  • a computer device includes a memory and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer executable instructions in the computer device, and is configured to be executed by one or more processors.
  • the one or more programs include the following computer executable instructions: obtaining N video sample data; each of the video sample data includes a real face image and sample voice data of a sample user, and N is an integer greater than 1; inputting the N video sample data into a model to be trained for model iterative training to obtain a face driven model; wherein the model to be trained includes a first vector extraction sub-model, a second vector extraction sub-model model and facial expression prediction sub-model; the specific implementation methods of each model training are: for each of the video sample data: the first vector extraction sub-model performs text content recognition on the sample voice data in the video sample data to obtain a first feature vector; the second vector extraction sub-model performs non-phoneme feature recognition on the sample voice data to obtain a second feature vector; the facial expression prediction sub-model performs facial expression prediction based on the first feature vector and the second feature vector to obtain first facial expression prediction data; based on the first facial expression prediction data and the real facial expression data corresponding to each of the video sample data, a first loss value is determined;
  • the computer device includes a memory and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a computer device.
  • a series of computer executable instructions in the program, and configured to be executed by one or more processors, the one or more programs include the following computer executable instructions: obtaining target voice data; the target voice data includes the original voice data of the target user or the synthesized voice data of the target user; inputting the target voice data into the trained face drive model for face drive processing to obtain a target virtual digital human video; wherein the face drive model includes a first vector extraction sub-model, a second vector extraction sub-model and a facial expression prediction sub-model; the specific implementation method of the face drive processing is: the first vector extraction sub-model performs text content recognition on the target voice data to obtain a third feature vector; the second vector extraction sub-model performs non-phoneme feature recognition on the target voice data to obtain a fourth feature vector; the facial expression prediction sub-model performs expression
  • the computer device in the embodiment of the present application uses the first feature vector obtained by performing text content recognition on the sample voice data and the second feature vector obtained by performing non-phoneme feature recognition as input data of the facial expression prediction sub-model to perform facial expression prediction and obtain facial expression prediction data; then updates the parameters of the training model based on the facial expression prediction data and the real facial expression data; that is, not only the first feature vector irrelevant to the individual feature is extracted from the phoneme dimension, but also the second feature vector related to the individual feature is extracted from the non-phoneme dimension, and at the same time, the parameters of the facial expression prediction sub-model are updated based on the first feature vector and the second feature vector.
  • the facial expression prediction sub-model can simultaneously learn expression parameters that are unrelated to individual characteristics and expression parameters that are related to individual characteristics, which can improve the accuracy of the model parameters of the facial expression prediction sub-model, thereby improving the expression prediction accuracy of the facial expression prediction sub-model. Furthermore, in the application stage of the face-driven model (i.e., including the trained facial expression prediction sub-model), the third eigenvector obtained by performing text content recognition on the target speech data and the fourth eigenvector obtained by performing non-phoneme feature recognition are also used as input data of the facial expression prediction sub-model to perform facial expression prediction and obtain facial expression prediction data; then, image rendering is performed based on the facial expression prediction data to obtain a virtual digital image. This can improve the authenticity of the expression of the virtual digital human video generated by the trained face-driven model.
  • the embodiment of the computer device in the present application and the embodiment of the training method of the face driving model in the present application are based on the same inventive concept. Therefore, the specific implementation of this embodiment can refer to the implementation of the aforementioned corresponding training method of the face driving model, and the repeated parts will not be repeated.
  • the embodiments of the present application also provide a storage medium for storing computer executable instructions.
  • the storage medium can be a USB flash drive, an optical disk, a hard disk, etc.
  • the computer executable instructions stored in the storage medium can implement the following process when executed by the processor: obtain N video sample data; each of the video sample data includes a real face image and sample voice data of a sample user, and N is an integer greater than 1; input the N video sample data into the model to be trained for model iterative training to obtain a face driving model; wherein the model to be trained includes a first vector extraction sub-model, a second vector extraction sub-model and a facial expression prediction sub-model; each model training
  • the specific implementation methods are as follows: for each of the video sample data: the first vector extraction sub-model performs text content recognition on the sample voice data in the video sample data to obtain a first feature vector; the second vector extraction sub-model performs non-phoneme feature recognition on the sample voice data to obtain a second feature vector; the facial expression prediction sub-model performs facial expression prediction based on the first feature vector and the second feature vector to obtain first facial expression prediction data; based on the first facial expression prediction data and the real facial expression data corresponding to each of the video sample data
  • the storage medium may be a USB flash drive, a CD, a hard disk, etc.
  • the computer executable instructions stored in the storage medium can implement the following process when executed by the processor: obtaining target voice data; the target voice data includes the original voice data of the target user or the synthesized voice data of the target user; inputting the target voice data into the trained face driving model for Face drive processing is used to obtain a target virtual digital human video; wherein, the face drive model includes a first vector extraction sub-model, a second vector extraction sub-model and a facial expression prediction sub-model; the specific implementation methods of the face drive processing are: the first vector extraction sub-model performs text content recognition on the target voice data to obtain a third feature vector; the second vector extraction sub-model performs non-phoneme feature recognition on the target voice data to obtain a fourth feature vector; the facial expression prediction sub-model performs expression prediction based on the third feature vector and the fourth feature vector to obtain second facial expression prediction data; image rendering is performed based on the second facial expression prediction data to obtain the
  • the first feature vector obtained by performing text content recognition on the sample speech data and the second feature vector obtained by performing non-phoneme feature recognition are used as input data of the facial expression prediction sub-model to perform facial expression prediction and obtain facial expression prediction data; then, based on the facial expression prediction data and the real facial expression data, the parameters of the training model are updated; that is, not only the first feature vector irrelevant to the individual characteristics is extracted from the phoneme dimension, but also the second feature vector related to the individual characteristics is extracted from the non-phoneme dimension, and at the same time, the parameters of the facial expression prediction sub-model are updated based on the first feature vector and the second feature vector, so that the facial expression prediction sub-model
  • the model can simultaneously learn expression parameters that are unrelated to individual characteristics and expression parameters that are related to individual characteristics, which can improve the accuracy of the model parameters of the facial expression prediction sub-model, thereby improving the expression prediction accuracy of the facial expression prediction sub-model
  • the third eigenvector obtained by performing text content recognition on the target speech data and the fourth eigenvector obtained by performing non-phoneme feature recognition are also used as input data of the facial expression prediction sub-model to perform facial expression prediction and obtain facial expression prediction data; then, image rendering is performed based on the facial expression prediction data to obtain a virtual digital human video, which can improve the expression authenticity of the virtual digital human video generated by the trained face-driven model.
  • the embodiments of the present application may be provided as methods, systems or computer program products. Therefore, the embodiments of the present application may adopt the form of complete hardware embodiments, complete software embodiments, or embodiments in combination with software and hardware. Moreover, the application may adopt the form of a computer program product implemented on one or more computer-readable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
  • computer-readable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
  • a computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
  • processors CPU
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-permanent storage in a computer-readable medium, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer readable media include permanent and non-permanent, removable and non-removable media that can be implemented by any method or technology to store information.
  • Information can be computer readable instructions, data structures, program modules or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by a computing device.
  • computer readable media does not include temporary computer readable media (transitory media), such as modulated data signals and carrier waves.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • One or more embodiments of the present application may also be practiced in distributed computing environments where the computer programs are executed by a computer.
  • the remote processing device connected through the communication network performs tasks.
  • program modules may be located in local and remote computer storage media including memory storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

本申请实施例提供了人脸驱动模型的训练方法、视频生成方法及装置,在人脸驱动模型的训练过程中,将通过对样本语音数据进行文本内容识别得到的第一特征向量和进行非音素特征识别得到的第二特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据和真实面部表情数据对待训练模型进行参数更新;即不仅从音素维度提取得到与个体特征无关的第一特征向量,还从非音素维度提取得到与个体特征有关的第二特征向量,同时基于第一特征向量和第二特征向量对面部表情预测子模型进行参数更新。

Description

人脸驱动模型的训练方法、视频生成方法及装置
交叉引用
本申请要求在2022年10月09日提交中国专利局、申请号为202211226776.6、名称为“人脸驱动模型的训练方法、视频生成方法及装置”的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种人脸驱动模型的训练方法、视频生成方法及装置。
背景技术
目前,随着人工智能技术的快速发展,基于语音驱动生成虚拟数字人视频的应用越来越广泛,其中,虚拟数字人可以是虚拟客服、虚拟导游、智能助手等助手型数字人,还可以是虚拟歌手、虚拟代言人等娱乐型数字人,也可以是虚拟主播、虚拟主持人等主播型数字人;然而,在一些情形下,用于生成虚拟数字人视频的人脸驱动模型的参数准确度低,导致虚拟数字人的表情预测准确度低,从而导致视频中虚拟数字人的表情真实性差的问题。
发明内容
本申请实施例的目的是提供一种人脸驱动模型的训练方法、视频生成方法及装置,能够提高面部表情预测子模型的表情预测准确度,进而提高利用训练后的人脸驱动模型生成的虚拟数字人视频的表情真实性。
为了实现上述技术方案,本申请实施例是这样实现的:
一方面,本申请实施例提供的一种人脸驱动模型的训练方法,所述方法 包括:获取N个视频样本数据;每个所述视频样本数据包括样本用户的真实人脸图像和样本语音数据,N为大于1的整数;将所述N个视频样本数据输入至待训练模型进行模型迭代训练,得到人脸驱动模型;其中,所述待训练模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;每次模型训练的具体实现方式有:针对每个所述视频样本数据:所述第一向量提取子模型对所述视频样本数据中样本语音数据进行文本内容识别,得到第一特征向量;所述第二向量提取子模型对所述样本语音数据进行非音素特征识别,得到第二特征向量;所述面部表情预测子模型基于所述第一特征向量和所述第二特征向量进行面部表情预测,得到第一面部表情预测数据;基于各所述视频样本数据对应的第一面部表情预测数据和真实面部表情数据,确定第一损失值;所述真实面部表情数据是基于所述视频样本数据中的真实人脸图像得到的;基于所述第一损失值,对所述待训练模型进行参数更新。
一方面,本申请实施例提供的一种视频生成方法,所述方法包括:获取目标语音数据;所述目标语音数据包括目标用户的原声语音数据或者目标用户的合成语音数据;将所述目标语音数据输入至训练后的人脸驱动模型进行人脸驱动处理,得到目标虚拟数字人视频;其中,所述人脸驱动模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;所述人脸驱动处理的具体实现方式有:所述第一向量提取子模型对所述目标语音数据进行文本内容识别,得到第三特征向量;所述第二向量提取子模型对所述目标语音数据进行非音素特征识别,得到第四特征向量;所述面部表情预测子模型基于所述第三特征向量和所述第四特征向量进行表情预测,得到第二面部表情预测数据;基于所述第二面部表情预测数据进行图像渲染,得到所述目标虚拟数字人视频。
一方面,本申请实施例提供的一种人脸驱动模型的训练装置,所述装置包括:样本数据获取模块,用于获取N个视频样本数据;每个所述视频样本数据包括样本用户的真实人脸图像和样本语音数据,N为大于1的整数;模 型训练模块,用于将所述N个视频样本数据输入至待训练模型进行模型迭代训练,直到当前模型训练结果满足预设模型训练结束条件,得到训练后的人脸驱动模型;其中,所述待训练模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;每次模型训练的具体实现方式有:针对每个所述视频样本数据:所述第一向量提取子模型对所述视频样本数据中样本语音数据进行文本内容识别,得到第一特征向量;所述第二向量提取子模型对所述样本语音数据进行非音素特征识别,得到第二特征向量;所述面部表情预测子模型基于所述第一特征向量和所述第二特征向量进行面部表情预测,得到第一面部表情预测数据;基于各所述视频样本数据对应的第一面部表情预测数据和真实面部表情数据,确定第一损失值;所述真实面部表情数据是基于所述视频样本数据中的真实人脸图像得到的;基于所述第一损失值,对所述待训练模型进行参数更新。
一方面,本申请实施例提供的一种视频生成装置,所述装置包括:目标数据获取模块,用于获取目标语音数据;所述目标语音数据包括目标用户的原声语音数据或者目标用户的合成语音数据;视频生成模块,用于将所述目标语音数据输入至训练后的人脸驱动模型进行人脸驱动处理,得到目标虚拟数字人视频;其中,所述人脸驱动模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;所述人脸驱动处理的具体实现方式有:所述第一向量提取子模型对所述目标语音数据进行文本内容识别,得到第三特征向量;所述第二向量提取子模型对所述目标语音数据进行非音素特征识别,得到第四特征向量;所述面部表情预测子模型基于所述第三特征向量和所述第四特征向量进行表情预测,得到第二面部表情预测数据;基于所述第二面部表情预测数据进行图像渲染,得到所述目标虚拟数字人视频。
一方面,本申请实施例提供的一种计算机设备,所述设备包括:处理器;以及被安排成存储计算机可执行指令的存储器,所述可执行指令被配置由所述处理器执行,所述可执行指令包括用于执行如上述方法中的步骤。
一方面,本申请实施例提供的一种存储介质,其中,所述存储介质用于存储计算机可执行指令,所述可执行指令使得计算机执行如上述方法中的步骤。
一方面,本申请实施例提供了一种计算机程序产品,其中,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如上述方法。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请一个或多个中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的人脸驱动模型的训练方法的流程示意图;
图2为本申请实施例提供的人脸驱动模型的训练方法中每次模型训练过程的第一种流程示意图;
图3为本申请实施例提供的人脸驱动模型的训练方法的第一种实现原理示意图;
图4为本申请实施例提供的人脸驱动模型的训练方法中每次模型训练过程的第二种流程示意图;
图5为本申请实施例提供的人脸驱动模型的训练方法的第二种实现原理示意图;
图6为本申请实施例提供的人脸驱动模型的训练方法的第三种实现原理示意图;
图7为本申请实施例提供的视频生成方法的流程示意图;
图8为本申请实施例提供的视频生成方法的实现原理示意图;
图9为本申请实施例提供的人脸驱动模型的训练装置的模块组成示意图;
图10为本申请实施例提供的视频生成装置的模块组成示意图;
图11为本申请实施例提供的计算机设备的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请一个或多个中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一个或多个一部分实施例,而不是全部的实施例。基于本申请一个或多个中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请的保护范围。
需要说明的是,在不冲突的情况下,本申请中的一个或多个实施例以及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请实施例。
本申请一个或多个实施例提供了一种人脸驱动模型的训练方法、视频生成方法及装置,考虑到如果先从音素维度基于语音数据提取第一特征向量,并仅将第一特征向量输入至人脸驱动模型中的面部表情预测子模型进行面部表情预测,得到面部表情预测数据;利用表征个体特征的参数矩阵T对该表情预测数据进行调整处理,来实现个体特征无关的表情参数与个体特征相关的表情参数之间的映射,因此,针对样本用户不包含目标用户(即目标说话人)的情况,也需要预先基于目标用户的大量视频数据训练得到相应的参数矩阵T,才能够准确地对目标说话人的面部表情进行预测,也就是说,无论是针对每个样本说话人,还是针对目标说话人预先均需要分别训练得到一个对应的参数矩阵T,这样势必存在需要获取目标用户的大量视频数据的需求问题,而对于无法获取目标用户的大量视频数据的情况而言,无法准确地训练得到一个目标用户对应的参数矩阵T,从而无法实现针对目标用户进行个体特征无关的表情参数与个体特征相关的表情参数之间的映射,基于上述问 题,本技术方案通过在人脸驱动模型的训练过程中,将通过对样本语音数据进行文本内容识别得到的第一特征向量和进行非音素特征识别得到的第二特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据和真实面部表情数据对待训练模型进行参数更新;即不仅从音素维度提取得到与个体特征无关的第一特征向量,还从非音素维度提取得到与个体特征有关的第二特征向量,同时基于第一特征向量和第二特征向量对面部表情预测子模型进行参数更新,使得面部表情预测子模型能够同时学习到与个体特征无关的表情参数和与个体特征有关的表情参数,这样能够提高面部表情预测子模型的模型参数的精度,从而提高面部表情预测子模型的表情预测准确度,进而在人脸驱动模型(即包含训练后的面部表情预测子模型)的应用阶段,同样将通过对目标语音数据进行文本内容识别得到的第三特征向量、以及进行非音素特征识别得到的第四特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据进行图像渲染得到虚拟数字人视频,这样能够提高利用训练后的人脸驱动模型生成的虚拟数字人视频的表情真实性;本申请在模型训练阶段,在将从音素维度提取的与个体特征无关的第一特征向量作为模型输入的基础上,还直接将从非音素维度提取的与个体特征有关的第二特征向量作为模型输入,使得表情预测子模型不仅能够学习到与个体特征无关的表情参数,还能够学习到与个体特征有关的表情参数,因此,在模型应用阶段,表情预测子模型输出的面部表情预测数据既能够表征基于第三特征向量(即从音素维度提取的特征向量)预测的不同目标说话人的表情共性数据,也能够表征基于第四特征向量(即从非音素维度提取的特征向量)预测的不同目标说话人的表情差异性数据,因此,由于无需预先针对目标说话人训练得到一个对应的参数矩阵T,使得无需获取目标说话人的大量视频数据,同样能够确保面部表情预测数据反映出个体之间的表情差异性,这样在仅能够获取目标说话人的单张图像的情况下,仍能够准确地对目标说 话人的面部表情进行预测,降低了对目标说话人的视频数据的数量要求的限制,提高了训练后的人脸驱动模型的适用范围。
图1为本申请一个或多个实施例提供的人脸驱动模型的训练方法的第一种流程示意图,图1中的方法能够由设置有人脸驱动模型训练装置的电子设备执行,该电子设备可以是终端设备或者指定服务器,其中,用于人脸驱动模型训练的硬件装置(即设置有人脸驱动模型训练装置的电子设备)与用于虚拟数字人视频生成的硬件装置(即设置有虚拟数字人视频生成装置的电子设备)可以相同或不同。基于本申请实施例提供的模型训练方法训练得到的人脸驱动模型可以应用到任一需要生成虚拟数字人视频的具体应用场景,例如,用于生成虚拟客服的问题解答视频的应用场景,又如,用于生成虚拟歌手的歌曲演唱视频的应用场景,再如,用于生成虚拟主播的产品介绍视频的应用场景。
具体的,针对人脸驱动模型的训练过程,如图1所示,该方法至少包括以下步骤:
S102,获取N个视频样本数据;其中,每个视频样本数据包括样本用户的真实人脸图像和样本语音数据,N为大于1的整数。
针对预设应用场景,将该预设应用场景下M个样本用户的历史语音数据作为视频样本数据,M为大于1且小于或等于N的整数;例如,若预设应用场景为用于生成虚拟客服的问题解答视频的应用场景,则将M个样本客服人员的问题解答视频作为视频样本数据;又如,若预设应用场景为用于生成虚拟歌手的歌曲演唱视频的应用场景,则将M个样本虚拟歌手的歌曲演唱视频作为视频样本数据;再如,若预设应用场景为用于生成虚拟主播的产品介绍视频的应用场景,则将M个样本主播的产品介绍视频作为视频样本数据。
S104,将上述N个视频样本数据输入至待训练模型进行模型迭代训练,得到人脸驱动模型。
在获取到视频样本数据集合后,基于该视频样本数据集合对待训练模型 中预设模型参数进行迭代更新,直到当前模型训练结果满足预设模型训练结束条件,得到用于生成虚拟数字人视频的人脸驱动模型;其中,上述预设模型训练结束条件可以包括:当前模型训练轮数等于总训练轮数、模型损失函数收敛中任一项。
针对上述步骤S104中的模型迭代训练过程,下述对模型迭代训练的具体实现过程进行说明,由于模型迭代训练过程中每次模型训练的处理过程相同,因此,以任意一次模型训练为例进行细化说明。具体的,若上述待训练模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;如图2所示,每次模型训练的具体实现方式均可以有如下步骤S1041至步骤S1043:
S1041,针对每个视频样本数据:第一向量提取子模型对该视频样本数据中样本语音数据进行文本内容识别,得到第一特征向量;第二向量提取子模型对该样本语音数据进行非音素特征识别,得到第二特征向量;面部表情预测子模型基于上述第一特征向量和第二特征向量进行面部表情预测,得到第一面部表情预测数据。
上述第一向量提取子模型可以是预先训练好的语音识别模型,例如,该语音识别模型可以是预训练模型“基于深度学习的循环神经网络DeepSpeech RNN”,也可以是学习音频到文本的映射的其他神经网络模型;在获取到视频样本数据后,将视频样本数据中的样本语音数据输入至第一向量提取子模型,基于样本语音数据从音素维度提取每一帧语音信号中文本内容特征(即语音转文本处理),得到第一特征向量,例如,ASR特征向量(语音识别ASR,Automatic Speech Recognition)作为面部表情预测子模型的第一输入数据。
上述第二向量提取子模型可以是预先训练好的说话人表示识别模型,例如,该说话人表示识别模型可以是基于声纹识别算法voxceleb的模型,也可以是学习说话人辨别和确认的其他神经网络模型;在获取到视频样本数据后,不仅将视频样本数据中的样本语音数据输入至第一向量提取子模型,还将该 样本语音数据输入至第二向量提取子模型,以便利用第二向量提取子模型,基于样本语音数据从非音素维度提取每一帧语音信号中说话人表示特征,得到第二特征向量(即用于反映个体面部表情差异性的说话人表示向量),例如,能够表征说话人特性的表示向量作为面部表情预测子模型的第二输入数据。
具体的,由于说话人的语速和情绪等说话方式均能够影响到说话人的面部表情的个性化差异,例如,不同说话人的语速快慢会影响到说话人嘴巴的张合频率等面部表情,不同说话人的情绪差异性会影响到说话人嘴角上扬程度等面部表情,又由于声纹特征能够表征说话人的语速等说话方式,因此,对于第二特征向量的提取过程,上述非音素特征识别可以包括声纹特征识别、情绪特征识别中至少一项。
上述面部表情预测子模型是待训练的用于预测人脸面部表情的神经网络模型;在通过第一向量提取子模型得到第一特征向量,以及通过第二向量提取子模型得到第二特征向量之后,将第一特征向量和第二特征向量输入至面部表情预测子模型,该面部表情预测子模型的输出即为第一面部表情预测数据;其中,面部表情预测子模型中的模型参数为需要迭代训练的模型参数。
S1042,基于上述各视频样本数据对应的第一面部表情预测数据和真实面部表情数据,确定第一损失值;上述真实面部表情数据是基于上述视频样本数据中的真实人脸图像得到的。
在利用面部表情预测子模型得到第一面部表情预测数据之后,获取视频样本数据对应的真实面部表情数据,将该真实面部表情数据作为真实标签,将该第一面部表情预测数据作为预测标签,计算第一面部表情预测数据和真实面部表情数据之间的表情预测子损失值,即基于真实标签和预测标签得到模型的表情预测损失,再基于各视频样本数据对应的表情预测子损失值,得到第一损失值;对于某一视频样本数据而言,从视频样本数据中的真实人脸图像中提取真实面部表情数据作为真实标签,以及从视频样本数据中的样本语音数据中,提取第一特征向量和第二特征向量,再基于第一特征向量和第 二特征向量进行面部表情预测,得到第一面部表情预测数据作为预测标签,进而计算真实标签与预测标签之间的表情预测损失信息。
上述真实面部表情数据可以预先利用脸部特征提取器基于视频样本数据中的真实人脸图像进行人脸特征数据提取得到,也可以实时利用脸部特征提取器,基于视频样本数据中的真实人脸图像进行人脸特征数据提取得到;例如,脸部特征提取器可以是已有的三维面部追踪(3-dimension face tracking,3D face tracking),也可以是其他脸部特征提取器;利用脸部特征提取器对真实人脸图像进行人脸特征数据提取,得到用户的真实面部表情数据、形状特征向量和纹理特征向量;其中,形状特征向量和纹理特征向量可以作为基于面部表情预测数据进行脸部图像渲染的基础数据。
S1043,基于上述第一损失值,对上述待训练模型进行参数更新;其中,由于待训练模型中第一向量提取子模型和第二向量提取子模型的模型参数是预先训练的,并且第一损失值主要用来表征真实面部表情数据与面部表情预测数据之间的表情预测损失信息,因此第一损失值主要用来对面部表情预测子模型的模型参数进行迭代更新。
在基于第一面部表情预测数据和真实面部表情数据得到第一损失值之后,利用梯度下降方法基于上述第一损失值对面部表情预测子模型进行参数调整;其中,由于第一面部表情预测数据是基于上述第一特征向量和第二特征向量这两部分确定的,因此,第一损失值不仅能够反映从音素特征维度考量的表情预测损失分量,还能够反映从非音素特征维度考量的表情预测损失分量,因此,能够提高第一损失值的准确度,从而提高模型参数调整的准确度,使得训练后的面部表情预测子模型的面部表情预测准确度更高。
基于待训练模型的第一损失值对模型参数进行迭代训练,得到训练后的人脸驱动模型可以参见相关技术中的利用梯度下降方法反向传播对模型参数进行调优的过程,在此不再赘述。
本申请实施例中,在人脸驱动模型的训练过程中,将通过对样本语音数 据进行文本内容识别得到的第一特征向量和进行非音素特征识别得到的第二特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据和真实面部表情数据对待训练模型进行参数更新;即不仅从音素维度提取得到与个体特征无关的第一特征向量,还从非音素维度提取得到与个体特征有关的第二特征向量,同时基于第一特征向量和第二特征向量对面部表情预测子模型进行参数更新,使得面部表情预测子模型能够同时学习到与个体特征无关的表情参数和与个体特征有关的表情参数,这样能够提高面部表情预测子模型的模型参数的精度,从而提高面部表情预测子模型的表情预测准确度,进而在人脸驱动模型(即包含训练后的面部表情预测子模型)的应用阶段,同样将通过对目标语音数据进行文本内容识别得到的第三特征向量和进行非音素特征识别得到的第四特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据进行图像渲染得到虚拟数字人视频,这样能够提高利用训练后的人脸驱动模型生成的虚拟数字人视频的表情真实性。
如图3所示,给出了一种人脸驱动模型训练过程的具体实现原理示意图,具体包括:获取多个样本用户的视频样本数据;每个视频样本数据包括真实人脸图像和样本语音数据;将视频样本数据中的样本语音数据输入至第一向量提取子模型进行文本内容识别,得到第一特征向量;以及,将样本语音数据输入至第二向量提取子模型进行非音素特征识别,得到第二特征向量;将上述第一特征向量和上述第二特征向量输入至待训练的面部表情预测子模型进行面部表情预测,得到第一面部表情预测数据;将视频样本数据中的真实人脸图像输入至脸部特征提取器进行特征提取,得到样本用户的真实面部表情数据;其中,上述脸部特征提取器还可以用于提取样本用户的形状特征向量和纹理特征向量,脸部特征数据提取过程可以是在模型训练之前完成的,也可以是与模型训练的过程同步进行的;基于各视频样本数据对应的第一面 部表情预测数据和真实面部表情数据,确定第一损失值;基于上述第一损失值,对待训练模型中面部表情预测子模型的模型参数进行迭代更新,直到当前模型训练结果满足预设模型训练结束条件,得到训练后的人脸驱动模型;其中,训练后的人脸驱动模型包括上述第一向量提取子模型、第二向量提取子模型和面部表情预测子模型。
上述脸部特征提取器可以是独立于训练后的人脸驱动模型的,例如,对于脸部特征数据提取过程在模型训练之前完成的情况,脸部特征提取器与人脸驱动模型是相对独立的;上述脸部特征提取器还可以是属于训练后的人脸驱动图像,例如,针对脸部特征数据提取过程与模型训练的过程同步进行的情况,训练后的人脸驱动模型包括脸部特征提取器。
考虑到可能存在利用面部表情预测子模型得到的面部表情预测数据中口型与语音不同步的问题,即口型与音频不对应,存在延迟反应,因此,在对面部表情预测子模型进行模型参数调整的过程中,不仅考虑了基于真实表情标签和预测表情标签得到的面部表情损失,还考虑了预测表情图像帧中的口型与语音帧对应的文字之间的同步损失,最终用于模型参数优化的损失值不仅包括面部表情预测数据与真实面部表情数据之间的第一损失值,还包括口型信息与语音帧对应的文字之间的第二损失值,基于此,如图4所示,在上述图2的基础上,上述待训练模型还包括面部表情渲染器和口型语音同步识别子模型;每次模型训练的具体实现方式还可以包括如下步骤S1044至步骤S1045:
S1044,针对每个视频样本数据:面部表情渲染器基于上述第一面部表情预测数据进行脸部图像渲染,得到人脸预测图像;口型语音同步识别子模型基于上述人脸预测图像和样本语音数据,确定视频样本数据对应的同步子损失值。
上述面部表情渲染器可以是相关技术中的三维人脸渲染器,例如,三维人脸渲染器可以是可微分渲染器tf-mesh-render,还可以是其他面部表情渲染 器;在获取到第一面部表情预测数据,以及获取到形状特征向量和纹理特征向量之后,将第一面部表情预测数据、形状特征向量和纹理特征向量输入到面部表情渲染器,进行脸部图像渲染即可得到人脸预测图像,以便将人脸预测图像作为口型语音同步识别子模型的输入数据进行口型与语音的同步性识别。
上述口型语音同步识别子模型可以是预先训练的用于识别面部口型与语音帧对应的文字是否同步的神经网络模型,例如,SycNet模型或者其他神经网络模型;在获取到人脸预测图像(包含预测的面部口型)后,将人脸预测图像和样本语音数据输入至口型语音同步识别子模型,以对人脸预测图像帧中面部口型与样本语音帧对应的文字之间的同步性进行打分,得到同步子损失值。
其中,上述S1044与上述S1042之间的先后顺序可以互换,本申请对此不做限制。
S1045,基于上述各视频样本数据对应的同步子损失值,确定第二损失值;上述第二损失值用于表征样本用户的口型与语音的延迟程度。
在获取到各视频样本数据对应的同步子损失值之后,对多个同步子损失值进行求和或者加权求和,即可得到第二损失值。
上述S1043,基于上述第一损失值,对上述待训练模型进行参数更新,具体包括:
S10431,基于上述第一损失值和第二损失值,对上述待训练模型进行参数更新。
基于上述第一损失值和所述第二损失值,确定加权损失值;基于该加权损失值,对上述待训练模型中面部表情预测子模型的模型参数进行迭代更新,得到训练后的人脸驱动模型;其中,训练后的人脸驱动模型可以包括口型语音同步识别子模型(用于对基于第二面部表情预测数据渲染得到的人脸渲染图像中的面部口型与目标语音数据中语音帧对应的文字的同步性进行识别, 从而能够基于同步性识别结果,对面部表情预测子模型的精度进行评估),训练后的人脸驱动模型也可以不包括口型语音同步识别子模型。
由于口型语音同步识别子模型主要是用于对面部表情预测数据中口型与样本语音数据的同步性进行识别,从而得到表征面部口型与语音帧延迟程度的第二损失值,进而使得同时基于表征真实表情和预测表情之间的面部表情损失,以及表征面部口型与语音帧之间的同步损失,对面部表情预测子模型的模型参数进行调优,在模型训练过程中,对面部表情预测子模型所输出的预测表情中的口型信息与对应的真实语音帧之间的同步性进行监督,使得面部表情预测子模型输出的面部表情预测数据既能够确保面部表情的真实性,又能够确保面部口型与语音帧的同步性,因此,口型语音同步识别子模型可以仅在模型训练阶段使用,而在模型应用阶段,可以不需要口型语音同步识别子模型,即待训练模型可以包括上述第一向量提取子模型、第二向量提取子模型、面部表情预测子模型、面部表情渲染器和口型语音同步识别子模型,而训练后的人脸驱动模型可以包括上述第一向量提取子模型、第二向量提取子模型、面部表情预测子模型和面部表情渲染器。
在上述图3的基础上,如图5所示,给出了另一种人脸驱动模型训练过程的具体实现原理示意图,具体包括:获取多个样本用户的视频样本数据;每个视频样本数据包括真实人脸图像和样本语音数据;将视频样本数据中的样本语音数据输入至第一向量提取子模型进行文本内容识别,得到第一特征向量;以及,将样本语音数据输入至第二向量提取子模型进行非音素特征识别,得到第二特征向量;将上述第一特征向量和上述第二特征向量输入至待训练的面部表情预测子模型进行面部表情预测,得到第一面部表情预测数据;将视频样本数据中的真实人脸图像输入至脸部特征提取器进行特征提取,得到样本用户的真实面部表情数据、形状特征向量和纹理特征向量;其中,脸部特征数据提取过程可以是在模型训练之前完成的,也可以是与模型训练的过程同步进行的;将第一面部表情预测数据、形状特征向量和纹理特征向量 输入至面部表情渲染器进行脸部图像渲染,得到人脸预测图像;然后,将各视频样本数据对应的人脸预测图像和样本语音数据输入至口型语音同步识别子模型进行口型语音同步性识别,得到视频样本数据对应的同步子损失值;基于各视频样本数据对应的第一面部表情预测数据和真实面部表情数据,确定第一损失值;以及,基于各视频样本数据对应的同步子损失值,确定第二损失值;基于上述第一损失值和第二损失值,对待训练模型中面部表情预测子模型的模型参数进行迭代更新,直到当前模型训练结果满足预设模型训练结束条件,得到训练后的人脸驱动模型;其中,训练后的人脸驱动模型包括上述第一向量提取子模型、第二向量提取子模型、面部表情预测子模型和面部表情渲染器。
对于上述同步子损失值的确定过程,上述口型语音同步识别子模型可以包括多层神经网络和损失信息输出网络。
(1)上述多层神经网络将上述人脸预测图像和样本语音数据映射到高维空间,进行口型与语音同步对比,得到语音口型同步得分。
将人脸预测图像帧和对应的样本语音帧一并输入到多层神经网络;多层神经网络基于人脸预测图像帧进行口型特征提取,得到图像帧高维特征向量;以及多层神经网络基于样本语音帧进行语音特征提取,得到语音帧高维特征向量;将图像帧高维特征向量与语音帧高维特征向量进行相关性计算,基于相关性计算结果得到语音口型同步得分。
(2)上述损失信息输出网络基于上述语音口型同步得分,确定上述视频样本数据对应的同步子损失值。
在针对每一对人脸预测图像帧和样本语音帧,确定相应的语音口型同步得分之后,基于各人脸预测图像帧对应的语音口型同步得分,确定视频样本数据对应的同步子损失值。针对上述第一面部表情预测数据的生成过程,上述面部表情预测子模型可以包括向量拼接网络和面部表情识别网络。
(1)上述向量拼接网络对上述第一特征向量和第二特征向量进行拼接处 理,得到目标特征向量。
在获取到各视频样本数据对应的第一特征向量和第二特征向量之后,将第一特征向量和第二特征向量输入至上一轮参数更新后的向量拼接网络,对第一特征向量和第二特征向量进行加权求和处理,向量拼接网络的输出即为目标特征向量。
(2)上述面部表情识别网络基于上述目标特征向量进行面部表情识别,得到第一面部表情预测数据。
在获取到各视频样本数据对应的目标特征向量之后,将目标特征向量输入至上一轮参数更新后的面部表情识别网络,进行面部表情预测,面部表情识别网络的输出即为视频样本数据对应的第一面部表情预测数据。
考虑到在人脸驱动模型的训练过程中,主要是对面部表情预测子模型的模型参数进行迭代更新,而上述第一向量提取子模型、第二向量提取子模型和口型语音同步识别子模型均为预先基于相应的样本集数据训练好的预设神经网络模型;以及上述面部表情渲染器也应是预先构建好的三维人脸渲染器;利用预先训练好的第一向量提取子模型、第二向量提取子模型、口型语音同步识别子模型和面部表情渲染器,输出人脸驱动模型训练过程所需的基础数据,以便基于该基础数据得到面部表情预测子模型的总损失值,进而基于该总损失值对面部表情预测子模型的模型参数进行迭代更新。
其中,上述第一向量提取子模型的训练过程可以参照相关技术中的端到端的语音识别模型的具体训练过程;以及上述面部表情渲染器的构建过程可以参见相关技术中的可微分渲染器的构建过程,在此不再赘述。
针对上述第二向量提取子模型的训练过程,在上述S102,获取N个视频样本数据之前,还包括:
步骤A1,获取第一样本数据集;其中,第一样本数据集可以包括用于通过非音素特征识别进行说话人分类训练的说话人语音数据集。
上述第一样本数据集可以包括用于通过声纹特征识别进行说话人分类训 练的说话人语音数据集,和/或用于通过情绪特征识别进行说话人分类训练的说话人语音数据集。
步骤A2,基于上述第一样本数据集对第一预设神经网络模型进行参数迭代更新,得到上述训练后的第二向量提取子模型。
上述第一预设神经网络模型可以是待训练的说话人表示识别模型,先将第一样本数据集输入至待训练的说话人表示识别模型进行说话人分类,得到说话人分类预测结果,再基于说话人分类预测结果和说话人真实分类信息确定说话人分类损失值,再基于该说话人分类损失值对第一预设神经网络模型的模型参数进行迭代更新,直到当前模型训练结果满足预设模型训练结束条件,得到训练后的第一预设神经网络模型;将训练后的第一预设神经网络模型作为训练后的第二向量提取子模型;其中,训练后的第二向量提取子模型用于对样本语音数据进行非音素特征识别,得到第二特征向量。
针对上述口型语音同步识别子模型的训练过程,在上述S102,获取N个视频样本数据之前,还包括:
步骤B1,获取第二样本数据集;其中,第二样本数据集可以包括用于对口型与语音是否同步进行分类训练的样本数据集,每个样本数据包括一对图像帧和语音帧。
步骤B2,基于上述第二样本数据集对第二预设神经网络模型进行参数迭代更新,得到上述训练后的口型语音同步识别子模型。
上述第二预设神经网络模型可以是待训练的用于识别面部口型与语音帧是否同步的神经网络模型,先将第二样本数据集输入至待训练的第二预设神经网络模型进行二分类,得到口型语音同步分类预测结果,再基于口型语音同步分类预测结果和口型语音同步与否的真实分类信息确定口型语音同步分类损失值,再基于该口型语音同步分类损失值对第二预设神经网络模型的模型参数进行迭代更新,直到当前模型训练结果满足预设模型训练结束条件,得到训练后的第二预设神经网络模型;将训练后的第二预设神经网络模型作 为训练后的口型语音同步识别子模型;其中,训练后的口型语音同步识别子模型用于基于人脸预测图像和样本语音数据,确定视频样本数据对应的同步子损失值。
考虑到在利用人脸驱动模型生成虚拟数字人视频的过程中,为了提高虚拟数字人的播放效果,提升用户观看体验,还可以在虚拟数字人视频中增加更加真实的背景画面,即需要将利用面部表情渲染器进行渲染得到的人脸预测图像与指定背景画面进行融合,因此,上述训练后的人脸驱动模型还可以包括背景合成渲染器,即训练后的人脸驱动模型包括上述第一向量提取子模型、第二向量提取子模型、面部表情预测子模型、面部表情渲染器和背景合成渲染器;其中,背景合成渲染器可以包括神经渲染器Neural Render或者其他三维渲染器;背景合成渲染器可以是在人脸驱动模型的训练之前,基于第三样本数据集(即包含目标背景画面的样本视频数据集)预先构建的,也可以是在人脸驱动模型的训练过程中,基于上述人脸预测图像和目标背景画面实时构建的。
在上述图5的基础上,如图6所示,给出了又一种人脸驱动模型训练过程的具体实现原理示意图,具体包括:
在获取多个样本用户的视频样本数据之前,预先基于样本数据集1得到训练后的第一向量提取子模型;以及预先基于样本数据集2得到训练后的第二向量提取子模型;以及预先基于样本数据集3得到训练后的口型语音同步识别子模型;以及预先基于样本数据集4构建得到面部表情渲染器;然后,基于训练后的第一向量提取子模型、第二向量提取子模型、口型语音同步识别子模型、面部表情渲染器和待训练的面部表情预测子模型,生成待训练模型;然后,基于获取到的多个样本用户的视频样本数据,对待训练的面部表情预测子模型的模型参数进行迭代更新,得到参数迭代更新后的面部表情预测子模型;以及基于目标背景画面视频和面部表情渲染器输出的人脸预测图像,对初始的背景合成渲染器的参数进行迭代优化,得到参数优化后的背景 合成渲染器;基于预先训练后的第一向量提取子模型、第二向量提取子模型、面部表情渲染器、参数迭代更新后的面部表情预测子模型和参数优化后的背景合成渲染器,生成训练后的人脸驱动模型。
本申请实施例中的人脸驱动模型的训练方法,在人脸驱动模型的训练过程中,将通过对样本语音数据进行文本内容识别得到的第一特征向量和进行非音素特征识别得到的第二特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据和真实面部表情数据对待训练模型进行参数更新;即不仅从音素维度提取得到与个体特征无关的第一特征向量,还从非音素维度提取得到与个体特征有关的第二特征向量,同时基于第一特征向量和第二特征向量对面部表情预测子模型进行参数更新,使得面部表情预测子模型能够同时学习到与个体特征无关的表情参数和与个体特征有关的表情参数,这样能够提高面部表情预测子模型的模型参数的精度,从而提高面部表情预测子模型的表情预测准确度,进而在人脸驱动模型(即包含训练后的面部表情预测子模型)的应用阶段,同样将通过对目标语音数据进行文本内容识别得到的第三特征向量和进行非音素特征识别得到的第四特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据进行图像渲染得到虚拟数字人视频,这样能够提高利用训练后的人脸驱动模型生成的虚拟数字人视频的表情真实性。
对应上述图1至图6描述的视频生成方法,基于相同的技术构思,本申请实施例还提供了一种视频生成方法,图7为本申请实施例提供的视频生成方法的流程示意图,图7中的方法能够由设置有视频生成装置的电子设备执行,该电子设备可以是终端设备或者指定服务器,该电子设备中部署有预先训练后的人脸驱动模型;其中,训练后的人脸驱动模型包括训练后的第一向量提取子模型、第二向量提取子模型、面部表情渲染器、面部表情预测子模 型、背景合成渲染器中至少一项。基于本申请实施例提供的模型训练方法训练得到的人脸驱动模型可以应用到任一需要生成虚拟数字人视频的具体应用场景,例如,用于生成虚拟客服的问题解答视频的应用场景,又如,用于生成虚拟歌手的歌曲演唱视频的应用场景,再如,用于生成虚拟主播的产品介绍视频的应用场景。
针对利用训练后的人脸驱动模型生成虚拟数字人视频的具体实现过程,如图7所示,上述视频生成方法至少包括以下步骤:
S702,获取目标语音数据;其中,目标语音数据包括目标用户的原声语音数据或者目标用户的合成语音数据。
上述目标语音数据可以是直接获取的语音数据(即目标用户的原声语音数据),还可以是基于预设文本内容转换得到的语音数据(即目标用户的合成语音数据);其中,针对无法直接获取目标用户的原声语音数据的情况,先获取目标文本数据;再利用预设文本转语音模型,对目标文本数据进行文本语音转换处理,得到目标用户的合成语音数据;其中,预设文本转语音模型可以是预先基于样本数据集训练得到的;例如,可以是基于文本转语音(Text-to-Speech,TTS)技术的文本转语音模型,利用该文本转语音模型得到合成语音数据中包含目标用户的非音素特征,其中,基于目标文本数据转换得到的合成语音数据能够反映目标用户的声纹特征(即利用TTS技术自动生成包含目标用户的声纹特征的语音帧的一段语音数据),因此,无论是基于原声语音数据,还是基于合成语音数据均能够准确地提取出与个体特征有关的第四特征向量,这样可以不要求必须采集目标用户的原声语音,而是仅基于目标文本数据即可触发生成任一目标用户的虚拟数字人视频,提高了虚拟数字人视频的生成灵活性。
S702,将上述目标语音数据输入至训练后的人脸驱动模型进行人脸驱动处理,得到目标虚拟数字人视频;其中,训练后的人脸驱动模型可以是基于上述人脸驱动模型的训练方法训练得到的。
上述训练后的人脸驱动模型可以包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;其中,上述人脸驱动处理的具体实现方式有:第一向量提取子模型对上述目标语音数据进行文本内容识别,得到第三特征向量;第二向量提取子模型对上述目标语音数据进行非音素特征识别,得到第四特征向量;面部表情预测子模型基于上述第三特征向量和第四特征向量进行表情预测,得到第二面部表情预测数据;基于上述第二面部表情预测数据进行图像渲染,得到目标虚拟数字人视频。
本申请实施例中,在目标虚拟数字人视频的生成过程,将通过对目标语音数据进行文本内容识别得到的第三特征向量和进行非音素特征识别得到的第四特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据进行图像渲染得到虚拟数字人视频;因此,面部表情预测子模型在进行面部表情预测时,不仅考虑了与个体特征无关的第三特征向量,还考虑了与个体特征有关的第三特征向量,这样能够提高第二面部表情预测数据的预测准确度;并且,如果在面部表情预测子模型的训练过程中考虑口型语音同步损失的话,面部表情预测子模型在进行面部表情预测时,还能够确保第二面部表情预测数据中面部口型与语音帧的同步性。
上述人脸驱动模型还包括面部表情渲染器和背景合成渲染器;具体的,上述基于第二面部表情预测数据进行图像渲染,得到目标虚拟数字人视频,具体包括:面部表情渲染器基于上述第二面部表情预测数据进行脸部图像渲染,得到人脸渲染图像;背景合成渲染器基于上述人脸渲染图像进行背景合成,得到目标虚拟数字人视频。
由于人脸驱动模型中面部表情预测子模型、面部表情渲染器和背景合成渲染器之间均是相互独立的,因此,最终生成的目标虚拟数字人视频中的语音数据和图像数据可以是同一个用户的,也可以是不同用户的,即上述目标虚拟数字人视频可以包括目标用户的语音数据和虚拟数字人图像,或者,上 述目标虚拟数字人视频可以包括目标用户的语音数据和其他用户的虚拟数字人图像;并且目标虚拟数字人视频中的背景图像也可以是根据实际需求进行灵活设置的。
针对预设应用场景,获取该预设应用场景下的目标语音数据;将目标语音数据输入至训练后的人脸驱动模型进行人脸驱动处理,得到目标虚拟数字人视频;例如,若预设应用场景为用于生成虚拟客服的问题解答视频的应用场景,则获取基于目标问题解答文本转换得到的目标问题解答语音,将该目标问题解答语音作为目标语音数据;又如,若预设应用场景为用于生成虚拟歌手的歌曲演唱视频的应用场景,则获取基于目标歌曲歌词文本转换得到的目标歌曲演唱语音,将该目标歌曲演唱语音作为目标语音数据;再如,若预设应用场景为用于生成虚拟主播的产品介绍视频的应用场景,则获取基于目标产品介绍文本转换得到的目标产品介绍语音,将该目标产品介绍语音作为目标语音数据。
以上述训练后的人脸驱动模型包括预设文本转语音模型、脸部特征提取器、第一向量提取子模型、第二向量提取子模型、面部表情预测子模型、面部表情渲染器和背景合成渲染器为例;如图8所示,给出了一种虚拟数字人视频生成过程的具体实现原理示意图,具体包括:获取预设应用场景下的目标文本数据;利用预设的TTS技术对目标文本数据进行文本语音转换处理,得到第一目标用户的合成语音数据;将第一目标用户的合成语音数据输入至第一向量提取子模型进行文本内容识别,得到第三特征向量;以及,将第一目标用户的合成语音数据输入至第二向量提取子模型进行非音素特征识别,得到第四特征向量;将上述第三特征向量和上述第四特征向量输入至面部表情预测子模型进行面部表情预测,得到第二面部表情预测数据;将第二目标用户的真实人脸图像输入至脸部特征提取器进行特征提取,得到第二目标用户的形状特征向量和纹理特征向量;其中,第二目标用户可以与第一目标用户相同或不同,脸部特征数据提取过程可以是在虚拟数字人视频生成之前完 成的,也可以是与虚拟数字人视频的生成过程同步进行的;将第二面部表情预测数据、形状特征向量和纹理特征向量输入至面部表情渲染器进行脸部图像渲染,得到目标人脸渲染图像;将目标人脸渲染图像输入至背景合成渲染器进行背景合成,得到目标虚拟数字人视频。
本申请实施例中的视频生成方法,在人脸驱动模型(即包含训练后的面部表情预测子模型)的应用阶段,将通过对目标语音数据进行文本内容识别得到的第三特征向量和进行非音素特征识别得到的第四特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据进行图像渲染得到虚拟数字人视频;其中,由于在人脸驱动模型的训练过程中,主要是将通过对样本语音数据进行文本内容识别得到的第一特征向量和进行非音素特征识别得到的第二特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据和真实面部表情数据对待训练模型进行参数更新;即不仅从音素维度提取得到与个体特征无关的第一特征向量,还从非音素维度提取得到与个体特征有关的第二特征向量,同时基于第一特征向量和第二特征向量对面部表情预测子模型进行参数更新,使得面部表情预测子模型能够同时学习到与个体特征无关的表情参数和与个体特征有关的表情参数,这样能够提高面部表情预测子模型的模型参数的精度,从而提高面部表情预测子模型的表情预测准确度,因此,这样能够提高利用训练后的人脸驱动模型生成的虚拟数字人视频的表情真实性。
需要说明的是,本申请中该实施例与本申请中上一实施例基于同一发明构思,因此该实施例的具体实施可以参见前述人脸驱动模型的训练方法的实施,重复之处不再赘述。
对应上述图1至图6描述的人脸驱动模型的训练方法,基于相同的技术构思,本申请实施例还提供了一种人脸驱动模型的训练装置,图9为本申请 实施例提供的人脸驱动模型的训练装置的模块组成示意图,该装置用于执行图1至图6描述的人脸驱动模型的训练方法,如图9所示,该装置包括:样本数据获取模块902,用于获取N个视频样本数据;每个所述视频样本数据包括样本用户的真实人脸图像和样本语音数据,N为大于1的整数;模型训练模块904,用于将所述N个视频样本数据输入至待训练模型进行模型迭代训练,直到当前模型训练结果满足预设模型训练结束条件,得到训练后的人脸驱动模型;其中,所述待训练模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;每次模型训练的具体实现方式有:针对每个所述视频样本数据:所述第一向量提取子模型对所述视频样本数据中样本语音数据进行文本内容识别,得到第一特征向量;所述第二向量提取子模型对所述样本语音数据进行非音素特征识别,得到第二特征向量;所述面部表情预测子模型基于所述第一特征向量和所述第二特征向量进行面部表情预测,得到第一面部表情预测数据;基于各所述视频样本数据对应的第一面部表情预测数据和真实面部表情数据,确定第一损失值;所述真实面部表情数据是基于所述视频样本数据中的真实人脸图像得到的;基于所述第一损失值,对所述待训练模型进行参数更新。
本申请实施例中的人脸驱动模型的训练装置,在人脸驱动模型的训练过程中,将通过对样本语音数据进行文本内容识别得到的第一特征向量和进行非音素特征识别得到的第二特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据和真实面部表情数据对待训练模型进行参数更新;即不仅从音素维度提取得到与个体特征无关的第一特征向量,还从非音素维度提取得到与个体特征有关的第二特征向量,同时基于第一特征向量和第二特征向量对面部表情预测子模型进行参数更新,使得面部表情预测子模型能够同时学习到与个体特征无关的表情参数和与个体特征有关的表情参数,这样能够提高面部表情预测子模型的模型参数的精度,从而提高面部表情预测子模型的表情预测准确度,进 而在人脸驱动模型(即包含训练后的面部表情预测子模型)的应用阶段,同样将通过对目标语音数据进行文本内容识别得到的第三特征向量和进行非音素特征识别得到的第四特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据进行图像渲染得到虚拟数字人视频,这样能够提高利用训练后的人脸驱动模型生成的虚拟数字人视频的表情真实性。
需要说明的是,本申请中关于人脸驱动模型的训练装置的实施例与本申请中关于人脸驱动模型的训练方法的实施例基于同一发明构思,因此该实施例的具体实施可以参见前述对应的人脸驱动模型的训练方法的实施,重复之处不再赘述。
对应上述图7至图8描述的视频生成方法,基于相同的技术构思,本申请实施例还提供了一种视频生成装置,图10为本申请实施例提供的视频生成装置的模块组成示意图,该装置用于执行图7至图8描述的视频生成方法,如图10所示,该装置包括:目标数据获取模块1002,用于获取目标语音数据;所述目标语音数据包括目标用户的原声语音数据或者目标用户的合成语音数据;视频生成模块1004,用于将所述目标语音数据输入至训练后的人脸驱动模型进行人脸驱动处理,得到目标虚拟数字人视频;其中,所述人脸驱动模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;所述人脸驱动处理的具体实现方式有:所述第一向量提取子模型对所述目标语音数据进行文本内容识别,得到第三特征向量;所述第二向量提取子模型对所述目标语音数据进行非音素特征识别,得到第四特征向量;所述面部表情预测子模型基于所述第三特征向量和所述第四特征向量进行表情预测,得到第二面部表情预测数据;基于所述第二面部表情预测数据进行图像渲染,得到所述目标虚拟数字人视频。
本申请实施例中的视频生成装置,在人脸驱动模型(即包含训练后的面 部表情预测子模型)的应用阶段,将通过对目标语音数据进行文本内容识别得到的第三特征向量和进行非音素特征识别得到的第四特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据进行图像渲染得到虚拟数字人视频;其中,由于在人脸驱动模型的训练过程中,主要是将通过对样本语音数据进行文本内容识别得到的第一特征向量和进行非音素特征识别得到的第二特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据和真实面部表情数据对待训练模型进行参数更新;即不仅从音素维度提取得到与个体特征无关的第一特征向量,还从非音素维度提取得到与个体特征有关的第二特征向量,同时基于第一特征向量和第二特征向量对面部表情预测子模型进行参数更新,使得面部表情预测子模型能够同时学习到与个体特征无关的表情参数和与个体特征有关的表情参数,这样能够提高面部表情预测子模型的模型参数的精度,从而提高面部表情预测子模型的表情预测准确度,因此,这样能够提高利用训练后的人脸驱动模型生成的虚拟数字人视频的表情真实性。
需要说明的是,本申请中关于人脸驱动模型的训练装置的实施例与本申请中关于人脸驱动模型的训练方法的实施例基于同一发明构思,因此该实施例的具体实施可以参见前述对应的人脸驱动模型的训练方法的实施,重复之处不再赘述。
进一步地,对应上述图1至图8所示的方法,基于相同的技术构思,本申请实施例还提供了一种计算机设备,该设备用于执行上述的人脸驱动模型的训练方法或者视频生成方法,如图11所示。
计算机设备可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上的处理器1101和存储器1102,存储器1102中可以存储有一个或一个以上存储应用程序或数据。其中,存储器1102可以是短暂存储或持久存储。 存储在存储器1102的应用程序可以包括一个或一个以上模块(图示未示出),每个模块可以包括对计算机设备中的一系列计算机可执行指令。更进一步地,处理器1101可以设置为与存储器1102通信,在计算机设备上执行存储器1102中的一系列计算机可执行指令。计算机设备还可以包括一个或一个以上电源1103,一个或一个以上有线或无线网络接口1104,一个或一个以上输入输出接口1105,一个或一个以上键盘1106等。
在一个具体的实施例中,计算机设备包括有存储器,以及一个或一个以上的程序,其中一个或者一个以上程序存储于存储器中,且一个或者一个以上程序可以包括一个或一个以上模块,且每个模块可以包括对计算机设备中的一系列计算机可执行指令,且经配置以由一个或者一个以上处理器执行该一个或者一个以上程序包含用于进行以下计算机可执行指令:获取N个视频样本数据;每个所述视频样本数据包括样本用户的真实人脸图像和样本语音数据,N为大于1的整数;将所述N个视频样本数据输入至待训练模型进行模型迭代训练,得到人脸驱动模型;其中,所述待训练模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;每次模型训练的具体实现方式有:针对每个所述视频样本数据:所述第一向量提取子模型对所述视频样本数据中样本语音数据进行文本内容识别,得到第一特征向量;所述第二向量提取子模型对所述样本语音数据进行非音素特征识别,得到第二特征向量;所述面部表情预测子模型基于所述第一特征向量和所述第二特征向量进行面部表情预测,得到第一面部表情预测数据;基于各所述视频样本数据对应的第一面部表情预测数据和真实面部表情数据,确定第一损失值;所述真实面部表情数据是基于所述视频样本数据中的真实人脸图像得到的;基于所述第一损失值,对所述待训练模型进行参数更新。
在另一个具体的实施例中,计算机设备包括有存储器,以及一个或一个以上的程序,其中一个或者一个以上程序存储于存储器中,且一个或者一个以上程序可以包括一个或一个以上模块,且每个模块可以包括对计算机设备 中的一系列计算机可执行指令,且经配置以由一个或者一个以上处理器执行该一个或者一个以上程序包含用于进行以下计算机可执行指令:获取目标语音数据;所述目标语音数据包括目标用户的原声语音数据或者目标用户的合成语音数据;将所述目标语音数据输入至训练后的人脸驱动模型进行人脸驱动处理,得到目标虚拟数字人视频;其中,所述人脸驱动模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;所述人脸驱动处理的具体实现方式有:所述第一向量提取子模型对所述目标语音数据进行文本内容识别,得到第三特征向量;所述第二向量提取子模型对所述目标语音数据进行非音素特征识别,得到第四特征向量;所述面部表情预测子模型基于所述第三特征向量和所述第四特征向量进行表情预测,得到第二面部表情预测数据;基于所述第二面部表情预测数据进行图像渲染,得到所述目标虚拟数字人视频。
本申请实施例中的计算机设备,在人脸驱动模型的训练过程中,将通过对样本语音数据进行文本内容识别得到的第一特征向量和进行非音素特征识别得到的第二特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据和真实面部表情数据对待训练模型进行参数更新;即不仅从音素维度提取得到与个体特征无关的第一特征向量,还从非音素维度提取得到与个体特征有关的第二特征向量,同时基于第一特征向量和第二特征向量对面部表情预测子模型进行参数更新,使得面部表情预测子模型能够同时学习到与个体特征无关的表情参数和与个体特征有关的表情参数,这样能够提高面部表情预测子模型的模型参数的精度,从而提高面部表情预测子模型的表情预测准确度,进而在人脸驱动模型(即包含训练后的面部表情预测子模型)的应用阶段,同样将通过对目标语音数据进行文本内容识别得到的第三特征向量和进行非音素特征识别得到的第四特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据进行图像渲染得到虚拟数字 人视频,这样能够提高利用训练后的人脸驱动模型生成的虚拟数字人视频的表情真实性。
需要说明的是,本申请中关于计算机设备的实施例与本申请中关于人脸驱动模型的训练方法的实施例基于同一发明构思,因此该实施例的具体实施可以参见前述对应的人脸驱动模型的训练方法的实施,重复之处不再赘述。
进一步地,对应上述图1至图8所示的方法,基于相同的技术构思,本申请实施例还提供了一种存储介质,用于存储计算机可执行指令,一种具体的实施例中,该存储介质可以为U盘、光盘、硬盘等,该存储介质存储的计算机可执行指令在被处理器执行时,能实现以下流程:获取N个视频样本数据;每个所述视频样本数据包括样本用户的真实人脸图像和样本语音数据,N为大于1的整数;将所述N个视频样本数据输入至待训练模型进行模型迭代训练,得到人脸驱动模型;其中,所述待训练模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;每次模型训练的具体实现方式有:针对每个所述视频样本数据:所述第一向量提取子模型对所述视频样本数据中样本语音数据进行文本内容识别,得到第一特征向量;所述第二向量提取子模型对所述样本语音数据进行非音素特征识别,得到第二特征向量;所述面部表情预测子模型基于所述第一特征向量和所述第二特征向量进行面部表情预测,得到第一面部表情预测数据;基于各所述视频样本数据对应的第一面部表情预测数据和真实面部表情数据,确定第一损失值;所述真实面部表情数据是基于所述视频样本数据中的真实人脸图像得到的;基于所述第一损失值,对所述待训练模型进行参数更新。
另一种具体的实施例中,该存储介质可以为U盘、光盘、硬盘等,该存储介质存储的计算机可执行指令在被处理器执行时,能实现以下流程:获取目标语音数据;所述目标语音数据包括目标用户的原声语音数据或者目标用户的合成语音数据;将所述目标语音数据输入至训练后的人脸驱动模型进行 人脸驱动处理,得到目标虚拟数字人视频;其中,所述人脸驱动模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;所述人脸驱动处理的具体实现方式有:所述第一向量提取子模型对所述目标语音数据进行文本内容识别,得到第三特征向量;所述第二向量提取子模型对所述目标语音数据进行非音素特征识别,得到第四特征向量;所述面部表情预测子模型基于所述第三特征向量和所述第四特征向量进行表情预测,得到第二面部表情预测数据;基于所述第二面部表情预测数据进行图像渲染,得到所述目标虚拟数字人视频。
本申请实施例中的存储介质存储的计算机可执行指令在被处理器执行时,在人脸驱动模型的训练过程中,将通过对样本语音数据进行文本内容识别得到的第一特征向量和进行非音素特征识别得到的第二特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据和真实面部表情数据对待训练模型进行参数更新;即不仅从音素维度提取得到与个体特征无关的第一特征向量,还从非音素维度提取得到与个体特征有关的第二特征向量,同时基于第一特征向量和第二特征向量对面部表情预测子模型进行参数更新,使得面部表情预测子模型能够同时学习到与个体特征无关的表情参数和与个体特征有关的表情参数,这样能够提高面部表情预测子模型的模型参数的精度,从而提高面部表情预测子模型的表情预测准确度,进而在人脸驱动模型(即包含训练后的面部表情预测子模型)的应用阶段,同样将通过对目标语音数据进行文本内容识别得到的第三特征向量和进行非音素特征识别得到的第四特征向量,作为面部表情预测子模型的输入数据进行面部表情预测,得到面部表情预测数据;再基于面部表情预测数据进行图像渲染得到虚拟数字人视频,这样能够提高利用训练后的人脸驱动模型生成的虚拟数字人视频的表情真实性。
需要说明的是,本申请中关于存储介质的实施例与本申请中关于人脸驱动模型的训练方法的实施例基于同一发明构思,因此该实施例的具体实施可 以参见前述对应的人脸驱动模型的训练方法的实施,重复之处不再赘述。
上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本领域内的技术人员应明白,本申请实施例可提供为方法、系统或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可读存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图 一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本申请实施例可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请的一个或多个实施例,在这些分布式计算环境中,由通过 通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本申请中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
以上所述仅为本文件的实施例而已,并不用于限制本文件。对于本领域技术人员来说,本文件可以有各种更改和变化。凡在本文件的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本文件的权利要求范围之内。

Claims (14)

  1. 一种人脸驱动模型的训练方法,所述方法包括:
    获取多个视频样本数据;所述视频样本数据包括样本用户的真实人脸图像和样本语音数据;
    将所述视频样本数据输入至待训练模型进行模型迭代训练,得到人脸驱动模型;
    其中,所述待训练模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;所述模型迭代训练中的每次模型训练包括:
    所述第一向量提取子模型对所述样本语音数据进行文本内容识别,得到第一特征向量;所述第二向量提取子模型对所述样本语音数据进行非音素特征识别,得到第二特征向量;所述面部表情预测子模型基于所述第一特征向量和所述第二特征向量进行面部表情预测,得到第一面部表情预测数据;
    基于所述第一面部表情预测数据和真实面部表情数据,确定第一损失值;所述真实面部表情数据是基于所述真实人脸图像得到的;
    基于所述第一损失值,对所述待训练模型进行参数更新。
  2. 根据权利要求1所述的方法,其中,所述待训练模型还包括面部表情渲染器和口型语音同步识别子模型;所述模型迭代训练中的每次模型训练还包括:
    所述面部表情渲染器基于所述第一面部表情预测数据进行脸部图像渲染,得到人脸预测图像;所述口型语音同步识别子模型基于所述人脸预测图像和所述样本语音数据,确定同步子损失值;
    基于所述同步子损失值,确定第二损失值;所述第二损失值用于表征样本用户的口型与语音的延迟程度;
    所述基于所述第一损失值,对所述待训练模型进行参数更新,包括:基于所述第一损失值和所述第二损失值,对所述待训练模型进行参数更新。
  3. 根据权利要求2所述的方法,其中,所述口型语音同步识别子模型包括多层神经网络和损失信息输出网络;
    所述多层神经网络将所述人脸预测图像和所述样本语音数据映射到高维空间,进行口型与语音同步对比,得到语音口型同步得分;
    所述损失信息输出网络基于所述语音口型同步得分,确定所述同步子损失值。
  4. 根据权利要求1所述的方法,其中,所述面部表情预测子模型包括向量拼接网络和面部表情识别网络;
    所述向量拼接网络对所述第一特征向量和所述第二特征向量进行拼接处理,得到目标特征向量;
    所述面部表情识别网络基于所述目标特征向量进行面部表情识别,得到第一面部表情预测数据。
  5. 根据权利要求1所述的方法,其中,在获取N个视频样本数据之前,还包括:
    获取第一样本数据集;所述第一样本数据集包括用于通过非音素特征识别进行说话人分类训练的说话人语音数据集;
    基于所述第一样本数据集对第一预设神经网络模型进行参数迭代更新,得到训练后的所述第二向量提取子模型。
  6. 根据权利要求2所述的方法,其中,在获取N个视频样本数据之前,还包括:
    获取第二样本数据集;所述第二样本数据集包括用于对口型与语音是否同步进行分类训练的样本数据集,每个样本数据包括一对图像帧和语音帧;
    基于所述第二样本数据集对第二预设神经网络模型进行参数迭代更新, 得到训练后的所述口型语音同步识别子模型。
  7. 根据权利要求1至6任一项所述的方法,其中,所述非音素特征识别包括声纹特征识别、情绪特征识别中至少一项。
  8. 一种视频生成方法,所述方法包括:
    获取目标语音数据;
    将所述目标语音数据输入至训练后的人脸驱动模型进行人脸驱动处理,得到目标虚拟数字人视频;
    其中,所述人脸驱动模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;所述人脸驱动处理的过程包括:
    所述第一向量提取子模型对所述目标语音数据进行文本内容识别,得到第三特征向量;所述第二向量提取子模型对所述目标语音数据进行非音素特征识别,得到第四特征向量;所述面部表情预测子模型基于所述第三特征向量和所述第四特征向量进行表情预测,得到第二面部表情预测数据;基于所述第二面部表情预测数据进行图像渲染,得到所述目标虚拟数字人视频。
  9. 根据权利要求8所述的方法,其中,所述人脸驱动模型还包括面部表情渲染器和背景合成渲染器;所述基于所述第二面部表情预测数据进行图像渲染,得到所述目标虚拟数字人视频,包括:
    所述面部表情渲染器基于所述第二面部表情预测数据进行脸部图像渲染,得到人脸渲染图像;所述背景合成渲染器基于所述人脸渲染图像进行背景合成,得到所述目标虚拟数字人视频。
  10. 一种人脸驱动模型的训练装置,所述装置包括:
    样本数据获取模块,用于获取多个视频样本数据;所述视频样本数据包括样本用户的真实人脸图像和样本语音数据;
    模型训练模块,用于将所述视频样本数据输入至待训练模型进行模型迭代训练,直到当前模型训练结果满足预设模型训练结束条件,得到训练后的人脸驱动模型;
    其中,所述待训练模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;所述模型迭代训练中的每次模型训练包括:所述第一向量提取子模型对所述样本语音数据进行文本内容识别,得到第一特征向量;所述第二向量提取子模型对所述样本语音数据进行非音素特征识别,得到第二特征向量;所述面部表情预测子模型基于所述第一特征向量和所述第二特征向量进行面部表情预测,得到第一面部表情预测数据;基于所述第一面部表情预测数据和真实面部表情数据,确定第一损失值;所述真实面部表情数据是基于所述真实人脸图像得到的;基于所述第一损失值,对所述待训练模型进行参数更新。
  11. 一种视频生成装置,所述装置包括:
    目标数据获取模块,用于获取目标语音数据;
    视频生成模块,用于将所述目标语音数据输入至训练后的人脸驱动模型进行人脸驱动处理,得到目标虚拟数字人视频;
    其中,所述人脸驱动模型包括第一向量提取子模型、第二向量提取子模型和面部表情预测子模型;所述人脸驱动处理的过程包括:所述第一向量提取子模型对所述目标语音数据进行文本内容识别,得到第三特征向量;所述第二向量提取子模型对所述目标语音数据进行非音素特征识别,得到第四特征向量;所述面部表情预测子模型基于所述第三特征向量和所述第四特征向量进行表情预测,得到第二面部表情预测数据;基于所述第二面部表情预测数据进行图像渲染,得到所述目标虚拟数字人视频。
  12. 一种计算机设备,所述设备包括:
    处理器;以及
    被安排成存储计算机可执行指令的存储器,所述可执行指令被配置由所述处理器执行,所述可执行指令包括用于执行如权利要求1至7任一项或者权利要求8至9任一项所述的方法中的步骤。
  13. 一种存储介质,所述存储介质用于存储计算机可执行指令,所述可执行指令使得计算机执行如权利要求1至7任一项或者权利要求8至9任一项所述的方法。
  14. 一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如权利要求1至7任一项或者权利要求8至9任一项所述的方法。
PCT/CN2023/120778 2022-10-09 2023-09-22 人脸驱动模型的训练方法、视频生成方法及装置 WO2024078303A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211226776.6 2022-10-09
CN202211226776.6A CN117935323A (zh) 2022-10-09 2022-10-09 人脸驱动模型的训练方法、视频生成方法及装置

Publications (1)

Publication Number Publication Date
WO2024078303A1 true WO2024078303A1 (zh) 2024-04-18

Family

ID=90668707

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/120778 WO2024078303A1 (zh) 2022-10-09 2023-09-22 人脸驱动模型的训练方法、视频生成方法及装置

Country Status (2)

Country Link
CN (1) CN117935323A (zh)
WO (1) WO2024078303A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634413A (zh) * 2020-12-24 2021-04-09 北京百度网讯科技有限公司 生成模型和生成3d动画的方法、装置、设备和存储介质
CN113380271A (zh) * 2021-08-12 2021-09-10 明品云(北京)数据科技有限公司 情绪识别方法、系统、设备及介质
CN114419702A (zh) * 2021-12-31 2022-04-29 南京硅基智能科技有限公司 数字人生成模型、模型的训练方法以及数字人生成方法
CN114581980A (zh) * 2022-03-03 2022-06-03 北京京东尚科信息技术有限公司 用于生成说话人像视频和训练人脸渲染模型的方法、装置
CN115116109A (zh) * 2022-04-27 2022-09-27 平安科技(深圳)有限公司 虚拟人物说话视频的合成方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634413A (zh) * 2020-12-24 2021-04-09 北京百度网讯科技有限公司 生成模型和生成3d动画的方法、装置、设备和存储介质
CN113380271A (zh) * 2021-08-12 2021-09-10 明品云(北京)数据科技有限公司 情绪识别方法、系统、设备及介质
CN114419702A (zh) * 2021-12-31 2022-04-29 南京硅基智能科技有限公司 数字人生成模型、模型的训练方法以及数字人生成方法
CN114581980A (zh) * 2022-03-03 2022-06-03 北京京东尚科信息技术有限公司 用于生成说话人像视频和训练人脸渲染模型的方法、装置
CN115116109A (zh) * 2022-04-27 2022-09-27 平安科技(深圳)有限公司 虚拟人物说话视频的合成方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN117935323A (zh) 2024-04-26

Similar Documents

Publication Publication Date Title
US20210256985A1 (en) System and method for creating timbres
KR102346046B1 (ko) 3차원 가상 인물 입모양 변화 제어 방법 및 장치
CN112823380A (zh) 将数字视频中的口形和动作与替代音频匹配
CN110675886B (zh) 音频信号处理方法、装置、电子设备及存储介质
CN113077537B (zh) 一种视频生成方法、存储介质及设备
US20210390945A1 (en) Text-driven video synthesis with phonetic dictionary
CN110446066A (zh) 用于生成视频的方法和装置
CN110264993A (zh) 语音合成方法、装置、设备及计算机可读存储介质
US20230039540A1 (en) Automated pipeline selection for synthesis of audio assets
CN111444379B (zh) 音频的特征向量生成方法及音频片段表示模型的训练方法
JP2020184100A (ja) 情報処理プログラム、情報処理装置、情報処理方法及び学習済モデル生成方法
CN112383721B (zh) 用于生成视频的方法、装置、设备和介质
WO2024078303A1 (zh) 人脸驱动模型的训练方法、视频生成方法及装置
CN116912375A (zh) 面部动画生成方法、装置、电子设备及存储介质
WO2023046016A1 (en) Optimization of lip syncing in natural language translated video
CN112995530A (zh) 视频的生成方法、装置及设备
CN116824650A (zh) 一种目标对象的视频生成方法及相关装置
CN113990295A (zh) 一种视频生成方法和装置
CN113299271B (zh) 语音合成方法、语音交互方法、装置及设备
CN114022597A (zh) 多风格唇形合成方法、装置、设备及存储介质
CN113362849A (zh) 一种语音数据处理方法以及装置
CN113129925B (zh) 一种基于vc模型的嘴部动作驱动模型训练方法及组件
JP4242676B2 (ja) 口形状ライブラリを作成するための分解方法
CN117726727A (zh) 面部驱动方法、装置、电子设备及可读存储介质
CN116310004A (zh) 虚拟人授课动画生成方法、装置、计算机设备和存储介质