CN112396182B

CN112396182B - Method for training face driving model and generating face mouth shape animation

Info

Publication number: CN112396182B
Application number: CN202110068320.0A
Authority: CN
Inventors: 蒋心为
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-19
Filing date: 2021-01-19
Publication date: 2021-04-16
Anticipated expiration: 2041-01-19
Also published as: CN112396182A

Abstract

The application discloses a method for training a face driving model and generating a face mouth shape animation, and belongs to the technical field of computers and the Internet. The method comprises the following steps: acquiring sample voice data and sample video data when a sample object speaks; carrying out feature extraction on the sample voice data to obtain voice feature data of the sample voice data; carrying out feature extraction on the sample video data to obtain face feature data of the sample video data; using the face characteristic data as the label information of the voice characteristic data to generate a training sample of a face driving model; and training the face driving model by adopting a training sample. The application provides a mode of driving virtual face model based on face drive model improves the drive efficiency of virtual face model, has considered the influence of context to face mouth shape animation when carrying out the model training, guarantees the accuracy of face drive model training result, improves the authenticity of face drive model training effect.

Description

Method for training face driving model and generating face mouth shape animation

Technical Field

The application relates to the technical field of computers and internet, in particular to a method for training a face driving model and generating a face mouth shape animation.

Background

At present, with the continuous development of animation technology, the requirement for the matching degree of facial expressions and voice information of a face model is gradually increased.

In the related art, after acquiring the voice information, the computer device analyzes the voice information, determines each pronunciation in the voice information, acquires an animation parameter corresponding to each pronunciation from the database, and drives the face model according to the animation parameter, so that the expression action of the face model is consistent with the voice information in view angle effect. Wherein, the database comprises the corresponding relation between each pronunciation and the animation parameter.

However, in the related art, the correspondence between each utterance and the animation parameter is a fixed relationship, and thus needs to be manually set in advance, which is difficult to accurately and comprehensively understand.

Disclosure of Invention

The embodiment of the application provides a method for training a face driving model and generating a face mouth shape animation, and the driving efficiency of a virtual face model can be improved. The technical scheme comprises the following items.

According to an aspect of an embodiment of the present application, there is provided a method for training a face-driven model, the method including:

acquiring sample voice data and sample video data when a sample object speaks;

performing feature extraction on the sample voice data to obtain voice feature data of the sample voice data, wherein the voice feature data is used for indicating pronunciation features of the sample object;

performing feature extraction on the sample video data to obtain facial feature data of the sample video data, wherein the facial feature data are used for indicating facial expression features of the sample object;

using the face feature data as label information of the voice feature data to generate a training sample of the face driving model;

and training the face driving model by adopting the training sample, wherein the face driving model is used for generating face characteristic data for controlling a virtual face model to make a face mouth shape animation matched with the voice data to be played based on the voice data to be played.

According to an aspect of the embodiments of the present application, there is provided a method for generating a facial mouth shape animation, the method including:

acquiring voice data to be played;

performing feature extraction on the voice data to be played to obtain voice feature data of the voice data to be played, wherein the voice feature data is used for indicating pronunciation features;

generating facial feature data according to the voice feature data through a facial driving model, wherein the facial feature data are used for indicating facial expression features;

and controlling a virtual face model to make face mouth shape animation matched with the voice data to be played based on the face feature data.

According to an aspect of an embodiment of the present application, there is provided a training apparatus for a face-driven model, the apparatus including:

the data acquisition module is used for acquiring sample voice data and sample video data when a sample object speaks;

the first extraction module is used for extracting the characteristics of the sample voice data to obtain the voice characteristic data of the sample voice data, and the voice characteristic data is used for indicating the pronunciation characteristics of the sample object;

the second extraction module is used for performing feature extraction on the sample video data to obtain facial feature data of the sample video data, wherein the facial feature data are used for indicating facial expression features of the sample object;

the sample generation module is used for generating a training sample of the face driving model by taking the face characteristic data as the label information of the voice characteristic data;

and the model training module is used for training the face driving model by adopting the training sample, and the face driving model is used for generating face characteristic data for controlling the virtual face model to make face mouth shape animation matched with the voice data to be played based on the voice data to be played.

According to an aspect of the embodiments of the present application, there is provided an apparatus for generating a facial mouth shape animation, the apparatus including:

the voice acquisition module is used for acquiring voice data to be played;

the feature extraction module is used for extracting features of the voice data to be played to obtain voice feature data of the voice data to be played, and the voice feature data is used for indicating pronunciation features;

the data generation module is used for generating facial feature data according to the voice feature data through a face driving model, and the facial feature data are used for indicating facial expression features;

and the model control module is used for controlling the virtual face model to make face mouth shape animation matched with the voice data to be played based on the face characteristic data.

According to an aspect of the embodiments of the present application, there is provided a computer device, including a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the training method of the above face-driven model or to implement the generation method of the above face-mouth animation.

Optionally, the computer device is a terminal or a server.

According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium having at least one instruction, at least one program, a code set, or a set of instructions stored therein, the at least one instruction, the at least one program, the code set, or the set of instructions being loaded and executed by a processor to implement the above-mentioned training method of the face driving model, or to implement the above-mentioned generation method of the face mouth shape animation.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the training method of the face driving model or realize the generation method of the face mouth shape animation.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

the method comprises the steps that a training sample for a face driving model is generated through sample voice data and sample video data when a sample object speaks, the face driving model is used for generating face characteristic data used for controlling a virtual face model to make face mouth shape animation matched with the voice data to be played based on the voice data to be played, namely, the method provides a mode for driving the virtual face model based on the face driving model, the problem that the relation between each pronunciation and animation parameters cannot be accurately and comprehensively preset in the related technology is avoided, the face driving model directly generates the face characteristic data matched with the voice data according to the voice data, and the driving efficiency and the accuracy rate of the virtual face model are improved; moreover, the face driving model is trained through sample voice data and sample video data acquired in the speaking process of the sample object, so that the training sample of the face driving model is acquired through the whole speaking process, the influence of the context on the face mouth shape animation is considered during model training, the accuracy of the training result of the face driving model is ensured, and the reality of the training effect of the face driving model is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a training system for a face driven model provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a system for generating a facial mouth animation according to an embodiment of the present application;

FIG. 3 is a flow chart of a method for training a face-driven model according to an embodiment of the present application;

FIG. 4 illustrates a schematic diagram of a data alignment approach;

FIG. 5 illustrates a diagram of the advantages of a canonical term constraint model;

FIG. 6 is a flow chart of a method for generating a facial mouth animation according to an embodiment of the present application;

fig. 7 is a schematic diagram illustrating an acquisition manner of voice data to be played;

FIG. 8 is a diagram illustrating a driving manner of a virtual face model;

FIG. 9 is a block diagram of a training apparatus for a face driven model provided in an embodiment of the present application;

FIG. 10 is a block diagram of a training apparatus for a face-driven model according to another embodiment of the present application;

FIG. 11 is a block diagram of an apparatus for generating a facial mouth animation according to an embodiment of the present application;

fig. 12 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Please refer to fig. 1, which illustrates a schematic diagram of a training system of a face-driven model according to an embodiment of the present application. The training system of the face-driven model can comprise: a sample terminal 10 and a model training apparatus 20.

The sample terminal 10 is used to provide training data to the model training apparatus 20. The training data may include sample voice data and sample video data, the sample voice data refers to voice data acquired when a sample object speaks, the sample video data refers to video data acquired when the sample object speaks, and there is a corresponding relationship between the sample voice data and the sample video data, that is, the voice data and the video data of the sample object are acquired simultaneously when the sample object speaks, and the sample object may be any object, such as a human, an animal, an intelligent robot, and the like.

In a possible embodiment, the sample terminal 10 may obtain the training data by recording. Alternatively, the sample terminal 10 may be any electronic device capable of capturing voice data and video data simultaneously; alternatively, the sample terminal 10 may be a combination device consisting of any electronic device capable of capturing voice data and any electronic device capable of capturing video data. Illustratively, the sample terminal 10 includes a helmet-type expression catcher for acquiring video data and a microphone for acquiring audio data, and when acquiring training data, the sample terminal 10 drives a digital human technology pipeline in real time according to the existing real-time, records facial animation data of a digital human being driven by an actor in real time as video data by using the helmet-type expression catcher, records voice data spoken by the actor in performance as audio data by using the microphone, and then aligns the video data and the audio data in time to obtain the training data. Alternatively, the sample terminal 10 may turn on the voice capturing function and the video capturing function to acquire the training data only when the sample object speaks.

In another possible embodiment, the sample terminal 10 may obtain the training data from a network environment, such as a crawler technology. Alternatively, the sample terminal 10 may be any electronic device having an information capturing function. The sample terminal 10 may search for the training data from the network environment at regular intervals.

The model training device 20 is used to train face-driven models. Alternatively, the model training device 20 may be a computer device, such as a server, a PC, or other electronic device. In the embodiment of the present application, the model training device 20 may obtain a training model of the face-driven model through training data provided by the sample terminal 10.

Alternatively, the sample terminal 10 and the model training apparatus 20 may communicate with each other via a network.

Referring to fig. 2, a schematic diagram of a system for generating a facial mouth shape animation according to an embodiment of the present application is shown. The generation system of the face mouth shape animation can comprise: a user terminal 30 and a server 40.

The user terminal 30 is used to present a facial mouth-shape animation to the user. Alternatively, the user terminal 30 may be an electronic device such as a mobile phone, a tablet Computer, a game console, an electronic book reader, a multimedia player, a wearable device, a PC (Personal Computer), and the like. In the embodiment of the present application, a client of the application program may be installed in the user terminal 30. Alternatively, the application may be any application capable of playing facial animation, such as a video application, a gaming application, an information application, a shopping application, a reading application, and the like. The application program may be an application program that needs to be downloaded and installed, or may be an application program that is to be used on demand, which is not limited in this embodiment of the application.

The server 40 is used to provide background services for clients of applications in the user terminal 30. For example, the server 40 may be a backend server for the application described above. The server 40 may be one server, a server cluster composed of a plurality of servers, or a cloud computing service center. Optionally, the server 40 provides background services for applications in multiple user terminals 30 simultaneously. In the embodiment of the present application, a face-driven model may be provided in the server 40, and the face-driven model is configured to generate, based on the voice data to be played, face feature data for controlling the virtual face model to make a facial mouth-shape animation matching the voice data to be played. Alternatively, the server 40 may obtain the face driven model through the model training device 20 in the embodiment of fig. 1. It should be noted that, in the embodiment of the present application, the server 40 and the model training device 20 may be the same computer device or different computer devices, and the embodiment of the present application is not limited thereto.

Of course, in practical applications, the face driving model may be directly installed in the user terminal 30, which is not limited in the embodiment of the present application.

The technical solution of the present application will be described in detail with reference to several embodiments.

Please refer to fig. 3, which illustrates a flowchart of a training method of a face-driven model according to an embodiment of the present application. The method can be applied to a computer device, and the execution subject of each step can be the model training device 20 in the training system of the face-driven model shown in fig. 1. The method can include the following steps (301-305).

Step 301, sample voice data and sample video data of a sample object during speaking are obtained.

The sample object refers to any object capable of speaking and expressing, such as a human, an animal, an artificial intelligent robot and the like. Sample speech data refers to speech data acquired while a sample object is speaking. Sample video data refers to video data acquired while a sample object is speaking. The sample voice data and the sample video data have a corresponding relationship, that is, the sample voice data and the sample video data are data obtained when the same sample object speaks.

Alternatively, the sample voice data and the sample video data may be used to generate training samples for the face-driven model. In the embodiment of the present application, before training the face-driven model, the model training device obtains sample voice data and sample video data of the sample object during speaking. The number of the sample objects may be one or more, that is, the sample voice data and the sample video data of different sample objects may be included in the sample voice data and the sample video data.

In one possible embodiment, the sample voice data and the sample video data may be obtained by recording. For example, voice data and video data are recorded simultaneously by a particular device while a sample subject is speaking. Alternatively, the model training apparatus may acquire the sample voice data and the sample video data through a sample terminal having a voice data and video data collection function.

In another possible embodiment, the sample voice data and the sample video data are obtained from a search in a network environment. For example, sample voice data and sample video data of a sample object are obtained from a network environment through a web crawler technology. Alternatively, the model training apparatus may acquire the sample voice data and the sample video data through a sample terminal having an information capturing function.

Step 302, performing feature extraction on the sample voice data to obtain voice feature data of the sample voice data.

The speech feature data is used to indicate pronunciation features of the sample object. The pronunciation characteristics may include phonemes and pronunciation speed, among others. In this embodiment of the application, after obtaining the sample voice data, the model training device performs feature extraction on the sample voice data, so as to obtain voice feature data of the sample voice data.

Optionally, in this embodiment of the present application, the model training device may perform feature extraction after performing framing processing on the sample speech data. The voice feature data may include a vocal tract frequency feature and a pronunciation speed feature. Optionally, the step 302 includes the following steps.

3021. And performing framing processing on the sample voice data to obtain a plurality of sample voice data frames.

In the embodiment of the application, in order to ensure the accuracy of the voice feature data extracted for the sample voice data, the model training device performs framing processing on the sample voice data after acquiring the sample voice data, so as to obtain a plurality of sample voice data frames. The single frame duration of different sample voice data frames may be the same or different.

In one possible implementation, the model training device performs a random framing process on the sample speech data. Optionally, after obtaining the sample voice data, the model training device randomly performs framing processing on the sample voice data, so as to obtain a plurality of sample voice data frames.

In another possible implementation, the model training device frames the sample speech data according to a set number of frames. Optionally, after obtaining the sample voice data, the model training device obtains a set frame number of the sample voice data, and performs framing processing on the sample voice data based on the set frame number, thereby obtaining a plurality of sample voice data frames.

In yet another possible implementation, the model training device frames the sample speech data according to a single frame duration and an interval duration. Optionally, after obtaining the sample voice data, the model training device obtains a single frame duration and an interval duration of the sample voice data frame. The single-frame duration refers to the duration of a sample voice data frame, the interval duration refers to the time interval between the starting moments of two adjacent sample voice data frames, and the interval duration is smaller than the single-frame duration. Further, after the single frame duration and the interval duration are obtained, the sample voice data is subjected to framing processing according to the single frame duration and the interval duration, and a plurality of sample voice data frames are obtained. It should be noted that, in this case, overlapping contents are included between adjacent sample voice data frames. For example, if the duration of a single frame is 5s and the duration of an interval is 2s, the start time of the first sample voice data frame is 0s, the end time is 5s, the start time of the second sample voice data frame is 2s, the end time is 7s, and so on.

3022. And extracting the characteristics of the sample voice data frame to obtain the sound channel frequency characteristics of the sample voice data frame.

In this embodiment of the present application, after obtaining the plurality of sample voice data frames, the model training device performs feature extraction on the sample voice data frames to obtain the vocal tract frequency features of the sample voice data frames. Illustratively, the channel Frequency characteristic is an MFCC (Mel Frequency Cepstrum Coefficient) characteristic.

3023. And acquiring the pronunciation speed characteristic of the sample voice data frame based on the sound channel frequency characteristic of the sample voice data frame.

In the embodiment of the present application, after obtaining the vocal tract frequency features, the model training device obtains the pronunciation speed features of the sample voice data frames based on the vocal tract frequency features of the sample voice data frames. The pronunciation speed feature may include a first speed feature and a second speed feature.

Optionally, after obtaining the vocal tract frequency features, the model training device obtains a difference between adjacent vocal tract frequency features, and takes the difference as the first speed feature, and further, continues to obtain a difference between adjacent differences, and takes the difference as the second speed feature. The adjacent channel frequency feature refers to a channel frequency feature corresponding to the most adjacent voice data in terms of time. For example, if the channel frequency characteristic is a MFCC characteristic, the first speed characteristic is a first-order MFCC characteristic, and the second speed characteristic is a second-order MFCC characteristic.

Step 303, performing feature extraction on the sample video data to obtain facial feature data of the sample video data.

The facial feature data is indicative of facial expressive features of the sample object. Optionally, from the facial feature data, a current facial expression of the sample object can be determined. In this embodiment of the application, after the model training device acquires the sample video data, feature extraction is performed on the sample video data to obtain facial feature data of the sample video data.

Optionally, in this embodiment of the present application, the model training device may perform feature extraction after performing framing processing on the sample video data. Optionally, the step 303 includes the following steps:

3031. acquiring a sampling frequency for sample video data;

3032. sampling the sample video data based on the sampling frequency to obtain a plurality of image frames corresponding to the sample video data;

3033. and respectively carrying out feature extraction on the plurality of image frames to obtain face feature data of the sample video data.

In the embodiment of the application, when the model training device performs feature extraction on sample video data, the sampling frequency of the sample video data is obtained, the sample video data is sampled based on the sampling frequency, a plurality of image frames corresponding to the sample data are obtained, the feature extraction is performed on the plurality of image frames through image processing on the plurality of image frames, and then the plurality of image frames corresponding to the sample video data are obtained.

And step 304, generating a training sample of the face driving model by taking the face characteristic data as the label information of the voice characteristic data.

In this embodiment of the application, after obtaining the above-mentioned face feature data and voice feature data, the model training device uses the face feature data as the label information of the voice feature data, and further generates a training sample of the face-driven model.

Optionally, the sample voice data includes a plurality of sample voice data frames, and the facial feature data of the sample video data includes facial feature data corresponding to each of the plurality of image frames. When the model training device generates a training sample, it needs to align a sample voice data frame and a sample video data frame. Optionally, the step 304 includes the following steps:

3041. acquiring a playing time period corresponding to the sample voice data frame;

3042. determining n image frames corresponding to the playing time period based on the playing time period, wherein n is a positive integer;

3043. and taking the face feature data corresponding to the n image frames as the label information of the sample voice data frame to generate a training sample.

In this embodiment of the application, after obtaining the sample voice data frame and the sample video data frame, the model training device obtains a playing time period corresponding to the sample voice data frame, and determines n image frames corresponding to the playing time period based on the playing time period. And taking the face feature data corresponding to the n image frames as the label information of the sample voice data frame to generate a training sample. Illustratively, with reference to fig. 4 in conjunction, while a sample subject is speaking, the model training device obtains sample speech data and sample video data for the sample subject and performs feature extraction on the sample speech data and sample video data, respectively, obtains speech feature data and facial feature data, and aligns the speech feature data and facial feature data in time. In addition, after the voice feature data and the face feature data are aligned, the model training device may drive the virtual face model based on the aligned voice feature data and face feature data, and check whether the alignment effect of the voice feature data and the face feature data is accurate through a driving effect.

Step 305, training the face driving model by using the training sample.

The face driving model is used for generating face characteristic data used for controlling the virtual face model to make face mouth shape animation matched with the voice data to be played based on the voice data to be played.

In the embodiment of the application, after the model training device obtains the training sample, the training sample is adopted to train the face driving model, and then the trained face driving model is obtained.

To sum up, in the technical scheme provided in the embodiment of the present application, a training sample for a face driving model is generated through sample voice data and sample video data when a sample object speaks, and the face driving model is used for generating face feature data for controlling a virtual face model to make a face mouth animation matched with the voice data to be played based on the voice data to be played, that is, the present application provides a way of driving the virtual face model based on the face driving model, so as to avoid the problem that a person in the related art cannot accurately and comprehensively preset the relationship between each pronunciation and animation parameter, and the face driving model directly generates the face feature data matched with the voice data according to the voice data, thereby improving the driving efficiency and accuracy of the virtual face model; moreover, the face driving model is trained through sample voice data and sample video data acquired in the speaking process of the sample object, so that the training sample of the face driving model is acquired through the whole speaking process, the influence of the context on the face mouth shape animation is considered during model training, the accuracy of the training result of the face driving model is ensured, and the reality of the training effect of the face driving model is improved.

Next, a training process of the face-driven model will be described.

In an exemplary embodiment, the above step 305 includes the following steps:

3051. processing the voice characteristic data through the face driving model to obtain face characteristic data output by the model;

3052. determining a regular term value and a loss function value of the face driving model based on the face characteristic data and the label information output by the model;

3053. and adjusting parameters of the face driving model based on the regular term value and the loss function value until the regular term meets the first condition and the loss function meets the second condition, and stopping training the face driving model.

The regular term is used for measuring the authenticity of the face driving effect corresponding to the output result of the face driving model, and the loss function is used for measuring the accuracy of the output result of the face driving model. Optionally, the model training device may determine whether the face-driven model is trained through the regularization term and the loss function during the model training process.

In this embodiment of the application, after obtaining the training sample, the model training device inputs the voice feature data in the training sample to the face driving model, processes the voice feature data through the face driving model to obtain face feature data output by the model, and further determines a regular term and a loss function of the face driving model based on the face feature data and the label information output by the model. Optionally, the sample speech data includes a plurality of sample speech data frames.

Illustratively, the regularization term M of the face-driven model is:

；

the loss function L of the face-driven model is:

；

wherein N represents the number of individual training samples in the training samples, q represents the output facial feature data of the face driving model, o represents the tag information of the voice feature data, d represents the variation value between adjacent data, and x represents the input voice feature data.

Optionally, in this embodiment of the application, when obtaining the regular term value, the model training device obtains the face feature data and the label information output by the model corresponding to each sample speech data frame, and determines a first variation value (d [ qi (x)) ] and a second variation value (d [ oi (x)) ] corresponding to adjacent sample speech data frames. The first variation value refers to a variation value of the facial feature data output by the model, and the second variation value refers to a variation value of the tag information. Then, the model training device substitutes the first variation value and the second variation value into the regular term formula based on the first variation value and the second variation value to determine the regular term value of the face driving model.

Optionally, in this embodiment of the application, when obtaining the loss function value, the model training device obtains face feature data and tag information output by a model corresponding to each sample voice data frame, and based on the face feature data and the tag information output by the model, substitutes the face feature data and the tag information output by the model into the formula of the loss function, and determines the loss function value of the face-driven model.

In this embodiment of the application, after the model training device obtains the regular term value and the loss function value, the model training device adjusts parameters of the face driving model based on the regular term value and the loss function value until the regular term satisfies the first condition and the loss function satisfies the second condition, and stops training the face driving model. Wherein, the first condition refers to the convergence of the regular term, and the second condition refers to the convergence of the loss function. Optionally, the model training device may determine that the regularization term satisfies the first condition when the regularization term value is minimum, and determine that the loss function satisfies the second condition when the loss function value is minimum; or, the model training device may determine that the regular term satisfies the first condition when the regular term value tends to a certain value, and determine that the loss function satisfies the second condition when the loss function value region has a certain value.

Illustratively, the structure of the above face-driven model is briefly described in conjunction with the following table, in which the face-driven model includes three convolutional neural network layers and three fully-connected layers.

It should be noted that, in the embodiment of the present application, the regular term is used to constrain the training process of the face driving model, so that the driving effect corresponding to the data output by the face driving model can be better and more real. Taking a face model drive as an example, referring to fig. 5, a real drive effect corresponding to certain voice feature data is a first virtual face 51, based on the voice feature data, a drive effect of a face drive model constrained by a regular term is a second virtual face 52, and a drive effect of a face drive model not constrained by the regular term is a third virtual face 53, as can be seen from fig. 5, the second virtual face 52 is closer to the first virtual face 51.

Referring to fig. 6, a flowchart of a method for generating a facial mouth shape animation according to an embodiment of the present application is shown. The method can be applied to a computer device, and the execution subject of each step can be the user terminal 30 in the generation system of the face mouth shape animation shown in fig. 2. The method can include the following steps (601-604).

Step 601, acquiring voice data to be played.

The voice data to be played refers to voice data waiting to be played on the virtual face model. In the embodiment of the application, when the user terminal drives the virtual face model, the voice data to be played for the virtual face model may be acquired first.

In one possible embodiment, the voice data to be played may be voice data input by a user. Illustratively, after acquiring voice data input by a user, the user terminal drives the virtual face model according to the voice data input by the user to increase interactivity.

In another possible implementation, the voice data to be played may be response voice data for target voice information, where the target voice information may be voice information acquired by the user terminal from a real environment, such as a speaking voice of the user. Illustratively, with reference to fig. 7, after acquiring the speech sound of the user, the user terminal performs speech detection on the speech sound, and sends a speech detection result to a server corresponding to the user terminal, and the server performs speech recognition on the speech detection result, acquires text content corresponding to the speech sound, performs natural language processing on the text content, generates a response text for the text content, and converts the response text into response speech data. And then, the user terminal drives the virtual human face model by taking the response voice data as the voice data to be played.

Step 602, performing feature extraction on the voice data to be played to obtain voice feature data of the voice data to be played.

The speech feature data is used to indicate pronunciation features. The pronunciation characteristics may include phonemes and pronunciation speed, among others. In the embodiment of the application, after the user terminal obtains the voice data to be played, the user terminal performs feature extraction on the voice data to be played to obtain voice feature data of the voice data to be played.

Optionally, in this embodiment of the application, the user terminal may perform feature extraction after performing framing processing on the voice data to be played. The voice feature data may include a vocal tract frequency feature and a pronunciation speed feature. Optionally, the step 602 includes the following steps.

6021. And performing framing processing on the voice data to be played to obtain a plurality of voice data frames.

In the embodiment of the application, in order to ensure the accuracy of the voice feature data extracted for the voice data to be played, the voice data to be played is subjected to framing processing after the voice data to be played is acquired by the user, so as to obtain a plurality of voice data frames. The single frame duration of different voice data frames may be the same or different.

In a possible implementation manner, the user terminal performs random framing processing on the voice data to be played. Optionally, after acquiring the voice data to be played, the user terminal randomly frames the voice data to be played, and further acquires a plurality of voice data frames.

In another possible implementation manner, the user terminal performs framing processing on the voice data to be played according to the set frame number. Optionally, after acquiring the voice data to be played, the user terminal acquires a set frame number of the voice data to be played, and performs framing processing on the voice data to be played based on the set frame number, thereby acquiring a plurality of voice data frames.

In yet another possible implementation, the user terminal frames the sample speech data according to the single frame duration and the interval duration. Optionally, after acquiring the voice data to be played, the user terminal acquires a single frame duration and an interval duration of the voice data frame to be played. The single-frame duration refers to the duration of the voice data frame to be played, the interval duration refers to the time interval between the starting moments of two adjacent voice data frames to be played, and the interval duration is smaller than the single-frame duration. Further, after the single-frame duration and the interval duration are obtained, framing processing is performed on the voice data to be played according to the single-frame duration and the interval duration, so that a plurality of voice data frames are obtained.

6022. And performing feature extraction on the voice data frame to acquire the sound channel frequency feature of the voice data frame.

In the embodiment of the present application, after acquiring the plurality of voice data frames, the user terminal performs feature extraction on the voice data frames to acquire the vocal tract frequency features of the voice data frames. Illustratively, the channel Frequency characteristic is an MFCC (Mel Frequency Cepstrum Coefficient) characteristic.

6023. And acquiring the pronunciation speed characteristic of the voice data frame based on the sound channel frequency characteristic of the voice data frame.

In the embodiment of the present application, after acquiring the vocal tract frequency feature, the user terminal acquires a pronunciation speed feature of the voice data frame based on the vocal tract frequency feature of the voice data frame. The pronunciation speed feature may include a first speed feature and a second speed feature.

Optionally, after obtaining the channel frequency characteristics, the user terminal obtains a difference between adjacent channel frequency characteristics, and uses the difference as the first speed characteristic, and further, continues to obtain a difference between adjacent differences, and uses the difference as the second speed characteristic. The adjacent channel frequency feature refers to a channel frequency feature corresponding to the most adjacent voice data in terms of time. For example, if the channel frequency characteristic is a MFCC characteristic, the first speed characteristic is a first-order MFCC characteristic, and the second speed characteristic is a second-order MFCC characteristic.

Step 603, generating face feature data according to the voice feature data through the face driving model.

The facial feature data is indicative of facial expressive features. The face-driven model is a deep learning model trained through a large number of training samples. The training sample comprises voice feature data and face feature data with corresponding relations, namely the face feature data is used for indicating the face javelin feature of the sample object when the sample object emits the sound corresponding to the voice feature data.

In the embodiment of the application, after the user terminal acquires the voice feature data, the face driving model generates face feature data according to the voice feature data. The face driving model may be set in a user terminal, or may be set in a server corresponding to the user terminal, which is not limited in this embodiment of the application.

And step 604, controlling the virtual face model to make a face mouth shape animation matched with the voice data to be played based on the face feature data.

In the embodiment of the application, after the user terminal obtains the face feature data, the virtual face model is controlled to make the face mouth shape animation matched with the voice data to be played based on the face feature data.

It should be noted that the facial feature data are used to indicate facial expression features, and the virtual face model cannot be directly driven, and the facial feature data need to be converted into a facial mouth shape animation that can directly drive the virtual face model. Optionally, the step 604 includes the following steps:

6041. converting the face feature data into face mouth shape animation data, wherein the face mouth shape animation data is used for controlling the face mouth shape animation of the virtual face model;

6042. and controlling the virtual face model to make face mouth shape animation matched with the voice data to be played by adopting the face mouth shape animation data.

Exemplarily, referring to fig. 8 in combination, after acquiring the voice data to be played, the user terminal performs feature extraction on the voice data to be played, acquires voice feature data of the voice data to be played, determines face feature data according to the voice feature data through the face driving model, converts the face feature data into face mouth animation data, and drives the virtual face by using the face mouth animation data, so that the virtual face model makes a face mouth animation matched with the voice data to be played.

To sum up, in the technical scheme provided by the embodiment of the application, the face driving model acquires face feature data according to the voice data to be played, and the face feature data enables the virtual face model to make a face mouth shape animation matched with the voice data to be played, the technical scheme that the virtual face model is driven to make a corresponding mouth shape according to the deep learning model and the voice data is provided, human resource consumption caused by the fact that the relation between each pronunciation and animation parameter is preset manually in the related technology is avoided, the face driving model directly generates the face feature data matched with the voice data according to the voice data, and the driving efficiency of the virtual face model is improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 9, a block diagram of a training apparatus for a face-driven model according to an embodiment of the present application is shown. The device has the function of realizing the training method of the face driving model, and the function can be realized by hardware or by hardware executing corresponding software. The device can be computer equipment, and can also be arranged in the computer equipment. The apparatus 900 may include: a data acquisition module 910, a first extraction module 920, a second extraction module 930, a sample generation module 940, and a model training module 950.

The data acquiring module 910 is configured to acquire sample voice data and sample video data of a sample object during speaking.

A first extraction module 920, configured to perform feature extraction on the sample voice data to obtain voice feature data of the sample voice data, where the voice feature data is used to indicate pronunciation features of the sample object.

A second extracting module 930, configured to perform feature extraction on the sample video data to obtain facial feature data of the sample video data, where the facial feature data is used to indicate facial expression features of the sample object.

A sample generating module 940, configured to use the facial feature data as tag information of the voice feature data, and generate a training sample of the face-driven model.

A model training module 950, configured to train the face-driven model with the training sample, where the face-driven model is configured to generate, based on the voice data to be played, face feature data for controlling a virtual face model to make a face-mouth animation matched with the voice data to be played.

In an exemplary embodiment, as shown in fig. 10, the model training module 950 includes: a feature determination unit 951, a numerical value determination unit 952, and a model adjustment unit 953.

A feature determining unit 951, configured to process the voice feature data through the face driving model to obtain face feature data output by the model.

A numerical value determining unit 952, configured to determine a regular term value and a loss function value of the face-driven model based on the label information and the face feature data output by the model; the regular term is used for measuring the authenticity of the face driving effect corresponding to the output result of the face driving model, and the loss function is used for measuring the accuracy of the output result of the face driving model.

A model adjusting unit 953, configured to adjust parameters of the face driving model based on the regular term value and the loss function value, and stop training the face driving model until the regular term satisfies a first condition and the loss function satisfies a second condition.

In an exemplary embodiment, the sample speech data comprises a plurality of frames of sample speech data; the numerical value determining unit 952 is further configured to obtain the facial feature data and the tag information output by the model, which correspond to each sample voice data frame; determining a first change value and a second change value corresponding to adjacent sample voice data frames, wherein the first change value is a change value of face feature data output by the model, and the second change value is a change value of the label information; determining a regularization term value of the face-driven model based on the first variance value and the second variance value.

In an exemplary embodiment, the sample voice data includes a plurality of sample voice data frames, and the facial feature data of the sample video data includes facial feature data corresponding to a plurality of image frames, respectively; as shown in fig. 10, the sample generating module 940 is configured to obtain a playing time period corresponding to the sample voice data frame; determining n image frames corresponding to the playing time period based on the playing time period, wherein n is a positive integer; and taking the facial feature data corresponding to the n image frames as the label information of the sample voice data frame to generate the training sample.

In an exemplary embodiment, as shown in fig. 10, the first extraction module 920 includes: a framing processing unit 921, a first extraction unit 922, and a speed acquisition unit 923.

And a framing processing unit 921, configured to perform framing processing on the sample voice data to obtain multiple sample voice data frames.

A first extracting unit 922, configured to perform feature extraction on the sample voice data frame, and obtain a vocal tract frequency feature of the sample voice data frame.

A sound speed obtaining unit 923, configured to obtain a pronunciation speed feature of the sample voice data frame based on the vocal tract frequency feature of the sample voice data frame; wherein the speech feature data comprises the vocal tract frequency feature and the pronunciation speed feature.

In an exemplary embodiment, the framing processing unit 921 is configured to obtain a single-frame duration and an interval duration of the sample voice data frame, where the interval duration refers to a time interval between starting times of two adjacent sample voice data frames, and the interval duration is smaller than the single-frame duration; and performing framing processing on the sample voice data according to the single-frame duration and the interval duration to obtain a plurality of sample voice data frames.

In an exemplary embodiment, as shown in fig. 10, the second extraction module 930 is configured to obtain a sampling frequency for the sample video data; sampling the sample video data based on the sampling frequency to obtain a plurality of image frames corresponding to the sample video data; and respectively carrying out feature extraction on the plurality of image frames to obtain facial feature data of the sample video data.

In summary, in the technical solutions provided in the embodiments of the present application, to sum up, in the technical solutions provided in the embodiments of the present application, generating a training sample for the face-driven model from sample voice data and sample video data of the sample object when speaking, the face driving model is used for generating face characteristic data used for controlling the virtual face model to make face mouth shape animation matched with the voice data to be played based on the voice data to be played, namely, the method for driving the virtual face model based on the face driving model is provided, the problem that the relation between each pronunciation and animation parameter cannot be set accurately and comprehensively manually in the related technology is avoided, the face driving model directly generates the face characteristic data matched with the voice data according to the voice data, and the driving efficiency and the accuracy of the virtual face model are improved; moreover, the face driving model is trained through sample voice data and sample video data acquired in the speaking process of the sample object, so that the training sample of the face driving model is acquired through the whole speaking process, the influence of the context on the face mouth shape animation is considered during model training, the accuracy of the training result of the face driving model is ensured, and the reality of the training effect of the face driving model is improved.

Referring to fig. 11, a block diagram of an apparatus for generating a facial mouth shape animation according to an embodiment of the present application is shown. The device has the function of realizing the generation method of the facial mouth shape animation, and the function can be realized by hardware or by hardware executing corresponding software. The device can be computer equipment, and can also be arranged in the computer equipment. The apparatus 1100 may include: a speech acquisition module 1110, a feature extraction module 1120, a data generation module 1130, and a model control module 1140.

The voice obtaining module 1110 is configured to obtain voice data to be played.

The feature extraction module 1120 is configured to perform feature extraction on the voice data to be played to obtain voice feature data of the voice data to be played, where the voice feature data is used to indicate pronunciation features.

A data generating module 1130, configured to generate facial feature data according to the voice feature data through a face driving model, where the facial feature data is used to indicate facial expression features.

And the model control module 1140 is used for controlling the virtual face model to make facial mouth shape animation matched with the voice data to be played based on the facial feature data.

In an exemplary embodiment, the model control module 1140 is configured to convert the facial feature data into face-mouth animation data, the face-mouth animation data being configured to control a face-mouth animation of the virtual face model; and controlling the virtual face model to make the face mouth shape animation matched with the voice data to be played by adopting the face mouth shape animation data.

In an exemplary embodiment, the feature extraction module 1120 is configured to perform framing processing on the voice data to be played to obtain a plurality of voice data frames; extracting the characteristics of the voice data frame to obtain the sound channel frequency characteristics of the voice data frame; acquiring the pronunciation speed characteristic of the voice data frame based on the sound channel frequency characteristic of the voice data frame; wherein the speech feature data comprises the vocal tract frequency feature and the pronunciation speed feature.

In an exemplary embodiment, the feature extraction module 1120 is configured to obtain a single-frame duration and an interval duration of the voice data frame, where the interval duration refers to a time interval between start times of two adjacent voice data frames, and the interval duration is smaller than the single-frame duration; and according to the single-frame duration and the interval duration, performing framing processing on the voice data to be played to obtain a plurality of voice data frames.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 12, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device can be used for realizing the functions of the training method of the face driving model or the generation method of the face mouth shape animation. Specifically, the method comprises the following steps:

the computer apparatus 1200 includes a Central Processing Unit (CPU) 1201, a system Memory 1204 including a Random Access Memory (RAM) 1202 and a Read Only Memory (ROM) 1203, and a system bus 1205 connecting the system Memory 1204 and the Central Processing Unit 1201. The computer device 1200 also includes a basic Input/Output system (I/O system) 1206, which facilitates transfer of information between various devices within the computer, and a mass storage device 1207 for storing an operating system 1213, application programs 1214, and other program modules 1215.

The basic input/output system 1206 includes a display 1208 for displaying information and an input device 1209, such as a mouse, keyboard, etc., for user input of information. Wherein a display 1208 and an input device 1209 are connected to the central processing unit 1201 through an input-output controller 1210 coupled to the system bus 1205. The basic input/output system 1206 may also include an input/output controller 1210 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1210 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1207 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205. The mass storage device 1207 and its associated computer-readable media provide non-volatile storage for the computer device 1200. That is, the mass storage device 1207 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1204 and mass storage device 1207 described above may be collectively referred to as memory.

According to various embodiments of the present application, the computer device 1200 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the computer device 1200 may connect to the network 1212 through a network interface unit 1211 connected to the system bus 1205, or may connect to other types of networks or remote computer systems (not shown) using the network interface unit 1211.

The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the above-described face-driven model training method or to implement the above-described face-mouth animation generation method.

In an exemplary embodiment, there is further provided a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which when executed by a processor, implements the above-mentioned training method of a face driving model, or implements the generation method of a face mouth shape animation.

Optionally, the computer-readable storage medium may include: ROM (Read Only Memory), RAM (Random Access Memory), SSD (Solid State drive), or optical disc. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the training method of the face driving model or realize the generation method of the face mouth shape animation.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of training a face-driven model, the method comprising:

acquiring sample voice data and sample video data when a sample object speaks;

processing the voice feature data through a face driving model to obtain face feature data output by the model, wherein the face driving model is used for generating face feature data used for controlling a virtual face model to make face mouth shape animation matched with the voice data to be played based on the voice data to be played;

determining a regular term value and a loss function value of the face driving model based on the face feature data output by the model and the label information; the regular term is used for measuring the authenticity of a face driving effect corresponding to an output result of the face driving model, and the loss function is used for measuring the accuracy of the output result of the face driving model;

and adjusting parameters of the face driving model based on the regular term value and the loss function value until the regular term meets a first condition and the loss function meets a second condition, and stopping training the face driving model.

2. The method of claim 1, wherein the sample speech data comprises a plurality of frames of sample speech data;

the determining a regularization term value of the face-driven model based on the facial feature data output by the model and the label information includes:

acquiring face feature data and the label information output by the model corresponding to each sample voice data frame;

determining a first change value and a second change value corresponding to adjacent sample voice data frames, wherein the first change value is a change value of face feature data output by the model, and the second change value is a change value of the label information;

determining a regularization term value of the face-driven model based on the first variance value and the second variance value.

3. The method of claim 1, wherein the sample voice data comprises a plurality of sample voice data frames, and the facial feature data of the sample video data comprises facial feature data corresponding to a plurality of image frames respectively;

the generating a training sample of the face driving model by using the face feature data as the label information of the voice feature data includes:

acquiring a playing time period corresponding to the sample voice data frame;

determining n image frames corresponding to the playing time period based on the playing time period, wherein n is a positive integer;

and taking the facial feature data corresponding to the n image frames as the label information of the sample voice data frame to generate the training sample.

4. The method according to any one of claims 1 to 3, wherein the performing feature extraction on the sample voice data to obtain the voice feature data of the sample voice data comprises:

performing framing processing on the sample voice data to obtain a plurality of sample voice data frames;

performing feature extraction on the sample voice data frame to obtain a sound channel frequency feature of the sample voice data frame;

acquiring the pronunciation speed characteristic of the sample voice data frame based on the sound channel frequency characteristic of the sample voice data frame;

wherein the speech feature data comprises the vocal tract frequency feature and the pronunciation speed feature.

5. The method of claim 4, wherein the framing the sample speech data to obtain a plurality of frames of sample speech data comprises:

acquiring single-frame duration and interval duration of the sample voice data frames, wherein the interval duration refers to a time interval between starting moments of two adjacent sample voice data frames, and the interval duration is smaller than the single-frame duration;

and performing framing processing on the sample voice data according to the single-frame duration and the interval duration to obtain a plurality of sample voice data frames.

6. The method according to any one of claims 1 to 3, wherein the performing feature extraction on the sample video data to obtain facial feature data of the sample video data comprises:

obtaining a sampling frequency for the sample video data;

sampling the sample video data based on the sampling frequency to obtain a plurality of image frames corresponding to the sample video data;

and respectively carrying out feature extraction on the plurality of image frames to obtain facial feature data of the sample video data.

7. A method for generating a facial mouth shape animation, the method comprising:

acquiring voice data to be played;

generating facial feature data according to the voice feature data through a face driving model, wherein the facial feature data are used for indicating facial expression features, the face driving model is obtained through training based on a regular term value, and the regular term is used for measuring the authenticity of a face driving effect corresponding to an output result of the face driving model;

8. The method of claim 7, wherein controlling a virtual face model to animate a face mouth that matches the speech data to be played based on the facial feature data comprises:

converting the facial feature data into facial mouth shape animation data, wherein the facial mouth shape animation data is used for controlling the facial mouth shape animation of the virtual face model;

and controlling the virtual face model to make the face mouth shape animation matched with the voice data to be played by adopting the face mouth shape animation data.

9. The method according to claim 7, wherein the performing feature extraction on the voice data to be played to obtain voice feature data of the voice data to be played comprises:

performing framing processing on the voice data to be played to obtain a plurality of voice data frames;

extracting the characteristics of the voice data frame to obtain the sound channel frequency characteristics of the voice data frame;

acquiring the pronunciation speed characteristic of the voice data frame based on the sound channel frequency characteristic of the voice data frame;

10. The method according to claim 9, wherein the framing the voice data to be played to obtain a plurality of voice data frames comprises:

acquiring single-frame duration and interval duration of the voice data frames, wherein the interval duration refers to a time interval between starting moments of two adjacent voice data frames, and the interval duration is smaller than the single-frame duration;

and according to the single-frame duration and the interval duration, performing framing processing on the voice data to be played to obtain a plurality of voice data frames.

11. A training apparatus for a face-driven model, the apparatus comprising:

the model training module is used for processing the voice characteristic data through a face driving model to obtain face characteristic data output by the model, and the face driving model is used for generating face characteristic data used for controlling a virtual face model to make face mouth shape animation matched with the voice data to be played based on the voice data to be played; determining a regular term value and a loss function value of the face driving model based on the face feature data output by the model and the label information; the regular term is used for measuring the authenticity of a face driving effect corresponding to an output result of the face driving model, and the loss function is used for measuring the accuracy of the output result of the face driving model; and adjusting parameters of the face driving model based on the regular term value and the loss function value until the regular term meets a first condition and the loss function meets a second condition, and stopping training the face driving model.

12. An apparatus for generating a facial mouth animation, the apparatus comprising:

the voice acquisition module is used for acquiring voice data to be played;

the data generation module is used for generating facial feature data according to the voice feature data through a face driving model, the facial feature data are used for indicating facial expression features, the face driving model is obtained through training based on a regular term value, and the regular term is used for measuring the authenticity of a face driving effect corresponding to an output result of the face driving model;

13. A computer device, characterized in that it comprises a processor and a memory in which at least one instruction, at least one program, set of codes or set of instructions is stored, which is loaded and executed by the processor to implement a method of training a face driven model according to any one of claims 1 to 6 or to implement a method of generating a facial mouth animation according to any one of claims 7 to 10.

14. A computer-readable storage medium, wherein at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the storage medium, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the training method of the face driven model according to any one of claims 1 to 6, or to implement the generation method of the face mouth animation according to any one of claims 7 to 10.