CN116468826A

CN116468826A - Training method of expression generation model, and method and device for expression generation

Info

Publication number: CN116468826A
Application number: CN202310723506.4A
Authority: CN
Inventors: 杜宗财; 范锡睿; 赵亚飞; 郭紫垣; 王志强; 陈毅
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-07-21
Anticipated expiration: 2043-06-16
Also published as: CN116468826B

Abstract

The disclosure provides a training method of an expression generation model, and a method and a device for generating an expression, and relates to the technical field of computers, in particular to the technical fields of virtual digital people, artificial intelligent augmented reality, virtual reality, mixed reality, augmented reality, meta universe, deep learning and the like. The specific implementation scheme is as follows: generating a first mouth shape control feature and a first expression driving parameter according to the training sample; inputting the audio characteristics and the first mouth shape control characteristics of the training sample into a first expression generating model, and predicting to obtain second expression driving parameters; obtaining a loss function according to the real expression driving parameters, the first expression driving parameters and the second expression driving parameters of the training sample; and updating the first expression generating model according to the loss function to obtain a trained second expression generating model. The method and the device can improve the diversity of mouth shapes when the digital person is speaking through the audio drive, and obtain more personalized mouth shape styles.

Description

Training method of expression generation model, and method and device for expression generation

Technical Field

The disclosure relates to the technical field of computers, in particular to the technical fields of virtual digital people, artificial intelligence augmented reality, virtual reality, mixed reality, augmented reality, metauniverse, deep learning and the like.

Background

Audio-driven digital human speaking is an important technology in scenes such as metauniverse/digital human, and aims to generate digital human expression and mouth shape synchronous with input audio. In general, under the influence of various factors which can change the mouth shape, different mouth shapes can be generated by speaking the same word under different states, so that the realization of personalized mouth shape style generation is an important link for making a digital person more vividly simulate.

Disclosure of Invention

The disclosure provides a training method of an expression generation model, an expression generation method, an expression generation device and a storage medium.

According to an aspect of the present disclosure, there is provided a training method of an expression generation model, including:

generating a first mouth shape control feature and a first expression driving parameter according to the training sample;

inputting the audio characteristics and the first mouth shape control characteristics of the training sample into a first expression generating model, and predicting to obtain second expression driving parameters;

obtaining a loss function according to the real expression driving parameters, the first expression driving parameters and the second expression driving parameters of the training sample; and

and updating the first expression generating model according to the loss function to obtain a trained second expression generating model.

According to another aspect of the present disclosure, there is provided a method of expression generation, including:

extracting audio features according to the target audio data;

obtaining a target mouth shape control characteristic according to a preset style attribute parameter;

determining expression driving parameters according to the audio characteristics and the target mouth shape control characteristics by using a second expression generating model; the second expression generating model is obtained through training according to the training method of the expression generating model in any one of the embodiments of the disclosure; and

and generating an expression image corresponding to the target audio data according to the expression driving parameters.

According to another aspect of the present disclosure, there is provided a training apparatus of an expression generating model, including:

the generating module is used for generating a first mouth shape control characteristic and a first expression driving parameter according to the training sample;

the prediction module is used for inputting the audio characteristics and the first mouth shape control characteristics of the training sample into the first expression generation model, and predicting to obtain second expression driving parameters;

the loss determination module is used for obtaining a loss function according to the real expression driving parameters, the first expression driving parameters and the second expression driving parameters of the training sample; and

and updating the model, namely updating the first expression generating model according to the loss function to obtain a trained second expression generating model.

According to another aspect of the present disclosure, there is provided an expression generating apparatus including:

the extraction module is used for extracting audio characteristics according to the target audio data;

the feature determining module is used for obtaining target mouth shape control features according to preset style attribute parameters;

the parameter determining module is used for determining expression driving parameters according to the audio characteristics and the target mouth shape control characteristics by using the second expression generating model; wherein the second expression generating model is obtained according to the training of the device of any one of the embodiments of the disclosure; and

and the image generation module is used for generating an expression image corresponding to the target audio data according to the expression driving parameters.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.

The method and the device can improve the diversity of mouth shapes when the digital person is speaking through the audio drive, and obtain more personalized mouth shape styles.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a training method of an expression generation model according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a mouth style according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of a training method of an expression generation model according to another embodiment of the present disclosure;

FIG. 4 is a schematic diagram of the structure of a deep learning network for attribute control according to an embodiment of the present disclosure;

FIG. 5 is a flow chart of a method of expression generation provided in accordance with an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a training device for expression generating models according to an embodiment of the present disclosure;

fig. 7 is a schematic structural view of an expression generating apparatus provided according to an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device used to implement an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, the numerical person speaks the same mouth shape of the same word. However, in practice, the mouth shapes of the speaking person are generally different in different states, and as shown in fig. 2 (a) and (b), the mouth width of the speaking person is large in a open-heart manner, and the mouth width of the speaking person is small in a injured manner. On the other hand, the speaking mouths of people of different ages and sexes are often different. As shown in fig. 2 (c) and (d), the mouth shape tends to be larger when a male speaks than when a female speaks. Therefore, the realization of personalized mouth style generation of the digital person is an important link for making the digital person more vivid and anthropomorphic.

Fig. 1 is a flowchart illustrating a training method of an expression generating model according to an embodiment of the disclosure. As shown in fig. 1, the method at least comprises the steps of:

s101, generating a first mouth shape control characteristic and a first expression driving parameter according to a training sample.

In the embodiment of the disclosure, the training sample is obtained according to a video recorded by the collected object when the collected object sounds. The training sample at least comprises audio data formed by the sound production (such as speaking or reading) of the collected object and video data obtained by recording the expression change of the face area during the sound production.

The first profile control feature may be understood as a feature that affects the change in profile when speaking. The die control characteristics may be derived from corresponding die control parameters. The mouth shape control parameters include both identity attributes and emotional styles, also referred to as semantic parameters. Identity attributes may include parameters of age, gender, body type, etc. of the speaker. The emotion style may include parameters such as emotion at the time of speaking, the language used (which may include chinese, japanese, english, and dialects of various languages), and the like. By encoding the above parameters, the corresponding die control characteristics can be obtained.

The first expression driving parameter may be understood as a parameter for controlling and driving expression changes of a two-dimensional or three-dimensional model. Are commonly used for facial animation and expression synthesis to achieve realistic and natural changes in the facial expression of a model.

Expression driving parameters can be seen as a set of control parameters describing the shape change of the model under different facial expressions. Each parameter corresponds to a particular facial expression, such as smile, anger, surprise, etc. By adjusting the values of these parameters, the facial shape of the model can be changed, thereby achieving smooth transitions and changes between different expressions.

In one example, the expression driving parameter may use a blend deformation (Blendshape) parameter. The blendmap parameter is a morphing technique based on linear interpolation, commonly used to achieve facial animation and expression changes. In a three-dimensional model, the Blendrope parameter represents different facial expressions or shape changes. Each blendcope parameter corresponds to a particular facial expression, such as smile, anger, surprise, etc. By adjusting the weights of the Blendrope parameters, the shape of the model can be changed, thereby realizing different expressions and changes.

The blendrope parameter is derived from the video data of the training samples. Specifically, facial motion data may be recorded using a data acquisition device (e.g., a facial capture system) and the blendmap parameter may be derived by analyzing the coordinate locations of a plurality of feature points of the face in the data.

In the blendrope parameter, the intensity of each expression can be controlled by adjusting the weight value. By blending (blending) different blendmaps, continuous facial animation can be generated or custom facial expressions can be achieved. This technique is widely used in the fields of movies, games, virtual reality, etc. to achieve realistic facial animation effects.

It should be noted that, in order to increase the diversity of the training samples, the collected objects may include the object compositions with different ages and sexes. For example, a plurality of subjects to be collected for men and women are selected in each age group according to a plurality of age groups such as children, teenagers, young, middle-aged, elderly, etc. The collected object can read a preset text or can talk according to a preset script, and can also speak freely. The collected subject can sound under a plurality of different emotions, such as happiness, surprise, heart injury, fear, vitality and no scraps.

S102, inputting the audio features and the first mouth shape control features of the training sample into a first expression generating model, and predicting to obtain second expression driving parameters.

In embodiments of the present disclosure, the audio features may be obtained using any existing audio feature extraction method, such as using a pre-trained feature model. Specifically, the model may be a wav2vec2 feature model, a mel-frequency cepstral coefficient (Mel Frequency Cepstrum Coefficient, MFCC).

The audio feature and the first mouth shape control feature are input into a first expression generating model, and the first expression generating model outputs second expression driving parameters. The second expression driving parameter is a predicted value of the expression driving parameter obtained by the first expression generating model according to the audio characteristic and the first mouth shape control characteristic.

S103, obtaining a loss function according to the real expression driving parameters, the first expression driving parameters and the second expression driving parameters of the training sample.

And determining a loss value according to the loss function through the first expression driving parameter, the second expression driving parameter and the real expression driving parameter.

The loss function is a function that measures the difference or error between the predicted outcome of the model on the training data and the actual label. In each training round, the model propagates forward according to the input data, generates prediction results, and then calculates the difference between the prediction results and the real labels, namely the loss value.

S104, updating the first expression generating model according to the loss function to obtain a trained second expression generating model.

The model training aims at minimizing a loss function, and the loss value is gradually reduced by continuously adjusting and updating parameters of the first expression generation model. This process typically uses an optimization algorithm (e.g., gradient descent) for parameter updating. Model training is the process of optimizing a model by minimizing a loss function, and loss values calculated by the loss function are the basis for evaluating the performance of the model on training data and guiding optimization. And obtaining a trained second expression generating model under the condition that the model convergence condition is met through multiple rounds of iterative updating of the first expression generating model.

In the embodiment of the disclosure, a plurality of mouth shape control features which can influence the speaking mouth shape are input into a first expression generating model to be trained, so that the model can learn the influence of the mouth shape control features on the mouth shape.

According to the scheme of the embodiment of the disclosure, the diversity of mouth shapes when the digital person is driven to speak through the audio can be improved, and more personalized mouth shape styles can be obtained.

In one possible implementation, step S101 generates a first mouth shape control feature and a first expression driving parameter according to a training sample, and further includes the steps of:

S1011, reordering the original mouth shape control characteristics and the real expression driving parameters of the plurality of training samples to obtain the reordered mouth shape control characteristics and the first expression driving parameters of each training sample in the plurality of training samples. The rearrangement pattern control features and the rearrangement expression driving parameters of the same training sample have the same sequencing result.

S1012, obtaining a first mouth shape control characteristic according to the original mouth shape control characteristic and the rearranged mouth shape control characteristic for each training sample.

In the embodiment of the disclosure, when the model is trained, a plurality of training samples can be formed into a batch (batch). Each training sample in the batch has a corresponding sequence number, the original mouth shape control characteristics and the real expression driving parameters of the training samples are reordered, and the original index sequences (1, 2, …, batch) of the training samples can be randomly rearranged to obtain a second index sequence. And reordering the original mouth shape control features and the real expression driving parameters of the plurality of training samples according to the second index sequence.

In one example, the original mouth shape control feature A1 of the training sample 1, the real expression driving parameter is B1; the original mouth shape control characteristic of the training sample 2 is A2, and the real expression driving parameter is B2; the original mouth shape control characteristic of the training sample 3 is A3, and the real expression driving parameter is B3. The original index sequences are 1,2 and 3, and the rearranged second index sequences are 3, 1 and 2. Then, after reordering the original mouth shape control characteristics and the real expression driving parameters of the training samples according to the second index training, determining the first mouth shape control characteristics of the training sample 1 according to the A1 and the reordered mouth shape control characteristics A3, wherein the first expression driving parameters are B3; the first mouth shape control characteristic of the training sample 2 is determined according to the A2 and the rearrangement mouth shape control characteristic A1, and the first expression driving parameter is B1; the first mouth shape control characteristic of the training sample 3 is determined according to the A3 and the re-mouth shape control characteristic A2, and the real expression driving parameter is B2.

The first mouth shape control characteristic of the training sample can be determined by a hidden space interpolation method according to the original mouth shape control characteristic and the corresponding rearranged mouth shape control characteristic.

According to the scheme of the embodiment of the disclosure, as speaking styles are countless, all situations in real life are difficult to be contained in recorded training sample data. Therefore, the diversity of the training samples is improved by the method of rearranging the training samples and performing hidden space interpolation.

In one possible implementation, step S1012 obtains, for each training sample, a first mouth shape control feature from the original mouth shape control feature and the rearranged mouth shape control feature, further including:

and obtaining the fusion weight of each training sample according to the distribution condition of the reordered plurality of training samples.

And aiming at each training sample, obtaining a first mouth shape control characteristic according to the fusion weight, the original mouth shape control characteristic and the rearranged mouth shape control characteristic.

In the embodiment of the disclosure, after the original indexes are randomly ordered to obtain the second indexes, the second indexes can be uniformly distributedMid-sampling to obtain fusion weight +.>. In the above example, assuming a fusion weight of 0.3, the first die control feature for training sample 1 is 0.3×a1+ (1-0.3) ×a3.

According to the scheme of the embodiment of the disclosure, the continuity of the semantic parameters is increased through the fusion weight.

In one possible implementation manner, step S103 obtains a loss function according to the real expression driving parameter, the first expression driving parameter and the second expression driving parameter of the training sample, and further includes:

s1031, obtaining a first loss value according to the real expression driving parameters and the second expression driving parameters of the training samples.

S1032, obtaining a second loss value according to the first expression driving parameters and the second expression driving parameters of the training samples.

S1033, obtaining a loss function according to the first loss value, the second loss value and the loss function weight of the training samples.

In an embodiment of the disclosure, a plurality of training samples in a batch may be combined into a matrix, and for each training sample in the matrix, a first loss value of each training sample is determined according to a second expression driving parameter predicted by the model and a real expression driving parameter. And determining a second loss value of each training sample according to the second expression driving parameters and the first expression driving parameters obtained through rearrangement and interpolation. And mixing the two loss values according to the weight of the loss function to obtain the total loss value of each training sample. After determining the total loss value of each training sample, the matrices are summed to obtain the loss value of the batch of training samples. And updating the model parameters according to the loss value of the batch of training samples. Since the loss function in the embodiment of the present disclosure is formed by mixing two loss values, it may be referred to as a mixed style loss function.

The Loss function may depend on the specific task and model characteristics, and common Loss functions include mean square error (Mean Squared Error, MSE), cross-Entropy Loss (Cross-Entropy Loss), contrast Loss (contrast Loss), and the like.

According to the scheme of the embodiment of the disclosure, based on the hidden space interpolation method and the mixed style loss function, the continuity of semantic parameters and the diversity of model prediction mouth style are increased.

In one possible implementation, the method of the embodiment of the disclosure further includes the steps of:

the loss function weights are determined from the gaussian distribution with the fused weights as desired.

In the embodiment of the disclosure, after determining the fusion weight, the fusion weight may be distributed from gaussianMid-sampling to obtain a loss function weight +.>For->Do [0,1]And (5) cutting off the upper and lower boundaries.

According to the scheme of the embodiment of the disclosure, the loss function weight is derived from Gaussian distribution, so that the diversity of the generated mouth style is improved.

In one possible implementation manner, step S102 inputs the audio feature and the first mouth shape control feature of the training sample into the first expression generating model, predicts the second expression driving parameter, and further includes:

s1021, gradually fusing the audio features of the training sample and the multiple sub-features contained in the first mouth shape control features by using the first expression generation model to obtain fusion features.

S1022, predicting to obtain a second expression driving parameter according to the fusion characteristic.

In embodiments of the present disclosure, the plurality of sub-features included in the first die control feature may be features that are independently encoded for each of a plurality of die control parameters. Such as age characteristics, gender characteristics, and mood characteristics. In one example, the age may be expressed in a decimal fraction ranging from 0 to 1 (e.g., 5 to 10 years old 0/14, 10 to 15 years old15-20 years old->And so on) through two fully connected layers to obtain the age characteristic. Gender is also represented by the small number of 0,1 (0 represents complete boy style, 1 represents complete girl style, the width of opening and closing of the mouth is larger when a male speaks, the female is smaller, and the transition of mouth style is represented between 0 and 1 and can be understood as the transition of opening width), and gender characteristics are obtained through two layers of full-connection layer coding. The emotion is a 6-dimensional vector (which sequentially represents the strength of happiness, surprise, heart injury, fear, qi generation and no bits), each dimension range is [ 0,1 ], and the emotion characteristics are obtained through two layers of full-connection layers and encoding.

Progressive fusion refers to that in each convolution layer used for fusion, only one sub-feature of multiple sub-features is input, so that the sub-features are fused with the audio feature or the output result of the previous convolution layer. In the next convolutional layer for fusion, another seed feature is input. I.e. each sub-feature is entered only once in a preset order. The predetermined order may be to input features related to identity attributes first and then features related to emotional styles. Age and gender are attributes that are strongly correlated to digital human individuals, fused with shallow convolution features. Emotion is a property independent of digital individuals and is fused with deep convolution features rich in semantics.

In the convolutional layer for fusion, the fusion can be performed by using an adaptive instance normalization (Adaptive Instance Normalization, adaIN), which is a normalization technology applied in a neural network, and can help a model to better capture characteristic information of each sample and reduce differences among batches.

AdaIN introduces a mechanism for adaptive adjustment on the basis of Instance Normalization (instance normalization). The mean value and the variance of each sample are adjusted by introducing additional adjustment parameters, so that the model can apply different normalization modes to different samples. In this way, adaIN can adaptively normalize the feature distribution of the input samples while maintaining feature independence for each sample.

AdaIN finds wide application in some computer vision tasks, particularly in the field of image style migration and image generation. Through self-adaptive normalization, the style information of the input sample can be transferred to the target sample, so that style conversion or generation of the target sample is realized. This allows for more realistic, versatile images to be generated with better style consistency.

According to the scheme of the embodiment of the disclosure, a progressive fusion mode is adopted, so that the method has a better style control effect compared with common feature addition or feature splicing.

In one possible implementation manner, step S1021 uses the first expression generation model to progressively fuse the audio feature of the training sample with the multiple sub-features included in the first mouth shape control feature to obtain a fused feature, and further includes:

the audio features of the training samples are input to a first layer of a plurality of fused convolution layers of the first expression generation model.

And inputting the multiple sub-features contained in the first mouth shape control feature into the multiple fusion convolution layers one by one according to a preset sequence. Wherein the plurality of sub-features includes at least one of age features, gender features, mood features, language features.

And obtaining fusion characteristics according to the output result of the last fusion convolution layer of the first expression generation model.

In the embodiment of the disclosure, the number of the fusion convolutional layers may be the same as the number of the sub-features, and each sub-feature is input into one fusion convolutional layer to realize feature fusion. Specifically, the audio feature and the first seed feature are input to a first layer of the plurality of fused convolutional layers, and the output result is input to a next convolutional layer. The next convolution layer may be a fused convolution layer or a non-fused convolution layer, i.e., a conventional convolution layer that does not input sub-features. And the second fusion convolution layer inputs a second seed characteristic and an output result of a last convolution layer connected with the second fusion convolution layer, so that fusion of the second seed characteristic is completed. And so on. And finishing feature fusion until all the sub-features are input. At this time, the output result of the last fusion convolution layer is the fusion characteristic obtained by gradually fusing the audio characteristic and the multiple sub-characteristics. In one example, the order of entry of the plurality of sub-features is in turn an age feature, a gender feature, a language feature, and an emotion feature. In another example, the input order of the plurality of sub-features is gender, age, emotion, and language features in that order.

According to the scheme of the embodiment of the disclosure, through gradual fusion and control of the input sequence of multiple sub-features, the model can learn multiple semantic parameters better, and particularly the features related to emotion styles, which are input later.

In one possible implementation, at least one non-fused convolutional layer is provided between two adjacent fused convolutional layers.

In the embodiment of the disclosure, one or more non-fusion convolutional layers are arranged between two adjacent fusion convolutional layers, and a preferable mode is that one layer is arranged.

It should be noted that, two adjacent fusion convolution layers may be directly connected without a non-fusion convolution layer.

According to the scheme of the embodiment of the disclosure, at least one non-fusion convolution layer is separated between two adjacent fusion convolution layers, so that better learning of the mouth shape control characteristic of the model is facilitated, and therefore faster convergence is achieved.

Fig. 3 is a flowchart of a training method of an expression generating model according to another embodiment of the present disclosure. As shown in fig. 3, in one possible implementation, the method at least includes the following steps:

(1) Audio coding: firstly, dividing audio data into a plurality of audio windows (the expression generating model is to predict 51-dimensional face parameters corresponding to the middle time point of the audio windows), and obtaining the audio characteristics of the plurality of audio windows by a pre-training characteristic model (the wav2vec2 characteristic model adopted by the scheme can also use traditional characteristics such as MFCC) . The wav2vec2 model predicts a 392-dimensional feature every 20ms of audio, and takes the middle 20 392-dimensional feature components +.>。

(2) Style attribute coding: style attribute parameters (mouth shape control parameters) are encoded. Age is expressed in decimal places [ 0,1 ] (e.g., 5-10 years old10-15 years of age is->15-20 years old->And so on) via two full-connection layers (Linear + ReLU + Linear, the first full-connection layer being mapped to +.>Dimension, second full connection layer maps to +.>Dimension) encoding intoThe method comprises the steps of carrying out a first treatment on the surface of the Gender is also expressed in terms of the decimal number [ 0,1 ] (0 stands for full boy style, 1 stands for full girl style, transition between 0 and 1 stands for mouth style) through two fully connected layers (linear+ReLU+linear, first fully connected layer mapped to ]>Dimension, second full connection layer maps to +.>Dimension) code +.>The method comprises the steps of carrying out a first treatment on the surface of the Emotion is a 6-dimensional vector (representing the intensity of happiness, surprise, heart injury, fear, qi generation and no scraps in sequence), each dimension range is [ 0,1 ], the emotion is mapped to ]>Dimension, second full connection layer maps to +.>Dimension) encoding into。

(3) Deep learning network for attribute control: to achieve age, gender, and emotion-based control of speech styles, feature codes for style attributes and feature codes for audio need to be fused.

The embodiment of the disclosure uses a progressive fusion method based on AdaIN (age and gender are attributes which are strongly related to digital individuals and are fused with shallow convolution features, emotion is attributes which are irrelevant to digital individuals and are fused with deep convolution features with rich semantics), and compared with common feature addition or feature stitching, the method has a better style control effect. The network is transportedIn is an audio featureStyle attribute feature (+)>，，/>The following->Representing any one of age, sex and expression characteristics), the output is Blendshape driving parameter +.>The structure of the learning network is shown in FIG. 4, wherein +.>Input feature number representing two-dimensional convolution, +.>Representing the output feature number>Representing the size of the convolution kernel +.>Represents the convolution step size +.>Representing the number of pixels with edge complements 0. The real numbers in the structure in the figure are only examples, and the actual values can be adjusted.

Characterizing attributesFusion to a certain layer of convolution layer output features through AdaIN(h and w are the height and width of the feature map) are defined as follows:

wherein,,representative characteristics->Is>Personal profile->Represents->Is>Personal characteristics (I)>Represents a mean function>Representing the standard deviation function.

(4) Hidden spatial interpolation and blend style loss functions: because of the myriad of possibilities of speaking styles, it is difficult to include all real life situations in the recorded data (e.g., 10 year old boys have very many possibilities of speaking style happily). The embodiment of the disclosure uses a hidden space interpolation-based method and a mixed style loss function to increase semantic parameter continuity and model prediction mouth style diversity. The audio characteristics of the current training batch are Attribute features->The corresponding true value is->. The hidden space interpolation and the calculation of the blend style loss function are as follows:

a) Randomly rearranging the index sequences (1, 2, …, batch) to obtain an indexFrom uniform distribution->Mid-sampling to obtain fusion weight +.>From Gaussian distribution->A loss function weight is sampled in>For->Do [0,1]And (5) cutting off the upper and lower boundaries.

b) According to indexesRearranging the attribute characteristics and true values to obtain +.>And->。

c) Attribute characteristics are based on the fusion weightWeighted summation:

d) Characterizing audioAnd interpolated attribute features->Sending into a deep learning network with attribute control, predicting to obtain driving parameters +.>。

e) The calculation is based on the mean square error lossIs a mixed style loss function of (1):

f) And the loss function module updates the parameters of the network according to the calculated loss value.

According to the scheme of the embodiment of the disclosure, the speaking mouth style of the digital person is conveniently controlled by modifying the semantic parameters (age, sex and expression), so that the controllability of the mouth style effect of the digital person is improved; through hidden space interpolation and mixed style loss function, semantic parameter continuity and style diversity generation are realized, so that the mouth style selection of the digital person has diversity.

Fig. 5 is a flowchart of a method for generating an expression according to an embodiment of the present disclosure. As shown in fig. 5, the method at least includes:

s501, extracting audio features according to the target audio data.

S502, obtaining target mouth shape control characteristics according to preset style attribute parameters.

S503, determining expression driving parameters according to the audio characteristics and the target mouth shape control characteristics by using the second expression generation model. The second expression generating model is trained according to the method of any embodiment.

S504, generating an expression image corresponding to the target audio data according to the expression driving parameters.

In the embodiment of the disclosure, the expression driving parameters can be obtained according to the target audio data and the preset style attribute parameters, so that corresponding expression images and videos are generated according to the expression driving parameters. And inputting the audio characteristics and the target mouth shape control characteristics into a second expression generating model, so that expression driving parameters are obtained according to the output of the model. The method for extracting the audio features and the method for determining the target mouth shape control features can be referred to the related description of the corresponding steps in the above method embodiments, which are not described herein.

According to the scheme of the embodiment of the disclosure, the model is generated by utilizing the trained expression, so that the desired mouth style can be dynamically adjusted in real time without retraining or fine-tuning the model; when the target audio data is used for driving the digital person to speak, the speaking mouth style of the digital person can be conveniently controlled by modifying the style attribute parameters, and the controllability of the mouth style effect of the digital person is improved.

In one possible implementation, the preset style attribute parameter includes at least one of an age parameter, a gender parameter, an emotion parameter, and a language parameter.

Fig. 6 is a schematic structural diagram of a training device for expression generating models according to an embodiment of the present disclosure. As shown in fig. 6, the training device of the expression generating model includes:

the generating module 601 is configured to generate a first mouth shape control feature and a first expression driving parameter according to the training sample.

The prediction module 602 is configured to input the audio feature and the first mouth shape control feature of the training sample into the first expression generating model, and predict to obtain the second expression driving parameter.

The loss determination module 603 is configured to obtain a loss function according to the real expression driving parameter, the first expression driving parameter, and the second expression driving parameter of the training sample. And

and the update model 604 is configured to update the first expression generating model according to the loss function to obtain a trained second expression generating model.

In one possible implementation, the generating module 601 includes:

and the rearrangement sub-module is used for reordering the original mouth shape control characteristics and the real expression driving parameters of the plurality of training samples to obtain the rearranged mouth shape control characteristics and the first expression driving parameters of each training sample in the plurality of training samples. The rearrangement pattern control features and the rearrangement expression driving parameters of the same training sample have the same sequencing result.

The feature determination submodule is used for obtaining a first mouth shape control feature according to the original mouth shape control feature and the rearranged mouth shape control feature for each training sample.

In one possible implementation, the feature determination submodule is configured to:

In one possible implementation, the loss determination module 603 is configured to:

and obtaining a first loss value according to the real expression driving parameters and the second expression driving parameters of the training samples.

And obtaining a second loss value according to the first expression driving parameters and the second expression driving parameters of the training samples.

And obtaining a loss function according to the first loss value, the second loss value and the loss function weight of the training samples.

In one possible implementation, the apparatus further includes:

and the loss function weight determining module is used for determining the loss function weight from the Gaussian distribution taking the fusion weight as the expected Gaussian distribution.

In one possible implementation, the prediction module 602 includes:

And the fusion sub-module is used for gradually fusing the audio features of the training sample and the multiple sub-features contained in the first mouth shape control features by using the first expression generation model so as to obtain fusion features.

And the prediction sub-module is used for predicting and obtaining a second expression driving parameter according to the fusion characteristics.

In one possible implementation, the fusion submodule is configured to:

For descriptions of specific functions and examples of each module and sub-module of the apparatus in the embodiments of the present disclosure, reference may be made to the related descriptions of corresponding steps in the foregoing method embodiments, which are not repeated herein.

Fig. 7 is a schematic structural diagram of an expression generating apparatus according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:

the extracting module 701 is configured to extract audio features according to the target audio data.

The feature determining module 702 is configured to obtain a target mouth shape control feature according to a preset style attribute parameter.

The parameter determining module 703 is configured to determine expression driving parameters according to the audio feature and the target mouth shape control feature by using the second expression generating model. The second expression generating model is obtained through training by the training device of the expression generating model.

The image generating module 704 is configured to generate an expression image corresponding to the target audio data according to the expression driving parameter.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a training method of expression generation model, a method of expression generation. For example, in some embodiments, the training method of the expression generation model, the method of expression generation, may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the training method of the expression generation model, the method of expression generation described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the training method of the expression generation model, the method of expression generation, by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. that are within the principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of an expression generating model comprises the following steps:

inputting the audio features of the training sample and the first mouth shape control features into a first expression generating model, and predicting to obtain second expression driving parameters;

2. The method of claim 1, wherein generating the first mouth shape control feature and the first expression driving parameter from the training sample comprises:

reordering the original mouth shape control features and the real expression driving parameters of a plurality of training samples to obtain the reordered mouth shape control features and the first expression driving parameters of each training sample in the plurality of training samples; the rearrangement type control characteristics of the same training sample are the same as the sequencing results of the rearrangement expression driving parameters;

and aiming at each training sample, obtaining a first mouth shape control characteristic according to the original mouth shape control characteristic and the rearranged mouth shape control characteristic.

3. The method of claim 2, wherein for each training sample, deriving a first mouth shape control feature from the original mouth shape control feature and the reordered mouth shape control feature comprises:

obtaining the fusion weight of each training sample according to the distribution condition of the reordered plurality of training samples;

4. The method of claim 3, wherein deriving a loss function from the real expression driving parameters, the first expression driving parameters, and the second expression driving parameters of the training sample comprises:

obtaining a first loss value according to the real expression driving parameters and the second expression driving parameters of the training samples;

obtaining a second loss value according to the first expression driving parameters and the second expression driving parameters of the training samples;

5. The method of claim 4, further comprising:

the loss function weight is determined from a gaussian distribution with the fusion weight as desired.

6. The method of any of claims 1-5, wherein inputting the audio features of the training sample and the first oral control features into a first expression generating model predicts a second expression driving parameter, comprising:

gradually fusing the audio features of the training sample and the multiple sub-features contained in the first mouth shape control features by using a first expression generation model to obtain fusion features;

And predicting to obtain a second expression driving parameter according to the fusion characteristics.

7. The method of claim 6, wherein progressively fusing the audio features of the training sample with the plurality of sub-features contained in the first mouth shape control feature using a first expression generation model to obtain a fused feature, comprising:

inputting the audio features of the training sample to a first layer of a plurality of fusion convolution layers of a first expression generating model;

inputting a plurality of sub-features contained in the first mouth shape control feature into the fusion convolution layers one by one according to a preset sequence; wherein the plurality of sub-features includes at least one of age features, gender features, mood features, language features;

8. The method of claim 7, wherein at least one non-fused convolutional layer is provided between two adjacent fused convolutional layers.

9. A method of expression generation, comprising:

extracting audio features according to the target audio data;

Determining expression driving parameters according to the audio characteristics and the target mouth shape control characteristics by using a second expression generating model; wherein the second expression generating model is trained according to the method of any one of claims 1 to 8; and

10. The method of claim 9, wherein the preset style attribute parameters include at least one of an age parameter, a gender parameter, an emotion parameter, a language parameter.

11. A training device of an expression generation model, comprising:

the prediction module is used for inputting the audio characteristics of the training sample and the first mouth shape control characteristics into a first expression generation model, and predicting to obtain second expression driving parameters;

12. The apparatus of claim 11, wherein the generating means comprises:

a rearrangement sub-module, configured to reorder original mouth shape control features and real expression driving parameters of a plurality of training samples, so as to obtain rearranged mouth shape control features and first expression driving parameters of each training sample in the plurality of training samples; the rearrangement type control characteristics of the same training sample are the same as the sequencing results of the rearrangement expression driving parameters;

and the characteristic determination submodule is used for obtaining a first mouth shape control characteristic according to the original mouth shape control characteristic and the rearranged mouth shape control characteristic aiming at each training sample.

13. The apparatus of claim 12, wherein the feature determination submodule is to:

14. The apparatus of claim 13, wherein the loss determination module is to:

15. The apparatus of claim 14, further comprising:

and the loss function weight determining module is used for determining the loss function weight from the Gaussian distribution taking the fusion weight as a desire.

16. The apparatus of any of claims 11 to 15, wherein the prediction module comprises:

the fusion sub-module is used for gradually fusing the audio features of the training sample and the multiple sub-features contained in the first mouth shape control features by using a first expression generation model so as to obtain fusion features;

17. The apparatus of claim 16, wherein the fusion sub-module is to:

18. The apparatus of claim 17, wherein at least one non-fused convolutional layer is disposed between two adjacent fused convolutional layers.

19. An expression generating apparatus comprising:

the parameter determining module is used for determining expression driving parameters according to the audio characteristics and the target mouth shape control characteristics by using a second expression generating model; wherein the second expression generating model is trained from the apparatus of any one of claims 11 to 18; and

20. The apparatus of claim 19, wherein the preset style attribute parameters include at least one of an age parameter, a gender parameter, an emotion parameter, a language parameter.

21. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-10.