CN116206621B

CN116206621B - Method and device for training mouth-shaped driving model, electronic equipment and storage medium

Info

Publication number: CN116206621B
Application number: CN202310492252.XA
Authority: CN
Inventors: 杜宗财; 范锡睿; 赵亚飞; 张世昌; 郭紫垣; 王志强; 陈毅
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-07-25
Anticipated expiration: 2043-05-04
Also published as: CN116206621A

Abstract

The disclosure provides a method, a device, electronic equipment and a storage medium for training a mouth shape driving model, relates to the technical field of computers, and particularly relates to the fields of artificial intelligence, voice technology and digital man technology. The specific implementation scheme is as follows: extracting features of the sample audio stream to obtain time sequence features corresponding to each audio frame; respectively inputting the time sequence characteristics of a continuous preset number of audio frames into a main learning network and an auxiliary learning network to obtain a first mouth shape driving parameter output by the main learning network and a second mouth shape driving parameter output by the auxiliary learning network; calculating a first loss function value, a second loss function value, and a third loss function value; and training the main learning network based on the first loss function value, the second loss function value and the third loss function value, and taking the trained main learning network as a mouth shape driving model. By applying the embodiment of the disclosure, the flexibility of the mouth shape of the three-dimensional face model can be improved.

Description

Method and device for training mouth-shaped driving model, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to the fields of artificial intelligence, speech technology, and digital man technology.

Background

The audio driving face mouth shape technology refers to that audio characteristics of an audio stream are input into a pre-trained deep learning network to obtain mouth shape driving parameters output by the deep learning network, and then mouth shapes of a three-dimensional face model are driven based on the mouth shape driving parameters, so that the mouth shapes of the three-dimensional face model are consistent with the audio stream.

Disclosure of Invention

The present disclosure provides a method, apparatus, electronic device, and storage medium for training a mouth shape driving model.

According to a first aspect of the present disclosure, there is provided a method of training a mouth shape driving model, comprising:

extracting features of the sample audio stream to obtain time sequence features corresponding to each audio frame;

respectively inputting the time sequence characteristics of a continuous preset number of audio frames into a main learning network and an auxiliary learning network, and acquiring a first mouth shape driving parameter corresponding to each audio frame output by the main learning network and a second mouth shape driving parameter corresponding to each audio frame output by the auxiliary learning network; the main learning network comprises a time sequence self-attention module;

calculating a first loss function value based on a first difference and a second difference, calculating a second loss function value based on the second difference, and calculating a third loss function value based on a difference between first mouth-shape driving parameters corresponding to adjacent audio frames, wherein the first difference is: the difference between the first mouth shape driving parameter and the second mouth shape driving parameter corresponding to each audio frame is that: differences between the first mouth shape driving parameters and the label mouth shape parameters corresponding to each audio frame;

And training the main learning network based on the first loss function value, the second loss function value and the third loss function value, and taking the trained main learning network as a mouth shape driving model.

According to a second aspect of the present disclosure, there is provided a mouth-shaped driving model training device comprising:

the extraction module is used for extracting the characteristics of the sample audio stream to obtain the time sequence characteristics corresponding to each audio frame;

the acquisition module is used for respectively inputting the time sequence characteristics of the continuous preset number of audio frames into the main learning network and the auxiliary learning network, and acquiring a first mouth shape driving parameter corresponding to each audio frame output by the main learning network and a second mouth shape driving parameter corresponding to each audio frame output by the auxiliary learning network; the main learning network comprises a time sequence self-attention module;

a calculation module, configured to calculate a first loss function value based on a first difference and a second difference, calculate a second loss function value based on the second difference, and calculate a third loss function value based on a difference between first mouth shape driving parameters corresponding to adjacent audio frames, where the first difference is: the difference between the first mouth shape driving parameter and the second mouth shape driving parameter corresponding to each audio frame is that: differences between the first mouth shape driving parameters and the label mouth shape parameters corresponding to each audio frame;

And the training module is used for training the main learning network based on the first loss function value, the second loss function value and the third loss function value, and taking the trained main learning network as a mouth shape driving model.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first aspects.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any one of the first aspects.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the first aspects.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a method for training a mouth shape driving model according to an embodiment of the disclosure;

fig. 2 is a schematic structural diagram of an assisted learning network according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a master learning network according to an embodiment of the present disclosure;

FIG. 4 is an exemplary schematic diagram of a die-driven model training process provided by an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a training device for a mouth shape driving model according to an embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing a method of model training for a mouthpiece driver of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

When the deep learning network for the output port type driving parameters is trained at present, in order to reduce jitter, the parameters of the deep learning network are adjusted by using a time sequence consistency loss function, so that the prediction result of the deep learning network on adjacent audio frames is more approximate, but the mouth shape of the three-dimensional face model is smooth due to the fact that the mouth shape of the three-dimensional face model is smooth, and flexibility is sacrificed.

In addition, a method of mixed training of multiple population styles can be adopted at present, for example, when different speakers read the letter "a", some speakers have larger population and some speakers have smaller population, and the result predicted by the deep learning model is the average result of the mouth shapes of a plurality of speakers, so that the three-dimensional face model is driven according to the average result, and the mouth shapes are smooth and have insufficient flexibility.

In view of the foregoing, a description is given below of a method for training a mouth shape driving model according to an embodiment of the present disclosure.

Referring to fig. 1, fig. 1 is a flow chart of a method for training a mouth shape driving model, which is provided by an embodiment of the present disclosure, and is applied to an electronic device, where the electronic device may be a terminal, a server, or other devices capable of supporting training a deep learning model, and the method includes:

S101, extracting features of the sample audio stream to obtain time sequence features corresponding to each audio frame.

The sample audio stream is an audio stream of a speaker acquired in advance, namely, the sample audio stream comprises human voice. One sample audio stream includes a plurality of audio frames, for example: if the timing length of each audio frame is specified to be 10ms, an 800ms sample audio stream may include 80 audio frames.

In addition, feature extraction may be performed on multiple sample audio streams, such as: if 128 sample audio streams are collected in advance, feature extraction can be sequentially performed on each sample audio stream to obtain time sequence features corresponding to each audio frame included in each sample audio stream.

S102, inputting the time sequence characteristics of a continuous preset number of audio frames into a main learning network and an auxiliary learning network respectively, and acquiring a first mouth shape driving parameter corresponding to each audio frame output by the main learning network and a second mouth shape driving parameter corresponding to each audio frame output by the auxiliary learning network.

The main learning network comprises a time sequence self-attention module, the auxiliary learning network does not comprise the time sequence self-attention module, and the time sequence self-attention module is used for processing time sequence characteristics based on a self-attention mechanism.

The first mouth shape driving parameter output by the main learning network and the second mouth shape driving parameter output by the auxiliary learning network can be three-dimensional vertex coordinates of a human face, and can also be a blend shape driving parameter.

In addition, the preset number of values may be empirically set, for example, may be 3 or 5. As an example, if the preset number is 3, the electronic device may input the sequential characteristics of the continuous 3 audio frames into the main learning network and the auxiliary learning network, respectively, and then the main learning network outputs 3 first mouth shape driving parameters and the auxiliary learning network outputs 3 second mouth shape driving parameters.

Taking a preset number of values of 3 as an example, in one implementation, the time sequence features of the continuous 3 audio frames included in one sample audio stream may be input into the main learning network and the auxiliary learning network respectively. For example: each sample audio stream comprises 80 audio frames, firstly, for a first sample audio stream, the time sequence characteristics from the 0 th audio frame to the 2 nd audio frame in the sample audio stream are input into a main learning network and an auxiliary learning network, and then the time sequence characteristics from the 3 rd audio frame to the 5 th audio frame in the sample audio stream are input into the main learning network and the auxiliary learning network. I.e., the timing characteristics of 1*3 audio frames at a time are input into the primary and secondary learning networks. And so on until the timing characteristics of all audio frames in the sample audio stream are input into the primary and secondary learning networks. And repeating the process for the second sample audio stream until the time sequence characteristics of the audio frames in all the sample audio streams are input into the main learning network and the auxiliary learning network.

In another implementation, the timing characteristics of consecutive 3 audio frames included in each sample audio stream included in one batch may be simultaneously input into the primary learning network and the secondary learning network, respectively. For example: each batch may include 128 sample audio streams, each sample audio stream includes 80 audio frames, and the time sequence features from the 0 th audio frame to the 2 nd audio frame in each sample audio stream may be respectively input into the main learning network and the auxiliary learning network, and then the time sequence features from the 3 rd audio frame to the 5 th audio frame in each sample audio stream may be respectively input into the main learning network and the auxiliary learning network, i.e. the number of the time sequence features each time that are respectively input into the main learning network and the auxiliary learning network is 128×3. And so on, until the time sequence characteristics of all audio frames in the sample audio stream of the batch are respectively input into the main learning network and the auxiliary learning network. Processing of each sample audio stream included in the next batch may then continue until the master learning network converges.

S103, calculating a first loss function value based on the first difference and the second difference, calculating a second loss function value based on the second difference, and calculating a third loss function value based on the difference between the first mouth shape driving parameters corresponding to the adjacent audio frames.

Wherein the first difference is: the difference between the first mouth shape driving parameter and the second mouth shape driving parameter corresponding to each audio frame is that: differences between the first mouth shape driving parameters and the label mouth shape parameters corresponding to each audio frame.

The first loss function may be a contrast learning loss function, the second loss function may be a reconstruction loss function, and the third loss function may be a timing consistency loss function, with the formulation of the contrast learning loss function, the reconstruction loss function, and the timing consistency loss function being detailed in the following embodiments.

And S104, training the main learning network based on the first loss function value, the second loss function value and the third loss function value, and taking the trained main learning network as a mouth shape driving model.

The first loss function value, the second loss function value and the third loss function value may be weighted and summed to obtain a total loss function value, and then the main learning network may be trained based on the total loss function value.

In one implementation, the total loss function value may be obtained by the following formula:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the total loss function value, < >>Representing a second loss function value, ">Representing a third loss function value, " >Represents a first loss function value, a is +.>The corresponding weight value, b is +.>Corresponding weight value, c is +.>The corresponding weight values, a, b and c, may be empirical values obtained from experiments, as an example, a is 1, b is 0.5 and c is 0.1.

In addition, after training is completed on the main learning network, the trained main learning network can be used as a mouth shape driving model to be applied online. In the process of online application, extracting time sequence characteristics of an audio stream to be predicted, inputting the time sequence characteristics into a mouth shape driving model, and driving a three-dimensional face model based on mouth shape driving parameters output by the mouth shape driving model.

By adopting the technical scheme, the auxiliary learning network does not comprise a time sequence self-attention module, and when the three-dimensional face model is driven based on the second mouth shape driving parameters, the mouth shape of the three-dimensional face model is smoother. The main learning network comprises a time sequence self-attention module, and the time sequence self-attention module can process the time sequence characteristics based on a time sequence self-attention mechanism, so that when the main learning network predicts the mouth shape driving parameters of each audio frame, the main learning network can pay more attention to the time sequence characteristics important to the audio frame, and the problem of mouth shape smoothness when the first mouth shape driving parameters act on the three-dimensional face model can be avoided.

In addition, the embodiment of the disclosure further introduces a first loss function value, and when the first loss function value is calculated, the difference between the first mouth shape driving parameter corresponding to each audio frame and the second mouth shape driving parameter and the difference between the first mouth shape driving parameter corresponding to each audio frame and the label mouth shape parameter are considered, so that the first mouth shape driving parameter is far away from the second mouth shape driving parameter and is close to the label mouth shape parameter, the mouth shape corresponding to the mouth shape driving parameter predicted by the main learning network after training can be prevented from being too smooth, and the flexibility is higher. And the second loss function value and the third loss function value are considered in the training process, so that the accuracy of the mouth shape driving parameters obtained by the prediction of the mouth shape driving model is ensured, the smoothness of the mouth shape obtained by the prediction is reduced, and the flexibility of the mouth shape is further improved.

In one embodiment of the present disclosure, the corresponding timing characteristics of each frame of audio may be obtained in the following manner.

Inputting the sample audio stream into a feature extraction model to obtain the audio feature of each audio frame included in the sample audio stream; and for each audio frame, splicing the audio features of the first number of audio frames before the audio frame, the audio features of the audio frame and the audio features of the second number of audio frames after the audio frame to obtain the corresponding time sequence features of the audio frame.

For example: if the duration of one frame of audio is 10ms, the sample audio stream with the time sequence length of 800ms can comprise 80 audio frames, and then the sample audio stream with the time sequence length of 800ms is input into the feature extraction model, so that the audio features of 80 audio frames can be obtained in total.

The feature extraction may be performed on the sample audio stream by a WAV2vec (WAV to Vector) model, a mel-frequency cepstral coefficient, or the like.

In one implementation, by way of example, the corresponding timing characteristics for each audio frame may be generated by means of a sliding window, such as: the window size of the sliding window may be set to 20 frames, a first number is defined as 9 frames, a second number is defined as 10 frames, and the first number plus the second number plus 1 is equal to the size of the sliding window.

Taking the 15 th audio frame in the sample audio stream as an example, acquiring the audio characteristics of 9 total audio frames adjacent to the 15 th audio frame in the sample audio stream, acquiring the audio characteristics of 10 total audio frames adjacent to the 15 th audio frame in the sample audio stream, and splicing the acquired audio characteristics of 9 total audio frames before the 15 th audio frame, the acquired audio characteristics of the 15 th audio frame and the acquired audio characteristics of 10 total audio frames after the 15 th audio frame to obtain the time sequence characteristics corresponding to the 15 th audio frame.

It will be appreciated that for an audio frame at the beginning of the sample audio stream there may be a case where the number of audio frames preceding the audio frame is less than a first number, or for an audio frame at the end of the audio stream in the sample there may be a case where the number of audio frames following the audio frame is less than a second number, if this is the case, the audio features of the audio frames that are less than the first number or less than the second data may be padded with 0.

Assuming that the number from 0 is counted, taking the 0 th audio frame as an example, a total of 9 audio frames before the 0 th audio frame do not exist, the audio features of the total of 10 audio frames after the 0 th audio frame can be padded with 0, and the 9 audio features padded with 0, the audio features of the 0 th audio frame and the acquired 10 audio features are spliced to obtain a time sequence feature corresponding to the 0 th audio frame.

Or taking the last 1 audio frame as an example, 10 audio frames after the last 1 audio frame do not exist, 0 can be used for filling in, and the audio features of the total 9 audio frames before the last 1 audio frame are acquired, and the acquired 9 audio features, the audio features of the last 1 audio frame and the 10 audio features filled in with 0 are spliced to obtain the time sequence feature corresponding to the last 1 audio frame.

The time sequence features obtained by the method can be usedRepresentation, wherein->The representation is: time sequence feature of the ith audio frame, +.>A timing dimension representing a timing characteristic of an i-th audio frame, the timing dimension having a value of a first number plus a second number plus 1, for example: the first number is 9 frames, the second number is 10 frames, and the time sequence dimension of the ith time sequence feature is +.>20->Feature dimensions representing the timing characteristics of the ith audio frame, if the timing characteristics of the ith audio frame have 392 feature dimensions +.>392.

By adopting the technical scheme, for each audio frame, the corresponding time sequence feature of the audio frame comprises the audio features of the first number of audio frames before the audio frame, the audio features of the audio frame and the audio features of the second number of audio frames after the audio frame, that is to say, the corresponding time sequence feature of the audio frame not only comprises the audio features of the audio frame, but also comprises the audio features adjacent to the audio frame. That is, when the main learning network and the auxiliary learning network predict based on the time sequence characteristics corresponding to the audio frame, not only the audio characteristics of the audio frame are referred, but also the audio characteristics of the audio frames adjacent to the audio frame in the front and the rear are referred, and when the main learning network and the auxiliary learning network predict the first mouth shape driving parameter and the second mouth shape driving parameter based on the time sequence characteristics, the front and rear semantic relations in the sample audio stream are considered, so that the predicted mouth shape driving parameter accords with the front and rear semantics, and the accuracy is improved.

As an example, the auxiliary learning network may be a convolutional neural network, and the auxiliary learning network is configured as shown in fig. 2, and includes 4 modules in total in the auxiliary learning network. It should be noted that fig. 2 is only an example of the auxiliary learning network, and the auxiliary learning network in the embodiment of the disclosure may also be a convolutional neural network with other structures.

In fig. 2, the first module includes a convolution layer whose convolution function is Conv2d (2-dimensional convolution function), normalization function is batch norm2d (2-dimensional batch standard function), and activation function is ReLU (Rectified Linear Unit, linear rectification function).

As an example, the parameters in the Conv2d function may be: ic=1, oc=128, k= (3, 1), s= (2, 1), where ic represents the number of two-dimensional convolution input channels, oc represents: the number of two-dimensional convolution output channels, which may be the same as the number of acquired sample audio streams, k represents: the convolution kernel size, s, represents the step size. The first module is for mapping time series features input into the auxiliary learning network into a feature space in the auxiliary learning network.

The second module may include a plurality of convolution layers, fig. 2 illustrates only 4 convolution layers by way of example, and embodiments of the present disclosure do not limit the number of convolution layers included in the second module. The convolution function of each convolution layer of the second module is Conv2d function, the normalization function is Batchnorm2d function, and the activation function is ReLU. As an example, the parameters in the Conv2d function include: ic=128, oc=128, k= (3, 1), s= (2, 1), the convolution kernel in the convolution layer in the second module is used to convolve the timing dimension in the timing feature.

The third module may include a plurality of convolution layers, fig. 2 illustrates only 9 convolution layers by way of example, and embodiments of the present disclosure do not limit the number of convolution layers included in the third module. The convolution function of each convolution layer of the third module is Conv2d function, the normalization function is Batchnorm2d function, and the activation function is ReLU. As an example, the parameters in the Conv2d function include: ic=128, oc=128, k= (1, 3), s= (1, 2), the convolution kernel in the convolution layer in the third module is used to convolve the feature dimension in the time sequential feature.

The fourth module includes 1 convolution layer, the convolution function of the convolution layer is a Conv2d function, and parameters in the Conv2d function include: ic=128, oc=v, k= (1, 1), s= (1, 1), where V represents the dimension of the second die drive parameter output by the auxiliary learning network, and the fourth module is configured to map the timing feature from the feature space to the second die drive parameter.

Referring to fig. 3, the structure of the main learning network in fig. 3 is substantially the same as that of the auxiliary learning network, except that a specified convolution layer of the main learning network is connected to the time-series self-attention module.

In fig. 3, the first module includes a convolution layer whose convolution function is Conv2d, normalization function is battnorm 2d, and activation function is ReLU. As an example, the parameters in the Conv2d function may be: ic=1, oc=128, k= (3, 1), s= (2, 1). The first module is for mapping time series features input into the master learning network into a feature space in the master learning network.

The second module may include a plurality of convolution layers, fig. 3 illustrates only 4 convolution layers by way of example, and embodiments of the present disclosure do not limit the number of convolution layers included in the second module. The convolution function of each convolution layer of the second module is Conv2d function, the normalization function is Batchnorm2d function, and the activation function is ReLU. As an example, the parameters in the Conv2d function include: ic=128, oc=128, k= (3, 1), s= (2, 1), the convolution kernel in the convolution layer in the second module is used to convolve the timing dimension in the timing feature.

In addition, each convolution layer in the second module is connected with the time sequence self-attention module, the convolution kernel in each convolution layer in the second module is used for convolving time sequence dimension in time sequence characteristics, and the output result of the convolution layer in the second module is further processed through the time sequence self-attention module, so that the main learning network can pay more attention to important time sequence characteristics.

The third module may include a plurality of convolution layers, fig. 3 illustrates only 9 convolution layers by way of example, and embodiments of the present disclosure do not limit the number of convolution layers included in the third module. The convolution function of each convolution layer of the third module is Conv2d function, the normalization function is Batchnorm2d function, and the activation function is ReLU. As an example, the parameters in the Conv2d function include: ic=128, oc=128, k= (1, 3), s= (1, 2), the convolution kernel in the convolution layer in the third module is used to convolve the feature dimension in the time sequential feature.

The fourth module includes 1 convolution layer, the convolution function of the convolution layer is a Conv2d function, and parameters in the Conv2d function include: ic=128, oc=v, k= (1, 1), s= (1, 1), wherein V is the dimension of the first profile driving parameter output by the main learning network, and the fourth module is configured to map the timing feature from the feature space to the first profile driving parameter.

In one embodiment of the present disclosure, the electronic device may obtain the first profile driving parameter output by the main learning network in the following manner.

Inputting the time sequence characteristics of the continuous preset number of audio frames into a main learning network, and obtaining the convolution characteristics of the preset number of audio frames output by a specified convolution layer; inputting the convolution characteristics of the preset number of audio frames into a time sequence self-attention module, and acquiring time sequence fusion enhancement characteristics of the preset number of audio frames output after the time sequence self-attention module performs time sequence self-attention processing on the convolution characteristics of the preset number of audio frames; and inputting the time sequence fusion enhancement features of a preset number of audio frames into a subsequent convolution layer of the main learning network, and acquiring a first mouth shape driving parameter corresponding to each audio frame output by an output layer of the main learning network.

The specified convolution layer in the main learning network can process time sequence dimensions in the time sequence features to obtain the convolution features. A plurality of designated convolution layers may be included in the main learning network, each designated convolution layer being connected to one of the time-sequential self-attention modules.

Taking the main learning network shown in fig. 3 as an example, the assigned convolution layers are 4 convolution layers in the second module, and each assigned convolution layer is connected to one time sequence self-attention module. The method comprises the steps of inputting sequential characteristics of a continuous preset number of audio frames into a main learning network, processing the continuous preset number of sequential characteristics by a convolution layer in a first module of the main learning network, and inputting a processing result into a first convolution layer in a second module to obtain the preset number of convolution characteristics. Inputting a preset number of convolution features into a time sequence self-attention module connected with a first convolution layer in a second module to obtain a preset number of first time sequence fusion enhancement features, then processing the first time sequence fusion enhancement features through the second convolution layer in the second module and the time sequence self-attention module connected with the convolution layer to obtain second time sequence fusion enhancement features, processing the second time sequence fusion enhancement features through a third convolution layer in the second module and the time sequence self-attention module connected with the convolution layer to obtain third time sequence fusion enhancement features, and processing the third time sequence fusion enhancement features through a fourth convolution layer in the second module and the time sequence self-attention module connected with the convolution layer to obtain fourth time sequence fusion enhancement features. And finally, processing the fourth time sequence fusion enhancement characteristic through the convolution layers in the third module and the fourth module to acquire the first mouth shape driving parameters.

By adopting the technical scheme, the appointed convolution layer in the main learning network is connected with the time sequence self-attention module, so that after the convolution characteristic of the audio frame is acquired, the convolution characteristic of the audio frame can be input into the time sequence self-attention module connected with the appointed convolution layer, and then the time sequence self-attention processing can be carried out on the convolution characteristic of the audio frame based on the time sequence self-attention module, so as to obtain the time sequence fusion enhancement characteristic. The main learning network can accurately learn the time sequence characteristics which are more important to the audio frame by processing the time sequence fusion enhancement characteristics, so that the first mouth shape driving parameters can be generated more accurately, and mouth shape smoothness is avoided.

In one embodiment of the present disclosure, a time-series self-attention process may be performed on a convolution feature based on a time-series self-attention module by:

step 1, for the convolution characteristic of each audio frame, respectively inputting the convolution characteristic into three convolution layers of a time sequence self-attention module to obtain a query matrix, a key matrix and a value matrix corresponding to the audio frame.

Specifically, the convolution characteristics of an audio frame can be respectively used as the input of three convolution layers, and after the first convolution layer processes the convolution characteristics, a query matrix is output; after the second convolution layer processes the convolution characteristics, outputting a key matrix; the third convolution layer processes the convolution characteristics and outputs a matrix of values.

As an example, each of the three convolution layers in the sequential self-attention module may be a 1×1 convolution layer.

Convolution characteristics may be usedRepresentation, wherein->For the number of convolution features batched into one convolution layer of the time-series self-attention module,/->Number of feature maps representing convolution features of the ith audio frame,/th audio frame>A time sequence dimension representing the convolution characteristic of the ith audio frame, the time sequence dimension having a value of a first number plus a second number plus 1,/v>The feature dimension representing the convolution feature of the ith audio frame.

Taking a preset number of values of 3 as an example, if the convolution characteristics of consecutive 3 audio frames included in each sample audio stream included in one batch are simultaneously input into three convolution layers of the sequential self-attention module, for example, each batch may include 128 sample audio streams, that is, the number of convolution characteristics input into three convolution layers of the sequential self-attention module at a time is 128×3, that isThe value of (2) is 128 x 3.

Alternatively, if the convolution characteristics of consecutive 3 audio frames included in one audio stream are each input into three convolution layers of the time-sequential self-attention module, respectively, i.e., the number of convolution characteristics each input into three convolution layers of the time-sequential self-attention module is 3*1, respectively, i.e. Is 3*1.

For the convolution characteristic of each audio frame, the convolution characteristic can be respectively input into three convolution layers with the size of 1 multiplied by 1 to respectively obtain a query matrixKey matrix->Sum matrix->Wherein, query matrix->Representing a query matrix and a key matrix corresponding to an ith audio frameThe representation is: key matrix corresponding to the ith audio frame, value matrix +.>Representing the value matrix corresponding to the i-th audio frame.

And 2, performing dimension transformation on the query matrix, the key matrix and the value matrix to obtain a transformed query matrix, a transformed key matrix and a transformed value matrix.

The dimensions of the query matrix, the key matrix and the value matrix are as follows:after the dimension reduction operation is carried out on the query matrix, the key matrix and the value matrix, a transformation query matrix is obtained>Transformation Key matrix->And transform value matrix->Wherein->、/>And->。

Step 3, calculating a self-attention matrix based on the transformation query matrix, the transformation key matrix and the offset matrix; and multiplying the transformation value matrix with the self-attention moment matrix, and performing dimension transformation to obtain the time sequence fusion enhancement characteristic of the audio characteristic.

In one implementation, the self-attention matrix may be calculated by the following formula:

；

Wherein a represents: a self-attention matrix is provided which is a self-attention matrix,for an activation function, ++>The representation is: a transformation query matrix corresponding to the ith audio frame, < >>The representation is: transpose of the transformation key matrix corresponding to the ith audio frame,/->The representation is: an offset matrix corresponding to the ith audio frame.

One practice of the present disclosureIn an embodiment, the offset matrix corresponding to the ith audio frame may be calculated by the following formula：

；

Wherein, the liquid crystal display device comprises a liquid crystal display device,the representation is: offset matrix->The value of row j, column k, < >>The values of (2) are as follows: adding 1 to the sum of the first number and the second number, j being the number of rows of elements in the offset matrix, k being the number of columns of elements in the offset matrix,/->Maximum number of rows and maximum number of columns and +.>The same applies.

For example:the value of (2) may be 20, for convenience in describing the generation process of the offset matrix, by +.>For example, a value of 3, a 3×3 offset matrix can be generated>。/>Can be expressed in a matrix as:

；

matrix arrayFor representing the degree of association between two audio features, the further the distance between the audio features is, the smaller the degree of association between the audio features is represented.

For each audio frame, the timing characteristics of the audio frame include not only the audio characteristics of the audio but also the audio characteristics of the audio frames adjacent to the audio frame. By shifting matrices As can be seen from the calculation formula of (2), if other audio features in the time sequence features of the audio frame are adjacent to the audio features of the audio frame, the numerical value of the corresponding row number and column number in the offset matrix is larger, so that the time sequence fusion enhancement feature is calculated after the offset matrix is added, the audio features adjacent to the audio features are more focused, and further the first mouth shape driving parameters can be generated more accurately.

After computing the self-attention matrix, the temporal fusion enhancement feature can be computed by the following formula:

；

wherein a represents: a self-attention matrix is provided which is a self-attention matrix,the representation is: matrix of transformation values corresponding to the ith audio frame,/->The representation is: time sequence fusion enhancement characteristic corresponding to ith audio frame before dimension transformation>The dimensions of (2) are: />Then toPerforming dimension transformation to obtain a time sequence fusion enhancement characteristic corresponding to the ith audio frame after dimension transformation>Wherein, the method comprises the steps of, wherein,。

then fuse the timing with the enhanced featuresThe output result of the third module is input into the convolution layer of the fourth module after being processed by 9 convolution layers in the third module, and the fourth module in the main learning network outputs the first mouth shape driving parameters.

The embodiment of the disclosure utilizes the self-attention matrix when calculating the self-attention matrix, so that the time sequence fusion enhancement characteristic of the audio characteristic can be obtained after dimension transformation after multiplying the transformation value matrix with the self-attention moment matrix. Therefore, when the first mouth shape driving parameter corresponding to a certain audio frame is predicted based on the main learning network, the embodiment of the disclosure uses the audio features of the audio frame relatively adjacent to the audio frame, that is, uses the audio features relatively important to the audio frame, but not uses all the audio features of the sample audio stream, so that the predicted first mouth shape driving parameter can be more accurate.

In one embodiment of the present disclosure, referring to fig. 4, for an exemplary schematic diagram of a training process of a mouth-piece driving model provided in an embodiment of the present disclosure, a process for training a mouth-piece driving model is as follows:

the sample audio stream is divided into a plurality of audio frames of fixed timing length. As an example: an audio stream of samples of a time sequence length of 800ms may be divided into 80 audio frames of a time sequence length of 10 ms.

The obtained audio frames are input into a feature extraction model, wherein the feature extraction model can be a Mel frequency cepstrum coefficient, a wav2vec model and the like, and the corresponding time sequence feature of each audio frame is obtained.

The time sequence characteristics are respectively input into a main learning network in the main branch and an auxiliary learning network in the auxiliary branch, the main learning network outputs a first mouth shape driving parameter, and the auxiliary learning network outputs a second mouth shape driving parameter.

And calculating a first loss function, a second loss function and a third loss function, training a main learning network through back propagation, taking the trained main learning network as a mouth shape driving model after training the main learning network, and then feeding the mouth shape driving model.

The contrast learning loss function value may be calculated based on the following formula:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,the contrast learning loss function value is shown.

，/>，/>。/>The representation is: first mouth-shaped driving parameter corresponding to the ith audio frame, < >>The representation is: the second mouth shape driving parameter corresponding to the ith audio frame, batch is the number of time sequence characteristics input into the main learning network and the auxiliary learning network in batch, V is the dimension of the first mouth shape driving parameter and the second mouth shape driving parameter, and->The representation is: the label mouth shape parameter corresponding to the ith audio frame.The representation is: euclidean distance between the first mouth shape driving parameter corresponding to the ith audio frame and the label mouth shape parameter, ++>The larger the value of (a) indicates that the larger the difference between the first mouth shape driving parameter corresponding to the i-th audio frame and the label mouth shape parameter is +. >The smaller the value of (c) is, the smaller the difference between the first mouth shape driving parameter corresponding to the i-th audio frame and the label mouth shape parameter is. />The representation is: euclidean distance between the first and second mouth shape driving parameters corresponding to the ith audio frame, and the same applies>The larger the value of (a) indicates that the larger the difference between the first and second mouth shape driving parameters corresponding to the ith audio frame is, +.>The smaller the value of (c) is, the smaller the difference between the first and second mouth shape driving parameters corresponding to the i-th audio frame is.

The reconstruction loss function value may be calculated based on the following formula:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the reconstruction loss function value, ">The representation is: euclidean distance between the first mouth shape driving parameter corresponding to the ith audio frame and the label mouth shape parameter, ++>The representation is: first mouth-shaped driving parameter corresponding to the ith audio frame, < >>The representation is: label mouth shape parameter corresponding to ith audio frame,/-for the audio frame>The larger the value of (c) is, the larger the difference between the first mouth shape driving parameter corresponding to the i-th audio frame and the label mouth shape parameter is,the smaller the value of (c) is, the smaller the difference between the first mouth shape driving parameter corresponding to the i-th audio frame and the label mouth shape parameter is.

The timing consistency loss function value may be calculated based on the following formula:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,represents a timing consistency loss function value,>the representation is: the euclidean distance between the first die drive parameter corresponding to the i-th audio frame and the first die drive parameter corresponding to the i + 1-th audio frame,the representation is: first mouth-shaped driving parameter corresponding to the ith audio frame, < >>The representation is: the first mouth-shaped driving parameter corresponding to the (i+1) th audio frame is due to the followingThis->For calculating the Euclidean distance between the first mouth-shaped driving parameters corresponding to adjacent audio frames,/for the first mouth-shaped driving parameters>The larger the value of (c) is, the larger the difference between the corresponding first mouth-driving parameters of adjacent audio frames is, the +.>The smaller the value of (c) is, the smaller the difference between the corresponding first mouthpiece drive parameters of adjacent audio frames is.

By adopting the technical scheme, the first mouth shape driving parameter is far away from the second mouth shape driving parameter and is close to the label mouth shape parameter by calculating the first loss function value, so that when the three-dimensional face model is driven based on the first mouth shape driving parameter, the mouth shape is more accurate when the three-dimensional face model is speaking, the smoothness of the mouth shape of the three-dimensional face model can be reduced, and the flexibility of the mouth shape of the three-dimensional face model is improved.

Based on the same concept, fig. 5 is a schematic structural diagram of a training device for a mouth shape driving model according to an embodiment of the present disclosure, including:

the extracting module 501 is configured to perform feature extraction on the sample audio stream to obtain a time sequence feature corresponding to each audio frame;

the acquisition module 502 is configured to input time sequence features of a continuous preset number of audio frames into a main learning network and an auxiliary learning network respectively, and acquire a first mouth shape driving parameter corresponding to each audio frame output by the main learning network and a second mouth shape driving parameter corresponding to each audio frame output by the auxiliary learning network; the main learning network comprises a time sequence self-attention module;

a calculating module 503, configured to calculate a first loss function value based on a first difference and a second difference, calculate a second loss function value based on the second difference, and calculate a third loss function value based on a difference between first mouth-shape driving parameters corresponding to adjacent audio frames, where the first difference is: the difference between the first mouth shape driving parameter and the second mouth shape driving parameter corresponding to each audio frame is that: differences between the first mouth shape driving parameters and the label mouth shape parameters corresponding to each audio frame;

A training module 504, configured to train the master learning network based on the first loss function value, the second loss function value, and the third loss function value, and use the trained master learning network as a mouth shape driving model.

Optionally, the extracting module 501 is specifically configured to:

inputting the sample audio stream into a feature extraction model to obtain the audio feature of each audio frame included in the sample audio stream;

and for each audio frame, splicing the audio features of the first number of audio frames before the audio frame, the audio features of the audio frame and the audio features of the second number of audio frames after the audio frame to obtain the corresponding time sequence features of the audio frame.

Optionally, the specified convolution layer of the main learning network is connected with a time sequence self-attention module; the obtaining module 502 includes:

the first acquisition submodule is used for inputting the time sequence characteristics of the continuous preset number of audio frames into the main learning network and acquiring the convolution characteristics of the preset number of audio frames output by the appointed convolution layer;

the second acquisition sub-module is used for inputting the convolution characteristics of the preset number of audio frames into the time sequence self-attention module, acquiring the time sequence fusion enhancement characteristics of the preset number of audio frames output after the time sequence self-attention module carries out time sequence self-attention processing on the convolution characteristics of the preset number of audio frames;

A third obtaining sub-module, configured to input the sequential fusion enhancement features of the preset number of audio frames into a subsequent convolution layer of the main learning network, and obtain a first mouth shape driving parameter corresponding to each audio frame output by an output layer of the main learning network;

and the fourth acquisition sub-module is used for inputting the time sequence characteristics of the continuous preset number of audio frames into the auxiliary learning network and acquiring the second mouth shape driving parameters of each audio frame output by the auxiliary learning network.

Optionally, the second obtaining submodule is specifically configured to:

for the convolution characteristics of each audio frame, respectively inputting the convolution characteristics into three convolution layers of the time sequence self-attention module to obtain a query matrix, a key matrix and a value matrix corresponding to the audio frame;

performing dimension transformation on the query matrix, the key matrix and the value matrix to obtain a transformed query matrix, a transformed key matrix and a transformed value matrix;

calculating a self-attention matrix based on the transformed query matrix, the transformed key matrix, and an offset matrix;

and multiplying the transformation value matrix with the self-attention moment matrix, and performing dimension transformation to obtain the time sequence fusion enhancement characteristic of the audio characteristic.

Optionally, the self-attention matrix is:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,inquiring the matrix for the transformation>Transpose of the transformation key matrix, +.>Is the offset matrix;

；

wherein, the liquid crystal display device comprises a liquid crystal display device,is the offset momentValues of the j-th row and k-th column in the array, respectively>The values of (2) are as follows: and adding 1 to the sum of the first number and the second number, wherein j is the number of rows of the elements in the offset matrix, and k is the number of columns of the elements in the offset matrix.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

It should be noted that the sample audio stream in this embodiment is derived from the public data set.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as the mouth shape driven model training method. For example, in some embodiments, the die drive model training method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the mouth shape driving model training method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured as a mouth-shaped driving model training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of model training for a mouth shape driver, comprising:

2. The method of claim 1, wherein the feature extraction of the sample audio stream to obtain the corresponding timing feature of each audio frame comprises:

3. The method of claim 2, wherein a designated convolution layer of the master learning network is connected to a time-sequential self-attention module;

the method includes respectively inputting the time sequence characteristics of a continuous preset number of audio frames into a main learning network and an auxiliary learning network, obtaining a first mouth shape driving parameter corresponding to each audio frame output by the main learning network and a second mouth shape driving parameter corresponding to each audio frame output by the auxiliary learning network, and comprises the following steps:

inputting the time sequence characteristics of the continuous preset number of audio frames into the main learning network, and acquiring the convolution characteristics of the preset number of audio frames output by the appointed convolution layer;

inputting the convolution characteristics of the preset number of audio frames into the time sequence self-attention module, and acquiring the time sequence fusion enhancement characteristics of the preset number of audio frames output after the time sequence self-attention module carries out time sequence self-attention processing on the convolution characteristics of the preset number of audio frames;

inputting the time sequence fusion enhancement features of the preset number of audio frames into a subsequent convolution layer of the main learning network, and obtaining a first mouth shape driving parameter corresponding to each audio frame output by an output layer of the main learning network;

And inputting the time sequence characteristics of the continuous preset number of audio frames into the auxiliary learning network, and obtaining second mouth shape driving parameters of each audio frame output by the auxiliary learning network.

4. The method of claim 3, wherein inputting the convolution features of the preset number of audio frames into the time sequence self-attention module, obtaining the time sequence fusion enhancement features of the preset number of audio frames output by the time sequence self-attention module after performing time sequence self-attention processing on the convolution features of the preset number of audio frames, and comprises:

5. The method of claim 4, wherein the self-attention matrix is:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,for shifting the value of the j-th row and k-th column in the matrix,>the values of (2) are as follows: and adding 1 to the sum of the first number and the second number, wherein j is the number of rows of the elements in the offset matrix, and k is the number of columns of the elements in the offset matrix.

6. A mouth-drive model training device, comprising:

7. The apparatus of claim 6, the extraction module being specifically configured to:

8. The apparatus of claim 7, a designated convolutional layer of the master learning network connected with a time-sequential self-attention module; the acquisition module comprises:

9. The apparatus of claim 8, the second acquisition sub-module being specifically configured to:

10. The device of claim 9, the self-attention matrix being:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,for shifting the value of the j-th row and k-th column in the matrix,>the values of (2) are as follows: adding 1 to the sum of the first number and the second number, j being the offsetThe number of rows of elements in the matrix, k, is the number of columns of elements in the offset matrix.

11. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.