CN116013354A

CN116013354A - Training method of deep learning model and method for controlling mouth shape change of virtual image

Info

Publication number: CN116013354A
Application number: CN202310306535.0A
Authority: CN
Inventors: 杜宗财; 范锡睿; 赵亚飞; 张世昌; 郭紫垣; 王志强; 陈毅
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-04-25
Anticipated expiration: 2043-03-24
Also published as: CN116013354B

Abstract

The disclosure provides a training method of a deep learning model, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of virtual digital people, augmented reality, virtual reality, mixed reality, augmented reality, meta universe and the like. The specific implementation scheme is as follows: determining audio data with random length from initial sample audio data with appointed length as effective data, and masking audio data except the effective data in the initial sample audio data to obtain target sample audio data; extracting characteristics of the audio data of the target sample; inputting the characteristics of the target sample audio data into a deep learning model to obtain output port type parameters corresponding to the initial sample audio data; determining the loss of the deep learning model according to the output port type parameters; and adjusting parameters of the deep learning model according to the loss. The present disclosure also provides a method, apparatus, electronic device, and storage medium for controlling a change in an avatar profile.

Description

Training method of deep learning model and method for controlling mouth shape change of virtual image

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of virtual digital people, augmented reality, virtual reality, mixed reality, augmented reality, meta universe and the like. More particularly, the present disclosure provides a training method of a deep learning model, a method of controlling an avatar mouth shape change, an apparatus, an electronic device, and a storage medium.

Background

With the rapid development of technologies such as the internet, augmented Reality (Augmented Reality), virtual Reality (Virtual Reality), and meta universe, the application of Virtual images in live broadcast, virtual social contact, entertainment media, and the like is becoming wider and wider.

Disclosure of Invention

The present disclosure provides a training method of a deep learning model, a method of controlling a change in an avatar mouth shape, an apparatus, an electronic device, and a storage medium.

According to a first aspect, there is provided a training method of a deep learning model, the method comprising: determining audio data with random length from initial sample audio data with appointed length as effective data, and masking audio data except the effective data in the initial sample audio data to obtain target sample audio data; extracting characteristics of the audio data of the target sample; inputting the characteristics of the target sample audio data into a deep learning model to obtain output port type parameters corresponding to the initial sample audio data; determining the loss of the deep learning model according to the output port type parameters; and adjusting parameters of the deep learning model according to the loss.

According to a second aspect, there is provided a method of controlling a change in an avatar profile, the method comprising: extracting characteristics of audio data to be processed; inputting the characteristics of the audio data to be processed into a deep learning model to obtain mouth shape parameters corresponding to the audio data to be processed; and controlling the mouth shape of the virtual image to change according to the mouth shape parameters; the deep learning model is trained according to the training method of the deep learning model.

According to a third aspect, there is provided a training apparatus of a deep learning model, the apparatus comprising: the sample processing module is used for determining the audio data with random length from the initial sample audio data with specified length as effective data and masking the audio data except the effective data in the initial sample audio data to obtain target sample audio data; the first feature extraction module is used for extracting features of the audio data of the target sample; the sample input module is used for inputting the characteristics of the target sample audio data into the deep learning model to obtain output port type parameters corresponding to the initial sample audio data; the first loss determination module is used for determining the loss of the deep learning model according to the output port type parameters; and the first parameter adjusting module is used for adjusting parameters of the deep learning model according to the loss.

According to a fourth aspect, there is provided an apparatus for controlling a change in an avatar profile, the apparatus comprising: the third feature extraction module is used for extracting features of the audio data to be processed; the audio input module is used for inputting the characteristics of the audio data to be processed into the deep learning model to obtain mouth shape parameters corresponding to the audio data to be processed; the control module is used for controlling the mouth shape of the virtual image to change according to the mouth shape parameters; the deep learning model is obtained through training according to the training device of the deep learning model.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method provided according to the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic view of a method of audio driving an avatar in the related art;

FIG. 2 is a flow chart of a training method of a deep learning model according to one embodiment of the present disclosure;

FIG. 3A is a schematic diagram of processing initial sample data into target sample audio data according to one embodiment of the present disclosure;

FIG. 3B is a schematic diagram of a training method of a deep learning model according to one embodiment of the present disclosure;

FIG. 4 is a flow chart of a training method of a deep learning model according to another embodiment of the present disclosure;

FIG. 5A is a schematic diagram of processing an initial feature as a target feature according to one embodiment of the present disclosure;

FIG. 5B is a schematic diagram of a training method of a deep learning model according to another embodiment of the present disclosure;

FIG. 6 is a flowchart of a method of controlling avatar profile variation according to one embodiment of the present disclosure;

FIG. 7 is a block diagram of a training apparatus of a deep learning model according to one embodiment of the present disclosure;

FIG. 8 is a block diagram of an apparatus for controlling an avatar profile change according to one embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device of at least one of a training method of a deep learning model and a method of controlling an avatar profile change according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The avatar may be a three-dimensional virtual digital person, including a virtual anchor, a virtual customer service, a virtual idol, a virtual pilot, and the like. The avatar has a wide range of application scenes and business needs, wherein the generation of audio-driven avatars has become one of important research hotspots in the field of virtual digital human interaction.

The audio driving avatar aims at generating speaking mouth shape information of the avatar conforming to the input audio, thereby controlling the mouth shape of the avatar to change, and further realizing the generation of animation and video of the avatar.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

The audio-driven three-dimensional avatar may be implemented based on a deep learning model. For example, the audio data and the mouth shape parameters corresponding to the audio data are taken as samples, and the deep learning model is trained, so that the model has the capability of generating the corresponding mouth shape parameters aiming at the input audio, and the mouth shape of the virtual image is driven to change based on the mouth shape parameters output by the model, thereby realizing the generation of the virtual image.

Fig. 1 is a schematic view of a method of audio driving an avatar in the related art.

As shown in fig. 1, the fixed-length audio data may be pre-processed audio data to be processed. For example, a fixed length time window (e.g., 520ms or 1 s) may be used to sequentially extract fixed length audio data from the audio data to be processed in a certain step (e.g., 20 ms). The center of each fixed length audio data may correspond approximately to one word of audio signal.

The fixed-length audio data is input into the feature extraction model 110 to obtain audio features. The feature extraction model 110 may be a pre-trained model based on deep learning, such as a Wav2Vec (converting audio to vectors) model. A conventional feature extraction model, such as MFCC (Mel-Frequency Cepstral Coefficients, mel-frequency cepstral coefficient) feature extraction model, is also possible.

The audio features are input into a deep learning model 120, the deep learning model 120 being used to map the audio features into mouth-shape parameters. The mouth shape parameter characterizes a mouth shape structure corresponding to the fixed length audio data. Inputting the mouth shape parameters into the avatar generation model 130 can drive the mouth shape structure of the avatar to change, and generate the animation or video of the avatar. The avatar generation model 130 may be a processing module for performing post-processing of smoothing, rendering, etc.

In the related art, in order to facilitate model processing, audio data to be processed is uniformly preprocessed into audio data with a fixed length. However, the duration of the audio signal of each word is different, the audio data with the same fixed length is used for prediction, and the word accuracy is poor.

For example, a fixed-length time window (for example, 520ms or 1 s) is used to sequentially intercept fixed-length audio data from the audio data to be processed according to a certain step (for example, 20 ms), and every two adjacent fixed-length audio data are overlapped. That is, each fixed-length audio data contains information of adjacent audio data, so that for the current word, the audio data of the adjacent word can be used as a reference to assist in predicting the mouth shape of the current word.

However, the audio data of the neighboring word may negatively affect the prediction of the audio data of the current word. For example, for "open door", the mouth shape of the "door" word is predicted currently, the "door" word is a closed mouth shape, and the "open" word and the "o" word are both open mouth shapes, so that the audio signals of the "open" word and the "o" word are introduced to easily cause inaccurate prediction of the mouth shape of the current "door" word.

Therefore, the duration of the audio signal of each word is different, the influence (positive influence or negative influence) of the audio signal of the adjacent word on the current word mouth shape prediction is different, the length of the audio data required for predicting the mouth shape parameter of each word is different, the audio data with the same fixed length is adopted for prediction, and the accuracy is low.

Fig. 2 is a flow chart of a training method of a deep learning model according to one embodiment of the present disclosure.

As shown in FIG. 2, the training method 200 of the deep learning model includes operations S210-S250.

In operation S210, audio data of a random length is determined as valid data from among the initial sample audio data of a designated length, and the audio data other than the valid data in the initial sample audio data is masked to obtain target sample audio data.

For example, the initial sample audio data of a specified length is sample audio data of a specified time-series length, which may be sample audio data of a fixed time-series length (for example, a specified length of 1 s) consistent with the related art. In this way, the existing model structure can be reused without changing the existing model structure.

The audio data of a random length, which is not greater than the specified length, for example, a random length of 500ms, may be determined as valid data from the initial sample audio data of the specified length. And it may be determined that audio data other than valid data among the initial sample audio data of the specified length is invalid data. This may change the length of the valid data of the initial sample audio data from a specified length (e.g., 1 s) to a random length (e.g., 500 ms).

The random length may be obtained by random uniform sampling within a preset length range. For example, the preset length range may be equal to or greater than a minimum length (e.g., 200ms to 300 ms) and equal to or less than a specified length (1 s), where random uniform sampling refers to the probability that any length within the range is sampled being the same.

The audio data except the valid data in the initial sample audio data can be masked out in a mask mode, so that the target sample audio data is obtained. The length of the target sample audio data is the same as that of the initial sample audio data, so that the target sample audio data can be adapted to the existing model structure, and the target sample audio data can be directly input into the existing model structure without introducing excessive processing due to the change of the length.

In operation S220, features of the target sample audio data are extracted.

For example, the target sample audio data is input to a feature extraction model, which may be a Wav2Vec model or a MFCC feature extraction model. The characteristics of the target sample audio data can be obtained through the processing of the characteristic extraction model.

In operation S230, the features of the target sample audio data are input into the deep learning model to obtain output port type parameters corresponding to the initial sample audio data.

And inputting the characteristics of the target sample audio data into a deep learning model, wherein the deep learning model can map the audio characteristics into mouth shape parameters to obtain output mouth shape parameters. The output port type parameter is a port type parameter corresponding to the initial sample audio data predicted by the deep learning model.

In operation S240, a loss of the deep learning model is determined according to the output port type parameter.

In operation S250, parameters of the deep learning model are adjusted according to the loss.

The initial sample audio data is also labeled with a tag mouth-shape parameter that represents the actual mouth-shape parameter corresponding to the sample audio data. And calculating the difference (such as mean square error, cross entropy and the like) between the output port type parameter and the label port type parameter to obtain the loss of the deep learning model, and adjusting the parameter of the deep learning model according to the loss to obtain the updated deep learning model.

For the next initial sample audio data, operation S210 may be returned, and operations S210 to S250 may be repeated until the model converges, to obtain a trained deep learning model, where the trained deep learning model has the capability of mapping features of the audio data to mouth shape parameters.

The process of repeating the operations S210 to S250 each time is called an iteration, and each iteration processes the initial sample audio data into target sample audio data containing effective data with random length, and since the random length processed each time is obtained by randomly and uniformly sampling within a preset length range, multiple iterations can be performed to obtain effective data with multiple different lengths, so that the model can be trained by using the effective data with multiple lengths, the generalization capability of the model can be improved, and the accuracy of model prediction mouth shape parameters is improved.

In addition, the length of the initial sample audio data in the embodiment is consistent with the fixed length in the related technology, the existing model can be reused, the processing process of processing the initial sample audio data into the target sample audio data is inserted before the feature extraction operation, the plug and play can be realized, the model precision can be improved in a lossless manner on the basis of not affecting the existing structure of the model, and the accuracy of model prediction mouth shape parameters is improved.

The above-mentioned operation S210 includes determining a window of a random length; determining a mask array with a specified length according to a window with a random length, wherein the center of the window with the random length is consistent with the center of the mask array with the specified length, the values of elements in the window with the random length taking the center of the mask array as the center in the mask array are first numerical values, and the values of elements outside the window with the random length in the mask array are second numerical values; and determining the audio data with random length from the initial sample audio data with the specified length by using a mask array with the specified length as effective data, and masking the audio data except the effective data in the initial sample audio data to obtain target sample audio data.

Fig. 3A is a schematic diagram of processing initial sample data into target sample audio data according to one embodiment of the present disclosure.

As shown in fig. 3A, the initial sample data 301 is a specified length, for example, a specified length of 1s, and an audio signal may include 50 time points within 1 s. Mask array 302 is also of a specified length, for example mask array 302 includes 50 elements.

The random length is for example 500ms, which may correspond to 25 time points of the audio signal or 25 elements. The center of the window of random length coincides with the center of the mask array 302, and the values of the elements (shaded areas) within the window of random length centered around the center of the mask array 302 in the mask array 302 are a first value (e.g., 1) and the values of the elements outside the window of random length (blank areas) are a second value (e.g., 0).

The initial sample data 301 and the mask array 302 are multiplied to obtain target sample audio data 303. The audio data within the random length range is effective data (hatched) with the center of the target sample audio data 303 as the center, and the audio data other than the effective data is ineffective data in the target sample audio data 303.

It will be appreciated that the audio data in the central region of the initial sample data 301 is the audio data most relevant to the current word, and that the valid data is determined centering on the center of the initial sample data 301, the loss of the audio data most relevant to the current word can be avoided.

Fig. 3B is a schematic diagram of a training method of a deep learning model according to one embodiment of the present disclosure.

The initial sample audio data is audio data with a specified time sequence length (for example, 1 s), masking processing is carried out on the initial sample audio data by using a masking array, so that audio data with random length in the initial sample audio data is effective data, audio data except the effective data is ineffective data, target sample audio data is obtained, and the target sample audio data is also the audio data with the specified time sequence length.

The target sample audio data is input into the feature extraction model 310 to obtain audio features of the target sample audio data, and the audio features are input into the deep learning model 320 to obtain output port type parameters corresponding to the initial sample audio data. The difference (cross entropy, mean square error, etc.) between the delivery port type parameter and the label port type parameter is calculated as a loss of the deep learning model 320, and the parameters of the deep learning model 320 are adjusted according to the loss.

Wherein the audio data of the central area of the target audio data is the audio data most relevant to the current word, the characteristic of the audio data of the central area may be referred to as a local characteristic. Further, the feature of the audio data of the area adjacent to the center area of the target audio data may be referred to as a global feature. The model can be trained by utilizing effective data with various lengths through multiple iterations, namely, the model can be trained by fully utilizing local features and global features, so that the generalization capability of the model is enhanced, and the accuracy of model prediction mouth shape parameters is improved.

In addition, the masking process of the initial sample audio data of the present embodiment is inserted before the feature extraction operation of the feature extraction model 310, and can be plug and play.

Another implementation of the training method of the deep learning model provided in the present disclosure is described below.

Fig. 4 is a flow chart of a training method of a deep learning model according to another embodiment of the present disclosure.

As shown in FIG. 4, the training method 400 of the deep learning model includes operations S410-S450.

In operation S410, features of the initial sample audio data are extracted, resulting in initial features of a specified size.

For example, the present embodiment may directly extract the features of the initial sample audio data without masking the initial sample audio data, resulting in the initial features of the specified size. The initial feature may be two-dimensional, and one dimension of the specified size is a specified length of the initial sample audio data, for example, the specified length of the initial sample audio data is 1s, and the audio signal containing 50 time points, then one dimension of the specified size may be 50, which respectively corresponds to the audio signal of 50 time points within 1 s. The other dimension of the specified size may be determined by the feature extraction model, for example, may be 128, and thus the specified size may be 50×128.

In operation S420, features of random sizes are determined as valid features from among the initial features of the designated size, and features other than the valid features in the initial features are masked to obtain target features.

For example, the mask processing may be performed on features of a specified size such that features of a random size among the features of the specified size are valid features and features other than the valid features are invalid features.

The random size may be obtained from a range of preset sizes by random uniform sampling. For example, a predetermined size range may be equal to or greater than a minimum size (e.g., 2 x 128) and equal to or less than a specified size (e.g., 50 x 128), where random uniform sampling refers to a probability that any size within the range is sampled.

Features except for the effective features in the initial features can be masked in a mask mode, so that target features are obtained. The size of the target feature is the same as the size of the initial sample audio data, so that the target feature can be adapted to the existing model structure, and the target feature can be directly input into the existing model structure without introducing excessive processing due to the change of the size.

In operation S430, the target feature is input into the deep learning model, and output port type parameters corresponding to the initial sample audio data are obtained.

In operation S440, a loss of the deep learning model is determined according to the outlet type parameter.

In operation S450, parameters of the deep learning model are adjusted according to the loss.

For example, the target feature is input into a deep learning model, which may map the target feature to a mouth-shape parameter to obtain an outlet-shape parameter. The output port type parameter is a port type parameter corresponding to the initial sample audio data predicted by the deep learning model.

The difference (such as mean square error, cross entropy, etc.) between the output port type parameter and the label port type parameter can be used as a loss of the deep learning model, and the parameters of the deep learning model can be adjusted according to the loss to obtain an updated deep learning model. For the next initial sample audio data, operation S410 may be returned, and operations S410 to S450 may be repeated until the model converges, to obtain a trained deep learning model, where the trained deep learning model has the capability of mapping features of the audio data to mouth shape parameters.

The process of repeating the operations S410 to S450 each time is called an iteration, and each iteration processes the initial feature of the initial sample audio data into the target feature containing the effective feature with random size, and since the random size processed each time is obtained by randomly and uniformly sampling within the preset size range, the effective features with different sizes can be obtained through multiple iterations, so that the model can be trained by utilizing the effective features with multiple sizes, the generalization capability of the model can be improved, and the accuracy of model prediction mouth shape parameters is improved.

The effect achieved by masking the initial feature in this embodiment is identical to the effect achieved by masking the initial sample audio data.

In addition, the length of the initial sample audio data in the embodiment is consistent with the fixed length in the related technology, the existing model can be reused, and the process of processing the initial characteristic of the initial sample audio data into the target characteristic is inserted between the characteristic extraction and the processing of the deep learning model, so that the model precision can be improved in a lossless manner on the basis of not affecting the existing structure of the model, and the accuracy of model prediction mouth shape parameters is improved.

Operation S420 includes determining a window of a random size; determining a mask matrix with a specified size according to a window with a random size, wherein the center of the window with the random size is consistent with the center of the mask matrix with the specified size, the values of elements in the window with the random size taking the center of the mask matrix as the center in the mask matrix are first numerical values, and the values of elements outside the window with the random size in the mask matrix are second numerical values; and determining the random-size features from the initial features with the specified size as effective features by using a mask matrix with the specified size, and masking the features except the effective features in the initial features to obtain target features.

Fig. 5A is a schematic diagram of processing an initial feature as a target feature according to one embodiment of the present disclosure.

As shown in fig. 5A, the initial feature 501 is of a specified size (e.g., 50 x 128), and the mask matrix 502 is also of a specified size (e.g., 50 x 128). The random size is, for example, 25 x 128, the center of the window of random size coincides with the center of the mask matrix 502, the value of the element (shaded area) within the window of random size centered around the center of the mask matrix 502 in the mask matrix 502 is a first value (e.g., 1), and the element outside the window of random length (blank area) is a second value (0).

The initial feature 501 and the mask matrix 502 are multiplied to obtain the target feature 503. The target feature 503 is characterized by effective features (hatched portions) in a random size range centered around the center of the target feature 503, and by ineffective features other than the effective features.

It will be appreciated that the features in the central region of the initial feature 501 are the features most relevant to the current word, and that determining valid features centered around the center of the initial feature 501 may avoid the loss of features most relevant to the current word.

Fig. 5B is a schematic diagram of a training method of a deep learning model according to another embodiment of the present disclosure.

The initial sample audio data is audio data of a specified time sequence length (e.g., 1 s), and initial features of a specified size (e.g., 50×128) are obtained by processing of the feature extraction model 510. And masking the initial features by using a mask matrix so that the features with random sizes in the initial features are effective features and the features except the effective features are ineffective features, thereby obtaining target features.

The target features are input to the deep learning model 520 to obtain output port type parameters corresponding to the initial sample audio data. And calculating the difference (cross entropy, mean square error and the like) between the output port type parameter and the label port type parameter, obtaining the loss of the deep learning model 520, and adjusting the parameters of the deep learning model 520 according to the loss.

Wherein the feature of the central region of the target feature is the feature most relevant to the current word, which may be referred to as a local feature. Further, the features of the region adjacent to the central region of the target feature may be referred to as global features. The model can be trained by utilizing the effective features with various sizes through multiple iterations, namely, the model can be trained by fully utilizing the local features and the global features, so that the generalization capability of the model is enhanced, and the accuracy of model prediction mouth shape parameters is improved.

Further, the masking process of the initial feature of the present embodiment is interposed between the feature extraction operation of the feature extraction model 510 and the mouth shape prediction operation of the deep learning model 520, and can be plug and play.

The determination of the initial sample audio data and the tag die parameters is described below.

The determination of the initial sample audio data and the tag die parameters includes: acquiring mouth structure information of preset audio sent by a target object acquired by acquisition equipment at a plurality of acquisition moments, and taking the mouth structure information as a label mouth shape parameter; and taking the acquisition time as a center, and determining preset audio with a specified length sent by the target object as initial sample audio data.

The sample audio data with the specified length can be obtained by intercepting collected preset audio, and the preset audio can be obtained by collecting sound emitted by a speaker. For example, during the speaking process, the speaker performs audio acquisition on the speaker to obtain the preset audio. A plurality of audio data of a specified length is cut out from the preset audio using a time window of a specified length (e.g., 1 s) as sample audio data. The specified length may be preset, and since the audio duration of a word is generally 200ms to 300ms, the specified length may be set to be greater than 200ms to 300ms, for example, may be set to 520ms, 1s, or the like.

In the process of audio acquisition of a speaker, an image acquisition device (for example, a camera) can be used for scanning the face of the speaker at the same time to obtain face structure information. The face structure information may include the number and location information of a plurality of key points (or vertices) of the face. Wherein, the mouth structure information can be used as mouth shape parameters.

The mouth shape parameters of the speaker in the speaking process and the audio data can be corresponding to obtain a sample pair, and the mouth shape parameters in the sample pair are used as the label mouth shape parameters of the sample audio data.

For example, if it is required to drive the avatar within 1s to generate 50 frames of mouth-shape motion, i.e., 1s long of audio data drives 50 frames of mouth shapes, the 50 frames of mouth shapes correspond to a plurality of time points within 1 s. For example, the mouth shape of the first frame corresponds to the time of 20ms (time point), the mouth shape of the second frame corresponds to the time of 40ms (time point), and so on. Thus, the mouth shape parameter acquired at the time point of 20ms may be taken as the mouth shape parameter of the first frame, the mouth shape parameter acquired at the time point of 40ms may be taken as the mouth shape parameter of the second frame, and so on.

For each time point, the audio data of the specified length may be truncated using the time window of the specified length centered at the time point as sample audio data of the time point. The sample audio data at that point in time and the mouth-shape parameters at that point in time form a sample pair.

The method of controlling the mouth shape change of the avatar provided in the present disclosure will be described in detail.

Fig. 6 is a flowchart of a method of controlling an avatar profile change according to one embodiment of the present disclosure.

As shown in fig. 6, the method 600 for controlling the mouth shape change of the avatar includes operations S610 to S630.

In operation S610, features of audio data to be processed are extracted.

In operation S620, the features of the audio data to be processed are input into the deep learning model to obtain the mouth shape parameters corresponding to the audio data to be processed.

In operation S630, the mouth shape of the avatar is controlled to be changed according to the mouth shape parameter.

For example, the deep learning model may be trained using the training method of the deep learning model described above. The length of the audio data to be processed can be a designated length (for example, 1 s), and the deep learning model obtained by the training method can predict mouth shape parameters with higher accuracy for the audio data with different lengths. Therefore, when the model is used, the audio data to be processed with the specified length can be directly used, and the mouth shape parameters with higher accuracy can be predicted without adjusting the length.

And continuously inputting a plurality of audio data to be processed with specified lengths into the deep learning model to obtain a mouth shape parameter sequence. The mouth shape parameter sequence is input into the virtual image driving model, so that the mouth shape of the virtual image can be driven to change, and the generation of the virtual image animation or video is realized.

According to the method, the device and the system, the mouth shape parameters are predicted by adopting the trained deep learning model, so that the accuracy of mouth shape parameter prediction can be improved, and the generation effect of the virtual image animation or video is further improved.

Fig. 7 is a block diagram of a training apparatus of a deep learning model according to one embodiment of the present disclosure.

As shown in fig. 7, the training apparatus 700 of the deep learning model includes a sample processing module 701, a first feature extraction module 702, a sample input module 703, a first loss determination module 704, and a first tuning module 705.

The sample processing module 701 is configured to determine that audio data with random length is valid data from initial sample audio data with specified length, and mask audio data except valid data in the initial sample audio data to obtain target sample audio data.

The first feature extraction module 702 is configured to extract features of the target sample audio data.

The sample input module 703 is configured to input the features of the target sample audio data into the deep learning model, so as to obtain output port type parameters corresponding to the initial sample audio data.

The first loss determination module 704 is configured to determine a loss of the deep learning model according to the output port type parameter.

The first parameter tuning module 705 is configured to adjust parameters of the deep learning model according to the loss.

The sample processing module 701 includes a first window determination unit, a first mask determination unit, and a sample processing unit.

The first window determining unit is used for determining a window with random length.

The first mask determining unit is used for determining a mask array with a specified length according to a window with a random length, wherein the center of the window with the random length is consistent with the center of the mask array with the specified length, the values of elements in the window with the random length, which is centered on the center of the mask array, in the mask array are first numerical values, and the values of elements outside the window with the random length in the mask array are second numerical values.

The sample processing unit is used for determining the audio data with random length from the initial sample audio data with the specified length by using the mask array with the specified length as the effective data and masking the audio data except the effective data in the initial sample audio data to obtain the target sample audio data.

The training apparatus 700 of the deep learning model further comprises a length determination module.

The length determining module is used for determining a random length from a preset length range, wherein the preset length range is greater than or equal to a minimum length and less than or equal to a specified length.

The training device 700 of the deep learning model further includes a second feature extraction module, a feature processing module, a feature input module, a second loss determination module, and a second parameter adjustment module.

The second feature extraction module is used for extracting features of the initial sample audio data to obtain initial features with specified sizes.

The feature processing module is used for determining the random size features as effective features from the initial features with the specified size, masking the features except the effective features in the initial features, and obtaining target features.

The feature input module is used for inputting target features into the deep learning model to obtain output port type parameters corresponding to the initial sample audio data.

The second loss determination module determines a loss of the deep learning model according to the output port type parameter.

The second parameter adjusting module is used for adjusting parameters of the deep learning model according to the loss.

The feature processing module includes a second window determining unit, a second mask determining unit, and a feature processing unit.

The second window determining unit is used for determining windows of random sizes.

The second mask determining unit is configured to determine a mask matrix of a specified size according to a window of a random size, where a center of the window of the random size coincides with a center of the mask matrix of the specified size, values of elements in the mask matrix within the window of the random size centered on the center of the mask matrix are a first numerical value, and values of elements in the mask matrix outside the window of the random size are a second numerical value.

The feature processing unit is used for determining the features with random sizes from the initial features with the specified sizes to be effective features by using a mask matrix with the specified sizes, and masking the features except the effective features in the initial features to obtain target features.

The training apparatus 700 of the deep learning model further comprises a sizing module.

The size determining module is used for determining the random size from a preset size range, wherein the preset size range is larger than or equal to the minimum size and smaller than or equal to the specified size.

The training apparatus 700 of the deep learning model further includes a label determination module and a sample determination module.

The tag determining module is used for acquiring mouth structure information of preset audio sent by a target object acquired by the acquisition equipment at a plurality of acquisition moments and taking the mouth structure information as tag mouth shape parameters.

The sample determining module is used for determining preset audio with specified length sent by the target object as initial sample audio data by taking the acquisition time as the center.

The first loss determination module is used for determining the loss of the deep learning model according to the difference between the label mouth shape parameter and the outlet mouth shape parameter.

The second loss determination module is used for determining the loss of the deep learning model according to the difference between the label mouth shape parameter and the outlet mouth shape parameter.

Fig. 8 is a block diagram of an apparatus for controlling an avatar profile change according to one embodiment of the present disclosure.

As shown in fig. 8, the apparatus 800 for controlling the mouth shape change of an avatar includes a third feature extraction module 801, an audio input module 802, and a control module 803.

The third feature extraction module 801 is configured to extract features of audio data to be processed.

The audio input module 802 is configured to input features of the audio data to be processed into a deep learning model, so as to obtain mouth shape parameters corresponding to the audio data to be processed.

The control module 803 is used for controlling the mouth shape of the avatar to change according to the mouth shape parameters.

The deep learning model is obtained through training according to the training device of the deep learning model.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 901 performs the respective methods and processes described above, for example, at least one of a training method of a deep learning model and a method of controlling a change in an avatar shape. For example, in some embodiments, at least one of the training method of the deep learning model and the method of controlling the avatar profile change may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of at least one of the training method of the deep learning model and the method of controlling the avatar profile change described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform at least one of a training method of the deep learning model and a method of controlling the avatar profile change.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of a deep learning model, comprising:

determining audio data with random length from initial sample audio data with appointed length as effective data, and masking audio data except the effective data in the initial sample audio data to obtain target sample audio data;

extracting features of the target sample audio data;

inputting the characteristics of the target sample audio data into a deep learning model to obtain output port type parameters corresponding to the initial sample audio data;

determining the loss of the deep learning model according to the output port type parameter; and

and adjusting parameters of the deep learning model according to the loss.

2. The method of claim 1, wherein the determining the random length of audio data from the initial sample audio data with the specified length as valid data and masking the audio data of the initial sample audio data except for the valid data to obtain the target sample audio data comprises:

Determining a window of random length;

determining a mask array with a specified length according to the window with the random length, wherein the center of the window with the random length is consistent with the center of the mask array with the specified length, the values of elements in the window with the random length taking the center of the mask array as the center in the mask array are first numerical values, and the values of elements outside the window with the random length in the mask array are second numerical values; and

and determining the audio data with random length from the initial sample audio data with the specified length by using the mask array with the specified length as effective data, and masking the audio data except the effective data in the initial sample audio data to obtain target sample audio data.

3. The method of claim 1 or 2, further comprising:

the random length is determined from a preset length range, wherein the preset length range is greater than or equal to a minimum length and less than or equal to the specified length.

4. The method of claim 1, further comprising:

extracting the characteristics of the initial sample audio data to obtain initial characteristics with specified size;

determining the random size features from the initial features with the specified size as effective features, and masking the features except the effective features in the initial features to obtain target features;

Inputting the target characteristics into a deep learning model to obtain output port type parameters corresponding to the initial sample audio data;

and adjusting parameters of the deep learning model according to the loss.

5. The method of claim 4, wherein the determining the random size feature from the initial features of the specified size as a valid feature and masking features of the initial features other than the valid feature to obtain a target feature comprises:

determining a window of random size;

determining a mask matrix with a specified size according to the window with the random size, wherein the center of the window with the random size is consistent with the center of the mask matrix with the specified size, the values of elements in the window with the random size, which takes the center of the mask matrix as the center, in the mask matrix are first numerical values, and the values of elements outside the window with the random size in the mask matrix are second numerical values; and

and determining the features with random sizes from the initial features with the specified sizes as effective features by using the mask matrix with the specified sizes, and masking the features except the effective features in the initial features to obtain target features.

6. The method of claim 4 or 5, further comprising:

the random size is determined from a preset size range, wherein the preset size range is a minimum size or more and a specified size or less.

7. The method of claim 1 or 4, further comprising:

acquiring mouth structure information of preset audio sent by a target object acquired by acquisition equipment at a plurality of acquisition moments, and taking the mouth structure information as a label mouth shape parameter;

and taking the acquisition time as a center, and determining preset audio with a specified length sent by the target object as the initial sample audio data.

8. The method of claim 7, wherein the determining the loss of the deep learning model from the delivery outlet type parameter comprises:

and determining the loss of the deep learning model according to the difference between the label mouth shape parameter and the outlet mouth shape parameter.

9. A method of controlling a change in an avatar profile, comprising:

extracting characteristics of audio data to be processed;

inputting the characteristics of the audio data to be processed into a deep learning model to obtain mouth shape parameters corresponding to the audio data to be processed; and

controlling the mouth shape of the virtual image to change according to the mouth shape parameters;

Wherein the deep learning model is trained according to the method of any one of claims 1 to 8.

10. The method of claim 9, wherein the audio data to be processed comprises audio data of a specified length.

11. A training device for a deep learning model, comprising:

the sample processing module is used for determining the audio data with random length from the initial sample audio data with specified length as effective data and masking the audio data except the effective data in the initial sample audio data to obtain target sample audio data;

a first feature extraction module for extracting features of the target sample audio data;

the sample input module is used for inputting the characteristics of the target sample audio data into a deep learning model to obtain output port type parameters corresponding to the initial sample audio data;

the first loss determination module is used for determining the loss of the deep learning model according to the output port type parameter; and

and the first parameter adjusting module is used for adjusting parameters of the deep learning model according to the loss.

12. The apparatus of claim 11, wherein the sample processing module comprises:

A first window determining unit configured to determine a window of a random length;

a first mask determining unit, configured to determine a mask array of a specified length according to the window of a random length, where a center of the window of the random length is consistent with a center of the mask array of the specified length, a value of an element in the window of the random length centered on the center of the mask array in the mask array is a first numerical value, and a value of an element outside the window of the random length in the mask array is a second numerical value; and

and the sample processing unit is used for determining the audio data with random length from the initial sample audio data with the specified length by using the mask array with the specified length as effective data and masking the audio data except the effective data in the initial sample audio data to obtain target sample audio data.

13. The apparatus of claim 11 or 12, further comprising:

the length determining module is used for determining the random length from a preset length range, wherein the preset length range is greater than or equal to a minimum length and less than or equal to the specified length.

14. The apparatus of claim 11, further comprising:

The second feature extraction module is used for extracting the features of the initial sample audio data to obtain initial features with specified sizes;

the feature processing module is used for determining the features with random sizes from the initial features with the specified sizes as effective features and masking the features except the effective features in the initial features to obtain target features;

the feature input module is used for inputting the target features into a deep learning model to obtain output port type parameters corresponding to the initial sample audio data;

the second loss determination module is used for determining the loss of the deep learning model according to the output port type parameter; and

and the second parameter adjusting module is used for adjusting parameters of the deep learning model according to the loss.

15. The apparatus of claim 14, wherein the feature processing module comprises:

a second window determining unit configured to determine a window of a random size;

a second mask determining unit, configured to determine a mask matrix of a specified size according to the window of random size, where a center of the window of random size coincides with a center of the mask matrix of specified size, values of elements in the window of random size centered on the center of the mask matrix in the mask matrix are first values, and values of elements outside the window of random size in the mask matrix are second values; and

And the feature processing unit is used for determining the features with random sizes from the initial features with the specified sizes to be effective features by using the mask matrix with the specified sizes, masking the features except the effective features in the initial features, and obtaining target features.

16. The apparatus of claim 14 or 15, further comprising:

and the size determining module is used for determining the random size from a preset size range, wherein the preset size range is larger than or equal to a minimum size and smaller than or equal to the specified size.

17. The apparatus of claim 14, further comprising:

the tag determining module is used for acquiring mouth structure information of preset audio sent by a target object acquired by the acquisition equipment at a plurality of acquisition moments and taking the mouth structure information as tag mouth shape parameters;

and the sample determining module is used for determining preset audio with specified length sent by the target object as the initial sample audio data by taking the acquisition time as the center.

18. The apparatus of claim 17, the first loss determination module to determine a loss of the deep learning model based on a difference between the tag die parameter and the outlet die parameter;

The second loss determining module is configured to determine a loss of the deep learning model according to a difference between the label mouth shape parameter and the output mouth shape parameter.

19. An apparatus for controlling a change in an avatar profile, comprising:

the third feature extraction module is used for extracting features of the audio data to be processed;

the audio input module is used for inputting the characteristics of the audio data to be processed into a deep learning model to obtain mouth shape parameters corresponding to the audio data to be processed; and

the control module is used for controlling the mouth shape of the virtual image to change according to the mouth shape parameters;

wherein the deep learning model is trained from the apparatus of any one of claims 11 to 18.

20. The device of claim 19, the audio data to be processed comprising audio data of a specified length.

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 10.

22. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 10.