CN117012224A

CN117012224A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN117012224A
Application number: CN202210695394.1A
Authority: CN
Inventors: 陈雅静; 黄永祥
Original assignee: Tencent Cyber Shenzhen Co Ltd
Current assignee: Tencent Cyber Shenzhen Co Ltd
Priority date: 2022-06-16
Filing date: 2022-06-16
Publication date: 2023-11-07

Abstract

The application relates to the technical field of computers and provides a data processing method, a device, electronic equipment and a storage medium, wherein the data processing method comprises the steps of obtaining audio data to be processed, and extracting semantic features and acoustic features of the audio data to be processed; carrying out fusion processing on the semantic features and the acoustic features; carrying out prediction processing on expression base control parameters on the fusion processing result in a preset dimension range to obtain target expression base control parameters; determining target expression base data according to the target expression base control parameters and the basic expression base data; and processing the preset three-dimensional mouth shape data based on the target expression base data to obtain target three-dimensional mouth shape data. According to the application, the semantic features and the acoustic features are fused in the preset dimension range to predict the target expression base control parameters, so that not only can the prediction accuracy of the expression base control parameters be improved, multiple languages and multiple environments can be supported, but also the dimension of the predicted target expression base control parameters can be reduced, and the data processing cost is reduced.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, a data processing device, an electronic device, and a storage medium.

Background

FIG. 1 is a schematic diagram of a prior art speech-based mouth-shape solution from which parameters of a template 3D face model or face binding system are learned by inputting a segment of speech into a neural network system to extract features. However, the existing scheme has high training data acquisition cost and data processing cost, poor model stability and robustness, and cannot support multilingual and multi-context scenes.

Disclosure of Invention

In order to solve the problems of high cost and low accuracy of converting input audio data into linear predictive coding characteristics to input three-dimensional mouth shape data output by a neural network model in the prior art, the application provides a data processing method, a data processing device, electronic equipment and a storage medium, wherein the data processing method comprises the following steps of:

according to a first aspect of the present application, there is provided a data processing method comprising:

acquiring audio data to be processed, and extracting semantic features and acoustic features of the audio data to be processed;

carrying out fusion processing on the semantic features and the acoustic features;

carrying out prediction processing on expression base control parameters on the fusion processing result in a preset dimension range to obtain target expression base control parameters; the target expression base control parameters comprise control parameters corresponding to mouth key points, and the preset dimension range is a dimension range corresponding to preset basic expression base data;

Determining target expression base data according to the target expression base control parameters and the basic expression base data;

and processing the preset three-dimensional mouth shape data based on the target expression base data to obtain target three-dimensional mouth shape data.

According to a second aspect of the present application, there is provided a data processing apparatus comprising:

the acquisition module is used for acquiring the audio data to be processed and extracting semantic features and acoustic features of the audio data to be processed;

the fusion processing module is used for carrying out fusion processing on the semantic features and the acoustic features;

the prediction processing module is used for performing prediction processing on the expression base control parameters on the fusion processing result in a preset dimension range to obtain target expression base control parameters; the target expression base control parameters comprise control parameters corresponding to mouth key points, and the preset dimension range is a dimension range corresponding to preset basic expression base data;

the first determining module is used for determining target expression base data according to the target expression base control parameters and the basic expression base data;

and the second determining module is used for processing the preset three-dimensional mouth shape data based on the target expression base data to obtain target three-dimensional mouth shape data.

On the other hand, the data processing method is realized based on voice generation three-dimensional mouth model;

The speech generating three-dimensional mouth model comprises an encoder and a decoder;

the encoder is used for extracting semantic features and acoustic features of the audio data to be processed and carrying out fusion processing on the semantic features and the acoustic features;

the decoder is used for carrying out prediction processing on the expression base control parameters on the fusion processing result to obtain target expression base control parameters.

In another aspect, a data processing apparatus includes: a voice generating three-dimensional mouth model training module;

the voice generation three-dimensional mouth model training module comprises:

the acquisition sub-module is used for acquiring the sample audio data and the corresponding labeling information; the labeling information comprises labeling expression base control parameters, and the labeling expression base control parameters are obtained based on sample video data corresponding to the sample audio data;

the first input sub-module is used for inputting the sample audio data to an encoder of a preset neural network model, extracting sample semantic features and sample acoustic features of the sample audio data through the encoder, and carrying out fusion processing on the sample semantic features and the sample acoustic features;

the second input sub-module is used for inputting the fusion processing result to a decoder of a preset neural network model, and carrying out prediction processing on the expression base control parameters in a preset dimension range through the decoder to obtain predicted expression base control parameters;

The first determining submodule is used for determining a first sub-loss value according to the marked expression base control parameter, the predicted expression base control parameter and the basic expression base data; the first sub-loss value represents the difference value of the mouth key point data;

the second determining submodule is used for determining a second sub-loss value according to the marked expression base control parameter, the predicted expression base control parameter and the basic expression base data; the second sub-loss value represents the difference value of the mouth closing key point data;

the third determining submodule is used for determining a third sub-loss value according to the marked three-dimensional mouth shape data corresponding to the marked expression base control parameters and the predicted three-dimensional mouth shape data corresponding to the predicted expression base control parameters; the third loss value characterizes the difference of the mouth shape data;

the training sub-module is used for training the preset neural network model according to the first sub-loss value, the second sub-loss value and the third sub-loss value until the preset training ending condition is met, so as to obtain the three-dimensional mouth shape model generated by the voice.

In another aspect, a speech generating three-dimensional mouth model training module includes:

a fourth determining sub-module, configured to determine a fourth sub-loss value according to a difference value between the labeled expression base control parameter and the predicted expression base control parameter;

A fifth determining submodule, configured to determine a fifth sub-loss value according to a difference value between the labeled three-dimensional face data corresponding to the labeled expression base control parameter and the predicted three-dimensional face data corresponding to the predicted expression base control parameter;

the training module is used for training the preset neural network model according to the first sub-loss value, the second sub-loss value, the third sub-loss value, the fourth sub-loss value and the fifth sub-loss value until the preset training ending condition is met, so as to obtain the three-dimensional mouth shape model generated by the voice.

On the other hand, the first determining submodule is used for determining the marked expression base data according to the product of the marked expression base control parameter and the basic expression base data;

determining predicted expression base data according to the product of the predicted expression base control parameters and the basic expression base data;

and determining a first sub-loss value according to the product of the difference value of the mouth key point data in the marked expression base data and the mouth key point data in the predicted expression base data and the mouth key point control parameter in the marked expression base control parameter.

On the other hand, the second determining submodule is used for determining the marked expression base data according to the product of the marked expression base control parameter and the basic expression base data;

and determining a second sub-loss value according to the difference value between the mouth closing key point data in the marked expression base data and the mouth closing key point data in the predicted expression base data.

On the other hand, the third determining submodule is used for rendering the labeling three-dimensional mouth shape data corresponding to the labeling expression base control parameters and the prediction three-dimensional mouth shape data corresponding to the prediction expression base control parameters to obtain a labeling image and a prediction image;

feature extraction processing is carried out on the marked image and the predicted image, so that marked mouth shape features corresponding to the marked expression base control parameters and predicted mouth shape features corresponding to the predicted expression base control parameters are obtained;

and determining a third sub-loss value according to the difference value of the predicted mouth shape characteristic and the marked mouth shape characteristic.

According to a third aspect of the present application there is provided an electronic device comprising a processor and a memory in which at least one instruction or at least one program is stored, the at least one instruction or at least one program being loaded and executed by the processor to carry out the data processing method of the first aspect of the present application.

According to a fourth aspect of the present application there is provided a computer storage medium having stored therein at least one instruction or at least one program, the at least one instruction or at least one program being loaded and executed by a processor to implement the data processing method of the first aspect of the present application.

According to a fifth aspect of the present application there is provided a computer program product comprising at least one instruction or at least one program, the at least one instruction or at least one program being loaded and executed by a processor to implement the data processing method of the first aspect of the present application.

The data processing method, the device, the electronic equipment and the storage medium provided by the embodiment of the application have the following technical effects:

extracting semantic features and acoustic features of the audio data to be processed by acquiring the audio data to be processed; carrying out fusion processing on the semantic features and the acoustic features; carrying out prediction processing on expression base control parameters on the fusion processing result in a preset dimension range to obtain target expression base control parameters; the target expression base control parameters comprise control parameters corresponding to mouth key points, and the preset dimension range is a dimension range corresponding to preset basic expression base data; determining target expression base data according to the target expression base control parameters and the basic expression base data; and processing the preset three-dimensional mouth shape data based on the target expression base data to obtain target three-dimensional mouth shape data. According to the embodiment of the application, the target expression base control parameters are predicted by fusing the semantic features and the acoustic features in the preset dimension range, so that the prediction accuracy of the expression base control parameters can be improved, multilingual and multilingual can be supported, the dimension of the predicted target expression base control parameters can be reduced, and the data processing cost is reduced.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a framework of a prior speech-based mouth-shape scheme;

FIG. 2 is a schematic diagram of an application environment provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of an interface for running functional software for generating a mouth shape based on linear predictive coding;

FIG. 4 is a schematic diagram of an interface for running functional software that generates a mouth shape based on phonemes;

FIG. 5 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a data processing method according to an embodiment of the present application;

FIG. 7 is a flowchart of a training method for generating a three-dimensional mouth shape model by using voice according to an embodiment of the present application;

fig. 8 is a schematic diagram of a closing distance of a key point corresponding to a lip according to an embodiment of the present application;

FIG. 9 is a flowchart of a training method for generating a three-dimensional mouth shape model by using voice according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic hardware structure of an electronic device for implementing the data processing method provided by the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in further detail with reference to the accompanying drawings. It will be apparent that the described embodiments are merely one embodiment of the application, and not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic may be included in at least one implementation of the application. In describing embodiments of the present application, it should be understood that the terms "first," "second," and "third," etc. are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, features defining "first," "second," and "third," etc. may explicitly or implicitly include one or more such features. Moreover, the terms "first," "second," and "third," etc. are used to distinguish between similar objects and not necessarily to describe a particular order or sequence. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprising," "having," and "being," and any variations thereof, are intended to cover a non-exclusive inclusion.

It will be appreciated that in the specific embodiments of the present application, where related data such as video data and audio data are involved, when the above embodiments of the present application are applied to specific products or technologies, user approval or consent is required, and the collection, use and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant country and region.

Referring to fig. 2, fig. 2 is a schematic diagram of an application environment provided in an embodiment of the present application, where the application environment may include an acquisition device 10 and a server 20. The acquisition device 10 and the server 20 may be directly or indirectly connected by wired or wireless communication.

In some possible embodiments, the acquisition device 10 may transmit the acquired audio data to be processed to the server via a network. The server can provide data processing service, semantic features and acoustic features of the audio data to be processed are extracted, and the audio data are input into the decoder after fusion to output expression base control parameters, so that the audio data can be conveniently migrated to different animation roles to generate three-dimensional mouth shape data.

The acquisition device 10 may include at least one hardware device of a video recorder, a smart phone, a computer (such as a desktop computer, a tablet computer, a notebook computer), an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, an intelligent wearable device, an intelligent home appliance, and other hardware devices capable of supporting synchronous audio-video acquisition and recording. Or may be software, such as a computer program, running in a physical device. The operating system corresponding to the client may be an Android system, an iOS system (a mobile operating system developed by apple corporation), a linux system (an operating system), a Microsoft Windows system (microsoft windows operating system), and the like.

The server 20 may be an independent physical server, a service cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The server may include a network communication unit, a processor, a memory, and the like. The server may provide an image backhaul service.

In some possible embodiments, both the client 10 and the server 20 may be node devices in the blockchain system, and may be capable of sharing the acquired and generated information to other node devices in the blockchain system, so as to implement information sharing between multiple node devices. The plurality of node devices in the blockchain system can be configured with the same blockchain, the blockchain consists of a plurality of blocks, and the blocks adjacent to each other in front and back have an association relationship, so that the data in any block can be detected through the next block when being tampered, thereby avoiding the data in the blockchain from being tampered, and ensuring the safety and reliability of the data in the blockchain.

The existing main software for generating the mouth shape based on the voice comprises mouth shape software based on linear predictive coding and mouth shape software based on phonemes. Fig. 3 is a schematic diagram of an operation interface of a function software for generating a mouth shape based on a linear predictive coding, wherein a main algorithm of the software is based on a siggraph2017 article Audio-Driven Animation by Joint End-to-End Learning of Pose and Emotion, firstly, input Audio is converted into the linear predictive coding (linear predictive coding, LPC), then the LPC is input into a neural network, and the vertex of a 3D model of a template is output. It uses 5-10 minutes of 4D scan data as training data, with the loss function being mainly the loss of vertices and the loss of interframe smoothing of emotion. Since the generating of the mouth shape software based on the linear predictive coding uses 4D scan data as training data, a special capturing device is required, and data capturing for 5 minutes requires high costs. Moreover, a large amount of training data cannot be captured, and poor stability and robustness of the model are easily caused. Although the software claims to support multiple languages and multiple contexts, it was found that it was difficult to support languages other than english at the time of actual testing, and that the generated mouth shapes had significant jitter. In addition, the method such as the deformation transfer technique deformation teansfer is required to transfer the motion corresponding to the 4D scan data to the model of the target character, and in the transfer process, defects or inaccurate transfer motion may be caused due to the different shapes of the template and the target character.

FIG. 4 is a schematic diagram of An interface for running a phoneme-based mouth piece generating function software that primarily provides mouth piece animation for a game, the primary algorithm of the software being based on JALI of a siggraph 2016: an animal-Centric Viseme Model for Expressive Lip-Synchronization and Vision of a siggraph 2018: audio-drive animal-centric speech animation. A special facial animation binding system is designed based on phoneme generating mouth shape software, and comprises a chin and lip two-part controller, different mouth shapes are obtained by adjusting parameters of the two-part controller, then an experienced animator is utilized to manually make animation data, and then the animation data is used to train a network model for inputting phoneme output shape parameters. The labor cost is high because generating the mouth shape software based on phonemes requires a professional animator to repair the animation data. Moreover, since the scheme is based on a network model of phoneme training output port parameters, the multilingual can not be supported due to the great phoneme difference of different languages. If each language trains a model independently, the workload and labor cost of the professional animator are increased. Furthermore, training the network model of the delivery outlet parameters based on phonemes is less vivid and is difficult to change appropriately according to the emotion of the speech.

Next, a specific embodiment of a data processing method according to the present application is described, fig. 5 is a schematic flow chart of a data processing method according to an embodiment of the present application, and fig. 6 is a schematic frame diagram of a data processing method according to an embodiment of the present application. The present specification provides method operational steps as illustrated by an example or flowchart, but may include more or fewer operational steps based on conventional or non-inventive labor. The sequence of steps recited in the embodiments is only one manner of a plurality of execution sequences, and does not represent a unique execution sequence, and when actually executed, may be executed sequentially or in parallel (e.g., in a parallel processor or a multithreaded environment) according to the method shown in the embodiments or the drawings.

In practical application, the data processing method provided by the embodiment of the application can be applied to various scenes suitable for the virtual image, such as broadcasting, explaining, constructing game characters and the like, and can also be used for bearing personalized service scenes by utilizing the virtual image, such as personal one-to-one service scenes of psychological doctors, virtual assistants and the like.

As shown in fig. 5 and 6 in particular, the data processing method may include:

S501: and acquiring the audio data to be processed, and extracting semantic features and acoustic features of the audio data to be processed.

In the embodiment of the application, the audio data to be processed recorded by the recording device or the voice synthesis device, namely the audio to be processed input by the user, can be obtained, and then the audio data to be processed can be subjected to semantic feature extraction processing to obtain the semantic features of the audio data to be processed, and simultaneously the audio data to be processed can be subjected to acoustic feature extraction processing to obtain the acoustic features of the audio data to be processed.

In some possible embodiments, a decoder in the three-dimensional mouth shape model generated by voice, namely a semantic recognition model, can be used for carrying out feature extraction processing on the audio to be processed input by the user, so as to obtain 512-dimensional semantic features before the last layer of the audio to be processed. Meanwhile, another decoder in the three-dimensional mouth model generated by voice can be utilized to conduct feature extraction processing on audio to be processed input by a user, and Mel-Frequency cepstrum coefficient (MFCC) is obtained. Optionally, the voice recognition system Deep Speech developed in hundred degrees can be utilized to perform feature extraction processing on the audio to be processed input by the user, so as to obtain the Deep feature of the high-level semantic information. Meanwhile, pre-emphasis, framing, windowing, fourier transformation, mel filter and discrete cosine transformation can be carried out on the audio to be processed, and the MFCC of low-level audio information is obtained.

In the embodiment of the application, the data processing method can be realized by generating the three-dimensional mouth model based on voice. The speech generating three-dimensional mouth-shape model can comprise an encoder and a decoder, wherein the encoder can be used for extracting semantic features and acoustic features of audio data to be processed and carrying out fusion processing on the semantic features and the acoustic features. The decoder may receive the result of the fusion processing, and perform prediction processing on the expression base control parameter of the fusion processing result, to obtain an expression base control parameter of the audio data to be processed, that is, an arkit parameter. By changing the structure of the three-dimensional mouth model generated by voice, one encoder for inputting linear predictive coding is changed into two encoders for inputting semantic features and acoustic features so as to fuse and output predictive expression base control parameters, thereby not only improving the accuracy of model prediction, but also supporting multiple languages and multiple contexts.

Fig. 7 is a flowchart of a training method for generating a three-dimensional mouth model by using voice according to an embodiment of the present application. The method comprises the following specific steps:

s701: and acquiring the sample audio data and the corresponding labeling information.

In the embodiment of the application, the labeling information may include a labeling expression base control parameter, and the labeling expression base control parameter may be obtained based on sample video data corresponding to the sample audio data. Alternatively, the method for capturing the expression base control parameters based on the RGB picture surface and the method for capturing the expression base control parameters based on the dynamic coordinate surface can capture the annotation expression base control parameters of each frame of image from the sample video data corresponding to the sample audio data. The labeling expression base control parameters can drive a character model conforming to basic expression base data, namely, the labeling expression base control parameters can be combined with the basic expression base data to obtain three-dimensional face data, namely, 3D face groups, of each frame of image in the sample video data, wherein the three-dimensional face data can comprise three-dimensional mouth shape data.

Wherein, expression base refers to the collection of all grid vertexes involved in the complete vertex animation. In practical applications, the sample audio data and the sample video data corresponding to the sample audio data may be data recorded by using a synchronous audio-video acquisition device, and the sample audio data and the sample video data corresponding to the sample audio data may be encapsulated in an audio-video interleaved format (.wav). The control parameters of the marked expression base can be matrix c ⁱ ＝[c ₁ ⁱ ,c ₂ ⁱ ,...,c _m ⁱ ] ^T Representing, i may represent a video frame, and m may represent a dimension of the labeled expression base control parameter, which is the same as a dimension of the preset basic expression base data. For example, the basic expression base data may be 52 basic expression base data developed by apple company, i.e., apple ARkit blendshapes, including closed-mouth basic expression base data, lip-open basic expression base data, top-lip basic expression base data, blink basic expression base data, and the like, i.e., m=52. The basic expression base data can adopt a matrix B= [ B ] ₁ ,b ₂ ,...,b _m ] ^T Representing, wherein each b _m Each key point coordinate in a basic expression base can be represented. By determining 52 Apple ARkit blendshapes with the labeled emotion base control parameters, the dimension of the output data can be reduced, and the data processing cost can be reduced.

S703: the method comprises the steps of inputting sample audio data to an encoder of a preset neural network model, extracting sample semantic features and sample acoustic features of the sample audio data through the encoder, and carrying out fusion processing on the sample semantic features and the sample acoustic features.

In the embodiment of the application, after the sample audio data is acquired, a preset neural network model can be constructed as a model to be trained, the sample audio data is input into one encoder in the preset neural network model, the sample semantic features of high-level semantic information are extracted, the sample audio data is input into the other encoder in the preset neural network model, and the sample acoustic features of low-level audio information are extracted. And then, carrying out fusion processing on the sample semantic features of the high-level semantic information and the sample acoustic features of the low-level audio information to obtain a fusion processing result. For example, the sample semantic features and the sample acoustic features may be spliced to obtain a result of the splicing process.

S705: inputting the fusion processing result to a decoder of a preset neural network model, and carrying out prediction processing on expression base control parameters in a preset dimension range through the decoder to obtain predicted expression base control parameters.

In the embodiment of the application, the fusion processing result can be input to a decoder in the constructed preset neural network model, and the decoder is used for carrying out the prediction processing of the expression base control parameters in the preset dimension range to obtain the predicted expression base control parameters of each frame of audio in the sample audio data. In practical application, the control parameters of the marked expression base can be matrix a ⁱ ＝[a ₁ ⁱ ,a ₂ ² ,...,a _m ⁱ ] ^T Representing, i may represent an audio frame and m may represent a dimension of a predictive expression base control parameter, which is the same as a dimension of preset basic expression base data.

S707: determining a first sub-loss value according to the marked expression base control parameter, the predicted expression base control parameter and the basic expression base data; the first sub-loss value characterizes a difference in mouth keypoint data.

In the embodiment of the application, the marked expression base data can be determined according to the product of the marked expression base control parameter and the basic expression base data, and the predicted expression base data can be determined according to the product of the predicted expression base control parameter and the basic expression base data. Then, a first sub-loss value representing the difference value of the mouth key point data, which is also called Frame-weighted landmark loss, can be determined according to the product of the difference value of the mouth key point data in the labeled expression base data and the mouth key point data in the predicted expression base data and the mouth key point control parameter in the labeled expression base control parameter. Specifically, the labeling expression base data may be determined using formula (1), the predicting expression base data may be determined using formula (2), and the first sub-loss value may be determined using formula (3):

u _i ＝c _i ^T B (1)

v _i ＝a _i ^T B (2)

L _land ＝ω∑ _k∈S (v _k -u _k ) ^T (v _k -u _k )，ω＝∑ _j c _j ² (3)

Wherein u is _i Can be marked with expression base data, v _i Expression base data can be predicted, B can represent basic expression base data, L _land The first sub-loss value can be represented, ω is a weight obtained by labeling expression base control parameters, and the purpose of the weight is to enable a preset neural network model to focus on a frame with larger learning action amplitude, avoid learning an average value of a data set, and S can be an index set of mouth key points.

S709: determining a second sub-loss value according to the labeled expression base control parameter, the predicted expression base control parameter and the basic expression base data; the second sub-loss value characterizes a difference in mouth closure keypoint data.

Since the pronunciation of the closed mouth is generally plosive and the duration is very short, the proportion of the proportion in the sample audio data is very low, and the mouth shape of the closed mouth is difficult to learn by presetting a neural network model. Therefore, the embodiment of the application introduces a second sub-loss value representing the difference value of the mouth closing key point data, so that the preset neural network model focuses on the closing distance of the key points corresponding to the upper lip and the lower lip, namely the distance between the upper lip point and the lower lip point of the mouth. Fig. 8 is a schematic diagram of a closing distance of a key point corresponding to a lip according to an embodiment of the present application.

In the embodiment of the application, the marked expression base data can be determined according to the product of the marked expression base control parameter and the basic expression base data, and the predicted expression base data can be determined according to the product of the predicted expression base control parameter and the basic expression base data. And then, determining a second sub-loss value representing the difference value of the mouth closing key point data according to the difference value of the mouth closing key point data in the labeling expression base data and the mouth closing key point data in the prediction expression base data, which is also called a mouth closing loss. Specifically, the second sub-loss value may be determined using equation (4):

L _mouth ＝(v _mouth -u _mouth ) ^T (v _mouth -u _mouth ) (4)

wherein v is _mouth Can represent predicted mouth closure keypoint data, u _mouth Can represent the closed key point data of the labeling mouth, L _mouth Can represent a second sub-loss value, predicts mouth closure keypoint data v _mouth And labeling mouth closure keypoint data u _mouth Can be obtained according to the calculation methods shown in the formulas (1) and (2).

S711: determining a third sub-loss value according to the marked three-dimensional mouth shape data corresponding to the marked expression base control parameters and the predicted three-dimensional mouth shape data corresponding to the predicted expression base control parameters; the third loss value characterizes a difference in the mouth shape data.

To make the output mouth-shaped and audio more synchronous and more stable in time sequence, a network sync net that detects audio and lip-sync in video, which detects exclusively whether the mouth-shaped and audio are aligned in video, can be used to provide constraints. In practical application, the guided renderer can be used for rendering the 3D vertexes of the three-dimensional facial data into a 2D picture with skin color, 5 frames can be spliced into a video, and then the 5 frames of video and MFCC features with corresponding duration can be input into a sync net to obtain two features, namely a marking port feature F _render And predicting mouth shape characteristic F _mfcc 。

In the embodiment of the application, the labeling three-dimensional mouth shape data corresponding to the labeling expression base control parameters can be rendered to obtain the labeling mouth shape characteristics corresponding to the labeling expression base control parameters, the predicted three-dimensional mouth shape data corresponding to the predicted expression base control parameters is rendered to obtain the labeling image and the predicted image, and then the characteristic extraction processing can be performed on the labeling image and the predicted image to obtain the predicted mouth shape characteristics corresponding to the predicted expression base control parameters. Further, a third sub-loss value, also referred to as Lipsync loss, may be determined based on the difference between the signature and predicted signature. Specifically, a third sub-loss value may be determined using equation (5):

L _sync ＝(F _render -F _mfcc ) ² (5)

Wherein F is _render Can represent the predicted mouth shape characteristic, F _mfcc Can represent the characteristics of the marking port type, L _sunc A third sub-loss value may be represented.

S713: training the preset neural network model according to the first sub-loss value, the second sub-loss value and the third sub-loss value until the preset training ending condition is met, so as to obtain the three-dimensional mouth shape model generated by the voice.

In the embodiment of the application, after the first sub-loss value, the second sub-loss value and the third sub-loss value are determined, the preset neural network model can be trained according to the sum value of the first sub-loss value, the second sub-loss value and the third sub-loss value until the preset training ending condition is met, so as to obtain the three-dimensional mouth shape model generated by voice. The preset training ending condition may be that the sum value is smaller than a preset threshold value. Specifically, the preset neural network model may be trained using equation (6):

L＝ω _land L _land +ω _mouth L _mouth +ω _sync L _sync (6)

wherein omega _land Can represent the weight, omega, corresponding to the first sub-loss value _mouth Can represent the weight, omega, corresponding to the second sub-loss value _sync The weight corresponding to the third sub-loss value may be represented. The values of the weights may be arbitrary values, and the magnitude relation of the weights to each other may be arbitrarily defined. In one example, ω _land ＝10，ω _mouth ＝20，ω _sync ＝0.05。

By adopting the training method for generating the three-dimensional mouth shape model by the voice, provided by the embodiment of the application, the accuracy of generating the three-dimensional mouth shape model by the voice can be improved by restraining the mouth key point data, the mouth closing expression data and the matching degree of the mouth shape and the audio frequency, and the accuracy of the generated three-dimensional mouth shape model can be further improved.

Fig. 9 is a flowchart of a training method of a three-dimensional mouth model for speech generation according to an embodiment of the present application. The method comprises the following specific steps:

s901: and acquiring the sample audio data and the corresponding labeling information.

In the embodiment of the application, the labeling information may include a labeling expression base control parameter, and the labeling expression base control parameter may be obtained based on sample video data corresponding to the sample audio data. Optionally, the method for capturing and labeling the expression base control parameters based on the RGB picture plane and the method for capturing and labeling the expression base control parameters based on the dynamic coordinate plane can capture the labeled expression base control parameters of each frame of image from the sample video data corresponding to the sample audio data. The labeling expression base control parameters can drive a character model conforming to basic expression base data, namely, the labeling expression base control parameters can be combined with the basic expression base data to obtain three-dimensional face data, namely, 3D face groups, of each frame of image in the sample video data, wherein the three-dimensional face data can comprise three-dimensional mouth shape data.

Wherein, expression base refers to the collection of all grid vertexes involved in the complete vertex animation. In practical applications, the sample audio data and the sample video data corresponding to the sample audio data may be data recorded by using a synchronous audio-video acquisition device, and the sample audio data and the sample video data corresponding to the sample audio data may be encapsulated in an audio-video interleaved format (.wav). The control parameters of the marked expression base can be matrix c ⁱ ＝[c ₁ ⁱ ,c ₂ ⁱ ,...,c _m ⁱ ] ^T Representing, i may represent a video frame, and m may represent a dimension of the labeled expression base control parameter, which is the same as a dimension of the preset basic expression base data. For example, the underlying expression base data mayIs 52 basic expression base data developed by apple company, namely Apple ARkit blendshapes, and comprises closed mouth basic expression base data, lip open basic expression base data, upper lip basic expression base data, blink basic expression base data and the like, namely m=52. The basic expression base data can adopt a matrix B= [ B ] ₁ ,b ₂ ,...,b _m ] ^T Representing, wherein each b _m Each key point coordinate in a basic expression base can be represented.

S903: the method comprises the steps of inputting sample audio data to an encoder of a preset neural network model, extracting sample semantic features and sample acoustic features of the sample audio data through the encoder, and carrying out fusion processing on the sample semantic features and the sample acoustic features.

S905: inputting the fusion processing result to a decoder of a preset neural network model, and carrying out prediction processing on expression base control parameters in a preset dimension range through the decoder to obtain predicted expression base control parameters.

In the embodiment of the application, the fusion processing result can be input to a decoder in the constructed preset neural network model, and the decoder is used for carrying out the prediction processing of the expression base control parameters in the preset dimension range to obtain the predicted expression base control parameters of each frame of audio in the sample audio data. In practical application, the control parameters of the marked expression base can be matrix a ⁱ ＝[a ₁ ¹ ,a ₂ ² ,...,a _m ⁱ ] ^T Representing, i may represent an audio frame and m may represent a dimension of a predictive expression base control parameter, which is the same as a dimension of preset basic expression base data.

S907: determining a first sub-loss value according to the marked expression base control parameter, the predicted expression base control parameter and the basic expression base data; the first sub-loss value characterizes a difference in mouth keypoint data.

wherein u is _i Can represent and annotate expression base data, v _i Can represent predicted expression base data, B can represent basic expression base data, L _land The first sub-loss value may be represented, ω being a weight obtained by labeling the expression-based control parameter, the weight being for the purpose of modeling the predetermined neural networkThe model focuses on the frames with larger learning action amplitude, avoids learning the average value of the data set, and S can be an index set of the mouth key points.

S909: determining a second sub-loss value according to the mouth closing expression base control parameter in the labeling expression base control parameter, the mouth closing expression base control parameter in the predicting expression base control parameter and the basic expression base data; the second sub-loss value characterizes a difference in mouth closure expression data.

L _sync ＝(F _render -F _mfcc ) ² (5)

S911: determining a third sub-loss value according to the marked three-dimensional mouth shape data corresponding to the marked expression base control parameters and the predicted three-dimensional mouth shape data corresponding to the predicted expression base control parameters; the third loss value characterizes a difference in the mouth shape data.

L _sync ＝(F _render -F _mfcc ) ² (5)

S913: and determining a fourth sub-loss value according to the difference value between the marked expression base control parameter and the predicted expression base control parameter.

In the embodiment of the application, the control parameters of the facial capture marked expression groups and the control parameters of the predicted expression groups output by the improved decoder can be constrained, so that the control parameters of the facial capture marked expression groups and the control parameters of the predicted expression groups output by the improved decoder are consistent as much as possible. Alternatively, the fourth sub-loss value, namely Parameter loss, may be determined based on the difference between the labeled expression base control Parameter and the predicted expression base control Parameter. Specifically, the fourth sub-loss value may be determined using equation (6):

L _para ＝(a ⁱ -c ⁱ ) ^T (a ⁱ -c ⁱ ) (6)

Wherein a is ⁱ Can represent the predicted expression base control parameters, c ⁱ Can represent and annotate expression base control parameters, L _para A fourth sub-loss value may be represented.

S915: and determining a fifth sub-loss value according to the difference value between the marked three-dimensional face data corresponding to the marked expression base control parameters and the predicted three-dimensional face data corresponding to the predicted expression base control parameters.

In the embodiment of the application, after the labeling expression base data is determined according to the product of the labeling expression base control parameters and the basic expression base data and the prediction expression base data is determined according to the product of the prediction expression base control parameters and the basic expression base data, the labeling three-dimensional face data and the prediction three-dimensional face data can be limited after the labeling three-dimensional face data and the prediction three-dimensional face data are obtained, so that the labeling three-dimensional face data and the prediction three-dimensional face data are consistent as much as possible, and the part which cannot be limited by the first sub-loss value is supplemented. Optionally, the fifth sub-loss value, namely, the Vertex loss may be determined according to the difference value between the labeled three-dimensional face data corresponding to the labeled expression base control parameter and the predicted three-dimensional face data corresponding to the predicted expression base control parameter. Specifically, a fifth sub-loss value may be determined using equation (7):

L _ver ＝(v-u) ^T (v-u) (7)

Where v may represent predicted three-dimensional face data, u may represent labeled three-dimensional face data, L _ver A fifth sub-loss value may be represented. The predicted three-dimensional face data v and the labeled three-dimensional face data u can be obtained according to the calculation methods shown in the formulas (1) and (2).

S917: training the preset neural network model according to the first sub-loss value, the second sub-loss value, the third sub-loss value, the fourth sub-loss value and the fifth sub-loss value until a preset training ending condition is met, so as to obtain the three-dimensional mouth shape model for generating the voice.

In the embodiment of the application, after the first sub-loss value, the second sub-loss value, the third sub-loss value, the fourth sub-loss value and the fifth sub-loss value are determined, the preset neural network model can be trained according to the sum value of the first sub-loss value, the second sub-loss value, the third sub-loss value, the fourth sub-loss value and the fifth sub-loss value until the preset training ending condition is met, and the three-dimensional mouth shape model is generated by voice. Specifically, the preset neural network model may be trained using equation (8):

L＝ω _land L _land +ω _mouth L _mouth +ω _sync L _sync +ω _para L _para +ω _ver L _ver (8)

wherein omega _land Can represent the weight, omega, corresponding to the first sub-loss value _mouth Can represent the weight, omega, corresponding to the second sub-loss value _sync Can represent the weight, omega, corresponding to the third sub-loss value _para Can represent the weight, omega, corresponding to the fourth sub-loss value _ver The weight corresponding to the fifth sub-loss value may be represented. The values of the weights may be arbitrary values, and the magnitude relation of the weights to each other may be arbitrarily defined. In one example, ω _land ＝10，ω _mouth ＝20，ω _sync ＝0.05，ω _para ＝1，ω _ver ＝1。

By adopting the training method for generating the three-dimensional mouth shape model by the voice, provided by the embodiment of the application, the accuracy of generating the three-dimensional mouth shape model by the voice can be further improved by restraining the mouth key point data, the mouth closing expression data, the matching degree of mouth shape and audio frequency, the expression base control parameters and the three-dimensional face data, and the accuracy of generating the three-dimensional mouth shape model can be further improved.

S503: and carrying out fusion processing on the semantic features and the acoustic features.

In the embodiment of the application, after the semantic features and the acoustic features of the audio data to be processed are extracted, the semantic features of the high-level semantic information and the acoustic features of the low-level audio information can be fused to obtain a fusion processing result. For example, the semantic features and the acoustic features may be stitched to obtain the result of the stitching process.

S505: and carrying out prediction processing on the expression base control parameters on the fusion processing result in a preset dimension range to obtain target expression base control parameters.

In the embodiment of the application, the target expression base control parameters comprise control parameters corresponding to the mouth key points, and the preset dimension range is a dimension range corresponding to the preset basic expression base.

In the embodiment of the application, the fusion processing result can be input to a decoder in a voice generation three-dimensional mouth shape model, and the decoder is used for carrying out prediction processing on expression base control parameters in a preset dimension range to obtain target expression base control parameters of each frame of audio in the audio data to be processed.

S507: and determining target expression base data according to the target expression base control parameters and the basic expression base data.

According to the embodiment of the application, the target expression base data can be determined according to the product of the target expression base control parameter and the basic expression base data.

S509: and processing the preset three-dimensional mouth shape data based on the target expression base data to obtain target three-dimensional mouth shape data.

In the embodiment of the application, after the target expression base data is obtained, the character model conforming to the basic expression base data can be deformed based on the target expression base data, namely, the standard three-dimensional face model in the expression library is deformed to obtain the target three-dimensional mouth shape data. In addition, the face model acquired in real time can be deformed to obtain target three-dimensional mouth shape data.

By adopting the data processing method provided by the embodiment of the application, based on the embodiment of the application, by changing the structure of the three-dimensional mouth shape model generated by voice, one encoder for inputting linear predictive coding is changed into two encoders for inputting semantic features and acoustic features, so as to fuse and output predictive expression base control parameters, thereby not only improving the accuracy of model prediction, supporting multiple languages and multiple scenes, but also reducing the dimension of output data and lowering the data processing cost. By constraining the mouth key point data, mouth closing expression data, the matching degree of mouth shape and audio frequency, expression base control parameters and three-dimensional face data, the precision of generating a three-dimensional mouth shape model by voice can be further improved, and the accuracy of the generated three-dimensional mouth shape model can be further improved.

Fig. 10 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, where, as shown in fig. 10, the data processing apparatus may include:

an obtaining module 1001, configured to obtain audio data to be processed, and extract semantic features and acoustic features of the audio data to be processed;

a fusion processing module 1003, configured to perform fusion processing on the semantic features and the acoustic features;

The prediction processing module 905 is configured to perform prediction processing on the expression base control parameters on the fusion processing result in a preset dimension range, so as to obtain target expression base control parameters; the target expression base control parameters comprise control parameters corresponding to mouth key points, and the preset dimension range is a dimension range corresponding to preset basic expression base data;

a first determining module 1007, configured to determine target expression base data according to the target expression base control parameter and the basic expression base data;

the second determining module 1009 is configured to process the preset three-dimensional mouth shape data based on the target expression base data to obtain target three-dimensional mouth shape data.

In the embodiment of the application, the data processing method is realized based on voice generation of a three-dimensional mouth model;

In an embodiment of the present application, a data processing apparatus includes: a voice generating three-dimensional mouth model training module;

The voice generation three-dimensional mouth model training module comprises:

the second determining submodule is used for determining a second sub-loss value according to the mouth closing expression base control parameter in the labeling expression base control parameter, the mouth closing expression base control parameter in the predicting expression base control parameter and the basic expression base data; the second sub-loss value represents a difference value of mouth closing expression data;

In the embodiment of the application, the voice generation three-dimensional mouth model training module comprises:

In the embodiment of the application, a first determining submodule is used for determining the marked expression base data according to the product of the marked expression base control parameter and the basic expression base data;

In the embodiment of the application, the second determining submodule is used for determining the marked expression base data according to the product of the marked expression base control parameter and the basic expression base data;

In the embodiment of the application, a third determining submodule is used for rendering the labeling three-dimensional mouth shape data corresponding to the labeling expression base control parameters and the prediction three-dimensional mouth shape data corresponding to the prediction expression base control parameters to obtain a labeling image and a prediction image;

Feature extraction processing can be carried out on the marked image and the predicted image to obtain marked mouth shape features corresponding to the marked expression base control parameters and predicted mouth shape features corresponding to the predicted expression base control parameters;

The device and method embodiments in the embodiments of the present application are based on the same application idea.

The embodiment of the application provides an electronic device, which comprises a processor and a memory, wherein at least one instruction or at least one section of program is stored in the memory, and the at least one instruction or the at least one section of program is loaded and executed by the processor to realize the data processing method provided by the embodiment of the method.

Fig. 11 is a schematic hardware structure diagram of an electronic device for implementing the data processing method provided by the embodiment of the present application, where the electronic device may participate in a backhaul device that forms or includes a road image provided by the embodiment of the present application. As shown in fig. 11, the electronic device may include one or more (shown as 1101a, 1101b in the figure) processors 1101 (the processor 1101 may include, but is not limited to, a processing means such as a microprocessor 1101MCU or a programmable logic device FPGA), a memory 1103 for storing data, and a transmission means 1105 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, and/or a power source. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 11 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the electronic device may also include more or fewer components than shown in FIG. 11, or have a different configuration than shown in FIG. 11.

It should be noted that the one or more processors 1101 and/or other data processing circuits described above may be referred to generally as "data processing circuits" in the present application. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Further, the data processing circuitry may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the electronic device (or mobile device). As referred to in this embodiment of the application, the data processing circuit acts as a processor 1101 control (e.g., selection of the path of the variable resistor termination to interface).

The memory 1103 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the data processing method in the embodiment of the present application, and the processor 1101 executes the software programs and modules stored in the memory 1103, thereby never executing each functional application and data processing, that is, implementing one of the data processing methods described above. The memory 1103 can include high speed Random Access Memory (RAM) and can also include non-volatile random access memory 1103, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory 1103. In some possible embodiments, the memory 1103 may further comprise a memory 1103 that is remotely located relative to the process, the remote memory 1103 being connectable to the value electronics via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 1105 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the electronic device. In one example, the transmission device 1105 includes a network adapter (NetworkInterfaceController, NIC) that may be connected to other network devices through a base station to communicate with the internet. In one example, the transmission device 1105 may be a radio frequency (RadioFrequency, RF) module for communicating wirelessly with the internet.

The display may be, for example, a touch screen type liquid crystal display (LED) that may enable a user to interact with a user interface of an electronic device (or mobile device).

Embodiments of the present application provide a computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program related to a data processing method for implementing a method embodiment, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the data processing method provided by the method embodiment.

Alternatively, in this embodiment, the storage medium may be located in at least one network server among a plurality of network servers of the computer network. Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It should be noted that: the order in which the embodiments of the application are presented is intended to be illustrative only and is not intended to limit the application to the particular embodiments disclosed, and other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in a different order in a different embodiment and can achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or the sequential order shown, to achieve desirable results, and in some embodiments, multitasking parallel processing may be possible or advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for embodiments of the apparatus and the electronic device, the description is relatively simple, since it is based on embodiments similar to the method, as relevant see the partial description of the method embodiments.

While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the application, such changes and modifications are also intended to be within the scope of the application.

Claims

1. A method of data processing, comprising:

and processing preset three-dimensional mouth shape data based on the target expression base data to obtain target three-dimensional mouth shape data.

2. The method of claim 1, wherein the data processing method is implemented based on speech-generated three-dimensional mouth forms;

the speech generating three-dimensional mouth-shaped model comprises an encoder and a decoder;

the encoder is used for extracting the semantic features and the acoustic features of the audio data to be processed and carrying out fusion processing on the semantic features and the acoustic features;

And the decoder is used for carrying out prediction processing on the expression base control parameters on the fusion processing result to obtain the target expression base control parameters.

3. The method of claim 2, wherein the training step of speech generating a three-dimensional mouth model comprises:

acquiring sample audio data and corresponding labeling information; the annotation information comprises annotation expression base control parameters which are obtained based on sample video data corresponding to the sample audio data;

inputting the sample audio data to an encoder of a preset neural network model, extracting sample semantic features and sample acoustic features of the sample audio data by the encoder, and carrying out fusion processing on the sample semantic features and the sample acoustic features;

inputting the fusion processing result to a decoder of the preset neural network model, and carrying out prediction processing on expression base control parameters in the preset dimension range through the decoder to obtain predicted expression base control parameters;

determining a first sub-loss value according to the labeled expression base control parameter, the predicted expression base control parameter and the basic expression base data; the first sub-loss value represents a difference value of mouth key point data;

Determining a second sub-loss value according to the labeled expression base control parameter, the predicted expression base control parameter and the basic expression base data; the second sub-loss value represents the difference value of the mouth closing key point data;

determining a third sub-loss value according to the marked three-dimensional mouth shape data corresponding to the marked expression base control parameters and the predicted three-dimensional mouth shape data corresponding to the predicted expression base control parameters; the third loss value characterizes the difference of the mouth shape data;

training the preset neural network model according to the first sub-loss value, the second sub-loss value and the third sub-loss value until a preset training ending condition is met, so as to obtain the three-dimensional mouth shape model for generating the voice.

4. The method of claim 3, wherein the step of training the speech-generated three-dimensional mouth model further comprises:

determining a fourth sub-loss value according to the difference value between the marked expression base control parameter and the predicted expression base control parameter;

determining a fifth sub-loss value according to the difference value between the marked three-dimensional face data corresponding to the marked expression base control parameters and the predicted three-dimensional face data corresponding to the predicted expression base control parameters;

Training the preset neural network model according to the first sub-loss value, the second sub-loss value, the third sub-loss value, the fourth sub-loss value and the fifth sub-loss value until a preset training ending condition is met to obtain the voice generation three-dimensional mouth shape model.

5. The method of claim 3, wherein the determining a first sub-loss value from the labeled expression base control parameter, the predicted expression base control parameter, and the base expression base data comprises:

determining labeled expression base data according to the product of the labeled expression base control parameters and the basic expression base data;

and determining the first sub-loss value according to the product of the difference value between the mouth key point data in the marked expression base data and the mouth key point data in the predicted expression base data and the mouth key point control parameter in the marked expression base control parameter.

6. The method of claim 3, wherein the determining a second sub-loss value from the labeled expression base control parameter, the predicted expression base control parameter, and the base expression base data comprises:

and determining the second sub-loss value according to the difference value between the mouth closing key point data in the marked expression base data and the mouth closing key point data in the predicted expression base data.

7. The method of claim 3, wherein determining the third sub-loss value according to the labeled three-dimensional mouth shape data corresponding to the labeled expression base control parameter and the predicted three-dimensional mouth shape data corresponding to the predicted expression base control parameter comprises:

rendering the marked three-dimensional mouth shape data corresponding to the marked expression base control parameters and the predicted three-dimensional mouth shape data corresponding to the predicted expression base control parameters to obtain marked images and predicted images;

performing feature extraction processing on the labeling image and the predicted image to obtain labeling mouth shape features corresponding to the labeling expression base control parameters and predicted mouth shape features corresponding to the predicted expression base control parameters;

And determining the third sub-loss value according to the difference value of the predicted mouth shape characteristic and the marked mouth shape characteristic.

8. A training device for generating a mouth shape model by speech, comprising:

9. An electronic device comprising a processor and a memory, wherein the memory has stored therein at least one instruction or at least one program that is loaded and executed by the processor to implement the data processing method of any of claims 1-7.

10. A computer storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement the data processing method of any of claims 1-7.

11. A computer program product comprising at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement the data processing method of any of claims 1-7.