CN116992309A

CN116992309A - Training method of voice mouth shape synchronous detection model, electronic equipment and storage medium

Info

Publication number: CN116992309A
Application number: CN202311243365.2A
Authority: CN
Inventors: 王宁; 宋凡
Original assignee: Suzhou Qingying Feifan Software Technology Co ltd
Current assignee: Suzhou Qingying Feifan Software Technology Co ltd
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2023-11-03
Anticipated expiration: 2043-09-26
Also published as: CN116992309B

Abstract

The present application relates to the field of machine learning technologies, and in particular, to a training method for a voice mouth shape synchronous detection model, an electronic device, and a storage medium. The method comprises the following steps: establishing a training data set, wherein the training data set comprises a data sample consisting of audio data and face picture data; respectively constructing a feature extraction network to perform feature extraction on the audio data and the face picture data of the data sample to obtain the audio features and the mouth shape features of the sample data; determining similarity scores of the audio data and the face picture data in the data sample based on the audio features and the mouth shape features of the data sample; training by using the loss function to obtain a synchronous detection model; and synchronously detecting the audio and video based on the trained voice mouth shape synchronous detection model. The training method of the voice mouth shape synchronous detection model has the advantages of high detection efficiency and high accuracy in detecting audio and video synchronization.

Description

Training method of voice mouth shape synchronous detection model, electronic equipment and storage medium

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a training method for a voice mouth shape synchronous detection model, an electronic device, and a storage medium.

Background

Many audios and videos in the prior art are generated by combining independent video pictures and audio pictures, such as movies, television shows, and the like. The synthesized audios and videos often have the problem of audio and video synchronization, and especially along with the progress of technology, the requirements of people on the video quality of the audios and videos are increasingly high, and people can more easily perceive the problem of asynchronous audios and videos in the audios and videos.

Therefore, how to detect the synchronization degree of audio and images in audio and video, and further to improve the quality of audio and video is needed to be solved.

Disclosure of Invention

In order to solve the defects in the prior art, the application aims to provide a training method for a voice mouth shape synchronous detection model, electronic equipment and a storage medium, train the voice mouth shape synchronous detection model and detect the synchronous degree of audio and pictures.

In order to achieve the above object, the present application provides a training method for a voice mouth shape synchronous detection model, comprising:

a training data set is established, wherein the training data set comprises a data sample consisting of audio data with a fixed time length in audio and video and face picture data with a fixed frame number corresponding to the fixed time length;

respectively constructing a feature extraction network to perform feature extraction on the audio data and the face picture data of the data sample to obtain the audio features and the mouth shape features of the sample data;

determining similarity scores of the audio data and the face picture data in the data sample based on the audio features and the mouth shape features of the data sample;

training by using a loss function based on the expected value of the data sample and the similarity score to obtain a voice mouth shape synchronous detection model;

and synchronously detecting the audio and video based on the trained voice mouth shape synchronous detection model.

Further, the specific step of establishing the training data set includes:

setting the audio and video as a preset frame rate, segmenting sound in the audio and video into audio data in a fixed-length time period, and exporting fixed-frame frames in the audio and video in a time period corresponding to the audio data into frame data;

and carrying out face recognition on the picture data to obtain face picture data corresponding to the audio data.

Further, the specific steps of respectively constructing a feature extraction network to perform feature extraction on the audio data and the face picture data of the data sample to obtain the audio feature and the mouth shape feature of the sample data include:

reading the audio data, and converting the audio data into corresponding Fbank characteristics;

and performing feature extraction on the Fbank features by using a first feature extraction network to obtain audio features.

Further, the specific step of extracting the features of the Fbank by using a feature extraction network to obtain audio features includes:

mapping the Fbank features into feature representations in an embedding space using a convolutional layer;

carrying out dimension normalization on the feature representation to obtain normalized feature representation;

the normalized feature representation is encoded using a transducer encoder resulting in audio features.

Further, the method further comprises:

carrying out dropout operation on the normalized feature representation to obtain dropout features;

the dropout features are encoded using a transducer encoder, resulting in audio features.

splicing the face picture data with the fixed frame number, and taking the spliced face picture data as image data to be extracted;

and performing feature extraction on the image data to be extracted by using a second feature extraction network to obtain mouth shape features.

Further, the specific step of extracting the features of the image data to be extracted by using the second feature extraction network to obtain the mouth shape features includes:

and performing feature extraction on the image data to be extracted by using a preset convolutional neural network to obtain mouth shape features.

Further, the specific step of determining the similarity score of the audio data and the face picture data in the data sample based on the audio feature and the mouth shape feature of the data sample includes:

expanding dimensions to match the audio features with the mouth shape features;

splicing the audio features and the mouth shape features to obtain comprehensive features of the data sample;

and evaluating the comprehensive characteristics by using a convolutional neural network to obtain similarity scores of the audio data and the face picture data in the data samples.

Further, the loss function is a mean square error loss.

Further, the method further comprises;

carrying out negative sampling operation on partial data samples in the training data set at random to divide the data samples into negative samples and positive samples, wherein the synchronous state of the audio data of the positive samples and the face picture data is synchronous, and the synchronous state of the audio data of the negative samples and the face picture data is asynchronous; the expected value of the data sample is a quantized value of the synchronization state of the data sample.

In order to achieve the above object, the present application provides an electronic device, including:

a processor;

a memory having stored thereon one or more computer instructions that execute on the processor;

the processor, when executing the computer instructions, performs the steps of the training method of the voice-mouth-shape synchronous detection model as described above.

To achieve the above object, the present application provides a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of the training method of a speech mouth-shape synchronous detection model as described above.

According to the training method of the voice mouth shape synchronous detection model, the audio and video data set is constructed, the feature extraction and the synchronous detection training are carried out on the audio and video data, and the voice mouth shape synchronous detection model is generated to detect the synchronous condition of the audio and video, so that the detection efficiency is high, the accuracy is high, and the practicability is high.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and do not limit the application. In the drawings:

FIG. 1 is a flow chart of a training method of a voice mouth shape synchronous detection model;

FIG. 2 is a schematic flow chart for creating a training data set;

FIG. 3 is a flow chart of obtaining audio features;

FIG. 4 is a schematic flow diagram of feature extraction of Fbank features using a first feature extraction network;

FIG. 5 is a schematic flow chart of obtaining a mouth shape feature;

FIG. 6 is a performance index diagram of the speech mouth sync detection model of the present application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While the application is susceptible of embodiment in the drawings, it is to be understood that the application may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the application. It should be understood that the drawings and embodiments of the application are for illustration purposes only and are not intended to limit the scope of the present application.

It should be understood that the various steps recited in the method embodiments of the present application may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the application is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those skilled in the art will appreciate that "one or more" is intended to be construed as "one or more" unless the context clearly indicates otherwise. "plurality" is understood to mean two or more.

Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings.

Example 1

In one embodiment of the present application, a training method of a voice mouth shape synchronous detection model is provided, fig. 1 is a flow chart of the training method of the voice mouth shape synchronous detection model, and the training method of the voice mouth shape synchronous detection model of the present application will be described in detail with reference to fig. 1, and referring to fig. 1, the method includes:

step S101: and establishing a training data set, wherein the training data set comprises data samples consisting of audio data with a fixed time length in audio and video and face picture data with a fixed frame number corresponding to the fixed time length.

Specifically, firstly, a large amount of audios and videos are recorded or collected, and the content of the audios and the videos covers various mouth shapes, pronunciation, speech speed and other changing conditions; and then cleaning and classifying the audio and video data, removing the audio and video with background noise, unclear mouth shape or blurred video image by using noise recognition, image blur detection and other technologies, and classifying the audio and video according to languages and accents by using a voice recognition technology so as to train models of different languages or accents.

It can be understood that the more the number of the audios and the videos and the richer the content, the more the performance of the voice mouth shape synchronous detection model for subsequent training can be ensured.

It will be appreciated that all audio and video of the same language or same accent is used to train the generation of models that detect the degree of synchronization of the audio and video of that language or accent. For example, a model generated by final training of audio and video in english can only be used to detect the synchronization degree of audio and video in english.

After the above-mentioned tasks are completed, the training data set is started to be established, fig. 2 is a schematic flow chart of the training data set establishment, referring to fig. 2, the specific steps of the training data set establishment include:

s1011: setting the audio and video as a preset frame rate, segmenting sound in the audio and video into audio data in a fixed-length time period, and exporting fixed-frame frames in the audio and video in a time period corresponding to the audio data as frame data.

In this embodiment, the transcoding tool is used to process the audio and video, adjust the frame rate to a preset frame rate (e.g. 30 frames per second), extract each frame of picture in sequence, and derive each frame of picture as a picture in RGB color space format. Meanwhile, audio of the audio and video is converted into a single channel, the audio of the audio and video is divided into fragments with the duration of 0.2 seconds according to the sampling rate set to 16KHz, and fragments with the end less than 0.2 seconds are discarded. Since the frame rate is 30, there are 30 pictures per second, and one audio clip is 0.2 seconds, so one audio clip corresponds to 6 pictures. The audio clip is the audio data, and the 6 pictures corresponding to the audio clip are the picture data.

S1012: and carrying out face recognition on the picture data to obtain face picture data corresponding to the audio data.

Specifically, the derived pictures are further processed by using a face recognition technology. And cutting out the face area in the picture, and scaling the face area to a uniform size to serve as face picture data.

In the present embodiment, the size of the picture is 256×256 pixels.

In this embodiment, the method further includes:

Specifically, after sample data comprising audio data and corresponding face picture data are obtained, carrying out negative sampling operation on a part of sample data at random, and carrying out random time migration on the face picture data in a part of sample data so that the original corresponding audio data and the face picture data are not corresponding any more; the sample data subjected to random time shift, namely, the sample data subjected to negative sampling operation is a negative sample, other sample data are positive samples, the synchronous state of the audio data of the positive sample and the face picture data is marked as synchronous, and the synchronous state of the audio data of the negative sample and the face picture data is marked as asynchronous.

Through step S101, a training data set comprising positive and negative samples is created,

step S102: and respectively constructing a feature extraction network to perform feature extraction on the audio data and the face picture data of the data sample to obtain the audio features and the mouth shape features of the sample data.

Fig. 3 is a schematic flow chart of obtaining audio features, referring to fig. 3, the specific steps for obtaining audio features include:

s201: reading audio data, and converting the audio data into corresponding Fbank characteristic data;

specifically, after reading the audio clip, each waveform is expanded in dimension and then multiplied by a scaling factor to increase its magnitude to a suitable range. Then, invoking an FBank function of the torchaudio, kaldi library to convert each waveform into corresponding primary FBank characteristics; the converted FBank features are normalized by subtracting the mean and dividing by twice the standard deviation. Finally, an extra dimension is added through the unscqueze operation, so that the input requirement of the feature extraction network is adapted to be used as final FBank feature data.

Note that Fbank (FilterBank) is an audio processing algorithm that processes audio in a manner similar to a human ear, which can improve the performance of speech recognition. The general steps for obtaining Fbank features of a speech signal are: pre-emphasis, framing, windowing, short Time Fourier Transform (STFT), mel filtering, de-averaging, etc.

In other embodiments, the feature data of the corresponding algorithm may also be obtained using algorithms such as FFT (fast fourier transform), pitch (Pitch frequency), MFCC (mel-frequency cepstral coefficient), and PCEN (channel energy normalization), so as to perform feature extraction using the first feature extraction network in a subsequent step, to obtain audio features.

S202: and performing feature extraction on the Fbank feature data by using a first feature extraction network to obtain audio features.

Fig. 4 is a schematic flow chart of feature extraction on Fbank features by using a first feature extraction network, and specific steps include:

s2021: mapping the FBank feature data into a feature representation in embedded space using a convolutional layer;

specifically, patch embedding (patch embedding) is performed on the FBank feature data, that is, local features are extracted by a convolution layer (nn. Conv2 d), where the input channel of the convolution layer is 1, the output channel of the convolution layer is 256, the convolution kernel size is 16×16, and the stride is 16, and the FBank feature is mapped into a representation in an embedding space (embedding), so as to obtain a feature representation:

；

s2022: carrying out dimension normalization on the feature representation to obtain normalized feature representation;

in this embodiment, in order to make the feature smoother in the channel dimension, a dimension normalization (LayerNorm) operation is performed on the feature representation to obtain a normalized feature representation, where the formula is as follows:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,for normalized feature representation, ++>For the characterization of->Is mean value (I)>Is standard deviation (S)>Is a learning parameter for scaling the normalized value,/is the scaling parameter>The displacement parameter is also a learning parameter used for translating normalized values:

s2023: and carrying out dropout operation on the normalized characteristic representation to obtain dropout characteristics.

In the present embodiment, in order to prevent overfitting, the discarding method operation is performed on the normalized feature representation feature to obtain a discarding method feature, and the drop out rate (discarding ratio) is set to 0.1.

In the training process of the deep learning network, the drop method is an algorithm for temporarily dropping a neural network element from the network with a certain probability.

S2024: the dropping method features are encoded using a transducer encoder to obtain audio features.

Specifically, the dropout feature is encoded by a transducer encoder (transducer Encoder), where the encoder's emudbed_dim is 256, num_heads is 8, encoder_layers is 8, and the output of the transducer Encoder is the audio feature.

FIG. 5 is a schematic flow chart of obtaining the mouth shape characteristics, and the specific steps include:

s301: splicing face picture data with a fixed frame number, and taking the spliced face picture data as image data to be extracted;

specifically, 6 pictures in the face picture data are spliced into 6×3×256×256-format pictures to be used as the image data to be extracted.

S302: and performing feature extraction on the image data to be extracted by using a second feature extraction network to obtain mouth shape features.

Specifically, feature extraction is performed on image data to be extracted by using a preset convolutional neural network, so that mouth shape features are obtained.

The preset convolutional neural network sequentially comprises the following components:

an initial convolution block, wherein the convolution kernel size of the first convolution block is 7x7, the output channel number is 64, and padding is 3;

the first residual block, the output channel number of the first residual block is 64, the convolution kernel size is 3x3, and padding is 1;

the first downsampling block, the output channel number of the first downsampling block is 64, the convolution kernel size is 3x3, and padding is 1;

the first convolution block, the output channel number of the first convolution block is 128, the convolution kernel size is 3x3, and padding is 1;

the second residual block, the output channel number of the second residual block is 128, the convolution kernel size is 3x3, and padding is 1;

the second downsampling block, the output channel number 128 of the second downsampling block, the convolution kernel size is 3x3, and padding is 1;

the second convolution block, the output channel number of the second convolution block is 128, the convolution kernel size is 3x3, and padding is 1;

the third residual block, the output channel number of the third residual block is 128, the convolution kernel size is 3x3, and padding is 1;

a third downsampling block, the output channel number of the third downsampling block is 128, the convolution kernel size is 3x3, and padding is 1;

the output channel number of the third convolution block is 128, the convolution kernel size is 3x3, and padding is 1;

a fourth residual block, the number of output channels of the fourth residual block is 128, the convolution kernel size is 3x3, and padding is 1;

a fourth downsampling block, the output channel number 128 of the fourth downsampling block, the convolution kernel size being 3x3, and padding being 1;

the output channel number of the fourth convolution block is 128, the convolution kernel size is 3x3, and padding is 1;

a fifth residual block, the number of output channels of the fifth residual block is 128, the convolution kernel size is 3x3, and padding is 1;

a fifth downsampling block, the output channel number 128 of the sixth downsampling block, the convolution kernel size being 3x3, and padding being 1;

and a fifth convolution block, wherein the output channel number 256 of the fifth convolution block is 1, and the padding is 1.

It should be noted that the residual block, the downsampling block and the convolution block are defined by the res net network, where the residual block has the following structure: convolutional layer + relu activate function + convolutional layer + relu activate function.

The structure of the lower sampling block is as follows: convolution layer + pooling layer + relu activation function;

the structure of the convolution block is as follows: convolutional layer + relu activate function.

S103: determining similarity scores of the audio data and the face picture data in the sample data based on the audio features and the mouth shape features of the sample data;

in this embodiment, the steps specifically include:

expanding dimensions to match the audio features with the mouth shape features;

illustratively, the audio features are expanded by one dimension and replicated multiple times in each dimension to achieve a shape that matches the mouth shape features.

Specifically, the convolutional neural network sequentially comprises two convolutional layers, wherein the number of input channels of the first convolutional layer is 512, the number of output channels of the first convolutional layer is 256, the convolutional kernel size is 3x3, the padding is 1, and a LeakyReLU activation function is used; the second convolutional layer has 256 input channels, 1 output channels, 3x3 convolutional kernel size, and 1 padding.

S104: training by using a loss function based on the expected value of the data sample and the similarity score to obtain a voice mouth shape synchronous detection model;

in the present embodiment, the loss function used is a mean square error loss, and the formula is as follows:

；

n is the number of data samples for the same training batch, yi is the similarity score for the ith data sample,for the expected value of the i-th data sample, i.e., the quantized value of the sync state, the exemplary sync state is a quantized value of 1 for sync and the sync state is a quantized value of 0 for non-sync.

It should be noted that the optimization is also performed using an adaptive moment estimation (Adam) optimization algorithm.

Through multiple rounds of training, the model can gradually learn the complex corresponding relation between the voice and the mouth shape, thereby improving the accuracy of mouth shape synchronous detection.

S105: and synchronously detecting the audio and video based on the trained voice mouth shape synchronous detection model.

Specifically, the voice mouth-shape synchronous detection model can be used for video quality inspection.

In other embodiments, also for virtual digital human model training.

Fig. 6 is a performance index chart of the voice mouth-shape synchronous detection model of the present application, as shown in fig. 6, performance indexes of the voice mouth-shape synchronous detection model of the present application are evaluated on an LRS2 data set and an own data set, respectively, and the performance indexes such as accuracy, precision, recall rate and f1 score are all better than those of the color_sync_net model, and the accuracy indexes are improved by approximately 4% and 5% on two data sets, respectively, compared with the color_sync_net.

Example 2

In this embodiment, an electronic device is further provided, including a processor and a memory. The memory is used to store non-transitory computer readable instructions. The processor is configured to execute non-transitory computer readable instructions that, when executed by the processor, may perform one or more steps of the training method of the voice-mouth-sync detection model described above. The memory and processor may be interconnected by a bus system and/or other forms of connection mechanisms.

For example, the processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or other form of processing unit having data processing and/or program execution capabilities, such as a Field Programmable Gate Array (FPGA), or the like; for example, the Central Processing Unit (CPU) may be an X86 or ARM architecture, or the like.

For example, the memory may comprise any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer program modules may be stored on the computer readable storage medium and executed by the processor to perform various functions of the electronic device. Various applications and various data, as well as various data used and/or generated by the applications, etc., may also be stored in the computer readable storage medium.

It should be noted that, in the embodiment of the present application, specific functions and technical effects of the electronic device may refer to the description of the training method related to the voice mouth shape synchronous detection model in the above, which is not repeated herein.

Example 3

In this embodiment, there is also provided a computer-readable storage medium for storing non-transitory computer-readable instructions. For example, non-transitory computer readable instructions, when executed by a computer, may perform one or more steps in a training method according to the voice-mouth-sync detection model described above.

For example, the storage medium may be applied to the above-described electronic device. For example, the storage medium may be a memory in the electronic device of embodiment 3. For example, the relevant description of the storage medium may refer to the corresponding description of the memory in the electronic device of embodiment 3, which is not repeated here.

The storage medium (computer readable medium) of the present application may be a computer readable signal medium, a non-transitory computer readable storage medium, or any combination of the two. The non-transitory computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the non-transitory computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, an access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the context of this document, a non-transitory computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a non-transitory computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), or the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented in software or in hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), etc.

The above description is only illustrative of some of the embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present application is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the application. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. The training method of the voice mouth shape synchronous detection model is characterized by comprising the following steps of:

2. The method for training a synchronous speech model according to claim 1, wherein the specific step of creating a training data set comprises:

3. The training method of the voice mouth shape synchronous detection model according to claim 1, wherein the specific steps of respectively constructing a feature extraction network to perform feature extraction on the audio data and the face picture data of the data sample, and obtaining the audio feature and the mouth shape feature of the sample data comprise:

4. The method for training a voice mouth shape synchronous detection model according to claim 3, wherein the specific step of extracting features of the Fbank features by using a first feature extraction network to obtain audio features comprises:

5. The method for training a speech mouth sync detection model according to claim 4, further comprising:

6. The training method of the voice mouth shape synchronous detection model according to claim 1, wherein the specific steps of respectively constructing a feature extraction network to perform feature extraction on the audio data and the face picture data of the data sample, and obtaining the audio feature and the mouth shape feature of the sample data comprise:

7. The method for training a synchronous voice-mouth-shape detection model according to claim 6, wherein the specific step of extracting features of the image data to be extracted by using a second feature extraction network to obtain mouth-shape features comprises:

8. The method for training a voice-mouth-shape synchronous detection model according to claim 1, wherein the specific step of determining the similarity score of the audio data and the face picture data in the data sample based on the audio feature and the mouth-shape feature of the data sample comprises:

expanding dimensions to match the audio features with the mouth shape features;

9. The method of claim 1, wherein the loss function is a mean square error loss.

10. The method for training a speech mouth sync detection model according to claim 1, further comprising:

11. An electronic device, comprising:

a processor;

the processor, when executing the computer instructions, performs the steps of the training method of the speech mouth shape synchronous detection model according to any one of claims 1-10.

12. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps of the training method of the speech mouth sync detection model according to any of claims 1-10.