CN113111812A - Mouth action driving model training method and assembly - Google Patents

Mouth action driving model training method and assembly Download PDF

Info

Publication number
CN113111812A
CN113111812A CN202110424518.8A CN202110424518A CN113111812A CN 113111812 A CN113111812 A CN 113111812A CN 202110424518 A CN202110424518 A CN 202110424518A CN 113111812 A CN113111812 A CN 113111812A
Authority
CN
China
Prior art keywords
mouth
model
picture
speech
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110424518.8A
Other languages
Chinese (zh)
Inventor
陈泷翔
刘炫鹏
王鑫宇
刘云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Zhuiyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhuiyi Technology Co Ltd filed Critical Shenzhen Zhuiyi Technology Co Ltd
Priority to CN202110424518.8A priority Critical patent/CN113111812A/en
Publication of CN113111812A publication Critical patent/CN113111812A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Geometry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Signal Processing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application discloses a mouth motion driving model training method and a mouth motion driving model training component. The mouth action driving model can convert text data into acoustic features, the acoustic features are coded into audio files, mouth action pictures corresponding to the audio files are determined, and the mouth action driving model learns the speech synthesis and coding capacity and the matching capacity of speech and pictures. In the process of learning the speech synthesis and coding capability, the text corresponding to the speech in the video is taken as training data, so that the model learns the pause of intonation and tone in the speech corresponding to the picture, and the synthesized speech can be consistent with the speech in the video. Therefore, the mouth movement driving model obtained by training can synthesize the voice with speech pause, and the voice and the picture are aligned in time when matched according to the voice and the picture, so that the matching accuracy of the voice and the picture is improved. The mouth movement driving model training component provided by the application also has the technical effects.

Description

Mouth action driving model training method and assembly
Technical Field
The application relates to the technical field of computers, in particular to a mouth action driving model training method and assembly.
Background
In the fields of character image generation, rendering of human-like character actions in electronic animation and the like, in order to make characters in images more real and natural, matching of mouth actions and voice is very important, and how to complete mapping from voice to mouth actions is a key for solving the problem.
The existing technology can be divided into a rule-based method and a deep learning-based method.
The rule-based method records the corresponding relation between phonemes and mouth movements provided by a linguist by using a dictionary-like structure, and the mapping from sound to mouth movements is completed in a table look-up mode when the method is used. This approach requires a lot of human factors, where the cost of the expert database is high and it is biased to customization and cannot be flexibly applied to multiple scenarios.
The method based on deep learning directly inputs the sound characteristics into the neural network, and then the relevant mouth action parameters can be obtained. This approach requires training the TTS speech synthesis model before training the mouth motion driver model. The TTS speech synthesis model is used for converting text into speech and is used as input data of the mouth action driving model in the application process. Wherein the mouth motion driving model is trained by images and sounds in the video. Therefore, the TTS speech synthesis model and the mouth movement driving model are trained separately, speech in a video used by the mouth movement driving model during training has pause of intonation and tone, but speech output by the TTS speech synthesis model is difficult to keep consistent with speech in the video (the TTS training does not learn pause of intonation and tone in speech corresponding to a picture), so that the trained mouth movement driving model cannot align the speech and the picture, and the matching accuracy of the speech and the picture is limited.
The mouth movements obtained by the two methods are further processed by pixel rendering and the like, and finally the video animation of the mouth movements of the character matched with the sound can be obtained.
Therefore, how to make the mouth movement driving model learn the alignment ability of the speech and the picture and improve the matching accuracy of the speech and the picture is a problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a mouth movement driving model training method and component, so that the mouth movement driving model learns the alignment capability of the speech and the picture, and the matching accuracy of the speech and the picture is improved. The specific scheme is as follows:
in a first aspect, the present application provides a mouth motion driving model training method, including:
acquiring a target video;
extracting sound and images in the target video;
acquiring text data corresponding to the sound;
extracting mouth action features corresponding to the text data from the image;
converting the text data into acoustic features by using an initial deep learning model, coding the acoustic features into an audio file, and determining a mouth action picture corresponding to the audio file;
calculating loss values of the mouth motion picture and the mouth motion characteristics;
if the loss value meets the requirement of model convergence, determining the initial deep learning model as a mouth action driving model; otherwise, after the model parameters of the initial deep learning model are updated, iterative training is carried out on the updated initial deep learning model until the loss value meets the model convergence requirement.
Preferably, the extracting, from the image, mouth motion features corresponding to the text data includes:
extracting key point information of the mouth from the image by using a face detection algorithm to serve as the mouth action feature;
or
Extracting mouth contour information from the image as the mouth action feature by using a three-dimensional model;
or
Extracting key point information of a mouth from the image by using a face detection algorithm;
extracting mouth contour information from the image using a three-dimensional model;
and fusing the mouth key point information and the mouth outline information to obtain fused information, and taking the fused information as the mouth action characteristics.
Preferably, before converting the text data into acoustic features by using the initial deep learning model, encoding the acoustic features into an audio file, and determining a mouth motion picture corresponding to the audio file, the method further includes:
segmenting the text data to obtain a plurality of text segments;
and respectively converting each text segment into a corresponding pronunciation.
Preferably, the determining the mouth motion picture corresponding to the audio file includes:
outputting the audio file in segments, and simultaneously outputting the picture frames corresponding to the segments;
downsampling the picture frame corresponding to each segment according to a preset time length to obtain the mouth action picture;
or
And outputting the audio file in segments, and outputting the mouth action picture by taking a preset time length as a period.
Preferably, the preset time length is an inverse number of a frame rate of the image.
Preferably, the mouth motion driving model includes a vocoder and a TTS model.
Preferably, the TTS model is Tacotron-2, and the vocoder is an encoder based on the griffin-lim algorithm, wavrn or MelGAN.
In a second aspect, the present application provides a mouth motion driving model training device, comprising:
the first acquisition module is used for acquiring a target video;
the first extraction module is used for extracting sound and images in the target video;
the second acquisition module is used for acquiring text data corresponding to the sound;
a second extraction module, configured to extract mouth motion features corresponding to the text data from the image;
the processing module is used for converting the text data into acoustic features by using an initial deep learning model, coding the acoustic features into an audio file, and determining a mouth action picture corresponding to the audio file;
a calculation module, configured to calculate loss values of the mouth motion picture and the mouth motion feature;
the training module is used for determining the initial deep learning model as a mouth action driving model if the loss value meets the model convergence requirement; otherwise, after the model parameters of the initial deep learning model are updated, iterative training is carried out on the updated initial deep learning model until the loss value meets the model convergence requirement.
Preferably, the second extraction module comprises:
a first extraction unit, configured to extract, as the mouth action feature, mouth key point information from the image by using a face detection algorithm;
or
A second extraction unit configured to extract mouth contour information from the image as the mouth action feature using a three-dimensional model;
or
A first extraction unit, configured to extract key point information of a mouth from the image by using a face detection algorithm;
a second extraction unit for extracting mouth contour information from the image using a three-dimensional model;
and the fusion unit is used for fusing the mouth key point information and the mouth outline information to obtain fusion information, and taking the fusion information as the mouth action characteristics.
Preferably, the method further comprises the following steps:
the segmentation module is used for segmenting the text data to obtain a plurality of text segments;
and the conversion module is used for converting each text segment into corresponding pronunciations respectively.
Preferably, the processing module comprises:
the output unit is used for outputting the audio file in segments and outputting the picture frames corresponding to the segments simultaneously;
the sampling unit is used for carrying out downsampling on the picture frame corresponding to each segment according to the preset time length so as to obtain the mouth part action picture;
or
And the periodic output unit is used for outputting the audio file in segments and outputting the mouth action picture by taking a preset time length as a period.
Preferably, the preset time length is an inverse number of a frame rate of the image.
Preferably, the mouth motion driving model includes a vocoder and a TTS model.
Preferably, the TTS model is Tacotron-2, and the vocoder is an encoder based on the griffin-lim algorithm, wavrn or MelGAN.
In a third aspect, the present application provides a computer device comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the mouth motion driven model training method disclosed in the foregoing.
In a fourth aspect, the present application provides a readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the mouth motion driven model training method disclosed in the foregoing.
According to the scheme, the application provides a mouth motion driving model training method, which comprises the following steps: acquiring a target video; extracting sound and images in the target video; acquiring text data corresponding to the sound; extracting mouth action features corresponding to the text data from the image; converting the text data into acoustic features by using an initial deep learning model, coding the acoustic features into an audio file, and determining a mouth action picture corresponding to the audio file; calculating loss values of the mouth motion picture and the mouth motion characteristics; if the loss value meets the requirement of model convergence, determining the initial deep learning model as a mouth action driving model; otherwise, after the model parameters of the initial deep learning model are updated, iterative training is carried out on the updated initial deep learning model until the loss value meets the model convergence requirement.
Therefore, the mouth motion driving model in the application can convert text data into acoustic features, encode the acoustic features into an audio file, and determine a mouth motion picture corresponding to the audio file, namely, the mouth motion driving model learns the speech synthesis and encoding capabilities and the matching capability of speech and the picture. In the process of learning the speech synthesis and coding capability, the text corresponding to the speech in the video is taken as training data, so that the intonation and the tone of the speech corresponding to the model learning picture can be stopped, and the synthesized speech can be consistent with the speech in the video. Therefore, the mouth movement driving model obtained by training can synthesize the voice with speech pause, and the voice and the picture are aligned in time when matched according to the voice and the picture, so that the matching accuracy of the voice and the picture is improved.
Accordingly, the present application provides a mouth movement driving model training assembly (i.e., apparatus, device and readable storage medium) having the same technical effects as described above.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a mouth movement driving model training method disclosed in the present application;
FIG. 2 is a schematic flow chart of a model application disclosed herein;
FIG. 3 is a schematic diagram of a mouth movement driving model training device according to the present disclosure;
FIG. 4 is a schematic diagram of a computer apparatus disclosed herein;
fig. 5 is a schematic diagram of an interactive system disclosed in the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, a TTS speech synthesis model and a mouth movement driving model are trained separately, speech in a video used by the mouth movement driving model is in speech and tone pause, but the speech output by the TTS speech synthesis model is difficult to keep consistent with the speech in the video (the speech and tone pause corresponding to a learning picture is not learned during the TTS training), so that the trained mouth movement driving model cannot align the speech and the picture, and the matching accuracy of the speech and the picture is limited. Therefore, the application provides a mouth motion driving model training scheme, which can enable the mouth motion driving model to learn the aligning capability of the voice and the picture, improve the matching accuracy of the voice and the picture,
referring to fig. 1, a mouth motion driving model training method provided in an embodiment of the present application is described below, and an embodiment of the present application discloses a mouth motion driving model training method, including:
and S101, acquiring a target video.
S102, extracting sound and images in the target video.
The target video may be an animation video, or a live-recorded video, and is preferably a live-recorded video. The sound in the target video is: the characters in the video speak speech, which may include a small amount of recording noise. The images in the target video are: and the image data of the video when the character in the video speaks.
And S103, acquiring text data corresponding to the sound.
The text data corresponding to the sound is: characters spoken by characters in the video.
And S104, extracting mouth action characteristics corresponding to the text data from the image.
Wherein, the mouth action characteristics are as follows: the image characteristics of the mouth when the character speaks in the video.
And S105, converting the text data into acoustic features by using the initial deep learning model, coding the acoustic features into an audio file, and determining a mouth action picture corresponding to the audio file.
The deep learning model can be any structure, such as a cyclic neural network, a convolutional neural network, and the like. The text data is used as training data, the mouth action characteristics are used as a learning target of the model, and the deep learning model can learn the mapping capacity of the text data to the mouth action characteristics, so that the mouth action driving model which has a speech synthesis function and a function of matching speech and pictures is obtained.
Since the mouth movement driving model has two functions, it can be considered that there are two functional blocks in the mouth movement driving model: one for synthesizing speech and the other for matching speech to pictures. In one embodiment, the mouth motion driver models include vocoder and TTS models. The TTS model is used to synthesize speech. The vocoder is used for matching voice and picture after encoding voice.
In one embodiment, the TTS model is Tacotron-2 and the vocoder is a griffin-lim algorithm based encoder, WavRNN or MelGAN. Of course, other configurations of vocoder and TTS models are possible.
And S106, calculating loss values of the mouth motion picture and the mouth motion characteristics.
Wherein, the loss value of the mouth motion picture and the mouth motion feature can be calculated by using any loss function, such as: cross entropy loss functions, exponential loss functions, etc.
S107, judging whether the loss value meets the requirement of model convergence; if yes, executing S108; if not, S109 is executed.
And S108, determining the initial deep learning model as a mouth motion driving model.
S109, the model parameters of the initial deep learning model are updated, and then S101 is executed.
After the model parameters of the initial deep learning model are updated, S101 is executed to perform iterative training on the updated initial deep learning model until the loss value meets the model convergence requirement. The model convergence requirement may be set based on a loss threshold, for example: and if the loss values of the mouth action picture and the mouth action features are smaller than the loss threshold value, determining that the current loss value meets the model convergence requirement. Of course, the model convergence requirement may also be set based on a change value of the loss value, for example: and if the change degree of the loss values of the mouth action picture and the mouth action characteristics is smaller than the expected change degree compared with the last loss value, the current loss value is considered to meet the model convergence requirement.
It can be seen that the mouth motion driving model in this embodiment can convert text data into acoustic features, encode the acoustic features into an audio file, and determine a mouth motion picture corresponding to the audio file, that is, the mouth motion driving model learns both speech synthesis and encoding capabilities and matching capabilities of speech and the picture. In the process of learning the speech synthesis and coding capability, the text corresponding to the speech in the video is taken as training data, so that the intonation and the tone of the speech corresponding to the model learning picture can be stopped, and the synthesized speech can be consistent with the speech in the video. Therefore, the mouth movement driving model obtained by training can synthesize the voice with speech pause, and the voice and the picture are aligned in time when matched according to the voice and the picture, so that the matching accuracy of the voice and the picture is improved.
Based on the above embodiments, it should be noted that the extracting of the mouth movement feature corresponding to the text data from the image includes: extracting key point information (landworks) of the mouth from the image by using a face detection algorithm to serve as mouth action features; or extracting mouth contour information from the image by using the three-dimensional model as mouth action characteristics; or extracting key point information (blendshape) of the mouth from the image by using a face detection algorithm; extracting mouth contour information from the image by using the three-dimensional model; and fusing the key point information of the mouth and the outline information of the mouth to obtain fused information, and taking the fused information as the mouth action characteristics.
The face detection algorithm may be any algorithm capable of identifying key points of the mouth, and feature data is generally extracted in two-dimensional coordinates, so that the extracted feature data lacks three-dimensional information. While the feature data extracted using the three-dimensional model includes three-dimensional information, the accuracy is relatively low. Therefore, in order to improve the effectiveness of the mouth action characteristics, the key point information and the mouth contour information of the mouth can be fused, only one part of repeated information is reserved in the fusion process, and the unrepeated information is reserved and mutually supplemented.
Based on the foregoing embodiments, before converting text data into acoustic features by using an initial deep learning model, encoding the acoustic features into an audio file, and determining a mouth motion picture corresponding to the audio file, the method further includes: segmenting text data to obtain a plurality of text segments; and respectively converting each text segment into a corresponding pronunciation.
Therefore, before processing the text data, the text data can be segmented into each text segment, each text segment is converted into a corresponding pronunciation, and then the corresponding pronunciation is used for replacing each text segment and is input into the initial deep learning model. Wherein, pronouncing may include: pinyin, rhythm, phonetic symbols, etc.
Based on the above embodiments, it should be noted that determining a mouth motion picture corresponding to an audio file includes: outputting the audio file in segments, and simultaneously outputting the picture frames corresponding to the segments; downsampling the picture frame corresponding to each segment according to the preset time length to obtain a mouth action picture; or outputting the audio file in segments, and outputting the mouth action pictures with the preset time length as a period.
In one embodiment, the preset time length is an inverse number of a frame rate of the image. For example: if the frame rate of the image is 50fps, the following means: 50 frames of pictures are transmitted one second, and each frame of pictures requires 20ms of time, so that one 20ms of audio can correspond to one frame of picture. Therefore, the preset time length is set as the reciprocal of the frame rate, so that the audio output by the fragment corresponds to the picture, namely, the audio and the picture are aligned in time.
It should be noted that an autoregressive vocoder can match speech to picture as described below. Each step of the vocoder outputs a digit, and each time, the digit output in the previous step needs to be used as the current input of the vocoder, namely yt ═ f (state, yt-1), wherein the state is a state of the network itself, and the input and the output are automatically completed in the operation process, so that the input is seemingly yt ═ f (yt-1). If it is necessary to output motion parameters at the same time, the motion parameters are a vector, and if it is noted as z, one way is to output motion parameters at each step, i.e. yt & zt ═ f (yt-1), and the motion parameters do not need autoregressive mode (i.e. motion parameters zt-1 of the previous step do not need to be input into the vocoder), and then when used last, down-sample z. For example, 16000 action parameters are obtained from 1s, and only 25 action parameters are needed, and then 25 action parameters are obtained according to the same interval.
Obviously, the above method is very wasteful of computation, so another method is to separate the last autoregressive layer of the vocoder, the generation part of the audio number is not changed, but one more branch is added to generate the motion parameters, that is, one more small network is added: and g (state), the small network can only calculate when we input the state, and then we can control how many audio numbers are generated, and the state is input into g to obtain the action parameters matched with the state, and this way, there is no waste of calculation.
Based on the above embodiments, it should be noted that after training to obtain a mouth motion driving model, the model may be applied to match corresponding mouth motion image data to any segment of speech, specifically refer to fig. 2, and fig. 2 illustrates a model application process.
S201, acquiring a text to be processed;
s202, inputting the text to be processed into a mouth action driving model to obtain corresponding mouth action image data;
and S203, displaying the text to be processed and the mouth action image data.
The mouth movement driving model and the related execution steps in this embodiment can refer to the related descriptions of the above embodiments, and are not described herein again.
Therefore, the mouth movement driving model in the embodiment takes the text as input, and can synthesize the voice with the speech pause based on the text, and accordingly, when the voice is matched with the picture, the voice and the picture are aligned in time, so that the matching accuracy of the voice and the picture is improved.
Based on the deep learning method, a training scheme and an application scheme of the mouth motion driving model are provided below. The training scheme comprises the following steps: recording video data, processing the video data to obtain acoustic features, mouth motion parameters and texts, and training a mouth motion driving model. The application scheme comprises the following steps: and processing the text by using the mouth action driving model to obtain action parameters, and obtaining synchronous voice and action on a time axis. The specific details of the training scheme for processing the video data to obtain the acoustic features and the mouth action parameters include:
dividing the recorded video data into an audio file and an image frame, and obtaining acoustic characteristics of the audio file in a signal processing mode, wherein the characteristics can be an amplitude spectrum, a Mel frequency spectrum, a Mel cepstrum coefficient and the like; and detecting the image frames to landworks through key points or obtaining mouth action parameters in a three-dimensional modeling mode. Meanwhile, the text is obtained according to the audio file label, the Chinese text can be converted into pinyin and rhythm, and the English text can be converted into phonetic symbols.
Training process: a TTS model (adopting a Tacotron-2 structure) is trained by a text, a vocoder is synchronously trained by the output of the TTS model in the training process, and the joint training of the TTS model and the vocoder is realized, so the TTS model and the vocoder can be externally regarded as two components of a model. The vocoder may be a griffin-lim algorithm that does not require model training, or it may be a more sophisticated WavRNN or MelGAN, with the input being acoustic features and the output being PCM encoded (which may generate an audio file). In order to match the motion parameters with the speech in time, the vocoder needs to be modified so that the vocoder outputs one frame of motion parameters at every fixed time while outputting the PCM code continuously (one motion parameter is required for each 20ms of audio generated when the frame rate is 50fps, for example, which is related to the required frame rate).
The corresponding application process is as follows: and inputting the text into a TTS model and a vocoder to obtain the action parameters.
It should be noted that the software environment for training the mouth movement driving model may be a python environment supporting tensiorflow or pytorch, and when the model is applied, the software environment may be consistent with the training phase, or other software frameworks may be rewritten to obtain a software environment suitable for the model, so as to reduce the landing cost. If the model application stage has streaming requirements, the design of the model structure should satisfy the following conditions: a unidirectional cyclic neural network must be adopted in the case of including the cyclic neural network; the receptive field (sliding window) cannot be too large in the case of a convolutional neural network.
If multi-timbre support is needed, the structure of the Tacotron-2 needs to be modified while the sounds of multiple timbres need to be collected, so that the Tacotron-2 can control the type of generated timbres according to the input label. For example: if the tone color class of the condition control input and output is added, it is necessary to widen the network of input positions. Such as: the input dimension is originally 80 dimensions, the dimension of the input network must be 80, and if 10 timbres are supported, which is the condition of 10 dimensions, the dimension is widened to 90.
If a vector related to a speaker, i-vector and x-vector, is used, the vector can be taken as a condition, and the i-vector of each person is unique. The simplest condition is a one-hot vector, [1,0,0,0,0,0] indicates the first class of 6 classes, originally, y ═ f (x), x is a phoneme sequence, and y is a frequency spectrum; now becomes y ═ f (x, c), c is the condition; instead of directly controlling the effectiveness of the individual dimensional inputs, they are indirectly controlled through a multi-layer network.
Therefore, the method and the device do not need to introduce rules, and the whole process can be automated. The training process is end-to-end, simple and easy to understand, and convenient to optimize. The online VC model has high flexibility, can shorten the updating process of the whole system, realizes the joint generation of the voice and the corresponding action sequence, and solves the problem that the voice and the action sequence are not synchronous on a time axis.
In the following, a mouth movement driving model training device provided by an embodiment of the present application is described, and a mouth movement driving model training device described below and a mouth movement driving model training method described above may be referred to each other.
Referring to fig. 3, an embodiment of the present application discloses a mouth motion driving model training device, including:
a first obtaining module 301, configured to obtain a target video;
a first extraction module 302, configured to extract sound and images in a target video;
a second obtaining module 303, configured to obtain text data corresponding to a sound;
a second extraction module 304, configured to extract mouth motion features corresponding to the text data from the image;
the processing module 305 is configured to convert the text data into acoustic features by using the initial deep learning model, encode the acoustic features into an audio file, and determine a mouth action picture corresponding to the audio file;
a calculating module 306, configured to calculate loss values of the mouth motion picture and the mouth motion feature;
a training module 307, configured to determine the initial deep learning model as a mouth motion driving model if the loss value meets the model convergence requirement; otherwise, after the model parameters of the initial deep learning model are updated, iterative training is carried out on the updated initial deep learning model until the loss value meets the model convergence requirement.
In one embodiment, the second extraction module includes:
the first extraction unit is used for extracting key point information of a mouth from the image as a mouth action characteristic by using a face detection algorithm;
or
A second extraction unit for extracting mouth contour information from the image as a mouth action feature using the three-dimensional model;
or
The first extraction unit is used for extracting key point information of a mouth from the image by using a face detection algorithm;
a second extraction unit for extracting mouth contour information from the image using the three-dimensional model;
and the fusion unit is used for fusing the key point information of the mouth and the contour information of the mouth to obtain fusion information, and taking the fusion information as the mouth action characteristics.
In a specific embodiment, the method further comprises the following steps:
the segmentation module is used for segmenting the text data to obtain a plurality of text segments;
and the conversion module is used for converting each text segment into corresponding pronunciations respectively.
In one embodiment, the processing module comprises:
the output unit is used for outputting the audio file fragment segments and outputting the picture frame corresponding to each segment;
the sampling unit is used for carrying out downsampling on the picture frame corresponding to each segment according to the preset time length so as to obtain a mouth part action picture;
or
And the periodic output unit is used for outputting the audio file fragment and outputting the mouth action picture by taking the preset time length as a period.
In one embodiment, the preset time length is an inverse number of a frame rate of the image.
In one embodiment, the mouth motion driver models include vocoder and TTS models.
In one embodiment, the TTS model is Tacotron-2 and the vocoder is a griffin-lim algorithm based encoder, WavRNN or MelGAN.
For more specific working processes of each module and unit in this embodiment, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not described here again.
Therefore, the embodiment provides a mouth motion driving model training device, and the mouth motion driving model trained by the device can synthesize the voice with speech pause, and accordingly, when the voice is matched with the picture, the voice and the picture are aligned in time, and the matching accuracy of the voice and the picture is improved.
In the following, a computer device provided by the embodiments of the present application is introduced, and a computer device described below and a mouth motion driving model training method and apparatus described above may be referred to each other.
Referring to fig. 4, an embodiment of the present application discloses a computer device, including:
a memory 401 for storing a computer program;
a processor 402 for executing the computer program to implement the mouth motion driving model training method disclosed in any of the foregoing embodiments.
A readable storage medium provided by the embodiments of the present application is described below, and a readable storage medium described below and a mouth motion driving model training method, apparatus and device described above may be referred to each other.
A readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the mouth motion driving model training method disclosed in the previous embodiment. For the specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which are not described herein again.
The mouth motion driving model training method provided by the present application is described in detail below with reference to specific application scenarios, and it should be noted that the mouth motion driving model obtained by training may be used for animation production, specifically: the mouth movement of the character in the animation is controlled by using the model.
Referring to fig. 5, fig. 5 is a schematic diagram illustrating an application environment suitable for the embodiment of the present application. The mouth motion driving model training method provided by the embodiment of the application can be applied to an interactive system as shown in fig. 5. The interactive system comprises a terminal device 101 and a server 102, wherein the server 102 is in communication connection with the terminal device 101. The server 102 may be a conventional server or a cloud server, and is not limited herein.
The terminal device 101 may be various electronic devices that have a display screen, a mouth motion driving model training module, a shooting camera, an audio input/output function, and support data input, including but not limited to a smart phone, a tablet computer, a laptop computer, a desktop computer, a kiosk, a wearable electronic device, and the like. Specifically, the data input may be inputting voice based on a voice module provided on the electronic device, inputting characters based on a character input module, and the like.
The terminal device 101 may have a client application installed thereon, and the user may trigger the training method based on the client application (e.g., APP, wechat applet, etc.). A user may register a user account in the server 102 based on the client application program, and communicate with the server 102 based on the user account, for example, the user logs in the user account in the client application program, inputs information through the client application program based on the user account, and may input text information or voice information, and the like, after receiving information input by the user, the client application program may send the information to the server 102, so that the server 102 may receive the information, process and store the information, and the server 102 may also receive the information and return a corresponding output information to the terminal device 101 according to the information.
In some embodiments, the apparatus for implementing the training method may also be disposed on the terminal device 101, so that the terminal device 101 may implement interaction with the user without relying on the server 102 to establish communication, and at this time, the interactive system may only include the terminal device 101.
References in this application to "first," "second," "third," "fourth," etc., if any, are intended to distinguish between similar elements and not necessarily to describe a particular order or sequence. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, or apparatus.
It should be noted that the descriptions in this application referring to "first", "second", etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present application.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of readable storage medium known in the art.
The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A mouth motion driving model training method, comprising:
acquiring a target video;
extracting sound and images in the target video;
acquiring text data corresponding to the sound;
extracting mouth action features corresponding to the text data from the image;
converting the text data into acoustic features by using an initial deep learning model, coding the acoustic features into an audio file, and determining a mouth action picture corresponding to the audio file;
calculating loss values of the mouth motion picture and the mouth motion characteristics;
if the loss value meets the requirement of model convergence, determining the initial deep learning model as a mouth action driving model; otherwise, after the model parameters of the initial deep learning model are updated, iterative training is carried out on the updated initial deep learning model until the loss value meets the model convergence requirement.
2. The method of claim 1, wherein extracting mouth motion features corresponding to the text data from the image comprises:
extracting key point information of the mouth from the image by using a face detection algorithm to serve as the mouth action feature;
or
Extracting mouth contour information from the image as the mouth action feature by using a three-dimensional model;
or
Extracting key point information of a mouth from the image by using a face detection algorithm;
extracting mouth contour information from the image using a three-dimensional model;
and fusing the mouth key point information and the mouth outline information to obtain fused information, and taking the fused information as the mouth action characteristics.
3. The method of claim 1, wherein before converting the text data into acoustic features using the initial deep learning model, encoding the acoustic features into an audio file, and determining a mouth motion picture corresponding to the audio file, the method further comprises:
segmenting the text data to obtain a plurality of text segments;
and respectively converting each text segment into a corresponding pronunciation.
4. The method of claim 1, wherein the determining the mouth action picture corresponding to the audio file comprises:
outputting the audio file in segments, and simultaneously outputting the picture frames corresponding to the segments;
downsampling the picture frame corresponding to each segment according to a preset time length to obtain the mouth action picture;
or
And outputting the audio file in segments, and outputting the mouth action picture by taking a preset time length as a period.
5. The method of claim 4, wherein the preset length of time is an inverse of a frame rate of the image.
6. The method of claim 1, wherein the mouth motion driver models comprise vocoder and TTS models.
7. The method of claim 6, wherein the TTS model is Tacotron-2 and the vocoder is a griffin-lim algorithm based encoder, WavRNN or MelGAN.
8. A mouth motion driven model training device, comprising:
the first acquisition module is used for acquiring a target video;
the first extraction module is used for extracting sound and images in the target video;
the second acquisition module is used for acquiring text data corresponding to the sound;
a second extraction module, configured to extract mouth motion features corresponding to the text data from the image;
the processing module is used for converting the text data into acoustic features by using an initial deep learning model, coding the acoustic features into an audio file, and determining a mouth action picture corresponding to the audio file;
a calculation module, configured to calculate loss values of the mouth motion picture and the mouth motion feature;
the training module is used for determining the initial deep learning model as a mouth action driving model if the loss value meets the model convergence requirement; otherwise, after the model parameters of the initial deep learning model are updated, iterative training is carried out on the updated initial deep learning model until the loss value meets the model convergence requirement.
9. A computer device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the method of any one of claims 1 to 7.
10. A readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the method of any one of claims 1 to 7.
CN202110424518.8A 2021-04-20 2021-04-20 Mouth action driving model training method and assembly Pending CN113111812A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110424518.8A CN113111812A (en) 2021-04-20 2021-04-20 Mouth action driving model training method and assembly

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110424518.8A CN113111812A (en) 2021-04-20 2021-04-20 Mouth action driving model training method and assembly

Publications (1)

Publication Number Publication Date
CN113111812A true CN113111812A (en) 2021-07-13

Family

ID=76719182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110424518.8A Pending CN113111812A (en) 2021-04-20 2021-04-20 Mouth action driving model training method and assembly

Country Status (1)

Country Link
CN (1) CN113111812A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822969A (en) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 Method, device and server for training nerve radiation field model and face generation
WO2023035969A1 (en) * 2021-09-09 2023-03-16 马上消费金融股份有限公司 Speech and image synchronization measurement method and apparatus, and model training method and apparatus
CN116051692A (en) * 2023-04-03 2023-05-02 成都索贝数码科技股份有限公司 Three-dimensional digital human face animation generation method based on voice driving
WO2024087337A1 (en) * 2022-10-24 2024-05-02 深圳先进技术研究院 Method for directly synthesizing speech from tongue ultrasonic images

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109377539A (en) * 2018-11-06 2019-02-22 北京百度网讯科技有限公司 Method and apparatus for generating animation
CN110047121A (en) * 2019-03-20 2019-07-23 北京字节跳动网络技术有限公司 Animation producing method, device and electronic equipment end to end
CN110347867A (en) * 2019-07-16 2019-10-18 北京百度网讯科技有限公司 Method and apparatus for generating lip motion video
CN110910479A (en) * 2019-11-19 2020-03-24 中国传媒大学 Video processing method and device, electronic equipment and readable storage medium
CN111583913A (en) * 2020-06-15 2020-08-25 深圳市友杰智新科技有限公司 Model training method and device for speech recognition and speech synthesis and computer equipment
CN112365882A (en) * 2020-11-30 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, model training method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109377539A (en) * 2018-11-06 2019-02-22 北京百度网讯科技有限公司 Method and apparatus for generating animation
US20190392625A1 (en) * 2018-11-06 2019-12-26 Beijing Baidu Netcom Science and Technology Co., Ltd Method and apparatus for generating animation
CN110047121A (en) * 2019-03-20 2019-07-23 北京字节跳动网络技术有限公司 Animation producing method, device and electronic equipment end to end
CN110347867A (en) * 2019-07-16 2019-10-18 北京百度网讯科技有限公司 Method and apparatus for generating lip motion video
CN110910479A (en) * 2019-11-19 2020-03-24 中国传媒大学 Video processing method and device, electronic equipment and readable storage medium
CN111583913A (en) * 2020-06-15 2020-08-25 深圳市友杰智新科技有限公司 Model training method and device for speech recognition and speech synthesis and computer equipment
CN112365882A (en) * 2020-11-30 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, model training method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023035969A1 (en) * 2021-09-09 2023-03-16 马上消费金融股份有限公司 Speech and image synchronization measurement method and apparatus, and model training method and apparatus
CN113822969A (en) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 Method, device and server for training nerve radiation field model and face generation
WO2024087337A1 (en) * 2022-10-24 2024-05-02 深圳先进技术研究院 Method for directly synthesizing speech from tongue ultrasonic images
CN116051692A (en) * 2023-04-03 2023-05-02 成都索贝数码科技股份有限公司 Three-dimensional digital human face animation generation method based on voice driving

Similar Documents

Publication Publication Date Title
CN113111812A (en) Mouth action driving model training method and assembly
CN108520741A (en) A kind of whispering voice restoration methods, device, equipment and readable storage medium storing program for executing
CN111914076B (en) User image construction method, system, terminal and storage medium based on man-machine conversation
CN114401438A (en) Video generation method and device for virtual digital person, storage medium and terminal
CN116129863A (en) Training method of voice synthesis model, voice synthesis method and related device
CN112420014A (en) Virtual face construction method and device, computer equipment and computer readable medium
CN114895817B (en) Interactive information processing method, network model training method and device
CN112329451B (en) Sign language action video generation method, device, equipment and storage medium
CN111916054B (en) Lip-based voice generation method, device and system and storage medium
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN114255737B (en) Voice generation method and device and electronic equipment
CN114882862A (en) Voice processing method and related equipment
CN112837669A (en) Voice synthesis method and device and server
CN115497448A (en) Method and device for synthesizing voice animation, electronic equipment and storage medium
CN114245230A (en) Video generation method and device, electronic equipment and storage medium
CN113724683A (en) Audio generation method, computer device, and computer-readable storage medium
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN113782042B (en) Speech synthesis method, vocoder training method, device, equipment and medium
CN113870838A (en) Voice synthesis method, device, equipment and medium
CN117275485B (en) Audio and video generation method, device, equipment and storage medium
CN114387945A (en) Voice generation method and device, electronic equipment and storage medium
CN113870827A (en) Training method, device, equipment and medium of speech synthesis model
CN113314096A (en) Speech synthesis method, apparatus, device and storage medium
CN113963092B (en) Audio and video fitting associated computing method, device, medium and equipment
CN112331184B (en) Voice mouth shape synchronization method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination