WO2022033327A1 - 视频生成方法、生成模型训练方法、装置、介质及设备 - Google Patents

视频生成方法、生成模型训练方法、装置、介质及设备 Download PDF

Info

Publication number
WO2022033327A1
WO2022033327A1 PCT/CN2021/109460 CN2021109460W WO2022033327A1 WO 2022033327 A1 WO2022033327 A1 WO 2022033327A1 CN 2021109460 W CN2021109460 W CN 2021109460W WO 2022033327 A1 WO2022033327 A1 WO 2022033327A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
target
model
image sequence
posterior probability
Prior art date
Application number
PCT/CN2021/109460
Other languages
English (en)
French (fr)
Inventor
殷翔
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to US18/000,387 priority Critical patent/US20230223010A1/en
Publication of WO2022033327A1 publication Critical patent/WO2022033327A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present disclosure relates to the field of data processing, such as a video generation method, a generation model training method, an apparatus, a medium and a device.
  • voice-to-video generation is becoming a research hotspot.
  • an avatar can be driven to make head movements and body postures corresponding to the voice, so as to bring users an immersive experience.
  • One implementation is to extract acoustic features in speech (for example, Mel Frequency Cepstral Coefficient (MFCC)), and then generate an image sequence directly through an image model according to the acoustic features, and finally combine the image sequence with the image model.
  • MFCC Mel Frequency Cepstral Coefficient
  • Speech synthesis into video since the extracted acoustic features contain speaker-related information, the image model established with this method can only generate image sequences based on the speech of a specific speaker.
  • the present disclosure provides a video generation method, including:
  • the present disclosure provides a training method for an image generation model, where the image generation model includes a speech recognition sub-model, a gated recursive unit, and a variational autoencoder, wherein the variational autoencoder includes an encoding network and decoding network;
  • the method includes:
  • reference video data includes reference audio data, a reference image sequence, and text data corresponding to the reference audio data
  • the reference image sequence is used as the target output of the decoding network, and the phoneme posterior corresponding to the reference audio data determined by the speech recognition sub-model according to the acoustic characteristics of the reference audio data
  • the probability is used as the input of the gated recursive unit, and the output of the encoding network is used as the target output of the gated recursive unit for model training, so as to obtain the image generation model.
  • the present disclosure provides a video generation device, comprising:
  • an acquisition module set to acquire the target audio data to be synthesized
  • an extraction module configured to extract the acoustic features of the target audio data obtained by the first obtaining module as the target acoustic features
  • a determination module configured to determine the phoneme posterior probability corresponding to the target audio data according to the target acoustic feature extracted by the extraction module, and generate an image corresponding to the target audio data according to the phoneme posterior probability sequence, wherein the phoneme posterior probability is used to characterize the distribution probability of the phoneme to which each speech frame in the target audio data belongs;
  • the synthesis module is configured to perform video synthesis on the target audio data obtained by the first obtaining module and the image sequence corresponding to the target audio data determined by the determining module to obtain target video data.
  • the present disclosure provides an apparatus for training an image generation model, where the image generation model includes a speech recognition sub-model, a gated recursive unit, and a variational autoencoder, wherein the variational autoencoder includes an encoding network and decoding network;
  • the device includes:
  • an acquisition module configured to acquire reference video data, wherein the reference video data includes reference audio data, a reference image sequence, and text data corresponding to the reference audio data;
  • the training module is configured to use the acoustic feature of the reference audio data obtained by the second acquisition module as the input of the speech recognition sub-model, and use the text data corresponding to the reference audio data as the speech recognition sub-model.
  • the target output of the model, the reference image sequence is used as the input of the encoding network
  • the reference image sequence is used as the target output of the decoding network
  • the speech recognition sub-model is based on the acoustic characteristics of the reference audio data.
  • the determined phoneme posterior probability corresponding to the reference audio data is used as the input of the gated recursive unit
  • the output of the encoding network is used as the target output of the gated recursive unit for model training to obtain The image generation model.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, implements the video generation method provided in the first aspect of the present disclosure or the video generation method provided in the second aspect of the present disclosure.
  • the training method of the image generation model is described.
  • the present disclosure provides an electronic device, comprising:
  • a processing device configured to execute the computer program in the storage device to implement the video generation method provided by the first aspect of the present disclosure.
  • the present disclosure provides an electronic device, comprising:
  • a processing device configured to execute the computer program in the storage device to implement the training method for the image generation model provided by the second aspect of the present disclosure.
  • Fig. 1 is a flow chart of a video generation method according to an exemplary embodiment.
  • Fig. 2 is a schematic diagram illustrating a process of generating an image sequence according to an exemplary embodiment.
  • Fig. 3 is a block diagram of an image generation model according to an exemplary embodiment.
  • Fig. 4 is a flowchart showing a training method of an image generation model according to an exemplary embodiment.
  • Fig. 5 is a block diagram of an image generation model according to another exemplary embodiment.
  • Fig. 6 is a block diagram of a video generating apparatus according to an exemplary embodiment.
  • Fig. 7 is a block diagram of an apparatus for training an image generation model according to an exemplary embodiment.
  • Fig. 8 is a block diagram of an electronic device according to an exemplary embodiment.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • Fig. 1 is a flow chart of a video generation method according to an exemplary embodiment. As shown in FIG. 1 , the method may include S101 to S104.
  • target audio data to be synthesized is acquired.
  • the target audio data to be synthesized may be the audio corresponding to any speaker, that is, the speech uttered by any speaker.
  • the target audio data may be the audio corresponding to the speaker's speech, or may be the audio corresponding to the speaker's singing.
  • the language of the target audio data is not limited in the present disclosure, and it may be Chinese, English, etc., for example.
  • the acoustic features of the target audio data are extracted as the target acoustic features.
  • the acoustic feature may be MFCC, Mel-scale Filter Bank (FBank), Linear Predictive Cepstral Coding (LPCC), Cepstral Coefficient, Perceptual Linear Prediction Coefficients (Perceptual Linear Predictive, PLP), Fast Fourier Transform (Fast Fourier Transform, referred to as FFT) amplitude and so on.
  • FBank Mel-scale Filter Bank
  • LPCC Linear Predictive Cepstral Coding
  • Cepstral Coefficient Cepstral Coefficient
  • Perceptual Linear Prediction Coefficients Perceptual Linear Predictive, PLP
  • FFT Fast Fourier Transform
  • the acoustic features may be obtained by using at least one acoustic feature extraction algorithm.
  • an exemplary calculation method of MFCC may be: first convert the time domain signal to frequency domain with FFT, then convolve its logarithmic energy spectrum with a triangular filter bank distributed according to the Mel scale, and finally apply multiple filters The vector formed by the output of is subjected to discrete cosine transform, and the first N coefficients are taken as MFCC.
  • An exemplary calculation method of FBank may be: consistent with the calculation method of MFCC, multiple filter outputs are used as FBank.
  • An exemplary calculation method of the LPCC may be as follows: the LPCC can be obtained by minimizing the mean square error between the sample value of the target audio data and the linear prediction sample value.
  • An exemplary calculation method of the cepstral coefficients may be: using a homomorphic processing method, the target audio data signal is subjected to discrete Fourier transform, logarithms are obtained, and then the cepstral coefficients can be obtained by inverse transformation.
  • An exemplary calculation method of the PLP may be: using the Durbin method to calculate the linear prediction coefficient parameters, and using discrete cosine transform on the logarithmic energy spectrum of the auditory excitation when calculating the autocorrelation parameter to obtain the PLP.
  • An exemplary method for calculating the magnitude of the FFT may be: using an FFT algorithm to extract the FFT magnitude feature of the target audio data.
  • a phoneme posterior probability corresponding to the target audio data is determined according to the target acoustic feature, and an image sequence corresponding to the target audio data is generated according to the phoneme posterior probability.
  • the phonetic posterior probability (Phonetic Posterior Grams, PPG) is used to represent the distribution probability of the phoneme to which each speech frame in the audio data belongs, that is, the probability distribution of which phoneme the content of the speech frame belongs to.
  • S104 video synthesis is performed on the target audio data and the image sequence corresponding to the target audio data to obtain target video data.
  • the target video data can be obtained by synthesizing the voice frame and the image frame based on the time stamp corresponding to each voice frame in the target audio data and the time stamp corresponding to each image frame in the corresponding image sequence.
  • the phoneme posterior probability corresponding to the target audio data can be determined according to the acoustic feature, and the phoneme posterior probability corresponding to the target audio data can be generated according to the phoneme posterior probability.
  • image sequence after that, perform video synthesis on the target audio data and the corresponding image sequence to obtain target video data. Since the phoneme posterior probability is information that has nothing to do with the actual speaker, it can avoid the influence of different speakers' pronunciation habits (accent), noise and other factors on the subsequently generated image sequence, thereby improving the head movement in the generated image sequence. and body posture accuracy.
  • a corresponding image sequence can be generated to obtain video data.
  • the following describes in detail the implementation of determining the phoneme posterior probability corresponding to the target audio data according to the target acoustic feature, and generating the image sequence corresponding to the target audio data according to the phoneme posterior probability in S103.
  • the target acoustic feature can be input into an automatic speech recognition (Automatic Speech Recognition, ASR) model to obtain a phoneme posterior probability corresponding to the target audio data;
  • ASR Automatic Speech Recognition
  • the action features including head movements and body postures
  • the RNN is set to learn the phoneme posterior probability and action features during the training process.
  • the action features generated by RNN are synthesized into image sequences through techniques such as head and body alignment, image fusion, and optical flow method.
  • the target acoustic feature is input into the image generation model, so that the image generation model determines the phoneme posterior probability corresponding to the target audio data according to the target acoustic feature, and according to the phoneme posterior probability corresponding to the target audio data probability to generate the image sequence corresponding to the target audio data.
  • the image generation model determines the phoneme posterior probability corresponding to the target audio data according to the target acoustic feature, and according to the phoneme posterior probability corresponding to the target audio data probability to generate the image sequence corresponding to the target audio data.
  • the image generation model includes: sequentially connected speech recognition sub-models, gated recurrent units (Gated Recurrent Unit, GRU), and Variational Autoencoder (Variational Autoencoder, VAE) decoding The internet.
  • GRU Gated recurrent Unit
  • VAE Variational Autoencoder
  • the speech recognition sub-model is set to determine the phoneme posterior probability of the audio data according to the acoustic features of the input audio data.
  • the speech recognition sub-model may be a feedforward neural network (Deep-feedforward sequential memory networks, DFSMN) model, a Gaussian Mixed Model-Hidden Markov (Gaussian Mixed Model-Hidden Markov Model, GMM-HMM) model, a deep Neural Network-Hidden Markov (Deep Neural Networks-Hidden Markov Model, DNN-HMM) model, etc.
  • DFSMN feedforward neural network
  • GMM-HMM Gaussian Mixed Model-Hidden Markov
  • DNN-HMM Deep Neural Network-Hidden Markov model
  • the GRU is set to determine the feature vector based on the input phoneme posterior probability.
  • the decoding network of the VAE is set up to generate image sequences corresponding to the audio data based on the feature vectors. That is, the decoding network of the VAE decodes the feature vector to obtain an image sequence corresponding to the audio data.
  • the image generation model further includes an encoding network of VAE.
  • the image generation model can be obtained by training S401 and S402 shown in FIG. 4 .
  • reference video data is acquired, wherein the reference video data includes reference audio data, a reference image sequence, and text data corresponding to the reference audio data.
  • the text data corresponding to the reference audio data may be subtitle data in the reference video, or may be text data obtained by manually annotating the reference audio data.
  • a large amount of video data of the same speaker can be used as reference video data to train the image generation model. In this way, the virtual image in the image sequence generated by the trained image generation model is the image of the speaker.
  • the acoustic feature of the reference audio data is used as the input of the speech recognition sub-model
  • the text data corresponding to the reference audio data is used as the target output of the speech recognition sub-model
  • the reference image sequence is used as the input of the encoding network
  • the reference image The sequence is used as the target output of the decoding network
  • the phoneme posterior probability corresponding to the reference audio data determined by the speech recognition sub-model according to the acoustic characteristics of the reference audio data is used as the input of the gated recursive unit
  • the output of the encoding network is used as the gated recursion.
  • the model is trained in the way of the target output of the unit to obtain the image generation model.
  • the acoustic features of the reference audio data are input into the speech recognition sub-model, and the predicted text data corresponding to the reference audio data can be obtained;
  • a comparison result between the text data and the target output of the speech recognition sub-model ie, the text data corresponding to the reference audio data updates the model parameters of the speech recognition sub-model.
  • the reference image sequence can be input into the coding network of the VAE, so that the reference image sequence can be extracted by the coding network, and the new features are formed by resampling, that is, the reference feature vector corresponding to the reference image sequence;
  • the feature vector is input into the decoding network of the VAE to decode the reference feature vector through the decoding network to obtain the corresponding image sequence; next, the image sequence output by the decoding network and the target output of the decoding network (ie Refer to the comparison results of the image sequence) to update the model parameters of the VAE.
  • the above-mentioned speech recognition sub-model can determine the phoneme posterior probability corresponding to the reference audio data according to the acoustic features of the reference audio data, and then input it into the GRU to obtain the predicted feature vector;
  • the model parameters of the GRU are updated by comparing the predicted feature vector with the target output of the GRU (ie, the reference feature vector output by the encoding network).
  • the above-mentioned image generation model may also include a discriminator, wherein the image generation model is a generative confrontation network including a generator and a discriminator, and the generator includes a speech recognition sub-model, a gated recurrent unit, and a decoding of VAE.
  • the discriminator is set to perform a true or false judgment on the image sequence output by the decoding network during the model training phase, that is, to determine whether the image sequence is a real image sequence, and the obtained true and false judgment results are used for the generator's model parameters and discrimination.
  • the model parameters of the controller are updated.
  • two adjacent frames of images in the image sequence generated by the generator can be made more similar, thereby ensuring the continuity of the image sequence, and the generated image sequence is closer to the real video image sequence, That is, it is more natural, thereby improving the continuity and naturalness of the subsequently synthesized video.
  • the present disclosure also provides a training method for an image generation model, wherein, as shown in FIG. 3 , the image generation model includes a speech recognition sub-model, a GRU, and a VAE, where the VAE includes an encoding network and a decoding network.
  • the training method includes S401 and S402.
  • reference video data is acquired, wherein the reference video data includes reference audio data, a reference image sequence, and text data corresponding to the reference audio data.
  • the text data corresponding to the reference audio data may be subtitle data in the reference video, or may be text data obtained by manually annotating the reference audio data.
  • a large amount of video data of the same speaker can be used as reference video data to train the image generation model. In this way, the virtual image in the image sequence generated by the trained image generation model is the image of the speaker.
  • the acoustic feature of the reference audio data is used as the input of the speech recognition sub-model
  • the text data corresponding to the reference audio data is used as the target output of the speech recognition sub-model
  • the reference image sequence is used as the input of the encoding network
  • the reference image The sequence is used as the target output of the decoding network
  • the phoneme posterior probability corresponding to the reference audio data determined by the speech recognition sub-model according to the acoustic characteristics of the reference audio data is used as the input of the gated recursive unit
  • the output of the encoding network is used as the gated recursion.
  • the model is trained in the way of the target output of the unit to obtain the image generation model.
  • the acoustic features of the reference audio data are input into the speech recognition sub-model, and the predicted text data corresponding to the reference audio data can be obtained;
  • a comparison result between the text data and the target output of the speech recognition sub-model ie, the text data corresponding to the reference audio data updates the model parameters of the speech recognition sub-model.
  • the reference image sequence can be input into the coding network of the VAE, so that the reference image sequence can be extracted by the coding network, and the new features are formed by resampling, that is, the reference feature vector corresponding to the reference image sequence;
  • the feature vector is input into the decoding network of the VAE to decode the reference feature vector through the decoding network to obtain the corresponding image sequence; next, the image sequence output by the decoding network and the target output of the decoding network (ie Refer to the comparison results of the image sequence) to update the model parameters of the VAE.
  • the above-mentioned speech recognition sub-model can determine the phoneme posterior probability corresponding to the reference audio data according to the acoustic features of the reference audio data, and then input it into the GRU to obtain the predicted feature vector;
  • the model parameters of the GRU are updated by comparing the predicted feature vector with the target output of the GRU (ie, the reference feature vector output by the encoding network).
  • the above-mentioned image generation model may also include a discriminator, wherein the image generation model is a generative confrontation network including a generator and a discriminator, and the generator includes a speech recognition sub-model, a gated recurrent unit, and a decoding of VAE.
  • the network and the encoding network of the VAE may also include a discriminator, wherein the image generation model is a generative confrontation network including a generator and a discriminator, and the generator includes a speech recognition sub-model, a gated recurrent unit, and a decoding of VAE.
  • the network and the encoding network of the VAE may also include a discriminator, wherein the image generation model is a generative confrontation network including a generator and a discriminator, and the generator includes a speech recognition sub-model, a gated recurrent unit, and a decoding of VAE.
  • the network and the encoding network of the VAE may also include a discriminator, wherein the image generation model is a generative
  • the above training method further includes the following steps: the decoding network inputs the obtained image sequence to the discriminator; the discriminator performs a true or false determination on the image sequence obtained by the decoding network, that is, determines whether the image sequence is a real image sequence; The decision result updates the model parameters of the generator and the model parameters of the discriminator.
  • two adjacent frames of images in the image sequence generated by the generator can be made more similar, thereby ensuring the continuity of the image sequence, and the generated image sequence is closer to the real video image sequence, That is, it is more natural, thereby improving the continuity and naturalness of the subsequently synthesized video.
  • Fig. 6 is a block diagram of a video generating apparatus according to an exemplary embodiment.
  • the device 600 includes: a first acquisition module 601 configured to acquire target audio data to be synthesized; an extraction module 602 configured to extract the target audio data acquired by the first acquisition module 601 The acoustic feature is used as the target acoustic feature; the determination module 603 is configured to determine the phoneme posterior probability corresponding to the target audio data according to the target acoustic feature extracted by the extraction module 602, and according to the phoneme posterior probability, generating an image sequence corresponding to the target audio data, wherein the phoneme posterior probability is used to represent the distribution probability of the phoneme to which each speech frame in the audio data belongs; the synthesis module 604 is set to the first acquisition module 601 The acquired target audio data and the image sequence corresponding to the target audio data determined by the determining module 603 are subjected to video synthesis to obtain target video data.
  • the target audio data to be synthesized may be the audio corresponding to any speaker, that is, the speech uttered by any speaker.
  • the target audio data may be the audio corresponding to the speaker's speech, or may be the audio corresponding to the speaker's singing.
  • the language of the target audio data is not limited in the present disclosure.
  • Phonetic Posterior Grams is used to represent the distribution probability of the phoneme to which each speech frame in the audio data belongs, that is, the probability distribution of which phoneme the content of the speech frame belongs to.
  • the phoneme posterior probability corresponding to the target audio data can be determined according to the acoustic feature, and the phoneme posterior probability corresponding to the target audio data can be generated according to the phoneme posterior probability.
  • image sequence after that, perform video synthesis on the target audio data and the corresponding image sequence to obtain target video data. Since the phoneme posterior probability is information that has nothing to do with the actual speaker, it can avoid the influence of different speakers' pronunciation habits (accent), noise and other factors on the subsequently generated image sequence, thereby improving the head movement in the generated image sequence. and body posture accuracy.
  • a corresponding image sequence can be generated to obtain video data.
  • the determination module 603 determines the phoneme posterior probability corresponding to the target audio data according to the target acoustic feature, and generates an image sequence corresponding to the target audio data according to the phoneme posterior probability.
  • the determining module 603 includes: a determining submodule, configured to input the target acoustic feature into the ASR model to obtain the phoneme posterior probability corresponding to the target audio data; a feature extraction submodule, configured to input the phoneme posterior probability
  • the empirical probability is input into the pre-trained recurrent neural network RNN, and the action features (including head movements and body postures) corresponding to the target audio data are obtained.
  • the mapping relationship between them; the synthesis sub-module is set to synthesize the action features generated by the RNN into an image sequence through techniques such as head and body alignment, image fusion, and optical flow.
  • the determining module 603 is configured to input the target acoustic feature into an image generation model, so as to determine the phoneme corresponding to the target audio data through the image generation model according to the target acoustic feature The posterior probability is generated, and the image sequence corresponding to the target audio data is generated according to the phoneme posterior probability corresponding to the target audio data.
  • the image sequence corresponding to the target audio data can be directly generated, which is convenient and quick.
  • the image generation model includes: a speech recognition sub-model, a gated recursive unit, and a decoding network of a variational autoencoder connected in sequence; wherein, the speech recognition sub-model is set to be based on the input audio data.
  • the acoustic feature determines the phoneme posterior probability of the audio data;
  • the gated recursive unit is set to determine a feature vector according to the input phoneme posterior probability;
  • the decoding network is set to generate a feature vector according to the feature vector.
  • the image generation model further includes an encoding network of the variational autoencoder; and can be obtained by training an image generation model training device.
  • the training device 700 includes: a second obtaining module 701 configured to obtain reference video data, wherein the reference video data includes reference audio data, a reference image sequence, and text data corresponding to the reference audio data
  • the training module 702 is set as the input of the speech recognition sub-model by the acoustic feature of the reference audio data obtained by the second acquisition module 701, and the text data corresponding to the reference audio data is used as the
  • the target output of the speech recognition sub-model, the reference image sequence is used as the input of the encoding network, the reference image sequence is used as the target output of the decoding network, and the speech recognition sub-model is based on the reference audio data.
  • the phoneme posterior probability corresponding to the reference audio data determined by the acoustic features is used as the input of the gated recursive unit, and the model training is performed by using the output of the encoding network as the target output of the gated recursive unit , to obtain the image generation model.
  • the image generation model further includes a discriminator, wherein the image generation model is a generative adversarial network including a generator and the discriminator, and the generator includes the speech recognition sub-model, the A gated recursive unit, the decoding network, and the encoding network; the discriminator is set to perform a true-false judgment on the image sequence output by the decoding network in the model training phase, wherein the obtained true-false judgment result is used for The model parameters of the generator and the model parameters of the discriminator are updated.
  • the image generation model is a generative adversarial network including a generator and the discriminator
  • the generator includes the speech recognition sub-model, the A gated recursive unit, the decoding network, and the encoding network
  • the discriminator is set to perform a true-false judgment on the image sequence output by the decoding network in the model training phase, wherein the obtained true-false judgment result is used for
  • the model parameters of the generator and the model parameters of the discriminator are updated.
  • the present disclosure also provides a training device for an image generation model, wherein the image generation model includes a speech recognition sub-model, a gated recursive unit, and a variational autoencoder, wherein the variational autoencoder includes an encoding network and a decoding network. As shown in FIG.
  • the device 700 includes: a second obtaining module 701, configured to obtain reference video data, wherein the reference video data includes reference audio data, a reference image sequence, and text data corresponding to the reference audio data;
  • the training module 702 is configured to use the acoustic feature of the reference audio data obtained by the second obtaining module 701 as the input of the speech recognition sub-model, and use the text data corresponding to the reference audio data as the speech.
  • the target output of the recognition sub-model, the reference image sequence is used as the input of the encoding network, the reference image sequence is used as the target output of the decoding network, and the speech recognition sub-model is based on the reference audio data.
  • the phoneme posterior probability corresponding to the reference audio data determined by the acoustic feature is used as the input of the gated recursive unit, and the model training is performed by using the output of the encoding network as the target output of the gated recursive unit, to obtain the image generation model.
  • the text data corresponding to the reference audio data may be subtitle data in the reference video, or may be text data obtained by manually annotating the reference audio data.
  • the image generation model further includes a discriminator, wherein the image generation model is a generative adversarial network including a generator and the discriminator, and the generator includes the speech recognition sub-model, the A gated recursive unit and the variational autoencoder;
  • the device 700 further includes: an input module configured to input the obtained image sequence to the discriminator through a decoding network; a determination module configured to pass the discriminator Perform authenticity determination on the image sequence obtained by the decoding network; the updating module is configured to update the model parameters of the generator and the model parameters of the discriminator using the obtained authenticity determination results.
  • the training apparatus 700 for the image generation model may be integrated into the video generation apparatus 600, or may be independent of the video generation apparatus 600, which is not limited in the present disclosure.
  • the manner in which each module performs operations has been described in detail in the embodiments of the method, and will not be described in detail here.
  • FIG. 8 it shows a schematic structural diagram of an electronic device (eg, a terminal device or a server) 800 suitable for implementing an embodiment of the present disclosure.
  • Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA), PAD (tablet computers), portable multimedia players (Portable Media Players) , PMP), mobile terminals such as in-vehicle terminals (eg, in-vehicle navigation terminals), etc., as well as fixed terminals such as digital televisions (Television, TV), desktop computers, and the like.
  • PDA Personal Digital Assistant
  • PAD tablet computers
  • PMP portable multimedia players
  • PMP portable multimedia players
  • the electronic device 800 may include a processing device (such as a central processing unit, a graphics processor, etc.) 801, which may be stored in a read-only memory (Read-Only Memory, ROM) 802 according to a program or from a storage device 808 performs various appropriate actions and processes by loading a program into a random access memory (RAM) 803 .
  • ROM Read-Only Memory
  • RAM random access memory
  • various programs and data required for the operation of the electronic device 800 are also stored.
  • the processing device 801, the ROM 802, and the RAM 803 are connected to each other through a bus 804.
  • An Input/Output (I/O) interface 805 is also connected to the bus 804 .
  • I/O interface 805 the following devices can be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) Output device 807 , speaker, vibrator, etc.; storage device 808 including, eg, magnetic tape, hard disk, etc.; and communication device 809 .
  • Communication means 809 may allow electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 8 shows an electronic device 800 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 809, or from the storage device 808, or from the ROM 802.
  • the processing device 801 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Computer readable storage media may include, but are not limited to, electrical connections having at least one wire, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable Read memory ((Erasable Programmable Read-Only Memory, EPROM) or flash memory), optical fiber, portable compact disk read only memory (Compact Disc-Read Only Memory, CD-ROM), optical storage device, magnetic storage device, or any of the above suitable combination.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code embodied on the computer-readable medium may be transmitted by any suitable medium, including but not limited to: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
  • the client and server can use any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects.
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently Known or future developed networks.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries at least one program, and when the above-mentioned at least one program is executed by the electronic device, the electronic device: acquires target audio data to be synthesized; extracts the acoustic features of the target audio data as the target acoustic features; Determine the phoneme posterior probability corresponding to the target audio data according to the target acoustic feature, and generate an image sequence corresponding to the target audio data according to the phoneme posterior probability, wherein the phoneme posterior probability is used for Characterize the distribution probability of the phonemes to which each speech frame in the audio data belongs; perform video synthesis on the target audio data and the image sequence corresponding to the target audio data to obtain target video data.
  • the above-mentioned computer-readable medium carries at least one program, and when the above-mentioned at least one program is executed by the electronic device, causes the electronic device to: acquire reference video data, wherein the reference video data includes reference audio data, reference image sequence Text data corresponding to the reference audio data; wherein the image generation model includes a speech recognition sub-model, a gated recursive unit and a variational autoencoder, wherein the variational autoencoder includes an encoding network and a decoding network; by The acoustic feature of the reference audio data is used as the input of the speech recognition sub-model, the text data corresponding to the reference audio data is used as the target output of the speech recognition sub-model, and the reference image sequence is used as the encoding The input of the network, the reference image sequence is used as the target output of the decoding network, and the phoneme posterior probability corresponding to the reference audio data determined by the speech recognition sub-model according to the acoustic characteristics of the reference audio data As the input of the reference
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains at least one configurable function for implementing the specified logical function. Execute the instruction.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the first acquisition module may also be described as "a module for acquiring target audio data to be synthesized".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (Application Specific Standard Products) Standard Parts, ASSP), system on chip (System on Chip, SOC), complex programmable logic device (Complex Programmable Logic Device, CPLD) and so on.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Products
  • SOC System on Chip
  • complex programmable logic device Complex Programmable Logic Device, CPLD
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include at least one wire-based electrical connection, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory flash memory
  • fiber optics compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • CD-ROM compact disk read only memory
  • magnetic storage devices or any suitable combination of the foregoing.
  • Example 1 provides a video generation method, comprising: acquiring target audio data to be synthesized; extracting an acoustic feature of the target audio data as a target acoustic feature; according to the target acoustic feature, Determine the phoneme posterior probability corresponding to the target audio data, and generate an image sequence corresponding to the target audio data according to the phoneme posterior probability, wherein the phoneme posterior probability is used to characterize each of the audio data.
  • the distribution probability of the phonemes to which the speech frame belongs; the target audio data and the image sequence corresponding to the target audio data are combined into video to obtain the target video data.
  • Example 2 provides the method of Example 1, wherein the phoneme posterior probability corresponding to the target audio data is determined according to the target acoustic feature, and the phoneme posterior probability is generated according to the phoneme posterior probability
  • the image sequence corresponding to the target audio data includes: inputting the target acoustic feature into an image generation model, so as to determine the phoneme posterior corresponding to the target audio data through the image generation model according to the target acoustic feature probability, and generate an image sequence corresponding to the target audio data according to the phoneme posterior probability corresponding to the target audio data.
  • Example 3 provides the method of Example 1, the image generation model comprising: sequentially connected speech recognition sub-models, gated recurrent units, and a decoding network of variational autoencoders; wherein, The speech recognition sub-model is set to determine the phoneme posterior probability of the audio data according to the acoustic features of the input audio data; the gated recursive unit is set to determine the feature vector according to the input phoneme posterior probability; The decoding network is configured to generate an image sequence corresponding to the audio data based on the feature vector.
  • Example 4 provides the method of Example 3, wherein the image generation model further includes an encoding network of the variational autoencoder; the image generation model is trained by obtaining a reference video data, wherein the reference video data includes reference audio data, a reference image sequence, and text data corresponding to the reference audio data; by using the acoustic features of the reference audio data as the input of the speech recognition sub-model, the The text data corresponding to the reference audio data is used as the target output of the speech recognition sub-model, the reference image sequence is used as the input of the encoding network, the reference image sequence is used as the target output of the decoding network, and the The phoneme posterior probability corresponding to the reference audio data determined by the speech recognition sub-model according to the acoustic characteristics of the reference audio data is used as the input of the gated recursive unit, and the output of the encoding network is used as the Model training is performed by gating the target output of the recursive unit to obtain the image generation model.
  • the image generation model further includes an
  • Example 5 provides the method of Example 4, the image generation model further comprising a discriminator, wherein the image generation model is a generative adversarial network comprising a generator and the discriminator,
  • the generator includes the speech recognition sub-model, the gated recursive unit, the decoding network and the encoding network; the discriminator is set to perform the image sequence output from the decoding network in the model training phase.
  • True or false determination wherein the obtained true or false determination result is used to update the model parameters of the generator and the model parameters of the discriminator.
  • Example 6 provides a method for training an image generation model, the image generation model including a speech recognition sub-model, a gated recursive unit, and a variational autoencoder, wherein the variational
  • the self-encoder includes an encoding network and a decoding network; the method includes: acquiring reference video data, wherein the reference video data includes reference audio data, a reference image sequence, and text data corresponding to the reference audio data;
  • the acoustic features of the reference audio data are used as the input of the speech recognition sub-model, the text data corresponding to the reference audio data is used as the target output of the speech recognition sub-model, and the reference image sequence is used as the input of the encoding network , the reference image sequence is used as the target output of the decoding network, and the phoneme posterior probability corresponding to the reference audio data determined by the speech recognition sub-model according to the acoustic characteristics of the reference audio data is used as the
  • the input of the gated recursive unit is used for
  • Example 7 provides the method of Example 6, the image generation model further comprising a discriminator, wherein the image generation model is a generative adversarial network comprising a generator and the discriminator,
  • the generator includes the speech recognition submodel, the gated recursive unit, and the variational autoencoder; the method further includes: the decoding network inputting the resulting image sequence to the discriminator; the The discriminator determines whether the image sequence obtained by the decoding network is true or false; the model parameters of the generator and the model parameters of the discriminator are updated using the obtained true and false determination results.
  • Example 8 provides a video generation apparatus, comprising: a first acquisition module configured to acquire target audio data to be synthesized; an extraction module configured to extract the first acquisition module to acquire The acoustic feature of the target audio data is taken as the target acoustic feature; the determining module is configured to determine the phoneme posterior probability corresponding to the target audio data according to the target acoustic feature extracted by the extraction module, and according to the phoneme posterior probability, generating an image sequence corresponding to the target audio data, wherein the phoneme posterior probability is used to represent the distribution probability of the phoneme to which each speech frame in the audio data belongs; the synthesis module is set to An image sequence corresponding to the target audio data acquired by the acquiring module and the target audio data determined by the determining module is subjected to video synthesis to obtain target video data.
  • Example 9 provides an apparatus for training an image generation model, where the image generation model includes a speech recognition sub-model, a gated recursive unit, and a variational autoencoder, wherein the variational The self-encoder includes an encoding network and a decoding network; the apparatus includes: a second obtaining module configured to obtain reference video data, wherein the reference video data includes reference audio data, a reference image sequence, and a reference audio data corresponding to the reference audio data Text data; a training module configured to use the acoustic feature of the reference audio data obtained by the second acquisition module as the input of the speech recognition sub-model, and use the text data corresponding to the reference audio data as the The target output of the speech recognition sub-model, the reference image sequence is used as the input of the encoding network, the reference image sequence is used as the target output of the decoding network, and the speech recognition sub-model is based on the reference audio data.
  • the phoneme posterior probability corresponding to the reference audio data determined by the acoustic characteristics of the reference audio data is used as the input of the gated recursive unit, and the model training is performed by using the output of the encoding network as the target output of the gated recursive unit. , to obtain the image generation model.
  • Example 10 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the method of any one of Examples 1-7.
  • Example 11 provides an electronic device, comprising: a storage device on which a computer program is stored; and a processing device configured to execute the computer program in the storage device to implement the example The method of any one of 1-5.
  • Example 12 provides an electronic device, comprising: a storage device on which a computer program is stored; and a processing device configured to execute the computer program in the storage device to implement the example The method described in 6 or 7.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本公开涉及一种视频生成方法、生成模型训练方法、装置、介质及设备。方法包括:获取待合成的目标音频数据;提取目标音频数据的声学特征作为目标声学特征;根据目标声学特征,确定目标音频数据对应的音素后验概率,并根据音素后验概率,生成目标音频数据对应的图像序列;将目标音频数据和目标音频数据对应的图像序列进行视频合成,得到目标视频数据。

Description

视频生成方法、生成模型训练方法、装置、介质及设备
本申请要求在2020年8月12日提交中国专利局、申请号为202010807940.7的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及数据处理领域,例如一种视频生成方法、生成模型训练方法、装置、介质及设备。
背景技术
目前,语音到视频生成这一技术正在成为研究热点,例如针对一段任意说话人的语音,可以驱动一个虚拟形象做出该段语音对应的头部动作和身体姿态,以带给用户沉浸式的体验。一种实现方式是提取语音中的声学特征(例如,梅尔频率倒谱系数(Mel Frequency Cepstral Coefficient,MFCC)),然后根据该声学特征,通过图像模型直接生成图像序列,最后将该图像序列和语音合成为视频。然而,由于提取的声学特征中含有与说话人相关的信息,导致以此建立的图像模型只能根据特定说话人的语音,生成图像序列。
发明内容
第一方面,本公开提供一种视频生成方法,包括:
获取待合成的目标音频数据;
提取所述目标音频数据的声学特征作为目标声学特征;
根据所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述音素后验概率,生成所述目标音频数据对应的图像序列,其中,所述音素后验概率用于表征所述目标音频数据中的每一语音帧所属音素的分布概率;
将所述目标音频数据和所述目标音频数据对应的图像序列进行视频合成, 得到目标视频数据。
第二方面,本公开提供一种图像生成模型的训练方法,所述图像生成模型包括语音识别子模型、门控递归单元以及变分自编码器,其中,所述变分自编码器包括编码网络和解码网络;
所述方法包括:
获取参考视频数据,其中,所述参考视频数据包括参考音频数据、参考图像序列和所述参考音频数据对应的文本数据;
通过将所述参考音频数据的声学特征作为所述语音识别子模型的输入,将所述参考音频数据对应的文本数据作为所述语音识别子模型的目标输出,将所述参考图像序列作为所述编码网络的输入,将所述参考图像序列作为所述解码网络的目标输出,将所述语音识别子模型根据所述参考音频数据的声学特征确定出的、所述参考音频数据对应的音素后验概率作为所述门控递归单元的输入,将所述编码网络的输出作为所述门控递归单元的目标输出的方式进行模型训练,以得到所述图像生成模型。
第三方面,本公开提供一种视频生成装置,包括:
获取模块,设置为获取待合成的目标音频数据;
提取模块,设置为提取所述第一获取模块获取到的所述目标音频数据的声学特征作为目标声学特征;
确定模块,设置为根据所述提取模块提取到的所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述音素后验概率,生成所述目标音频数据对应的图像序列,其中,所述音素后验概率用于表征所述目标音频数据中的每一语音帧所属音素的分布概率;
合成模块,设置为将所述第一获取模块获取到的所述目标音频数据和所述确定模块确定出的所述目标音频数据对应的图像序列进行视频合成,得到目标视频数据。
第四方面,本公开提供一种图像生成模型的训练装置,所述图像生成模型 包括语音识别子模型、门控递归单元以及变分自编码器,其中,所述变分自编码器包括编码网络和解码网络;
所述装置包括:
获取模块,设置为获取参考视频数据,其中,所述参考视频数据包括参考音频数据、参考图像序列和所述参考音频数据对应的文本数据;
训练模块,设置为通过将所述第二获取模块获取到的所述参考音频数据的声学特征作为所述语音识别子模型的输入,将所述参考音频数据对应的文本数据作为所述语音识别子模型的目标输出,将所述参考图像序列作为所述编码网络的输入,将所述参考图像序列作为所述解码网络的目标输出,将所述语音识别子模型根据所述参考音频数据的声学特征确定出的、所述参考音频数据对应的音素后验概率作为所述门控递归单元的输入,将所述编码网络的输出作为所述门控递归单元的目标输出的方式进行模型训练,以得到所述图像生成模型。
第五方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第一方面提供的所述视频生成方法或者本公开第二方面提供的所述图像生成模型的训练方法。
第六方面,本公开提供一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,设置为执行所述存储装置中的所述计算机程序,以实现本公开第一方面提供的所述视频生成方法。
第七方面,本公开提供一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,设置为执行所述存储装置中的所述计算机程序,以实现本公开第二方面提供的所述图像生成模型的训练方法。
附图说明
图1是根据一示例性实施例示出的一种视频生成方法的流程图。
图2是根据一示例性实施例示出的一种生成图像序列的过程的示意图。
图3是根据一示例性实施例示出的一种图像生成模型的框图。
图4是根据一示例性实施例示出的一种图像生成模型的训练方法的流程图。
图5是根据另一示例性实施例示出的一种图像生成模型的框图。
图6是根据一示例性实施例示出的一种视频生成装置的框图。
图7是根据一示例性实施例示出的一种图像生成模型的训练装置的框图。
图8是根据一示例性实施例示出的一种电子设备的框图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的, 本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“至少一个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
图1是根据一示例性实施例示出的一种视频生成方法的流程图。如图1所示,该方法可以包括S101~S104。
在S101中,获取待合成的目标音频数据。
在本公开中,待合成的目标音频数据可以为任意说话人对应的音频,即任意说话人发出的语音。并且,目标音频数据可以是说话人讲话所对应的音频,也可以是说话人唱歌所对应的音频。另外,目标音频数据的语种在本公开中也不作限定,其可以例如是是汉语、英语等。
在S102中,提取目标音频数据的声学特征作为目标声学特征。
在本公开中,该声学特征可以是MFCC、梅尔标度滤波器组(Mel-scale Filter Bank,FBank)、线性预测倒谱系数(Linear Predictive Cepstral Coding,LPCC)、倒谱系数、感知线性预测系数(Perceptual Linear Predictive,PLP)、快速傅立叶变换(Fast Fourier Transform,简称FFT)的幅值等等。
其中,声学特征可以是利用至少一种声学特征提取算法获取。例如,MFCC的示例性计算方法可以是:首先用FFT将时域信号转化成频域,之后对其对数能量谱用依照Mel刻度分布的三角滤波器组进行卷积,最后对多个滤波器的输出构成的向量进行离散余弦变换,取前N个系数作为MFCC。FBank的示例性计算方法可以是:与MFCC的计算方法一致,将多个滤波器输出作为FBank。LPCC的示例性计算方法可以是:通过使目标音频数据的采样值和线性预测采样值之间达到均方差最小,即可得到LPCC。倒谱系数的示例性计算方法可以是:利用同态处理方法,对目标音频数据信号求离散傅立叶变换后取对数,再求反 变换即可得到倒谱系数。PLP的示例性计算方法可以是:用德宾法去计算线性预测系数参数,在计算自相关参数时采用对听觉激励的对数能量谱进行离散余弦变换,以得到PLP。FFT的幅值的示例性计算方法可以是:采用FFT算法提取目标音频数据的FFT幅值特征。
在S103中,根据目标声学特征,确定目标音频数据对应的音素后验概率,并根据音素后验概率,生成目标音频数据对应的图像序列。
在本公开中,音素后验概率(Phonetic Posterior Grams,PPG)用于表征音频数据中的每一语音帧所属音素的分布概率,即语音帧内容属于哪一种音素的概率分布。
在S104中,将目标音频数据和目标音频数据对应的图像序列进行视频合成,得到目标视频数据。
在本公开中,可以基于目标音频数据中每个语音帧对应的时间戳和相应图像序列中每一图像帧对应的时间戳,对语音帧和图像帧进行合成,得到目标视频数据。
在上述技术方案中,在提取到待合成的目标音频数据的声学特征后,可以根据该声学特征,确定目标音频数据对应的音素后验概率,并根据音素后验概率,生成目标音频数据对应的图像序列;之后,将目标音频数据和相应的图像序列进行视频合成,得到目标视频数据。由于音素后验概率为与实际说话人无关的信息,由此可以避免不同说话人发音习惯(口音)、噪声等因素对后续生成的图像序列的影响,从而可以提升生成的图像序列中头部动作和身体姿态的准确度。并且,针对任意说话人的语音数据,均可生成相应的图像序列,进而得到视频数据。
下面针对上述S103中的根据目标声学特征,确定目标音频数据对应的音素后验概率,并根据音素后验概率,生成目标音频数据对应的图像序列的实施方式进行详细说明。
在一种实施方式中,可以将目标声学特征输入至自动语音识别(Automatic Speech Recognition,ASR)模型中,以得到目标音频数据对应的音素后验概率;然后,将该音素后验概率输入至预先训练好的循环神经网络(Recurrent Neural Network,RNN)中,得到目标音频数据对应的动作特征(包括头部动作和身体姿态),其中,RNN在训练过程中设置为学习音素后验概率与动作特征之间的映射关系;最后,通过头部和身体对齐、图像融合、光流法等技术将RNN生成的动作特征合成为图像序列。
在另一种实施方式中,将目标声学特征输入至图像生成模型中,以通过图像生成模型根据目标声学特征,确定目标音频数据对应的音素后验概率,并根据目标音频数据对应的音素后验概率,生成目标音频数据对应的图像序列。这样,将目标声学特征输入到图像生成模型中,可以直接生成目标音频数据对应的图像序列,方便快捷。
示例性的,如图2所示,该图像生成模型包括:依次连接的语音识别子模型、门控递归单元(Gated Recurrent Unit,GRU)、以及变分自编码器(Variational Autoencoder,VAE)的解码网络。
其中,语音识别子模型设置为根据输入的音频数据的声学特征,确定音频数据的音素后验概率。示例地,该语音识别子模型可以为前馈神经网络(Deep-feedforward sequential memory networks,DFSMN)模型、高斯混合模型-隐马尔可夫(Gaussian Mixed Model-Hidden Markov Model,GMM-HMM)模型、深度神经网络-隐马尔可夫(Deep Neural Networks-Hidden Markov Model,DNN-HMM)模型等。
GRU设置为根据所输入的音素后验概率,确定特征向量。
VAE的解码网络设置为根据特征向量,生成与音频数据对应的图像序列。即,VAE的解码网络对该特征向量进行解码,得到与音频数据对应的图像序列。
下面针对上述图像生成模型的训练方法进行详细说明,其中,如图3所示, 图像生成模型还包括VAE的编码网络。示例性的,可以通过图4中所示的S401和S402来训练得到图像生成模型。
在401中,获取参考视频数据,其中,参考视频数据包括参考音频数据、参考图像序列和参考音频数据对应的文本数据。
在本公开中,参考音频数据对应的文本数据可以是参考视频中的字幕数据,也可以是根据参考音频数据,进行人工标注所得的文本数据。另外,可以将同一个说话人的大量视频数据作为参考视频数据,以对图像生成模型进行训练,这样,训练得到的图像生成模型生成的图像序列中的虚拟形象即为该说话人的形象。
在S402中,通过将参考音频数据的声学特征作为语音识别子模型的输入,将参考音频数据对应的文本数据作为语音识别子模型的目标输出,将参考图像序列作为编码网络的输入,将参考图像序列作为解码网络的目标输出,将语音识别子模型根据参考音频数据的声学特征确定出的、参考音频数据对应的音素后验概率作为门控递归单元的输入,将编码网络的输出作为门控递归单元的目标输出的方式进行模型训练,以得到图像生成模型。
在本公开中,针对语音识别子模型,如图3中所示,将参考音频数据的声学特征输入至语音识别子模型中,可以得到参考音频数据对应的预测文本数据;之后,可以根据该预测文本数据与语音识别子模型的目标输出(即,参考音频数据对应的文本数据)的比较结果,对语音识别子模型的模型参数进行更新。
针对VAE,可以将参考图像序列输入至VAE的编码网络中,以通过该编码网络对参考图像序列进行特征提取,并重采样形成新特征,即参考图像序列对应的参考特征向量;之后,将该参考特征向量输入至VAE的解码网络中,以通过该解码网络对该参考特征向量进行解码,得到相应的图像序列;接下来,可以根据该解码网络输出的图像序列与该解码网络的目标输出(即参考图像序列)的比较结果,对VAE进行模型参数更新。
针对GRU,上述语音识别子模型根据参考音频数据的声学特征,可以确定出参考音频数据对应的音素后验概率,之后,可以将其输入至GRU中,得到预测特征向量;接下来,可以根据该预测特征向量与GRU的目标输出(即编码网络输出的参考特征向量)的比较结果,对GRU的模型参数进行更新。
由此,可以得到图像生成模型。
如图5所示,上述图像生成模型还可以包括判别器,其中,图像生成模型为包括生成器和判别器的生成式对抗网络,生成器包括语音识别子模型、门控递归单元、VAE的解码网络以及VAE的编码网络。判别器设置为在模型训练阶段,对解码网络输出的图像序列进行真假判定,即判定图像序列是否为真实的图像序列,其中,所得的真假判定结果用于对生成器的模型参数和判别器的模型参数进行更新。
通过判别器和生成器的对抗训练,可以使得生成器生成的图像序列中相邻两帧图像更加相似,从而保证图像序列的连续性,并且,生成的图像序列更加接近于真实视频的图像序列,即更加自然,进而提升了后续合成的视频的连续性和自然度。
本公开还提供一种图像生成模型的训练方法,其中,如图3所示,该图像生成模型包括语音识别子模型、GRU以及VAE,其中,VAE包括编码网络和解码网络。如图4所示,该训练方法包括S401和S402。
在401中,获取参考视频数据,其中,参考视频数据包括参考音频数据、参考图像序列和参考音频数据对应的文本数据。
在本公开中,参考音频数据对应的文本数据可以是参考视频中的字幕数据,也可以是根据参考音频数据,进行人工标注所得的文本数据。另外,可以将同一个说话人的大量视频数据作为参考视频数据,以对图像生成模型进行训练,这样,训练得到的图像生成模型生成的图像序列中的虚拟形象即为该说话人的形象。
在S402中,通过将参考音频数据的声学特征作为语音识别子模型的输入,将参考音频数据对应的文本数据作为语音识别子模型的目标输出,将参考图像序列作为编码网络的输入,将参考图像序列作为解码网络的目标输出,将语音识别子模型根据参考音频数据的声学特征确定出的、参考音频数据对应的音素后验概率作为门控递归单元的输入,将编码网络的输出作为门控递归单元的目标输出的方式进行模型训练,以得到图像生成模型。
在本公开中,针对语音识别子模型,如图3中所示,将参考音频数据的声学特征输入至语音识别子模型中,可以得到参考音频数据对应的预测文本数据;之后,可以根据该预测文本数据与语音识别子模型的目标输出(即,参考音频数据对应的文本数据)的比较结果,对语音识别子模型的模型参数进行更新。
针对VAE,可以将参考图像序列输入至VAE的编码网络中,以通过该编码网络对参考图像序列进行特征提取,并重采样形成新特征,即参考图像序列对应的参考特征向量;之后,将该参考特征向量输入至VAE的解码网络中,以通过该解码网络对该参考特征向量进行解码,得到相应的图像序列;接下来,可以根据该解码网络输出的图像序列与该解码网络的目标输出(即参考图像序列)的比较结果,对VAE进行模型参数更新。
针对GRU,上述语音识别子模型根据参考音频数据的声学特征,可以确定出参考音频数据对应的音素后验概率,之后,可以将其输入至GRU中,得到预测特征向量;接下来,可以根据该预测特征向量与GRU的目标输出(即编码网络输出的参考特征向量)的比较结果,对GRU的模型参数进行更新。
由此,可以得到图像生成模型。
如图5所示,上述图像生成模型还可以包括判别器,其中,图像生成模型为包括生成器和判别器的生成式对抗网络,生成器包括语音识别子模型、门控递归单元、VAE的解码网络以及VAE的编码网络。
上述训练方法还包括以下步骤:解码网络将所得的图像序列输入至判别器; 判别器对解码网络所得的图像序列进行真假判定,即判定图像序列是否为真实的图像序列;利用所得的真假判定结果对生成器的模型参数和判别器的模型参数进行更新。
通过判别器和生成器的对抗训练,可以使得生成器生成的图像序列中相邻两帧图像更加相似,从而保证图像序列的连续性,并且,生成的图像序列更加接近于真实视频的图像序列,即更加自然,进而提升了后续合成的视频的连续性和自然度。
图6是根据一示例性实施例示出的一种视频生成装置的框图。如图6所示,该装置600包括:第一获取模块601,设置为获取待合成的目标音频数据;提取模块602,设置为提取所述第一获取模块601获取到的所述目标音频数据的声学特征作为目标声学特征;确定模块603,设置为根据所述提取模块602提取到的所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述音素后验概率,生成所述目标音频数据对应的图像序列,其中,所述音素后验概率用于表征音频数据中的每一语音帧所属音素的分布概率;合成模块604,设置为将所述第一获取模块601获取到的所述目标音频数据和所述确定模块603确定出的所述目标音频数据对应的图像序列进行视频合成,得到目标视频数据。
在本公开中,待合成的目标音频数据可以为任意说话人对应的音频,即任意说话人发出的语音。并且,目标音频数据可以是说话人讲话所对应的音频,也可以是说话人唱歌所对应的音频。另外,目标音频数据的语种在本公开中也不作限定。音素后验概率(Phonetic Posterior Grams,PPG)用于表征音频数据中的每一语音帧所属音素的分布概率,即语音帧内容属于哪一种音素的概率分布。
在上述技术方案中,在提取到待合成的目标音频数据的声学特征后,可以根据该声学特征,确定目标音频数据对应的音素后验概率,并根据音素后验概率,生成目标音频数据对应的图像序列;之后,将目标音频数据和相应的图像 序列进行视频合成,得到目标视频数据。由于音素后验概率为与实际说话人无关的信息,由此可以避免不同说话人发音习惯(口音)、噪声等因素对后续生成的图像序列的影响,从而可以提升生成的图像序列中头部动作和身体姿态的准确度。并且,针对任意说话人的语音数据,均可生成相应的图像序列,进而得到视频数据。
下面针对上述确定模块603根据目标声学特征,确定目标音频数据对应的音素后验概率,并根据音素后验概率,生成目标音频数据对应的图像序列的实施方式进行详细说明。
在一种实施方式中,确定模块603包括:确定子模块,设置为将目标声学特征输入至ASR模型中,以得到目标音频数据对应的音素后验概率;特征提取子模块,设置为将音素后验概率输入至预先训练好的循环神经网络RNN中,得到目标音频数据对应的动作特征(包括头部动作和身体姿态),其中,RNN在训练过程中设置为学习音素后验概率与动作特征之间的映射关系;合成子模块,设置为通过头部和身体对齐、图像融合、光流法等技术将RNN生成的动作特征合成为图像序列。
在另一种实施方式中,确定模块603是设置为将所述目标声学特征输入至图像生成模型中,以通过所述图像生成模型根据所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述目标音频数据对应的音素后验概率,生成所述目标音频数据对应的图像序列。这样,将目标声学特征输入到图像生成模型中,可以直接生成目标音频数据对应的图像序列,方便快捷。
可选地,所述图像生成模型包括:依次连接的语音识别子模型、门控递归单元、以及变分自编码器的解码网络;其中,所述语音识别子模型设置为根据输入的音频数据的声学特征,确定所述音频数据的音素后验概率;所述门控递归单元设置为根据所输入的音素后验概率,确定特征向量;所述解码网络设置为根据所述特征向量,生成与所述音频数据对应的图像序列。
可选地,所述图像生成模型还包括所述变分自编码器的编码网络;并且,可以通过图像生成模型的训练装置来训练得到。如图7所示,该训练装置700包括:第二获取模块701,设置为获取参考视频数据,其中,所述参考视频数据包括参考音频数据、参考图像序列和所述参考音频数据对应的文本数据;训练模块702,设置为通过将所述第二获取模块701获取到的所述参考音频数据的声学特征作为所述语音识别子模型的输入,将所述参考音频数据对应的文本数据作为所述语音识别子模型的目标输出,将所述参考图像序列作为所述编码网络的输入,将所述参考图像序列作为所述解码网络的目标输出,将所述语音识别子模型根据所述参考音频数据的声学特征确定出的、所述参考音频数据对应的音素后验概率作为所述门控递归单元的输入,将所述编码网络的输出作为所述门控递归单元的目标输出的方式进行模型训练,以得到所述图像生成模型。
可选地,所述图像生成模型还包括判别器,其中,所述图像生成模型为包括生成器和所述判别器的生成式对抗网络,所述生成器包括所述语音识别子模型、所述门控递归单元、所述解码网络以及所述编码网络;所述判别器设置为在模型训练阶段,对所述解码网络输出的图像序列进行真假判定,其中,所得的真假判定结果用于对所述生成器的模型参数和所述判别器的模型参数进行更新。
本公开还提供一种图像生成模型的训练装置,其中,图像生成模型包括语音识别子模型、门控递归单元以及变分自编码器,其中,变分自编码器包括编码网络和解码网络。如图7所示,该装置700包括:第二获取模块701,设置为获取参考视频数据,其中,所述参考视频数据包括参考音频数据、参考图像序列和所述参考音频数据对应的文本数据;训练模块702,设置为通过将所述第二获取模块701获取到的所述参考音频数据的声学特征作为所述语音识别子模型的输入,将所述参考音频数据对应的文本数据作为所述语音识别子模型的目标输出,将所述参考图像序列作为所述编码网络的输入,将所述参考图像序列作 为所述解码网络的目标输出,将所述语音识别子模型根据所述参考音频数据的声学特征确定出的、所述参考音频数据对应的音素后验概率作为所述门控递归单元的输入,将所述编码网络的输出作为所述门控递归单元的目标输出的方式进行模型训练,以得到所述图像生成模型。
在本公开中,参考音频数据对应的文本数据可以是参考视频中的字幕数据,也可以是根据参考音频数据,进行人工标注所得的文本数据。
可选地,所述图像生成模型还包括判别器,其中,所述图像生成模型为包括生成器和所述判别器的生成式对抗网络,所述生成器包括所述语音识别子模型、所述门控递归单元以及所述变分自编码器;所述装置700还包括:输入模块,设置为通过解码网络将所得的图像序列输入至所述判别器;判定模块,设置为通过所述判别器对所述解码网络所得的图像序列进行真假判定;更新模块,设置为利用所得的真假判定结果对所述生成器的模型参数和所述判别器的模型参数进行更新。
另外,需要说明的是,上述图像生成模型的训练装置700可以集成于视频生成装置600中,也可以独立于该视频生成装置600,在本公开中不作限定。另外,关于上述实施例中的装置,其中各个模块执行操作的方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
下面参考图8,其示出了适于用来实现本公开实施例的电子设备(例如终端设备或服务器)800的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、PAD(平板电脑)、便携式多媒体播放器(Portable Media Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字电视(Television,TV)、台式计算机等等的固定终端。图8示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图8所示,电子设备800可以包括处理装置(例如中央处理器、图形处 理器等)801,其可以根据存储在只读存储器(Read-Only Memory,ROM)802中的程序或者从存储装置808加载到随机访问存储器(Random Access Memory,RAM)803中的程序而执行各种适当的动作和处理。在RAM 803中,还存储有电子设备800操作所需的各种程序和数据。处理装置801、ROM 802以及RAM803通过总线804彼此相连。输入/输出(Input/Output,I/O)接口805也连接至总线804。
通常,以下装置可以连接至I/O接口805:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置806;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置807;包括例如磁带、硬盘等的存储装置808;以及通信装置809。通信装置809可以允许电子设备800与其他设备进行无线或有线通信以交换数据。虽然图8示出了具有各种装置的电子设备800,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置809从网络上被下载和安装,或者从存储装置808被安装,或者从ROM 802被安装。在该计算机程序被处理装置801执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有至少一个导线的电连接、便携式计算机磁盘、硬盘、随机访 问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器((Erasable Programmable Read-Only Memory,EPROM)或闪存)、光纤、便携式紧凑磁盘只读存储器(Compact Disc-Read Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有至少一个程序,当上述至少一个程序被该电子设备执行时,使得该电子设备:获取待合成的目标音频数据;提取所述目标音频数据的声学特征作为目标声学特征;根据所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述音素后验概率,生成所述目标音频 数据对应的图像序列,其中,所述音素后验概率用于表征音频数据中的每一语音帧所属音素的分布概率;将所述目标音频数据和所述目标音频数据对应的图像序列进行视频合成,得到目标视频数据。
或者,上述计算机可读介质承载有至少一个程序,当上述至少一个程序被该电子设备执行时,使得该电子设备:获取参考视频数据,其中,所述参考视频数据包括参考音频数据、参考图像序列和所述参考音频数据对应的文本数据;其中,图像生成模型包括语音识别子模型、门控递归单元以及变分自编码器,其中,所述变分自编码器包括编码网络和解码网络;通过将所述参考音频数据的声学特征作为所述语音识别子模型的输入,将所述参考音频数据对应的文本数据作为所述语音识别子模型的目标输出,将所述参考图像序列作为所述编码网络的输入,将所述参考图像序列作为所述解码网络的目标输出,将所述语音识别子模型根据所述参考音频数据的声学特征确定出的、所述参考音频数据对应的音素后验概率作为所述门控递归单元的输入,将所述编码网络的输出作为所述门控递归单元的目标输出的方式进行模型训练,以得到所述图像生成模型。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图 中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含至少一个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,第一获取模块还可以被描述为“获取待合成的目标音频数据的模块”。
本文中以上描述的功能可以至少部分地由至少一个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于至少一个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM 或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的至少一个实施例,示例1提供了一种视频生成方法,包括:获取待合成的目标音频数据;提取所述目标音频数据的声学特征作为目标声学特征;根据所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述音素后验概率,生成所述目标音频数据对应的图像序列,其中,所述音素后验概率用于表征音频数据中的每一语音帧所属音素的分布概率;将所述目标音频数据和所述目标音频数据对应的图像序列进行视频合成,得到目标视频数据。
根据本公开的至少一个实施例,示例2提供了示例1的方法,所述根据所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述音素后验概率,生成所述目标音频数据对应的图像序列,包括:将所述目标声学特征输入至图像生成模型中,以通过所述图像生成模型根据所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述目标音频数据对应的音素后验概率,生成所述目标音频数据对应的图像序列。
根据本公开的至少一个实施例,示例3提供了示例1的方法,所述图像生成模型包括:依次连接的语音识别子模型、门控递归单元、以及变分自编码器的解码网络;其中,所述语音识别子模型设置为根据输入的音频数据的声学特征,确定所述音频数据的音素后验概率;所述门控递归单元设置为根据所输入的音素后验概率,确定特征向量;所述解码网络设置为根据所述特征向量,生成与所述音频数据对应的图像序列。
根据本公开的至少一个实施例,示例4提供了示例3的方法,所述图像生成模型还包括所述变分自编码器的编码网络;所述图像生成模型是通过如下方式训练得到:获取参考视频数据,其中,所述参考视频数据包括参考音频数据、参考图像序列和所述参考音频数据对应的文本数据;通过将所述参考音频数据 的声学特征作为所述语音识别子模型的输入,将所述参考音频数据对应的文本数据作为所述语音识别子模型的目标输出,将所述参考图像序列作为所述编码网络的输入,将所述参考图像序列作为所述解码网络的目标输出,将所述语音识别子模型根据所述参考音频数据的声学特征确定出的、所述参考音频数据对应的音素后验概率作为所述门控递归单元的输入,将所述编码网络的输出作为所述门控递归单元的目标输出的方式进行模型训练,以得到所述图像生成模型。
根据本公开的至少一个实施例,示例5提供了示例4的方法,所述图像生成模型还包括判别器,其中,所述图像生成模型为包括生成器和所述判别器的生成式对抗网络,所述生成器包括所述语音识别子模型、所述门控递归单元、所述解码网络以及所述编码网络;所述判别器设置为在模型训练阶段,对所述解码网络输出的图像序列进行真假判定,其中,所得的真假判定结果用于对所述生成器的模型参数和所述判别器的模型参数进行更新。
根据本公开的至少一个实施例,示例6提供了一种图像生成模型的训练方法,所述图像生成模型包括语音识别子模型、门控递归单元以及变分自编码器,其中,所述变分自编码器包括编码网络和解码网络;所述方法包括:获取参考视频数据,其中,所述参考视频数据包括参考音频数据、参考图像序列和所述参考音频数据对应的文本数据;通过将所述参考音频数据的声学特征作为所述语音识别子模型的输入,将所述参考音频数据对应的文本数据作为所述语音识别子模型的目标输出,将所述参考图像序列作为所述编码网络的输入,将所述参考图像序列作为所述解码网络的目标输出,将所述语音识别子模型根据所述参考音频数据的声学特征确定出的、所述参考音频数据对应的音素后验概率作为所述门控递归单元的输入,将所述编码网络的输出作为所述门控递归单元的目标输出的方式进行模型训练,以得到所述图像生成模型。
根据本公开的至少一个实施例,示例7提供了示例6的方法,所述图像生成模型还包括判别器,其中,所述图像生成模型为包括生成器和所述判别器的 生成式对抗网络,所述生成器包括所述语音识别子模型、所述门控递归单元以及所述变分自编码器;所述方法还包括:所述解码网络将所得的图像序列输入至所述判别器;所述判别器对所述解码网络所得的图像序列进行真假判定;利用所得的真假判定结果对所述生成器的模型参数和所述判别器的模型参数进行更新。
根据本公开的至少一个实施例,示例8提供了一种视频生成装置,包括:第一获取模块,设置为获取待合成的目标音频数据;提取模块,设置为提取所述第一获取模块获取到的所述目标音频数据的声学特征作为目标声学特征;确定模块,设置为根据所述提取模块提取到的所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述音素后验概率,生成所述目标音频数据对应的图像序列,其中,所述音素后验概率用于表征音频数据中的每一语音帧所属音素的分布概率;合成模块,设置为将所述第一获取模块获取到的所述目标音频数据和所述确定模块确定出的所述目标音频数据对应的图像序列进行视频合成,得到目标视频数据。
根据本公开的至少一个实施例,示例9提供了一种图像生成模型的训练装置,所述图像生成模型包括语音识别子模型、门控递归单元以及变分自编码器,其中,所述变分自编码器包括编码网络和解码网络;所述装置包括:第二获取模块,设置为获取参考视频数据,其中,所述参考视频数据包括参考音频数据、参考图像序列和所述参考音频数据对应的文本数据;训练模块,设置为通过将所述第二获取模块获取到的所述参考音频数据的声学特征作为所述语音识别子模型的输入,将所述参考音频数据对应的文本数据作为所述语音识别子模型的目标输出,将所述参考图像序列作为所述编码网络的输入,将所述参考图像序列作为所述解码网络的目标输出,将所述语音识别子模型根据所述参考音频数据的声学特征确定出的、所述参考音频数据对应的音素后验概率作为所述门控递归单元的输入,将所述编码网络的输出作为所述门控递归单元的目标输出的 方式进行模型训练,以得到所述图像生成模型。
根据本公开的至少一个实施例,示例10提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现示例1-7中任一项所述方法。
根据本公开的至少一个实施例,示例11提供了一种电子设备,包括:存储装置,其上存储有计算机程序;处理装置,设置为执行所述存储装置中的所述计算机程序,以实现示例1-5中任一项所述方法。
根据本公开的至少一个实施例,示例12提供了一种电子设备,包括:存储装置,其上存储有计算机程序;处理装置,设置为执行所述存储装置中的所述计算机程序,以实现示例6或7所述方法。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (12)

  1. 一种视频生成方法,包括:
    获取待合成的目标音频数据;
    提取所述目标音频数据的声学特征作为目标声学特征;
    根据所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述音素后验概率,生成所述目标音频数据对应的图像序列,其中,所述音素后验概率用于表征所述目标音频数据中的每一语音帧所属音素的分布概率;
    将所述目标音频数据和所述目标音频数据对应的图像序列进行视频合成,得到目标视频数据。
  2. 根据权利要求1所述的方法,其中,所述根据所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述音素后验概率,生成所述目标音频数据对应的图像序列,包括:
    将所述目标声学特征输入至图像生成模型中,以通过所述图像生成模型根据所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述目标音频数据对应的音素后验概率,生成所述目标音频数据对应的图像序列。
  3. 根据权利要求2所述的方法,其中,所述图像生成模型包括:依次连接的语音识别子模型、门控递归单元、以及变分自编码器的解码网络;
    其中,所述语音识别子模型设置为根据输入的目标音频数据的声学特征,确定所述目标音频数据的音素后验概率;
    所述门控递归单元设置为根据所输入的音素后验概率,确定特征向量;
    所述解码网络设置为根据所述特征向量,生成与所述目标音频数据对应的图像序列。
  4. 根据权利要求3所述的方法,其中,所述图像生成模型还包括所述变分自编码器的编码网络;
    所述图像生成模型是通过如下方式训练得到:
    获取参考视频数据,其中,所述参考视频数据包括参考音频数据、参考图像序列和所述参考音频数据对应的文本数据;
    通过将所述参考音频数据的声学特征作为所述语音识别子模型的输入,将所述参考音频数据对应的文本数据作为所述语音识别子模型的目标输出,将所述参考图像序列作为所述编码网络的输入,将所述参考图像序列作为所述解码网络的目标输出,将所述语音识别子模型根据所述参考音频数据的声学特征确定出的、所述参考音频数据对应的音素后验概率作为所述门控递归单元的输入,将所述编码网络的输出作为所述门控递归单元的目标输出的方式进行模型训练,以得到所述图像生成模型。
  5. 根据权利要求4所述的方法,其中,所述图像生成模型还包括判别器,其中,所述图像生成模型为包括生成器和所述判别器的生成式对抗网络,所述生成器包括所述语音识别子模型、所述门控递归单元、所述解码网络以及所述编码网络;
    所述判别器设置为在模型训练阶段,对所述解码网络输出的图像序列进行真假判定,其中,所得的真假判定结果用于对所述生成器的模型参数和所述判别器的模型参数进行更新。
  6. 一种图像生成模型的训练方法,所述图像生成模型包括语音识别子模型、门控递归单元以及变分自编码器,其中,所述变分自编码器包括编码网络和解码网络;
    所述方法包括:
    获取参考视频数据,其中,所述参考视频数据包括参考音频数据、参考图像序列和所述参考音频数据对应的文本数据;
    通过将所述参考音频数据的声学特征作为所述语音识别子模型的输入,将所述文本数据作为所述语音识别子模型的目标输出,将所述参考图像序列作为所述编码网络的输入,将所述参考图像序列作为所述解码网络的目标输出,将所述语音识别子模型根据所述参考音频数据的声学特征确定出的、所述参考音频数据对应的音素后验概率作为所述门控递归单元的输入,将所述编码网络的输出作为所述门控递归单元的目标输出的方式进行模型训练,以得到所述图像 生成模型。
  7. 根据权利要求6所述的方法,其中,所述图像生成模型还包括判别器,其中,所述图像生成模型为包括生成器和所述判别器的生成式对抗网络,所述生成器包括所述语音识别子模型、所述门控递归单元以及所述变分自编码器;
    所述方法还包括:
    所述解码网络将所得的图像序列输入至所述判别器;
    所述判别器对所述解码网络所得的图像序列进行真假判定;
    利用所得的真假判定结果对所述生成器的模型参数和所述判别器的模型参数进行更新。
  8. 一种视频生成装置,包括:
    获取模块,设置为获取待合成的目标音频数据;
    提取模块,设置为提取所述第一获取模块获取到的所述目标音频数据的声学特征作为目标声学特征;
    确定模块,设置为根据所述提取模块提取到的所述目标声学特征,确定所述目标音频数据对应的音素后验概率,并根据所述音素后验概率,生成所述目标音频数据对应的图像序列,其中,所述音素后验概率用于表征所述目标音频数据中的每一语音帧所属音素的分布概率;
    合成模块,设置为将所述第一获取模块获取到的所述目标音频数据和所述确定模块确定出的所述目标音频数据对应的图像序列进行视频合成,得到目标视频数据。
  9. 一种图像生成模型的训练装置,所述图像生成模型包括语音识别子模型、门控递归单元以及变分自编码器,其中,所述变分自编码器包括编码网络和解码网络;
    所述装置包括:
    获取模块,设置为获取参考视频数据,其中,所述参考视频数据包括参考音频数据、参考图像序列和所述参考音频数据对应的文本数据;
    训练模块,设置为通过将所述第二获取模块获取到的所述参考音频数据的声学特征作为所述语音识别子模型的输入,将所述参考音频数据对应的文本数据作为所述语音识别子模型的目标输出,将所述参考图像序列作为所述编码网络的输入,将所述参考图像序列作为所述解码网络的目标输出,将所述语音识别子模型根据所述参考音频数据的声学特征确定出的、所述参考音频数据对应的音素后验概率作为所述门控递归单元的输入,将所述编码网络的输出作为所述门控递归单元的目标输出的方式进行模型训练,以得到所述图像生成模型。
  10. 一种计算机可读介质,所述计算机可读介质上存储有计算机程序,所述计算机程序被处理装置执行时实现权利要求1-7中任一项所述方法。
  11. 一种电子设备,包括:
    存储装置,所述存储装置上存储有计算机程序;
    处理装置,设置为执行所述存储装置中的所述计算机程序,以实现权利要求1-5中任一项所述方法。
  12. 一种电子设备,包括:
    存储装置,所述存储装置上存储有计算机程序;
    处理装置,设置为执行所述存储装置中的所述计算机程序,以实现权利要求6或7所述方法。
PCT/CN2021/109460 2020-08-12 2021-07-30 视频生成方法、生成模型训练方法、装置、介质及设备 WO2022033327A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/000,387 US20230223010A1 (en) 2020-08-12 2021-07-30 Video generation method, generation model training method and apparatus, and medium and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010807940.7A CN111933110B (zh) 2020-08-12 2020-08-12 视频生成方法、生成模型训练方法、装置、介质及设备
CN202010807940.7 2020-08-12

Publications (1)

Publication Number Publication Date
WO2022033327A1 true WO2022033327A1 (zh) 2022-02-17

Family

ID=73312027

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109460 WO2022033327A1 (zh) 2020-08-12 2021-07-30 视频生成方法、生成模型训练方法、装置、介质及设备

Country Status (3)

Country Link
US (1) US20230223010A1 (zh)
CN (1) CN111933110B (zh)
WO (1) WO2022033327A1 (zh)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933110B (zh) * 2020-08-12 2021-10-29 北京字节跳动网络技术有限公司 视频生成方法、生成模型训练方法、装置、介质及设备
CN112735371B (zh) * 2020-12-28 2023-08-04 北京羽扇智信息科技有限公司 一种基于文本信息生成说话人视频的方法及装置
CN112634861A (zh) * 2020-12-30 2021-04-09 北京大米科技有限公司 数据处理方法、装置、电子设备和可读存储介质
CN112887789B (zh) * 2021-01-22 2023-02-21 北京百度网讯科技有限公司 视频生成模型的构建和视频生成方法、装置、设备及介质
CN112992189B (zh) * 2021-01-29 2022-05-03 青岛海尔科技有限公司 语音音频的检测方法及装置、存储介质及电子装置
CN113079327A (zh) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 视频生成方法和装置、存储介质和电子设备
CN113077819A (zh) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 发音评价方法和装置、存储介质和电子设备
CN113111813A (zh) * 2021-04-20 2021-07-13 深圳追一科技有限公司 基于asr声学模型的嘴部动作驱动模型训练方法及组件
CN113314104B (zh) * 2021-05-31 2023-06-20 北京市商汤科技开发有限公司 交互对象驱动和音素处理方法、装置、设备以及存储介质
CN113408208B (zh) * 2021-06-25 2023-06-09 成都欧珀通信科技有限公司 模型训练方法、信息提取方法、相关装置及存储介质
CN113611265B (zh) * 2021-07-07 2022-09-23 湖南师范大学 一种人工智能作曲方法和系统
CN113703579B (zh) * 2021-08-31 2023-05-30 北京字跳网络技术有限公司 数据处理方法、装置、电子设备及存储介质
CN113935418A (zh) * 2021-10-15 2022-01-14 北京字节跳动网络技术有限公司 视频生成方法及设备
CN113936643B (zh) * 2021-12-16 2022-05-17 阿里巴巴达摩院(杭州)科技有限公司 语音识别方法、语音识别模型、电子设备和存储介质
CN114567693B (zh) * 2022-02-11 2024-01-30 维沃移动通信有限公司 视频生成方法、装置和电子设备
CN114938476B (zh) * 2022-05-31 2023-09-22 深圳市优必选科技股份有限公司 说话头视频合成方法、装置、终端设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610717A (zh) * 2016-07-11 2018-01-19 香港中文大学 基于语音后验概率的多对一语音转换方法
CN110503942A (zh) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 一种基于人工智能的语音驱动动画方法和装置
CN110880315A (zh) * 2019-10-17 2020-03-13 深圳市声希科技有限公司 一种基于音素后验概率的个性化语音和视频生成系统
CN110930981A (zh) * 2018-09-20 2020-03-27 深圳市声希科技有限公司 多对一语音转换系统
CN111933110A (zh) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 视频生成方法、生成模型训练方法、装置、介质及设备
CN113079327A (zh) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 视频生成方法和装置、存储介质和电子设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018077244A1 (en) * 2016-10-27 2018-05-03 The Chinese University Of Hong Kong Acoustic-graphemic model and acoustic-graphemic-phonemic model for computer-aided pronunciation training and speech processing
CN109493846B (zh) * 2018-11-18 2021-06-08 深圳市声希科技有限公司 一种英语口音识别系统
CN110728203B (zh) * 2019-09-23 2022-04-12 清华大学 基于深度学习的手语翻译视频生成方法及系统
CN111292720B (zh) * 2020-02-07 2024-01-23 北京字节跳动网络技术有限公司 语音合成方法、装置、计算机可读介质及电子设备
CN111429894A (zh) * 2020-03-12 2020-07-17 南京邮电大学 基于SE-ResNet STARGAN的多对多说话人转换方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610717A (zh) * 2016-07-11 2018-01-19 香港中文大学 基于语音后验概率的多对一语音转换方法
CN110930981A (zh) * 2018-09-20 2020-03-27 深圳市声希科技有限公司 多对一语音转换系统
CN110503942A (zh) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 一种基于人工智能的语音驱动动画方法和装置
CN110880315A (zh) * 2019-10-17 2020-03-13 深圳市声希科技有限公司 一种基于音素后验概率的个性化语音和视频生成系统
CN111933110A (zh) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 视频生成方法、生成模型训练方法、装置、介质及设备
CN113079327A (zh) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 视频生成方法和装置、存储介质和电子设备

Also Published As

Publication number Publication date
US20230223010A1 (en) 2023-07-13
CN111933110B (zh) 2021-10-29
CN111933110A (zh) 2020-11-13

Similar Documents

Publication Publication Date Title
WO2022033327A1 (zh) 视频生成方法、生成模型训练方法、装置、介质及设备
CN111583900B (zh) 歌曲合成方法、装置、可读介质及电子设备
US20200335121A1 (en) Audio-visual speech separation
CN111369967B (zh) 基于虚拟人物的语音合成方法、装置、介质及设备
CN111899719A (zh) 用于生成音频的方法、装置、设备和介质
CN111798821B (zh) 声音转换方法、装置、可读存储介质及电子设备
WO2022151931A1 (zh) 语音合成方法、合成模型训练方法、装置、介质及设备
US11355097B2 (en) Sample-efficient adaptive text-to-speech
CN110246488B (zh) 半优化CycleGAN模型的语音转换方法及装置
CN112489621B (zh) 语音合成方法、装置、可读介质及电子设备
CN113205793B (zh) 音频生成方法、装置、存储介质及电子设备
CN110097890A (zh) 一种语音处理方法、装置和用于语音处理的装置
WO2022037388A1 (zh) 语音生成方法、装置、设备和计算机可读介质
CN111883107B (zh) 语音合成、特征提取模型训练方法、装置、介质及设备
WO2022237665A1 (zh) 语音合成方法、装置、电子设备和存储介质
CN109697978B (zh) 用于生成模型的方法和装置
CN111785247A (zh) 语音生成方法、装置、设备和计算机可读介质
CN114255740A (zh) 语音识别方法、装置、计算机设备和存储介质
CN114429658A (zh) 人脸关键点信息获取方法、生成人脸动画的方法及装置
WO2022037383A1 (zh) 语音处理方法、装置、电子设备和计算机可读介质
CN112785667A (zh) 视频生成方法、装置、介质及电子设备
CN111415662A (zh) 用于生成视频的方法、装置、设备和介质
CN111862933A (zh) 用于生成合成语音的方法、装置、设备和介质
CN114495901A (zh) 语音合成方法、装置、存储介质及电子设备
CN112652292A (zh) 用于生成音频的方法、装置、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21855380

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21855380

Country of ref document: EP

Kind code of ref document: A1