CN113990295A - Video generation method and device - Google Patents

Video generation method and device Download PDF

Info

Publication number
CN113990295A
CN113990295A CN202111130297.XA CN202111130297A CN113990295A CN 113990295 A CN113990295 A CN 113990295A CN 202111130297 A CN202111130297 A CN 202111130297A CN 113990295 A CN113990295 A CN 113990295A
Authority
CN
China
Prior art keywords
phoneme
face
audio
video
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111130297.XA
Other languages
Chinese (zh)
Inventor
王愈
李健
武卫东
陈明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sinovoice Technology Co Ltd
Original Assignee
Beijing Sinovoice Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sinovoice Technology Co Ltd filed Critical Beijing Sinovoice Technology Co Ltd
Priority to CN202111130297.XA priority Critical patent/CN113990295A/en
Publication of CN113990295A publication Critical patent/CN113990295A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention provides a video generation method and a video generation device, wherein the method comprises the following steps: determining phoneme characteristics in the audio data through a preset speech recognition model, and determining phoneme characteristics in the text data through a preset speech synthesis model; processing each frame of phoneme feature vector through a preset face feature conversion model to obtain corresponding each frame of face features; determining a face image corresponding to each frame of face features through a preset face reconstruction model; and packing the facial images and the audio data of the continuous frames into a video file. The method has the following advantages: 1. the method is related to non-specific people and supports the input of voice data or text data of any person; 2. and the method is robust, simplifies the vector feature space, and facilitates the learning of the human face feature conversion model and the stable mapping relation of the human face image characteristics.

Description

Video generation method and device
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a video generation method and apparatus.
Background
With the development of artificial intelligence technology, more and more artificial intelligence technology is practically applied in various fields, wherein a virtual anchor is a popular direction in the artificial intelligence technology, the virtual anchor is a broadcasting character animation virtually synthesized by a computer, and the consistency and the synchronism of mouth shapes and pronunciations are emphasized. The application scene of the virtual anchor generally comprises an audio drive and a text drive: the audio drive is that the real person records in the background and generates the virtual anchor animation video in the foreground, the character drive is that the characters are directly input, and the virtual anchor is responsible for generating synchronous audio and animation completely. The existing virtual anchor generation scheme generally utilizes the acquired audio to perform subsequent processing, and does not consider that the audio feature distribution has large difference, for example, the sound features of male voice, female voice and people of different ages are very different.
The technical scheme has the following defects: 1. the specific person is related, namely the audio and video of the same person are required, the characteristics extracted respectively are used for training the conversion model, meanwhile, a large amount of data of the same person are required during training, and the generated data can only be used for the person, but cannot be used by other people; 2. the method is not robust, the voice characteristic space and the image characteristic space are not in one-to-one correspondence, namely, the specific voice pronunciation details and the specific face image are not in accurate one-to-one correspondence, and a trained conversion model is sensitive and easy to fluctuate.
Disclosure of Invention
In view of the above, embodiments of the present invention are proposed to provide a video generation method and a corresponding video generation apparatus that overcome or at least partially solve the above problems.
In order to solve the above problem, in one aspect, an embodiment of the present invention discloses a video generation method, including the following steps:
acquiring input information;
determining target audio data according to the input information, and determining phoneme characteristics of a plurality of audio frames in the target audio data by taking the frames as granularity;
respectively generating corresponding human face features according to the phoneme features of the audio frames;
generating a face image according to the face features;
and generating a video file by adopting the plurality of audio frames and the corresponding face images.
Further, the determining phoneme characteristics of a plurality of audio frames in the target audio data with frame granularity includes:
determining the posterior probability PPGs of the phonemes corresponding to a plurality of audio frames in the target audio data by using a preset phoneme recognition model and taking the frames as granularity;
and determining the phoneme characteristics of the audio frame according to the PPGs corresponding to the audio frame.
Further, the determining the phoneme characteristics of the audio frame according to the PPGs corresponding to the audio frame includes:
performing one-hot encoding processing on the PPGs;
and determining the phoneme with the probability of one after the one-hot coding process as the phoneme characteristic of the audio frame.
Further, the generating corresponding face features according to the phoneme features of the plurality of audio frames respectively includes:
and converting the phoneme characteristics of the audio frame into corresponding human face characteristics through a preset human face characteristic conversion model.
Further, the preset face feature conversion model is obtained by training in the following way:
acquiring a sample video, wherein the sample video is a video containing a face of a speaker;
extracting audio data and video data of the sample video;
extracting voice features from the audio data and extracting face features from the video data according to the same frame length;
and training the preset human face feature conversion model by adopting the voice features and the human face features.
Further, the time length of the sample video is greater than a preset time length.
Further, the input information includes input audio data or input text data;
the determining target audio data according to the input information includes:
when the input information is audio data, determining the target audio data through a preset voice recognition model;
and when the input information is text data, determining the target audio data through a preset speech synthesis model.
In another aspect, the present invention further provides a video generating apparatus, including:
the data acquisition module is used for acquiring input information;
the phoneme characteristic acquisition module is used for determining target audio data according to the input information and determining phoneme characteristics of a plurality of audio frames in the target audio data by taking the frames as granularity;
the face feature acquisition module is used for respectively generating corresponding face features according to the phoneme features of the audio frames;
the face image acquisition module is used for generating a face image according to the face features;
and the video file generation module is used for generating a video file by adopting the plurality of audio frames and the corresponding face images.
Further, the phoneme feature obtaining module comprises:
the PPGs acquisition submodule is used for determining the posterior probability PPGs of the phonemes corresponding to a plurality of audio frames in the target audio data by using a preset phoneme recognition model and taking the frame as granularity;
and the phoneme characteristic acquisition submodule is used for determining the phoneme characteristics of the audio frame according to the PPGs corresponding to the audio frame.
Further, the PPGs includes probabilities for respective phonemes in a preset phoneme list, and the phoneme feature acquisition submodule includes:
the one-hot coding unit is used for carrying out one-hot coding processing on the PPGs;
and the phoneme feature unit is used for determining the phoneme with the probability of one after the one-hot coding processing as the phoneme feature of the audio frame.
Further, the face feature acquisition module comprises:
and the face feature conversion submodule is used for converting the phoneme features of the audio frame into corresponding face features through a preset face feature conversion model.
Further, the preset face feature conversion model is obtained by training through the following modules:
the system comprises a sample video acquisition module, a video processing module and a video processing module, wherein the sample video acquisition module is used for acquiring a sample video, and the sample video is a video containing a face of a speaker;
the data extraction module is used for extracting audio data and video data of the sample video;
the face feature acquisition submodule is used for extracting voice features from the audio data and extracting face features from the video data according to the same frame length; and training the preset human face feature conversion model by adopting the voice features and the human face features.
Further, the time length of the sample video is greater than a preset time length.
Further, the input information includes input audio data or input text data; the phoneme feature acquisition module comprises:
the first target audio data acquisition submodule is used for determining the target audio data through a preset voice recognition model when the input information is audio data;
and the second target audio data acquisition submodule is used for determining the target audio data through a preset speech synthesis model when the input information is text data.
Meanwhile, the embodiment of the invention also provides an electronic device, which comprises a processor, a memory and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the steps of the video generation method when being executed by the processor.
Meanwhile, an embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the steps of the video generation method.
According to the technical scheme, a phoneme recognition model is preset, phoneme posterior probabilities PPGs corresponding to a plurality of audio frames in target audio data are determined by taking the frames as granularity, phoneme features of the audio frames are determined according to the PPGs, corresponding face features are respectively generated according to the phoneme features, face images are generated according to the face features, and finally, a video file is generated by adopting the target audio data and the corresponding face images. Compared with the virtual anchor generation method from audio to video in the prior art, the PPGs only reflect pronunciation content and do not contain the personality information of a speaker, and the method has the following advantages: 1. the method is related to non-specific people and supports the input of voice data or text data of any person; 2. and (3) robustness: the vector feature space is simplified, the stable mapping relation between the human face feature conversion model learning and the human face image characteristics is facilitated, and the generated virtual anchor pronunciation is adaptive to the human face image.
Drawings
Fig. 1 is a flowchart illustrating steps of a video generation method according to an embodiment of the present invention;
fig. 2 is a partial internal structure diagram of a UFANS neural network according to an embodiment of the present invention;
fig. 3 is a schematic diagram of another video generation method provided by the embodiment of the invention;
fig. 4 is a schematic diagram of another video generation method according to an embodiment of the present invention;
fig. 5 is a schematic diagram of another video generation method according to an embodiment of the present invention;
fig. 6 is a block diagram of a video generating apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Before specifically describing the technical solution of the present invention, the applicant first briefly introduces terms used in several technical solutions:
mel-frequency cepstral coefficients (MFCCs), in the field of speech recognition, are a set of feature vectors obtained by encoding and operating on speech physical information (spectral envelope and details).
Mel-cepstrum (MCPs), which is used to describe the details of pronunciation, contains the personal characteristics of the speaker. Because the vocal cords and oral cavity characteristics of each person are different, the sound waveforms emitted by different persons have different characteristics, and the mel-frequency cepstrum MCEPs are parameters describing the different characteristics of the sound waveforms emitted by different persons.
Phoneme Posterior Probabilities (PPGs) are the output of acoustic models in speech recognition systems. The phoneme is the smallest unit of speech divided according to the pronunciation action of the speech, and for example, the phoneme of the audio "o" is (ā), and the phoneme of "love" is (aji). The voice audio has large difference of audio characteristic distribution, such as male voice, female voice, and the difference of the audio characteristic distribution of people of different ages is large, and different audio characteristics have different influence on subsequent model processing. According to the method and the device, the original audio is split into the phonemes, so that the influence caused by the audio characteristics can be avoided. The PPGs is a probability that it is likely to be a certain phoneme for each frame of phonemes of the input audio. For example, in the chinese speech recognition system, if the preset phoneme list is 70 (a, b, c, d, e, f, g … zh, ch, sh, silence), the PPGs output by the trained speech recognition model is a 70-dimensional vector [ x1, x2, …, x70], where each element in the vector has a probability between 0 and 1, and the sum of 70 elements is equal to 1.
One-Hot vector processing, also called One-Hot encoding, is to change the maximum value of One vector to 1 and the other values to 0. For example, [1,3,5,2], the result of One-Hot conversion is [0,0,1,0 ].
Fig. 1 is a flowchart of steps of a video generation method provided by an embodiment of the present invention, which can be applied to a virtual anchor scene, where the virtual anchor has great potential in scenes such as news broadcasting, virtual teachers, virtual interviews, and virtual live broadcasting, and thus the labor cost can be reduced. The video generation method comprises the following steps:
step 101, acquiring input information;
the input information includes input audio data or input text data.
Step 102, determining target audio data according to the input information, and determining phoneme characteristics of a plurality of audio frames in the target audio data by taking frames as granularity;
determining the posterior probability PPGs of the phonemes corresponding to a plurality of audio frames in the target audio data by using a preset phoneme recognition model and taking the frames as granularity;
the method comprises the steps that a user selects audio input or text input according to actual requirements, when the user selects the audio input, an audio recording file is input into a preset voice recognition model, and Mel-frequency cepstral coefficients (MFCC) corresponding to the audio recording file and the posterior Probability of Phonemes (PPGs) of each frame of phonemes corresponding to the MFCC are obtained through the voice recognition model; when a user selects text data input, the text data is input into a preset speech synthesis model, and the Mel cepstrum coefficient and the phoneme posterior probability corresponding to the text data are obtained through the speech synthesis model. In this embodiment, the PPGs includes probabilities for each phoneme in the preset phoneme list. The speech recognition model and the speech synthesis model are general techniques in the field of speech processing, and will not be described in detail here.
When the input information is audio data, two groups of outputs are obtained through a preset speech recognition model, One group is normal recording data, the other group is PPGs corresponding to phonemes of each frame in the speech information, target audio data and the PPGs corresponding to each audio frame in the target audio data are determined according to the preset speech recognition model, One-Hot coding (One-Hot) processing (namely, for One vector, the maximum value of the One vector is changed into 1, and other values are changed into 0) is carried out on the PPGs, and the phoneme with the probability of 1 after the One-Hot coding processing is determined as the phoneme feature of the audio frame.
When the input information is text data, two groups of outputs are generated through the speech synthesis model, one group is normal synthesized speech, and the other group is PPGs corresponding to each frame of phoneme, because the speech synthesis model marks pinyin for the text and then generates pronunciation in the using process, so that the specific phoneme corresponding to each frame of speech information, namely one of 70 phonemes of { a, b, c, d, e, f, g … zh, ch, sh, silence } can be directly output, the output PPGs is a 70-dimensional vector [ x1, x2, …, x70], each element in the vector has a probability between 0 and 1, the sum of 70 elements is equal to 1, determining target audio data and PPGs corresponding to each audio frame in the target audio data through a preset speech synthesis model, and performing One-Hot coding (One-Hot) processing on the PPGs, and determining the phoneme with the probability of 1 after the One-Hot coding processing as the phoneme feature of the audio frame. The One-Hot vector processing simplifies the vector feature space, facilitates the follow-up face feature conversion model to learn the stable mapping relation between the PPGs and the face image characteristics, and enables the generated virtual anchor pronunciation to be adaptive to the face image.
103, respectively generating corresponding human face features according to the phoneme features of the audio frames;
the face feature conversion model needs to be trained for a specific target person: the video containing the face speech is collected over the preset time of the target person, in this embodiment, the preset time may be more than fifteen minutes, and of course, a person skilled in the art may set the preset time according to actual needs, which is not limited in the embodiment of the present invention.
The specific training mode comprises the following steps: acquiring a sample video, wherein the sample video is a video containing a face of a speaker; extracting audio data and video data of the sample video; extracting voice features from the audio data and extracting face features from the video data according to the same frame length; and training the preset human face feature conversion model by adopting the voice features and the human face features.
The face feature conversion model adopts a UFANS neural network. UFANS (U-shaped Fuel-parallel academic neural Structure) is a deep neural network structure oriented to one-dimensional sequence modeling task. The structure has two main characteristics: the structure is a U-shaped structure, in view of the recent popular U-Net in the image field, the input size is reduced by half by sampling one round in a recursion mode inside the structure, then the size is doubled by deconvolution again after the result of each round is ended, the result is used as a residual error to be added to the input of the round, the sum of two ways can be regarded as the sum of each round, one way of basic convolution and another way of information which is recovered after one time of size detection is completed are provided, and the structure can cover a wider visual field; and secondly, full convolution, wherein only convolution, deconvolution and Pooling (Pooling) operations are performed inside the model, and no RNN basic structure is included, so that the effect of full parallelization calculation is achieved, and the calculation speed can be greatly improved.
As shown in fig. 2, fig. 2 is a partial internal structure diagram of a UFANS neural network, where a denotes a convolution operation in a dimension reduction stage, B denotes a poling mean value of a pooling layer in the dimension reduction stage, C denotes a deconvolution operation, D denotes a convolution operation in a dimension increase stage, and E denotes a final convolution operation. Optionally, when the preset round is 2, the phoneme sequence vector corresponding to the PPGs is subjected to 2 rounds of convolution operations a and B with half-reduced length to obtain O _ B2 with the size of [ T/4, F ], and then is restored to the original input size through 2 rounds of deconvolution operations C and D with doubled length to obtain the output mel cepstrum parameters mcps. In the process of training the face feature conversion model, the UFANS neural network is adopted to transform according to the input phoneme sequence vector, so that the adaptive range of the face feature conversion model is enlarged.
104, generating a face image according to the face features;
the face characteristics are converted into face images through the face reconstruction model, the face reconstruction model can optionally select a conventional technical scheme commonly used in the voice industry, and the face reconstruction model is not specifically limited in the application.
And 105, generating a video file by adopting the plurality of audio frames and the corresponding face images.
According to the technical scheme, a phoneme recognition model is preset, phoneme posterior probabilities PPGs corresponding to a plurality of audio frames in target audio data are determined by taking the frames as granularity, phoneme features of the audio frames are determined according to the PPGs, corresponding face features are respectively generated according to the phoneme features, face images are generated according to the face features, and finally, a video file is generated by adopting the target audio data and the corresponding face images. Compared with the method for generating the virtual anchor from the audio to the video in the prior art shown in fig. 1, the PPGs only represent pronunciation content and do not contain personality information of a speaker, and the method has the following advantages: 1. the method is related to non-specific people and supports the input of voice data or text data of any person; 2. and (3) robustness: the vector feature space is simplified, the stable mapping relation between the human face feature conversion model learning and the human face image characteristics is facilitated, and the generated virtual anchor pronunciation is adaptive to the human face image.
Fig. 3 is a schematic diagram of another video generation method according to an embodiment of the present invention, specifically, the method includes: acquiring audio data; determining phoneme characteristics of a plurality of audio frames in the audio data through a preset speech recognition model; processing each frame of phoneme characteristics through a preset face characteristic conversion model to obtain corresponding each frame of face characteristics; determining a face image corresponding to each frame of face features through a preset face reconstruction model; and packaging the face images and the audio data of the continuous frames into a video file.
Compared with the method for generating the virtual anchor from the audio to the video in the prior art, the method has the following advantages: 1. the method is characterized in that the method is not related to a specific person, and the PPGs only show pronunciation content but do not contain personality information of a speaker, so that the video generation method in the application supports the input of voices of any person, generates equivalent PPGs through a voice recognition model, and generates images through subsequent steps; 2. and (3) robustness: the PPGs simplifies the vector feature space, a human face feature conversion model can conveniently learn the stable mapping relation between the PPGs and the human face image characteristics, and the generated virtual anchor pronunciation is adaptive to the human face image.
Fig. 4 is a schematic diagram of another video generation method provided by an embodiment of the present invention, and based on the video generation method shown in fig. 3, the present invention provides another video generation method, which is mainly used for text data input, and for further simplifying vector features, and facilitating a face feature conversion model to learn a stable mapping relationship between PPGs and face image characteristics, the method includes two parts, namely a training stage and a synthesis stage:
the core of the training stage is that a human face feature conversion model is trained to convert PPGs into human face features, and the training stage needs to collect videos containing human face speech of a target person for more than fifteen minutes; and extracting audio and images, and extracting voice characteristics from the audio and face characteristics from the images according to frames with equal length respectively to be used as training models of training data. The face feature conversion model adopts a UFANS neural network, and the training stage comprises the following steps:
acquiring audio data; determining phoneme characteristics of a plurality of audio frames in the audio data through a preset speech recognition model; performing vectorization processing on each frame of the phoneme characteristics to obtain a corresponding phoneme sequence vector, wherein the vectorization processing is One-Hot vector processing; processing each frame of phoneme sequence vector through a preset face feature conversion model to obtain corresponding each frame of face features; determining a face image corresponding to each frame of face features through a preset face reconstruction model; and packaging the face images and the voice information of the continuous frames into a video file.
After the face feature conversion model is trained, the synthesis stage of the method comprises the following steps: acquiring text data; determining phoneme characteristics of a plurality of audio frames in the text data through a preset speech synthesis model; performing vectorization processing on each frame of the phoneme characteristics to obtain a corresponding phoneme sequence vector, wherein the vectorization processing is One-Hot vector processing; processing each frame of phoneme sequence vector through a preset face feature conversion model to obtain corresponding each frame of face features; determining a face image corresponding to each frame of face features through a preset face reconstruction model; and packaging the face images and the voice information of the continuous frames into a video file.
Compared with the method for generating the virtual anchor from the audio to the video in the prior art, the method has the following advantages: 1. the method is related to non-specific people and supports the input of voice data or text data of any person; 2. and (3) robustness: the vector feature space is simplified, the stable mapping relation between the human face feature conversion model learning and the human face image characteristics is facilitated, and the generated virtual anchor pronunciation is adaptive to the human face image.
In the two scenarios of the audio driving method shown in fig. 3 and the text driving method shown in fig. 4, the specific implementation steps are similar and can be integrated into a dual-mode video generation method, and fig. 5 is a schematic diagram of another video generation method provided by an embodiment of the present invention, where the method includes a training phase and a using phase, and the specific steps are not described repeatedly herein. In the training stage, videos containing human face speech of a target person are collected for more than fifteen minutes; and extracting audio and images, and extracting voice characteristics from the audio and face characteristics from the images according to frames with equal length respectively to be used as training models of training data. The human face feature conversion model adopts a UFANS neural network, the method supports the input of audio and text after training, and compared with the audio-video virtual anchor generation method in the prior art, the method has the following advantages: 1. the method is related to non-specific people and supports the input of voice data or text data of any person; 2. and (3) robustness: the vector feature space is simplified, the stable mapping relation between the human face feature conversion model learning and the human face image characteristics is facilitated, and the generated virtual anchor pronunciation is adaptive to the human face image.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Fig. 6 is a block diagram of a video generating apparatus according to an embodiment of the present invention, where the video generating apparatus may include:
a data obtaining module 601, configured to obtain input information;
a phoneme feature obtaining module 602, configured to determine target audio data according to the input information, and determine phoneme features of multiple audio frames in the target audio data by using a frame as a granularity;
a face feature obtaining module 603, configured to generate corresponding face features according to the phoneme features of the multiple audio frames;
a face image obtaining module 604, configured to generate a face image according to the face features;
a video file generating module 605, configured to generate a video file by using the plurality of audio frames and the corresponding face images.
In an alternative embodiment, the phoneme feature obtaining module 602 may include:
the PPGs acquisition submodule is used for determining the posterior probability PPGs of the phonemes corresponding to a plurality of audio frames in the target audio data by using a preset phoneme recognition model and taking the frame as granularity;
and the phoneme characteristic acquisition submodule is used for determining the phoneme characteristics of the audio frame according to the PPGs corresponding to the audio frame.
In an alternative embodiment, the PPGs includes probabilities for each phoneme in a preset phoneme list, and the phoneme feature obtaining sub-module may include:
the one-hot coding unit is used for carrying out one-hot coding processing on the PPGs;
and the phoneme feature unit is used for determining the phoneme with the probability of one after the one-hot coding processing as the phoneme feature of the audio frame.
In an alternative embodiment, the facial feature obtaining module 603 may include:
and the face feature conversion submodule is used for converting the phoneme features of the audio frame into corresponding face features through a preset face feature conversion model.
In an optional embodiment, the preset face feature conversion model is obtained by training through the following modules:
the system comprises a sample video acquisition module, a video processing module and a video processing module, wherein the sample video acquisition module is used for acquiring a sample video, and the sample video is a video containing a face of a speaker;
the data extraction module is used for extracting audio data and video data of the sample video;
the face feature acquisition submodule is used for extracting voice features from the audio data and extracting face features from the video data according to the same frame length; and training the preset human face feature conversion model by adopting the voice features and the human face features.
In an alternative embodiment, the time length of the sample video is greater than a preset time length.
In an alternative embodiment, the input information comprises input audio data or input text data; the phoneme feature obtaining module 602 may include:
the first target audio data acquisition submodule is used for determining the target audio data through a preset voice recognition model when the input information is audio data;
and the second target audio data acquisition submodule is used for determining the target audio data through a preset speech synthesis model when the input information is text data.
Based on the above description, in the technical scheme of the present invention, a phoneme posterior probability PPGs corresponding to a plurality of audio frames in target audio data is determined by using a preset phoneme recognition model and using the frames as granularity, phoneme features of the audio frames are determined according to the PPGs, corresponding face features are respectively generated according to the phoneme features, face images are generated according to the face features, and finally, a video file is generated by using the target audio data and the corresponding face images. Compared with the method for generating the virtual anchor from the audio to the video in the prior art shown in fig. 1, the PPGs only represent pronunciation contents and do not contain personality information of a speaker, and the method has the following advantages: 1. the method is related to non-specific people and supports the input of voice data or text data of any person; 2. and (3) robustness: the vector feature space is simplified, the stable mapping relation between the human face feature conversion model learning and the human face image characteristics is facilitated, and the generated virtual anchor pronunciation is adaptive to the human face image.
The embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, and when being executed by the processor, the computer program implements each process of the video generation method embodiment, and can achieve the same technical effect, and is not described herein again to avoid repetition.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements each process of the video generation method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The video generation method and the video generation apparatus provided by the present invention are described in detail above, and the principle and the implementation of the present invention are explained in the present document by applying specific examples, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method of video generation, comprising:
acquiring input information;
determining target audio data according to the input information, and determining phoneme characteristics of a plurality of audio frames in the target audio data by taking the frames as granularity;
respectively generating corresponding human face features according to the phoneme features of the audio frames;
generating a face image according to the face features;
and generating a video file by adopting the plurality of audio frames and the corresponding face images.
2. The method of claim 1, wherein the determining phoneme characteristics for a plurality of audio frames in the target audio data at a frame granularity comprises:
determining the posterior probability PPGs of the phonemes corresponding to a plurality of audio frames in the target audio data by using a preset phoneme recognition model and taking the frames as granularity;
and determining the phoneme characteristics of the audio frame according to the PPGs corresponding to the audio frame.
3. The method according to claim 2, wherein the PPGs include probabilities for each phoneme in a preset phoneme list, and wherein the determining the phoneme characteristics of the audio frame according to the PPGs to which the audio frame corresponds comprises:
performing one-hot encoding processing on the PPGs;
and determining the phoneme with the probability of one after the one-hot coding process as the phoneme characteristic of the audio frame.
4. The method according to claim 1, wherein the generating corresponding face features according to the phoneme features of the plurality of audio frames comprises:
and converting the phoneme characteristics of the audio frame into corresponding human face characteristics through a preset human face characteristic conversion model.
5. The method of claim 4, wherein the predetermined face feature transformation model is trained by:
acquiring a sample video, wherein the sample video is a video containing a face of a speaker;
extracting audio data and video data of the sample video;
extracting voice features from the audio data and extracting face features from the video data according to the same frame length;
and training the preset human face feature conversion model by adopting the voice features and the human face features.
6. The method of claim 5, wherein the time length of the sample video is greater than a preset time length.
7. The method of claim 1, wherein the input information comprises input audio data or input text data;
the determining target audio data according to the input information includes:
when the input information is audio data, determining the target audio data through a preset voice recognition model;
and when the input information is text data, determining the target audio data through a preset speech synthesis model.
8. A video generation apparatus, comprising:
the data acquisition module is used for acquiring input information;
the phoneme characteristic acquisition module is used for determining target audio data according to the input information and determining phoneme characteristics of a plurality of audio frames in the target audio data by taking the frames as granularity;
the face feature acquisition module is used for respectively generating corresponding face features according to the phoneme features of the audio frames;
the face image acquisition module is used for generating a face image according to the face features;
and the video file generation module is used for generating a video file by adopting the plurality of audio frames and the corresponding face images.
9. An electronic device, comprising: processor, memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the video generation method according to any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the video generation method according to any one of claims 1 to 7.
CN202111130297.XA 2021-09-26 2021-09-26 Video generation method and device Pending CN113990295A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111130297.XA CN113990295A (en) 2021-09-26 2021-09-26 Video generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111130297.XA CN113990295A (en) 2021-09-26 2021-09-26 Video generation method and device

Publications (1)

Publication Number Publication Date
CN113990295A true CN113990295A (en) 2022-01-28

Family

ID=79736739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111130297.XA Pending CN113990295A (en) 2021-09-26 2021-09-26 Video generation method and device

Country Status (1)

Country Link
CN (1) CN113990295A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581570A (en) * 2022-03-01 2022-06-03 浙江同花顺智能科技有限公司 Three-dimensional face action generation method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581570A (en) * 2022-03-01 2022-06-03 浙江同花顺智能科技有限公司 Three-dimensional face action generation method and system
CN114581570B (en) * 2022-03-01 2024-01-26 浙江同花顺智能科技有限公司 Three-dimensional face action generation method and system

Similar Documents

Publication Publication Date Title
CN112184858B (en) Virtual object animation generation method and device based on text, storage medium and terminal
CN111667812A (en) Voice synthesis method, device, equipment and storage medium
CN113077537A (en) Video generation method, storage medium and equipment
CN114267329B (en) Multi-speaker speech synthesis method based on probability generation and non-autoregressive model
US20230343319A1 (en) speech processing system and a method of processing a speech signal
WO2024055752A9 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
CN113822017A (en) Audio generation method, device, equipment and storage medium based on artificial intelligence
CN113111812A (en) Mouth action driving model training method and assembly
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN111627420A (en) Specific-speaker emotion voice synthesis method and device under extremely low resources
Shimba et al. Talking heads synthesis from audio with deep neural networks
KR102319753B1 (en) Method and apparatus for producing video contents based on deep learning
CN114743539A (en) Speech synthesis method, apparatus, device and storage medium
CN115497448A (en) Method and device for synthesizing voice animation, electronic equipment and storage medium
CN116129852A (en) Training method of speech synthesis model, speech synthesis method and related equipment
CN113990295A (en) Video generation method and device
CN112216293A (en) Tone conversion method and device
CN113963092B (en) Audio and video fitting associated computing method, device, medium and equipment
CN117935323A (en) Training method of face driving model, video generation method and device
Bittal et al. Speech to image translation framework for teacher-student learning
CN114299989A (en) Voice filtering method and device, electronic equipment and storage medium
CN115731917A (en) Voice data processing method, model training method, device and storage medium
Zainkó et al. Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging
d’Alessandro et al. Reactive statistical mapping: Towards the sketching of performative control with data
CN113223513A (en) Voice conversion method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination