CN110880315A

CN110880315A - Personalized voice and video generation system based on phoneme posterior probability

Info

Publication number: CN110880315A
Application number: CN201910991186.4A
Authority: CN
Inventors: 孙立发; 周艺超; 钟静华; 李坤; 胡景强; 刘鹏飞
Original assignee: Shenzhen City Of Hope Technology Co Ltd
Current assignee: Shenzhen City Of Hope Technology Co Ltd
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2020-03-13

Abstract

The invention discloses a personalized voice and video generation system based on phoneme posterior probability, which mainly comprises the following steps: s1, extracting phoneme posterior probability through an automatic speech recognition system; s2, training a recurrent neural network to learn the mapping relation between the phoneme posterior probability and the lip-shaped feature, and inputting the audio of any target speaker through the network to output the corresponding lip-shaped feature; s3, synthesizing the lip-shaped features into corresponding face images through the technologies of face alignment, image fusion, an optical flow method and the like; and S4, generating a final speaker speech video from the generated face sequence through a dynamic programming technology and the like. The present invention relates to the technical field of speech synthesis and speech conversion. The invention provides a method for generating lip shape based on phoneme posterior probability, which greatly reduces the requirement on the video data volume of the target speaker, and simultaneously can directly generate the video of the target speaker from the text content without additionally recording the audio of the speaker.

Description

Personalized voice and video generation system based on phoneme posterior probability

Technical Field

The invention relates to the technical field of voice and video, in particular to a personalized voice and video generation system based on phoneme posterior probability.

Background

With the improvement of computing power, the collection of a large amount of internet data and the breakthrough of a core algorithm, artificial intelligence enters a new development stage and a man-machine interaction mode is gradually changed. An important part in the human-computer interaction process is that the real human image is simulated to interact with the user, wherein the key technology is the virtual image generation technology, and the personalized voice and video synthesis can be realized by combining the voice synthesis technology and the voice conversion technology.

Speech synthesis is a technique that converts text to speech, which can be used as a custom tone for the synthesized speech. With the application of deep learning, the naturalness and fluency of synthesized voice and converted voice are greatly improved.

The current mainstream virtual image generation technology is to change the expression of the virtual image in real time according to facial recognition, and the mode is more suitable for the two-dimensional image but is difficult to generate the virtual image similar to a real person. In recent years, research and development are carried out in the technical and industrial fields of virtual image generation based on real-person modeling, and the generation effect still needs to be further improved, so that the problems of strange lip, hard sound, improper facial action and sound, low face, particularly lip pixels and the like exist. In addition, the technology has certain requirements on the video data volume of the target speaker, the data volume is insufficient, the generation effect of a user is difficult to ensure, the use experience of the user is reduced, the overall practicability is not strong, and the operation is inconvenient for the user.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a personalized voice and video generation system based on phoneme posterior probability, which greatly reduces the requirement on the video data volume of a target speaker and can directly generate the video of the target speaker from the text content without additionally recording the audio of the speaker.

(II) technical scheme

In order to achieve the purpose, the invention is realized by the following technical scheme: a personalized voice and video generation system based on phoneme posterior probability mainly comprises the following steps:

s1, firstly, extracting phoneme Posterior Probability (PPG) from the voice of the source speaker by using an automatic voice recognition system (SI-ASR) independent of the speaker;

s2, secondly, training a Recurrent Neural Network (RNN) to learn the mapping relation between the phoneme posterior probability and the lip-shaped feature; through the network, the corresponding lip characteristics can be output by inputting the audio frequency of any target speaker; if the input is a text, firstly outputting the audio frequency of the target speaker through voice synthesis and voice conversion, and then outputting the lip-shaped feature through a network;

s3, synthesizing lip-shaped features generated by the recurrent neural network into corresponding face images through technologies such as face alignment, image fusion and an optical flow method, wherein the lip shape of the face is kept synchronous with audio;

and S4, generating a final speaker speech video from the generated face sequence through a dynamic programming technology and the like.

Preferably, the speaker-independent automatic speech recognition is abbreviated as SI-ASR, the training recurrent neural network is abbreviated as RNN, and the phoneme posterior probability is abbreviated as PPG.

Preferably, in S2, two steps are shifted in the RNN model, and in order to generate smooth and natural lip motion, a long-short term memory network (LSTM) is used as a basic unit of the neural network, and a gating mechanism of the LSTM unit can control necessary information storage and state transition so that it can simultaneously retain audio and long-term dependency of previous lip and head gestures, so that after the RNN model is trained, a speaker video with natural lip and head motion can be generated in accordance with the input audio.

Preferably, the synthesized face image in S3 uses multiple image processing algorithms, such as normalizing the face in the video by a face alignment technique, seamlessly joining the synthesized lip texture to the face by an image fusion technique, completing chin correction by an optical flow method, and readjusting the time axis of the video by dynamic programming to make the video act more naturally in cooperation with the audio head.

Preferably, in S4, the generated video supports editing and modifying again.

(III) advantageous effects

The invention provides a personalized voice and video generation system based on phoneme posterior probability. The method has the following beneficial effects:

(1) the personalized voice and video generation system based on the phoneme posterior probability specifically comprises the following steps: s1, firstly, extracting phoneme Posterior Probability (PPG) from the voice of the source speaker by using an automatic voice recognition system (SI-ASR) independent of the speaker; the requirements on the amount of video data of the target speaker are greatly reduced.

(2) The personalized voice and video generation system based on the phoneme posterior probability specifically comprises the following steps: s2, secondly, training a recurrent neural network to learn the mapping relation between the phoneme posterior probability and the lip-shaped characteristics; through the network, the corresponding lip characteristics can be output by inputting the audio frequency of any target speaker; if the input is a text, firstly outputting the audio frequency of the target speaker through voice synthesis and voice conversion, and then outputting the lip-shaped feature through a network; the video of the target speaker can be generated directly from the text content without additionally recording the audio of the speaker.

Drawings

FIG. 1 is a diagram of the main steps of the practice of the present invention;

FIG. 2 is a schematic representation of the RNN model of the present invention;

FIG. 3 is a detailed flow chart of the practice of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-3, an embodiment of the present invention provides a technical solution: a personalized voice and video generation system based on phoneme posterior probability mainly comprises the following steps:

s1, first, a phoneme Posterior Probability (PPG) is extracted from the speech of the source speaker using a speaker-independent automatic speech recognition (SI-ASR) system, the posterior probability based method being based in part on the following assumptions: the posterior probability obtained from the speaker-independent speech recognition system is independent of the speaker and only related to the content of the utterance, and the phoneme posterior probability-based method is divided into three stages: the method comprises a first training stage (marked as a training stage 1), a second training stage (marked as a training stage 2) and a video generation stage, wherein the SI-ASR model is used for obtaining PPG representation of input voice, the second training stage is used for modeling mapping relation between PPG characteristics and lip characteristics of a target speaker for voice parameter generation through training a Recurrent Neural Network (RNN) model, and the video generation stage is used for generating corresponding lip characteristics for input text or voice through the SI-ASR and RNN models so as to synthesize corresponding faces and videos.

S2, next, normalizing the lip shape of each frame in the video by extracting it and translating, rotating, scaling, etc., as a feature vector of the lip shape, and then generating a corresponding lip feature vector sequence { y } in time₀,y₁,…,y_tAnd training with a Recurrent Neural Network (RNN) model, which schematically shows the audio features x at time t₀How to input into the LSTM unit is noteworthy because lip movements usually precede our voice productionFor example when we say "orange", our mouth has opened before the sound of "o", so we move two steps in the model; learning the mapping relation between the posterior probability of the audio feature phoneme and the lip-shaped feature through the recurrent neural network, so that the corresponding lip-shaped feature can be output by inputting the audio of any target speaker; if the input is a text, firstly outputting the audio frequency of the target speaker through voice synthesis and voice conversion, and then outputting the lip-shaped feature through a network;

s3, synthesizing lip-shaped features generated based on the training recurrent neural network into corresponding face images through technologies such as face alignment, image fusion and an optical flow method, wherein the lip shape of the face is kept synchronous with audio;

In the invention, the speaker-independent automatic speech recognition is called SI-ASR for short, the training recurrent neural network is called RNN for short, and the phoneme posterior probability is called PPG for short.

In the present invention, in S2, two steps are shifted in the RNN model, in order to generate smooth and natural lip motion, a long-short term memory network (LSTM) is used as a basic unit of the neural network, and a gating mechanism of the LSTM unit can control necessary information storage and state transition so that it can simultaneously retain audio and long-term dependency of previous lip and head gestures, and thus, after the RNN model is trained, a speaker video with natural lip and head motion can be generated in accordance with the input audio.

In the invention, the synthesized face image in the S3 uses a plurality of image processing algorithms, for example, the face in the video is orthogonalized by a face alignment technology, the synthesized lip texture is seamlessly connected with the face by an image fusion technology, the chin correction is completed by an optical flow method, and the time axis of the video is readjusted by dynamic programming, so that the action of matching the video with the audio head is more natural.

In the present invention, in S4, the generated video supports editing and modifying again.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A system for personalized speech and video generation based on phoneme posterior probabilities, comprising: the method mainly comprises the following steps:

2. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: the speaker-independent automatic speech recognition is called SI-ASR for short, the training recurrent neural network is called RNN for short, and the phoneme posterior probability is called PPG for short.

3. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: in S2, two steps are shifted in the RNN model, and in order to generate smooth and natural lip motion, a long-short term memory network (LSTM) is used as a basic unit of the neural network, and a gating mechanism of the LSTM unit can control necessary information storage and state transition so that it can simultaneously retain audio and long-term dependency of previous lip and head gestures, and thus, after the RNN model is trained, a speaker video with natural lip and head motion in accordance with input audio can be generated.

4. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: the synthesized face image in the S3 uses a plurality of image processing algorithms, for example, the face in the video is orthogonalized by a face alignment technique, the synthesized lip texture is seamlessly joined to the face by an image fusion technique, the chin correction is completed by an optical flow method, and the time axis of the video is readjusted by dynamic programming, so that the action of the video in cooperation with the audio head is more natural.

5. The system of claim 1 for personalized speech and video generation based on phoneme posterior probabilities, wherein: in S4, the generated video supports editing and modifying again.