WO2023002511A1

WO2023002511A1 - System and method for producing audio-video response to an input

Info

Publication number: WO2023002511A1
Application number: PCT/IN2022/050662
Authority: WO
Inventors: Bhairav SHANKAR; Buvaneash D
Original assignee: Avantari Technologies Private Limited
Priority date: 2021-07-23
Filing date: 2022-07-23
Publication date: 2023-01-26

Abstract

Disclosed are systems and methods for producing audio-video responses to an input (102). The system comprises a terminal device (104), a response generation module (106), and a video generation module. The terminal device (104) is configured to receive one or more inputs (102) to be responded and the response generation module (106) is configured to process this input and generate a response text accordingly. The video generation module is configured to convert the generated response text to a response video. The response video further comprises a plurality of individual audio frames synthesized (110) from the input (102) and is encoded with a plurality of corresponding facial lip-synced video frames (204). The plurality of individual audio frames (110) and the plurality of corresponding facial lip-synced video frames (204) are configured to combine to form a single video output.

Description

SYSTEM AND METHOD FOR PRODUCING AUDIO-VIDEO RESPONSE TO AN

INPUT

FIELD OF THE INVENTION

[0001] The present invention provides a system and method for producing audio-video responses to input using one or more deep learning modules.

BACKGROUND OF THE INVENTION

[0002] The present invention is related to the field of deepfake technology. A deepfake is media, which may be either an image, video, and/or audio that was generated and/or modified using artificial intelligence. In some examples, a deepfake creator may combine and/or superimpose existing images and/or video onto a source image and/or video to generate the deepfake. As artificial intelligence such as, example, neural networks, deep leaning, machine learning and/or any other artificial intelligence technique advances, deepfake media has become increasingly realistic.

[0003] These deepfake videos can be a powerful source of conveying information to people, as they can make it appear as if a human is conveying this information, however, the human is completely computer generated, and is usually created through AI, when the AI model is fed the human's likeness.

[0004] Moreover, deepfake can have applications in healthcare and beyond. For example, at an ATM, the bank can use the opportunity to speak to the customer about other financial products, while the customer is interacting with the machine. Possible applications can be kiosks at hospitals where patients can interact can show their doctor/nurse, allowing the patient to communicate with a known face rather than a machine. Multiple papers have shown how this ensures better compliance by the elderly regarding health, medication, and therapy. [0005] Other prior arts such as US10628635B1 and US9721373B2 describe the creation of a virtual character whereby its lip movements and gestures are controlled by computer-generated instructions, however, their virtual character is built using computer-aided design, and not deepfake technology. Inherently, since deepfake is the manipulation of a real image/video/audio it sounds and looks realistic rather than 3D models, whose realism is dependent on the virtual character's ability to cross the uncanny valley, i.e. a hypothesized relationship between the degree of an object's resemblance to a human being and the emotional response to such an object.

SUMMARY OF THE INVENTION

[0006] The present invention provides a method of producing a audio-video response to a text as input. The method characterizing receiving at a terminal device, one or more input to be responded, processing the input based on one or more response parameters to generate a response text to the input, and converting the response text into a response video via a video generation module. The response video comprises a plurality of individual audio frames synthesized from the input encoded with a plurality of corresponding facial lip-synced video frames, combined as a single video output.

[0007] In one exemplary embodiment, the one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data.

[0008] In another exemplary embodiment, the terminal comprises a chatbot adapted to provide a video response to the input.

[0009] In another exemplary embodiment, the plurality of individual audio frames is generated from the response text using a text-to-audio generation module selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model. The plurality of facial lip-sync video frames is generated for each of the audio frames using an audio -to- video synthesis module selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model. The lip-sync video frame comprises of face image rendered with the lip-synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.

[0010] In one embodiment, a computerized system for producing audio-video response to an input is provided. The system characterizes a terminal device adapted to receive one or more inputs to be responded, a response generation module adapted to process the input based on one or more response parameters, to generate a response text to the input, and a video generation module adapted to convert the response text into a response video. The response video comprises a plurality of individual audio frames synthesized from the input encoded with a plurality of corresponding facial lip-synced video frames, combined as a single video output.

[0011] In another exemplary embodiment, the terminal device is a chatbot adapted to provide a video response to the input.

[0012] In another exemplary embodiment, the video generation module comprising a text-to-audio generation module adapted to generate the plurality of individual audio frames from the response text, the text-to-audio generation module selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model. The video generation module comprises audio to video synthesis module adapted to generate facial lip-sync video frames corresponding to each of the plurality of audio frames, the audio -to -video synthesis module is selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] The invention is further described in the detailed description that follows, by reference to the noted drawings by way of illustrative embodiments of the invention, in which like reference numerals represent similar parts throughout the drawings. The invention is not limited to the precise arrangements and illustrative examples shown in the drawings:

[0014] Figure 1 illustrates a block diagram of a system for generating synthesized audio frames, in accordance with an exemplary embodiment of the present invention. [0015] Figure 2 illustrates a block diagram of the audio -to -video synthesis module of the system, in accordance with an exemplary embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0016] With reference to the figures provided, embodiments of the present invention are now described in detail. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these specific details. In other instances, structures, devices, activities, and methods are shown using schematics, use cases, and/or flow diagrams to avoid obscuring the invention. Although the following description contains many specifics for the purposes of illustration, anyone skilled in the art will appreciate that many variations and/or alterations to suggested details are within the scope of the present invention. Similarly, although many of the features of the present invention are described in terms of each other, or conjunction with each other, one skilled in the art will appreciate that many of these features can be provided independently of other features. Accordingly, this description of the invention is set forth without any loss of generality to, and without imposing limitations upon, the invention.

[0017] The present invention provides a system and method for producing audio-video responses to input using one or more deep learning modules. The system comprises a chatbot adapted to receive input and generate response text based on one or more response parameters. The response text is converted into a plurality of audio frames. The plurality of audio frames is fed to an audio- to-video synthesis module to generate facial lip-synced video frames corresponding to each audio frame.

[0018] Figure 1 illustrates a block diagram of a system for generating synthesized audio frames (110), in accordance with an exemplary embodiment of the present invention. The system comprises a terminal device (104), a response generation module (106), and a text-to-audio generation module (108). The terminal device (104) receives one or more inputs to be responded. The input is audio, text, video, gestures, or any combination thereof. In one example, the terminal device comprises a chatbot adapted to provide a video response to the input. In one embodiment, the chatbot is trained on either an intent-based method or a conversational NLP-based method. The response generation module (106) may be configured to process the input based on the one or more response parameters to generate a response text. The one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data. The text to audio generation module (108) may be configured to generate a plurality of audio frames (110) synthesized from the response text. The text to audio generation module (108) is selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN model.

[0019] In one embodiment, the response text is fed to the Forward Transformer TTS model that predicts the Mel Spectrogram Features, the predicted Mel Spectrogram in turn is fed to the HiFi- GAN model that generates the desired audio frame.

[0020] The Forward Transformer TTS is trained on a dataset consisting of phoneme texts as input, Mel spectrogram features as output, and duration prediction features as output pairs of a single person's voice. The Mel spectrogram features are calculated using hop length 275 and frame length 1100 Short-Term Fourier Transform from 22050 Hz Audios and English text is converted to Phoneme character texts. After training Forward Transformer TTS may be configured to predict correct Mel spectrogram features depending on the input phoneme text. The HiFi-GAN model is trained on Mel spectrogram features as input and audio signals as output pairs of a single person’s voice, whereas Mel spectrogram features are calculated as the same as in forwarding Transformer TTS training. After training HiFi-GAN model is enabled to generate audio signals of 22050 Hz from input Mel spectrogram features.

[0021] Figure 2 illustrates a block diagram of an audio -to- video synthesis module (202) of the system, in accordance with an exemplary embodiment of the present invention. The audio-to-video synthesis module (202) is adapted to generate facial lip-sync video frames (204) corresponding to each of the plurality of audio frames (110). The audio-to-video synthesis module (202) is selected from the group consisting of but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model. The lip-sync video frame (204) comprises of face image rendered with the lip-synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.

[0022] In one exemplary embodiment, the audio frames are fed to the LipSync RNN which predicts the lip-synced output encodings, which in turn is fed along with the input face with the mouth, chin, and neck cropped/blacked out to the Deep Neural Renderer UNet model that synthesizes/renders the lip-synced mouth, chin, neck in the input face image’s cropped/blacked out region, the required number of frames for the complete video is synthesized/rendered sequentially depending on the input audio duration. Later the individual synthesized/rendered frames get stitched into a complete video along with the input video.

[0023] In another exemplary embodiment, the Adversarial Variational AutoEncoder model is trained on frontal face images of the desired person as both the input and the output for the model, the encoder part of the AutoEncoder learns a 128 -dimensional vector of latent space that maps with each of the frontal face images that correspond to different facial features along with the lip movements, likewise, the Decoder part of AutoEncoder learns to perfectly reconstruct the original input frontal face image again from just the 128 latent vector alone.

[0024] In another exemplary embodiment, the LipSync RNN model is trained to predict lip-synced face encodings / latent vectors with respect to the input audio signal. It is trained on 100 Hz MFCC features calculated from 16000 Hz re-sampled audio of a video as input and 128-dimensional Latent vector for each frame of the same video in 25 FPS as output pairs, the latent vector is upsampled to 100 Hz to match input MFCC feature frequency. After training the LipSync RNN learns to map the 128-D Latent vector which corresponds to a facial expression/lip movement that's been learned by AVAE depending on the input audio MFCC feature. During the video synthesis, the 100Hz predicted 128-D latent vector from LipSync RNN gets downsampled to 25 Hz to match the video frame rate. [0025] In another exemplary embodiment, the Deep Neural Renderer (DNR) UNet model is trained on a face image with a mouth, chin, and neck cropped/blacked out along with the same face's frontal face image latent vector generated from AVAE as inputs and the complete whole face image as output as GAN model with two Discriminators (i.e., spatial and temporal). The Deep Neural renderer has trained in a temporal recurrent generation method that the input to the model is the current input (mouth cropped face image + latent vector) pairs and the T-l model's output (T-l completely generated face and its corresponding latent vector from AVAE). After training the model learns to generate mouth, lips, and neck regions in the cropped/masked out regions of the input face image depending on the latent vector that corresponds to a specific mouth movement/expression.

[0026] In another exemplary embodiment, an input is fed to the method which in turn synthesizes speech audio at 22 kHz saying out the words with correct pronunciation with appropriate pause according to the given input text. The synthesized speech audio is downsampled to 16 kHz and is fed as input to the LipSyncRNN that predicts the Latent Vectors at a higher sample rate matching with input audio rather than conventional video frame rates of 25, these Latent Vectors correspond to the subject person's mouth movement which is in sync with the input audio. The predicted Latent Vector is downsampled to match with a conventional video frame rate of 25 fps and is fed as conditional input along with masked mouth face images from the continuous looping sequence to DNR for synthesizing face images with correct mouth movement that will be in sync with the synthesized audio. The DNR inference is performed individually on per frame masked mouth face image and individual Latent Vector in the predicted sequence from LipSyncRNN.

[0027] The synthesized face images are blended onto the body and head base frame images corresponding to their DNR’s input sequence number. Each synthesized face image is placed onto the body and head base frame image using its nose tip landmark point as the coordinate point and blended in place with the frame image’s body and head. The Alpha mask is used to place any desired background image to the blended frame image. In the end, all the final blended frame images are rendered into a single output video along with the synthesized speech audio by the method. [0028] The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure.

Claims

CLAIMS:

1. A method of producing audio-video response to an input (102), the method characterizing: receiving at a terminal device (104), one or more inputs (102) to be responded; processing the input on the basis of one or more response parameters, to generate a response text to the input (102); and converting the response text into a response video via a video generation module; wherein the response video comprising a plurality of individual audio frames synthesized (110) from the input encoded with a plurality of corresponding facial lip- synced video frames (204), combined together as a single video output.

2. The method as claimed in claiml, wherein the one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data.

3. The method as claimed in claim 1, wherein the terminal device (104) comprises a chatbot adapted to provide a video response to the input.

4. The method as claimed in claim 1, wherein the plurality of individual audio frames (110) is generated from the response text using a text to audio generation module (108) selected from, without limitation, to a Forward Transformer TTS Model and a HiFi-GAN model.

5. The method as claimed in claim 1, wherein the plurality of facial lip-sync video frames (204) is generated for each of the audio frames using an audio to video synthesis module (202) selected from, without limitation, to a Lip-sync RNN, Deep Neural Renderer UNet model.

6. The method as claimed in claim 5, wherein the facial lip-sync video frame (204) comprising of face image rendered with the lip synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.

7. The method as claimed in claim 1, wherein the input is audio, text, video, gestures, or any combination thereof.

8. A computerized system for producing audio-video response to one or more inputs (102), the system characterizing: a terminal device (104) adapted to receive one or more inputs (102) to be responded; a response generation module (106) adapted to process the input (102) on the basis of one or more response parameters, to generate a response text to the input (102); and a video generation module adapted to convert the response text into a response video; characterized in that, the response video comprising a plurality of individual audio frames synthesized (110) from the input encoded with a plurality of corresponding facial lip-synced video frames (204), combined together as a single video output.

9. The system as claimed in claim 8, wherein the terminal device (104) is a chatbot adapted to provide a video response to the input.

10. The system as claimed in claim 8, wherein the video generation module comprises a text to audio generation module (108) adapted to generate the plurality of individual audio frames (110) from the response text, the text to audio generation module (108) selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model.

11. The system as claimed in claim 8, wherein the video generation module comprises an audio-to-video synthesis module (202) adapted to generate facial lip-sync video frames (204) corresponding to each of the plurality of audio frames (110), the audio-to-video synthesis module selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet Model.

12. The system as claimed in claim 8, wherein the input is audio, text, video, gestures, or any combination thereof.