WO2023002511A1 - System and method for producing audio-video response to an input - Google Patents

System and method for producing audio-video response to an input Download PDF

Info

Publication number
WO2023002511A1
WO2023002511A1 PCT/IN2022/050662 IN2022050662W WO2023002511A1 WO 2023002511 A1 WO2023002511 A1 WO 2023002511A1 IN 2022050662 W IN2022050662 W IN 2022050662W WO 2023002511 A1 WO2023002511 A1 WO 2023002511A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
response
audio
input
text
Prior art date
Application number
PCT/IN2022/050662
Other languages
French (fr)
Inventor
Bhairav SHANKAR
Buvaneash D
Original Assignee
Avantari Technologies Private Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Avantari Technologies Private Limited filed Critical Avantari Technologies Private Limited
Publication of WO2023002511A1 publication Critical patent/WO2023002511A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Definitions

  • the present invention provides a system and method for producing audio-video responses to input using one or more deep learning modules.
  • a deepfake is media, which may be either an image, video, and/or audio that was generated and/or modified using artificial intelligence.
  • a deepfake creator may combine and/or superimpose existing images and/or video onto a source image and/or video to generate the deepfake.
  • artificial intelligence such as, example, neural networks, deep leaning, machine learning and/or any other artificial intelligence technique advances, deepfake media has become increasingly realistic.
  • deepfake can have applications in healthcare and beyond.
  • the bank can use the opportunity to speak to the customer about other financial products, while the customer is interacting with the machine.
  • Possible applications can be kiosks at hospitals where patients can interact can show their doctor/nurse, allowing the patient to communicate with a known face rather than a machine. Multiple papers have shown how this ensures better compliance by the elderly regarding health, medication, and therapy.
  • Other prior arts such as US10628635B1 and US9721373B2 describe the creation of a virtual character whereby its lip movements and gestures are controlled by computer-generated instructions, however, their virtual character is built using computer-aided design, and not deepfake technology.
  • the present invention provides a method of producing a audio-video response to a text as input.
  • the method characterizing receiving at a terminal device, one or more input to be responded, processing the input based on one or more response parameters to generate a response text to the input, and converting the response text into a response video via a video generation module.
  • the response video comprises a plurality of individual audio frames synthesized from the input encoded with a plurality of corresponding facial lip-synced video frames, combined as a single video output.
  • the one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data.
  • the terminal comprises a chatbot adapted to provide a video response to the input.
  • the plurality of individual audio frames is generated from the response text using a text-to-audio generation module selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model.
  • the plurality of facial lip-sync video frames is generated for each of the audio frames using an audio -to- video synthesis module selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model.
  • the lip-sync video frame comprises of face image rendered with the lip-synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.
  • a computerized system for producing audio-video response to an input characterizes a terminal device adapted to receive one or more inputs to be responded, a response generation module adapted to process the input based on one or more response parameters, to generate a response text to the input, and a video generation module adapted to convert the response text into a response video.
  • the response video comprises a plurality of individual audio frames synthesized from the input encoded with a plurality of corresponding facial lip-synced video frames, combined as a single video output.
  • the terminal device is a chatbot adapted to provide a video response to the input.
  • the video generation module comprising a text-to-audio generation module adapted to generate the plurality of individual audio frames from the response text, the text-to-audio generation module selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model.
  • the video generation module comprises audio to video synthesis module adapted to generate facial lip-sync video frames corresponding to each of the plurality of audio frames, the audio -to -video synthesis module is selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model.
  • Figure 1 illustrates a block diagram of a system for generating synthesized audio frames, in accordance with an exemplary embodiment of the present invention.
  • Figure 2 illustrates a block diagram of the audio -to -video synthesis module of the system, in accordance with an exemplary embodiment of the present invention.
  • the present invention provides a system and method for producing audio-video responses to input using one or more deep learning modules.
  • the system comprises a chatbot adapted to receive input and generate response text based on one or more response parameters.
  • the response text is converted into a plurality of audio frames.
  • the plurality of audio frames is fed to an audio- to-video synthesis module to generate facial lip-synced video frames corresponding to each audio frame.
  • FIG. 1 illustrates a block diagram of a system for generating synthesized audio frames (110), in accordance with an exemplary embodiment of the present invention.
  • the system comprises a terminal device (104), a response generation module (106), and a text-to-audio generation module (108).
  • the terminal device (104) receives one or more inputs to be responded.
  • the input is audio, text, video, gestures, or any combination thereof.
  • the terminal device comprises a chatbot adapted to provide a video response to the input.
  • the chatbot is trained on either an intent-based method or a conversational NLP-based method.
  • the response generation module (106) may be configured to process the input based on the one or more response parameters to generate a response text.
  • the one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data.
  • the text to audio generation module (108) may be configured to generate a plurality of audio frames (110) synthesized from the response text.
  • the text to audio generation module (108) is selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN model.
  • the response text is fed to the Forward Transformer TTS model that predicts the Mel Spectrogram Features, the predicted Mel Spectrogram in turn is fed to the HiFi- GAN model that generates the desired audio frame.
  • the Forward Transformer TTS is trained on a dataset consisting of phoneme texts as input, Mel spectrogram features as output, and duration prediction features as output pairs of a single person's voice.
  • the Mel spectrogram features are calculated using hop length 275 and frame length 1100 Short-Term Fourier Transform from 22050 Hz Audios and English text is converted to Phoneme character texts.
  • Forward Transformer TTS may be configured to predict correct Mel spectrogram features depending on the input phoneme text.
  • the HiFi-GAN model is trained on Mel spectrogram features as input and audio signals as output pairs of a single person’s voice, whereas Mel spectrogram features are calculated as the same as in forwarding Transformer TTS training. After training HiFi-GAN model is enabled to generate audio signals of 22050 Hz from input Mel spectrogram features.
  • FIG. 2 illustrates a block diagram of an audio -to- video synthesis module (202) of the system, in accordance with an exemplary embodiment of the present invention.
  • the audio-to-video synthesis module (202) is adapted to generate facial lip-sync video frames (204) corresponding to each of the plurality of audio frames (110).
  • the audio-to-video synthesis module (202) is selected from the group consisting of but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model.
  • the lip-sync video frame (204) comprises of face image rendered with the lip-synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.
  • the audio frames are fed to the LipSync RNN which predicts the lip-synced output encodings, which in turn is fed along with the input face with the mouth, chin, and neck cropped/blacked out to the Deep Neural Renderer UNet model that synthesizes/renders the lip-synced mouth, chin, neck in the input face image’s cropped/blacked out region, the required number of frames for the complete video is synthesized/rendered sequentially depending on the input audio duration. Later the individual synthesized/rendered frames get stitched into a complete video along with the input video.
  • the Adversarial Variational AutoEncoder model is trained on frontal face images of the desired person as both the input and the output for the model, the encoder part of the AutoEncoder learns a 128 -dimensional vector of latent space that maps with each of the frontal face images that correspond to different facial features along with the lip movements, likewise, the Decoder part of AutoEncoder learns to perfectly reconstruct the original input frontal face image again from just the 128 latent vector alone.
  • the LipSync RNN model is trained to predict lip-synced face encodings / latent vectors with respect to the input audio signal. It is trained on 100 Hz MFCC features calculated from 16000 Hz re-sampled audio of a video as input and 128-dimensional Latent vector for each frame of the same video in 25 FPS as output pairs, the latent vector is upsampled to 100 Hz to match input MFCC feature frequency. After training the LipSync RNN learns to map the 128-D Latent vector which corresponds to a facial expression/lip movement that's been learned by AVAE depending on the input audio MFCC feature.
  • the 100Hz predicted 128-D latent vector from LipSync RNN gets downsampled to 25 Hz to match the video frame rate.
  • the Deep Neural Renderer (DNR) UNet model is trained on a face image with a mouth, chin, and neck cropped/blacked out along with the same face's frontal face image latent vector generated from AVAE as inputs and the complete whole face image as output as GAN model with two Discriminators (i.e., spatial and temporal).
  • the Deep Neural renderer has trained in a temporal recurrent generation method that the input to the model is the current input (mouth cropped face image + latent vector) pairs and the T-l model's output (T-l completely generated face and its corresponding latent vector from AVAE). After training the model learns to generate mouth, lips, and neck regions in the cropped/masked out regions of the input face image depending on the latent vector that corresponds to a specific mouth movement/expression.
  • an input is fed to the method which in turn synthesizes speech audio at 22 kHz saying out the words with correct pronunciation with appropriate pause according to the given input text.
  • the synthesized speech audio is downsampled to 16 kHz and is fed as input to the LipSyncRNN that predicts the Latent Vectors at a higher sample rate matching with input audio rather than conventional video frame rates of 25, these Latent Vectors correspond to the subject person's mouth movement which is in sync with the input audio.
  • the predicted Latent Vector is downsampled to match with a conventional video frame rate of 25 fps and is fed as conditional input along with masked mouth face images from the continuous looping sequence to DNR for synthesizing face images with correct mouth movement that will be in sync with the synthesized audio.
  • the DNR inference is performed individually on per frame masked mouth face image and individual Latent Vector in the predicted sequence from LipSyncRNN.
  • the synthesized face images are blended onto the body and head base frame images corresponding to their DNR’s input sequence number. Each synthesized face image is placed onto the body and head base frame image using its nose tip landmark point as the coordinate point and blended in place with the frame image’s body and head.
  • the Alpha mask is used to place any desired background image to the blended frame image. In the end, all the final blended frame images are rendered into a single output video along with the synthesized speech audio by the method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Disclosed are systems and methods for producing audio-video responses to an input (102). The system comprises a terminal device (104), a response generation module (106), and a video generation module. The terminal device (104) is configured to receive one or more inputs (102) to be responded and the response generation module (106) is configured to process this input and generate a response text accordingly. The video generation module is configured to convert the generated response text to a response video. The response video further comprises a plurality of individual audio frames synthesized (110) from the input (102) and is encoded with a plurality of corresponding facial lip-synced video frames (204). The plurality of individual audio frames (110) and the plurality of corresponding facial lip-synced video frames (204) are configured to combine to form a single video output.

Description

SYSTEM AND METHOD FOR PRODUCING AUDIO-VIDEO RESPONSE TO AN
INPUT
FIELD OF THE INVENTION
[0001] The present invention provides a system and method for producing audio-video responses to input using one or more deep learning modules.
BACKGROUND OF THE INVENTION
[0002] The present invention is related to the field of deepfake technology. A deepfake is media, which may be either an image, video, and/or audio that was generated and/or modified using artificial intelligence. In some examples, a deepfake creator may combine and/or superimpose existing images and/or video onto a source image and/or video to generate the deepfake. As artificial intelligence such as, example, neural networks, deep leaning, machine learning and/or any other artificial intelligence technique advances, deepfake media has become increasingly realistic.
[0003] These deepfake videos can be a powerful source of conveying information to people, as they can make it appear as if a human is conveying this information, however, the human is completely computer generated, and is usually created through AI, when the AI model is fed the human's likeness.
[0004] Moreover, deepfake can have applications in healthcare and beyond. For example, at an ATM, the bank can use the opportunity to speak to the customer about other financial products, while the customer is interacting with the machine. Possible applications can be kiosks at hospitals where patients can interact can show their doctor/nurse, allowing the patient to communicate with a known face rather than a machine. Multiple papers have shown how this ensures better compliance by the elderly regarding health, medication, and therapy. [0005] Other prior arts such as US10628635B1 and US9721373B2 describe the creation of a virtual character whereby its lip movements and gestures are controlled by computer-generated instructions, however, their virtual character is built using computer-aided design, and not deepfake technology. Inherently, since deepfake is the manipulation of a real image/video/audio it sounds and looks realistic rather than 3D models, whose realism is dependent on the virtual character's ability to cross the uncanny valley, i.e. a hypothesized relationship between the degree of an object's resemblance to a human being and the emotional response to such an object.
SUMMARY OF THE INVENTION
[0006] The present invention provides a method of producing a audio-video response to a text as input. The method characterizing receiving at a terminal device, one or more input to be responded, processing the input based on one or more response parameters to generate a response text to the input, and converting the response text into a response video via a video generation module. The response video comprises a plurality of individual audio frames synthesized from the input encoded with a plurality of corresponding facial lip-synced video frames, combined as a single video output.
[0007] In one exemplary embodiment, the one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data.
[0008] In another exemplary embodiment, the terminal comprises a chatbot adapted to provide a video response to the input.
[0009] In another exemplary embodiment, the plurality of individual audio frames is generated from the response text using a text-to-audio generation module selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model. The plurality of facial lip-sync video frames is generated for each of the audio frames using an audio -to- video synthesis module selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model. The lip-sync video frame comprises of face image rendered with the lip-synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.
[0010] In one embodiment, a computerized system for producing audio-video response to an input is provided. The system characterizes a terminal device adapted to receive one or more inputs to be responded, a response generation module adapted to process the input based on one or more response parameters, to generate a response text to the input, and a video generation module adapted to convert the response text into a response video. The response video comprises a plurality of individual audio frames synthesized from the input encoded with a plurality of corresponding facial lip-synced video frames, combined as a single video output.
[0011] In another exemplary embodiment, the terminal device is a chatbot adapted to provide a video response to the input.
[0012] In another exemplary embodiment, the video generation module comprising a text-to-audio generation module adapted to generate the plurality of individual audio frames from the response text, the text-to-audio generation module selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model. The video generation module comprises audio to video synthesis module adapted to generate facial lip-sync video frames corresponding to each of the plurality of audio frames, the audio -to -video synthesis module is selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] The invention is further described in the detailed description that follows, by reference to the noted drawings by way of illustrative embodiments of the invention, in which like reference numerals represent similar parts throughout the drawings. The invention is not limited to the precise arrangements and illustrative examples shown in the drawings:
[0014] Figure 1 illustrates a block diagram of a system for generating synthesized audio frames, in accordance with an exemplary embodiment of the present invention. [0015] Figure 2 illustrates a block diagram of the audio -to -video synthesis module of the system, in accordance with an exemplary embodiment of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0016] With reference to the figures provided, embodiments of the present invention are now described in detail. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these specific details. In other instances, structures, devices, activities, and methods are shown using schematics, use cases, and/or flow diagrams to avoid obscuring the invention. Although the following description contains many specifics for the purposes of illustration, anyone skilled in the art will appreciate that many variations and/or alterations to suggested details are within the scope of the present invention. Similarly, although many of the features of the present invention are described in terms of each other, or conjunction with each other, one skilled in the art will appreciate that many of these features can be provided independently of other features. Accordingly, this description of the invention is set forth without any loss of generality to, and without imposing limitations upon, the invention.
[0017] The present invention provides a system and method for producing audio-video responses to input using one or more deep learning modules. The system comprises a chatbot adapted to receive input and generate response text based on one or more response parameters. The response text is converted into a plurality of audio frames. The plurality of audio frames is fed to an audio- to-video synthesis module to generate facial lip-synced video frames corresponding to each audio frame.
[0018] Figure 1 illustrates a block diagram of a system for generating synthesized audio frames (110), in accordance with an exemplary embodiment of the present invention. The system comprises a terminal device (104), a response generation module (106), and a text-to-audio generation module (108). The terminal device (104) receives one or more inputs to be responded. The input is audio, text, video, gestures, or any combination thereof. In one example, the terminal device comprises a chatbot adapted to provide a video response to the input. In one embodiment, the chatbot is trained on either an intent-based method or a conversational NLP-based method. The response generation module (106) may be configured to process the input based on the one or more response parameters to generate a response text. The one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data. The text to audio generation module (108) may be configured to generate a plurality of audio frames (110) synthesized from the response text. The text to audio generation module (108) is selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN model.
[0019] In one embodiment, the response text is fed to the Forward Transformer TTS model that predicts the Mel Spectrogram Features, the predicted Mel Spectrogram in turn is fed to the HiFi- GAN model that generates the desired audio frame.
[0020] The Forward Transformer TTS is trained on a dataset consisting of phoneme texts as input, Mel spectrogram features as output, and duration prediction features as output pairs of a single person's voice. The Mel spectrogram features are calculated using hop length 275 and frame length 1100 Short-Term Fourier Transform from 22050 Hz Audios and English text is converted to Phoneme character texts. After training Forward Transformer TTS may be configured to predict correct Mel spectrogram features depending on the input phoneme text. The HiFi-GAN model is trained on Mel spectrogram features as input and audio signals as output pairs of a single person’s voice, whereas Mel spectrogram features are calculated as the same as in forwarding Transformer TTS training. After training HiFi-GAN model is enabled to generate audio signals of 22050 Hz from input Mel spectrogram features.
[0021] Figure 2 illustrates a block diagram of an audio -to- video synthesis module (202) of the system, in accordance with an exemplary embodiment of the present invention. The audio-to-video synthesis module (202) is adapted to generate facial lip-sync video frames (204) corresponding to each of the plurality of audio frames (110). The audio-to-video synthesis module (202) is selected from the group consisting of but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model. The lip-sync video frame (204) comprises of face image rendered with the lip-synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.
[0022] In one exemplary embodiment, the audio frames are fed to the LipSync RNN which predicts the lip-synced output encodings, which in turn is fed along with the input face with the mouth, chin, and neck cropped/blacked out to the Deep Neural Renderer UNet model that synthesizes/renders the lip-synced mouth, chin, neck in the input face image’s cropped/blacked out region, the required number of frames for the complete video is synthesized/rendered sequentially depending on the input audio duration. Later the individual synthesized/rendered frames get stitched into a complete video along with the input video.
[0023] In another exemplary embodiment, the Adversarial Variational AutoEncoder model is trained on frontal face images of the desired person as both the input and the output for the model, the encoder part of the AutoEncoder learns a 128 -dimensional vector of latent space that maps with each of the frontal face images that correspond to different facial features along with the lip movements, likewise, the Decoder part of AutoEncoder learns to perfectly reconstruct the original input frontal face image again from just the 128 latent vector alone.
[0024] In another exemplary embodiment, the LipSync RNN model is trained to predict lip-synced face encodings / latent vectors with respect to the input audio signal. It is trained on 100 Hz MFCC features calculated from 16000 Hz re-sampled audio of a video as input and 128-dimensional Latent vector for each frame of the same video in 25 FPS as output pairs, the latent vector is upsampled to 100 Hz to match input MFCC feature frequency. After training the LipSync RNN learns to map the 128-D Latent vector which corresponds to a facial expression/lip movement that's been learned by AVAE depending on the input audio MFCC feature. During the video synthesis, the 100Hz predicted 128-D latent vector from LipSync RNN gets downsampled to 25 Hz to match the video frame rate. [0025] In another exemplary embodiment, the Deep Neural Renderer (DNR) UNet model is trained on a face image with a mouth, chin, and neck cropped/blacked out along with the same face's frontal face image latent vector generated from AVAE as inputs and the complete whole face image as output as GAN model with two Discriminators (i.e., spatial and temporal). The Deep Neural renderer has trained in a temporal recurrent generation method that the input to the model is the current input (mouth cropped face image + latent vector) pairs and the T-l model's output (T-l completely generated face and its corresponding latent vector from AVAE). After training the model learns to generate mouth, lips, and neck regions in the cropped/masked out regions of the input face image depending on the latent vector that corresponds to a specific mouth movement/expression.
[0026] In another exemplary embodiment, an input is fed to the method which in turn synthesizes speech audio at 22 kHz saying out the words with correct pronunciation with appropriate pause according to the given input text. The synthesized speech audio is downsampled to 16 kHz and is fed as input to the LipSyncRNN that predicts the Latent Vectors at a higher sample rate matching with input audio rather than conventional video frame rates of 25, these Latent Vectors correspond to the subject person's mouth movement which is in sync with the input audio. The predicted Latent Vector is downsampled to match with a conventional video frame rate of 25 fps and is fed as conditional input along with masked mouth face images from the continuous looping sequence to DNR for synthesizing face images with correct mouth movement that will be in sync with the synthesized audio. The DNR inference is performed individually on per frame masked mouth face image and individual Latent Vector in the predicted sequence from LipSyncRNN.
[0027] The synthesized face images are blended onto the body and head base frame images corresponding to their DNR’s input sequence number. Each synthesized face image is placed onto the body and head base frame image using its nose tip landmark point as the coordinate point and blended in place with the frame image’s body and head. The Alpha mask is used to place any desired background image to the blended frame image. In the end, all the final blended frame images are rendered into a single output video along with the synthesized speech audio by the method. [0028] The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure.

Claims

CLAIMS:
1. A method of producing audio-video response to an input (102), the method characterizing: receiving at a terminal device (104), one or more inputs (102) to be responded; processing the input on the basis of one or more response parameters, to generate a response text to the input (102); and converting the response text into a response video via a video generation module; wherein the response video comprising a plurality of individual audio frames synthesized (110) from the input encoded with a plurality of corresponding facial lip- synced video frames (204), combined together as a single video output.
2. The method as claimed in claiml, wherein the one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data.
3. The method as claimed in claim 1, wherein the terminal device (104) comprises a chatbot adapted to provide a video response to the input.
4. The method as claimed in claim 1, wherein the plurality of individual audio frames (110) is generated from the response text using a text to audio generation module (108) selected from, without limitation, to a Forward Transformer TTS Model and a HiFi-GAN model.
5. The method as claimed in claim 1, wherein the plurality of facial lip-sync video frames (204) is generated for each of the audio frames using an audio to video synthesis module (202) selected from, without limitation, to a Lip-sync RNN, Deep Neural Renderer UNet model.
6. The method as claimed in claim 5, wherein the facial lip-sync video frame (204) comprising of face image rendered with the lip synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.
7. The method as claimed in claim 1, wherein the input is audio, text, video, gestures, or any combination thereof.
8. A computerized system for producing audio-video response to one or more inputs (102), the system characterizing: a terminal device (104) adapted to receive one or more inputs (102) to be responded; a response generation module (106) adapted to process the input (102) on the basis of one or more response parameters, to generate a response text to the input (102); and a video generation module adapted to convert the response text into a response video; characterized in that, the response video comprising a plurality of individual audio frames synthesized (110) from the input encoded with a plurality of corresponding facial lip-synced video frames (204), combined together as a single video output.
9. The system as claimed in claim 8, wherein the terminal device (104) is a chatbot adapted to provide a video response to the input.
10. The system as claimed in claim 8, wherein the video generation module comprises a text to audio generation module (108) adapted to generate the plurality of individual audio frames (110) from the response text, the text to audio generation module (108) selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model.
11. The system as claimed in claim 8, wherein the video generation module comprises an audio-to-video synthesis module (202) adapted to generate facial lip-sync video frames (204) corresponding to each of the plurality of audio frames (110), the audio-to-video synthesis module selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet Model.
12. The system as claimed in claim 8, wherein the input is audio, text, video, gestures, or any combination thereof.
PCT/IN2022/050662 2021-07-23 2022-07-23 System and method for producing audio-video response to an input WO2023002511A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN202141033107 2021-07-23
IN202141033107 2021-07-23

Publications (1)

Publication Number Publication Date
WO2023002511A1 true WO2023002511A1 (en) 2023-01-26

Family

ID=84978827

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2022/050662 WO2023002511A1 (en) 2021-07-23 2022-07-23 System and method for producing audio-video response to an input

Country Status (1)

Country Link
WO (1) WO2023002511A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10628635B1 (en) * 2017-03-29 2020-04-21 Valyant AI, Inc. Artificially intelligent hologram

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10628635B1 (en) * 2017-03-29 2020-04-21 Valyant AI, Inc. Artificially intelligent hologram

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
STAMATESCU LUCA: "Deepfake Virtual Assistant ", 18 May 2021 (2021-05-18), XP093027653, Retrieved from the Internet <URL:http://lucastamatescu.com/deepfake-virtual-assistant/> [retrieved on 20230228] *

Similar Documents

Publication Publication Date Title
CN111145322B (en) Method, apparatus, and computer-readable storage medium for driving avatar
Wang et al. Mead: A large-scale audio-visual dataset for emotional talking-face generation
Bhattacharya et al. Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning
Brand Voice puppetry
Le et al. Live speech driven head-and-eye motion generators
Vinciarelli et al. Social signal processing: state-of-the-art and future perspectives of an emerging domain
Cao et al. Expressive speech-driven facial animation
US20160134840A1 (en) Avatar-Mediated Telepresence Systems with Enhanced Filtering
Ding et al. Laughter animation synthesis
Boker et al. Effects of damping head movement and facial expression in dyadic conversation using real–time facial expression tracking and synthesized avatars
Ding et al. Modeling multimodal behaviors from speech prosody
WO2023284435A1 (en) Method and apparatus for generating animation
Rebol et al. Passing a non-verbal turing test: Evaluating gesture animations generated from speech
Niewiadomski et al. Rhythmic body movements of laughter
Rebol et al. Real-time gesture animation generation from speech for virtual human interaction
Kullmann et al. An evaluation of other-avatar facial animation methods for social VR
Li et al. A survey of computer facial animation techniques
JPH10228295A (en) Hierarchial feeling recognition device
Liu et al. Talking face generation via facial anatomy
Woo et al. IAVA: Interactive and Adaptive Virtual Agent
Pham et al. Learning continuous facial actions from speech for real-time animation
Zhang et al. Speech-driven personalized gesture synthetics: Harnessing automatic fuzzy feature inference
Čereković et al. Multimodal behavior realization for embodied conversational agents
WO2023002511A1 (en) System and method for producing audio-video response to an input
Ding et al. Audio-driven laughter behavior controller

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22845587

Country of ref document: EP

Kind code of ref document: A1