WO2023002511A1 - System and method for producing audio-video response to an input - Google Patents
System and method for producing audio-video response to an input Download PDFInfo
- Publication number
- WO2023002511A1 WO2023002511A1 PCT/IN2022/050662 IN2022050662W WO2023002511A1 WO 2023002511 A1 WO2023002511 A1 WO 2023002511A1 IN 2022050662 W IN2022050662 W IN 2022050662W WO 2023002511 A1 WO2023002511 A1 WO 2023002511A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- response
- audio
- input
- text
- Prior art date
Links
- 230000004044 response Effects 0.000 title claims abstract description 58
- 238000004519 manufacturing process Methods 0.000 title description 4
- 238000000034 method Methods 0.000 claims abstract description 22
- 230000001815 facial effect Effects 0.000 claims abstract description 14
- 230000015572 biosynthetic process Effects 0.000 claims description 12
- 238000003786 synthesis reaction Methods 0.000 claims description 12
- 238000013256 Gubra-Amylin NASH model Methods 0.000 claims description 9
- 230000001537 neural effect Effects 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 2
- 239000013598 vector Substances 0.000 description 15
- 238000010586 diagram Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000011960 computer-aided design Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006397 emotional response Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
Definitions
- the present invention provides a system and method for producing audio-video responses to input using one or more deep learning modules.
- a deepfake is media, which may be either an image, video, and/or audio that was generated and/or modified using artificial intelligence.
- a deepfake creator may combine and/or superimpose existing images and/or video onto a source image and/or video to generate the deepfake.
- artificial intelligence such as, example, neural networks, deep leaning, machine learning and/or any other artificial intelligence technique advances, deepfake media has become increasingly realistic.
- deepfake can have applications in healthcare and beyond.
- the bank can use the opportunity to speak to the customer about other financial products, while the customer is interacting with the machine.
- Possible applications can be kiosks at hospitals where patients can interact can show their doctor/nurse, allowing the patient to communicate with a known face rather than a machine. Multiple papers have shown how this ensures better compliance by the elderly regarding health, medication, and therapy.
- Other prior arts such as US10628635B1 and US9721373B2 describe the creation of a virtual character whereby its lip movements and gestures are controlled by computer-generated instructions, however, their virtual character is built using computer-aided design, and not deepfake technology.
- the present invention provides a method of producing a audio-video response to a text as input.
- the method characterizing receiving at a terminal device, one or more input to be responded, processing the input based on one or more response parameters to generate a response text to the input, and converting the response text into a response video via a video generation module.
- the response video comprises a plurality of individual audio frames synthesized from the input encoded with a plurality of corresponding facial lip-synced video frames, combined as a single video output.
- the one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data.
- the terminal comprises a chatbot adapted to provide a video response to the input.
- the plurality of individual audio frames is generated from the response text using a text-to-audio generation module selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model.
- the plurality of facial lip-sync video frames is generated for each of the audio frames using an audio -to- video synthesis module selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model.
- the lip-sync video frame comprises of face image rendered with the lip-synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.
- a computerized system for producing audio-video response to an input characterizes a terminal device adapted to receive one or more inputs to be responded, a response generation module adapted to process the input based on one or more response parameters, to generate a response text to the input, and a video generation module adapted to convert the response text into a response video.
- the response video comprises a plurality of individual audio frames synthesized from the input encoded with a plurality of corresponding facial lip-synced video frames, combined as a single video output.
- the terminal device is a chatbot adapted to provide a video response to the input.
- the video generation module comprising a text-to-audio generation module adapted to generate the plurality of individual audio frames from the response text, the text-to-audio generation module selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model.
- the video generation module comprises audio to video synthesis module adapted to generate facial lip-sync video frames corresponding to each of the plurality of audio frames, the audio -to -video synthesis module is selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model.
- Figure 1 illustrates a block diagram of a system for generating synthesized audio frames, in accordance with an exemplary embodiment of the present invention.
- Figure 2 illustrates a block diagram of the audio -to -video synthesis module of the system, in accordance with an exemplary embodiment of the present invention.
- the present invention provides a system and method for producing audio-video responses to input using one or more deep learning modules.
- the system comprises a chatbot adapted to receive input and generate response text based on one or more response parameters.
- the response text is converted into a plurality of audio frames.
- the plurality of audio frames is fed to an audio- to-video synthesis module to generate facial lip-synced video frames corresponding to each audio frame.
- FIG. 1 illustrates a block diagram of a system for generating synthesized audio frames (110), in accordance with an exemplary embodiment of the present invention.
- the system comprises a terminal device (104), a response generation module (106), and a text-to-audio generation module (108).
- the terminal device (104) receives one or more inputs to be responded.
- the input is audio, text, video, gestures, or any combination thereof.
- the terminal device comprises a chatbot adapted to provide a video response to the input.
- the chatbot is trained on either an intent-based method or a conversational NLP-based method.
- the response generation module (106) may be configured to process the input based on the one or more response parameters to generate a response text.
- the one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data.
- the text to audio generation module (108) may be configured to generate a plurality of audio frames (110) synthesized from the response text.
- the text to audio generation module (108) is selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN model.
- the response text is fed to the Forward Transformer TTS model that predicts the Mel Spectrogram Features, the predicted Mel Spectrogram in turn is fed to the HiFi- GAN model that generates the desired audio frame.
- the Forward Transformer TTS is trained on a dataset consisting of phoneme texts as input, Mel spectrogram features as output, and duration prediction features as output pairs of a single person's voice.
- the Mel spectrogram features are calculated using hop length 275 and frame length 1100 Short-Term Fourier Transform from 22050 Hz Audios and English text is converted to Phoneme character texts.
- Forward Transformer TTS may be configured to predict correct Mel spectrogram features depending on the input phoneme text.
- the HiFi-GAN model is trained on Mel spectrogram features as input and audio signals as output pairs of a single person’s voice, whereas Mel spectrogram features are calculated as the same as in forwarding Transformer TTS training. After training HiFi-GAN model is enabled to generate audio signals of 22050 Hz from input Mel spectrogram features.
- FIG. 2 illustrates a block diagram of an audio -to- video synthesis module (202) of the system, in accordance with an exemplary embodiment of the present invention.
- the audio-to-video synthesis module (202) is adapted to generate facial lip-sync video frames (204) corresponding to each of the plurality of audio frames (110).
- the audio-to-video synthesis module (202) is selected from the group consisting of but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model.
- the lip-sync video frame (204) comprises of face image rendered with the lip-synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.
- the audio frames are fed to the LipSync RNN which predicts the lip-synced output encodings, which in turn is fed along with the input face with the mouth, chin, and neck cropped/blacked out to the Deep Neural Renderer UNet model that synthesizes/renders the lip-synced mouth, chin, neck in the input face image’s cropped/blacked out region, the required number of frames for the complete video is synthesized/rendered sequentially depending on the input audio duration. Later the individual synthesized/rendered frames get stitched into a complete video along with the input video.
- the Adversarial Variational AutoEncoder model is trained on frontal face images of the desired person as both the input and the output for the model, the encoder part of the AutoEncoder learns a 128 -dimensional vector of latent space that maps with each of the frontal face images that correspond to different facial features along with the lip movements, likewise, the Decoder part of AutoEncoder learns to perfectly reconstruct the original input frontal face image again from just the 128 latent vector alone.
- the LipSync RNN model is trained to predict lip-synced face encodings / latent vectors with respect to the input audio signal. It is trained on 100 Hz MFCC features calculated from 16000 Hz re-sampled audio of a video as input and 128-dimensional Latent vector for each frame of the same video in 25 FPS as output pairs, the latent vector is upsampled to 100 Hz to match input MFCC feature frequency. After training the LipSync RNN learns to map the 128-D Latent vector which corresponds to a facial expression/lip movement that's been learned by AVAE depending on the input audio MFCC feature.
- the 100Hz predicted 128-D latent vector from LipSync RNN gets downsampled to 25 Hz to match the video frame rate.
- the Deep Neural Renderer (DNR) UNet model is trained on a face image with a mouth, chin, and neck cropped/blacked out along with the same face's frontal face image latent vector generated from AVAE as inputs and the complete whole face image as output as GAN model with two Discriminators (i.e., spatial and temporal).
- the Deep Neural renderer has trained in a temporal recurrent generation method that the input to the model is the current input (mouth cropped face image + latent vector) pairs and the T-l model's output (T-l completely generated face and its corresponding latent vector from AVAE). After training the model learns to generate mouth, lips, and neck regions in the cropped/masked out regions of the input face image depending on the latent vector that corresponds to a specific mouth movement/expression.
- an input is fed to the method which in turn synthesizes speech audio at 22 kHz saying out the words with correct pronunciation with appropriate pause according to the given input text.
- the synthesized speech audio is downsampled to 16 kHz and is fed as input to the LipSyncRNN that predicts the Latent Vectors at a higher sample rate matching with input audio rather than conventional video frame rates of 25, these Latent Vectors correspond to the subject person's mouth movement which is in sync with the input audio.
- the predicted Latent Vector is downsampled to match with a conventional video frame rate of 25 fps and is fed as conditional input along with masked mouth face images from the continuous looping sequence to DNR for synthesizing face images with correct mouth movement that will be in sync with the synthesized audio.
- the DNR inference is performed individually on per frame masked mouth face image and individual Latent Vector in the predicted sequence from LipSyncRNN.
- the synthesized face images are blended onto the body and head base frame images corresponding to their DNR’s input sequence number. Each synthesized face image is placed onto the body and head base frame image using its nose tip landmark point as the coordinate point and blended in place with the frame image’s body and head.
- the Alpha mask is used to place any desired background image to the blended frame image. In the end, all the final blended frame images are rendered into a single output video along with the synthesized speech audio by the method.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
Disclosed are systems and methods for producing audio-video responses to an input (102). The system comprises a terminal device (104), a response generation module (106), and a video generation module. The terminal device (104) is configured to receive one or more inputs (102) to be responded and the response generation module (106) is configured to process this input and generate a response text accordingly. The video generation module is configured to convert the generated response text to a response video. The response video further comprises a plurality of individual audio frames synthesized (110) from the input (102) and is encoded with a plurality of corresponding facial lip-synced video frames (204). The plurality of individual audio frames (110) and the plurality of corresponding facial lip-synced video frames (204) are configured to combine to form a single video output.
Description
SYSTEM AND METHOD FOR PRODUCING AUDIO-VIDEO RESPONSE TO AN
INPUT
FIELD OF THE INVENTION
[0001] The present invention provides a system and method for producing audio-video responses to input using one or more deep learning modules.
BACKGROUND OF THE INVENTION
[0002] The present invention is related to the field of deepfake technology. A deepfake is media, which may be either an image, video, and/or audio that was generated and/or modified using artificial intelligence. In some examples, a deepfake creator may combine and/or superimpose existing images and/or video onto a source image and/or video to generate the deepfake. As artificial intelligence such as, example, neural networks, deep leaning, machine learning and/or any other artificial intelligence technique advances, deepfake media has become increasingly realistic.
[0003] These deepfake videos can be a powerful source of conveying information to people, as they can make it appear as if a human is conveying this information, however, the human is completely computer generated, and is usually created through AI, when the AI model is fed the human's likeness.
[0004] Moreover, deepfake can have applications in healthcare and beyond. For example, at an ATM, the bank can use the opportunity to speak to the customer about other financial products, while the customer is interacting with the machine. Possible applications can be kiosks at hospitals where patients can interact can show their doctor/nurse, allowing the patient to communicate with a known face rather than a machine. Multiple papers have shown how this ensures better compliance by the elderly regarding health, medication, and therapy.
[0005] Other prior arts such as US10628635B1 and US9721373B2 describe the creation of a virtual character whereby its lip movements and gestures are controlled by computer-generated instructions, however, their virtual character is built using computer-aided design, and not deepfake technology. Inherently, since deepfake is the manipulation of a real image/video/audio it sounds and looks realistic rather than 3D models, whose realism is dependent on the virtual character's ability to cross the uncanny valley, i.e. a hypothesized relationship between the degree of an object's resemblance to a human being and the emotional response to such an object.
SUMMARY OF THE INVENTION
[0006] The present invention provides a method of producing a audio-video response to a text as input. The method characterizing receiving at a terminal device, one or more input to be responded, processing the input based on one or more response parameters to generate a response text to the input, and converting the response text into a response video via a video generation module. The response video comprises a plurality of individual audio frames synthesized from the input encoded with a plurality of corresponding facial lip-synced video frames, combined as a single video output.
[0007] In one exemplary embodiment, the one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data.
[0008] In another exemplary embodiment, the terminal comprises a chatbot adapted to provide a video response to the input.
[0009] In another exemplary embodiment, the plurality of individual audio frames is generated from the response text using a text-to-audio generation module selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model. The plurality of facial lip-sync video frames is generated for each of the audio frames using an audio -to- video synthesis module selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model. The lip-sync video
frame comprises of face image rendered with the lip-synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.
[0010] In one embodiment, a computerized system for producing audio-video response to an input is provided. The system characterizes a terminal device adapted to receive one or more inputs to be responded, a response generation module adapted to process the input based on one or more response parameters, to generate a response text to the input, and a video generation module adapted to convert the response text into a response video. The response video comprises a plurality of individual audio frames synthesized from the input encoded with a plurality of corresponding facial lip-synced video frames, combined as a single video output.
[0011] In another exemplary embodiment, the terminal device is a chatbot adapted to provide a video response to the input.
[0012] In another exemplary embodiment, the video generation module comprising a text-to-audio generation module adapted to generate the plurality of individual audio frames from the response text, the text-to-audio generation module selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model. The video generation module comprises audio to video synthesis module adapted to generate facial lip-sync video frames corresponding to each of the plurality of audio frames, the audio -to -video synthesis module is selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] The invention is further described in the detailed description that follows, by reference to the noted drawings by way of illustrative embodiments of the invention, in which like reference numerals represent similar parts throughout the drawings. The invention is not limited to the precise arrangements and illustrative examples shown in the drawings:
[0014] Figure 1 illustrates a block diagram of a system for generating synthesized audio frames, in accordance with an exemplary embodiment of the present invention.
[0015] Figure 2 illustrates a block diagram of the audio -to -video synthesis module of the system, in accordance with an exemplary embodiment of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0016] With reference to the figures provided, embodiments of the present invention are now described in detail. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these specific details. In other instances, structures, devices, activities, and methods are shown using schematics, use cases, and/or flow diagrams to avoid obscuring the invention. Although the following description contains many specifics for the purposes of illustration, anyone skilled in the art will appreciate that many variations and/or alterations to suggested details are within the scope of the present invention. Similarly, although many of the features of the present invention are described in terms of each other, or conjunction with each other, one skilled in the art will appreciate that many of these features can be provided independently of other features. Accordingly, this description of the invention is set forth without any loss of generality to, and without imposing limitations upon, the invention.
[0017] The present invention provides a system and method for producing audio-video responses to input using one or more deep learning modules. The system comprises a chatbot adapted to receive input and generate response text based on one or more response parameters. The response text is converted into a plurality of audio frames. The plurality of audio frames is fed to an audio- to-video synthesis module to generate facial lip-synced video frames corresponding to each audio frame.
[0018] Figure 1 illustrates a block diagram of a system for generating synthesized audio frames (110), in accordance with an exemplary embodiment of the present invention. The system comprises a terminal device (104), a response generation module (106), and a text-to-audio
generation module (108). The terminal device (104) receives one or more inputs to be responded. The input is audio, text, video, gestures, or any combination thereof. In one example, the terminal device comprises a chatbot adapted to provide a video response to the input. In one embodiment, the chatbot is trained on either an intent-based method or a conversational NLP-based method. The response generation module (106) may be configured to process the input based on the one or more response parameters to generate a response text. The one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data. The text to audio generation module (108) may be configured to generate a plurality of audio frames (110) synthesized from the response text. The text to audio generation module (108) is selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN model.
[0019] In one embodiment, the response text is fed to the Forward Transformer TTS model that predicts the Mel Spectrogram Features, the predicted Mel Spectrogram in turn is fed to the HiFi- GAN model that generates the desired audio frame.
[0020] The Forward Transformer TTS is trained on a dataset consisting of phoneme texts as input, Mel spectrogram features as output, and duration prediction features as output pairs of a single person's voice. The Mel spectrogram features are calculated using hop length 275 and frame length 1100 Short-Term Fourier Transform from 22050 Hz Audios and English text is converted to Phoneme character texts. After training Forward Transformer TTS may be configured to predict correct Mel spectrogram features depending on the input phoneme text. The HiFi-GAN model is trained on Mel spectrogram features as input and audio signals as output pairs of a single person’s voice, whereas Mel spectrogram features are calculated as the same as in forwarding Transformer TTS training. After training HiFi-GAN model is enabled to generate audio signals of 22050 Hz from input Mel spectrogram features.
[0021] Figure 2 illustrates a block diagram of an audio -to- video synthesis module (202) of the system, in accordance with an exemplary embodiment of the present invention. The audio-to-video synthesis module (202) is adapted to generate facial lip-sync video frames (204) corresponding to each of the plurality of audio frames (110). The audio-to-video synthesis module (202) is selected
from the group consisting of but not limited to a Lip-sync RNN, Deep Neural Renderer UNet model. The lip-sync video frame (204) comprises of face image rendered with the lip-synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.
[0022] In one exemplary embodiment, the audio frames are fed to the LipSync RNN which predicts the lip-synced output encodings, which in turn is fed along with the input face with the mouth, chin, and neck cropped/blacked out to the Deep Neural Renderer UNet model that synthesizes/renders the lip-synced mouth, chin, neck in the input face image’s cropped/blacked out region, the required number of frames for the complete video is synthesized/rendered sequentially depending on the input audio duration. Later the individual synthesized/rendered frames get stitched into a complete video along with the input video.
[0023] In another exemplary embodiment, the Adversarial Variational AutoEncoder model is trained on frontal face images of the desired person as both the input and the output for the model, the encoder part of the AutoEncoder learns a 128 -dimensional vector of latent space that maps with each of the frontal face images that correspond to different facial features along with the lip movements, likewise, the Decoder part of AutoEncoder learns to perfectly reconstruct the original input frontal face image again from just the 128 latent vector alone.
[0024] In another exemplary embodiment, the LipSync RNN model is trained to predict lip-synced face encodings / latent vectors with respect to the input audio signal. It is trained on 100 Hz MFCC features calculated from 16000 Hz re-sampled audio of a video as input and 128-dimensional Latent vector for each frame of the same video in 25 FPS as output pairs, the latent vector is upsampled to 100 Hz to match input MFCC feature frequency. After training the LipSync RNN learns to map the 128-D Latent vector which corresponds to a facial expression/lip movement that's been learned by AVAE depending on the input audio MFCC feature. During the video synthesis, the 100Hz predicted 128-D latent vector from LipSync RNN gets downsampled to 25 Hz to match the video frame rate.
[0025] In another exemplary embodiment, the Deep Neural Renderer (DNR) UNet model is trained on a face image with a mouth, chin, and neck cropped/blacked out along with the same face's frontal face image latent vector generated from AVAE as inputs and the complete whole face image as output as GAN model with two Discriminators (i.e., spatial and temporal). The Deep Neural renderer has trained in a temporal recurrent generation method that the input to the model is the current input (mouth cropped face image + latent vector) pairs and the T-l model's output (T-l completely generated face and its corresponding latent vector from AVAE). After training the model learns to generate mouth, lips, and neck regions in the cropped/masked out regions of the input face image depending on the latent vector that corresponds to a specific mouth movement/expression.
[0026] In another exemplary embodiment, an input is fed to the method which in turn synthesizes speech audio at 22 kHz saying out the words with correct pronunciation with appropriate pause according to the given input text. The synthesized speech audio is downsampled to 16 kHz and is fed as input to the LipSyncRNN that predicts the Latent Vectors at a higher sample rate matching with input audio rather than conventional video frame rates of 25, these Latent Vectors correspond to the subject person's mouth movement which is in sync with the input audio. The predicted Latent Vector is downsampled to match with a conventional video frame rate of 25 fps and is fed as conditional input along with masked mouth face images from the continuous looping sequence to DNR for synthesizing face images with correct mouth movement that will be in sync with the synthesized audio. The DNR inference is performed individually on per frame masked mouth face image and individual Latent Vector in the predicted sequence from LipSyncRNN.
[0027] The synthesized face images are blended onto the body and head base frame images corresponding to their DNR’s input sequence number. Each synthesized face image is placed onto the body and head base frame image using its nose tip landmark point as the coordinate point and blended in place with the frame image’s body and head. The Alpha mask is used to place any desired background image to the blended frame image. In the end, all the final blended frame images are rendered into a single output video along with the synthesized speech audio by the method.
[0028] The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure.
Claims
1. A method of producing audio-video response to an input (102), the method characterizing: receiving at a terminal device (104), one or more inputs (102) to be responded; processing the input on the basis of one or more response parameters, to generate a response text to the input (102); and converting the response text into a response video via a video generation module; wherein the response video comprising a plurality of individual audio frames synthesized (110) from the input encoded with a plurality of corresponding facial lip- synced video frames (204), combined together as a single video output.
2. The method as claimed in claiml, wherein the one or more response parameter is selected from but not limited to a user's data, application details, client information, product information, customer service information, and conversational data.
3. The method as claimed in claim 1, wherein the terminal device (104) comprises a chatbot adapted to provide a video response to the input.
4. The method as claimed in claim 1, wherein the plurality of individual audio frames (110) is generated from the response text using a text to audio generation module (108) selected from, without limitation, to a Forward Transformer TTS Model and a HiFi-GAN model.
5. The method as claimed in claim 1, wherein the plurality of facial lip-sync video frames (204) is generated for each of the audio frames using an audio to video synthesis module (202) selected from, without limitation, to a Lip-sync RNN, Deep Neural Renderer UNet model.
6. The method as claimed in claim 5, wherein the facial lip-sync video frame (204) comprising of face image rendered with the lip synced mouth, chin, neck in an input face image’s cropped/blacked out region according to the corresponding audio frame.
7. The method as claimed in claim 1, wherein the input is audio, text, video, gestures, or any combination thereof.
8. A computerized system for producing audio-video response to one or more inputs (102), the system characterizing: a terminal device (104) adapted to receive one or more inputs (102) to be responded; a response generation module (106) adapted to process the input (102) on the basis of one or more response parameters, to generate a response text to the input (102); and a video generation module adapted to convert the response text into a response video; characterized in that, the response video comprising a plurality of individual audio frames synthesized (110) from the input encoded with a plurality of corresponding facial lip-synced video frames (204), combined together as a single video output.
9. The system as claimed in claim 8, wherein the terminal device (104) is a chatbot adapted to provide a video response to the input.
10. The system as claimed in claim 8, wherein the video generation module comprises a text to audio generation module (108) adapted to generate the plurality of individual audio frames (110) from the response text, the text to audio generation module (108) selected from but not limited to a Forward Transformer TTS Model and a HiFi-GAN Model.
11. The system as claimed in claim 8, wherein the video generation module comprises an audio-to-video synthesis module (202) adapted to generate facial lip-sync video frames (204) corresponding to each of the plurality of audio frames (110), the audio-to-video synthesis module selected from but not limited to a Lip-sync RNN, Deep Neural Renderer UNet Model.
12. The system as claimed in claim 8, wherein the input is audio, text, video, gestures, or any combination thereof.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IN202141033107 | 2021-07-23 | ||
IN202141033107 | 2021-07-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023002511A1 true WO2023002511A1 (en) | 2023-01-26 |
Family
ID=84978827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IN2022/050662 WO2023002511A1 (en) | 2021-07-23 | 2022-07-23 | System and method for producing audio-video response to an input |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023002511A1 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10628635B1 (en) * | 2017-03-29 | 2020-04-21 | Valyant AI, Inc. | Artificially intelligent hologram |
-
2022
- 2022-07-23 WO PCT/IN2022/050662 patent/WO2023002511A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10628635B1 (en) * | 2017-03-29 | 2020-04-21 | Valyant AI, Inc. | Artificially intelligent hologram |
Non-Patent Citations (1)
Title |
---|
STAMATESCU LUCA: "Deepfake Virtual Assistant ", 18 May 2021 (2021-05-18), XP093027653, Retrieved from the Internet <URL:http://lucastamatescu.com/deepfake-virtual-assistant/> [retrieved on 20230228] * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111145322B (en) | Method, apparatus, and computer-readable storage medium for driving avatar | |
Wang et al. | Mead: A large-scale audio-visual dataset for emotional talking-face generation | |
Bhattacharya et al. | Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning | |
Brand | Voice puppetry | |
Le et al. | Live speech driven head-and-eye motion generators | |
Vinciarelli et al. | Social signal processing: state-of-the-art and future perspectives of an emerging domain | |
Cao et al. | Expressive speech-driven facial animation | |
US20160134840A1 (en) | Avatar-Mediated Telepresence Systems with Enhanced Filtering | |
Ding et al. | Laughter animation synthesis | |
Boker et al. | Effects of damping head movement and facial expression in dyadic conversation using real–time facial expression tracking and synthesized avatars | |
Ding et al. | Modeling multimodal behaviors from speech prosody | |
WO2023284435A1 (en) | Method and apparatus for generating animation | |
Rebol et al. | Passing a non-verbal turing test: Evaluating gesture animations generated from speech | |
Niewiadomski et al. | Rhythmic body movements of laughter | |
Rebol et al. | Real-time gesture animation generation from speech for virtual human interaction | |
Kullmann et al. | An evaluation of other-avatar facial animation methods for social VR | |
Li et al. | A survey of computer facial animation techniques | |
JPH10228295A (en) | Hierarchial feeling recognition device | |
Liu et al. | Talking face generation via facial anatomy | |
Woo et al. | IAVA: Interactive and Adaptive Virtual Agent | |
Pham et al. | Learning continuous facial actions from speech for real-time animation | |
Zhang et al. | Speech-driven personalized gesture synthetics: Harnessing automatic fuzzy feature inference | |
Čereković et al. | Multimodal behavior realization for embodied conversational agents | |
WO2023002511A1 (en) | System and method for producing audio-video response to an input | |
Ding et al. | Audio-driven laughter behavior controller |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22845587 Country of ref document: EP Kind code of ref document: A1 |