US20220375224A1 - Device and method for generating speech video along with landmark - Google Patents

Device and method for generating speech video along with landmark Download PDF

Info

Publication number
US20220375224A1
US20220375224A1 US17/762,926 US202017762926A US2022375224A1 US 20220375224 A1 US20220375224 A1 US 20220375224A1 US 202017762926 A US202017762926 A US 202017762926A US 2022375224 A1 US2022375224 A1 US 2022375224A1
Authority
US
United States
Prior art keywords
speech
person
landmark
speech video
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/762,926
Other languages
English (en)
Inventor
Gyeongsu CHAE
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Deepbrain AI Inc
Original Assignee
Deepbrain AI Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deepbrain AI Inc filed Critical Deepbrain AI Inc
Assigned to DEEPBRAIN AI INC. reassignment DEEPBRAIN AI INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHAE, Gyeongsu
Publication of US20220375224A1 publication Critical patent/US20220375224A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0356Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for synchronising with other signals, e.g. video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/055Time compression or expansion for synchronising with other signals, e.g. video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/272Means for inserting a foreground image in a background image, i.e. inlay, outlay

Definitions

  • Embodiments of the present disclosure relate to a speech video generation technology.
  • an artificial intelligence technology For example, when there is a voice message to be delivered, it may be desirable to generate a speech video in which the voice message sounds as if it is uttered by a famous person (e.g., president or the like) so as to attract the attention of people.
  • a famous person e.g., president or the like
  • This is implemented by generating lip shapes or the like suitable for a specific message so that the lip shapes look as if a famous person utters the specific message in a video of the famous person.
  • the correct value of the face landmark data in an image is obtained by a person by labeling while viewing the image, and each person may have a different criterion when multiple persons work on this operation, and even if the same person works, corresponding points cannot be marked for each image frame, and thus the presence of the annotation noise is unavoidable.
  • a face landmark is predicted using a correct value having noise, and when a face image is synthesized using the predicted face landmark, deterioration of image quality, such as image shaking, occurs.
  • a learning model which aligns a face landmark extracted from a speech image in a standard space, and predicts a face landmark by using a voice as an input.
  • a landmark is aligned in an inaccurate manner (e.g., based on an inaccurate estimated value, or simplifying conversion since three-dimensional movement or rotation is impossible in the case of two dimension), information loss and distortion occur, and thus lip shapes are not correctly synchronized, and unnecessary shaking or the like occurs.
  • a reference point is located at a virtual position (e.g., an average position of entire face landmarks or an average position of lip part landmarks), it is difficult to control so as to move only a speaking part while the head of a corresponding person is fixed.
  • Disclosed embodiments provide techniques for generating a speech video along with a landmark of the speech video.
  • a speech video generation device is a computing device having one or more processors and a memory which stores one or more programs executed by the one or more processors, the speech video generation device including: a first encoder, which receives an input of a person background image that is a video part in a speech video of a predetermined person, and extracts an image feature vector from the person background image; a second encoder, which receives an input of a speech audio signal that is an audio part in the speech video, and extracts a voice feature vector from the speech audio signal; a combining unit, which generates a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder; a first decoder, which reconstructs the speech video of the person using the combined vector as an input; and a second decoder, which predicts a landmark of the speech video using the combined vector as an input.
  • a portion related to a speech of the person may be hidden by a mask in the person background image.
  • the speech audio signal may be an audio part of the same section as the person background image in the speech video of the person.
  • the first decoder may be a machine learning model trained to reconstruct the portion hidden by the mask in the person background image based on the voice feature vector.
  • the combining unit may generates the combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder
  • the first decoder may receive the combined vector to generate the speech video of the person by reconstructing, based on the speech audio signal not related to the person background image, the portion related to the speech in the person background image
  • the second decoder may predict and output the landmark of the speech video.
  • the second decoder may include: an extraction module trained to extract a feature vector from the input combined vector; and a prediction module trained to predict landmark coordinates of the speech video based on the feature vector extracted by the extraction module.
  • An objective function L prediction of the second decoder may be expressed as an equation below.
  • ⁇ K ⁇ G(I; ⁇ ) ⁇ Function for deriving a difference between the labeled landmark coordinates and the predicted landmark coordinates of the speech video
  • the second decoder may include: an extraction module trained to extract a feature tensor from the input combined vector; and a prediction module trained to predict a landmark image based on the feature tensor extracted by the extraction module.
  • the landmark image may be an image indicating, by a probability value, whether each pixel corresponds to a landmark in an image space corresponding to the speech video.
  • An objective function L prediction of the second decoder may be expressed as an equation below.
  • a speech video generation device is a computing device having one or more processors and a memory which stores one or more programs executed by the one or more processors, the speech video generation device including: a first encoder, which receives an input of a person background image that is a video part in a speech video of a predetermined person, and extracts an image feature vector from the person background image; a second encoder, which receives an input of a speech audio signal that is an audio part in the speech video, and extracts a voice feature vector from the speech audio signal; a combining unit, which generates a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder; a decoder, which uses the combined vector as an input, and performs deconvolution and up-sampling on the combined vector; a first output layer, which is connected to the decoder and outputs a reconstructed speech video of the person based on up-sampled data; and a second output layer, which is connected to the de
  • a speech video generation method performed by a computing device having one or more processors and a memory which stores one or more programs executed by the one or more processors includes: receiving an input of a person background image that is a video part in a speech video of a predetermined person, and extracting an image feature vector from the person background image; receiving an input of a speech audio signal that is an audio part in the speech video, and extracting a voice feature vector from the speech audio signal; generating a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder; reconstructing the speech video of the person using the combined vector as an input; and predicting a landmark of the speech video using the combined vector as an input.
  • the image feature vector is extracted from the person background image
  • the voice feature vector is extracted from the speech audio signal
  • the combined vector is generated by combining the image feature vector and the voice feature vector
  • a speech video and a landmark are predicted together based on the combined vector, and thus the speech video and the landmark may be more accurately predicted.
  • the reconstructed (predicted) speech video is learnt so as to minimize a difference with the original speech video
  • the predicted landmark is learnt so as to minimize a difference with the labeled landmark extracted from the original speech video.
  • a learning for simultaneously predicting an actual speech video and a landmark is performed in a state in which a face position of a corresponding person and a landmark position spatially match, a shape change due to an overall face motion and a shape change of a speaking part due to a speech may be separately learnt without preprocessing for aligning the landmark in a standard space.
  • a neural network for reconstructing a speech video and a neural network for predicting a landmark are integrated into one component, a pattern about a motion of a speech-related portion may be shared and learnt, and thus noise may be efficiently eliminated from a predicted landmark.
  • FIG. 1 is a diagram illustrating a configuration of a device for generating a speech video along with a landmark according to an embodiment of the present disclosure.
  • FIG. 2 is a diagram illustrating an example of predicting a landmark in a speech video generation device according to an embodiment of the present disclosure.
  • FIG. 3 is a diagram illustrating another example of predicting a landmark in a speech video generation device according to an embodiment of the present disclosure.
  • FIG. 4 is a diagram illustrating a state in which a speech video and a landmark are inferred through a speech video generation device according to an embodiment of the present disclosure.
  • FIG. 5 is a diagram illustrating a configuration of a speech video generation device according to another embodiment of the present disclosure.
  • FIG. 6 is a diagram illustrating a configuration of a speech video generation device according to another embodiment of the present disclosure.
  • FIG. 7 is a block diagram illustrating a computing environment that includes a computing device suitable for use in example embodiments.
  • transmission may include a meaning in which the signal or information is directly transmitted from one element to another element and transmitted from one element to another element through an intervening element.
  • transmission or “sending” of the signal or information to one element may indicate a final destination of the signal or information and may not imply a direct destination.
  • a meaning in which two or more pieces of data or information are “related” indicates that when any one piece of data (or information) is obtained, at least a portion of other data (or information) may be obtained based thereon.
  • first”, “second” and the like may be used for describing various elements, but the elements should not be construed as being limited by the terms. These terms may be used for distinguishing one element from another element. For example, a first element could be termed a second element and vice versa without departing from the scope of the present disclosure.
  • FIG. 1 is a diagram illustrating a configuration of a device for generating a speech video along with a landmark according to an embodiment of the present disclosure.
  • a speech video generation device 100 may include a first encoder 102 , a second encoder 104 , a combining unit 106 , a first decoder 108 , and a second decoder 110 .
  • the configuration of the speech video generation device 100 illustrated in FIG. 1 shows functional elements that are functionally differentiated, wherein the functional elements may be functionally connected to each other to perform functions according to the present disclosure, and one or more elements may be actually physically integrated.
  • the speech video generation device 100 may be implemented with a convolutional neural network (CNN)-based machine learning technology, but is not limited thereto, and other various machine learning technologies may be applied.
  • CNN convolutional neural network
  • the following description is provided with a focus on a learning process for generating a speech video along with a landmark.
  • the first encoder 102 may be a machine learning model trained to extract an image feature vector using a person background image as an input.
  • vector may also be used to refer to a “tensor”.
  • the person background image input to the first encoder 102 is an image in which a person utters (speaks).
  • the person background image may be an image including a face and upper body of a person. That is, the person background image may be an image including not only a face but also an upper body so as to show motions of the face, neck, and shoulder of the person when the person utters, but is not limited thereto.
  • a portion related to a speech in the person background image input to the first encoder 102 may be masked. That is, a portion (e.g., a mouth and a portion around the mouth) related to a speech in the person background image may be hidden by a mask M. Furthermore, during a masking process, portions related to a face motion, neck motion, and shoulder motion due to a person's speech may not be masked in the person background image. In this case, the first encoder 102 extracts an image feature vector of a portion excluding the portion related to a speech in the person background image.
  • the first encoder 102 may include at least one convolutional layer and at least one pooling layer.
  • the convolutional layer while moving a filter of a preset size (e.g., 3 ⁇ 3 pixel size) at a fixed interval in the input person background image, may extract a feature value of pixels corresponding to the filter.
  • the pooling layer may receive an output from the convolutional layer as an input and may perform down sampling thereon.
  • the second encoder 104 may be a machine learning model trained to extract a voice feature vector using a speech audio signal as an input.
  • the speech audio signal corresponds to an audio part in the person background image (i.e., an image in which a person utters) input to the first encoder 102 .
  • a video part in a video in which a person utters may be input to the first encoder 102
  • an audio part may be input to the second encoder 104 .
  • the second encoder 104 may include at least one convolutional layer and at least one pooling layer, but a neural network structure of the second encoder 104 is not limited thereto.
  • the person background image input to the first encoder 102 and the speech audio signal input to the second encoder 104 may be synchronized in time. That is, in a section of the same time band in a video in which a person utters, a video may be input to the first encoder 102 , and an audio may be input to the second encoder 104 .
  • the person background image and the speech audio signal may be input to the first encoder 102 and the second encoder 104 every preset unit time (e.g., one frame or a plurality of successive frames).
  • the combining unit 106 may generate a combined vector by combining an image feature vector output from the first encoder 102 and a voice feature vector output from the second encoder 104 .
  • the combining unit 106 may generate the combined vector by concatenating the image feature vector and the voice feature vector, but the present disclosure is not limited thereto, and the combining unit 106 may generate the combined vector by combining the image feature vector and the voice feature vector in other various manners.
  • the first decoder 108 may be a machine learning model trained to reconstruct a speech video of a person using the combined vector output from the combining unit 106 as an input.
  • the first decoder 108 may be a machine learning model trained to reconstruct a portion (i.e., a portion related to a speech) hidden by the mask M of the image feature vector (i.e., a feature of a video part, in which the speech-related portion is hidden by the mask, in a video in which a person utters) output from the first encoder 102 , based on the voice feature vector (i.e., a feature of an audio part in the video in which a person utters) output from the second encoder 104 . That is, the first decoder 108 may be a model trained to reconstruct a masked region using an audio signal, when a portion related to a speech is masked in the person background image.
  • the first decoder 108 may generate a speech video by performing up sampling after performing deconvolution on the combined vector obtained by combining the image feature vector output from the first encoder 102 and the voice feature vector output from the second encoder 104 .
  • the first decoder 108 may compare a reconstructed speech video with an original speech video (i.e., a correct value), and may adjust a learning parameter (e.g., a loss function, a softmax function, etc.) so that the reconstructed speech video (i.e., a video in which a speech-related portion has been reconstructed through an audio part) approximates to the original speech video.
  • a learning parameter e.g., a loss function, a softmax function, etc.
  • the second decoder 110 may be a machine learning model trained to predict a landmark of a speech video using the combined vector output from the combining unit 106 as an input.
  • the second decoder 110 may extract a feature vector (or feature tensor) from the combined vector, and may predict a landmark of a speech video based on the extracted feature vector (or feature tensor).
  • the second decoder 110 may compare the predicted landmark with a labeled landmark (landmark extracted from an original speech video), and may adjust a learning parameter (e.g., a loss function, a softmax function, etc.) so that the predicted landmark approximates to the labeled landmark.
  • a learning parameter e.g., a loss function, a softmax function, etc.
  • the image feature vector is extracted from the person background image
  • the voice feature vector is extracted from the speech audio signal
  • the combined vector is generated by combining the image feature vector and the voice feature vector
  • a speech video and a landmark are predicted together based on the combined vector, and thus the speech video and the landmark may be more accurately predicted.
  • the reconstructed (predicted) speech video is learnt so as to minimize a difference with the original speech video
  • the predicted landmark is learnt so as to minimize a difference with the labeled landmark extracted from the original speech video.
  • a learning for simultaneously predicting an actual speech video and a landmark is performed in a state in which a face position of a corresponding person and a landmark position spatially match, a shape change due to an overall face motion and a shape change of a speaking part due to a speech may be separately learnt without preprocessing for aligning the landmark in a standard space.
  • a neural network for reconstructing a speech video and a neural network for predicting a landmark are integrated into one component, a pattern about a motion of a speech-related portion may be shared and learnt, and thus noise may be efficiently eliminated from a predicted landmark.
  • FIG. 2 is a diagram illustrating an example of predicting a landmark in a speech video generation device according to an embodiment of the present disclosure.
  • the second decoder 110 may include an extraction module 110 a and a prediction module 110 b.
  • the extraction module 110 a may be trained to extract a feature vector from an input combined vector.
  • the extraction module 110 a may extract the feature vector from the combined vector through a plurality of convolutional neural network layers.
  • the prediction module 110 b may be trained to predict landmark coordinates of a speech video based on the feature vector extracted by the extraction module 110 a . That is, the prediction module 110 b may be trained to predict a coordinate value corresponding to a landmark in a coordinate system of a speech video based on the extracted feature vector.
  • landmark coordinates may be two-dimensionally or three-dimensionally expressed.
  • landmark coordinates K may be expressed as Equation 1 below.
  • y n y-axis coordinate value of nth landmark
  • Predicting landmark coordinates from the combined vector in the second decoder 110 may be expressed as Equation 2 below.
  • K′ denotes landmark coordinates predicted by the second decoder 110
  • G denotes a neural network constituting the second decoder 110
  • I denotes a combined vector
  • denotes a parameter of the neural network G.
  • the second decoder 110 may be trained so as to minimize a difference between landmark coordinates predicted from the combined vector and labeled landmark coordinates.
  • an objective function L prediction of the second decoder 110 may be expressed as Equation 3 below.
  • K denotes labeled landmark coordinates
  • ⁇ A ⁇ B ⁇ function denotes a function for deriving a difference between A and B (e.g., Euclidean distance L 2 distance or Manhattan distance L 1 distance between A and B).
  • FIG. 3 is a diagram illustrating another example of predicting a landmark in a speech video generation device according to an embodiment of the present disclosure.
  • the second decoder 110 may include the extraction module 110 a and the prediction module 110 b.
  • the extraction module 110 a may be trained to extract a feature tensor from a combined vector.
  • the extraction module 110 a may extract a feature tensor so that a landmark is expressed as one point in an image space corresponding to a speech video.
  • the prediction module 110 b may be trained to predict a landmark image based on the feature tensor extracted by the extraction module 110 a .
  • the landmark image which indicates whether each pixel corresponds to a landmark in the image space corresponding to the speech video, may be an image in which a pixel value of a pixel is set to 1 if the pixel corresponds to a landmark or set to 0 if the pixel does not correspond to a landmark.
  • the prediction module 110 b may predict a landmark image by outputting a probability value (i.e., probability value pertaining to presence/absence of a landmark) between 0 and 1 for each pixel based on the extracted feature tensor. Outputting the probability value for each pixel from the prediction module 110 b may be expressed as Equation 4 below.
  • p(x i , y i ) denotes a probability value indicating whether a pixel (x i , y i ) is a landmark
  • P denotes a neural network constituting the second decoder 110
  • F(x i , y i ) denotes a feature tensor of a pixel (x i , y i )
  • denotes a parameter of a neural network P.
  • a sigmoid, Gaussian, or the like may be used as a probability distribution function, but the probability distribution function is not limited thereto.
  • Equation 5 the objective function L prediction of the second decoder 110 may be expressed as Equation 5 below.
  • p target (x i , y i ) denotes a labeled landmark indication value of a pixel (x i , y i ) of a speech video. That is, this parameter may be labeled to have a value of 1 when the corresponding pixel is a landmark and have a value of 0 when the corresponding pixel is not a landmark.
  • the probability value i.e., p(x i , y i ) indicating whether a pixel (x i , y i ) is a landmark increases when the labeled landmark indication value of the pixel (x i , y i ) is 1, and the probability value (i.e., p(x i , y i ) indicating whether a pixel (x i , y i ) is a landmark decreases when the labeled landmark indication value of the pixel (x i , y i ) is 0.
  • module used herein may represent a functional or structural combination of hardware for implementing the technical concept of the present disclosure and software for driving the hardware.
  • module may represent predetermined codes and logical units of hardware resources for executing the predetermined codes, but does not necessarily represent physically connected codes or one type of hardware.
  • FIG. 4 is a diagram illustrating a state in which a speech video and a landmark are inferred through a speech video generation device according to an embodiment of the present disclosure.
  • the first encoder 102 receives a person background image.
  • the person background image may be one used in a learning process.
  • the person background image may be an image including a face and upper body of a person.
  • a portion related to a speech may be hidden by the mask M.
  • the first encoder 102 may extract an image feature vector from the person background image.
  • the second encoder 104 receives an input of a speech audio signal.
  • the speech audio signal may not be related to the person background image input to the first encoder 102 .
  • the speech audio signal may be a speech audio signal of a person different from the person in the person background image.
  • the speech audio signal is not limited thereto, and may be one uttered by the person in the person background image.
  • the speech of the person may be one uttered in a situation or background not related to the person background image.
  • the second encoder 104 may extract a voice feature vector from the speech audio signal.
  • the combining unit 106 may generate a combined vector by combining the image feature vector output from the first encoder 102 and the voice feature vector output from the second encoder 104 .
  • the first decoder 108 may reconstruct and output a speech video using the combined vector as an input. That is, the first decoder 108 may generate the speech video by reconstructing a speech-related portion of the person background image based on the voice feature vector output from the second encoder 104 .
  • the speech audio signal input to the second encoder 104 is a speech not related to the person background image (e.g., although the speech audio signal was not uttered by the person in the person background image)
  • the speech video is generated as if the person in the person background image utters.
  • the second decoder 110 may predict and output a landmark of the speech video using the combined vector as an input.
  • the speech video generation device 100 since the speech video generation device 100 is trained to also predict a landmark of a speech video while reconstructing the speech video through the first decoder 108 and the second decoder 110 when the combined vector is input, the speech video generation device 100 may accurately and smoothly predict without a process of aligning the landmark of the speech video in a standard space.
  • FIG. 5 is a diagram illustrating a configuration of a speech video generation device according to another embodiment of the present disclosure. Hereinafter, differences with the embodiment illustrated in FIG. 1 will be mainly described.
  • a speech video generation device 200 may further include a residual block 212 .
  • At least one of the residual block 212 may be provided between a combining unit 206 and a first decoder 208 and second decoder 210 .
  • a plurality of the residual blocks 212 may be sequentially connected (in series) between the combining unit 206 and the first decoder 208 and second decoder 210 .
  • the residual block 212 may include at least one convolutional layer.
  • the residual block 212 may have a structure for performing convolution on an input value (i.e., combined vector output from the combining unit 206 ) and adding the input value to a result value of the convolution.
  • the residual block 212 may learn minimization between an input value and output value of the residual block 212 . In this manner, the image feature vector and the voice feature vector may be organically combined and used as an input for the first decoder 208 and the second decoder 210 .
  • FIG. 6 is a diagram illustrating a configuration of a speech video generation device according to another embodiment of the present disclosure. Hereinafter, differences with the embodiment illustrated in FIG. 1 will be mainly described.
  • a speech video generation device 300 may include a first encoder 302 , a second encoder 304 , a combining unit 306 , a decoder 314 , a first output layer 316 , and a second output layer 318 .
  • the first encoder 302 , the second encoder 304 , and the combining unit 306 are the same as or similar to those illustrated in FIG. 1 , and are thus not described in detail below.
  • the decoder 314 may use a combined vector output from the combining unit 306 as an input, and may perform up sampling after performing deconvolution on the combined vector.
  • the first output layer 316 which is one output layer connected to the decoder 314 , may output a reconstructed speech video based on data up-sampled by the decoder 314 .
  • the second output layer 318 which is another output layer connected to the decoder 314 , may output a predicted landmark of the speech video based on the data up-sampled by the decoder 314 .
  • a process of deconvoluting and up-sampling the combined vector may be shared through the decoder 314 , and only output layers may be differently configured so as to respectively output a reconstructed speech video and a predicted landmark.
  • FIG. 7 is a block diagram illustrating a computing environment 10 that includes a computing device suitable for use in example embodiments.
  • each component may have different functions and capabilities in addition to those described below, and additional components may be included in addition to those described below.
  • the illustrated computing environment 10 includes a computing device 12 .
  • the computing device 12 may be the speech video generation device 100 .
  • the computing device 12 includes at least one processor 14 , a computer-readable storage medium 16 , and a communication bus 18 .
  • the processor 14 may cause the computing device 12 to operate according to the above-described example embodiments.
  • the processor 14 may execute one or more programs stored in the computer-readable storage medium 16 .
  • the one or more programs may include one or more computer-executable instructions, which may be configured to cause, when executed by the processor 14 , the computing device 12 to perform operations according to the example embodiments.
  • the computer-readable storage medium 16 is configured to store computer-executable instructions or program codes, program data, and/or other suitable forms of information.
  • a program 20 stored in the computer-readable storage medium 16 includes a set of instructions executable by the processor 14 .
  • the computer-readable storage medium 16 may be a memory (a volatile memory such as a random access memory, a non-volatile memory, or any suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, other types of storage media that are accessible by the computing device 12 and store desired information, or any suitable combination thereof.
  • the communication bus 18 interconnects various other components of the computing device 12 , including the processor 14 and the computer-readable storage medium 16 .
  • the computing device 12 may also include one or more input/output interfaces 22 that provide an interface for one or more input/output devices 24 , and one or more network communication interfaces 26 .
  • the input/output interface 22 and the network communication interface 26 are connected to the communication bus 18 .
  • the input/output device 24 may be connected to other components of the computing device 12 via the input/output interface 22 .
  • the example input/output device 24 may include a pointing device (a mouse, a trackpad, or the like), a keyboard, a touch input device (a touch pad, a touch screen, or the like), a voice or sound input device, input devices such as various types of sensor devices and/or imaging devices, and/or output devices such as a display device, a printer, a speaker, and/or a network card.
  • the example input/output device 24 may be included inside the computing device 12 as a component constituting the computing device 12 , or may be connected to the computing device 12 as a separate device distinct from the computing device 12 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Image Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
US17/762,926 2020-08-28 2020-12-15 Device and method for generating speech video along with landmark Pending US20220375224A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
KR10-2020-0109173 2020-08-28
KR1020200109173A KR102501773B1 (ko) 2020-08-28 2020-08-28 랜드마크를 함께 생성하는 발화 동영상 생성 장치 및 방법
PCT/KR2020/018372 WO2022045485A1 (ko) 2020-08-28 2020-12-15 랜드마크를 함께 생성하는 발화 동영상 생성 장치 및 방법

Publications (1)

Publication Number Publication Date
US20220375224A1 true US20220375224A1 (en) 2022-11-24

Family

ID=80353398

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/762,926 Pending US20220375224A1 (en) 2020-08-28 2020-12-15 Device and method for generating speech video along with landmark

Country Status (3)

Country Link
US (1) US20220375224A1 (ko)
KR (2) KR102501773B1 (ko)
WO (1) WO2022045485A1 (ko)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220374637A1 (en) * 2021-05-20 2022-11-24 Nvidia Corporation Synthesizing video from audio using one or more neural networks

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116451A (zh) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 音频解码、编码方法、装置、电子设备及存储介质
KR102640603B1 (ko) * 2022-06-27 2024-02-27 주식회사 에스알유니버스 립싱크 네트워크 학습방법, 입모양 이미지 생성방법 및 립싱크 네트워크 시스템
CN117495750A (zh) * 2022-07-22 2024-02-02 戴尔产品有限公司 用于视频重构的方法、电子设备和计算机程序产品

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005031654A1 (en) * 2003-09-30 2005-04-07 Koninklijke Philips Electronics, N.V. System and method for audio-visual content synthesis
KR101378811B1 (ko) * 2012-09-18 2014-03-28 김상철 단어 자동 번역에 기초한 입술 모양 변경 장치 및 방법
US10062198B2 (en) * 2016-06-23 2018-08-28 LoomAi, Inc. Systems and methods for generating computer ready animation models of a human head from captured data images
KR102091643B1 (ko) 2018-04-23 2020-03-20 (주)이스트소프트 인공신경망을 이용한 안경 착용 영상을 생성하기 위한 장치, 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체
KR20200043660A (ko) * 2018-10-18 2020-04-28 주식회사 케이티 음성 합성 방법 및 음성 합성 장치

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220374637A1 (en) * 2021-05-20 2022-11-24 Nvidia Corporation Synthesizing video from audio using one or more neural networks

Also Published As

Publication number Publication date
WO2022045485A1 (ko) 2022-03-03
KR20230025824A (ko) 2023-02-23
KR20220028328A (ko) 2022-03-08
KR102501773B1 (ko) 2023-02-21

Similar Documents

Publication Publication Date Title
US20220375224A1 (en) Device and method for generating speech video along with landmark
US11775829B2 (en) Generative adversarial neural network assisted video reconstruction
US11610435B2 (en) Generative adversarial neural network assisted video compression and broadcast
US20220358703A1 (en) Method and device for generating speech video on basis of machine learning
US20220399025A1 (en) Method and device for generating speech video using audio signal
KR20200145700A (ko) 머신 러닝 기반의 발화 동영상 생성 방법 및 장치
US11972516B2 (en) Method and device for generating speech video by using text
KR102437039B1 (ko) 영상 생성을 위한 학습 장치 및 방법
US20220375190A1 (en) Device and method for generating speech video
US20230177663A1 (en) Device and method for synthesizing image capable of improving image quality
Chen et al. DualLip: A system for joint lip reading and generation
US20240055015A1 (en) Learning method for generating lip sync image based on machine learning and lip sync image generation device for performing same
CN113542758B (zh) 生成对抗神经网络辅助的视频压缩和广播
CN113542759B (zh) 生成对抗神经网络辅助的视频重建
US20230177664A1 (en) Device and method for synthesizing image capable of improving image quality
KR102612625B1 (ko) 신경망 기반의 특징점 학습 장치 및 방법
Koumparoulis et al. Audio-assisted image inpainting for talking faces
US20220343651A1 (en) Method and device for generating speech image
KR102649818B1 (ko) 3d 립싱크 비디오 생성 장치 및 방법
KR102584484B1 (ko) 발화 합성 영상 생성 장치 및 방법
US12045639B1 (en) System providing visual assistants with artificial intelligence
US20240046141A1 (en) Method for generating data using machine learning and computing device for executing the same
KR102540756B1 (ko) 발화 합성 영상 생성 장치 및 방법
US20230178072A1 (en) Apparatus and method for generating lip sync image

Legal Events

Date Code Title Description
AS Assignment

Owner name: DEEPBRAIN AI INC., KOREA, REPUBLIC OF

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:CHAE, GYEONGSU;REEL/FRAME:059376/0860

Effective date: 20220318

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER