US20220375224A1 - Device and method for generating speech video along with landmark - Google Patents
Device and method for generating speech video along with landmark Download PDFInfo
- Publication number
- US20220375224A1 US20220375224A1 US17/762,926 US202017762926A US2022375224A1 US 20220375224 A1 US20220375224 A1 US 20220375224A1 US 202017762926 A US202017762926 A US 202017762926A US 2022375224 A1 US2022375224 A1 US 2022375224A1
- Authority
- US
- United States
- Prior art keywords
- speech
- person
- landmark
- speech video
- decoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims description 12
- 230000005236 sound signal Effects 0.000 claims abstract description 35
- 239000000284 extract Substances 0.000 claims abstract description 17
- 238000013528 artificial neural network Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 16
- 238000010801 machine learning Methods 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 14
- 230000033001 locomotion Effects 0.000 description 9
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000006866 deterioration Effects 0.000 description 2
- 238000005315 distribution function Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0356—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for synchronising with other signals, e.g. video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/055—Time compression or expansion for synchronising with other signals, e.g. video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/272—Means for inserting a foreground image in a background image, i.e. inlay, outlay
Definitions
- Embodiments of the present disclosure relate to a speech video generation technology.
- an artificial intelligence technology For example, when there is a voice message to be delivered, it may be desirable to generate a speech video in which the voice message sounds as if it is uttered by a famous person (e.g., president or the like) so as to attract the attention of people.
- a famous person e.g., president or the like
- This is implemented by generating lip shapes or the like suitable for a specific message so that the lip shapes look as if a famous person utters the specific message in a video of the famous person.
- the correct value of the face landmark data in an image is obtained by a person by labeling while viewing the image, and each person may have a different criterion when multiple persons work on this operation, and even if the same person works, corresponding points cannot be marked for each image frame, and thus the presence of the annotation noise is unavoidable.
- a face landmark is predicted using a correct value having noise, and when a face image is synthesized using the predicted face landmark, deterioration of image quality, such as image shaking, occurs.
- a learning model which aligns a face landmark extracted from a speech image in a standard space, and predicts a face landmark by using a voice as an input.
- a landmark is aligned in an inaccurate manner (e.g., based on an inaccurate estimated value, or simplifying conversion since three-dimensional movement or rotation is impossible in the case of two dimension), information loss and distortion occur, and thus lip shapes are not correctly synchronized, and unnecessary shaking or the like occurs.
- a reference point is located at a virtual position (e.g., an average position of entire face landmarks or an average position of lip part landmarks), it is difficult to control so as to move only a speaking part while the head of a corresponding person is fixed.
- Disclosed embodiments provide techniques for generating a speech video along with a landmark of the speech video.
- a speech video generation device is a computing device having one or more processors and a memory which stores one or more programs executed by the one or more processors, the speech video generation device including: a first encoder, which receives an input of a person background image that is a video part in a speech video of a predetermined person, and extracts an image feature vector from the person background image; a second encoder, which receives an input of a speech audio signal that is an audio part in the speech video, and extracts a voice feature vector from the speech audio signal; a combining unit, which generates a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder; a first decoder, which reconstructs the speech video of the person using the combined vector as an input; and a second decoder, which predicts a landmark of the speech video using the combined vector as an input.
- a portion related to a speech of the person may be hidden by a mask in the person background image.
- the speech audio signal may be an audio part of the same section as the person background image in the speech video of the person.
- the first decoder may be a machine learning model trained to reconstruct the portion hidden by the mask in the person background image based on the voice feature vector.
- the combining unit may generates the combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder
- the first decoder may receive the combined vector to generate the speech video of the person by reconstructing, based on the speech audio signal not related to the person background image, the portion related to the speech in the person background image
- the second decoder may predict and output the landmark of the speech video.
- the second decoder may include: an extraction module trained to extract a feature vector from the input combined vector; and a prediction module trained to predict landmark coordinates of the speech video based on the feature vector extracted by the extraction module.
- An objective function L prediction of the second decoder may be expressed as an equation below.
- ⁇ K ⁇ G(I; ⁇ ) ⁇ Function for deriving a difference between the labeled landmark coordinates and the predicted landmark coordinates of the speech video
- the second decoder may include: an extraction module trained to extract a feature tensor from the input combined vector; and a prediction module trained to predict a landmark image based on the feature tensor extracted by the extraction module.
- the landmark image may be an image indicating, by a probability value, whether each pixel corresponds to a landmark in an image space corresponding to the speech video.
- An objective function L prediction of the second decoder may be expressed as an equation below.
- a speech video generation device is a computing device having one or more processors and a memory which stores one or more programs executed by the one or more processors, the speech video generation device including: a first encoder, which receives an input of a person background image that is a video part in a speech video of a predetermined person, and extracts an image feature vector from the person background image; a second encoder, which receives an input of a speech audio signal that is an audio part in the speech video, and extracts a voice feature vector from the speech audio signal; a combining unit, which generates a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder; a decoder, which uses the combined vector as an input, and performs deconvolution and up-sampling on the combined vector; a first output layer, which is connected to the decoder and outputs a reconstructed speech video of the person based on up-sampled data; and a second output layer, which is connected to the de
- a speech video generation method performed by a computing device having one or more processors and a memory which stores one or more programs executed by the one or more processors includes: receiving an input of a person background image that is a video part in a speech video of a predetermined person, and extracting an image feature vector from the person background image; receiving an input of a speech audio signal that is an audio part in the speech video, and extracting a voice feature vector from the speech audio signal; generating a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder; reconstructing the speech video of the person using the combined vector as an input; and predicting a landmark of the speech video using the combined vector as an input.
- the image feature vector is extracted from the person background image
- the voice feature vector is extracted from the speech audio signal
- the combined vector is generated by combining the image feature vector and the voice feature vector
- a speech video and a landmark are predicted together based on the combined vector, and thus the speech video and the landmark may be more accurately predicted.
- the reconstructed (predicted) speech video is learnt so as to minimize a difference with the original speech video
- the predicted landmark is learnt so as to minimize a difference with the labeled landmark extracted from the original speech video.
- a learning for simultaneously predicting an actual speech video and a landmark is performed in a state in which a face position of a corresponding person and a landmark position spatially match, a shape change due to an overall face motion and a shape change of a speaking part due to a speech may be separately learnt without preprocessing for aligning the landmark in a standard space.
- a neural network for reconstructing a speech video and a neural network for predicting a landmark are integrated into one component, a pattern about a motion of a speech-related portion may be shared and learnt, and thus noise may be efficiently eliminated from a predicted landmark.
- FIG. 1 is a diagram illustrating a configuration of a device for generating a speech video along with a landmark according to an embodiment of the present disclosure.
- FIG. 2 is a diagram illustrating an example of predicting a landmark in a speech video generation device according to an embodiment of the present disclosure.
- FIG. 3 is a diagram illustrating another example of predicting a landmark in a speech video generation device according to an embodiment of the present disclosure.
- FIG. 4 is a diagram illustrating a state in which a speech video and a landmark are inferred through a speech video generation device according to an embodiment of the present disclosure.
- FIG. 5 is a diagram illustrating a configuration of a speech video generation device according to another embodiment of the present disclosure.
- FIG. 6 is a diagram illustrating a configuration of a speech video generation device according to another embodiment of the present disclosure.
- FIG. 7 is a block diagram illustrating a computing environment that includes a computing device suitable for use in example embodiments.
- transmission may include a meaning in which the signal or information is directly transmitted from one element to another element and transmitted from one element to another element through an intervening element.
- transmission or “sending” of the signal or information to one element may indicate a final destination of the signal or information and may not imply a direct destination.
- a meaning in which two or more pieces of data or information are “related” indicates that when any one piece of data (or information) is obtained, at least a portion of other data (or information) may be obtained based thereon.
- first”, “second” and the like may be used for describing various elements, but the elements should not be construed as being limited by the terms. These terms may be used for distinguishing one element from another element. For example, a first element could be termed a second element and vice versa without departing from the scope of the present disclosure.
- FIG. 1 is a diagram illustrating a configuration of a device for generating a speech video along with a landmark according to an embodiment of the present disclosure.
- a speech video generation device 100 may include a first encoder 102 , a second encoder 104 , a combining unit 106 , a first decoder 108 , and a second decoder 110 .
- the configuration of the speech video generation device 100 illustrated in FIG. 1 shows functional elements that are functionally differentiated, wherein the functional elements may be functionally connected to each other to perform functions according to the present disclosure, and one or more elements may be actually physically integrated.
- the speech video generation device 100 may be implemented with a convolutional neural network (CNN)-based machine learning technology, but is not limited thereto, and other various machine learning technologies may be applied.
- CNN convolutional neural network
- the following description is provided with a focus on a learning process for generating a speech video along with a landmark.
- the first encoder 102 may be a machine learning model trained to extract an image feature vector using a person background image as an input.
- vector may also be used to refer to a “tensor”.
- the person background image input to the first encoder 102 is an image in which a person utters (speaks).
- the person background image may be an image including a face and upper body of a person. That is, the person background image may be an image including not only a face but also an upper body so as to show motions of the face, neck, and shoulder of the person when the person utters, but is not limited thereto.
- a portion related to a speech in the person background image input to the first encoder 102 may be masked. That is, a portion (e.g., a mouth and a portion around the mouth) related to a speech in the person background image may be hidden by a mask M. Furthermore, during a masking process, portions related to a face motion, neck motion, and shoulder motion due to a person's speech may not be masked in the person background image. In this case, the first encoder 102 extracts an image feature vector of a portion excluding the portion related to a speech in the person background image.
- the first encoder 102 may include at least one convolutional layer and at least one pooling layer.
- the convolutional layer while moving a filter of a preset size (e.g., 3 ⁇ 3 pixel size) at a fixed interval in the input person background image, may extract a feature value of pixels corresponding to the filter.
- the pooling layer may receive an output from the convolutional layer as an input and may perform down sampling thereon.
- the second encoder 104 may be a machine learning model trained to extract a voice feature vector using a speech audio signal as an input.
- the speech audio signal corresponds to an audio part in the person background image (i.e., an image in which a person utters) input to the first encoder 102 .
- a video part in a video in which a person utters may be input to the first encoder 102
- an audio part may be input to the second encoder 104 .
- the second encoder 104 may include at least one convolutional layer and at least one pooling layer, but a neural network structure of the second encoder 104 is not limited thereto.
- the person background image input to the first encoder 102 and the speech audio signal input to the second encoder 104 may be synchronized in time. That is, in a section of the same time band in a video in which a person utters, a video may be input to the first encoder 102 , and an audio may be input to the second encoder 104 .
- the person background image and the speech audio signal may be input to the first encoder 102 and the second encoder 104 every preset unit time (e.g., one frame or a plurality of successive frames).
- the combining unit 106 may generate a combined vector by combining an image feature vector output from the first encoder 102 and a voice feature vector output from the second encoder 104 .
- the combining unit 106 may generate the combined vector by concatenating the image feature vector and the voice feature vector, but the present disclosure is not limited thereto, and the combining unit 106 may generate the combined vector by combining the image feature vector and the voice feature vector in other various manners.
- the first decoder 108 may be a machine learning model trained to reconstruct a speech video of a person using the combined vector output from the combining unit 106 as an input.
- the first decoder 108 may be a machine learning model trained to reconstruct a portion (i.e., a portion related to a speech) hidden by the mask M of the image feature vector (i.e., a feature of a video part, in which the speech-related portion is hidden by the mask, in a video in which a person utters) output from the first encoder 102 , based on the voice feature vector (i.e., a feature of an audio part in the video in which a person utters) output from the second encoder 104 . That is, the first decoder 108 may be a model trained to reconstruct a masked region using an audio signal, when a portion related to a speech is masked in the person background image.
- the first decoder 108 may generate a speech video by performing up sampling after performing deconvolution on the combined vector obtained by combining the image feature vector output from the first encoder 102 and the voice feature vector output from the second encoder 104 .
- the first decoder 108 may compare a reconstructed speech video with an original speech video (i.e., a correct value), and may adjust a learning parameter (e.g., a loss function, a softmax function, etc.) so that the reconstructed speech video (i.e., a video in which a speech-related portion has been reconstructed through an audio part) approximates to the original speech video.
- a learning parameter e.g., a loss function, a softmax function, etc.
- the second decoder 110 may be a machine learning model trained to predict a landmark of a speech video using the combined vector output from the combining unit 106 as an input.
- the second decoder 110 may extract a feature vector (or feature tensor) from the combined vector, and may predict a landmark of a speech video based on the extracted feature vector (or feature tensor).
- the second decoder 110 may compare the predicted landmark with a labeled landmark (landmark extracted from an original speech video), and may adjust a learning parameter (e.g., a loss function, a softmax function, etc.) so that the predicted landmark approximates to the labeled landmark.
- a learning parameter e.g., a loss function, a softmax function, etc.
- the image feature vector is extracted from the person background image
- the voice feature vector is extracted from the speech audio signal
- the combined vector is generated by combining the image feature vector and the voice feature vector
- a speech video and a landmark are predicted together based on the combined vector, and thus the speech video and the landmark may be more accurately predicted.
- the reconstructed (predicted) speech video is learnt so as to minimize a difference with the original speech video
- the predicted landmark is learnt so as to minimize a difference with the labeled landmark extracted from the original speech video.
- a learning for simultaneously predicting an actual speech video and a landmark is performed in a state in which a face position of a corresponding person and a landmark position spatially match, a shape change due to an overall face motion and a shape change of a speaking part due to a speech may be separately learnt without preprocessing for aligning the landmark in a standard space.
- a neural network for reconstructing a speech video and a neural network for predicting a landmark are integrated into one component, a pattern about a motion of a speech-related portion may be shared and learnt, and thus noise may be efficiently eliminated from a predicted landmark.
- FIG. 2 is a diagram illustrating an example of predicting a landmark in a speech video generation device according to an embodiment of the present disclosure.
- the second decoder 110 may include an extraction module 110 a and a prediction module 110 b.
- the extraction module 110 a may be trained to extract a feature vector from an input combined vector.
- the extraction module 110 a may extract the feature vector from the combined vector through a plurality of convolutional neural network layers.
- the prediction module 110 b may be trained to predict landmark coordinates of a speech video based on the feature vector extracted by the extraction module 110 a . That is, the prediction module 110 b may be trained to predict a coordinate value corresponding to a landmark in a coordinate system of a speech video based on the extracted feature vector.
- landmark coordinates may be two-dimensionally or three-dimensionally expressed.
- landmark coordinates K may be expressed as Equation 1 below.
- y n y-axis coordinate value of nth landmark
- Predicting landmark coordinates from the combined vector in the second decoder 110 may be expressed as Equation 2 below.
- K′ denotes landmark coordinates predicted by the second decoder 110
- G denotes a neural network constituting the second decoder 110
- I denotes a combined vector
- ⁇ denotes a parameter of the neural network G.
- the second decoder 110 may be trained so as to minimize a difference between landmark coordinates predicted from the combined vector and labeled landmark coordinates.
- an objective function L prediction of the second decoder 110 may be expressed as Equation 3 below.
- K denotes labeled landmark coordinates
- ⁇ A ⁇ B ⁇ function denotes a function for deriving a difference between A and B (e.g., Euclidean distance L 2 distance or Manhattan distance L 1 distance between A and B).
- FIG. 3 is a diagram illustrating another example of predicting a landmark in a speech video generation device according to an embodiment of the present disclosure.
- the second decoder 110 may include the extraction module 110 a and the prediction module 110 b.
- the extraction module 110 a may be trained to extract a feature tensor from a combined vector.
- the extraction module 110 a may extract a feature tensor so that a landmark is expressed as one point in an image space corresponding to a speech video.
- the prediction module 110 b may be trained to predict a landmark image based on the feature tensor extracted by the extraction module 110 a .
- the landmark image which indicates whether each pixel corresponds to a landmark in the image space corresponding to the speech video, may be an image in which a pixel value of a pixel is set to 1 if the pixel corresponds to a landmark or set to 0 if the pixel does not correspond to a landmark.
- the prediction module 110 b may predict a landmark image by outputting a probability value (i.e., probability value pertaining to presence/absence of a landmark) between 0 and 1 for each pixel based on the extracted feature tensor. Outputting the probability value for each pixel from the prediction module 110 b may be expressed as Equation 4 below.
- p(x i , y i ) denotes a probability value indicating whether a pixel (x i , y i ) is a landmark
- P denotes a neural network constituting the second decoder 110
- F(x i , y i ) denotes a feature tensor of a pixel (x i , y i )
- ⁇ denotes a parameter of a neural network P.
- a sigmoid, Gaussian, or the like may be used as a probability distribution function, but the probability distribution function is not limited thereto.
- Equation 5 the objective function L prediction of the second decoder 110 may be expressed as Equation 5 below.
- p target (x i , y i ) denotes a labeled landmark indication value of a pixel (x i , y i ) of a speech video. That is, this parameter may be labeled to have a value of 1 when the corresponding pixel is a landmark and have a value of 0 when the corresponding pixel is not a landmark.
- the probability value i.e., p(x i , y i ) indicating whether a pixel (x i , y i ) is a landmark increases when the labeled landmark indication value of the pixel (x i , y i ) is 1, and the probability value (i.e., p(x i , y i ) indicating whether a pixel (x i , y i ) is a landmark decreases when the labeled landmark indication value of the pixel (x i , y i ) is 0.
- module used herein may represent a functional or structural combination of hardware for implementing the technical concept of the present disclosure and software for driving the hardware.
- module may represent predetermined codes and logical units of hardware resources for executing the predetermined codes, but does not necessarily represent physically connected codes or one type of hardware.
- FIG. 4 is a diagram illustrating a state in which a speech video and a landmark are inferred through a speech video generation device according to an embodiment of the present disclosure.
- the first encoder 102 receives a person background image.
- the person background image may be one used in a learning process.
- the person background image may be an image including a face and upper body of a person.
- a portion related to a speech may be hidden by the mask M.
- the first encoder 102 may extract an image feature vector from the person background image.
- the second encoder 104 receives an input of a speech audio signal.
- the speech audio signal may not be related to the person background image input to the first encoder 102 .
- the speech audio signal may be a speech audio signal of a person different from the person in the person background image.
- the speech audio signal is not limited thereto, and may be one uttered by the person in the person background image.
- the speech of the person may be one uttered in a situation or background not related to the person background image.
- the second encoder 104 may extract a voice feature vector from the speech audio signal.
- the combining unit 106 may generate a combined vector by combining the image feature vector output from the first encoder 102 and the voice feature vector output from the second encoder 104 .
- the first decoder 108 may reconstruct and output a speech video using the combined vector as an input. That is, the first decoder 108 may generate the speech video by reconstructing a speech-related portion of the person background image based on the voice feature vector output from the second encoder 104 .
- the speech audio signal input to the second encoder 104 is a speech not related to the person background image (e.g., although the speech audio signal was not uttered by the person in the person background image)
- the speech video is generated as if the person in the person background image utters.
- the second decoder 110 may predict and output a landmark of the speech video using the combined vector as an input.
- the speech video generation device 100 since the speech video generation device 100 is trained to also predict a landmark of a speech video while reconstructing the speech video through the first decoder 108 and the second decoder 110 when the combined vector is input, the speech video generation device 100 may accurately and smoothly predict without a process of aligning the landmark of the speech video in a standard space.
- FIG. 5 is a diagram illustrating a configuration of a speech video generation device according to another embodiment of the present disclosure. Hereinafter, differences with the embodiment illustrated in FIG. 1 will be mainly described.
- a speech video generation device 200 may further include a residual block 212 .
- At least one of the residual block 212 may be provided between a combining unit 206 and a first decoder 208 and second decoder 210 .
- a plurality of the residual blocks 212 may be sequentially connected (in series) between the combining unit 206 and the first decoder 208 and second decoder 210 .
- the residual block 212 may include at least one convolutional layer.
- the residual block 212 may have a structure for performing convolution on an input value (i.e., combined vector output from the combining unit 206 ) and adding the input value to a result value of the convolution.
- the residual block 212 may learn minimization between an input value and output value of the residual block 212 . In this manner, the image feature vector and the voice feature vector may be organically combined and used as an input for the first decoder 208 and the second decoder 210 .
- FIG. 6 is a diagram illustrating a configuration of a speech video generation device according to another embodiment of the present disclosure. Hereinafter, differences with the embodiment illustrated in FIG. 1 will be mainly described.
- a speech video generation device 300 may include a first encoder 302 , a second encoder 304 , a combining unit 306 , a decoder 314 , a first output layer 316 , and a second output layer 318 .
- the first encoder 302 , the second encoder 304 , and the combining unit 306 are the same as or similar to those illustrated in FIG. 1 , and are thus not described in detail below.
- the decoder 314 may use a combined vector output from the combining unit 306 as an input, and may perform up sampling after performing deconvolution on the combined vector.
- the first output layer 316 which is one output layer connected to the decoder 314 , may output a reconstructed speech video based on data up-sampled by the decoder 314 .
- the second output layer 318 which is another output layer connected to the decoder 314 , may output a predicted landmark of the speech video based on the data up-sampled by the decoder 314 .
- a process of deconvoluting and up-sampling the combined vector may be shared through the decoder 314 , and only output layers may be differently configured so as to respectively output a reconstructed speech video and a predicted landmark.
- FIG. 7 is a block diagram illustrating a computing environment 10 that includes a computing device suitable for use in example embodiments.
- each component may have different functions and capabilities in addition to those described below, and additional components may be included in addition to those described below.
- the illustrated computing environment 10 includes a computing device 12 .
- the computing device 12 may be the speech video generation device 100 .
- the computing device 12 includes at least one processor 14 , a computer-readable storage medium 16 , and a communication bus 18 .
- the processor 14 may cause the computing device 12 to operate according to the above-described example embodiments.
- the processor 14 may execute one or more programs stored in the computer-readable storage medium 16 .
- the one or more programs may include one or more computer-executable instructions, which may be configured to cause, when executed by the processor 14 , the computing device 12 to perform operations according to the example embodiments.
- the computer-readable storage medium 16 is configured to store computer-executable instructions or program codes, program data, and/or other suitable forms of information.
- a program 20 stored in the computer-readable storage medium 16 includes a set of instructions executable by the processor 14 .
- the computer-readable storage medium 16 may be a memory (a volatile memory such as a random access memory, a non-volatile memory, or any suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, other types of storage media that are accessible by the computing device 12 and store desired information, or any suitable combination thereof.
- the communication bus 18 interconnects various other components of the computing device 12 , including the processor 14 and the computer-readable storage medium 16 .
- the computing device 12 may also include one or more input/output interfaces 22 that provide an interface for one or more input/output devices 24 , and one or more network communication interfaces 26 .
- the input/output interface 22 and the network communication interface 26 are connected to the communication bus 18 .
- the input/output device 24 may be connected to other components of the computing device 12 via the input/output interface 22 .
- the example input/output device 24 may include a pointing device (a mouse, a trackpad, or the like), a keyboard, a touch input device (a touch pad, a touch screen, or the like), a voice or sound input device, input devices such as various types of sensor devices and/or imaging devices, and/or output devices such as a display device, a printer, a speaker, and/or a network card.
- the example input/output device 24 may be included inside the computing device 12 as a component constituting the computing device 12 , or may be connected to the computing device 12 as a separate device distinct from the computing device 12 .
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Security & Cryptography (AREA)
- Image Analysis (AREA)
- Spectroscopy & Molecular Physics (AREA)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2020-0109173 | 2020-08-28 | ||
KR1020200109173A KR102501773B1 (ko) | 2020-08-28 | 2020-08-28 | 랜드마크를 함께 생성하는 발화 동영상 생성 장치 및 방법 |
PCT/KR2020/018372 WO2022045485A1 (ko) | 2020-08-28 | 2020-12-15 | 랜드마크를 함께 생성하는 발화 동영상 생성 장치 및 방법 |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220375224A1 true US20220375224A1 (en) | 2022-11-24 |
Family
ID=80353398
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/762,926 Pending US20220375224A1 (en) | 2020-08-28 | 2020-12-15 | Device and method for generating speech video along with landmark |
Country Status (3)
Country | Link |
---|---|
US (1) | US20220375224A1 (ko) |
KR (2) | KR102501773B1 (ko) |
WO (1) | WO2022045485A1 (ko) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220374637A1 (en) * | 2021-05-20 | 2022-11-24 | Nvidia Corporation | Synthesizing video from audio using one or more neural networks |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115116451A (zh) * | 2022-06-15 | 2022-09-27 | 腾讯科技(深圳)有限公司 | 音频解码、编码方法、装置、电子设备及存储介质 |
KR102640603B1 (ko) * | 2022-06-27 | 2024-02-27 | 주식회사 에스알유니버스 | 립싱크 네트워크 학습방법, 입모양 이미지 생성방법 및 립싱크 네트워크 시스템 |
CN117495750A (zh) * | 2022-07-22 | 2024-02-02 | 戴尔产品有限公司 | 用于视频重构的方法、电子设备和计算机程序产品 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005031654A1 (en) * | 2003-09-30 | 2005-04-07 | Koninklijke Philips Electronics, N.V. | System and method for audio-visual content synthesis |
KR101378811B1 (ko) * | 2012-09-18 | 2014-03-28 | 김상철 | 단어 자동 번역에 기초한 입술 모양 변경 장치 및 방법 |
US10062198B2 (en) * | 2016-06-23 | 2018-08-28 | LoomAi, Inc. | Systems and methods for generating computer ready animation models of a human head from captured data images |
KR102091643B1 (ko) | 2018-04-23 | 2020-03-20 | (주)이스트소프트 | 인공신경망을 이용한 안경 착용 영상을 생성하기 위한 장치, 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체 |
KR20200043660A (ko) * | 2018-10-18 | 2020-04-28 | 주식회사 케이티 | 음성 합성 방법 및 음성 합성 장치 |
-
2020
- 2020-08-28 KR KR1020200109173A patent/KR102501773B1/ko active IP Right Grant
- 2020-12-15 WO PCT/KR2020/018372 patent/WO2022045485A1/ko active Application Filing
- 2020-12-15 US US17/762,926 patent/US20220375224A1/en active Pending
-
2023
- 2023-02-08 KR KR1020230016986A patent/KR20230025824A/ko not_active Application Discontinuation
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220374637A1 (en) * | 2021-05-20 | 2022-11-24 | Nvidia Corporation | Synthesizing video from audio using one or more neural networks |
Also Published As
Publication number | Publication date |
---|---|
WO2022045485A1 (ko) | 2022-03-03 |
KR20230025824A (ko) | 2023-02-23 |
KR20220028328A (ko) | 2022-03-08 |
KR102501773B1 (ko) | 2023-02-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220375224A1 (en) | Device and method for generating speech video along with landmark | |
US11775829B2 (en) | Generative adversarial neural network assisted video reconstruction | |
US11610435B2 (en) | Generative adversarial neural network assisted video compression and broadcast | |
US20220358703A1 (en) | Method and device for generating speech video on basis of machine learning | |
US20220399025A1 (en) | Method and device for generating speech video using audio signal | |
KR20200145700A (ko) | 머신 러닝 기반의 발화 동영상 생성 방법 및 장치 | |
US11972516B2 (en) | Method and device for generating speech video by using text | |
KR102437039B1 (ko) | 영상 생성을 위한 학습 장치 및 방법 | |
US20220375190A1 (en) | Device and method for generating speech video | |
US20230177663A1 (en) | Device and method for synthesizing image capable of improving image quality | |
Chen et al. | DualLip: A system for joint lip reading and generation | |
US20240055015A1 (en) | Learning method for generating lip sync image based on machine learning and lip sync image generation device for performing same | |
CN113542758B (zh) | 生成对抗神经网络辅助的视频压缩和广播 | |
CN113542759B (zh) | 生成对抗神经网络辅助的视频重建 | |
US20230177664A1 (en) | Device and method for synthesizing image capable of improving image quality | |
KR102612625B1 (ko) | 신경망 기반의 특징점 학습 장치 및 방법 | |
Koumparoulis et al. | Audio-assisted image inpainting for talking faces | |
US20220343651A1 (en) | Method and device for generating speech image | |
KR102649818B1 (ko) | 3d 립싱크 비디오 생성 장치 및 방법 | |
KR102584484B1 (ko) | 발화 합성 영상 생성 장치 및 방법 | |
US12045639B1 (en) | System providing visual assistants with artificial intelligence | |
US20240046141A1 (en) | Method for generating data using machine learning and computing device for executing the same | |
KR102540756B1 (ko) | 발화 합성 영상 생성 장치 및 방법 | |
US20230178072A1 (en) | Apparatus and method for generating lip sync image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: DEEPBRAIN AI INC., KOREA, REPUBLIC OF Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:CHAE, GYEONGSU;REEL/FRAME:059376/0860 Effective date: 20220318 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |