EP4010899A1 - Audio-driven speech animation using recurrent neutral network - Google Patents
Audio-driven speech animation using recurrent neutral networkInfo
- Publication number
- EP4010899A1 EP4010899A1 EP20760367.1A EP20760367A EP4010899A1 EP 4010899 A1 EP4010899 A1 EP 4010899A1 EP 20760367 A EP20760367 A EP 20760367A EP 4010899 A1 EP4010899 A1 EP 4010899A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- speech
- animation
- coarticulation
- phonemes
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 230000000306 recurrent effect Effects 0.000 title claims abstract description 13
- 230000007935 neutral effect Effects 0.000 title 1
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000013459 approach Methods 0.000 claims abstract description 23
- 238000013528 artificial neural network Methods 0.000 claims abstract description 19
- 230000002123 temporal effect Effects 0.000 claims abstract description 11
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 10
- 230000000007 visual effect Effects 0.000 claims description 14
- 230000000694 effects Effects 0.000 claims description 13
- 230000000454 anti-cipatory effect Effects 0.000 claims description 6
- 230000004886 head movement Effects 0.000 claims description 6
- 238000013135 deep learning Methods 0.000 claims description 5
- 238000003058 natural language processing Methods 0.000 claims description 4
- 238000007670 refining Methods 0.000 claims description 4
- 230000033001 locomotion Effects 0.000 abstract description 15
- 238000011156 evaluation Methods 0.000 abstract description 5
- 230000001360 synchronised effect Effects 0.000 abstract description 4
- 230000001419 dependent effect Effects 0.000 abstract description 2
- 230000001815 facial effect Effects 0.000 description 16
- 238000012549 training Methods 0.000 description 13
- 238000003786 synthesis reaction Methods 0.000 description 9
- 230000015572 biosynthetic process Effects 0.000 description 8
- 238000012360 testing method Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 238000004519 manufacturing process Methods 0.000 description 7
- 230000001537 neural effect Effects 0.000 description 7
- 238000000354 decomposition reaction Methods 0.000 description 6
- 238000000513 principal component analysis Methods 0.000 description 6
- 239000013598 vector Substances 0.000 description 6
- 238000012805 post-processing Methods 0.000 description 5
- 238000005457 optimization Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 230000008451 emotion Effects 0.000 description 3
- 230000015654 memory Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000010200 validation analysis Methods 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 206010048865 Hypoacusis Diseases 0.000 description 2
- 230000002996 emotional effect Effects 0.000 description 2
- 150000002148 esters Chemical class 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000036961 partial effect Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013526 transfer learning Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 241000013355 Mycteroperca interstitialis Species 0.000 description 1
- 235000009499 Vanilla fragrans Nutrition 0.000 description 1
- 244000263375 Vanilla tahitensis Species 0.000 description 1
- 235000012036 Vanilla tahitensis Nutrition 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010924 continuous production Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- RGCLLPNLLBQHPF-HJWRWDBZSA-N phosphamidon Chemical compound CCN(CC)C(=O)C(\Cl)=C(/C)OP(=O)(OC)OC RGCLLPNLLBQHPF-HJWRWDBZSA-N 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000009012 visual motion Effects 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- Fan et al. [2016] make use of bidirectional LSTM to learn coarticulation effect and predict the lower part of the face through Activate Appearance Model (AAM). These predictions are then used to drive a synthesis by concatenation. This model actually uses either text features (triphone information), audio features, or both.
- AAM Appearance Model
- This model actually uses either text features (triphone information), audio features, or both.
- Suwajanakorn et al. [2017] propose a solution based on recurrent neural network and a complex image processing pipeline. RNN are trained to predict sparse mouth shape from audio input, this prediction is used to synthesize the mouth texture, which is finally integrated into existing video. Video sequences used for synthesis are carefully picked to take account of head movement during speech, and several clever tricks are used to remove any artifacts on both the teeth or the jaw.
- Taylor et al. [2017] designed a system using phonetic sequence as input, to easily make the network speaker-invariant.
- a deep feed-forward neural network is used to map a window of phoneme inputs to active appearance model, a representation of the lower part of the mouth containing both landmark and texture parameterization. Due to the use of feed-forward neural networks and sliding temporal window, a post-filtering step is needed to smooth the output and to not produce jiggered animation.
- a retargeting system is used to map learned articulation from a specific speaker onto an arbitrary face model.
- each window is converted into a mesh through a neural network.
- First layers of the network are convolutional, designed to extract time-varying features from raw formant.
- another convolutional layers are used to reduce this temporal window into a single features vector which is used to predict the mesh state corresponding to center of the window.
- This mesh prediction is made by two dense layers, one producing a basis representation of the mesh, while the other is initialized using a principal component analysis on the mesh dataset, and finally produce the whole mesh.
- This initialization trick is expended in this study to the use of any linear latent representation, with a demonstration on both PCA and blendshape. Note that despite the loss function used to train the network, which contains a term to deals with motion dynamic and promote a smooth output, a post-filtering step is still needed. We believe this to be inherent in the use of temporal sliding windows and feed forward neural network.
- Zhou et al. [2018b] have proposed a solution based on JALI parameters, an overlay of FACS rigs.
- the solution consists of a two-step neural network with a transfer learning procedure. Firstly, an LSTM network extract landmark and phoneme information from audio features. This network is trained in a multi-task setting, using audiovisual datasets containing several speakers to ensure invariance. Then, another LSTM is staked at the top of the pretrained network, using features previously learnt by the first network to generate the JALI parameters from audio. A parallel dataset containing audio and animation parameters is still needed to jointly train the whole network, which is an expensive and time-consuming operation, even for a minimal corpus.
- Pham et al. [2018a] also use convolutional network to learn interesting features from raw speech signal, combined with a recurrent neural network to obtain smooth output trajectories.
- the network is composed of convolutions on the frequency axis, followed by convolutions on the temporal axis, and finally ends with a recurrent neural network. It outputs a set of blendshape weights directly used for the animation.
- Their solution also takes account for emotional state hidden inside the speech signal, and should produce expressive facial animation.
- speaker invariance which is only achieved by using a huge amount of data. Using speech sample too far from these datasets may produce bad results in specific cases (e.g. children's voice).
- Model-based approaches can generate avatar animation from only a set of parameters [Li et al. 2016; Pham et al. 2017; Wang et al. 2011]. They are less data demanding (than image-based approaches) and enjoy the flexibility of a deformable model. Also, with 3D animation approach, it is possible to do Facial animation retargeting, by transferring the recorded performance capture data between different virtual characters. To do that, the source and target avatars should have corresponding blendshapes. The blendshapes in many existing rigs are inspired by the Facial Action Coding System (FACS) [Ekman and Friesen 1978; Sagar 2006] and are completed by more specific articulation visemes (visual representation of a phoneme)[Benoit et al.
- FACS Facial Action Coding System
- motion Capture sometimes called performance driven technique, is used to collect realistic movement while avoiding the uncanny valley effect [Seyama and Nagayama 2007].
- This effect is caused by the fact that human visual perception is highly centered on facial motion and even the smallest incoherences can cause an emotion of disgust and rejection.
- the hight-quality audio-visual databases are obtained by recording actors performance which is then captured and used in 3D models animations [Zell et al. 2017].
- the present invention fully considers the specificity of speech as addressed in speech related fields (articulatory speech production, speech synthesis, phonetics and phonology, linguistics) during the different steps of the animation speech process.
- the present invention comprises:
- a method to build a lip synchronization engine comprising the steps of:
- RNN bidirectional gated recurrent neural network
- the sentences comprise several sentences with the highest phonetic variability coverage in several contexts that implicitly also cover several coarticulation examples.
- linguistic criteria comprise the position of the phoneme and its context.
- the invention comprises the step of using a motion-capture system composed of several cameras to acquire the audiovisual corpus.
- the invention comprises the step of, during the acquisition, using sixty-three reflective markers glued on the face of a speaker.
- the invention comprises the step of computing absolute 3D spatial positions of the reflective markers for each frame acquired by the cameras, after removing the head movement of the speaker.
- the invention comprises the step of generating a sequence of 3D poses that reproduce the original performance by creating different morph targets.
- the invention comprises the step of refining visual morph target set using sentation to be well adapted to speech articulation.
- the invention comprises the step of determining an optimized number of visemes between 10 and 20 without loss of animation quality.
- the invention comprises the step of using a deep learning approach based on a bidirectional gated sequence of phonemes by taking into account two possible types of coarticulation, anticipatory and carry-over with no particular assumption for the duration of the coarticulation effect.
- the invention includes following steps which are explained based on French language but can be carry out with other language :
- This speech animation technique is independent of the speaker and it allows generating animation that can be retargeted to any animation rig. It is possible to be integrated into existing pipelines.
- the aim of this analysis was to create a corpus with the highest phonetic coverage while keeping a reasonable number of sentences.
- Our approach was to collect some French open-source textual corpus to create a first large corpus. This large corpus guarantees an initial maximum of language coverage, and will be processed later on to reduce its size.
- the first corpus we obtained was about 7000 non-redundant sentences and is a result of merging freely available and in-house textual French corpus.
- the linguistic analysis consists of breaking all the sentences into a sequence of phonemes using an NLP (Natural Language Processing) module. After that, as we are dealing with a maximum coverage problem we used a greedy algorithm.
- NLP Natural Language Processing
- This algorithm takes the phonemes- sequence and a list of linguistic criteria, mainly the position of the phoneme and its context, as input.
- the list of extracted sentences allows obtaining the same coverage as the initial whole list of sentences.
- the present invention also concerns a computer program product comprising a computer useable or readable medium having a computer readable program may be provided.
- the computer readable program when executed on a computing device, causes the computing device to perform various ones of, and combinations of, the operations outlined here with regard to the method illustrative embodiment.
- a system/apparatus may comprise one or more processors and a memory coupled to the one or more processors.
- the memory may comprise instructions which, when executed by the one or more processors, cause the one or more processors to perform various ones of, and combinations of, the operations outlined here with regard to the method illustrative embodiment.
- Figure 1 is a diagram illustrating the number of occurrences of the 35 French phonemes in the sentences corpus according to the invention.
- Figure 2 illustrates the layout of a retro-reflective sensors.
- a set of 63 sensors 3mm and 4mm in diameter are glued on the speaker face.
- Six other sensors of 9mm are attached to the hat to track head movements.
- Figure 3 is a view of curves of trajectory of a sensor placed on the lower lip along the y- axis for the original data and the reconstructed data.
- Figure 4 illustrates a bidirectional RNN.
- Figure 5 illustrates neural architectures.
- Figure 6 illustrates average performances for global and lips RMSE in mm.
- Figure 7 illustrates critical segment analysis
- Figure 8 is a view of curves illustrating keyshapes differences before and after fine-tuning.
- Figure 9 is a view of an architecture of the lip-synchronization application according to the invention.
- Figure 10 is a view of an implementation of the lip-synchronization application according to the invention.
- An OptitrackTM motion-capture system has been used to acquire the audiovisual corpus.
- This system is composed of eight cameras (Flex 13) with a frame rate of 120 images per second.
- the cameras are adapted to the face/head region as they are conceived for medium volume motion capture tasks.
- Sixty three reflective markers of 3mm and 4mm diameters have been glued on the face of a French native speaker. The layout of the markers is presented in Fig.2.
- To track the head movement we used 9mm sensors glued on a hat.
- the post-processing task consists of computing the absolute 3D spatial positions of the reflective markers for each frame, after removing the head movement.
- Kaldi a toolkit for speech recognition
- ESTER is a multi-speaker database of the radio program which has been phonetically annotated by an automatic system. This model was used to generate accurate phonetic alignment.
- the goal of the animation is to generate a sequence of 3D poses that reproduce the original performance.
- the open-source rigged 3D model of Mathilda character was used to create different morph targets for the animation module.
- the morph-targets correspond to real frames from our 3D corpus.
- the facial tracking data is decomposed into a weighted combination of the key-frames set:
- k is the size of the chosen visemes set.
- Fw is the computed frame at moment t, resulting from computing equation (2) on an input frame Ft.
- Fi and Wi are a key- frame i from the key-frames set and its corresponding affected weight. This decomposition is made using a non-negative least squares fit resolving the problem: ar gmin w ⁇ F w - F t ⁇ 2 , W 3 0 (2)
- Ft is the frame at moment t that we want to decompose into a vector of weights.
- Fw is the result of the reconstruction of Ft using the computed weights.
- the inverse task is the reconstruction of the 3D trajectories from the key-shapes weights.
- the morphing algorithm used in our work is a linear composition (Equation 1) where the weights W are already known. This task of reconstruction was crucial to check the quality of the reconstruction.
- Table 1 RMSE error in millimeter and Pearson correlation between original data and reconstructed one. Those measures were computed for the three axes (X, Y, Z) then we calculated the average value of those three axes. Table 2. List of viseme and their corresponding phonemes along with their 3D representative blendshapes. The phonetic symbols are taken from the International Phonetic Alphabet (IPA) [Decker et al. 1999].
- IPA International Phonetic Alphabet
- GRU reduces the complexity of LSTM, by removing one gate and the cell memory and so decreasing the number of parameters, which should simplify the training.
- LSTM and its variations are well-known for their great performances in language and speech-related tasks for example, phoneme classification [Graves et al. 2006], machine translation [Bahdanau et al. 2015], and language modeling [Mulder et al. 2015].
- Neural Architectures We have compared three different designs, presented in Fig. 5, mainly differentiated by the output layer on top of the networks. All three models start with two layers of gated recurrent units.
- the output layer is a simple linear layer outputting the spatial trajectories of each sensor. This output then needs to be exploited to animate a facial model, for example, we compute in this work the blendshape weights with a non-negative partial least square.
- the neural networks learn to directly predict a latent representation of the visual motion trajectories, previously computed on the training set of the corpus.
- blendshape weights must be non-negative
- the output layer is composed of a linear layer with ReLU activation to ensure positivity, while an identity function is used for eigenvalues of PCA. Note that in the case of blendshape weight, this architecture could be used with a hand crafted database, in the case where animation workforce is easier to get than a motion- capture system.
- the output layer is composed of two consecutive fully connected layers.
- This first linear layer generates a latent representation, which can be bounded using differentiable function (e.g. ReLU for non-negativity) and the second layer reconstruct the 3D cloud from the latent representation.
- ReLU for non-negativity
- the last layer is carefully initialized, with either cloud value of each keyshape or the eigenvectors of the 3D cloud. Moreover, we could easily add a penalty at the latent representation level when desired.
- the target output A is a sequence of n-dimensional vectors representing either the stacked spatial coordinates of each articulator (model 5.3.1 and 5.3.3) or the latent representation (model 5.3.2), while the input f is the encoded phoneme sequence. is a one-hot vector representing the articulated phoneme at the time step t. This encoding preserves the duration of each phoneme without having to explicitly feed this information to the network, and can be seen as a multidimensional binary signal synchronized with the articulator trajectories.
- the mean squared error as a loss function, and defined the error as the Euclidean distance between at and at : with T the sequence size and an the j-th dimension of a , ⁇ .
- the optimization method used to train the network was Adam, an adaptive learning rate extension of stochastic gradient descent with many benefits (e.g. appropriate for non stationary objective and sparse gradients, parameters update invariant to gradient rescaling, intuitive hyper-parameters).
- Kingma and Ba [2015] claim that it combines both the advantages of RMSprop [Tieleman and Hinton [n. d.]] and AdaGrad [Duchi et al. 2011], two other well-known gradient-based optimization algorithms.
- Test 1 Computing the minimum of the mouth opening for each phoneme.
- Test 1 allows detecting if the model has captured the complete closure of the lips during bilabial sounds, and test 2, if the model has correctly learned the protrusion of the concerned vowels (mainly /u/).
- the mouth opening as the Euclidean distance between the central sensor of the upper lip, and the central sensor of the lower lip.
- Protrusion is defined as the position on the y-axis of the upper lip's central sensor.
- Figure 7 summarizes the distribution of minimal mouth opening for different models: learning spatial trajectories, learning latent representation for both blendshape's weight andzone component, and learning to reconstruct the spatial trajectories while fine- tuning the latent decoder.
- lower median means a mouth closed well during production, so the figure clearly exhibits some issue on the lips closing during bilabial production when learning from the raw spatial trajectories, as evidenced by the median of minimal mouth opening higher than the ground-truth median.
- Protrusion seems to be correctly learned for the different models (lower plot of fig. 7), but all models present less variability than the ground-truth values, in particular for the spatial trajectories model. Fortunately, this lack of variability does not affect the quality of the final synthesis: the mouth should be closed for bilabial production and protrusion should be correctly perceived. Thus, it is more relevant to ensure a median closer to the ground-truth median than a perfect match of the distribution. For example, the generated protrusion seems to be more noticeable than the ground-truth protrusion, but this presence is more important than barely visible protrusion. Learning from a latent representation, or learning to reconstruct the spatial trajectories from latent representation greatly improve the results. Best performances are reached using principal components as latent representation, which is not really surprising as principal components are computed to cover more than 95% of data variance, while blendshape decomposition is hand-crafted to be meaningful for animators and linguistically inspired.
- the system takes a text and the corresponding audio signal as inputs and generates a 3D animation synchronized with that audio.
- the voice of the user is recorded while uttering a sentence in French.
- the alignment module extracts the phonetic and temporal information and passes them to the prediction module.
- the prediction result is a sequence of blendshape weight vectors or 3D frames that we transform to a vector of blendshape weights.
- speech animation is played synchronously with the recorded audio using visualization player developed under Unity game engine.
- Any other interactive 3D rendering system can be used. For instance, it is possible to render the animation speech with a rendering software, as Maya 3D, for instance.
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962884427P | 2019-08-08 | 2019-08-08 | |
PCT/EP2020/072272 WO2021023869A1 (en) | 2019-08-08 | 2020-08-07 | Audio-driven speech animation using recurrent neutral network |
Publications (1)
Publication Number | Publication Date |
---|---|
EP4010899A1 true EP4010899A1 (en) | 2022-06-15 |
Family
ID=72234805
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP20760367.1A Withdrawn EP4010899A1 (en) | 2019-08-08 | 2020-08-07 | Audio-driven speech animation using recurrent neutral network |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP4010899A1 (en) |
WO (1) | WO2021023869A1 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113221840B (en) * | 2021-06-02 | 2022-07-26 | 广东工业大学 | Portrait video processing method |
CN113539240A (en) * | 2021-07-19 | 2021-10-22 | 北京沃东天骏信息技术有限公司 | Animation generation method and device, electronic equipment and storage medium |
CN113628635B (en) * | 2021-07-19 | 2023-09-15 | 武汉理工大学 | Voice-driven speaker face video generation method based on teacher student network |
CN114093025A (en) * | 2021-10-29 | 2022-02-25 | 济南大学 | Man-machine cooperation method and system for multi-mode intention reverse active fusion |
US11923899B2 (en) | 2021-12-01 | 2024-03-05 | Hewlett Packard Enterprise Development Lp | Proactive wavelength synchronization |
CN114202605B (en) * | 2021-12-07 | 2022-11-08 | 北京百度网讯科技有限公司 | 3D video generation method, model training method, device, equipment and medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1511008A1 (en) * | 2003-08-28 | 2005-03-02 | Universität Stuttgart | Speech synthesis system |
US20060009978A1 (en) * | 2004-07-02 | 2006-01-12 | The Regents Of The University Of Colorado | Methods and systems for synthesis of accurate visible speech via transformation of motion capture data |
-
2020
- 2020-08-07 EP EP20760367.1A patent/EP4010899A1/en not_active Withdrawn
- 2020-08-07 WO PCT/EP2020/072272 patent/WO2021023869A1/en unknown
Also Published As
Publication number | Publication date |
---|---|
WO2021023869A1 (en) | 2021-02-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lu et al. | Live speech portraits: real-time photorealistic talking-head animation | |
Richard et al. | Meshtalk: 3d face animation from speech using cross-modality disentanglement | |
Suwajanakorn et al. | Synthesizing obama: learning lip sync from audio | |
Karras et al. | Audio-driven facial animation by joint end-to-end learning of pose and emotion | |
US11682153B2 (en) | System and method for synthesizing photo-realistic video of a speech | |
WO2021023869A1 (en) | Audio-driven speech animation using recurrent neutral network | |
Xie et al. | Realistic mouth-synching for speech-driven talking face using articulatory modelling | |
Sargin et al. | Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation | |
KR20060090687A (en) | System and method for audio-visual content synthesis | |
US20210390945A1 (en) | Text-driven video synthesis with phonetic dictionary | |
Zhang et al. | Text2video: Text-driven talking-head video synthesis with personalized phoneme-pose dictionary | |
Zhou et al. | An image-based visual speech animation system | |
Gururani et al. | Space: Speech-driven portrait animation with controllable expression | |
Chai et al. | Speech-driven facial animation with spectral gathering and temporal attention | |
Lavagetto | Time-delay neural networks for estimating lip movements from speech analysis: A useful tool in audio-video synchronization | |
Deena et al. | Visual speech synthesis using a variable-order switching shared Gaussian process dynamical model | |
Liz-Lopez et al. | Generation and detection of manipulated multimodal audiovisual content: Advances, trends and open challenges | |
Hussen Abdelaziz et al. | Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models | |
Asadiabadi et al. | Multimodal speech driven facial shape animation using deep neural networks | |
Liu et al. | Real-time speech-driven animation of expressive talking faces | |
Hussen Abdelaziz et al. | Audiovisual speech synthesis using tacotron2 | |
Liu et al. | Optimization of an image-based talking head system | |
Mahavidyalaya | Phoneme and viseme based approach for lip synchronization | |
Zhang et al. | Realistic Speech-Driven Talking Video Generation with Personalized Pose | |
Deena | Visual speech synthesis by learning joint probabilistic models of audio and video |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20220215 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20230523 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20231003 |