EP4010899A1

EP4010899A1 - Audio-driven speech animation using recurrent neutral network

Info

Publication number: EP4010899A1
Application number: EP20760367.1A
Authority: EP
Inventors: Slim OUNI; Théo BIASUTTO-LERVAT; Sara DAHMANI
Original assignee: Universite de Lorraine
Current assignee: Universite de Lorraine
Priority date: 2019-08-08
Filing date: 2020-08-07
Publication date: 2022-06-15
Also published as: WO2021023869A1

Abstract

The present invention concerns a recurrent neural network (RNN) architecture to generate effective speech motion trajectories synchronized with speech. These motion trajectories can be used within traditional retargeting systems, and allow an automatic lip synchronization of any 3D character. In our work, we take into account the complex nature of speech, that is, speech is produced by a dynamic articulatory system highly dependent on the linguistic and phonetic context. This is critical to ensure an intelligible speech animation. The invention relates to : (1) building an audiovisual corpus representative of a given language, that is linguistically rich and that covers coarticulation examples well as much as possible; and (2) developing a stack of gated recurrent network stimulated with phoneme sequences and integrating a simple but efficient initialization strategy to integrate information about linear latent representation of the point cloud. We provide a detailed description of our approach. First, we present the procedure of building the audiovisual corpus, the linguistic considerations and how to optimize the underlying blendshape representation to be well adapted to speech articulation. Then, we present a bidirectional gated RNN to predict the animation trajectories from a temporal sequence of phonemes. Our approach integrates well coarticulation and provides good speech animation trajectories, as assessed by an objective evaluation of the quality of the generated data.

Description

Audio-driven Speech Animation using Recurrent Neural Network

BACKGROUND The main purpose of animating the mouth of a virtual character synchronically with acoustic speech is to give the impression that the audio speech is produced effectively by this character. This is far from being an easy task. In fact, humans are acutely sensitive to any incoherence between audio and visual animation. This may occur as an asynchrony between audio and visual speech [Dixon and Spitz 1980], or a small phonetic distortion compared to the natural relationship between the acoustic and the visual streams [Green and Kuhl 1989, 1991; Jiang et al. 2002]. The McGurk effect [McGurkand MacDonald 1976] describes the case when the mismatch has an important effect on perception: when an auditory stimulus 'ba' is paired with a visual stimulus 'ga', and the perceiver reports that the talker said 'da'. In the field of audiovisual synthesis, it has been shown that the degree of coherence between the auditory and visual modalities has an influence on the perceived quality of the synthetic visual speech [Mattheyses et al. 2009].

Research in audiovisual speech intelligibility has shown the importance of the information provided by the face especially when audio is degraded [Le Go et al. 1994; Ouni et al. 2007; Sumby and Pollack 1954]. Moreover, Le Go et al. [1994] have shown that when audio is degraded, the natural face can provide two thirds of the missing auditory intelligibility. Their synthetic face without the inner mouth (without the tongue) provided half of the missing intelligibility and the lips restored a third of it. For audiovisual synthesis and animation, this suggests that one should pay careful attention to model the part of the face that participates actively during speech. In fact, a facial animation system [Deng and Noh 2008] needs extremely good lip motion and deformation in order to achieve realism and effective communication. This is essential for challenged population as hard- of-hearing people. As a matter of fact, hard-of-hearing people perceive speech signals in a distorted way and audiovisual speech becomes important to them, as it provides additional visual information which can tremendously enhance speech perception. Even normal spectators and gamers cannot live a good entertainment experience if there is incoherence in lip synchronization of a given 3D character. Because speech production is a continuous process involving movements of different articulators (e.g., tongue, lips, jaw) having mass and inertia, phoneme utterances are influenced by the context in which they occur. This phenomenon is called coarticulation (see for instance [Hardcastle and Hewlett 2006], for a comprehensive literature review). Coarticulation has been extensively studied using descriptive phonetic measurements and several models have been presented [Bell-Berti and Harris 1979; Cohen and Massaro 1993; Lofqvist 1990; Ohman 1967]. The realizations of a given phonetic segment are typically affected by coarticulation with the preceding and following phonetic context. Although it is hard to determine how much context the coarticulation effects cover, it should be considered as limited in duration. This effect can be anticipatory and carryover [Bell-Berti and Harris 1976, 1979; Fowler and Saltzman 1993] and cannot be considered as universal as different coarticulation patterns occur depending on the language [Beddor et al. 2002; Hardcastle and Hewlett 2006; Manuel 1990; Ohman 1966]. In a similar way, the realization of phonemes can vary from one language to another [Ladefoged and Maddieson 1996].

Many studies have worked on automatic facial animation, from linguistic-based approach to machine learning oriented system. However, we can distinguish two main approaches: image-based approach aiming to produce realistic video, and a model based on a 3D character. On a side note, dealing with emotions have been addressed in several papers, either implicitly or explicitly.

Fan et al. [2016] make use of bidirectional LSTM to learn coarticulation effect and predict the lower part of the face through Activate Appearance Model (AAM). These predictions are then used to drive a synthesis by concatenation. This model actually uses either text features (triphone information), audio features, or both. Suwajanakorn et al. [2017] propose a solution based on recurrent neural network and a complex image processing pipeline. RNN are trained to predict sparse mouth shape from audio input, this prediction is used to synthesize the mouth texture, which is finally integrated into existing video. Video sequences used for synthesis are carefully picked to take account of head movement during speech, and several clever tricks are used to remove any artifacts on both the teeth or the jaw. We can also note current approaches based on adversarial network exploiting the capacity of these generative models to manipulate images on a pixel level [Pham et al. 2018b; Song et al. 2018; Vougioukas et al. 2018; Zhou et al. 2018a]. Such approaches can generate video from an arbitrary portrait, and don't rely on any image processing pipeline. Image based approaches have the advantage of generating photo-realistic output, since they compose the result from real images with natural textures [Anderson et al. 2013; Fan et al. 2015; Suwajanakorn et al. 2017; Thies et al. 2016]. Yet, these approaches need very large databases of video frames to cover all the possible facial appearances (trained on 17 hours of videos clips in the case of [Suwajanakorn et al. 2017]).

Automatic animation of 3D model is particularly suited for animation and video- game industries. To do this, these systems usually use text or audio as input, and produce parameters driving the facial animation.

Taylor et al. [2017] designed a system using phonetic sequence as input, to easily make the network speaker-invariant. A deep feed-forward neural network is used to map a window of phoneme inputs to active appearance model, a representation of the lower part of the mouth containing both landmark and texture parameterization. Due to the use of feed-forward neural networks and sliding temporal window, a post-filtering step is needed to smooth the output and to not produce jiggered animation. Finally, a retargeting system is used to map learned articulation from a specific speaker onto an arbitrary face model. Some of the assumptions in their work are arguable. The authors claim that their solution can be considered independent of the considered language. As discussed here above, coarticulation and even the realization of phonemes are highly dependent on the language. For instance, a study by Boyce [1990] on VCV utterance exhibited different coarticulation strategy depending on the language. In addition, using sliding window can probably help to integrate some aspects of coarticulation, but a too small sliding window can be counter-productive. This assumption has no solid arguments and it was not confirmed in literature.

Karras et al. [2017] consider audio input to generate the whole facial mesh instead of some parameters. Using a sliding window on the speech signal, each window is converted into a mesh through a neural network. First layers of the network are convolutional, designed to extract time-varying features from raw formant. Then, another convolutional layers are used to reduce this temporal window into a single features vector which is used to predict the mesh state corresponding to center of the window. This mesh prediction is made by two dense layers, one producing a basis representation of the mesh, while the other is initialized using a principal component analysis on the mesh dataset, and finally produce the whole mesh. This initialization trick is expended in this study to the use of any linear latent representation, with a demonstration on both PCA and blendshape. Note that despite the loss function used to train the network, which contains a term to deals with motion dynamic and promote a smooth output, a post-filtering step is still needed. We believe this to be inherent in the use of temporal sliding windows and feed forward neural network.

Zhou et al. [2018b] have proposed a solution based on JALI parameters, an overlay of FACS rigs. The solution consists of a two-step neural network with a transfer learning procedure. Firstly, an LSTM network extract landmark and phoneme information from audio features. This network is trained in a multi-task setting, using audiovisual datasets containing several speakers to ensure invariance. Then, another LSTM is staked at the top of the pretrained network, using features previously learnt by the first network to generate the JALI parameters from audio. A parallel dataset containing audio and animation parameters is still needed to jointly train the whole network, which is an expensive and time-consuming operation, even for a minimal corpus. Moreover, the transfer learning procedure rely only on audiovisual datasets for phoneme and landmark prediction, and thus cannot use the state-of-the-art acoustic model for phoneme extraction. Hence, we believe this solution to be less speaker-independent than solution relying on pure phoneme sequence instead of speech signal.

Pham et al. [2018a] also use convolutional network to learn interesting features from raw speech signal, combined with a recurrent neural network to obtain smooth output trajectories. The network is composed of convolutions on the frequency axis, followed by convolutions on the temporal axis, and finally ends with a recurrent neural network. It outputs a set of blendshape weights directly used for the animation. Their solution also takes account for emotional state hidden inside the speech signal, and should produce expressive facial animation. However, there is no specific focus on speaker invariance, which is only achieved by using a huge amount of data. Using speech sample too far from these datasets may produce bad results in specific cases (e.g. children's voice).

Model-based approaches can generate avatar animation from only a set of parameters [Li et al. 2016; Pham et al. 2017; Wang et al. 2011]. They are less data demanding (than image-based approaches) and enjoy the flexibility of a deformable model. Also, with 3D animation approach, it is possible to do Facial animation retargeting, by transferring the recorded performance capture data between different virtual characters. To do that, the source and target avatars should have corresponding blendshapes. The blendshapes in many existing rigs are inspired by the Facial Action Coding System (FACS) [Ekman and Friesen 1978; Sagar 2006] and are completed by more specific articulation visemes (visual representation of a phoneme)[Benoit et al. 1992; Govokhina 2008; Pandzic and Forchheimer 2003]. Nevertheless, no standard list or global agreement was made concerning the choice of the visemes list, even thought some researches were made to compare different size visemes groups [Benezeth et al. 2011].

To ensure realistic human-like articulation results, motion Capture, sometimes called performance driven technique, is used to collect realistic movement while avoiding the uncanny valley effect [Seyama and Nagayama 2007]. This effect is caused by the fact that human visual perception is highly centered on facial motion and even the smallest incoherences can cause an emotion of disgust and rejection. The hight-quality audio-visual databases are obtained by recording actors performance which is then captured and used in 3D models animations [Zell et al. 2017]. SUMMARY

This section explains the different phases of the approach of the present invention to acquire the motion capture French corpus synchroneously with speech. First, we have conducted a linguistic analysis to prepare the textual corpus. Then, we describe briefly, the acquisition material and protocol.

To be able to generate intelligible speech animation, it is critical to take into account the complex nature of speech. Thus, the present invention fully considers the specificity of speech as addressed in speech related fields (articulatory speech production, speech synthesis, phonetics and phonology, linguistics) during the different steps of the animation speech process. In summary, the present invention comprises:

A method to build a lip synchronization engine, comprising the steps of:

- building an audiovisual corpus representative of a given language by conducting a linguistic analysis, said linguistic analysis comprising breaking sentences into a sequence of phonemes using an Natural Language Processing module and using a greedy algorithm that takes phonemes-sequence and a list of linguistic criteria,

- using a bidirectional gated recurrent neural network (RNN) architecture to predict the animation trajectories from a temporal sequence of phonemes.

According to the invention, the sentences comprise several sentences with the highest phonetic variability coverage in several contexts that implicitly also cover several coarticulation examples.

According to the invention, linguistic criteria comprise the position of the phoneme and its context.

The invention comprises the step of using a motion-capture system composed of several cameras to acquire the audiovisual corpus.

The invention comprises the step of, during the acquisition, using sixty-three reflective markers glued on the face of a speaker.

The invention comprises the step of computing absolute 3D spatial positions of the reflective markers for each frame acquired by the cameras, after removing the head movement of the speaker.

The invention comprises the step of generating a sequence of 3D poses that reproduce the original performance by creating different morph targets.

The invention comprises the step of refining visual morph target set using sentation to be well adapted to speech articulation.

The invention comprises the step of determining an optimized number of visemes between 10 and 20 without loss of animation quality.

The invention comprises the step of using a deep learning approach based on a bidirectional gated sequence of phonemes by taking into account two possible types of coarticulation, anticipatory and carry-over with no particular assumption for the duration of the coarticulation effect. In other words, the invention includes following steps which are explained based on French language but can be carry out with other language :

• Building an audiovisual corpus for French that is linguistically rich, i.e., several sentences with the highest phonetic variability coverage in several contexts that implicitly also cover several coarticulation examples,

• Refining the visual morph target set using sentation to be well adapted to speech articulation. We have defined an optimized number of visemes to consider without loss of animation quality,

• A deep learning approach based on a bidirectional gated sequence of phonemes. This allows taking into account the two possible types of coarticulation (anticipatory and carry-over) with no particular assumption for the duration of the coarticulation effect,

• An objective evaluation of the quality of the prediction. We include both quantitative and qualitative evaluations showing the overall quality and performance,

• This speech animation technique is independent of the speaker and it allows generating animation that can be retargeted to any animation rig. It is possible to be integrated into existing pipelines.

Linguistic analysis

Since conducting audiovisual acquisitions and post-processing them are time- consuming, the aim of this analysis was to create a corpus with the highest phonetic coverage while keeping a reasonable number of sentences. Our approach was to collect some French open-source textual corpus to create a first large corpus. This large corpus guarantees an initial maximum of language coverage, and will be processed later on to reduce its size. The first corpus we obtained was about 7000 non-redundant sentences and is a result of merging freely available and in-house textual French corpus. The linguistic analysis consists of breaking all the sentences into a sequence of phonemes using an NLP (Natural Language Processing) module. After that, as we are dealing with a maximum coverage problem we used a greedy algorithm. This algorithm takes the phonemes- sequence and a list of linguistic criteria, mainly the position of the phoneme and its context, as input. The list of extracted sentences allows obtaining the same coverage as the initial whole list of sentences. These processing tasks were realized by an in-house French text- to-speech system.

Thus, we obtain 2000 sentences, a reduced set covering the 36 French phonemes, with a maximum coverage of different positions within sentences. Their number of occurrences balances between 200 for the rarest ones (j, m, o/) and more than 6K for the most frequent ones (I, a, B) (see Fig. 1). We also analyzed the coverage of diphones (adjacent pair of phones), which is a good indicator of the coverage of coarticulation. The 2000 sentences cover a 92% of the total number of possible combinations of the 37 phonemes (1260 pairs out of 1369). The remaining 8% represents a set of rare or non existent phoneme pairs in the French language (mm, sj, aoe...). The most frequent French diphones (la, d@ and aR) can appear 200 times in the sentence set.

The present invention also concerns a computer program product comprising a computer useable or readable medium having a computer readable program may be provided. The computer readable program, when executed on a computing device, causes the computing device to perform various ones of, and combinations of, the operations outlined here with regard to the method illustrative embodiment.

In yet another illustrative embodiment, a system/apparatus is provided. The system/apparatus may comprise one or more processors and a memory coupled to the one or more processors. The memory may comprise instructions which, when executed by the one or more processors, cause the one or more processors to perform various ones of, and combinations of, the operations outlined here with regard to the method illustrative embodiment.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a diagram illustrating the number of occurrences of the 35 French phonemes in the sentences corpus according to the invention.

Figure 2 illustrates the layout of a retro-reflective sensors. A set of 63 sensors 3mm and 4mm in diameter are glued on the speaker face. Six other sensors of 9mm are attached to the hat to track head movements. Figure 3 is a view of curves of trajectory of a sensor placed on the lower lip along the y- axis for the original data and the reconstructed data.

Figure 4 illustrates a bidirectional RNN.

Figure 5 illustrates neural architectures.

Figure 6 illustrates average performances for global and lips RMSE in mm.

Figure 7 illustrates critical segment analysis.

Figure 8 is a view of curves illustrating keyshapes differences before and after fine-tuning. Figure 9 is a view of an architecture of the lip-synchronization application according to the invention.

Figure 10 is a view of an implementation of the lip-synchronization application according to the invention.

DETAILED DESCRIPTION Material and protocol

An Optitrack™ motion-capture system has been used to acquire the audiovisual corpus. This system is composed of eight cameras (Flex 13) with a frame rate of 120 images per second. The cameras are adapted to the face/head region as they are conceived for medium volume motion capture tasks. We arranged eight cameras around a screen in a way that the face of the speaker is always visible while reading the sentences on the screen. Sixty three reflective markers of 3mm and 4mm diameters have been glued on the face of a French native speaker. The layout of the markers is presented in Fig.2. To track the head movement we used 9mm sensors glued on a hat. We have used a stereo microphone with a sampling rate of 48KHz, to record the acoustic speech.

Post-processing

The post-processing task consists of computing the absolute 3D spatial positions of the reflective markers for each frame, after removing the head movement. After generating phonetic transcription of our textual corpus, we used Kaldi (a toolkit for speech recognition) to align phonemes with audio. We have used an in-house alignment model for the neural network of Kaldi system. This model was trained on over 500 hours of French acoustic speech of the ESTER database [Gravier et al. 2004]. ESTER is a multi-speaker database of the radio program which has been phonetically annotated by an automatic system. This model was used to generate accurate phonetic alignment.

3D ANIMATION

The goal of the animation is to generate a sequence of 3D poses that reproduce the original performance. For this work the open-source rigged 3D model of Mathilda character was used to create different morph targets for the animation module.

To create the morph targets for this work, we were inspired by the work of Benoit et al. [1992] who proposes a list of 17 visemes as an adaptation to French language of the 16 visemes list defined by the MPEG4 [Pandzic and Forchheimer 2003] norm. Another visemes list was established by Govokhina [2008], which shortens the original list and brings it to 8 visemes only. Gonokhina used Bhattacharyya distance [Mak and Barnard 1996] to compute the distances between the Gaussian distributions of articulatory targets of corresponding phonemes. The closest phonemes were grouped together to form viseme categories. Those two groups of visemes were defined for French language but represent different levels of precision. For instance, in Benoit et al. classification, the use of internal articulators is taken into account. Thus, the visemes [t, d, n, q ] are distinguished from [I] and [K]. For Gonokhina's classification, only the external/visible articulators were considered, consequently [t, d, n, q, I, B] are grouped together. Regarding this work, non- visible articulators data wasn't collected. Thereby, we decided to work with Gonokhina classification.

For each chosen viseme we create a corresponding 3D model (morph-targets), B = [Bi,..., Bk] . The morph-targets correspond to real frames from our 3D corpus. The chosen 3D frames (key-frames), F = [Fi, ..., Fk] , are used to compute the weights, W = [Wi, ..., Wk], of the morph-targets. This computation is based on the work of Chuang and Bregler [2002]. Actually, the facial tracking data is decomposed into a weighted combination of the key-frames set:

F_w = ån=i WiFi, Wi > 0 (1) Where k is the size of the chosen visemes set. Fw is the computed frame at moment t, resulting from computing equation (2) on an input frame Ft. Also, Fi and Wi are a key- frame i from the key-frames set and its corresponding affected weight. This decomposition is made using a non-negative least squares fit resolving the problem: ar gmin_w\\F_w - F_t\\₂, W ³ 0 (2)

Where Ft is the frame at moment t that we want to decompose into a vector of weights. Fwis the result of the reconstruction of Ft using the computed weights. After the decomposition of the 3D data into a series of weights, we proceeded to the inverse task, which is the reconstruction of the 3D trajectories from the key-shapes weights. The morphing algorithm used in our work is a linear composition (Equation 1) where the weights W are already known. This task of reconstruction was crucial to check the quality of the reconstruction.

Actually, after the reconstruction of the 3D files we compute a RMSE (Root-Mean-Square Error) and Pearson correlation between the original data and the reconstructed one (see Table 1). The reconstruction using the basic list of 8 visemes didn't give good results (see

Table 1). The RMSE was above 2mm and the Pearson correlation was critically low (10%). Thus, we created a large list of visemes (23 visemes) to cover all possible phoneme categories and distinguishing the finest details. This complete list gave us good results, with RMSE less than 1mm and more than 80% of correlation. The next step was to remove visemes with low impact/unnecessary while keeping good quality of reconstruction. Finally, we kept 14 visemes that we present in Table 2. We can see a comparison of the original 3D trajectory of a lower-lip sensor on y-axis and the result of the reconstruction using the three lists of visemes in Figure 3.

Table 1. RMSE error in millimeter and Pearson correlation between original data and reconstructed one. Those measures were computed for the three axes (X, Y, Z) then we calculated the average value of those three axes. Table 2. List of viseme and their corresponding phonemes along with their 3D representative blendshapes. The phonetic symbols are taken from the International Phonetic Alphabet (IPA) [Decker et al. 1999].

NEURAL NETWORK DESIGN

Overview

We have designed the neural architecture according to three main objectives.

Speaker Invariance. Acquiring visual corpora using motion-capture system is incredibly time-consuming and expensive. Hence, our approach aims to use as few data as possible, while preserving the speaker invariance. To achieve this, we rely on phonetic information instead of the acoustic signal, which remove speaker specificity and clearly reduce the complexity of the input signal. Similar approach has already been successfully used for visual speech synthesis [Fan et al. 2016; Taylor et al. 2017].

Sequential model. Although one can argue that the coarticulation influence is restrained in a short temporal context, there is absolutely no consensus on the maximum duration of such effect. Using models able to deal with sequential data such as recurrent neural networks allows removing strong assumption, mainly using a temporal window, while ensuring to take into account of long-range coarticulation effect. Moreover, coarticulation should not exhibit very long-range correlation, where RNNs usually do not perform very well.

No post-processing. We avoid any post-processing step, such as using filters to smooth the network outputs. RNN are well-known to produce smooth output. Hence, they seem nicely suited for articulation prediction.

Bidirectional Gated RNN

While feedforward neural networks are universal approximators (see for instance [Hornik et al. 1989]), RNNs have been shown to be Turing complete, and thus should be able to approximate any dynamic system [Siegelmann and Sontag 1995]. RNNs are able to summarize the input sequence into an internal state using cyclical connection, giving them the ability to learn temporal relationship and correlation between data points. However, these recurrent networks are limited to the use of past information, although knowledge of future information could improve the yt prediction. This is particularly true when dealing with speech production, for which it is well established that future phonemes influence on the production of the current phoneme and even the previous ones (anticipatory coarticulation effect). Bidirectional RNNs (BRNNs) [Schuster and Paliwal 1997] overcome this limitation using two layers simultaneously trained in positive and negative time direction (see fig. 4).

Despite its theoretical abilities, training vanilla RNN to learn long-range dependencies using a gradient descent algorithm is still a difficult task due to the vanishing/exploding gradient issue [Bengio et al. 1994]. Long Short Term Memory (LSTM) network [Hochreiter and Schmidhuber 1997] gets around this issue by computing increments to the internal state and so encouraging information to stay for much longer, and by adding to each neural unit three gates which act as weight adjusters in function of inputs and hidden states. Among several variants of LSTM, the Gated Recurrent Unit (GRU) proposed by Cho et al. [2014] has become quite popular and successful. GRU reduces the complexity of LSTM, by removing one gate and the cell memory and so decreasing the number of parameters, which should simplify the training. Presently, LSTM and its variations are well-known for their great performances in language and speech-related tasks for example, phoneme classification [Graves et al. 2006], machine translation [Bahdanau et al. 2015], and language modeling [Mulder et al. 2015]. Neural Architectures We have compared three different designs, presented in Fig. 5, mainly differentiated by the output layer on top of the networks. All three models start with two layers of gated recurrent units.

Learning Spatial Trajectories. In this network, the output layer is a simple linear layer outputting the spatial trajectories of each sensor. This output then needs to be exploited to animate a facial model, for example, we compute in this work the blendshape weights with a non-negative partial least square.

Learning Latent Representation. In this model, the neural networks learn to directly predict a latent representation of the visual motion trajectories, previously computed on the training set of the corpus. We have compared two linear representations: blendshape weights and eigenvalues of a PCA (principal component analysis). As the blendshape weights must be non-negative, the output layer is composed of a linear layer with ReLU activation to ensure positivity, while an identity function is used for eigenvalues of PCA. Note that in the case of blendshape weight, this architecture could be used with a hand crafted database, in the case where animation workforce is easier to get than a motion- capture system.

Learning 3D Reconstruction. We designed a simple architecture able to exploit information from latent representation, while being trained with spatial trajectories. To do so, the output layer is composed of two consecutive fully connected layers. This first linear layer generates a latent representation, which can be bounded using differentiable function (e.g. ReLU for non-negativity) and the second layer reconstruct the 3D cloud from the latent representation. To inject knowledge about useful latent representation and ease the training, the last layer is carefully initialized, with either cloud value of each keyshape or the eigenvectors of the 3D cloud. Moreover, we could easily add a penalty at the latent representation level when desired. For example, we could add an LI penalty when using a blendshape decomposition, and so enforce sparse representation. This architecture also allows the network to refine each representation during the training, according to the minimization of 3D reconstruction errors. In the experiments presented in section 6, we consider two variations of this model : one keeps the parameters of the final layer frozen, and the second fine-tunes these parameters during the training. Training Procedure

We trained our networks in a framewise regression scheme where inputs and outputs are synchronized, using a pure stochastic gradient descent. Networks have to predict a sequence A = (ao, a-r) from a phoneme sequence f = (fo ft), where A is as close as possible to the target value A = (ao, ..., a-r).

The target output A is a sequence of n-dimensional vectors representing either the stacked spatial coordinates of each articulator (model 5.3.1 and 5.3.3) or the latent representation (model 5.3.2), while the input f is the encoded phoneme sequence. is a one-hot vector representing the articulated phoneme at the time step t. This encoding preserves the duration of each phoneme without having to explicitly feed this information to the network, and can be seen as a multidimensional binary signal synchronized with the articulator trajectories.

As in typical regression task, we used the mean squared error as a loss function, and defined the error as the Euclidean distance between at and at : with T the sequence size and an the j-th dimension of a ,·.

The partial derivatives of the loss function L (equation 3) according to the BRNN parameters were computed with Backpropagation Through Time (BPTT) [Rumelhart et al. 1986], and the network was fully unfolded for each training sequence.

The optimization method used to train the network was Adam, an adaptive learning rate extension of stochastic gradient descent with many benefits (e.g. appropriate for non stationary objective and sparse gradients, parameters update invariant to gradient rescaling, intuitive hyper-parameters). Kingma and Ba [2015], claim that it combines both the advantages of RMSprop [Tieleman and Hinton [n. d.]] and AdaGrad [Duchi et al. 2011], two other well-known gradient-based optimization algorithms.

To avoid overfitting and so ensure a better generalization, we applied an early stopping strategy which simply consists of monitoring the loss function on a validation dataset, and stop the training if the validation loss have not reached a new minima during N consecutive step. We finally evaluate the best performing model with a testing set containing unseen utterance. The corpus split is a traditional 80/10/10, where 80% of the data go to the training set, 10% to the validation set, and 10% to the testing set.

Evaluation Metrics

We measured the performances with two well-known metrics, the Root Mean Squared Error (RMSE) and the Pearson correlation p. For all dimensions of the stacked spatial coordinates a ,·, both metrics have been computed and averaged over utterances of the test set. The final performance of the system is the mean of all averaged indices. Note that the performance is always computed on the spatial reconstruction of the face.

Global Performances

In this section, we compare the global performances in terms of RMSE and correlation on four models: learning spatial trajectories 5.3.1, learning blendshape weights 5.3.2, and learning spatial re-construction of blendshape weights 5.3.3 (both with and without blendshapes refinement). Figure 6 presents the average RMSE for the whole face and the average RMSE for lips only. Examining the RMSE on the lips only gives an insight on the quality of the animation tightly related to speech articulation. As the training is stochastic, it can lead to different performances. Thus, we have trained each network configuration 5 times each, and draw violinplot to represent min, max, and mean performances. We did not report the correlation performances here, as they were all above 0.8. From these results, we can draw few conclusions:

• learning spatial trajectories directly leads to high variability using our training procedure, • learning blendshape weights leads to slightly better performances and greatly decrease the variability,

• learning spatial reconstruction of blendshape decrease the variability, but does not improve the performances,

• learning spatial reconstruction while refining the blendshapes both greatly decrease the variability and slightly improve the performances, and

• principal component is a much better latent representation than blendshape decomposition, and leads to better result in all the three cases.

Critical segment analysis

In this section, we examine two visible articulatory gestures that can be considered critical for speech intelligibility: mouth closure for bilabial sounds, and protrusion that is characteristic to French.

To ensure that these critical targets are reached, we propose the following two tests:

(1) Test 1. Computing the minimum of the mouth opening for each phoneme.

(2) Test 2. Computing the maximum of the protrusion for each phoneme.

Test 1 allows detecting if the model has captured the complete closure of the lips during bilabial sounds, and test 2, if the model has correctly learned the protrusion of the concerned vowels (mainly /u/). We define the mouth opening as the Euclidean distance between the central sensor of the upper lip, and the central sensor of the lower lip. Protrusion is defined as the position on the y-axis of the upper lip's central sensor.

Figure 7 summarizes the distribution of minimal mouth opening for different models: learning spatial trajectories, learning latent representation for both blendshape's weight and principale component, and learning to reconstruct the spatial trajectories while fine- tuning the latent decoder.

In the upper part of the figure, lower median means a mouth closed well during production, so the figure clearly exhibits some issue on the lips closing during bilabial production when learning from the raw spatial trajectories, as evidenced by the median of minimal mouth opening higher than the ground-truth median.

Protrusion seems to be correctly learned for the different models (lower plot of fig. 7), but all models present less variability than the ground-truth values, in particular for the spatial trajectories model. Fortunately, this lack of variability does not affect the quality of the final synthesis: the mouth should be closed for bilabial production and protrusion should be correctly perceived. Thus, it is more relevant to ensure a median closer to the ground-truth median than a perfect match of the distribution. For example, the generated protrusion seems to be more noticeable than the ground-truth protrusion, but this presence is more important than barely visible protrusion. Learning from a latent representation, or learning to reconstruct the spatial trajectories from latent representation greatly improve the results. Best performances are reached using principal components as latent representation, which is not really surprising as principal components are computed to cover more than 95% of data variance, while blendshape decomposition is hand-crafted to be meaningful for animators and linguistically inspired.

Keyshape Fine-tuning Analysis Figure 8 exhibits the least and most modified keyshapes. We think these slight modifications could provide a nice guide for animator, and help them produce keyshape model nicely adapted to our lip synchronization system. Moreover, there is no great difference between the refined and the original keyshape. We believe this is a strong clue about the validity of our keyshape choices.

Demonstration system

We designed an end-to-end demonstration system (see Fig. 9 or 10) that represents one of the possible implementations of our approach. The system takes a text and the corresponding audio signal as inputs and generates a 3D animation synchronized with that audio. First, the voice of the user is recorded while uttering a sentence in French. Then, the alignment module extracts the phonetic and temporal information and passes them to the prediction module. The prediction result is a sequence of blendshape weight vectors or 3D frames that we transform to a vector of blendshape weights. Once the weights have been computed and available, speech animation is played synchronously with the recorded audio using visualization player developed under Unity game engine. Any other interactive 3D rendering system can be used. For instance, it is possible to render the animation speech with a rendering software, as Maya 3D, for instance.

We have proposed in this work two main contributions. First, we have built a high quality audiovisual corpus. As acquiring such dataset is greatly expensive and time- consuming, we cannot afford to record hundred hours of speech, as usually done for state- of-the-art neural-based acoustic model. Thus, the quality of the database is crucial to obtain good speech animation with relatively modest training set. To ensure nice performances, we have performed a linguistic analysis to obtain a well-balanced corpus for French. This procedure can easily be used to extend our work to other languages. Choice of relevant visemes and blendshapes representation affects the final rendering strongly, we have demonstrated in this paper that our visemes list gives very good reconstruction. Moreover, the visemes are not widely changed during the fine-tuning procedure, so we believe these choices to be near-optimal. Finally, the compact blendshapes representation makes our approach extremely easy to retarget to new characters.

Our second contribution is the design of a recurrent- based neural architecture able to generate motion trajectories from phonetic representation of speech. The use of phoneme information instead of raw speech signal greatly help our system to generalize well to a new speaker, even if the database contains only one female speaker.

We have proposed a simple initialization procedure that allows transferring knowledge of useful representation into the network. This has been demonstrated using both a statistic method, PCA, and a hand-crafted latent representation based on animation technique, blendshape decomposition. We could even extend this procedure to non-linear latent representation, as long as the gradient can be backpropagated through the latent decoder (e.g. autoencoder). This procedure always improves the prediction accuracy and speech intelligibility, particularly when allowing the fine-tuning of the injected parameters, as assessed by our objective evaluation of the performances and our analysis of critical segments as bilabials. Learning directly the latent representation also works, and yield better result than model learning directly from raw motion trajectories. However, a latent representation always induces a loss of information, so we believe that using raw data is conceptually more appropriate, as it conserve the full richness of visual coarticulation.

We consider our system as fast and can be optimized to reach near real-time lip synchronization. Due to anticipatory coarticulation, a true real-time system cannot be considered, but the stability of our model can be measured to compute the minimum latency we need for accurate facial motion prediction.

To conclude, we have proposed a lip-synchronization system based on human articulatory movements. This system integrates an efficient neural network architecture able to capture coarticulation phenomena and human facial motion. This network has been successfully trained thanks to our audiovisual corpus that was carefully designed to take into consideration linguistic and phonetic specifications of the language.

Most of references below are cited as technical backgroung. None of them is considered as relevant alone.

REFERENCES

Robert Anderson, Bjorn Stenger, Vincent Wan, and Roberto Cipolla. 2013. An expressive text- driven 3D talking head. In ACM SIGGRAPH 2013 Posters. ACM, 80.

D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR).

Patrice Speeter Beddor, James D Harnsberger, and Stephanie Lindemann. 2002. Language- specific patterns of vowel-to-vowel coarticulation: acoustic structures and their perceptual correlates. Journal of Phonetics 30, 4 (2002), 591-627. Fredericka Bell-Berti and Katherine S Harris. 1976. Some aspects of coarticulation. Haskins Laboratories status report on speech research 45 (1976), 197.

Fredericka Bell-Berti and Katherine S Harris. 1979. Anticipatory coarticulation: Some implications from a study of lip rounding. The Journal of the Acoustical Society of America 65, 5 (1979), 1268-1270.

Yannick Benezeth, Guillaume Gravier, and Frederic Bimbot. 2011. Etude comparative des differentes unites acoustiques pour la synchronisation labiale. In GRETSI, Groupe d'Etudes du Traitement du Signal et des Images.

Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5 (March 1994), 157-166.

Christian Benoit, Tahar Lallouache, Tayeb Mohamadi, and Christian Abry. 1992. A set of French visemes for visual speech synthesis. , 485-501 pages.

Suzanne E Boyce. 1990. Coarticulatory organization for lip rounding in Turkish and English. The Journal of the Acoustical Society of America 88, 6 (1990), 2584-2595.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Erika Chuang and Chris Bregler. 2002. Performance driven facial animation using blendshape interpolation. Computer Science Technical Report , Stanford University 2, 2 (2002), 3. Michael M Cohen and Dominic W Massaro. 1993. Modeling coarticulation in synthetic visual speech. In Models and techniques in computer animation. Springer, 139-156. Donald M Decker et al. 1999. Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press. Zhigang Deng and Junyong Noh. 2008. Computer Facial Animation: A Survey. Springer London, London, 1-28. https://doi.oi i.< 1 t 1 J07/978-1-84628-907-1 1

Norman F. Dixon and Lydia Spitz. 1980. The detection of audiovisual desynchrony. Perception 9 (1980), 719-721. Issue 6.

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 12 (July 2011), 2121- 2159.

Paul Ekman and Wallace V Friesen. 1978. Manual for the facial action coding system. Consulting Psychologists Press.

Bo Fan, Lijuan Wang, Frank K Soong, and Lei Xie. 2015. Photo-real talking head with deep bidirectional LSTM. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 4884-4888.

Bo Fan, Lei Xie, Shan Yang, Lijuan Wang, and Frank K. Soong. 2016. A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications 75 (May 2016), 5287-5309.

Carol A Fowler and Elliot Saltzman. 1993. Coordination and coarticulation in speech production. Language and speech 36, 2-3 (1993), 171-195.

O Govokhina. 2008. Modeles de trajectoires pour T animation de visages parlants. Ph.D. Dissertation. These de I'Institut National Polytechnique de Grenoble.

A. Graves, S. Fernandez, and F. Gomez. 2006. Connectionist Temporal Classification: Labelling unsegmented sequence data with recurrent neural networks. In In Proceedings of the International Conference on Machine Learning, ICML 2006. 369-376. Guillaume Gravier, Jean-Frangois Bonastre, Edouard Georois, Sylvain Galliano, Kevin McTait, and Khalid Choukri. 2004. The ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News. In LREC.

Kerry P. Green and Patricia K. Kuhl. 1989. The role of visual information in the processing of place and manner features in speech perception. Perception and Psychophysics 45 (1989), 34-42. Issue 1.

Kerry P. Green and Patricia K. Kuhl. 1991. Integral processing of visual place and auditory voicing information during phonetic perception. Journal of Experimental Psychology: Human Perception and Performance 17 (1991), 278-288. Issue 1.

William J Hardcastle and Nigel Hewlett. 2006. Coarticulation: Theory, data and techniques. Cambridge University Press.

S. Hochreiter and J. Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9 (November 1997), 1735-1780.

K. Hornik, M. Stinchcombe, and H. White. 1989. Multilayer Feedforward Networks are Universal Approximators. Neural Networks 2 (July 1989), 359-366.

J. Jiang, A. Alwan, P.A. Keating, E.T. Auer, and L.E. Bernstein. 2002. On the relationship between face movements, tongue movements, and speech acoustics. EURASIP Journal on Applied Signal Processing 11 (2002), 1174-1188.

Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audiodriven Facial Animation by Joint End-to-end Learning of Pose and Emotion. ACM Trans. Graph. 36, 4, Article 94 (July 2017), 12 pages https://doi.org/10.1145/3072959.3073658 Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR).

Peter Ladefoged and Ian Maddieson. 1996. The sounds of the world's languages. Vol. 1012. Blackwell Oxford.

Le Goff, T. Guiard-marigny, M. Cohen, and C. Benoit. 1994. Real-Time Analysis-Synthesis and Intelligibility of Talking Faces. In In 2nd International conference on Speech Synthesis. 53-56.

Xu Li, Zhiyong Wu, Helen M Meng, Jia Jia, Xiaoyan Lou, and Lianhong Cai. 2016. Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data. In INTERSPEECH. 1477-1481.

Anders Lofqvist. 1990. Speech as Audible Gestures. Springer Netherlands, Dordrecht, 289- 322. https://doi.org/10.1007/978-94-009-2037-8_12

Brian Mak and Etienne Barnard. 1996. Phone clustering using the Bhattacharyya distance. In Fourth International Conference on Spoken Language Processing.

Sharon Y Manuel. 1990. The role of contrast in limiting vowel-to-vowel coarticulation in different languages. The Journal of the Acoustical Society of America 88, 3 (1990), 1286-1298.

W. Mattheyses, L. Latacz, and W. Verhelst. 2009. On the importance of audiovisual coherence for the perceived quality of synthesized visual speech. EURASIP Journal on Audio, Speech, and Music Processing (2009).

Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264 (1976), 746-748.

Wim De Mulder, Steven Bethard, and Marie-Francine Moens. 2015. A survey on the application of recurrent neural networks to statistical language modeling. Computer Speech & Language 30 (2015), 61 - 98.

Sven EG Ohman. 1966. Coarticulation in VCV utterances: Spectrographic measurements. The Journal of the Acoustical Society of America 39, 1 (1966), 151-168.

Sven EG Ohman. 1967. Numerical model of coarticulation. The Journal of the Acoustical Society of America 41, 2 (1967), 310-320.

Slim Ouni, Michael M. Cohen, Hope Ishak, and Dominic W. Massaro. 2007. Visual contribution to speech perception: measuring the intelligibility of animated talking heads. EURASIPJ. Audio Speech Music Process. 2007, 1 (Jan. 2007), 3-3.

Igor S Pandzic and Robert Forchheimer. 2003. MPEG-4 facial animation: the standard, implementation and applications. John Wiley & Sons.

Hai Xuan Pham, Samuel Cheung, and Vladimir Pavlovic. 2017. Speech-Driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach. In CVPR Workshops. 2328-2336.

Hai Xuan Pham, Yuting Wang, and Vladimir Pavlovic. 2018a. End-to-end Learning for 3D Facial Animation from Speech. In Proceedings of the 20th ACM International Conference on Multimodal Interaction (ICMI Ί8). ACM, New York, NY, USA, 361-365. https://doi.org/10.1145/3242969.3243017

Hai Xuan Pham, YuTing Wang, and Vladimir Pavlovic. 2018b. Generative Adversarial Talking Head: Bringing Portraits to Life with a Weakly Supervised Neural Network. CoRR abs/ 1803.07716 (2018). arXiv: 1803.07716 http://arxiv.ora/abs/180J 077 ! o D. E. Rumelhart, G. E. Hinton, and R. J. Williams. 1986. Learning internal representations by error propagation. MIT Press, 318-362.

Mark Sagar. 2006. Facial performance capture and expressive translation for king kong. In ACM SIGGRAPH 2006 Sketches. ACM, 26.

M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (November 1997), 2673-2681.

Jun'ichiro Seyama and Ruth S Nagayama. 2007. The uncanny valley: Effect of realism on the impression of artificial human faces. Presence: Teleoperators and virtual environments 16, 4 (2007), 337-351.

H.T. Siegelmann and E.D. Sontag. 1995. On the Computational Power of Neural Nets. J. Comput. System Sci. 50 (February 1995), 132-150.

Yang Song, Jingwen Zhu, Xiaolong Wang, and Hairong Qi. 2018. Talking Face Generation by Conditional Recurrent Adversarial Network. CoRR abs/ 1804.04786 (2018). arXiv: 1804.04786 http://arxiv.orQ/abs/1804.04786

W Sumby and I. Pollack. 1954. Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America 26 (1954), 212.

Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. Graph. 36 (2017), 95: 1-95: 13.

Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A Deep Learning Approach for Generalized Speech Animation. ACM Trans. Graph. 36, 4, Article 93 (July 2017), 11 pages https://doi.org/10.1145/3072959.3073699 Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias NieBner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2387- 2395.

T. Tieleman and G. Hinton [n. d.] . RMSprop Gradient Optimization ([n. d.]). http://www.cs.toronto.edu/tiimen/csc321/slides/lecture slides lec6.pdf

Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2018. End-to-End Speech- Driven Facial Animation with Temporal GANs. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018. 133. http://bmvc2018.org/contents/papers/0539.pdf Lijuan Wang, Wei Han, Frank K Soong, and Qiang Huo. 2011. Text driven 3D photorealistic talking head. In Twelfth Annual Conference of the International Speech Communication Association.

Eduard Zell, JP Lewis, Junyong Noh, Mario Botsch, et al. 2017. Facial retargeting with automatic range of motion alignment. ACM Transactions on Graphics (TOG) 36, 4 (2017), 154.

Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2018a. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. CoRR abs/ 1807.07860 (2018). arXiv: 1807.07860 http://arxiv.orq/abs/1807.07860

Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018b. Visemenet: Audio-driven Animator-centric Speech Animation. ACM Trans. Graph. 37, 4, Article 161 (July 2018), 10 pages. https://doi.org/10.1145/3197517.3201292

Claims

Claim

1. Method to build a lip synchronization engine, comprising the steps of:

2. Method according to claim 1, wherein the sentences comprise several sentences with the highest phonetic variability coverage in several contexts that implicitly also cover several coarticulation examples.

3. Method according to any of claims 1 or 2, wherein linguistic criteria comprise the position of the phoneme and its context.

4. Method according to any of preceding claims, wherein using a motion-capture system composed of several cameras to acquire the audiovisual corpus.

5. Method according to claim 4, wherein during the acquisition, using sixty-three reflective markers glued on the face of a speaker.

6. Method according to claim 5, wherein computing absolute 3D spatial positions of the reflective markers for each frame acquired by the cameras, after removing the head movement of the speaker.

7. Method according to any of preceding claims, wherein generating a sequence of 3D poses that reproduce the original performance by creating different morph targets.

8. Method according to claim 7, wherein refining visual morph target set using sentation to be well adapted to speech articulation.

9. Method according to claim 8, wherein determining an optimized number of visemes between 10 and 20 without loss of animation quality.

10. Method according to any of preceding claims, wherein using a deep learning approach based on a bidirectional gated sequence of phonemes by taking into account two possible types of coarticulation, anticipatory and carry-over with no particular assumption for the duration of the coarticulation effect.