BR102015020030A2

BR102015020030A2 - expressive speech 2d facial animation synthesis method

Info

Publication number: BR102015020030A2
Application number: BR102015020030A
Authority: BR
Inventors: Mario De Martino José; Dornhofer Paro Costa Paula
Original assignee: Univ Estadual De Campinas - Unicamp
Priority date: 2015-08-20
Filing date: 2015-08-20
Publication date: 2017-02-21
Also published as: BR102015020030A8; WO2017027940A1

Abstract

método de síntese de animação facial 2d de fala expressiva a presente invenção se refere a um método de síntese de animação facial de fala expressiva baseado em imagens 2d fotorrealistas. se insere no campo da tecnologia da informação, mais especificamente nas áreas de computação gráfica e animação facial, tendo aplicação na criação de personagens para jogos/filmes, agentes virtuais, compressão de vídeo (em videoconferências), estudos de percepção audiovisual e como ferramenta para o treinamento de habilidades de produção, reconhecimento e interpretação da fala e expressões faciais.Expressive Speech Facial Animation Synthesis Method 2 The present invention relates to an expressive speech facial animation synthesis method based on photorealistic 2d images. It is part of the field of information technology, specifically in the areas of computer graphics and facial animation, having application in the creation of characters for games / movies, virtual agents, video compression (in videoconferences), studies of audiovisual perception and as a tool for the training of production skills, recognition and interpretation of speech and facial expressions.

Description

MÉTODO DE SÍNTESE DE ANIMAÇÃO FACIAL 2D DE FALA EXPRESSIVA CAMPODAjNVENÇAO2D SPEAKING EXPRESSIVE FACIAL ANIMATION SYNTHESIS METHOD

[1] A presente invenção se refere a um método de síntese de animação facial baseado em imagens, ou 2D, em sincronia e em harmonia com a fala acompanhada da expressão de emoção, ou fala expressiva.[1] The present invention relates to a method of synthesizing facial animation based on images, or 2D, in sync and in harmony with speech accompanied by emotion expression, or expressive speech.

[2] A invenção se insere no campo da tecnologia da informação, mais especificamente nas áreas de computação gráfica e animação facial, tendo aplicação na criação de personagens para jogos/filmes, agentes virtuais, compressão de video (em videoconferências), estudos de percepção audiovisual e como ferramenta para o treinamento de habilidades de produção, reconhecimento e interpretação da fata e expressões faciais.[2] The invention is in the field of information technology, more specifically in the areas of computer graphics and facial animation, having application in the creation of characters for games / movies, virtual agents, video compression (in videoconferences), perception studies. audiovisual and as a tool for training production skills, recognition and interpretation of fata and facial expressions.

FUNDAMENTOS DA INVENÇÃOBACKGROUND OF THE INVENTION

[3] O termo "computação ubíqua" foi cunhado por Weiser (1991) como um cenário em que os dispositivos de computação estariam integrados às ações e aos comportamentos naturais das pessoas, em uma interface pratícamente invisível. Quase duas décadas depois, ainda existe um longo caminho de desenvolvimento a fim de que estes dispositivos se tornem de fato transparentes a seus usuários. {4) Uma das principais iniciativas para alcançar este nível de interação sáo as chamadas interfaces naturais. Tais interfaces são urna alternativa as interfaces baseadas na interação por meio de janelas, ícones, mouse e dispositivos apontadores, também chamadas de interfaces WIMP (do inglês “Windows, icons, mouse, pointing devices"). Neste sentido, a animação facial sincronizada com a fala (ou "talksng head", em inglês) caracteriza urna tecnologia chave para o desenvolvimento de interfaces naturais baseadas nos mecanismos de comunicação face-adoçe com os quais estamos familiarizados desde o nosso nascimento.[3] The term "ubiquitous computing" was coined by Weiser (1991) as a scenario in which computing devices would be integrated into people's natural actions and behaviors in a virtually invisible interface. Almost two decades later, there is still a long way to go before these devices will actually become transparent to their users. (4) One of the main initiatives to achieve this level of interaction is the so-called natural interfaces. Such interfaces are an alternative to interfaces based on interaction through windows, icons, mouse and pointing devices, also called WIMP interfaces (Windows, icons, mouse, pointing devices). In this sense, facial animation synchronized with Speech (or "talksng head") is a key technology for the development of natural interfaces based on the face-sweet communication mechanisms we are familiar with from birth.

[51 Combinada as técnicas de processamento de linguagem natural, inteligência artificial, síntese e reconhecimento de fata, a animação facial pode ser particularmente vantato ,u para usuário*, pouco familiarizados com tecnologia, tais conto, crianças, mdividuos u>;n baixo nível de letramento ou ainda indivíduos cem dificuldades motoi >. que impeçam ,, utili/.Kao, por exemplo, cie mouse * teclado [6] A ampla gama de aplicações da animação facial envolve assistentes virtuais, tutores, apresentadores de notícias, guias turísticos, vendedores, personagens em jogos, efeitos especiais no cinema, compressão de vídeos em aplicações de videoconferência e ferramentas para experimentos controlados ou terapia em psicologia e ciências comportamentaís.[51 Combined with natural language processing, artificial intelligence, synthesis and fact recognition techniques, facial animation can be particularly beneficial, u for user *, unfamiliar with technology, such as tale, children, and low-level individuals. of literacy or even individuals with a motorcycle difficulties>. preventing, utili / .Kao, for example, mouse * keyboard [6] The wide range of facial animation applications involves virtual assistants, tutors, news presenters, tour guides, salespeople, game characters, movie special effects , video compression in videoconferencing applications, and tools for controlled experiments or therapy in behavioral psychology and science.

[7] Desde o trabalho pioneiro de Parke (1972), as técnicas de animação facial vèm evoluindo devido aos avanços na computação grafica, suportados principalmente pelo aumento da capacidade dos processadores gráficos e de propósito geral, e pela melhoria das técnicas de captura de imagens e movimento, impulsionada pelo surgimento de câmeras com alta resolução, sofisticados sistemas de captura de movimentos e, mais recentemente, sensores RGB-D. Esta evolução tem permitido o surgimento de soluções que utilizam as "talking heads" corno componentes integrantes de produtos comerciais, [8} No entanto, um aspecto ainda relevante e desafiador envolvendo as técnicas de animação facial é a capacidade de gerar animações realistas que combinem os movimentos articulatórios da fala com elementos de comunicação nâo-verbal e a expressão de emoções, [9] Dentre os numerosos esforços realizados no sentido de desenvolver modelos computacionais de emoções, os denominados "modelos de avaliação" (do inglês, 'appraisal models") descrevem as emoções a partir do processo de apreciação e consequente juízo dos eventos e situações que as desencadeiam, [10] Ortony, Clore e Cotlms (1988) propuseram um modelo de avaliação que associa significados cognitivos para as operações lógicas envolvidas no processo de avaliação de uma emoção, e compreende a definição de 22 emoções diferentes em uma estiutura conhecida como modelo OCC, que vem sendo frequentemente adotado como modelo computacional de emoções. A estrutura definida e documentada peto modelo OCC é frequentemente adotada por sistemas inteligentes que reproduzem o processo de avaliação desencadeado por um evento, um usuário ou um objeto, [11] No que tange a modelagem, representação e mampulaçao da face de urna animação facial, as principais estratégias de geração de animações faciais apresentadas na literatura podem ser classificadas em: baseadas em modelos {animação facial 3D), e baseadas em imagens (animação facial 20), [12] Na animação facial 3D» a cabeça e a face são tipicamente descritas como uma malha poligonal tridimensional à qual é mapeada a informação de textura de muitos elementos visualmente distintos do rosto, como a textura da pele, os olhos, as sobrancelhas, lábios, o cabelo etc.[7] Since Parke's pioneering work (1972), facial animation techniques have evolved due to advances in computer graphics, supported primarily by the increased capacity of graphics and general-purpose processors, and improved image capture techniques. and motion, driven by the emergence of high-resolution cameras, sophisticated motion capture systems and, more recently, RGB-D sensors. These developments have enabled the emergence of talking head solutions as integral components of commercial products. [8} However, a still relevant and challenging aspect of facial animation techniques is the ability to generate realistic animations that combine the articulatory movements of speech with nonverbal communication elements and the expression of emotions, [9] Among the numerous efforts made to develop computational models of emotions, the so-called 'appraisal models' describe emotions from the process of appreciation and consequent judgment of the events and situations that trigger them, [10] Ortony, Clore and Cotlms (1988) proposed an evaluation model that associates cognitive meanings to the logical operations involved in the process of evaluating an emotion, and comprises the definition of 22 different emotions in a structure known as OCC model, which has often been adopted as a computational model of emotions. The structure defined and documented by the OCC model is often adopted by intelligent systems that reproduce the evaluation process triggered by an event, a user, or an object. [11] Regarding the modeling, representation, and shaping of a face animation, The main strategies for generating facial animations presented in the literature can be classified as: model-based {3D facial animation), and image-based (facial animation 20). [12] In 3D facial animation »head and face are typically described as a three-dimensional polygonal mesh that maps texture information to many visually distinct elements of the face, such as skin texture, eyes, eyebrows, lips, hair, and so on.

[13] Apesar do avanço nas técnicas de modelagem facial 3D e sua capacidade de sintetizar imagens de alta qualidade, uma observação mais cuidadosa das "talking heads" revela seu aspecto artificial, Para obter um aspecto mais natural destas faces, os modelos 3D requerem um controle sofisticado para reproduzir, por exemplo, as deformações plásticas da dinâmica da boca durante a fala e mudanças na textura da pele. No entanto, estas implementações tipicamente envolvem aparato especial para captura do movimento e da textura da pele, com elevados custos computacionais.[13] Despite advances in 3D facial modeling techniques and their ability to synthesize high quality images, a closer look at talking heads reveals their artificial look. To get a more natural look at these faces, 3D models require a sophisticated control to reproduce, for example, plastic deformations of mouth dynamics during speech and changes in skin texture. However, these implementations typically involve special apparatus for capturing motion and skin texture with high computational costs.

[14] O documento EP2618311Â1, por exemplo, descreve um sistema de animação facial 3D que tem como entrada as propriedades prosódicas da nova fala a ser animada. No entanto, além de apresentar o aspecto reconhecido como sintético, é incapaz de modelar a fala expressiva.[14] EP2618311Â1, for example, describes a 3D facial animation system that has as input the prosodic properties of the new speech to be animated. However, besides presenting the aspect recognized as synthetic, it is unable to model expressive speech.

[15] A patente US7663628B2, também de animação 3D, apesar de focar na fala expressiva, não descreve os efeitos de coarticuiação da fala pela face animada, o que prejudica a reprodução realista dos movimentos da fala. O trabalho de Beskow {2005) também apresenta a síntese de fala expressiva em modelo 3D, e utiliza dados de captura de movimento para a construção de seu modelo, e apresenta um pequeno numero de emoções estereotipadas. í o mesmo caso do trabalho de Deng (2006).[15] The US7663628B2 3D animation patent, although focusing on expressive speech, does not describe the effects of coarticuating speech by the animated face, which impairs the realistic reproduction of speech movements. Beskow's work (2005) also presents expressive speech synthesis in a 3D model, uses motion capture data to construct his model, and presents a small number of stereotyped emotions. This is the same as the work of Deng (2006).

[ 16j A patente US8624901B2 descreve um outro tipo de abordagem, a partir da concepcâo de um sistema de clonagem de expressões faciais a partir de um video de entrada e um conjunto de "key medels", que sào malhas poligonais tridimensionais. Conforme ele relata, o sistema é capaz de analisar a imagem de uma face de um vídeo de entrada e transferir suas emoções para um "avatar" de saída, mas que ainda é a representação em um modelo 3D. Ademais, não demonstra a capacidade de gerar novas animações faciais a partir simplesmente do conteúdo da fala animada.[16] US8624901B2 discloses another type of approach from designing a facial expression cloning system from an input video and a set of key medels, which are three-dimensional polygonal meshes. As he reports, the system is able to analyze the face image of an input video and transfer its emotions to an output avatar, but it is still the representation on a 3D model. Moreover, it does not demonstrate the ability to generate new facial animations simply from the content of animated speech.

[17] A animação facial baseada em imagem, ou 2D, é sintetizada pelo processamento, sequencíamento, concatenação e apresentação de amostras de imagens tomadas a partir de uma face real. Ela inerentemente resulta em animações cuja aparência da face é fotorrealista, mas tem controle limitado em outros aspectos não verbais, como a movimentação da cabeça.[17] Image-based facial animation, or 2D, is synthesized by processing, sequencing, concatenating, and displaying samples of images taken from a real face. It inherently results in animations whose face appearance is photorealistic but has limited control in other nonverbal aspects such as head movement.

[18] Os documentos US6250928B1, US6735586B1 e PI0903935-0 A2, apesar de adotarem a abordagem 2D, não preveem a capacidade de associação da fala com a expressão das emoções. O documento US6504546B1 também descreve um sistema de animação facial para a fala neutra, o que, apesar da característica fotorrealista das imagens, ainda mantém um caráter sintético da animação gerada, não expressiva.[18] US6250928B1, US6735586B1, and PI0903935-0 A2, while adopting the 2D approach, do not provide for the ability to associate speech with the expression of emotions. US6504546B1 also describes a facial animation system for neutral speech, which, despite the photorealistic feature of the images, still retains a synthetic character of the generated, non-expressive animation.

[19] 0 documento W020G9114488A1 supostamente descreve um sistema de animação facial 20 fotorrealista que engloba: o mecanismo de geração de uma extensa base de imagens faciais em poses especificas; a maneira como essas imagens são armazenadas e o fluxo de informações entre os diferentes componentes do sistema; e o sistema de criação de conteúdo a partir da base de imagens visando diversas aplicações. Contudo, não especifica os mecanismos de seleção de imagens em função da sequência de fonemas a ser animada, tampouco menciona estratégias de transição entre vísemas e não contempla nenhum modelo para a expressão de fata acompanhada de emoção. Os visemas, ou fonemas visuais, são as posturas labiais que contêm pistas observáveis que diferenciam tipos de sons.[19] Document W020G9114488A1 reportedly describes a photorealistic facial animation system 20 comprising: the mechanism for generating an extensive base of facial images in specific poses; the way these images are stored and the flow of information between different system components; and the system of content creation from the base of images for various applications. However, it does not specify the image selection mechanisms as a function of the sequence of phonemes to be animated, nor does it mention strategies for transition between vsemas and does not contemplate any model for the expression of fact accompanied by emotion. Visemas, or visual phonemes, are lip postures that contain observable clues that differentiate types of sounds.

[20] Oeste modo, fica evidenciado que nenhum dos documentos descritos no estado da técnica contempla o diferencial proporcionado pela presente invenção, já que ou não preveem a modelagem da fala acompanhada de emoções, ou baseiam-se na metodologia 3D com características visuais sintéticas ou adotam a modelagem de um pequeno numero de emoções estereotipadas. Sua ampla variedade de aplicações demonstra ainda mais a importância da presente invenção ao estado da técnica.[20] In this way, it is evident that none of the documents described in the prior art contemplate the differential provided by the present invention, since they do not foresee the modeling of speech accompanied by emotions, or are based on 3D methodology with synthetic visual characteristics or adopt the modeling of a small number of stereotyped emotions. Its wide variety of applications further demonstrates the importance of the present invention to the state of the art.

BREVE DESCRIÇÃO DA INVENÇÃOBRIEF DESCRIPTION OF THE INVENTION

[21] A presente invenção se refere a um método de síntese de animação facial baseado em imagens, ou 20, em sincronia e em harmonia com a fala acompanhada da expressão de emoção, ou fala expressiva.[21] The present invention relates to a method of synthesizing facial animation based on images, or 20, in sync and in harmony with speech accompanied by expression of emotion, or expressive speech.

[22] 0 método em questão pode ser descrito pelas seguintes etapas: (A) Geração de um modelo de síntese de imagens faciais expressivas; (Al) Captura de amostras de fala expressiva; (A2) Seleção de visemas expressivos e extração dos parâmetros de forma; (A3) Operação de alinhamento de formas à forma média; (AS. 1) Selecionar aleatoriamente uma forma na base de dados que será a forma padrão para normalização, visando garantir convergência do algoritmo; (A3.2) Alinhar cada uma das outras formas na base dados à forma padrão; (A3.3) Computar a forma média das formas alinhadas; (A3.4) Normalizar a orientação, a escala e a origem da forma média à forma padrão; (A3.5) Realinhar todas as formas de acordo com a forma média; (A3.6) Repetir etapas (A3.3), (A3.4) e (A3.5) até a convergência. (A4) Obtenção dos vetores de aparência; (A4.1) Delimitar uma região de interesse ROI; (A4.2) Realizar o "alinhamento da aparência" através da distorção das imagens à sua forma média; (A4.3) Armazenar as informações RGB dos pixels desta ROI em um vetor a, obtendo o vetor de aparência para uma imagem; (A4.4) Computar o vetor de aparência média por meio da equação 4; (A5) Obtenção dos vetores de aparência/forma; (A6) Obtenção da matriz de dados; (A7) Padronização de dados; (A7.1) Obtenção de £j, o valor médio dos elementos de uma coluna j, para j = 1,2, *··,η; (A7.2) Obtenção do desvio padrão deste mesmo conjunto; (Α73) Padronização dos elementos cl;- de acordo com a média e desvios padrão obtidos anteriormente, obtendo-se os dados padronizados íty, (A7.4) Composição dos elementos /ti;· em uma matriz H(m x n). (A8) implementação PCA; {A8.1} Obtenção da matriz de covariância Cov(H) da matriz H; e (A8.2) Determinação dos autovetores e autovalores da matriz de covariância obtida anteríormente resultando nos vetores ortonormais e autovalores associados a estes vetores. (A9) Geração dos modelos de forma e aparência; (A9.1) Obtenção das matrizes diagonais Dít e I)s, com os desvios padrão obtidos anteríormente (equação 15) nas suas diagonais principais; (A9.2) Determinação dos componentes principais dos modelos de aparência e forma, respectivamente et e /); e definição dos modelos de aparência como apresentado pelas equações 20 e 21 onde os coeficientes <qe β,representam os coeficientes da combinação linear dos vetores eL eff, (A10) Modelo de fala expressiva; (B) Processo de síntese de fala expressiva; (Bl) Processamento da transcrição fonética temporizada; (82) Extração dos intervalos de animação; (B3) Conversão de fonemas para visemas dependentes do contexto; (B4) Síntese das aparências das poses-chave; (B4.1) Síntese da aparência da face inteira; (B4.2) Síntese da aparência da ROI "lábios+bochechas" e sobreposição à aparência anterior; (B4.3) Síntese da aparência da ROI "lábios" e sobreposição à aparência anterior; (B4.4) Junção da aparência anterior à uma face-base; (B5) Síntese das poses-chave finais; (Β6) Modulador de forma; (B7) Transição entre duas ooses-chave; e (B8) Composição e apresentação. f23f A presente invenção define claramente a estratégia de seleção de imagens a partir de duas entradas: a transcrição fonética da fala a ser animada e a emoção que se deseja animar. A invenção permite a síntese automática de animações faciais com fala expressiva, contemplando um abrangente conjunto de emoções. Pelo fato de seu conjunto de emoções de entrada ser baseado em um modelo que contempla 22 emoções, há uma contribuição ainda maior como alternativa aos modelos que utilizam conjuntos mais simplificados de emoções estereotipadas, o que os torna limitados para aplicações que envolvam contextos de diálogo tipicamente observados, onde a combinação de uma grande variedade de estados emocionais e expressões faciais é encontrada.[22] The method in question can be described by the following steps: (A) Generation of a synthesis model of expressive facial images; (Al) Capture of expressive speech samples; (A2) Selection of expressive visemas and extraction of shape parameters; (A3) Shape alignment operation to the middle shape; (AS. 1) Randomly select a form in the database that will be the standard form for normalization, to ensure convergence of the algorithm; (A3.2) Align each of the other shapes in the database to the standard shape; (A3.3) Computing the mean shape of the aligned shapes; (A3.4) Normalize the orientation, scale and origin of the mean shape to the standard shape; (A3.5) Realign all shapes to mean shape; (A3.6) Repeat steps (A3.3), (A3.4) and (A3.5) until convergence. (A4) Obtaining appearance vectors; (A4.1) Delimit a region of interest ROI; (A4.2) Perform "appearance alignment" by distorting images to their average shape; (A4.3) Store the RGB information of the pixels of this ROI in a vector, obtaining the appearance vector for an image; (A4.4) Computing the average-looking vector by equation 4; (A5) Obtaining appearance / shape vectors; (A6) Obtaining the data matrix; (A7) Data standardization; (A7.1) Obtaining £ j, the mean value of the elements of a column j, for j = 1,2, * ··, η; (A7.2) Obtaining the standard deviation of this same set; (Α73) Standardization of cl elements - according to the mean and standard deviations obtained previously, obtaining the standardized data ity, (A7.4) Composition of the elements / ti; · in a matrix H (m x n). (A8) PCA implementation; {A8.1} Obtaining Cov (H) Covariance Matrix H; and (A8.2) Determination of the eigenvectors and eigenvalues of the covariance matrix obtained previously resulting in the orthonormal and eigenvalue vectors associated with these vectors. (A9) Generation of shape and appearance models; (A9.1) Obtaining the diagonal matrices Dit and I) s, with the standard deviations obtained previously (equation 15) in their main diagonals; (A9.2) Determination of main components of appearance and shape models, respectively t and /); and definition of appearance models as presented by equations 20 and 21 where the coefficients <qe β represent the coefficients of the linear combination of the vectors eL eff, (A10) Expressive speech model; (B) Expressive speech synthesis process; (Bl) Processing of timed phonetic transcription; (82) Extraction of animation intervals; (B3) Conversion of phonemes to context dependent visemas; (B4) Synthesis of appearances of key poses; (B4.1) Synthesis of the appearance of the whole face; (B4.2) Synthesis of ROI "lips + cheeks" appearance and overlap with previous appearance; (B4.3) Synthesis of ROI "lips" appearance and overlap with previous appearance; (B4.4) Junction of appearance prior to base face; (B5) Synthesis of final key poses; (Β6) Shape modulator; (B7) Transition between two key ooses; and (B8) Composition and presentation. The present invention clearly defines the strategy of selecting images from two inputs: the phonetic transcription of the speech to be animated and the emotion to be animated. The invention allows the automatic synthesis of expressive speech facial animations, encompassing a wide range of emotions. Because their input set of emotions is based on a 22-emotion model, there is an even greater contribution as an alternative to models that use simpler sets of stereotyped emotions, making them limited to applications involving typically dialog contexts. observed, where the combination of a wide variety of emotional states and facial expressions is found.

[24J Por fim, a abordagem baseada em visemas dependentes do contexto fonético inclui a modelagem implícita na coarticulação da fala, o que garante que os movimentos sejam modelados com maior nível de realismo. O invento não requer uso de ferramenta! para captura de movimento, e a geração dos dados de entrada pode ser realizada por uma câmera de vídeo comum.[24J Finally, the phonetic context-dependent approach to visemas includes implicit modeling in speech co-articulation, which ensures that movements are modeled with a higher level of realism. The invention does not require tool use! for motion capture, and input data generation can be performed by a standard camcorder.

BREVE DESCRIÇÃO DAS FIGURASBRIEF DESCRIPTION OF THE FIGURES

[25] Figura 1. Sequência de fonemas e seus visemas correspondentes em um trecho de vídeo gravado a partir da interpretação de uma atriz. {26} Figura 2. Pontos característicos da face, que determinam suas características de forma. (27j Figura 3, Lsquemático do processo de construção do modelo de fala expressiva, incluindo o processo de obtenção dos modelos de forma e aparência, (28) Figura 4. Conjunto de pontos de interesse em unia face, e sua divisão de acordo com sua estabilidade durante o movimento articulatório da fala, que podem ser "pontos âncora" (a), "pontos intermediários" (b) ou "pontos dinâmicos" (c).[25] Figure 1. Sequence of phonemes and their corresponding visemes in a recorded video excerpt from the interpretation of an actress. {26} Figure 2. Characteristic points of the face, which determine its shape characteristics. (27j Figure 3, Schematic of the process of constructing the expressive speech model, including the process of obtaining the shape and appearance models, (28) Figure 4. Set of one-sided points of interest, and their division according to their stability during speech articulatory movement, which may be "anchor points" (a), "intermediate points" (b) or "dynamic points" (c).

[29] Figura 5. Distorção das imagens na base de dados em direção à forma média, com (a) representando uma imagem original da base de dados, (b) esquematizando sua informação original de forma, (c) a representação da triangulação de Delaunay e (d) a região de interesse da imagem já distorcida, que será utilizada para gerar o vetor de aparência. 130} figura 6. Divisão da face em 3 RGis (regiões de interesse), de acordo com sua importância de cada para a expressividade da face e sua variabilidade durante a fala expressiva, o que garante que as regiões que sofrem maior variação na fala expressiva carreguem as informações mais relevantes no processo de síntese. {31} Figura 7. Ilustra a imagem de aparência média (a) e seus três principais componentes ou eigenfaces: ht (b), h2 (c) e h:l (d), respectivamente.[29] Figure 5. Distortion of the images in the database towards the average shape, with (a) representing an original image of the database, (b) schemating its original shape information, (c) the representation of the triangulation of Delaunay and (d) the region of interest of the distorted image that will be used to generate the appearance vector. 130} figure 6. Division of the face into 3 RGis (regions of interest) according to their importance to the expressiveness of the face and its variability during expressive speech, which ensures that the regions that suffer the most variation in expressive speech carry the most relevant information in the synthesis process. {31} Figure 7. Illustrates the medium-looking image (a) and its three main components or eigenfaces: ht (b), h2 (c), and h: l (d), respectively.

[32] Figura 8. Fluxograma geral do processo de síntese de fala expressiva. {33} Figura 9. Ilustra a transcrição de urna fala representada pela sequência de k fonemas e seus intervalos.[32] Figure 8. General flowchart of the expressive speech synthesis process. {33} Figure 9. Illustrates the transcription of a speech represented by the sequence of k phonemes and their intervals.

[34} figura 10. Ilustra dois vísemas-chave e a estratégia de transição entre eles. {35] Figura 11. Síntese das aparências das poses-chave. {36} Figura 12. Processo de síntese de quadros intermediários no sentido direto e inverso. {37} Figura 13. Ilustra as emoções animadas a partir deste método (linha de baixo) com as emoções extraídas na primeira etapa a partir do video original de fala real (linha de cima), {38] figura 14. ilustra as emoções animadas a partir deste método (linha de baixo) com as emoções extiaidas na primeira etapa a partir do vídeo original de fala real (linha de cima) DESCRIÇÃO DETALHADA DA INVENÇÃO {39} Â presente invenção se refere a um método cie síntese de animação facial baseado em imagens, ou 2D, em sincronia e em harmonia com a fala acompanhada da expressão cie emoção, ou fala expressiva. O método em questão pode ser descnto pelas seguintes etapas: (A) Geração de um modelo de síntese de imagens faciais expressivas; (Al) Captura de amostras de fala expressiva; (A2) Seleção de visemas expressivos e extração dos parâmetros de forma; (Α3) Operação de alinhamento de formas a forma média; (A3.1) Selecionar aleatoriamente uma forma na base de dados que será a forma padrão para normalização, a fim de promover a convergência do algoritmo; (A3.2) Alinhar cada uma das outras formas na base dados à forma padrão; (A3.3) Computar a forma media das formas alinhadas; (A3.4) Normalizar a orientação, a escala e a origem da forma media à forma padrão; (A3.5) Reaitnhar todas as formas de acordo com a forma média; (A3.6) Repetir etapas (A3.3), (A3.4) e (A3.5) até a convergência; (A4) Obtenção dos vetores de aparência; (A4.1) Delimitar uma região de interesse ROI; (A4.2) Realizar o "alinhamento da aparência" através da distorção das imagens à sua forma média; (A4.3) Armazenar as informações RGB dos pixels desta RO! em um vetor a, obtendo o vetor de aparência para uma imagem; (A4.4) Computar o vetor de aparência media por meio da equação 4; (AS) Obtenção dos vetores de aparéncia/forma; (A6) Obtenção da matriz de dados; (A7) Padronização de dados; (A7.1) Obtenção de C), o valor médio dos elementos de uma coluna j, para j = 1,2, ··*,»; (A7.2) Obtenção do desvio padrão deste mesmo conjunto; fA7.3) Padronização dos elementos cr de acordo com a média e desvios padrão obtidos anteriormente, obtendo se os dados padronizados fty; (A7.4) Composição dos elementos h:i em uma matriz ίί(ηι x n ); (AS> Implementação PCA; fAS. 1) Obtenção da matriz de covariáncia (’m■'(//) da matriz #f; (Α8.2) Determinação dos autovetores e autovalores da matriz de covariância obtida anteriormente resultando nos vetores ortonormais e autovalores associados a estes vetores; (A9) Geração dos modelos de forma e aparência; (A9.1) Obtenção das matrizes diagonais D„ e Ds, com os desvios padrão obtidos anteriormente (equação 15) nas suas diagonais principais; (A9.2) Determinação dos componentes principais dos modelos de aparência e forma, respectivamente e ff e definição dos modelos de aparência como apresentado pelas equações 20 e 21 onde os coeficientes ate /?{representam os coeficientes da combinação linear dos vetores e* e ff (A10) Modelo de fala expressiva; (B) Processo de síntese de faia expressiva; (BI) Processamento da transcrição fonética temporizada; (B2) Extração dos intervalos de animação; (83) Conversão de fonemas para visemas dependentes do contexto; (84) Síntese das aparências das poses-chave; (84.1) Síntese da aparência da face inteira; (B4.2) Síntese da aparência da ROI "lábios+bochechas" e sobreposição à aparência anterior; (B4.3) Síntese da aparência da ROI "lábios" e sobreposição à aparência anterior; (B4.4) Junção da aparência anterior a uma face-base; (85) Síntese das poses-chave finais; (86) Modulador de forma; (87) Transição entre duas poses-chave; e (68) Composição e apresentação. }40) Lia tem aplicação na criação de personagens para jogos/filmes, agentes virtuais, compressão de vídeo (cm videoconferências), estudos de percepção audiovisual e como ferramenta para o treinamento de habilidades de produção, reconhecimento e interpretação da fala e expressões faciais. {411 A seguir, o método proposto na presente invenção para síntese de animação facial baseado em imagens, ou 2D, em sincronia e em harmonia com a fala acompanhada da expressão de emoção, ou fata expressiva, está detalhadamente descrito para melhor compreensão da invenção.[34} figure 10. Illustrates two key vises and the transition strategy between them. {35] Figure 11. Summary of appearances of key poses. {36} Figure 12. Synthesis process of forward and reverse intermediate frames. {37} Figure 13. Illustrates animated emotions from this method (bottom line) with emotions extracted in the first step from the original real-speech video (top line), {38] Figure 14. illustrates animated emotions from this method (bottom line) with emotions elicited in the first step from the original real speech video (top line) DETAILED DESCRIPTION OF THE INVENTION {39} This invention relates to a method of synthesizing facial animation based on in images, or 2D, in sync and in harmony with speech accompanied by emotion expression, or expressive speech. The method in question can be described by the following steps: (A) Generation of a synthesis model of expressive facial images; (Al) Capture of expressive speech samples; (A2) Selection of expressive visemas and extraction of shape parameters; (Α3) Shape alignment operation to medium shape; (A3.1) Randomly select a shape in the database that will be the default shape for normalization to promote convergence of the algorithm; (A3.2) Align each of the other shapes in the database to the standard shape; (A3.3) Computing the middle form of the aligned forms; (A3.4) Normalize the orientation, scale and origin of the medium form to the standard form; (A3.5) Reaffirm all shapes according to the mean shape; (A3.6) Repeat steps (A3.3), (A3.4) and (A3.5) until convergence; (A4) Obtaining appearance vectors; (A4.1) Delimit a region of interest ROI; (A4.2) Perform "appearance alignment" by distorting images to their average shape; (A4.3) Store the RGB information of the pixels of this RO! in an a vector, getting the appearance vector for an image; (A4.4) Computing the average appearance vector by equation 4; (AS) Obtaining appearance / shape vectors; (A6) Obtaining the data matrix; (A7) Data standardization; (A7.1) Obtaining C), the mean value of the elements of a column j, for j = 1,2, ·· *, '; (A7.2) Obtaining the standard deviation of this same set; FA7.3) Standardization of the cr elements according to the mean and standard deviations obtained previously, obtaining the standardized data fty; (A7.4) Composition of elements h: i in a matrix ίί (ηι x n); (AS> PCA Implementation; fAS. 1) Obtaining the covariance matrix ('m ■' (//) of the #f matrix (Α8.2) Determining the eigenvectors and eigenvalues of the previously obtained covariance matrix resulting in the orthonormal vectors and eigenvalues associated with these vectors (A9) Generation of shape and appearance models (A9.1) Obtaining the diagonal matrices D „and Ds, with the standard deviations obtained previously (equation 15) in their main diagonals (A9.2 ) Determination of the principal components of the appearance and shape models, respectively and ff and definition of the appearance models as presented by equations 20 and 21 where the coefficients up to /? {Represent the coefficients of the linear combination of the vectors e * and ff (A10) Expressive speech model (B) Expressive beech synthesis process (BI) Timed phonetic transcription processing (B2) Animation interval extraction (83) Conversion of phonemes to context-dependent visemas (84) Synthesis of appearances of key poses; (84.1) Synthesis of the appearance of the entire face; (B4.2) Synthesis of ROI "lips + cheeks" appearance and overlap with previous appearance; (B4.3) Synthesis of ROI "lips" appearance and overlap with previous appearance; (B4.4) Joining appearance prior to base face; (85) Summary of final key poses; (86) Shape modulator; (87) Transition between two key poses; and (68) Composition and presentation. 40) Lia has application in the creation of characters for games / movies, virtual agents, video compression (videoconferencing), studies of audiovisual perception and as a tool for the training of production skills, speech recognition and interpretation and facial expressions. [411] Next, the method proposed in the present invention for synthesis of image-based or 2D facial animation in sync and in harmony with speech accompanied by expression of emotion, or expressive fact, is described in detail for a better understanding of the invention.

Geração de um modelo de síntese de imagens faciais expressivas {42} Na etapa (A) ocorre a geração de um modelo de síntese de imagens faciais. A geração deste modelo consiste nas seguintes subetapas: (Al) Captura de amostras de fala expressiva; (A2) Seleção de visemas expressivos e extração dos parâmetros de forma; (A3) Operação de alinhamento de formas à forma média; (A4) Obtenção dos vetores de aparência; (AS) Obtenção dos vetores de aparéncia/forma; (A6) Obtenção da matriz de dados; (A7) Padronização de dados; (A8) Implementação PCA; (A9) Geração dos modelos de forma e aparência; (A10) Modelo de fala expressiva;Generating an expressive facial image synthesis model {42} In step (A) there is the generation of a facial image synthesis model. The generation of this model consists of the following substeps: (Al) Capture of expressive speech samples; (A2) Selection of expressive visemas and extraction of shape parameters; (A3) Shape alignment operation to the middle shape; (A4) Obtaining appearance vectors; (AS) Obtaining appearance / shape vectors; (A6) Obtaining the data matrix; (A7) Data standardization; (A8) PCA implementation; (A9) Generation of shape and appearance models; (A10) Expressive speech model;

Captura de amostras de fala expressiva [43J A primeira subetapa consiste na captura de imagens faciais em posturas articulatorias específicas, chamadas visemas, realizadas durante a expressão de uma emoção. Como não ha relato de um corpus audiovisual anotado, publico ou não, que contivesse fata acompanhada de emoções para o Português cio Brasil (pt br), houve a necessidade de criá-lo, [44J Para a presente invenção, a estratégia adotada foi realizar, em condições controladas, a grav.xao em video da fala de enunciados escolhidos para compreender todos os fonemas da língua portuguesa falada no Brasil, em contextos fonéticos específicos, caracterizando os visemas dependentes de contexto fonético definidos por D» Mumno (20!Hi).Capturing expressive speech samples [43J The first substep consists of capturing facial images in specific articular postures, called visemas, performed during the expression of an emotion. As there is no report of an annotated audiovisual corpus, public or not, which contained facts accompanied by emotions for Portuguese in Brazil (pt br), there was a need to create it. [44J For the present invention, the strategy adopted was to realize , under controlled conditions, video recording of utterances chosen to understand all Portuguese-language phonemes spoken in Brazil, in specific phonetic contexts, characterizing the context-dependent phonetic visemas defined by D »Mumno (20! Hi) .

[45J Para a presente invenção filmou-se uma atriz representando todas as 22 emoções do modelo OCC, sendo elas; felicidade por alguém, alegria, esperança, satisfação, alívio, orgulho, recompensa, gratidão, admiração, amor, pena, tristeza, medo, ressentimento, medos confirmados, vergonha, reprovação, remorso, escárnio, desapontamento, nojo, raiva. Amostras da fala neutra também foram filmadas.[45] For the present invention an actress was filmed representing all 22 emotions of the OCC model, being them; happiness for someone, joy, hope, satisfaction, relief, pride, reward, gratitude, admiration, love, pity, sadness, fear, resentment, confirmed fears, shame, disapproval, remorse, derision, disappointment, disgust, anger. Samples of neutral speech were also filmed.

[46] Para este efeito, vinte e dois Scripts de gravação foram projetados coerentemente com o estado cognitivo descrito peio modelo OCC, e garantindo a ocorrência em cada discurso de todos visemas dependentes de contexto fonéticos para o português brasileiro, Um exemplo do texto redigido para representar a emoção "medo" do modelo OCC pode ser lido abaixo: "Lucas,,. Tulha... Estou muito preocupado... Se não conseguirmos este contrato, tudo que realizei e peto qual batalhei nesta vida pode ser arrasado. Sem este contrato ficarei sem dinheiro para pagar o que devo para o Liio e o Juiiano. Eles me tomarão a casa e o carro. Nunca mais poderei olhar com orgulho para minha família. E o pior, é que Já passei por dificuldades no passado e sei que, nessas horas, muitos dos que se dizem meus amigos, simplesmente sumirão... Sei que estarei sozinho e não terei para quem pedir ajuda".[46] To this end, twenty-two Recording Scripts were designed coherently with the cognitive state described by the OCC model, and ensuring the occurrence in each speech of all phonetic context-dependent visemas for Brazilian Portuguese. An example of the text written for represent the emotion "fear" of the OCC model can be read below: "Lucas ,,. Tulha ... I am very worried ... If we do not get this contract, everything I have accomplished and what I struggled in this life can be devastated. Without this I will run out of money to pay what I owe to Liio and Juiiano. They will take me home and car. I will never be able to look proudly back at my family. , at those times, many of those who claim to be my friends will simply disappear ... I know I'll be alone and I won't have anyone to ask for help. "

Seleção de visemas expressivos e extração do parâmetro de forma [47J Uma vez com o corpus finalizado, inicia-se a etapa de processamento das imagens capturadas e da extração de parâmetros de forma de visemas expressivos.Selecting expressive visemas and extracting the shape parameter [47J Once the corpus is finished, the processing of captured images and extracting shape parameters from expressive visemas begins.

[48] O áudio é processado e segmentado manualmente, resultando em uma transcrição fonética temporizada dos enunciados, possibilitando a associação de pequenas sequências do video na produção de fonemas específicos. Para cada fonerrsa de interesse, um visema representativo e selecionado observando o ponto de inflexão dos movimentos articuiatorios da fala (lábios, língua e dentes). A Figura 1 exemplifica uma sequência de fones e seus visemas correspondentes. O íonema fricativo ]fj. por exemplo, é produzido a tocar o lábio inferior nos dentes superiores e deixando o fluxo de ar através da boca. Na figura, o quadro (n + 2) representa o inicio de um novo movimento articulatorio. Sua configuração final é alcançada no quadro (n + 4), O quadro (η + 5) mostra a soltura dos lábios para preparar o som vocáfico seguinte. Neste caso, (n + 4) seria selecionado para ser o visema representante para o fonema (f), no presente contexto fonético. (491 Trinta e quatro visemas expressivos (De Martino (2006)) são selecionados para cada emoção OCC e também para a fala neutra, de modo que a base correspondente a esta atriz e composta de um totaf de 782 imagens da face. Os trinta e quatro visemas selecionados correspondem a 22 visemas dependentes do contexto fVDC) consonantais (Tabela 1), 11 dos visemas vocálicos (Tabela 2) e 1 visema silencioso. Tabefa 1: Visemas consonantais dependentes do contexto fonético (adaptado de DE MARTINO (2006)), Tabela 2: Vísemas vocálicos dependentes do contexto fonético (adaptado de DE MARTiNO (2006)}. (50] Á cada imagem facial e associada a informação de forma, caracterizada pelas coordenadas x e y, de 56 pontos característicos da face, conforme indica a Figura 2, Estes pontos são escolhidos para delinear elementos da face e da cabeça, como sobrancelhas, nariz, lábios, orelhas e queixo, ou seja, a "forma" da face. (51] Como resultado, temos uma base de dados de captura de movimento de fala expressiva para o pt-br, sincronizada com o áudio da fala e com performances correspondentes em video. Esta base de dados, a primeira do tipo para o idioma brasileiro, pode ser utilizada em diferentes aplicações, como análise, reconhecimento e síntese de emoções ou de fala expressiva, objetivo principal da presente invenção. Operação de alinhamento de formas a forma média (52] A subetapa (A3) compreende o primeiro dos passos necessários para extração de características das imagens faciais para a geração de um modelo que permitirá a síntese cie novas configurações faciais. A partir da base de visemas expressivos obtida na subetapa anterior, a figura 3 mostra o processo de construção do modelo de fala expressiva {subetapas A3 a A10). Especialmente, a subetapa A3 consiste ainda das seguintes subetapas: (Α3.1) Selecionar aleatoriamente uma forma na base de dados que será a forma padrão para normalização, a fim de promover a convergência do algoritmo; (A3.2) Alinhar cada uma das outras formas na base dados à forma padrão; {A3.3) Computar a forma média das formas alinhadas; (A3.4) Normalizar a orientação, escala e origem da forma média à forma padrão; (A3.5) Kealínbar todas as formas de acordo com a forma média; (A3.6) Repetir etapas (A3.3), (A3.4) e (A3.5) até a convergência. (53) O modelo de síntese é baseado primeiramente na criação de uma base de parâmetros extraídos de um modelo de aparência ativo (MAA, do inglês "Active Appearance Model"). É derivado de uma base de treinamento formada por um conjunto específico de imagens faciais expressivas extraídas da filmagem em video para a geração do corpus anteriormente descrito. Os parâmetros desta base servem de entrada para um modelo de síntese de imagens faciais que posteriormente servem de poses chave da animação. Uma análise estatística é realizada com o objetivo de obter as formas e modelos de aparência que são capazes de expressar a diversidade das configurações de face presentes na base de dados [54] Corno dito anteriormente, a forma de uma imagem facial pode ser descrita como um conjunto de pontos de interesse que descrevem suas características, corno o contorno dos olhos, sobrancelhas, nariz, boca e contorno da cabeça. Assim, e definido um vetor "forma" s como a concatenação das coordenadas ,v e v de cada um dos k pontos de interesse associados a uma imagem facial resultando em um vetor coluna com 2k elementos (equação 1). (equação 1) {55) 0 primeiro passo no processo da análise consiste no alinhamento de todas as formas da base de dados para obter um vetor "forma media" s, etapa necessária para obter a verdadeira representação da distribuição dos pontos de interesse, a partir do cálculo da distância pelo método dos mínimos quadrados entre duas formas.[48] Audio is processed and segmented manually, resulting in timed phonetic transcription of utterances, enabling the association of short video sequences in the production of specific phonemes. For each source of interest, a representative and selected visema observing the tipping point of the articulatory movements of the speech (lips, tongue and teeth). Figure 1 exemplifies a sequence of headphones and their corresponding visem. The fricative icon] fj. For example, it is produced by touching the lower lip on the upper teeth and letting air flow through the mouth. In the figure, the frame (n + 2) represents the beginning of a new articulatory movement. Its final setting is reached in the frame (n + 4). The frame (η + 5) shows the lip release to prepare the next vocational sound. In this case, (n + 4) would be selected to be the representative visema for the phoneme (f) in the present phonetic context. (491 Thirty-four expressive visemas (De Martino (2006)) are selected for each OCC emotion and also for neutral speech, so that the base corresponding to this actress is composed of a totaf of 782 face images. four selected visemas correspond to 22 consonantal context-dependent fVDC) visemas (Table 1), 11 of the vowel visemas (Table 2) and 1 silent one. Table 1: Phonetic context-dependent consonant visemas (adapted from DE MARTINO (2006)), Table 2: Phonetic context-dependent vocalic vesomes (adapted from DE MARTiNO (2006)}. (50] Each facial image and associated with shape, characterized by x and y coordinates, of 56 characteristic points of the face, as shown in Figure 2. These points are chosen to delineate elements of the face and head, such as eyebrows, nose, lips, ears and chin, ie the "shape". (51] As a result, we have a pt-br expressive speech motion capture database synchronized with speech audio and corresponding video performances. This database, the first of its kind For the Brazilian language, it can be used in different applications, such as analysis, recognition and synthesis of emotions or expressive speech, the main objective of the present invention. (A3) comprises the first of the steps required for feature extraction from facial images to generate a model that will allow synthesis of new facial configurations. From the expressive visema base obtained in the previous substep, figure 3 shows the process of constructing the expressive speech model (substeps A3 to A10). In particular, substep A3 also consists of the following substeps: (Α3.1) Randomly select a form in the database that will be the standard form for normalization to promote convergence of the algorithm; (A3.2) Align each of the other shapes in the database to the standard shape; (A3.3) Compute the average shape of the aligned shapes; (A3.4) Normalize the orientation, scale, and origin from the middle shape to the standard shape; (A3.5) Kealínbar all shapes according to the mean shape; (A3.6) Repeat steps (A3.3), (A3.4) and (A3.5) until convergence. (53) The synthesis model is primarily based on the creation of a parameter base extracted from an Active Appearance Model (MAA). It is derived from a training base formed by a specific set of expressive facial images extracted from video footage for the corpus generation described above. The parameters of this base serve as input to a synthesis model of facial images that later serve as key poses of the animation. A statistical analysis is performed to obtain the shapes and appearance models that are capable of expressing the diversity of face configurations present in the database. [54] As stated earlier, the shape of a facial image can be described as a set of points of interest that describe their characteristics, such as eye contour, eyebrows, nose, mouth and head contour. Thus, a vector "shape" s is defined as the concatenation of the coordinates, v and v of each of the k points of interest associated with a facial image resulting in a column vector with 2k elements (equation 1). (equation 1) (55) The first step in the analysis process is to align all forms of the database to obtain a "medium shape" s vector, which is necessary to obtain the true representation of the distribution of points of interest, from the calculation of the distance by the least squares method between two forms.

[56} Considerando s, e s, dois vetores de forma diferentes que serão alinhados, e posteriormente Sj rotacionado por ti, escalonado por h e transladado por itx.ty), resultando no vetor sitrans. A distância quadrática entre s* e Sjtrans pode ser escrita como; {equação 2) [57} O alinhamento de s, e Sj pode ser realizado escolhendo valores apropriados de ti, h e (tx,ty) de modo que O seja mínimo. Em alguns casos, alguns pontos de interesse podem ser considerados mais estáveis que outros, como se nâo fossem facilmente alterados durante o movimento articulatório. É o caso dos pontos nos cantos dos olhos e no nariz, em contrapartida aqueles ao redor dos lábios, por exemplo. Por este motivo, a equação 2 acima foi adaptada para incluir uma matriz de ponderação diagonal W, como um mecanismo de aumentar o peso dos pontos mais estáveis na definição dos parâmetros de rotação, escalonamento t transiação. {equação 3) {58) Para a presente invenção, o conjunto total de pontos foi dividido em 3 e ponderado de acordo com suas características, conforme indica a Figura 4. Os denominados "pontos âncora" sâo marcos que tem o maior peso, e pondetados com ir i no processo de alinhamento. Eles estão localizados em regiões da face que rào deformam durante o movimento articulaiorio da fala ou na expressão de emoções, como pontos na orelha, ponta do nariz, canto cios olhos e os pontos superiores que delineiam a face, (59} Os "pontos intermediários" sâo marcos que não são for temente afetados durante a fata expressiva, como pontos na testa, sobrancelhas, em volta do nariz e um par de pontos no maxilar. Efes sao ponderados com tv - 0,5.[56} Considering s, and s, two differently shaped vectors that will be aligned, and later Sj rotated by ti, scaled by h and translated by itx.ty), resulting in the sitrans vector. The quadratic distance between s * and Sjtrans can be written as; {equation 2) [57} The alignment of s, and Sj can be accomplished by choosing appropriate values of ti, h and (tx, ty) so that O is minimal. In some cases, some points of interest may be considered more stable than others, as if they were not easily altered during articulatory movement. It is the case of the points in the corners of the eyes and nose, in contrast to those around the lips, for example. For this reason, equation 2 above has been adapted to include a diagonal weighting matrix W as a mechanism for increasing the weight of the most stable points in the definition of the rotation, scaling and transition parameters. {equation 3) {58) For the present invention, the total set of points has been divided into 3 and weighted according to their characteristics as shown in Figure 4. The so-called "anchor points" are landmarks that have the highest weight, and considered with ir i in the alignment process. They are located in regions of the face that deform during the articular movement of speech or in the expression of emotions, such as points on the ear, tip of the nose, corner of the eyes, and the upper points that outline the face. (59} "Intermediate points "are milestones that are not commonly affected during expressive fact, such as points on the forehead, eyebrows, around the nose, and a pair of points on the jaw. Efes are weighted with tv - 0.5.

[60] Por fim, os "pontos dinâmicos" são aqueles fortemente afetados na dinâmica da fala expressiva, príncipaímente aqueles nos lábios e ao redor do nariz. Um peso pequeno (ir = 0,2} e atribuído a eles para garantir que tenham influência limitada na operação de alinhamento das formas.Finally, the "dynamic points" are those strongly affected in the dynamics of expressive speech, especially those on the lips and around the nose. A small weight (ir = 0.2} is assigned to them to ensure that they have limited influence on the shape alignment operation.

[61] A normalização média durante cada iteração (A3.4) acima é necessária para garantir a sua convergência, evitando que a forma média encolha, rotacione ou se desloque para o infinito. Â convergência é testada a partir da soma das diferenças entre as formas medias obtidas em diferentes iterações após a etapa (A3.3), e deve ser menor que 0,0001. Estas etapas resultam em um vetor forma médio s e um conjunto de formas alinhadas para cada imagem da base de dados.[61] Average normalization during each iteration (A3.4) above is necessary to ensure its convergence, preventing the average form from shrinking, rotating or shifting to infinity. Convergence is tested from the sum of the differences between the mean shapes obtained in different iterations after step (A3.3), and must be less than 0.0001. These steps result in a medium shape vector and a set of aligned shapes for each database image.

Obtenção dos vetores de aparência [62] A aparência, ou textura da uma região de interesse (ROt, do inglês "region of interesf'} de uma imagem refere-se aos valores dos pixels naquela região. A obtenção dos vetores de aparência e do vetor de aparência médio se dá da seguinte forma: (A4.1) Delimitar uma regiào de interesse ROI; (A4.2) Realizar o "alinhamento da aparência" através da distorção das imagens ã sua forma média; (A4.3) Armazenar as informações RGB dos pixels desta ROt em uni vetor a, obtendo o vetor de aparência para uma imagem; e (A4.4) Computar o vetor de aparência média por meio da equação 5.Obtaining Appearance Vectors [62] The appearance, or texture, of a region of interest (ROt) of an image refers to the values of the pixels in that region. The average appearance vector is as follows: (A4.1) Delimit a region of interest ROI (A4.2) Perform "appearance alignment" by distorting the images to their average shape (A4.3) Store the RGB information of the pixels of this ROt in a vector a, obtaining the appearance vector for an image, and (A4.4) Computing the average appearance vector using equation 5.

[63] Para realizar a análise estatística da distribuição a aparência na base de dados, e necessário realizar um procedimento de "alinhamento da aparência", removendo eventuais variações causadas peta variação da forma. Assim, cada imagem é distorcida (método de "warping") em relação a forma média s (Stegmann, 2000) por meio da técnica de transformação afim por partes (Glasbey; Mardia, 1998) (Figura S), utilizando a malha de triângulos pelo processo da triangulação de Deiaunay como referência, cujos vertíces sáo os pontos de interesse.[63] To perform statistical analysis of appearance distribution in the database, a "appearance alignment" procedure is required to remove any variations caused by shape variation. Thus, each image is distorted (warping method) from the average shape s (Stegmann, 2000) by the piecewise transformation technique (Glasbey; Mardia, 1998) (Figure S) using the triangle mesh. Deiaunay's triangulation process as a reference, whose points are the points of interest.

[64] A figura S(a) mostra uma imagem da base de dados e a S(b) mostra á sua forma associada, A figura 5(c) mostra a malha de triângulos usada como referência para a transformação afim por partes e a 5(d) mostra a imagem resultante distorcida em relação a forma média e também a região de interesse RO! do vetor de aparência, f651 Como estratégia de melhoria da qualidade das imagens finais sintetizadas, a face e dividida em três regiões, de acordo com sua importância para a expressividade da face e sua variabilidade durante a fala expressiva. Neste caso, urr» MAA diferente é construído para cada região: face {completa}, bochechas e lábios, e somente lábios (Figura 6}. Esta abordagem garante que as regiões que sofrem maior variação na fala expressiva carreguem as informações mais relevantes no processo de síntese, que será descrito posteriormente.[64] Figure S (a) shows an image of the database and S (b) shows its associated form. Figure 5 (c) shows the triangle mesh used as a reference for the piecewise affine transformation. 5 (d) shows the resulting image distorted from the average shape and also the region of interest RO! f651 As a strategy for improving the quality of the synthesized final images, the face is divided into three regions, according to its importance for face expressiveness and its variability during expressive speech. In this case, a different »MAA is constructed for each region: face {full}, cheeks and lips, and lips only (Figure 6}. This approach ensures that regions that experience the most variation in expressive speech carry the most relevant information in the process. summary, which will be described later.

[66} Assim, para uma ROI contendo q pixels, o vetor de aparência a é construído concatenando-se seus p, valores de pixel, í = 1,2, ···, q. Para uma imagem colorida, considerando os canais RGB de cores, cada pixel deve ser expresso como um conjunto triplo de valores para o canal vermelho, o verde e o azul: p, - {píR, p,a, pUi), i = 1,2, **%f. Assim, o vetor de aparência resultando, com 3q elementos, tem a seguinte estrutura: {equação 4) [67} Como resultado deste processo, é possível computar o vetor de aparência independente da forma para todas as imagens da base de dados. Assine o vetor de aparência medio obtido o partir da análise para todo o conjunto de dados e definido como; (equação 5) Obtenção dos vetores de aparência/forma {68J Em seguida, é necessário computar os vetores de aparência e forma para urna ROi de uma imagem, a fim de viabilizar a aplicação da operação de PCA, que será abordada mais adiante. Esta operação é nada menos que a concatenação dos vetores s de forma e a de aparência para uma determinada RO!, em função da correlação que existe entre ambas as variáveis: (equação 6) [69J Deste modo, o número de elementos de c é dado abaixo pela equação 7, onde q ê o número de pixels RGB de uma ROI e k o número de pontos de interesse em uma forma, representados pelas suas coordenadas ,v ey. (equação 7) Obtenção da matriz de dados [70] Considerando o vetor c definido pela equação 6, o vetor c, da i-ésima imagem da base dados pode ser representado conforme notação da equação abaixo, onde i = 1,2,--,111 é o índice da imagem na base, e j = 1,2,···,η o índice que identifica o elemento do vetor de aparência/forma, tem-se: (equação 8) [711 E levando em consideração os vetores de aparência/forma para todas as imagens da base, o vetor médio de aparência/forma e definido pela equação abaixo, onde ü è o vetor de aparência médio, s o vetor de forma médio: (equação 9) [72} e a notação cs, para j = 1,2, ··· »n adotada para referir-se a um elemento do vetor c. Por fim, reescrevendo a equação 6, podemos definir a matriz de dados C onde as amostras de dados são organizadas nas linhas e as variáveis nas colunas. (equação 10) Padronização de dados [73] Os vetores de forma/aparéncía combinam variáveis de diferentes unidades de medida e que apresentam características distintas de varíâncía. Enquanto os pixels RGB assumem valores de 0 a 255, o intervalo de valores para as coordenadas .v e y depende das dimensões da imagem original. O problema é que a PCA e sensível às unidades de medida, e grandes diferenças entre suas variàncias levarão a dominação das primeiras componentes principais de uma das variáveis sobre a outra, o que requer a necessidade de padronização das variàncias entre ambos.[66} Thus, for an ROI containing q pixels, the appearance vector a is constructed by concatenating its p, pixel values, i = 1,2, ···, q. For a color image, considering RGB color channels, each pixel must be expressed as a triple set of values for the red, green, and blue channel: p, - {pi, p, a, pUi), i = 1 , 2, **% f. Thus, the resulting appearance vector, with 3q elements, has the following structure: {equation 4) [67} As a result of this process, it is possible to compute the shape-independent appearance vector for all database images. Sign the average appearance vector obtained from analysis for the entire dataset and set to; (Equation 5) Obtaining Appearance / Shape Vectors {68J Next, it is necessary to compute the appearance and shape vectors for an ROi of an image in order to enable the application of the PCA operation, which will be discussed later. This operation is nothing less than the concatenation of the shape and appearance vectors s for a given RO !, as a function of the correlation that exists between both variables: (equation 6) [69J Thus, the number of elements of c is given below by equation 7, where q is the number of RGB pixels of a ROI and k the number of points of interest in a shape, represented by their coordinates, v and y. (Equation 7) Obtaining the Data Matrix [70] Considering the vector c defined by equation 6, the vector c of the ith database image can be represented by notation of the equation below, where i = 1,2, - -, 111 is the index of the image at the base, ej = 1,2, ···, η the index that identifies the element of the appearance / shape vector, we have: (equation 8) [711 And taking into account the appearance / shape vectors for all base images, the average appearance / shape vector is defined by the equation below, where ü is the average appearance vector, only the medium shape vector: (equation 9) [72} and the notation cs, for j = 1,2, ··· »not adopted to refer to an element of vector c. Finally, by rewriting equation 6, we can define the data matrix C where data samples are organized in rows and variables in columns. (equation 10) Data standardization [73] Shape / appearance vectors combine variables of different units of measurement and have distinct characteristics of variance. While RGB pixels assume values from 0 to 255, the range of values for the .v and y coordinates depends on the dimensions of the original image. The problem is that the PCA is unit-sensitive, and large differences between its variances will lead to the domination of the first major components of one of the variables over the other, which requires the need to standardize the variances between them.

[74J Por exemplo, uma das colunas da matriz C, que representa um conjunto de m observações da j-ésima variável, é expressa como o vetor t: (equação 11) [75J A fim de padronizar este vetor, a operação descrita na equação 12 abaixo é realizada, para i - 1.2. · ·, m, onde í eu são, respectivamente, a media e o desvio padrão obtidos pelas equações 13 e 14 (na sequência). (equação 12) (equação 13) (*‘C)U«K,ín 14) {76J Assim, é obtido o vetor tstd = [tlstd,Í2sM»-»*nsM3T de dados padronizados, com variância unitária. Assim, a transformação da matriz C de dados em uma matriz H de dados padronizados, se dá da seguinte forma: (A7.1) Obtenção de c,, o valor médio dos elementos de uma coluna j, para / =- 1,2, *··, n; (A7.2) Obtenção do desvio padrão deste mesmo conjunto; (A7.3) Padronização dos elementos ct/ de acordo com a média e desvios padrão obtidos anteríormente, obtendo-se os dados padronizados hu; e (Â7.4) Composição dos elementos /?,, em uma matriz H(m xii), [77J Deste modo, considerando a matriz de dados C anterior, o desvio padrão σ, de uma coluna / é tido como: (equação 15) [78] Assim, combinando as equações 9 e 15, pode-se definir portanto os elementos padronizados htj de acordo com a equação 16 abaixo, que compõe a matriz de dados padronizados H(m x n): (equação 16) (equação 17) Implementação PCA[74J For example, one of the columns in matrix C, which represents a set of m observations of the jth variable, is expressed as the vector t: (equation 11) [75J In order to standardize this vector, the operation described in equation 12 below is performed, for i - 1.2. · ·, M, where i and i are, respectively, the mean and standard deviation obtained by equations 13 and 14 (in sequence). (equation 12) (equation 13) (* ‘C) U 'K, ín 14) {76J Thus, the vector tstd = [tlstd, 12sM' -» * nsM3T of standardized data with unit variance is obtained. Thus, the transformation of the data matrix C into a standardized data matrix H is as follows: (A7.1) Obtaining c ,, the mean value of the elements of a column j, for / = - 1,2 , * ··, n; (A7.2) Obtaining the standard deviation of this same set; (A7.3) Standardization of ct / elements according to the mean and standard deviations obtained previously, obtaining the standardized data hu; and (Â.7.4) Composition of the elements? in a matrix H (m xii), [77J Thus, considering the previous data matrix C, the standard deviation σ of a column / is taken as: (equation 15) [78] Thus, by combining equations 9 and 15, one can therefore define the standard elements htj according to equation 16 below, which makes up the standardized data matrix H (mxn): (equation 16) (equation 17 ) PCA implementation

[79] Os modelos de aparência ativa (AAM, do inglês "active appearance model"), que levam em consideração informações aparência e de formo sâo baseados na análise de componentes principais (Jackson, 2003) do conjunto de treinamento, ou PCA, do inglês "principal components anaiysis". PCA e um tipo de analise que identifica, em um espaço multídímensional, as direções em que os dados apresentam a maior vanancta, e também prove informação sobre as direções octogonais que caracterizam os componentes principais dos dados, í.e. os eixos de projeção mais importantes para expressar a variabilidade dos dados, [80] A implementação PCA na matriz de dados padronizados H consiste na determinação dos autovetores da matriz de covariància Cov(H){n x n) que, na notação matricial, e especificada pela equação abaixo. (equação 18) (81) Assim, análise PCA se dá da seguinte forma: (A8.1) Obtenção cia matriz de covariància Cov(H) da matriz H; e (A8.2) Determinação dos autovetores e autovalores da matriz de covariància obtida anteriormente resultando nos vetores ortonormais e autovalores associados a estes vetores, [82} Ã análise PCA da matriz H resulta em um conjunto m de vetores ortonormais hj e seus autovalores ÃAj = 1,2, ···, ??t). Quando os vetores h, são considerados em ordem decrescente de seus autovalores, eles indicam os componentes principais da variabilidade dos dados. Em outras palavras, quando Λ, > > ■·· > ht indica a direção de maior variabilidade dos dados no espaço multidimensional, h2 a segunda direção, ou o segundo componente principal, e assim por diante. (83) Os vetores lii podem ser decompostos nos componentes principais de aparência e forma, onde ek são vetores com 3q elementos, representando os componentes principais de aparência, e /* são vetores com 2k elementos, representando os componentes principais de forma. (equação 19} Geração dos modelos de forma e aparência [84} A partir dos resultados da analise da PCA, a diversidade de aparência e forma nas amostras do conjunto de dados e expressa de acordo com as expressões abaixo, na qual a representa o modelo de aparência e s o modelo a forma. Neste caso, ά e s são os vetores médio de aparência e forma definidos anteriormente.[79] Active appearance models (AAM), which take into account appearance and shape information, are based on the principal component analysis (Jackson, 2003) of the training set, or PCA, of the English "principal components analysis". PCA is a type of analysis that identifies, in a multi-dimensional space, the directions in which the data presents the greatest advantage, and also provides information about the octagonal directions that characterize the main components of the data, i.e. the projection axes most important for expressing data variability, [80] The PCA implementation in the standardized data matrix H consists of determining the covariance matrix eigenvectors Cov (H) (nxn) which, in matrix notation, is specified by equation below. (equation 18) (81) Thus, PCA analysis proceeds as follows: (A8.1) Obtaining the covariance matrix Cov (H) from the matrix H; and (A8.2) Determination of the eigenvalues and eigenvalues of the previously obtained covariance matrix resulting in the orthonormal vectors and eigenvalues associated with these vectors. = 1.2, ···, ?? t). When the vectors h are considered in descending order of their eigenvalues, they indicate the main components of data variability. In other words, when Λ,>> ■ ··> ht indicates the direction of greatest variability of data in multidimensional space, h2 the second direction, or the second major component, and so on. (83) Vectors lii can be broken down into the main components of appearance and shape, where ek are vectors with 3q elements, representing the main components of appearance, and / * are vectors with 2k elements, representing the main components of form. (Equation 19} Generating Shape and Appearance Models [84} From the results of the PCA analysis, the appearance and shape diversity in the dataset samples is expressed according to the expressions below, where a represents the model. In this case, these are the average vectors of appearance and shape defined above.

[85) Da(3ij x 3í/) e Ds(2k x 2/r) são matrizes diagonais definidas mais abaixo, com os valores dos desvios padrão obtidos pela equação 15 na diagonal principal e os outros elementos nulos; a, e β. os coeficientes da combinação linear dos principais componentes da aparência e forma; e e, e f, os componentes principais dos modelos de aparência e forma, respectivamente.[85) Da (3ij x 3i /) and Ds (2k x 2 / r) are defined diagonal matrices below, with the standard deviation values obtained by equation 15 on the main diagonal and the other null elements; a, and β. the coefficients of the linear combination of the main components of appearance and shape; and e, and f, the main components of appearance and shape models, respectively.

[86) Oeste modo, o processo de geração dos modelos de forma e aparência se dá conforme as etapas e equações abaixo: (A9.1) Obtenção das matrizes diagonais Da e DSI com os desvios padrão obtidos anteriormente {equação 15) nas suas diagonais principais; e (A9.2) Determinação dos componentes principais dos modelos de aparência e forma, respectivamente e,· e /,·; e definição dos modelos de aparência como apresentado pelas equações 20 e 21 onde os coeficientes a{e $ representam os coeficientes da combinação linear dos vetores e, e ft- {equação 20) (equação 21) (equação 22) (equação 23) [87) Os vetores e, da equação 20 podem ser considerados a versão vetonzada das imagens protótipos ou eígenfaces. Este nome sugere o que o vetor representa; imagens que, combinadas de forma devida, resultam em novas imagens faciais que não estavam oríginalmente presentes na base de dados, A Figura 7 ilustra a imagem de aparência média (a) e seus três principais componentes ou eigenfaces: íij (b), (c) e fi·} (d), respectivamente. Observe que as equações 20 e 21 se aplicam tanto para a smtese quanto para a análise de formas e imagens, respectivamente.[86) In this way, the process of generating the shape and appearance models follows the steps and equations below: (A9.1) Obtaining the Da and DSI diagonal matrices with the previously obtained standard deviations (equation 15) in their diagonals main; and (A9.2) Determination of the main components of appearance and shape models, respectively and, · and /, ·; and definition of appearance models as presented by equations 20 and 21 where the coefficients a {and $ represent the coefficients of the linear combination of the vectors e, and ft- {equation 20) (equation 21) (equation 22) (equation 23) [ 87) The vectors e of equation 20 can be considered as the large version of the prototype or eigenface images. This name suggests what the vector represents; images that, combined properly, result in new facial images that were not originally present in the database. Figure 7 illustrates the average-looking image (a) and its three main components or eigenfaces: ij (b), (c ) and fi ·} (d), respectively. Note that equations 20 and 21 apply for both synthesis and analysis of shapes and images, respectively.

[88] O ajuste dos coeficientes de peso por exemplo, viabiliza a síntese de imagens novas, que são "independentes da forma", isto é, distorcidas em relação à forma média. De modo oposto, imagens faciais podem ser distorcidas à forma média e projetadas no espaço definido pela base ortonorma! e configurado pelos vetores e-t. Neste caso, os coeficientes de peso são o resultado do processo de análise de uma imagem, e podem ser vistos como código "independente da forma" para imagens faciais.[88] Adjusting weight coefficients, for example, makes it possible to synthesize new images that are "shape independent", ie distorted from the average shape. Conversely, facial images can be distorted to medium shape and projected into the space defined by the orthonormal base! and configured by the e-t vectors. In this case, the weight coefficients are the result of the process of analyzing an image, and can be viewed as "shape-independent" code for facial images.

Modelo de Fala Expressiva [89] O modelo de fala expressiva é construído a partir da projeção das amostras de vísemas expressivos pt-br para todas as 22 emoções do modelo OCC no espaço de vetores multidimensional definido pelas equações 20 e 21. Vale ressaltar que o modelo pode ser construído para qualquer língua ou qualquer conjunto de vísemas expressivos.Expressive Speech Model [89] The expressive speech model is constructed from the projection of the expressive ptse br samples for all 22 emotions of the OCC model in the multidimensional vector space defined by equations 20 and 21. It is noteworthy that the Template can be built for any language or any set of expressive vsemas.

[90] Em resumo, o modelo de fala expressiva concluído é caracterizado pela combinação dos seguintes conjuntos de dados; • Parâmetros de forma e aparência (equações 20 e 21), incluindo seus vetores de componentes principais; • Um conjunto de "coeficientes de aparência" (a, na equação 20), para cada região de interesse, que representa a projeção das imagens de todos os vísemas expressivos dependentes do contexto fonético para uma base de dados extraída do corpus no espaço multidimensional definido pelos principais componentes do modelo de aparência; • Um conjunto de formas alinhadas associadas à cada imagem da base de dados. • Uma imagem base da face, obtida escolhendo uma imagem a partir das formas alinhadas e distorcendo a em direção a forma media s. Na realização desta invenção» a imagem escolhida foi aquela que» após o alinhamento, foi a que apresentou a mínima distância euclidiana em relação à forma média.[90] In summary, the completed expressive speech model is characterized by the combination of the following data sets; • Shape and appearance parameters (equations 20 and 21), including their principal component vectors; • A set of "appearance coefficients" (a, in equation 20), for each region of interest, which represents the projection of the images of all phonetic context-dependent expressive vseems to a database extracted from the corpus in the defined multidimensional space key components of the appearance model; • A set of aligned shapes associated with each database image. • A base image of the face, obtained by choosing an image from aligned shapes and distorting it toward the middle shape. In carrying out this invention, the image chosen was that which, after alignment, was the one with the minimum Euclidean distance from the average shape.

[91] Como relatado anteriormente, a análise PCA permite identificar os componentes principais que são mais relevantes para expressar a variabilidade de dados e descartar os componentes menos relevantes, em um processo denominado redução de dtmensíonalidade, O modelo de fala expressiva pode ser configurado para atender diferentes requerimentos a respeito da qualidade desejada da animação versus o tamanho da base de dados do modelo de fala expressiva. A configuração de qualidade total do modelo e caracterizada quando não é realizada a operação de redução de dimensionalídade e todos os componentes principais são mantidos. Neste caso» se m imagens forem utilizadas para construir o modelo, o número de imagens protótipos geradas na base de dados do modelo também é m. Contudo» um aspecto interessante nesta modelagem é a possibilidade de codificar as imagens originais a partir da base de treinamento utilizando uma representação compacta. A compressão é realizada escolhendo os p componentes principais mais relevantes do modelo. Neste caso, as imagens da base de dados são também projetadas em um número reduzido de eixos» resultando em uma representação compacta do modelo de face.[91] As previously reported, PCA analysis allows us to identify the major components that are most relevant to expressing data variability and discard the least relevant components, in a process called dtmensonality reduction. The expressive speech model can be configured to meet different requirements regarding the desired quality of animation versus the size of the expressive speech model database. The overall quality setting of the model is characterized when the dimensionality reduction operation is not performed and all major components are maintained. In this case »if m images are used to build the model, the number of prototype images generated in the model database is also m. However, an interesting aspect of this modeling is the ability to encode the original images from the training base using a compact representation. Compression is performed by choosing the most relevant p major components of the model. In this case, database images are also projected on a reduced number of axes, resulting in a compact representation of the face model.

[92] O modelo de fala expressiva construído é utilizado na etapa seguinte, que descreve o processo de síntese de fala expressiva, de modo que só é necessário que ele seja construído apenas uma vez para uma determinada face e seus visemas dependentes do contexto fonético.[92] The constructed expressive speech model is used in the next step, which describes the process of expressive speech synthesis, so that it is only required to be constructed once for a given face and its phonetic context-dependent visemes.

[93] A base de dados original é utilizada para obter os modelos de forma e aparência, e adicionalmente. como uma fonte para amostras de vtsemas dependentes do contexto para implementar o modelo de fala expressiva, ista proposta garante que o conjunto de treinamento tenha amostras de aparência e forma para todos os visemas expressivos que sejam relevantes no processo de síntese, evitando o problema de qualidade visual na síntese que seja atribuído a falta de configurações faciais especificas na base de dados de dados original. j94| Contudo» esta estratégia não é absoluta» e pode ser implementada alternaiivantente utilizando um conjunto de imagens de treinamento para o modelo diferente do conjunto utilizado para seleção dos visemas dependentes do contexto. Assim, imagens extras destes visemas podem ser adicionadas a qualquer momento para melhorar o modelo da face gerado, ou para obter uma redução de dimensionalídade com um conjunto menor de treinamento. Outra possível estratégia é obter os modelos de forma e aparência utilizando amostras de imagem de uma face, e utilizar os visemas de uma face diferente para a construção do modelo de fala expressiva.[93] The original database is used to obtain models of shape and appearance, and additionally. As a source for context-dependent vtsema samples to implement the expressive speech model, this proposal ensures that the training set has samples of appearance and shape for all expressive visemas that are relevant in the synthesis process, avoiding the quality problem. visual synthesis that is attributed to the lack of specific facial configurations in the original database. j94 | However »this strategy is not absolute» and can be implemented alternatively using a set of training images for the different model from the set used for selecting context dependent visemas. Thus, extra images of these visemes can be added at any time to improve the generated face model, or to achieve dimensionality reduction with a smaller training set. Another possible strategy is to obtain shape and appearance models using single-sided image samples, and to use different-sided visemas to construct the expressive speech model.

Processo de síntese de fala expressiva {95j A segunda etapa (B) do método descreve o processo de síntese a partir da transcrição fonética temporizada da fala a ser animada, incluindo mapeamento para visemas dependentes de contexto fonético e síntese de imagens faciais correspondentes às poses-chave da animação. O fluxograma da figura 8 fornece uma visão geral do processo de síntese de fala expressiva, mostrando as principais etapas envolvidas: (Bl) Processamento da transcrição fonética temporizada; (B2) Extração dos intervalos de animação; (83) Conversão de fonemas para visemas dependentes do contexto: (B4) Síntese das aparências das poses-chave; (B4.1) Síntese da aparência da face inteira (B4.2) Síntese da aparência da ROI "lábios+bochechas" e sobreposição à aparência anterior. (B4.3) Síntese da aparência da ROI "lábios" e sobreposição à aparência anterior. (B4.4) Junção da aparência anterior à uma face base. fBS) Síntese das poses-chave finais: (B6) Modulador de forma; (87) Transição entre duas poses-chave: e (BS) Composição e apresentação.Expressive Speech Synthesis Process (95j) The second step (B) of the method describes the synthesis process from the timed phonetic transcription of the speech to be animated, including mapping to phonetic context-dependent visemas and facial image synthesis corresponding to the poses. key of the animation. The flowchart of figure 8 provides an overview of the expressive speech synthesis process, showing the main steps involved: (Bl) Timed phonetic transcription processing; (B2) Extraction of animation intervals; (83) Conversion of phonemes to context-dependent visemas: (B4) Synthesis of appearances of key poses; (B4.1) Synthesis of the appearance of the whole face (B4.2) Synthesis of the appearance of the "lips + cheeks" ROI and overlap with the previous appearance. (B4.3) Synthesis of ROI "lips" appearance and overlap with previous appearance. (B4.4) Joining the previous appearance to a base face. fBS) Summary of the final key poses: (B6) Shape modulator; (87) Transition between two key poses: and (BS) Composition and presentation.

[96j Como pode ser observado, a síntese requer quatro entradas principais: • Emoção a ser sintetizada. Nesta invenção» é selecionada dentre o conjunto de 22 emoções do modelo OCC, como "alegria"» "raiva", "esperança". • Áudio de fala: o arquivo com o áudio de fala a ser articulado pela animação facial, que pode ser gravado a partir de uma fala real ou sintetizado. • Transcrição fonética temporizada: um arquivo de entrada que informa a sequência de fones que compõem o áudio da fala. • Controle de forma: opcionalmente, a metodologia de síntese pode processar parâmetros de controle de forma utilizados para implementar um controle mais sofisticado da orientação da cabeça e de outros elementos da face, como olhos e sobrancelhas.[96j As can be seen, synthesis requires four main inputs: • Emotion to be synthesized. In this invention »is selected from the set of 22 emotions of the OCC model, such as" joy "," anger "," hope ". • Speech audio: The file with speech audio to be articulated by facial animation, which can be recorded from a real speech or synthesized. • Timed Phonetic Transcription: An input file that tells the sequence of headphones that make up speech audio. • Shape control: Optionally, the synthesis methodology can process shape control parameters that are used to implement more sophisticated control of head orientation and other face elements such as eyes and eyebrows.

[97] Nesta etapa, a presente invenção implementa uma estratégia de síntese de animação de quadros-chave baseada em regras, no qual os parâmetros de controle facial em quadros específicos da animação são obtidos a partir da descrição fonética temporizada da fala e expressão a ser sintetizada. As poses-chave são sintetizadas baseadas nos parâmetros do modelo de aparência da base de dados do modelo de fala expressiva, conjuntamente à informação de forma alinhada gerada pelo modelo ou por um "modulador de forma" externo. A síntese dos quadros intermediários entre duas poses adjacentes é obtida por meio de um algoritmo de transição de imagem não linear. O ultimo passo consiste em concatenar os quadros animados e o áudio da fala para obtenção da animação facial de fala expressiva.[97] In this step, the present invention implements a rule-based keyframe animation synthesis strategy in which the facial control parameters in specific animation frames are obtained from the timed phonetic description of the speech and expression to be Synthesized. Key poses are synthesized based on the appearance model parameters of the expressive speech model database, along with the aligned shape information generated by the model or by an external "shape modulator". The synthesis of the intermediate frames between two adjacent poses is obtained by a nonlinear image transition algorithm. The last step is to concatenate the animated frames and speech audio to achieve expressive facial speech animation.

Processamento da transcrição fonética temporizada [98] A transcrição fonética temporizada pode ser obtida de diversas formas, como: segmentação manual ou segmentação automática a partir de uns áudio de entrada (utilizando softwares específicos), ou como um subproduto em programas síntese automatíca de fala. Ela fornece a sequência de fones (segmentos do som da fala) que serão utilizados na síntese da fala, os seus intervalos e tempos limítrofes. A Figura 9 ilustra a transcrição de uma fala representada pela sequeneie de k fonemas, /·',, i = 1Os instantes t,.., et - são o início e termino no fonema !·',. e seu intervalo de duração a diferença entre ambos.Processing of Timed Phonetic Transcription [98] Timed phonetic transcription can be obtained in a variety of ways, such as manual segmentation or automatic segmentation from input audio (using specific software), or as a byproduct in automated speech synthesis programs. It provides the sequence of headphones (segments of speech sound) that will be used in speech synthesis, their intervals and boundary times. Figure 9 illustrates the transcription of a speech represented by the sequence of k phonemes, / · ', i = 1The moments t, .., et - are the beginning and end of the phoneme! ·' ,. and its duration interval the difference between them.

Extração dos intervalos de animação (99] Â partir da informação temporizada obtida na etapa anterior de transcrição, é possível definir a duração da animação, o número de quadros necessário e os instantes de tempo associados aos quadros-chave. Estes últimos, também denominados visemas-chave, são escolhidos para representar pontos de inflexão na trajetória do movimento articulatorio, principalmente lábios, dentes e língua. Em outras palavras, um visema-chave e sintetizado para representar um determinado fone, e ele representa a excursão final do seu movimento articulatorio característico, de modo que o quadro subsequente a um visema-chave define o começo da trajetória em direção a um novo visema-chave (Figura 10). f 100} A dinâmica do movimento articutatório em uma fala real é complexa, e os efeitos de coarticulação entre dois segmentos de fala vizinhos afetam consideravelmente o padrão articutatório típico, de modo que seus pontos de inflexão podem ocorrer a qualquer momento da duração de um fonema. Como uma simplificação desta dinâmica, a presente invenção adota uma estratégia que associa os vísemas-chave ao instante de tempo que corresponde à metade do intervalo de duração do respectivo fonema.Extracting Animation Intervals (99] From the timed information obtained in the previous transcription step, you can define the animation duration, the number of frames required, and the time frames associated with the keyframes. are chosen to represent turning points in the trajectory of articular movement, especially lips, teeth and tongue, in other words, a key visema synthesized to represent a particular headset, and it represents the final excursion of its characteristic articular movement. , so that the picture following a key visema defines the beginning of the trajectory toward a new key visema (Figure 10). f 100} The dynamics of articulatory movement in real speech are complex, and the effects of coarticulation between two neighboring speech segments considerably affects the typical articulatory pattern, so that your inflection points can m occur at any time during the duration of a phoneme. As a simplification of this dynamic, the present invention adopts a strategy that associates the key vsemas to the time instant that corresponds to half of the respective phoneme duration interval.

Conversão de fonemas para visemas dependentes do contexto [101J As poses-chave da animação correspondem às imagens sintetizadas de visemas expressivos dependentes do contexto. A síntese destas poses depende da conversão da sequência de fonemas fornecida pela transcrição fonética temporizada para estes visemas. Esta conversão envolve a conversão dos fonemas vocálicos e a conversão dos fonemas consonantais. (102) Relembrando a Tabela 2, esta mostra que a maioria dos visemas vocálicos são independentes do contexto fonético, com a única exceção o fonema jij. Assim, os visemas restantes são convertidos conforme a segunda coluna desta tabela, e o fonema }tj, se encontrado na sequencia, recai em casos especiais e seu triíonema (onde ele se localiza) deve ser analisado. Se o contexto fonético Ititl ou jjíj] for identificado, o fonema [ij è convertido para o visemu <í2>. Para os outros contextos, convertido para o vtsema <il>.Conversion of phonemes to context-dependent visemas [101J The key poses of the animation correspond to the synthesized images of context-dependent expressive visemas. The synthesis of these poses depends on the conversion of the phoneme sequence provided by the timed phonetic transcription to these visemas. This conversion involves the conversion of the vowel phonemes and the conversion of the consonant phonemes. (102) Recalling Table 2, it shows that most vowel visemas are independent of the phonetic context, with the only exception being the phoneme jij. Thus, the remaining visemas are converted according to the second column of this table, and the phoneme} tj, if found in the sequence, falls into special cases and its trioneone (where it is located) must be analyzed. If the phonetic context Ititl or jjíj] is identified, the phoneme [ij is converted to visemu <í2>. For other contexts, converted to vtsema <il>.

[103] A terceira coluna da Tabela 1, por sua vez, lista os contextos fonéticos para o pt-br e seus fonemas consonantais correspondentes, bem como os visemas para conversão. Esta invenção adota a estratégia do pentafonema (sequência de 5 fonemas), ao analisar dois fonemas à direita e dois à esquerda do fonema consonantal a ser convertido. As regras de conversão são aquelas definidas por Costa (2009). Síntese das aparências das poses-chave [104] De forma similar à síntese dos vetores de aparência, a síntese das aparências das poses-chave pode ser descrita de acordo com os seguintes passos (conforme Figura 11): (B4.1) Síntese da aparência da face inteira (B4.2) Síntese da aparência da ROÍ "lábios+bochechas" e sobreposição à aparência anterior. (B4.3) Síntese da aparência da ROI "lábios" e sobreposição à aparência anterior. (B4.4) Junção da aparência anterior à uma face-base.[103] The third column of Table 1, in turn, lists the phonetic contexts for pt-br and their corresponding consonant phonemes, as well as the conversion visemas. This invention adopts the pentaphoneme strategy (sequence of 5 phonemes) by analyzing two phonemes to the right and two to the left of the consonant phoneme to be converted. The conversion rules are those defined by Costa (2009). Synthesis of appearance of key poses [104] Similar to the synthesis of appearance vectors, the synthesis of appearance of key poses can be described according to the following steps (as in Figure 11): (B4.1) whole face appearance (B4.2) Synthesis of ROI "lips + cheeks" appearance and overlap with previous appearance. (B4.3) Synthesis of ROI "lips" appearance and overlap with previous appearance. (B4.4) Joining the appearance prior to a base face.

[105] A Figura ll(e) mostra o resultado final da síntese de uma pose-chave de aparência. De modo análogo à criação do modelo de síntese anteríormente descrito, a face-base é um quadro do vídeo em tamanho máximo distorcido à forma média. A Figura ll(d) mostra a região à qual a pose-chave de aparência é inserida na face-base. A composição de duas imagens II e 12 foi implementada utilizando uma operação de "alpha biending mask": /,(«) +/2(l-oc), com oc variando gradativamente de 0 a 1 nas fronteiras das regiões combinadas. Como um exemplo, a Figura ll(f) mostra esta operação na região de lábios e bochechas. Os parâmetros do modelo de aparência, para cada pose-chave e para cada ROI, são recuperados da base de dados do modelo de fala expressiva por meio de um par de chaves indexadoras: etiqueta de emoção e visema dependente do contexto. Os parâmetros do modelo de aparência consistem em: • Um conjunto de l componentes principais de uma imagem protótipo e„ com i = 1,2, ···, I; • Os desvios padrão originais dos dados Da, • Um conjunto de I coefiecientes a,; • 0 vetor de aparência medio ã;[105] Figure 11 (e) shows the end result of synthesizing a key looking pose. Similar to the creation of the synthesis model described above, the base face is a full-length video frame distorted to the average shape. Figure 11 (d) shows the region to which the key appearance pose is inserted into the base face. The composition of two images II and 12 was implemented using an alpha biending mask operation: /, («) + / 2 (1-oc), with c gradually varying from 0 to 1 at the boundaries of the combined regions. As an example, Figure 11 (f) shows this operation in the lip and cheek region. The appearance model parameters, for each key pose and for each ROI, are retrieved from the expressive speech model database using a pair of index keys: emotion tag and context-dependent visema. The appearance model parameters consist of: • A set of l main components of a prototype image and „with i = 1,2, ···, I; • The original standard deviations of the data Da, • A set of I coefficients a,; • The medium-looking vector;

[106] Â partir destas definições, a aparência A da pose-chave de uma ROI é computada de acordo com a equação abaixo: [equação 24) [107] O resultado da síntese das poses-chave é a sequencia de quadros de vídeo em tamanho máximo, que correspondem aos visemas expressivos dependentes do contexto fonético "independentes da forma", i.e. distorcidos à forma média. Síntese das poses-chave finais [108] Esta etapa consiste na distorção das poses-chaves de aparência obtidas anteriormente as formas finais que eles devem assumir, o que será descrito mais adiante. Sua entrada é urna sequência de vetores de forma st associados às ocorrências das poses-chave. No caso de não haver nenhum controle externo de forma, as formas finais sintetizadas das poses-chave são as formas alinhadas associadas à cada vísema expressivo dependente do contexto fonético presente na base de dados do modelo de fata expressiva. Neste modo de síntese, os vísemas expressivos são apresentados na animação facial de acordo com a etiqueta de emoção especificada, porém, a animação facial não apresenta o movimento de cabeça, piscar de olhos ou franzir de sobrancelhas que acompanham a fala expressiva. Em outras palavras, não estão programados outros sinais não verbais fora da região dos lábios.[106] From these definitions, the appearance A of the key pose of an ROI is computed according to the equation below: [equation 24] [107] The result of synthesizing the key poses is the sequence of video frames in maximum length, which correspond to the phonetic context-dependent expressive visemas "independent of form", ie distorted to medium form. Synthesis of final key poses [108] This step consists of distorting the key appearance poses previously obtained from the final forms they are to assume, which will be described later. Its input is a sequence of st shape vectors associated with occurrences of key poses. In case there is no external shape control, the synthesized final forms of the key poses are the aligned shapes associated with each expressive context dependent phonetic context present in the expressive fact model database. In this synthesis mode, expressive vissemas are presented in facial animation according to the specified emotion tag, however, facial animation does not have the head movement, blinking or frowning that accompanies expressive speech. In other words, no other nonverbal signals are programmed outside the lip region.

Modulador de forma [109] Opcionalmente, parâmetros de controle de forma podem ser fornecidos por um modulador de forma que processe entradas externas para obter comandos específicos para orientação da cabeça ou de outros elementos faciais, como dos olhos e sobrancelhas, com o intuito de conferir mais realismo à animação gerada. Nesta invenção, o modulador de forma utiliza o modelo de forma disponível no modelo de fala expressiva, ou seja, a forma final de cada pose-chave é correspondente à forma do visema original da base de imagens após ter sido alinhado com a forma média. Transição entre duas poses-chave (1101 A entrada para a etapa que sintetiza a animação final é a sequência de poses-chave sintetizadas {distorcidas à forma final), e sua duração correspondente, conforme relatado anteriormente. E a transição entre duas poses-chave é realizada ao gerar os quadros de animação que serão colocados entre duas poses chaves, Assim, dadas duas poses-chave l< !ιηιι, e K;íív0, o processo de transição determina as funções que distorcem a primeira pose em direção à segunda, e vice-versa, conforme indica a Figura 12, Para cada direção, a transição é dividida em passos intermediários que correspondem aos diversos quadros que serão dispostos entre duas poses-chave adjacentes. Os resultados da transformação nas duas direções são combinados, de acordo com uma proporção que é função do tempo. Considerando um intervalo normalizado entre duas poses-chave adjacentes, a operação de interpelação pode ser definida abaixo, com /■' o quadro de animação sintetizado final, Kf (í) a imagem distorcida no sentido direto no instante t; Kh{t} a imagem distorcida no sentido oposto no instante t, e t a variável normalizada de tempo para qualquer intervalo entre duas poses chave, 0 < t < 1. {equação 25) [111] A distorção é guiada pela definição de um mapa de correspondência de pontos de interesse entre as imagens de origem e destino. Ã partir deste mapa é possível computar uma função de distorção que define a relação espacial entre dois pontos em ambas as imagens, utilizando a estratégia de transformação afim por partes, a partir da triangulação de Delaunay para gerar uma malha de tnânguios, cujos vertices são os pontos de interesse. A mesma triangulação é aplicada como forma origem e forma destino.Shape modulator [109] Optionally, shape control parameters can be provided by a shape modulator that processes external inputs to obtain specific commands for orientation of the head or other facial elements, such as eyes and eyebrows, to check more realism to the generated animation. In this invention, the shape modulator utilizes the shape model available in the expressive speech model, that is, the final shape of each key pose corresponds to the shape of the original image base visema after it has been aligned with the middle shape. Transition between two key poses (1101 The input to the step that synthesizes the final animation is the sequence of synthesized key poses (distorted to the final shape), and their corresponding duration, as previously reported. And the transition between two key poses is performed by generating the animation frames that will be placed between two key poses. Thus, given two key poses l <! Ιηιι, and K; i0, the transition process determines the distorting functions. the first pose toward the second, and vice versa, as shown in Figure 12. For each direction, the transition is divided into intermediate steps that correspond to the various frames that will be arranged between two adjacent key poses. The transformation results in both directions are combined according to a proportion that is a function of time. Considering a normalized interval between two adjacent key poses, the interpellation operation can be defined below, with / ■ 'the final synthesized animation frame, Kf (i) the distorted image in the right direction at time t; Kh {t} the distorted image in the opposite direction at time t, eta and the time normalized variable for any interval between two key poses, 0 <t <1. {equation 25) [111] Distortion is guided by the definition of a map of matching points of interest between source and destination images. From this map it is possible to compute a distortion function that defines the spatial relationship between two points in both images, using the piecewise affine transformation strategy, from Delaunay triangulation to generate a tungual mesh, whose vertices are the points of interest. The same triangulation is applied as source form and destination form.

[112] Em seguida, para cada triângulo gerado, é computado um mapa de pixels da origem para a malha de triângulos do destino. A trajetória definida pelos pontos de interesse durante a transição afeta fortemente a percepção visual dos movimentos artículatorios da fala, de modo que é indicado que eles sigam uma curva de interpelação suave não-linear no domínio do tempo a fim de modelar a dinâmica observada em um discurso real. A trajetória dos pontos durante distorções sucessivas são então modeladas por meio da curva de interpelação paramétríca Hermite, que fornece continuidade Gi] entre dois quadros sucessivos e assegura derivada igual a zero nos instantes de tempo associados as poses-chave.[112] Then, for each triangle generated, a pixel map from the source to the destination triangle mesh is computed. The trajectory defined by the points of interest during the transition strongly affects the visual perception of the articulatory speech movements, so that it is indicated that they follow a nonlinear smooth time-domain interpellation curve to model the dynamics observed in a Real speech. The trajectory of the points during successive distortions is then modeled by means of the Hermite parametric interpellation curve, which provides continuity Gi] between two successive frames and ensures zero derivative at the time points associated with key poses.

[113J Para cada coordenada λ e y de um ponto de interesse especifico, a curva de interpolaçao é obtida a partir da solução do conjunto de equações abaixo, onde λ,(Π e v, í {) são as funções Hermite para as coordenadas ,v e y o i-esímo ponto de interesse, i -· 1,2,- ,k; 5y e Sv, sao as coordenadas ,v e v, respectivamente, de um ponto de interesse em uma imagem de origem, e 7'.v, e Ty da imagem destino: e t a variavel paramétríca independente normalizada em relação ao intervalo entre duas poses-chave.[113J For each coordinate λ and y of a specific point of interest, the interpolation curve is obtained from the solution of the set of equations below, where λ, (Π ev, í {) are the Hermite functions for the coordinates, veyo i -th point of interest, i - · 1,2, -, k; 5y and Sv, are the coordinates, v and v, respectively, of a point of interest in a source image, and 7'.v, and Ty of the target image: and t is the independent parametric variable normalized to the interval between two key poses.

Composição e apresentação {114J Apos definidas as imagens de transição entre poses-chave, elas são concatenadas e associadas ao áudio da fala, gerando um video com a animação facial da fala expressiva de forma fotorreahsta. (1151 A piesente invenção, portanto, permite a síntese de um amplo espectro de animações faciais de fala expressiva, associado a um robusto modelo de articulação da fala, que engloba não somente a modelagem provida petos visemos dependentes do contexto, irias também a transição não linear entre duas poses-chave. j 1 1υ| Um aspecto chave deste método e a síntese dos visemas "independente da forma'-, que permite trabalhar com informações relevantes do movimento artirulatório da fala, como a presença dos dentes e língua, o contorno dos lábios e a expressão dos olhos, além de mudanças sutis na textura da bochecha ou da testa. Adicionalmente, e mais importante, esta representação é capaz de expressar a variação destes elementos para as mais diferentes expressões da emoção. Juntamente com a modulação de formas, o método desenvolvido consiste em uma proposta flexível para a síntese de animação facial de fala expressiva fotorreaiista. As figuras 13 e 14 ilustram o alcance do método ao compararem imagens extraídas de video (linha superior) com imagens correspondentes ao mesmo par emoção/visema sintetizadas pelo presente método (tinha inferior).Composition and presentation {114J After defining the transition images between key poses, they are concatenated and associated with speech audio, generating a video with the photoreactive expressive facial animation of speech. (1151 The pious invention therefore allows the synthesis of a broad spectrum of expressive speech facial animations, coupled with a robust speech articulation model, which encompasses not only the modeling provided by context-dependent visos, but also the untranslated transition. j 1 1υ | A key aspect of this method is the 'shape-independent' visema synthesis, which allows you to work with relevant information on speech-language movement, such as the presence of teeth and tongue, the contour lip and eye expression, as well as subtle changes in the texture of the cheek or forehead.In addition, and most importantly, this representation is able to express the variation of these elements to the most different expressions of emotion. , the developed method consists of a flexible proposal for the synthesis of photorreaist expressive speech facial animation. they shine the scope of the method by comparing images taken from video (top line) with images matching the same emotion / visema pair synthesized by the present method (had inferior).

REIVINDICAÇÕES

Claims

1, (2D image-based facial animation synthesis method characterized in that the synthesized beech is accompanied by emotion expression and comprises the following steps: (A) Generation of an expressive facial image synthesis model; (Al) Sample capture (A2) Selection of expressive vsemas and extraction of shape parameters; (A3) Shape alignment operation with the average shape; (A4) Obtaining appearance vectors; (AS) Obtaining appearance / shape vectors; (A6) Obtaining the Data Matrix (A7) Data Standardization: (A8) PCA Implementation (A9) Generation of Shape and Appearance Models: | A10} Expressive Speech Model; (B) Expressive Speech Synthesis Process (81} Timed phonetic transcription processing (B2) Animation interval extraction (B3) Conversion of phonemes to context-dependent visemas (B4) Synthesis of key pose appearances (BS) Synthesis of key pose (; B6) Shape modulator (87) Transition between two key poses; and (BS} Composition and presentation.

Method according to Claim 1, characterized in that the expressive speech sampling substep (Al) comprises the acquisition of facial images in artificial postures for each of the 22 emotions of the OCC model and in neutral speech.

Method according to claim 2, characterized in that it comprises all the phonemes of the Portuguese language spoken in Brazil.

Method according to claim 1, characterized in that the substep (A2) selects the phonemes of interest by observing the occurrence of inflection points of articulatory speech movements observed in the lips, tongue and teeth, and associating each facial image with information so that it comprises the x and y coordinates of 56 characteristic points of the face.

Method according to Claim 1, characterized in that the medium-shape shape alignment operation substep (A3) comprises the following substeps: (A3.1) Randomly selecting a shape from the database which will be the standard shape. for normalization in order to promote convergence of the algorithm; (A3,2) Align each of the other shapes in the database to the standard shape; f A3.3J Compute the average shape of the aligned shapes; (A3.4) Normalize the orientation, scale and origin of the medium form to the standard form; AI3.5) Realign all shapes to mean shape; (A3.6) Repeat steps (A3.3), (A3.4) and (A3.5) until convergence;

Method according to Claim 1, characterized in that the substep (A4) for obtaining the appearance vectors comprises the following substeps: (A4.1) Delimiting a region of interest ROI; (A4.2) Perform '' alignment of appearance '' by distorting images to their average form; (A43) Store the RGB information of the pixels of this ROI in a ti vector, obtaining the appearance vector for an image; A4.4) Computing the vector of average appearance through equation 4;

Method according to Claim 1, characterized in that the substep (A5) comprises the concatenation of the shape and appearance vectors for a given ROI.

Method according to claim 1, characterized in that the data matrix portions obtained in substep (A6) comprise the data samples and the columns comprise the variables.

Method according to claim 1, characterized in that the data standardization substep (Ã7) comprises the following substeps: (A7.1) Obtaining cjr the mean value of the elements of a column j, for / = 1, 2, -, n; {kl.2) Obtaining the standard deviation of this same set; (kl3) Standardization of elements ci - according to the mean and standard deviations obtained previously, obtaining the standardized data h ^; (A7.4) Composition of htl elements in matrix II (m x n);

Method according to claim 1, characterized in that the PCA implementation substep (AS) comprises the following substeps: (A8.1) Obtaining the Cov (H) covariance matrix from the H matrix: (A8.2) Determination the eigenvectors and eigenvalues of the covariance matrix obtained previously resulting in the orthonormal and eigenvalues vectors associated with these vectors;

Method according to claim 1, characterized in that the substep (A9) generation of the shape and appearance models comprises the following substeps: (A9.1) Obtaining the diagonal matrices Da and Ds, with the standard deviations obtained previously ( equation 15) in its main diagonals; (A9.2) Determination of main components of appearance and shape models, respectively, and, and /; and definition of appearance models as presented by equations 20 and 21 where the coefficients a, and β represent the coefficients of the linear combination of vectors f and e;

Method according to claim 1, characterized in that the substep (B1) obtains the sequence of speech tone segments, their boundary intervals and times.

Method according to claim 1, characterized in that the substep (B2) defines the animation duration, the number of frames required and the time periods associated with the keyframes.

Method according to claim 1, characterized in that the substep (83) adopts the pentaphoneme strategy (sequence of 5 phonemes) and analyzes two phonemes to the right and two to the left of the consonant phoneme to be converted.

Method according to claim 1, characterized in that the substep (B4) for generating the shape and appearance models comprises the following substeps: (B4.1) Synthesis of the appearance of the entire face; {64.2) Summary of RO's appearance! "lips + cheeks" and overlap to anterior appearance; (B4.3) Synthesis of KOI appearance "lips" and overlap to anterior appearance; (B4.4) Junction of anterior appearance to base face; (B) Synthesis of final key poses;

Method according to claim 1, characterized in that the substep (B5) comprises the distortion of the key appearance poses obtained in the previous step to their final shapes.

Method according to claim 1, characterized in that the substep (B6) comprises the processing of external inputs for controlling the orientation of the head, eyes or eyebrows.

A method according to claim 1, characterized in that the substep (87) generates intermediate animation frames between two key poses from the mutual distortion of one key pose towards the other, and the distorted images are combined in a way. according to a proportion of time between said key poses.

Method according to claim 1, characterized in that the substep (B8) comprises the recognition of the final key poses and their intermediate frames, associating them with speech audio, and generating a video with facial animation of expressive speech.

Use of the 2D image-based facial animation synthesis method as defined in claims 1 to 19, characterized in that it is for character creation for games / movies, virtual agents, video compression (in videoconferences), studies audiovisual perception,

Use of the image-based facial animation synthesis method 20 as defined in claims 1 to 19, characterized in that it is for the training of speech production, recognition and interpretation skills and facial expressions.