FR2906056A1

FR2906056A1 - METHOD AND SYSTEM FOR ANIMATING A REAL-TIME AVATAR FROM THE VOICE OF AN INTERLOCUTOR

Info

Publication number: FR2906056A1
Application number: FR0608078A
Authority: FR
Inventors: Laurent Ach; Serge Viellescaze; Benoit Morel
Original assignee: LA CANTOCHE PRODUCTION SA
Current assignee: LA CANTOCHE PRODUCTION SA
Priority date: 2006-09-15
Filing date: 2006-09-15
Publication date: 2008-03-21
Anticipated expiration: 2026-09-15
Also published as: WO2008031955A2; US20090278851A1; EP2059926A2; FR2906056B1; WO2008031955A3

Abstract

Il s'agit d'un procédé et d'un système d'animation sur un écran (3, 3', 3'') d'appareil mobile (4, 4', 4'') d'un avatar (2, 2', 2'') muni d'une bouche (5, 5') à partir d'un signal d'entrée sonore (6) correspondant à la voix (7) d'un interlocuteur (8) de communication téléphonique.On transforme en temps réel le signal d'entrée sonore en un flux audio et vidéo dans lequel on synchronise les mouvements de la bouche de l'avatar avec les phonèmes détectés dans ledit signal d'entrée sonore, eton anime l'avatar de façon cohérente avec ledit signal par des changements d'attitudes et des mouvements par analyse dudit signal, de sorte que l'avatar semble parler en temps réel ou sensiblement en temps réel à la place de l'interlocuteur.It is a method and a system of animation on a screen (3, 3 ', 3' ') of a mobile device (4, 4', 4 '') of an avatar (2, 2 ', 2' ') provided with a mouth (5, 5') from a sound input signal (6) corresponding to the voice (7) of a telephone communication interlocutor (8). transforms in real time the sound input signal into an audio and video stream in which the movements of the mouth of the avatar are synchronized with the phonemes detected in said sound input signal, and animates the avatar in a coherent manner with said signal by changes of attitudes and movements by analyzing said signal, so that the avatar appears to speak in real time or substantially in real time in place of the interlocutor.

Description

PROCEDE ET SYSTEME D'ANIMATION D'UN AVATAR EN TEMPS REEL A PARTIR DE LAMETHOD AND SYSTEM FOR ANIMATING A REAL-TIME AVATAR FROM THE

VOIX D'UN INTERLOCUTEUR La présente invention concerne un procédé s d'animation d'un avatar en temps réel à partir de la voix d'un interlocuteur. Elle concerne également un système d'animation d'un tel avatar. L'invention trouve une application lo particulièrement importante bien que non exclusive, dans le domaine des appareils mobiles comme les téléphones portables ou plus généralement les appareils personnels de communication portable ou PDA (initiales anglosaxonnes pour Personal Digital 15 Apparatus). L'amélioration des téléphones portables, de leur esthétique et de la qualité des images et du son qu'ils véhiculent est une préoccupation constante des constructeurs de ce type d'appareils. The present invention relates to a method for animating an avatar in real time from the voice of an interlocutor. It also relates to a system for animation of such an avatar. The invention finds a particularly important application although not exclusive, in the field of mobile devices such as mobile phones or more generally portable personal communication devices or PDA (initials Anglosaxonnes for Personal Digital Apparatus 15). The improvement of mobile phones, their aesthetics and the quality of images and sound they convey is a constant concern for the manufacturers of this type of device.

20 Son utilisateur est quant-à-lui particulièrement sensible à la personnalisation de cet outil qui est devenu un vecteur essentiel de communication. Cependant, même si ses fonctionnalités sont devenues multiples, puisqu'il permet aujourd'hui le 25 stockage de sons et d'images notamment photographiques, en plus de sa fonction première de téléphone, il reste néanmoins une plate-forme limitée. Il ne permet pas notamment d'afficher des images 30 de haute définition, qui ne vont de toutes les façons pas pouvoir être visualisées du fait de la dimension réduite de son écran.Its user is in turn particularly sensitive to the personalization of this tool which has become an essential vector of communication. However, even if its functionalities have become multiple, since it now allows the storage of sound and images including photographic, in addition to its primary function of telephone, it remains nevertheless a limited platform. It does not allow in particular to display 30 high definition images, which in any case can not be viewed because of the reduced size of its screen.

2906056 2 Par ailleurs, de nombreux services accessibles aux téléphones portables fonctionnant jusqu'à présent uniquement en mode audio, se trouvent devoir répondre aujourd'hui à une demande en mode visiophonie (services de messagerie, centre d'appel clients, ...). Les prestataires à l'origine de ces services n'ont souvent pas de solution prête pour le passage de l'audio à la vidéo et/ou ne souhaitent pas diffuser l'image d'une personne réelle. io L'une des solutions à ces problèmes consiste dès lors à s'orienter vers l'utilisation d'avatars, c'est à dire l'utilisation d'images graphiques, schématiques et moins complexes, représentant un ou plusieurs utilisateurs.In addition, many services accessible to mobile phones that until now only operate in audio mode, have to meet today a demand in video telephony mode (messaging services, call center, ...) . The providers behind these services often do not have a ready solution for the transition from audio to video and / or do not wish to broadcast the image of a real person. One of the solutions to these problems is therefore to move towards the use of avatars, that is to say the use of graphic images, schematic and less complex, representing one or more users.

15 De tels graphiques peuvent alors être intégrés préalablement au téléphone et être ensuite appelés quand nécessaire lors d'une conversation téléphonique. On connaît ainsi (WO 2004/053799) un système et 20 une méthode pour implémenter des avatars dans un téléphone mobile permettant de les créer et de les modifier en utilisant le standard XML (initiales anglosaxonnes pour Extensible Markup Language). Un tel système ne permet cependant pas de résoudre 25 le contrôle des expressions faciales de l'avatar en fonction de l'interlocuteur, en particulier de façon synchronisée. Tout au plus existe-t-il dans l'art antérieur (EP 1 560 406) des programmes permettant de modifier 30 l'état d'un avatar de façon simple sur la base d'informations externes générées par un utilisateur, mais sans la finesse et la rapidité recherchée dans 2906056 3 le cas où l'avatar doit se comporter de façon parfaitement synchronisée avec le son d'une voix. Les technologies et programmes actuels conversationnels utilisant les avatars, tels que par 5 exemple ceux mettant en oeuvre un programme développé par la société américaine Microsoft dénommé Microsoft Agent , ne permettent pas, en effet, de reproduire efficacement le comportement d'un avatar en temps réel par rapport à une voix, sur un appareil io portable de capacités limitées comme un téléphone mobile. La présente invention vise à fournir un procédé et un système d'animation d'un avatar en temps réel répondant mieux que ceux antérieurement connus aux 15 exigences de la pratique, notamment en ce qu'elle permet l'animation en temps réel de la bouche et du corps d'un avatar sur un appareil mobile de capacité réduite tel qu'un téléphone portable, et ce avec une excellente synchronisation des mouvements.Such graphics can then be pre-integrated with the phone and then called when needed during a telephone conversation. Thus, WO 2004/053799 discloses a system and method for implementing avatars in a mobile phone for creating and modifying them using the XML standard (English initials for Extensible Markup Language). Such a system, however, does not solve the control of facial expressions of the avatar according to the interlocutor, especially in a synchronized manner. At most, there exists in the prior art (EP 1 560 406) programs for modifying the state of an avatar in a simple manner on the basis of external information generated by a user, but without the finesse and speed sought in 2906056 3 the case where the avatar must behave in a perfectly synchronized manner with the sound of a voice. The current conversational technologies and programs using avatars, such as for example those implementing a program developed by the American company Microsoft called Microsoft Agent, do not, indeed, effectively reproduce the behavior of an avatar in real time compared to a voice, on a portable device with limited capabilities such as a mobile phone. The present invention aims to provide a method and system for animation of a real-time avatar better than those previously known to the requirements of the practice, in particular in that it allows the animation in real time of the mouth and the body of an avatar on a mobile device of reduced capacity such as a mobile phone, with excellent synchronization of movements.

20 Avec l'invention il va être possible, tout en fonctionnant dans l'environnement standard des terminaux informatiques ou de communication mobile, et ce sans installer de composants logiciels spécifiques dans le téléphone mobile, d'obtenir une 25 animation de l'avatar en temps réel ou quasi réel cohérente avec le signal d'entrée, et ce uniquement par détection et analyse de la voix. La qualité esthétique et artistique conférées aux avatars et à leur mouvement lors de leur création est 30 également préservée, pour un coût faible et avec une excellente fiabilité. Dans ce but la présente invention propose notamment un procédé d'animation sur un écran 2906056 4 d'appareil mobile d'un avatar muni d'une bouche à partir d'un signal d'entrée sonore correspondant à la voix d'un interlocuteur de communication téléphonique, 5 caractérisé en ce que on transforme en temps réel le signal d'entrée sonore en un flux audio et vidéo dans lequel on synchronise les mouvements de la bouche de l'avatar avec les phonèmes détectés dans ledit signal d'entrée sonore, et on anime l'avatar de façon io cohérente avec ledit signal par des changements d'attitudes et des mouvements par analyse dudit signal, de sorte que l'avatar semble parler en temps réel ou sensiblement en temps réel à la place de l'interlocuteur. is Dans des modes de réalisation avantageux on a de plus recours à l'une et/ou à l'autre des dispositions suivantes : - on choisit et/ou on configure l'avatar à travers un service en ligne sur le réseau Internet ; 20 - l'appareil mobile est un téléphone mobile ; - en plus des phonèmes, on analyse le signal d'entrée sonore afin de détecter et d'utiliser pour l'animation un ou plusieurs paramètres supplémentaires dits paramètres de niveau 1, à savoir 25 les périodes de silence, les périodes de parole et /ou d'autres éléments contenu dans ledit signal sonore pris parmi la prosodie, l'intonation, le rythme et/ou l'accent tonique ; - pour animer l'avatar, on exploite des séquences 30 élémentaires, constituées d'images générées par un calcul de rendu 3D, ou générées à partir de dessins ; 2906056 5 on charge des séquences élémentaires en mémoire en début d'animation et on les conserve dans ladite mémoire pendant toute la durée de l'animation pour plusieurs interlocuteurs simultanés et/ou 5 successifs ; on sélectionne en temps réel la séquence élémentaire à jouer, en fonction de paramètres préalablement calculés et/ou déterminés ; - la liste des séquences élémentaires étant communes à tous les avatars utilisables dans l'appareil mobile, on définit un graphe d'animation dont chaque nœud représente un point ou état de transition entre deux séquences élémentaires, chaque connexion entre deux états de transition étant unidirectionnelle et toutes les séquences élémentaires connectées à travers un même état devant être visuellement compatibles avec le passage de la fin d'une séquence élémentaire au début de l'autre ; - chaque séquence élémentaire est dupliquée de façon à permettre de montrer un personnage qui parle ou qui se tait selon la détection ou non d'une son de voix ; - les phonèmes et/ou les autres paramètres de niveau 1 sont utilisés pour calculer des paramètres dits de niveau 2 à savoir et notamment le caractère lent, rapide, saccadé, joyeux ou triste de l'avatar, à partir desquels est réalisée en tout ou partie l'animation dudit avatar ; - les paramètres de niveau 2 étant considérés comme des dimensions suivant lesquelles on définit une série de coefficients avec des valeurs qui sont 2906056 6 fixées pour chaque état du graphe d'animation, on calcule pour un état e la valeur de probabilité : Pe = Pi x Ci avec Pi valeur du paramètre de niveau 2 calculé à s partir des paramètres de niveau 1 détectés dans la voix et Ci coefficient de l'état e suivant la dimension i, ce calcul étant effectué pour tous les états connectés à l'état vers lequel la séquence en cours aboutit dans le graphe ; io - lorsqu'une séquence élémentaire est en cours on laisse se dérouler la séquence élémentaire qui se tait jusqu'au bout ou on passe à la séquence dupliquée qui parle en cas de détection de la voix et vice versa, puis, lorsque la séquence se termine et 15 qu'on arrive à un nouvel état, on choisit le prochain état cible suivant une probabilité définie par les calculs de la valeur de probabilité des états connectés à l'état en cours. L'invention propose également un système mettant 20 en oeuvre le procédé ci-dessus. Elle propose également un système d'animation d'un avatar muni d'une bouche à partir d'un signal d'entrée sonore correspondant à la voix d'un interlocuteur de communication téléphonique, 25 caractérisé en ce que il comporte un appareil mobile de télécommunication, pour réception du signal d'entrée sonore émis par une source téléphonique externe, un serveur propriétaire de réception du signal comprenant des moyens d'analyse dudit signal 30 et de transformation en temps réel dudit signal d'entrée sonore en un flux audio et vidéo, des moyens de calcul agencés pour synchroniser les mouvements de la bouche de l'avatar transmis dans ledit flux avec 2906056 7 les phonèmes détectés dans ledit signal d'entrée sonore et pour animer l'avatar de façon cohérente avec ledit signal par des changements d'attitudes et des mouvements et des moyens pour transmettre les 5 images de l'avatar et le signal sonore correspondant, de sorte que l'avatar semble parler en temps réel ou sensiblement en temps réel à la place de l'interlocuteur. Avantageusement le système comporte des moyens de io configuration de l'avatar à travers un service en ligne sur le réseau Internet et/ou des moyens d'analyse du signal d'entrée sonore afin de détecter et d'utiliser pour l'animation un ou plusieurs paramètres supplémentaires dits paramètres de niveau 15 1, à savoir les périodes de silence, les périodes de parole et /ou d'autres éléments contenu dans ledit signal sonore pris parmi la prosodie, l'intonation, le rythme et/ou l'accent tonique. Dans un mode de réalisation avantageux il comporte 20 des moyens de constitution et de stockage sur un serveur, de séquences animées élémentaires pour animer l'avatar, constituées d'images générées par un calcul de rendu 3D, ou générées à partir de dessins. Avantageusement il comporte des moyens de 25 sélection en temps réel de la séquence élémentaire à jouer, en fonction de paramètres préalablement calculés et/ou déterminés. Egalement avantageusement la liste des séquences animées élémentaires étant communes à tous les 30 avatars utilisables dans l'appareil mobile, il comporte des moyens de calcul et de mise en œuvre d'un graphe d'animation dont chaque noeud représente un point ou état de transition entre deux séquences 2906056 8 élémentaires, chaque connexion entre deux états de transition étant unidirectionnelle et toutes les séquences connectées à travers un même état devant être visuellement compatibles avec le passage de la 5 fin d'une séquence élémentaire au début de l'autre. Dans un mode de réalisation avantageux il comporte des moyens pour dupliquer chaque séquence élémentaire de façon à permettre de montrer un personnage qui parle ou qui se tait selon la détection ou non d'une 10 voix. Avantageusement les phonèmes et/ou les autres paramètres de niveau 1 sont utilisés pour calculer des paramètres dits de niveau 2 qui correspondent à des caractéristiques telles que le caractère lent, 15 rapide, saccadé, joyeux, triste, ou d'autres caractères de type équivalent et on anime l'avatar au moins en partie à partir desdits paramètres de niveau 2. Par paramètre de type équivalent à un paramètre de 20 niveau 2, on entend un paramètre plus complexe conçu à partir des paramètres de niveau 1, qui sont eux-mêmes plus simples. En d'autres termes les paramètres de niveau 2 correspondent à une analyse et/ou à un regroupement 25 des paramètres de niveau 1, qui vont permettre d'affiner encore les états des personnages en les rendant plus adéquats à ce que l'on souhaite représenter. Les paramètres de niveau 2 étant considérés comme 30 des dimensions suivant lesquelles on définit une série de coefficients avec des valeurs qui sont fixées pour chaque état du graphe d'animation, les 2906056 9 moyens de calculs sont agencés pour calculer pour un état e la valeur de probabilité : Pe = >1 Pi x Ci avec Pi valeur du paramètre de niveau 2 calculé à 5 partir des paramètres de niveau 1 détectés dans la voix et Ci coefficient de l'état e suivant la dimension i, ce calcul étant effectué pour tous les états connectés à l'état vers lequel la séquence en cours aboutit dans le graphe. Lorsqu'une séquence Io élémentaire est en cours laisser se dérouler la séquence élémentaire qui se tait jusqu'au bout ou passer à la séquence dupliquée qui parle en cas de détection de la voix et vice versa, puis, lorsque la séquence se termine et qu'on arrive à un nouvel état, 15 choisir le prochain état cible suivant une probabilité définie par les calculs de la valeur de probabilité des états connectés à l'état courant. L'invention sera mieux comprise à la lecture qui suit de modes de réalisation particuliers donnés ci- 20 après à titre d'exemples non limitatifs. La description se réfère aux dessins qui l'accompagnent dans lesquels : La figure 1 est un schéma de principe montrant un système d'animation pour avatar selon l'invention, 25 La figure 2 donne un graphe d'état tel que mis en oeuvre selon le mode de réalisation de l'invention plus particulièrement décrit ici. La figure 3 montre trois types de séquences d'images, dont celle obtenue avec l'invention en 30 relation avec un signal d'entrée sonore. La figure 4 illustre schématiquement un autre mode d'implémentation du graphe d'état mis en oeuvre selon l'invention. 2906056 i0 La figure 5 montre schématiquement la méthode de sélection d'un état à partir des probabilités relatives, selon un mode de réalisation de l'invention.With the invention it will be possible, while operating in the standard environment of computer terminals or mobile communication, and without installing specific software components in the mobile phone, to obtain an animation of the avatar. real-time or near real-time coherence with the input signal, and only by detection and analysis of the voice. The aesthetic and artistic quality conferred on the avatars and their movement during their creation is also preserved, at a low cost and with excellent reliability. For this purpose, the present invention proposes, in particular, a method of animation on a mobile device screen of an avatar provided with a mouth from a sound input signal corresponding to the voice of an interlocutor. telephone communication, characterized in that the sound input signal is converted in real time into an audio and video stream in which the movements of the mouth of the avatar are synchronized with the phonemes detected in said sound input signal, and animating the avatar in a manner consistent with said signal by changes in attitudes and movements by analyzing said signal, so that the avatar appears to be speaking in real time or substantially in real time in place of the interlocutor . In advantageous embodiments, one and / or the other of the following provisions is also used: - the avatar is selected and / or configured through an online service on the Internet; The mobile device is a mobile phone; in addition to the phonemes, the sound input signal is analyzed in order to detect and use for the animation one or more additional parameters called level 1 parameters, namely the periods of silence, the speech periods and / or other elements contained in said sound signal taken from prosody, intonation, rhythm and / or tonic accent; to animate the avatar, elementary sequences are used, consisting of images generated by a 3D rendering calculation, or generated from drawings; Elementary sequences are loaded in memory at the beginning of the animation and they are stored in said memory during the entire duration of the animation for several simultaneous and / or successive interlocutors; the elementary sequence to be played is selected in real time, according to previously calculated and / or determined parameters; the list of elementary sequences being common to all the avatars that can be used in the mobile device, an animation graph is defined in which each node represents a point or transition state between two elementary sequences, each connection between two transition states being unidirectional and all the elementary sequences connected through the same state to be visually compatible with the transition from the end of one elementary sequence to the beginning of the other; - each elementary sequence is duplicated so as to show a character who speaks or is silent according to the detection or not of a voice sound; the phonemes and / or the other level 1 parameters are used to calculate so-called level 2 parameters, namely and especially the slow, fast, jerky, joyous or sad character of the avatar, from which is carried out in all or part the animation of said avatar; the level 2 parameters being considered as dimensions according to which a series of coefficients are defined with values which are fixed for each state of the animation graph, the probability value is calculated for a state e: Pe = Pi x Ci with Pi value of the level 2 parameter calculated from the level 1 parameters detected in the voice and Ci coefficient of the state e according to the dimension i, this computation being carried out for all the states connected to the state towards which current sequence ends in the graph; io - when an elementary sequence is in progress, the elementary sequence is allowed to go to the end or the duplicated sequence that speaks when the voice is detected and vice versa, then, when the sequence is terminates and we arrive at a new state, the next target state is chosen according to a probability defined by the calculations of the probability value of the states connected to the current state. The invention also provides a system implementing the above method. It also proposes an animation system of an avatar equipped with a mouth from a sound input signal corresponding to the voice of a telephone communication interlocutor, characterized in that it comprises a mobile device of telecommunication, for receiving the sound input signal emitted by an external telephone source, a proprietary signal receiving server comprising means for analyzing said signal and transforming in real time said sound input signal into an audio stream and video, computing means arranged to synchronize the movements of the mouth of the avatar transmitted in said stream with the phonemes detected in said sound input signal and to animate the avatar in a manner consistent with said signal by changes of attitudes and movements and ways to transmit the 5 images of the avatar and the corresponding sound signal, so that the avatar seems to speak in real time or substantially in real time in place of the interlocutor. Advantageously, the system comprises means for configuring the avatar through an online service on the Internet and / or means for analyzing the sound input signal in order to detect and use for the animation one or several additional parameters called level 1 parameters 1, namely the periods of silence, the speech periods and / or other elements contained in said sound signal taken from the prosody, the intonation, the rhythm and / or the accent tonic. In an advantageous embodiment, it comprises means for constituting and storing on a server elementary animated sequences for animating the avatar, consisting of images generated by a 3D rendering calculation, or generated from drawings. Advantageously, it comprises real-time selection means of the elementary sequence to be played, according to previously calculated and / or determined parameters. Also advantageously, since the list of elementary animated sequences is common to all the avatars that can be used in the mobile device, it includes means for calculating and implementing an animation graph, each node representing a point or transition state. between two elementary sequences, each connection between two transition states being unidirectional and all the sequences connected through the same state to be visually compatible with the transition from the end of one elementary sequence to the beginning of the other. In an advantageous embodiment, it comprises means for duplicating each elementary sequence so as to make it possible to show a character who speaks or is silent according to the detection or not of a voice. Advantageously, the phonemes and / or the other level 1 parameters are used to calculate so-called level 2 parameters which correspond to characteristics such as slow, fast, jerky, joyous, sad, or other characters of the equivalent type. and animating the avatar at least in part from said level 2 parameters. By type parameter equivalent to a level 2 parameter is meant a more complex parameter designed from the level 1 parameters, which are themselves even simpler ones. In other words, the level 2 parameters correspond to an analysis and / or grouping of the level 1 parameters, which will make it possible to further refine the states of the characters by making them more suitable to what one wishes. represent. The level 2 parameters being considered as dimensions according to which a series of coefficients are defined with values which are fixed for each state of the animation graph, the computation means are arranged to compute for a state e the value. of probability: Pe => 1 Pi x Ci with Pi value of the level 2 parameter calculated from the level 1 parameters detected in the voice and Ci coefficient of the state e according to the dimension i, this calculation being carried out for all the states connected to the state to which the current sequence ends in the graph. When an elementary Io sequence is in progress, let the elementary sequence keep going to the end or move on to the duplicate sequence that speaks when the voice is detected and vice versa, then, when the sequence ends and when A new state is arrived at, choosing the next target state according to a probability defined by the calculations of the probability value of the states connected to the current state. The invention will be better understood on the following reading of particular embodiments given hereinafter by way of non-limiting examples. The description refers to the drawings which accompany it in which: FIG. 1 is a block diagram showing an animation system for avatar according to the invention, FIG. 2 gives a state graph as implemented according to FIG. the embodiment of the invention more particularly described here. Figure 3 shows three types of image sequences, including that obtained with the invention in relation to a sound input signal. FIG. 4 schematically illustrates another mode of implementation of the state graph implemented according to the invention. FIG. 5 schematically shows the method of selecting a state from the relative probabilities, according to one embodiment of the invention.

5 La figure 6 montre un exemple de signal d'entrée sonore permettant la construction d'une suite d'états, pour être utilisés pour construire le comportement de l'avatar selon l'invention. La figure 7 montre un exemple de paramétrage io initial effectué à partir du téléphone portable de l'interlocuteur appelant. La figure 1 montre schématiquement le principe d'un système 1 d'animation pour avatar 2, 2' sur un écran 3, 3', 3" d'appareil mobile 4, 4', 4".Figure 6 shows an example of a sound input signal for constructing a sequence of states to be used to construct the behavior of the avatar according to the invention. Fig. 7 shows an example of initial setting made from the mobile phone of the calling party. Figure 1 shows schematically the principle of an animation system 1 for avatar 2, 2 'on a screen 3, 3', 3 "of mobile device 4, 4 ', 4".

15 L'avatar 2 est muni d'une bouche 5, 5' et est animé à partir d'un signal d'entrée sonore 6 correspondant à la voix 7 d'un interlocuteur 8 de communication par le biais d'un téléphone mobile 9, ou tout autre moyen de communication du son 20 (téléphone fixe, ordinateur, ...) . Le système 1 comprend, à partir d'un serveur 10 appartenant à un réseau (téléphonique, Internet ...), un serveur propriétaire 11 de réception de signaux 6. Ce serveur comprend des moyens 12 d'analyse du 25 signal et des transformations en temps réel dudit signal en flux audio et vidéomultiplexé 13 en deux voix 14, 15 ; 14', 15' dans le cas d'une réception par mobiles 3D ou 2D, ou en une seule voix 16 en cas de mobile dit vidéo.The avatar 2 is provided with a mouth 5, 5 'and is animated from a sound input signal 6 corresponding to the voice 7 of a communication interlocutor 8 by means of a mobile telephone. , or any other means of sound communication (landline telephone, computer, ...). The system 1 comprises, from a server 10 belonging to a network (telephone, Internet, etc.), a proprietary signal-receiving server 6. This server comprises means 12 for analyzing the signal and transformations. in real time of said signal in audio stream and videomultiplexed 13 in two voices 14, 15; 14 ', 15' in the case of reception by mobile 3D or 2D, or in one voice 16 in the case of mobile said video.

30 Il comprend de plus des moyens de calculs agencés pou synchroniser les mouvements de la bouche 5 de l'avatar avec les phénomènes détectés dans le signal d'entrée sonore et pour retransmettre (en cas de 2906056 11 mobile 2D et 3D) d'une part les données texte scénarisé en 17 ; 17', transmises alors en 18, 18' sous forme de script au téléphone mobile 4 ; 4', et d'autre part pour télécharger l'avatar 2D ou 3D, en 5 19, 19' audit téléphone mobile. Dans le cas d'utilisation d'un mobile dit de vidéo téléphonie, le texte est scénarisé en 20 pour être transmis sous forme de fichiers d'images de sons 21, avant compression en 22 et envoi au mobile 4", sous 10 la forme du flux vidéo 23. Le résultat obtenu est que l'avatar 2, et notamment sa bouche 5, semble parler en temps réel à la place de l'interlocuteur 8 et que le comportement de l'avatar (attitude, gestes) est cohérent avec la 15 VOIX. On va maintenant décrire plus avant l'invention en référence aux figures 2 à 7, le procédé plus particulièrement décrit permettant de réaliser les fonctions suivantes : 20 - exploiter des séquences animées élémentaires, constituées d'images générées par un calcul de rendu 3D ou bien directement produites à partir de dessins ; - choisir et configurer son personnage à travers 25 un service en ligne qui produira de nouvelles séquences élémentaires : rendu 3D sur le serveur ou sélection de catégories de séquences ; charger toutes les séquences élémentaires en mémoire, au lancement de l'application et les 30 conserver en mémoire pendant toute la durée du service pour plusieurs utilisateurs simultanés et successifs ; 2906056 12 - analyser la voix contenue dans le signal d'entrée afin de détecter les périodes de silence, les périodes de parole et éventuellement d'autres éléments contenus dans 5 le signal sonore, comme les phonèmes, la prosodie (intonation de la voix, rythme de la parole, accents toniques); - sélectionner en temps réel la séquence élémentaire à jouer, en fonction des paramètres 10 précédemment calculés. L'analyse du signal sonore s'effectue à partir d'un buffer correspondant à un petit intervalle de temps (environ 10 millisecondes). Le choix des séquences élémentaires (par ce qu'on appelle le 1s séquenceur) est expliqué plus loin. Plus précisément et pour obtenir les résultats recherchés par l'invention, on commence par créer une liste de séquences élémentaires d'animation pour un ensemble de personnages.It further comprises calculation means arranged to synchronize the movements of the mouth of the avatar with the phenomena detected in the sound input signal and to retransmit (in the case of mobile 2D and 3D) a share the scripted text data in 17; 17 ', then transmitted in 18, 18' in script form to the mobile phone 4; 4 ', and secondly to download the 2D or 3D avatar, in 5 19, 19' to said mobile phone. In the case of using a mobile said video telephony, the text is scripted in 20 to be transmitted as sound image files 21 before compression in 22 and sent to the mobile 4 "in the form of The result obtained is that the avatar 2, and in particular its mouth 5, seems to speak in real time in the place of the interlocutor 8 and that the behavior of the avatar (attitude, gestures) is coherent with The invention will now be further described with reference to FIGS. 2 to 7, the method more particularly described making it possible to perform the following functions: - to exploit elementary animated sequences, consisting of images generated by a calculation of 3D rendering or directly produced from drawings; - choose and configure his character through an online service that will produce new basic sequences: 3D rendering on the server or selection of categories of sequences; to harness all the elementary sequences in memory at the launch of the application and keep them in memory for the duration of the service for several simultaneous and successive users; Analyzing the voice contained in the input signal in order to detect the periods of silence, the speech periods and possibly other elements contained in the sound signal, such as the phonemes, the prosody (intonation of the voice, rhythm of speech, tonic accents); select in real time the elementary sequence to play, according to the parameters previously calculated. The sound signal is analyzed from a buffer corresponding to a small time interval (approximately 10 milliseconds). The choice of the elementary sequences (by what is called the 1s sequencer) is explained later. More precisely, and to obtain the results sought by the invention, we begin by creating a list of elementary animation sequences for a set of characters.

20 Chaque séquence est constituée d'une série d'images produites par un logiciel d'animation 3D ou 2D connus en eux-mêmes, comme par exemple les logiciels 3dsMax et Maya de la société américaine Autodesk et XSI de la société française Softimage, ou 25 bien par des outils classiques de rendu 3D propriétaires, ou bien encore constituées de dessins numérisés. Ces séquences sont générées au préalable et placées sur le serveur propriétaire qui diffuse le flux vidéo d'avatar, ou bien générées par le service 30 en ligne de configuration d'avatars et placées sur ce même serveur. Dans le mode de réalisation plus particulièrement décrit ici la liste des noms des séquences 2906056 13 élémentaires disponibles est commune à tous les personnages, mais les images qui les composent peuvent représenter des animations très différentes. Cela permet de définir un graphe d'état commun à 5 plusieurs avatars mais cette disposition n'est pas obligatoire. On définit ensuite (cf. figure 2) un graphe 24 d'états dont chaque nœud (ou état) 26, 27, 28, 29, 30 est défini comme un point de transition entre des lo séquences élémentaires. La connexion entre deux états est unidirectionnelle, dans un sens ou dans l'autre (flèches 25). Plus précisément, dans l'exemple de la figure 2, 15 on a défini cinq états, à savoir les états de début de séquence 26, neutre 27, excité 28, au repos 29 et de fin de séquence 30. Toutes les séquences connectées à travers un même état du graphe, doivent être visuellement compatibles 20 avec le passage de la fin d'une animation au début de l'autre. Le respect de cette contrainte est géré lors de la création des animations correspondant aux séquences élémentaires. Chaque séquence élémentaire est dupliquée pour 25 permettre de montrer un personnage qui parle ou bien un personnage qui se tait, suivant qu'on a ou non détecté des paroles dans la voix. Cela permet de passer d'une version à l'autre de la séquence élémentaire qui se déroule, pour 30 synchroniser l'animation de la bouche du personnage avec les périodes de parole. On a représenté sur la figure 3 une séquence d'images telle qu'obtenue avec parole 32, la même 2906056 14 séquence sans parole 33, et en fonction de l'entrée sonore (courbe 34) émise par l'interlocuteur, la séquence résultante 35. Il est maintenant décrit ci-après le principe de 5 sélection des séquences d'animation. L'analyse de la voix produit un certain nombre de paramètres dits de niveau 1, dont la valeur varie au cours du temps et dont on calcule la moyenne sur un certain intervalle, par exemple de 100 millisecondes.Each sequence consists of a series of images produced by a 3D or 2D animation software known per se, such as for example the software 3dsMax and Maya of the American company Autodesk and XSI of the French company Softimage, or 25 by conventional proprietary 3D rendering tools, or else consist of digitized drawings. These sequences are generated beforehand and placed on the proprietary server that broadcasts the avatar video stream, or generated by the online avatars configuration service and placed on the same server. In the embodiment more particularly described here the list of the names of the available elementary sequences is common to all the characters, but the images that compose them can represent very different animations. This makes it possible to define a state graph common to several avatars, but this provision is not mandatory. A graph 24 of states is then defined (see FIG. 2) in which each node (or state) 26, 27, 28, 29, 30 is defined as a point of transition between lo elementary sequences. The connection between two states is unidirectional, in one direction or the other (arrows 25). More precisely, in the example of FIG. 2, five states are defined, namely the start states of sequence 26, neutral 27, excited 28, at rest 29 and the end of sequence 30. All sequences connected to FIG. through the same state of the graph, must be visually compatible with the passage from the end of one animation to the beginning of the other. The respect of this constraint is managed during the creation of the animations corresponding to the elementary sequences. Each elementary sequence is duplicated so as to show a character who is speaking or a character who is silent, depending on whether or not speech has been detected in his voice. This makes it possible to switch from one version to another of the elementary sequence that takes place, to synchronize the animation of the character's mouth with the speaking periods. FIG. 3 shows a sequence of images as obtained with speech 32, the same sequence without speech 33, and depending on the sound input (curve 34) transmitted by the interlocutor, the resulting sequence 35. The principle of selection of animation sequences is now described below. The analysis of the voice produces a certain number of so-called level 1 parameters whose value varies over time and whose average is calculated over a certain interval, for example 100 milliseconds.

10 Ces paramètres sont, par exemple : - l'activité de parole (silence ou signaux de paroles) - le rythme de parole - le ton (aigu ou grave) s'il s'agit d'un langage 15 non tonal - la longueur des voyelles - la présence plus au moins importante d'accent tonique. Le paramètre d'activité de la parole peut-être 20 calculé en première approximation, à partir de la puissance du signal sonore (intégrale du signal au carré) en considérant qu'il y a parole au dessus d'un certain seuil. Le seuil est calculable dynamiquement en fonction du rapport signal / bruit. Un filtrage en 25 fréquence est aussi envisageable pour éviter de considérer par exemple le passage d'un camion comme de la voix. Le rythme de la parole est calculé à partir de la fréquence moyenne des périodes de silence et de parole. D'autres paramètres sont 30 également calculables à partir d'une analyse fréquentielle du signal.These parameters are, for example: the speech activity (silence or speech signals) the speech rhythm the pitch (acute or severe) if it is a non-tonal language the length vowels - the more or less important presence of tonal accent. The speech activity parameter can be calculated as a first approximation, from the power of the sound signal (integral of the signal squared) by considering that there is speech above a certain threshold. The threshold is dynamically calculable according to the signal-to-noise ratio. Frequency filtering is also possible to avoid considering for example the passage of a truck as the voice. The rhythm of the speech is calculated from the average frequency of the periods of silence and speech. Other parameters are also calculable from a frequency analysis of the signal.

2906056 15 Selon le mode de l'invention plus particulièrement décrit ici, des formules mathématiques simples (combinaisons linéaires, fonctions seuil, fonctions booléennes) permettent de passer de ces paramètres de 5 niveau 1 à des paramètres dits de niveau 2 qui correspondent à des caractéristiques telles que par exemple le caractère lent, rapide, saccadé, joyeux, triste, etc. Les paramètres de niveau 2 sont considérés comme 10 des dimensions suivant lesquelles on définit une série de coefficients Ci avec des valeurs fixées pour chaque état e du graphe d'animation. Des exemples d'un tel paramétrage sont donnés ci-après. A tout instant c'est à dire par exemple avec 15 une périodicité de 10 millisecondes, on calcule les paramètres de niveau 1. Lorsqu'un nouvel état doit être choisi, c'est-à-dire à la fin du déroulement d'une séquence, on peut donc calculer les paramètres de niveau 2 qui s'en déduisent et calculer pour un 20 état e la valeur suivante . Pe = > Pi x Ci où les valeurs pi sont celles des paramètres de niveau 2 et ci les coefficients de l'état e suivant ladite dimension i. Cette somme constitue une probabilité relative de 25 l'état e (par rapport aux autres états) d'être sélectionné. Lorsqu'une séquence élémentaire est en cours, on la laisse alors se dérouler jusqu'au bout c'est-à-dire jusqu'à l'état du graphe auquel elle aboutit 30 mais on passe d'une version à l'autre de la séquence (version avec ou sans parole) à tout instant en fonction du signal de parole détecté.According to the mode of the invention more particularly described here, simple mathematical formulas (linear combinations, threshold functions, Boolean functions) make it possible to pass from these level 1 parameters to so-called level 2 parameters which correspond to characteristics such as slow, fast, jerky, happy, sad, etc. The level 2 parameters are considered as dimensions according to which a series of coefficients Ci is defined with fixed values for each state e of the animation graph. Examples of such a parameterization are given below. At any instant, that is to say for example with a periodicity of 10 milliseconds, the level 1 parameters are calculated. When a new state has to be chosen, that is to say at the end of the course of a sequence, one can thus calculate the level 2 parameters which are deduced therefrom and calculate for a state e the following value. Pe => Pi x Ci where the values pi are those of the parameters of level 2 and ci the coefficients of the state e according to said dimension i. This sum constitutes a relative probability of the state e (relative to the other states) of being selected. When an elementary sequence is in progress, it is then allowed to proceed to the end, that is to say until the state of the graph at which it ends, but one goes from one version to the other of the sequence (version with or without speech) at any time depending on the detected speech signal.

2906056 16 Lorsque la séquence se termine et qu'on arrive à un nouvel état, on choisit le prochain état cible suivant une probabilité définie par les calculs précédents. Si l'état cible est le même que l'état 5 actuel, on s'y maintient en jouant une animation en boucle un certain nombre de fois et on se ramène ainsi au cas précédent. Certaines séquences sont des boucles qui partent d'un état et y retournent (flèche 31), elles sont lo utilisées lorsque le séquenceur décide de maintenir l'avatar dans son état courant, c'est-à-dire, choisit comme état cible suivant l'état courant lui-même. On a donné ci-après la description en pseudo-code d'un exemple de génération d'animation et la 15 description d'un exemple de déroulement de séquences Exemple de génération d'animation initialiser état courant à un état de départ prédéfini initialiser état cible à nul 20 initialiser séquence d'animation courante à séquence nulle tant qu'on reçoit un flux audio entrant : o décoder le flux audio entrant o calculer les paramètres de niveau 1 25 o si séquence d'animation courante terminée : ^ séquence d'animation courante = séquence nulle ^ état cible = état nul o si état cible nul: ^ calculer paramètres de niveau 2 en fonction des paramètres de niveau 1 (et éventuellement de leur historique) ^ sélectionner les états connectés à l'état courant ^ calcul des probabilités de ces états connectés en fonction de leurs coefficients 30 35 2906056 17 et des paramètres de niveau 2 précédemment calculés ^ tirage parmi ces états connectés de l'état cible en fonction des probabilités précédemment calculées => un nouvel état cible est ainsi défini o si séquence d'animation courante nulle : ^ sélectionner dans le graphe la séquence d'animation de l'état courant vers l'état cible => définit la séquence d'animation courante o dérouler la séquence d'animation courante => sélection d'images précalculées correspondantes o mettre en correspondance portion de flux audio entrant et les images sélectionnées à partir de l'analyse de ces portions de flux audio o générer un flux audio et vidéo compressé à partir des images sélectionnées et du flux audio entrant 20 Exemple de déroulement des séquences : -l'interlocuteur dit : "Bonjour, comment ça va ?" 1. les paramètres de niveau 1 indiquent la présence de paroles 2. les paramètres de niveau 2 indiquent : voix enjouée 25 (correspondant à "Bonjour") 3. le tirage probabiliste sélectionne l'état cible joyeux. 4. on déroule la séquence d'animation de l'état de départ vers l'état joyeux (dans sa version avec paroles) 5. on arrive dans la période de silence, reconnue à travers 30 les paramètres de niveau 1 6. la séquence d'animation est toujours en cours, on ne l'interrompt pas mais on sélectionne sa version sans parole 7. l'état cible joyeux est atteint 8. le silence conduit à sélectionner l'état cible neutre (à travers le calcul des paramètres de niveau 1 et 2 et le tirage probabiliste) 9. on déroule la séquence d'animation de l'état joyeux vers l'état neutre (dans sa version sans paroles) 10.1'état cible neutre est atteint 5 10 15 2906056 18 ll.le silence conduit à nouveau à sélectionner l'état cible neutre 12.on déroule la séquence d'animation neutre => neutre (boucle) dans sa version sans paroles 5 13.1es paramètres de niveau 1 indiquent la présence de paroles (correspondant à "Comment ça va ?") 14.1es paramètres de niveau 2 indiquent une voix interrogative 15.1'état cible neutre est à nouveau atteint 10 16.on sélectionne l'état cible interrogatif (à travers le calcul des paramètres de niveau 1 et 2 et le tirage probabiliste) 17.etc.When the sequence ends and we arrive at a new state, we choose the next target state according to a probability defined by the previous calculations. If the target state is the same as the current state, it is maintained by playing a loop animation a number of times and is thus reduced to the previous case. Some sequences are loops that start from a state and return to it (arrow 31), they are used lo when the sequencer decides to keep the avatar in its current state, that is to say, chooses as next target state the current state itself. The pseudo-code description of an exemplary animation generation is given below and the description of an example of a sequence of sequences Example of generation of animation initialize current state to a predefined initial state initialize state null target 20 initialize sequence of current animation with zero sequence as long as an incoming audio stream is received: o decode the incoming audio stream o calculate the level 1 parameters 25 o if current animation sequence is complete: ^ sequence of current animation = null sequence ^ target state = null state o if zero target state: ^ calculate level 2 parameters as a function of level 1 parameters (and possibly their history) ^ select the states connected to the current state ^ calculation of probabilities of these connected states as a function of their coefficients and previously calculated level 2 parameters draw from these connected states of the target state according to previously calculated probabilities => a new target state is thus defined o if current zero animation sequence: ^ select in the graph the animation sequence from the current state to the target state => defines the sequence of current animation o run the current animation sequence => selection of corresponding precomputed images o match the portion of the incoming audio stream and the images selected from the analysis of these portions of the audio stream o generate an audio stream and compressed video from selected images and incoming audio stream Example of sequence flow: -the speaker says, "Hello, how are you?" 1. level 1 parameters indicate the presence of lyrics 2. level 2 parameters indicate: playful voice 25 (corresponding to "hello") 3. the probabilistic draw selects the merry target state. 4. we run the animation sequence from the initial state to the joyous state (in its version with lyrics) 5. we arrive in the period of silence, recognized through the level 1 parameters 6. the sequence animation is still in progress, it is not interrupted but we select its version without speech 7. the target state joyful is reached 8. silence leads to select the neutral target state (through the calculation of the parameters of level 1 and 2 and probabilistic drawing) 9. the animation sequence of the joyous state is moved to the neutral state (in its version without words) 10.1 neutral target state is reached 5 10 15 2906056 18 ll.le silence leads again to select the neutral target state 12. runs the neutral animation sequence => neutral (loop) in its speechless version 5 13.1th level 1 parameters indicate the presence of lyrics (corresponding to "How do go? ") 14.1th level 2 parameters indicate an interrogative voice 15.1the neutral target state is reached again 16.on selects the interrogative target state (through the calculation of level 1 and 2 parameters and the probabilistic draw) 17.etc.

15 La méthode de sélection d'un état à partir des probabilités relatives est maintenant décrite en référence à la figure 5 qui donne un graphe de probabilité des états 40 à 44. La probabilité relative de l'état 40 est 20 déterminée par rapport à la valeur calculée ci-avant. Si la valeur (flèche 45) est à un niveau déterminé l'état correspondant est sélectionné (sur la figure l'état 42). En référence à la figure 4, on donne un autre 25 exemple de graphe d'états selon l'invention. Ici on a défini les états suivants état neutre (Neutral) : 46 état approprié à une première période de parole (speak 1) : 47 30 autre état approprié à une seconde période de parole (speak 2) : 48 - état approprié à une première période de silence (Idlel) : 49 - autre état approprié à une seconde période de 35 silence (Idle 2) : 50 2906056 19 - état approprié à un discours d'introduction (greeting) : 51 Le graphe d'états relie quant à lui de façon unidirectionnelle (dans les deux sens) tous ces états 5 sous forme d'étoile (lien 52). En d'autres termes, dans l'exemple plus particulièrement décrit en référence à la figure 4, on définit ainsi les dimensions, pour le calcul des probabilités relatives (dimensions des paramètres et io des coefficients) : IDLE : valeurs indiquant une période de silence SPEAK : valeurs indiquant une période de parole NEUTRAL : valeurs indiquant une période de neutralité GREETING : valeurs indiquant une phase d'accueil ou 15 de présentation On introduit ensuite des paramètres de premier niveau, détectés dans le signal d'entrée et utilisés comme valeurs intermédiaires pour le calcul des paramètres précédents, à savoir : 20 - Speak : valeur binaire qui indique si on est en train de parler - SpeakTime : durée écoulée depuis ledébut de la période de parole - MuteTime : durée écoulée depuis le début de la 25 période de silence - Speaklndex : numéro de la période de parole depuis un instant déterminé On définit également les formules permettant de passer des paramètres de premier niveau à ceux de 30 second niveau : -IDLE : NOT (Speak) x MuteTime - SPEAK : Speak 2906056 20 - NEUTRAL : NOT (Speak) - GREETING : Speak & (SpeakIndex =1) Les coefficients associés aux états sont par exemple donnés par le Tableau I ci-après : 5 TABLEAU I IDLE SPEAK NEUTRAL GREETING Neutral 0 0 1 0 _ Speakl 0.05 1 0 0 Speak2 0 1.2 0 0 Idlel 2 0 0 0 Idle2 1 0 0 0 Greeting 0 0.5 0 1 Un tel paramétrage, en référence à la figure 6, et pour quatre instants Tl, T2, T3, T4, donne l'état courant et les valeurs des paramètres de niveau 1 et io 2 dans le Tableau II ci-après. TABLEAU II Tl : Etat courant = Neutral ^ IDLE = 0 ^ Speak = 1 ^ SpeakTime = 0.01 sec • SPEAK = 1 ^ MuteTime = 0 sec • NEUTRAL = 0 ^ SpeakIndex = 1 • GREETING = 1 T2 : Etat courant = Greeting ^ ^ IDLE = 0.01 • Speak = 0 ^ SPEAK = 0 ^ SpeakTime = 0 sec ^ NEUTRAL = 1 ^ MuteTime = 0.01 sec • GREETING = 0 ^ SpeakIndex = 1 T3 : Etat courant = Neutral ^ Speak = 0 ^ IDLE = 0.5 ^ SpeakTime = 0 sec • SPEAK = 0 ^ MuteTime = 1.5 sec ^ NEUTRAL = 1 ^ SpeakIndex = 1 ^ GREETING = 0 T4 : Etat courant = Neutral ^ Speak = 1 ^ IDLE = 0 2906056 21 ^ SpeakTime = 0.01 sec ^ MuteTime = 0 sec ^ Speaklndex = 2 ^ SPEAK = 1 ^ NEUTRAL = 0 ^ GREETING = 0 La probabilité relative des états suivants est 5 alors donnée dans le Tableau III ci-après. TABLEAU III 10 15 Ti ^ Neutral = 0 ^ Speak1 = 1 ^ Speak2 = 1.2 ^ Greeting = 2.5 ^ Idlel = 0 ^ IdIe2 = 0 T3 ^ Neutral = 1 ^ Speak1 = 0 ^ Speak2 = 0 ^ Greeting = 0 ^ IdIel = 1 ^ Idle2 = 0.5T2 ^ Neutral = 1 ^ Speak1 = 0 ^ Speak2 = 0 ^ Greeting = 0 ^ Idlel = 0.02 ^ IdIe2 = 0.01 T4 ^ Neutral = 0 ^ Speak1 = 1 ^ Speak2 = 1.2 ^ Greeting = 0 ^ Idlel = 0 ^ IdIe2 = 0 20 Ce qui donne dans l'exemple choisi le tirage des 25 probabilités correspondant au Tableau IV suivant : 30 2906056 5 22 TABLEAU IV 10 15 T 1 : Etat Courant = Neutral Speakl Speak2 Greeting tirage Etat suivant = Greetine T2 : Etat Courant = Greeting Neutral tirage Etat suivant = Neutral 20 25 T3 : Etat Courant = Neutral Neutral tirage Idlel Idle2 Etat suivant = Neutral T4 : Etat Courant = Neutral Speakl Speak2 tirage Etat Suivant = Speak2 30 2906056 23 Enfin, en référence aux figures 7 et 1 on a représenté l'écran schématisé 52 d'un mobile permettant d'obtenir le paramétrage de l'avatar en temps réel.The method of selecting a state from the relative probabilities is now described with reference to FIG. 5 which gives a probability graph of the states 40 to 44. The relative probability of the state 40 is determined relative to the value calculated above. If the value (arrow 45) is at a certain level, the corresponding state is selected (in the figure, state 42). With reference to FIG. 4, another example of a state graph according to the invention is given. Here the following states have been defined neutral state: 46 state suitable for a first speech period (speak 1): 47 other state suitable for a second speech period (speak 2): 48 - state appropriate for a first period of silence (Idlel): 49 - another state appropriate to a second period of silence (Idle 2): 50 2906056 19 - state appropriate to an introductory speech (greeting): 51 The state graph connects for its part unidirectionally (in both directions) all these states 5 in star form (link 52). In other words, in the example more particularly described with reference to FIG. 4, the dimensions are thus defined, for the calculation of the relative probabilities (dimensions of the parameters and the coefficients): IDLE: values indicating a period of silence SPEAK: values indicating a speech period NEUTRAL: values indicating a neutrality period GREETING: values indicating a reception or presentation phase Then we introduce first level parameters, detected in the input signal and used as intermediate values for the calculation of the preceding parameters, namely: 20 - Speak: binary value which indicates whether one is speaking - SpeakTime: time elapsed since the beginning of the speech period - MuteTime: time elapsed since the beginning of the period of speech silence - Speaklndex: number of the speaking period since a given moment We also define the formulas allowing to pass from s first level parameters to the second level ones: -IDLE: NOT (Speak) x MuteTime - SPEAK: Speak 2906056 20 - NEUTRAL: NOT (Speak) - GREETING: Speak & (SpeakIndex = 1) The coefficients associated with the states are for example given in Table I below: TABLE I IDLE SPEAK NEUTRAL GREETING Neutral 0 0 1 0 _ Speakl 0.05 1 0 0 Speak2 0 1.2 0 0 Idlel 2 0 0 0 Idle2 1 0 0 0 Greeting 0 0.5 0 1 One Such parameterization, with reference to FIG. 6, and for four instants T1, T2, T3, T4, gives the current state and the values of the level 1 and 2 parameters in Table II below. TABLE II Tl: Current state = Neutral ^ IDLE = 0 ^ Speak = 1 ^ SpeakTime = 0.01 sec • SPEAK = 1 ^ MuteTime = 0 sec • NEUTRAL = 0 ^ SpeakIndex = 1 • GREETING = 1 T2: Current state = Greeting ^ ^ IDLE = 0.01 • Speak = 0 ^ SPEAK = 0 ^ SpeakTime = 0 sec ^ NEUTRAL = 1 ^ MuteTime = 0.01 sec • GREETING = 0 ^ SpeakIndex = 1 T3: Current state = Neutral ^ Speak = 0 ^ IDLE = 0.5 ^ SpeakTime = 0 sec • SPEAK = 0 ^ MuteTime = 1.5 sec ^ NEUTRAL = 1 ^ SpeakIndex = 1 ^ GREETING = 0 T4: Current state = Neutral ^ Speak = 1 ^ IDLE = 0 2906056 21 ^ SpeakTime = 0.01 sec ^ MuteTime = 0 sec ^ Speaklndex = 2 ^ SPEAK = 1 ^ NEUTRAL = 0 ^ GREETING = 0 The relative probability of the following states is then given in Table III below. TABLE III 10 Ti = Neutral = 0 Speak1 = 1 ^ Speak2 = 1.2 Greeting = 2.5 Idlel = 0 ^ IdIe2 = 0 T3 ^ Neutral = 1 ^ Speak1 = 0 ^ Speak2 = 0 Greeting = 0 ^ IdIel = 1 ^ Idle2 = 0.5T2 ^ Neutral = 1 ^ Speak1 = 0 ^ Speak2 = 0 ^ Greeting = 0 ^ Idlel = 0.02 ^ IdIe2 = 0.01 T4 ^ Neutral = 0 ^ Speak1 = 1 ^ Speak2 = 1.2 ^ Greeting = 0 ^ Idlel = 0 ^ IdIe2 = 0 20 Which gives in the example chosen the draw of the probabilities corresponding to the following Table IV: TABLE IV 10 15 T 1: Current State = Neutral Speakl Speak2 Greeting Draw Next State = Greetine T2: Status Current = Greeting Neutral draw Next state = Neutral 20 25 T3: Current state = Neutral Neutral draw Idlel Idle2 Next state = Neutral T4: Current state = Neutral Speakl Speak2 draw Next state = Speak2 30 2906056 23 Finally, with reference to Figures 7 and 1 there is shown the schematic screen 52 of a mobile for obtaining the setting of the avatar in real time.

5 A l'étape 1, l'utilisateur 8 configure les paramètres de la séquence vidéo qu'il souhaite personnaliser. Par exemple : • Personnage 53 io • Expression du personnage (heureux, triste ...) 54 • Réplique du personnage 55 • Fond sonore 56 • Numéro de téléphone du destinataire 57. A l'étape 2, les paramètres sont transmis sous 15 forme de requêtes à l'application serveur (serveur 11) qui les interprète, crée la vidéo, et l'envoie (liaison 13) à l'application d'encodage. A l'étape 3, les séquences vidéo sont compressées au bon format c'est à dire lisibles par les 20 terminaux mobiles avant l'étape 4 où les séquences vidéo compressées sont transmises (liaisons 18, 19, 18', 19' ; 23) au destinataire par exemple par MMS. Comme il va de soi, et comme il résulte de ce qui précède, l'invention ne se limite pas au mode de 25 réalisation plus particulièrement décrit mais en embrasse au contraire toutes les variantes et notamment celles où la diffusion se fait en différé et non en temps réel ou quasi réel. 30In step 1, the user 8 configures the parameters of the video sequence that he wishes to customize. For example: • Character 53 io • Expression of the character (happy, sad ...) 54 • Replica of the character 55 • Background sound 56 • Telephone number of the recipient 57. In step 2, the parameters are transmitted in the form requests to the server application (server 11) that interprets them, creates the video, and sends it (link 13) to the encoding application. In step 3, the video sequences are compressed in the correct format, ie readable by the mobile terminals before step 4, where the compressed video sequences are transmitted (links 18, 19, 18 ', 19'; ) to the recipient for example by MMS. As is obvious, and as a result of the foregoing, the invention is not limited to the embodiment more particularly described but encompasses all the variants and especially those where the diffusion is done offline and not in real time or near real time. 30

Claims

1. A method of animation on a screen (3, 3 ', 3 ") of a mobile device (4, 4', 4") of an avatar (2, 2 ', 2 ") provided with a mouth ( 5, 5 ') from a sound input signal (6) corresponding to the voice (7) of a telephone communication interlocutor (8), characterized in that the signal is transformed in real time sound input into an audio and video stream in which the movements of the mouth of the avatar are synchronized with the phonemes detected in said sound input signal, and the avatar is animated coherently with said signal by changes in attitudes and movements by analyzing said signal, so that the avatar appears to speak in real time or substantially in real time in place of the interlocutor.

2. Method according to claim 1, characterized in that one chooses and / or configures the avatar through an online service on the Internet.

3. Method according to any one of the preceding claims, characterized in that the mobile device is a mobile phone.

4. Method according to any one of the preceding claims, characterized in that, in addition to the phonemes, the sound input signal is analyzed in order to detect and use for the animation one or more additional parameters known as level 1, namely the periods of silence, speech periods and / or other elements contained in said sound signal taken from prosody, intonation, rhythm and / or tonic accent.

5. Method according to any one of the preceding claims, characterized in that to animate the avatar, elementary sequences are used, consisting of images generated by a calculation of 3D rendering, or generated from drawings.

6. Method according to claim 5, characterized in that elementary sequences are loaded into memory at the beginning of the animation and stored in said memory for the duration of the animation for several simultaneous and / or successive interlocutors. 15

7. Method according to any one of claims 5 and 6, characterized in that one selects in real time the elementary sequence to play, according to previously calculated and / or determined parameters. 20

8. Method according to any one of claims 5 to 7, characterized in that the elementary sequences being common to all avatars used in the mobile device, defining an animation graph of which each node 25 represents a point or state transition between two elementary sequences, each connection between two transition states being unidirectional and all the elementary sequences connected through the same state to be visually compatible with the transition from the end of one animation to the beginning of the other.

9. The method of claim 8, characterized in that each elementary sequence is duplicated so as to show a character who speaks or who is silent according to the detection or not of a voice sound.

10. The method as claimed in claim 4, wherein the phonemes and / or the other level 1 parameters are used to calculate so-called level 2 parameters, namely the slow character, fast, jerky, happy or sad of the avatar, from which is realized in all or part the animation of said avatar. he. A method according to claim 10, characterized in that the level 2 parameters being considered as dimensions according to which a series of coefficients with values which are set for each state of the animation graph are defined, one calculates for a state e the probability value: Pe = I Pi x Ci 20 with Pi value of the level 2 parameter calculated from the level 1 parameters detected in the voice and Ci coefficient of the state e according to the dimension i, then when a sequence elementary is in progress 25. the elementary sequence is allowed to go to the end or we go to the other sequence that speaks in case of detection of the voice and vice versa, then, when the sequence ends and we arrive at a new state, 30. the next target state is chosen according to a probability defined by the calculation of the probability values of the states connected to the current state. 12. A system (1) for animating an avatar (2, 2 ') provided with a mouth (5, 5') from a sound input signal (6) corresponding to the voice ( 7) of an interlocutor (8) for telephone communication, characterized in that it comprises a mobile telecommunication device (9) for receiving the sound input signal emitted by an external telephone source, a proprietary server (11) signal receiving means comprising means (12) for analyzing said signal and transforming in real time said sound input signal into an audio and video stream, computing means arranged to synchronize the movements of the mouth of the avatar transmitted in said stream, with the phonemes detected in said sound input signal, and for animating the avatar in a manner consistent with said signal by changes of attitudes and movements, so that the avatar appears to speak in 20 real time or substantially in real time in place of the interlocutor . 13. System according to claim 12, characterized in that it comprises means for configuring the avatar through an online service on the Internet. 14. System according to any one of claims 12 and 13, characterized in that it further comprises means for analyzing the sound input signal in order to detect and use for the animation one or more additional parameters. , namely the periods of silence, speech periods and / or other elements contained in said sound signal taken from the prosody, intonation, rhythm and / or tonic accent. 15. System according to any one of claims 12 to 14, characterized in that it comprises means for constituting and storing in a proprietary server elementary sequences for animating the avatar, consisting of images generated by a calculation. 3D rendering, or generated from drawings. 16. System according to claim 15, characterized in that it comprises real-time selection means of the elementary sequence to play, according to previously calculated and / or determined parameters. 17. System according to any one of claims 12 and 16, characterized in that, the list of elementary sequences being common to all avatars used for sending to the mobile device, it comprises calculation means 20 and implementation of an animation graph where each node represents a transition point or state between two elementary sequences, each connection between two transition states being unidirectional and all sequences connected through the same state to be visually compatible with the passage from the end of one animation to the beginning of the other. 18. System according to any one of claims 12 to 17, characterized in that it comprises means for duplicating each elementary sequence so as to make it possible to show a character who speaks or is silent according to the detection or not of a his voice. 19. System according to any one of claims 12 to 18, characterized in that, the phonemes and / or the other parameters being considered as dimensions according to which a series of coefficients are defined with values which are fixed for each state of the animation graph, the calculation means are arranged to calculate for a state e the probability value: Pe => Pi x Ci io with Pi value of the level 2 parameter calculated from the level 1 parameters detected in the voice and Ci coefficient of the state e according to the dimension i, then when an elementary sequence is in progress let the elementary sequence which is silent until the end or pass to the other sequence which speaks in case of detection of the voice and vice versa, then, when the sequence ends and we arrive at a new state, choose the next target state 20 according to a probability defined by the calculations of the probability value. States connected to the current state.