EP2059926A2

EP2059926A2 - Method and system for animating an avatar in real time using the voice of a speaker

Info

Publication number: EP2059926A2
Application number: EP07848234A
Authority: EP
Inventors: Laurent Ach; Serge Vieillescaze; Benoît MOREL
Original assignee: LA CANTOCHE PRODUCTION SA
Current assignee: LA CANTOCHE PRODUCTION SA
Priority date: 2006-09-15
Filing date: 2007-09-14
Publication date: 2009-05-20
Also published as: WO2008031955A2; FR2906056B1; FR2906056A1; US20090278851A1; WO2008031955A3

Abstract

This is a method and a system for animating on a screen (3, 3', 3'') of a mobile apparatus (4, 4', 4'') an avatar (2, 2', 2'') furnished with a mouth (5, 5') using an input sound signal (6) corresponding to the voice (7) of a speaker (8) having a telephone communication. The input sound signal is transformed in real time into an audio and video stream in which the movements of the mouth of the avatar are synchronized with the phonemes detected in said input sound signal, and the avatar is animated in a manner consistent with said signal by changes of posture and movements by analysing said signal, so that the avatar seems to talk in real time or substantially in real time instead of the speaker.

Description

AT

METHOD AND SYSTEM FOR ANIMATING A REAL-TIME AVATAR FROM THE VOICE OF AN INTERLOCUTOR

The present invention relates to a method for animating an avatar in real time from the voice of an interlocutor.

It also relates to a system for animation of such an avatar.

The invention finds. a particularly important application although not exclusive, in the field of mobile devices such as mobile phones or more generally personal devices for portable communication or PDA (English initials for Personal Digital Apparatus).

The improvement of mobile phones, their aesthetics and the quality of images and sound they convey is a constant concern for the manufacturers of this type of device.

Its user is particularly sensitive to the customization of this tool which has become an essential vector of communication.

However, even if its functionalities have become multiple, since it ^" today allows the storage of sound and images including photographic, in addition to its primary function of telephone, it remains nevertheless a limited platform.

It does not allow in particular to display high definition images, which in any case can not be viewed because of the reduced size of its screen. In addition, many services accessible to mobile phones that until now only operate in audio mode, are now required to meet a demand in video-telephony mode.

(courier services, call center, ...).

The service providers at the origin of these services often do not have a ready solution for the transition from audio to video and / or do not wish to broadcast the image of a real person.

One of the solutions to these problems is therefore to move towards the use of avatars, that is to say the use of graphic images, schematic and less complex, representing one or more users.

Such graphics can then be ^ι previously integrated the phone and then be referred to when necessary in a telephone conversation.

A system and a method for implementing avatars in a mobile phone for creating and modifying them using the XML standard (Extensible Markup Language) are thus known (WO 2004/053799).

Such a system, however, does not solve the control of facial expressions of the avatar depending on the speaker, especially in a synchronized manner.

At most, there exists in the prior art (EP 1 560 406) programs for modifying the state of an avatar ^' in a simple manner on the basis of external information generated by a user, but without the finesse and 'the speed sought in where. the avatar must behave in a perfectly synchronized way with the sound of a voice.

Current conversational technologies and programs using avatars, such as for example those implementing a program developed by the American company Microsoft called "Microsoft Agent", do not, indeed, effectively reproduce the behavior of an avatar in time real compared to a voice, on a portable device with limited capabilities such as a mobile phone.

Also known (GB 2 423 905) is a method of animating an entity on a mobile phone consisting of selecting and digitally processing the words of a message from which "visemes" are identified which are used to modify the mouth of the entity when the voice message is output.

Such a method, in addition to being based on the use of words, and not sounds as such, is limited and gives a mechanical appearance to the visual image of the entity.

The present invention aims at providing a method and a system for animating a real-time avatar better than those previously known to the requirements of the practice, in particular in that it allows real-time animation not only of the mouth, but also the body of an avatar on a mobile device of reduced capacity such as a mobile phone, with excellent synchronization of movements.

With the invention it will be possible, while operating in the standard environment of computer terminals or mobile communication, without installing specific software components in the mobile phone, to obtain an animation of the real-time or near-real-time avatar consistent with the input signal, and only by detection and analysis of the sound of the voice, ie phonemes.

A great aesthetic and artistic quality is thus conferred on the avatars and their movement during their creation and this while respecting the complexity of the timbre and finesse of the voice, for a low cost and with excellent reliability.

To do this, the invention starts with the idea of using the richness of sound and not just the words themselves.

For this purpose, the present invention notably proposes a method of animation on a mobile device screen of an avatar equipped with a mouth from a sound input signal corresponding to the voice of a telephone communication interlocutor. , characterized in that the sound input signal is converted in real time into an audio and video stream in which on the one hand the movements of the mouth of the avatar are synchronized with the phonemes detected in said sound input signal. , and on the other hand at least one other part of the avatar is animated coherently with said signal by changes of attitudes and movements by analysis of said signal, and in that in addition to the phonemes, the signal is analyzed. sound input to detect and use one or more parameters for the animation additional so-called level 1 parameters, namely the periods of silence, the speech periods and / or other elements contained in said sound signal taken from the ⁽ prosody, intonation, rhythm and / or tonic accent, so that the entire avatar moves and seems to speak in real time or substantially in real time in place of the interlocutor.

Other parts of the avatar include body and / or arms, neck, legs, eyes, eyebrows, hair, etc., other than the actual mouth. These are therefore not set in motion independently of the signal.

It is not a question here of detecting the (real) emotion of an interlocutor from his voice but of creating probable artificial reactions in a mechanical way, nevertheless credible and compatible with what could be the reality.

In advantageous embodiments, one and / or other of the following provisions are also used: the avatar is chosen and / or configured through an on-line service on the Internet; the mobile device is a mobile phone; to animate the avatar, we exploit elementary sequences, consisting of images generated by a calculation of 3D rendering, or generated from drawings; elementary sequences are loaded into memory at the beginning of the animation and stored in said memory for the duration of the animation for several simultaneous and / or successive interlocutors; the elementary sequence to be played is selected in real time, according to previously calculated and / or determined parameters; the list of elementary sequences being common to all the avatars that can be used in the mobile device, an animation graph is defined in which each node represents a point or transition state between two elementary sequences, each connection between two transition states being unidirectional and all the elementary sequences connected through the same state to be visually compatible with the transition from the end of one elementary sequence to the beginning of the other; each elementary sequence is duplicated so as to show a character who speaks or is silent according to the detection or not of a voice sound; the phonemes and / or the other level 1 parameters are used to calculate so-called level 2 parameters namely and especially the slow, fast, jerky, joyous or sad character of the avatar, from which is made in whole or in part the animation of said avatar; the level 2 parameters being considered as dimensions according to which one defines a series of coefficients with values which are fixed for each state of the graph of animation, one calculates for a state e the value of probability:

P _e = Σ Pi x Ci with Pi value of the level 2 parameter calculated from the level 1 parameters detected in the voice and Ci coefficient of the state e according to the dimension i, this calculation being carried out for all states connected to the state to which the current sequence ends in the graph; when an elementary sequence is in progress, the elementary sequence is allowed to go on until the end or we go on to the duplicated sequence that speaks when the voice is detected and vice versa, then, when the sequence ends and When a new state is reached, the next target state is chosen according to a probability defined by the calculations of the probability value of the states connected to the current state.

The invention also proposes a system implementing the above method.

It also proposes an animation system of an avatar equipped with a mouth from a sound input signal corresponding to the voice of a telephone communication interlocutor, characterized in that it comprises a mobile telecommunication device for receiving the sound input signal emitted by an external telephone source, a signal receiving proprietary server comprising means for analyzing said signal and transforming in real time said sound input signal into an audio and video stream, calculating means arranged on the one hand to synchronize the movements of the mouth of the avatar transmitted ^• in said stream with the phonemes detected in said input sound signal and secondly to animate at least another portion of the avatar in a manner coherent with said signal by changes of attitudes and movements, in that it comprises means for analyzing the input sound signal to detect and use to animate one or more additional parameters said parameters level ¹ 1, namely silence periods, periods of speech and / or other elements contained in said sound signal taken from prosody, intonation, rhythm and / or tonic accent, and that it comprises means for transmitting the images of the avatar and the corresponding sound signal, so that the avatar seems to move and speak in real time or substantially in real time in place of the interlocutor.

These additional parameters are for example greater than two, for example at least three and / or more than five.

Advantageously, the system comprises means for configuring the avatar through an online service on the Internet network.

In an advantageous embodiment, it comprises means for constituting and storing on a server, elementary animated sequences for animating the avatar, consisting of images generated by a 3D rendering calculation, or generated from drawings.

Advantageously, it comprises means for selecting in real time the elementary sequence to be played, according to parameters previously calculated and / or determined.

Also advantageously the list of elementary animated sequences being common to all avatars used in the mobile device, "it comprises means for calculating and implementing an animation graph, each node represents a point or transition state between two elementary sequences, each connection between two transition states being unidirectional and all the sequences connected through the same state to be visually compatible with the transition from the end of an elementary sequence to the beginning of the other.

In an advantageous embodiment, it comprises means for duplicating each elementary sequence so as to make it possible to show a character who speaks or is silent according to the detection or not of a voice.

Advantageously phonemes and / or other level 1 ^* parameters are used to calculate the so-called level 2 parameters that correspond to features such as the character slow, fast, jerky, happy, sad, or other equivalent type of characters and animating the avatar at least in part from said level 2 parameters.

By parameter of type equivalent to a level 2 parameter, we mean a more complex parameter designed from the level 1 parameters, which are themselves simpler.

In other words, the level 2 parameters correspond to an analysis and / or a regrouping of the level 1 parameters, which will make it possible to further refine the states of the characters by making them more suitable for what we wish to represent. .

Level 2 parameters are considered as dimensions according to which a series of coefficients are defined with values which are fixed for each state of the animation graph. computing means are arranged to calculate for a state e the probability value:

P _e = Σ Pi x Ci with Pi value diα level 2 parameter calculated from the level 1 parameters detected in the voice and Ci coefficient of the state e according to the dimension i, this computation being carried out for all the states connected to the state to which the current sequence ends in the graph. When an elementary sequence is in progress let the elementary sequence which is silent to the end or pass to the duplicate sequence which speaks in case of detection of the voice and vice versa, then, when the sequence ends and that we arrive at a new state, choose the next target state according to a probability defined by the calculations of the probability value of the states connected to the current state.

The invention will be better understood on reading the following particular embodiments given below by way of non-limiting examples.

The description refers to the accompanying drawings in which:

FIG. 1 is a block diagram showing an animation system for an avatar according to the invention,

FIG. 2 gives a state graph as implemented according to the embodiment of the invention more particularly described here.

Figure 3 shows three types of image sequences, including that obtained with the invention in connection with a sound input signal.

FIG. 4 schematically illustrates another mode of implementation of the state graph implemented according to the invention. Figure 5 shows schematically the method of selecting a state from the relative probabilities, according to an embodiment of the invention.

FIG. 6 shows an example of a sound input signal allowing the construction of a series of states, to be used for constructing the behavior of the avatar according to the invention.

Figure 7 shows an example of initial setting made from the mobile phone of the calling party.

FIG. 1 schematically shows the principle of an animation system 1 for avatar 2, 2 'on a screen 3, 3', 3 '' of mobile apparatus 4, 4 ', 4' '.

The avatar 2 is provided with a mouth 5, 5 'and is animated from a sound input signal 6 corresponding to the voice 7 of a communication interlocutor 8 by means of a mobile phone 9, or any other means of communication of the sound (fixed telephone, computer, ...).

The system 1 comprises, from a server 10 belonging to a network (telephone, Internet ...), a proprietary server 11 for receiving signals 6.

This server comprises means 12 for analyzing the signal and real-time transformations of said audio and videomultiplexed stream signal 13 in two voices 14, 15; 14 ', 15' in the case of mobile reception 3D or 2D, or in one voice IG in case of said mobile video.

It further comprises calculation means arranged to synchronize the movements of the mouth 5 of the avatar with the phenomena detected in the sound input signal and to retransmit (in case of mobile 2D and 3D) on the one hand the scripted text data 17; 17 ', then transmitted in 18, 18' in script form to the mobile phone 4; 4 ', and secondly to download the 2D or 3D avatar, in 19, 19' to said mobile phone.

In the case of using a mobile said video telephony, the text is scripted in 20 to be transmitted as sound image files 21, before compression in 22 and sent to the mobile 4 '', in the form video stream 23.

The result obtained is that the avatar 2, and in particular its mouth 5, seems to speak in real time in the place of the interlocutor 8 and that the behavior of the avatar (attitude, gestures) is coherent with the voice.

The invention will now be described in more detail with reference to FIGS. 2 to 7, the method more particularly described making it possible to perform the following functions: to exploit elementary animated sequences, consisting of images generated by a 3D rendering calculation or directly produced from drawings; choose and configure your character through an online service that will produce new basic sequences: 3D rendering on the server or selection of categories of sequences; load all the elementary sequences into memory, when the application is launched and keep them in memory for the duration of the service for several simultaneous and successive users; analyze the voice contained in the input signal in order to detect the periods of silence, the speech periods and possibly other elements contained in the sound signal, such as phonemes, prosody (intonation of the voice, rhythm of the speech , tonic accents); select in real time the elementary sequence to play, according to the parameters previously calculated.

The sound signal is analyzed from a buffer corresponding to a small time interval (approximately 10 milliseconds). The choice of the elementary sequences (by what is called the sequencer) is explained later.

More precisely, and to obtain the results sought by the invention, we begin by creating a list of elementary animation sequences for a set of characters.

Each sequence consists of a series of images produced by a 3D or 2D animation software known in themselves, such as for example the software 3dsMax and Maya of the American company Autodesk and XSI of the French company Softimage, or classic proprietary 3D rendering tools, or even digitized drawings. These sequences are generated in advance and placed on the proprietary server that broadcasts the avatar video stream, or generated by the online avatars configuration service and placed on the same server.

In the embodiment more particularly described here the list of sequence names Elemental available is common to all characters, but the images that compose them can represent very different animations.

This makes it possible to define a state graph common to several avatars but this provision is not mandatory.

A graph 24 of states is then defined (see FIG. 2) in which each node (or state) 26, 27, 28, 29, 30 is defined as a point of transition between elementary sequences.

The connection between two states is unidirectional, in one direction or the other (arrows 25).

More precisely, in the example of FIG. 2, five states have been defined, namely the start states of sequence 26, neutral 27, excited 28, at rest 29 and end of sequence 30.

All sequences connected through the same state of the graph, must be visually compatible with the passage of the end of one animation at the beginning of the other. The respect of this constraint is managed during the creation of the animations corresponding to the elementary sequences.

Each elementary sequence is duplicated to show a character who speaks or a character who is silent, depending on whether or not detected words in the voice.

This makes it possible to switch from one version to the other of the elementary sequence that takes place, to synchronize the animation of the character's mouth with the speaking periods.

FIG. 3 shows a sequence of images as obtained with speech 32, the same sequence without speech 33, and depending on the sound input (curve 34) transmitted by the interlocutor, the resulting sequence 35.

It is now described below the principle of selection of animation sequences.

The analysis of the voice produces a certain number of so-called level 1 parameters whose value varies over time and whose average is calculated over a certain interval, for example 100 milliseconds. These parameters are, for example: the activity of speech (silence or speech signals) the rhythm of speech the tone (acute or severe) if it is a non-tonal language the length of the vowels the presence more less important tonal accent.

The speech activity parameter can be calculated as a first approximation, from the power of the sound signal (integral of the signal squared) by considering that there is speech above a certain threshold. The threshold is dynamically calculable according to the signal-to-noise ratio. Frequency filtering is also possible to avoid considering for example the passage of a truck as the voice. The rhythm of the speech is calculated from the average frequency of the periods of silence and speech. Other parameters are also calculable from a frequency analysis of the signal. According to the mode of the invention more particularly described here, simple mathematical formulas (linear combinations, threshold functions, Boolean functions) make it possible to pass from these level 1 parameters to so-called level 2 parameters which correspond to characteristics such as by example the slow, fast, jerky, happy, sad character, etc.

The level 2 parameters are considered as dimensions according to which one defines a series of coefficients Ci with fixed values for each state e of the graph of animation. Examples of such a parameterization are given below.

At any time, that is to say for example with a periodicity of 10 milliseconds, the level 1 parameters are calculated. When a new state must be chosen, that is to say at the end of the course of a sequence we can therefore compute the level 2 parameters that are deduced and calculate for a state e the following value: P _e = Σ Pi x Ci where the values Pi are those of the level 2 parameters and Ci the coefficients of the state e along said dimension i.

This sum is a relative probability of the state e (relative to the other states) of being selected.

When an elementary sequence is in progress, it is then allowed to proceed to the end, that is to say until the state of the graph at which it ends, but we go from one version to another of the sequence (version with or without speech) at any time depending on the detected speech signal. When the sequence ends and we arrive at a new state, we choose the next target state following a probability defined by the previous calculations. If the target state is the same as the current state, it is maintained by playing a loop animation a certain number of times and thus we come back to the previous case.

Some sequences are loops that start from a state and return to it (arrow 31), they are used when the sequencer decides to keep the avatar in its current state, that is to say, chooses as target state following the current state itself.

The description in pseudo-code of an example of animation generation is given below and the description of an example of sequence flow: Example of generation of animation initialize current state to a predefined starting state initialize state target to null initialize current sequence with zero sequence as long as an incoming audio stream is received: o decode incoming audio stream o calculate level 1 parameters o if current animation sequence is complete:

"current animation sequence = null sequence

"target state = zero state o if target state zero:

"calculate level 2 parameters according to level 1 parameters (and possibly their history)" select the states connected to the current state

* calculation of the probabilities of these connected states according to their coefficients and previously calculated level 2 parameters

"draw among these connected states of the target state based on previously calculated probabilities => a new target state is thus defined o if zero current animation sequence:

"select in the graph the animation sequence from the current state to the target state => defines the current animation sequence o unfold the current animation sequence => selection of corresponding pre-calculated images o match the portion incoming audio streams and images selected from the analysis of these portions of audio streams o generate a compressed audio and video stream from the selected images and the incoming audio stream

Example of sequence flow: the interlocutor says: "Hello, how are you?" :

1. level 1 parameters indicate the presence of lyrics

2. level 2 parameters indicate -. cheerful voice (corresponding to "Hello")

3. the probabilistic draw selects the merry target state.

4. we run the animation sequence from the initial state to the joyous state (in its version with lyrics)

5. we arrive in the period of silence, recognized through the level 1 parameters

6. the animation sequence is still running, we do not interrupt it but we select its version without speech

7. the happy target state is reached

8. Silence leads to selecting the neutral target state (through the calculation of level 1 and 2 parameters and the probabilistic draw)

S. the animation sequence of the joyous state is unrolled to the neutral state (in its version without words) 10. the neutral target state is reached 11.1th silence leads again to select the neutral target state 12. the neutral animation sequence => neutral

(loop) in its version without lyrics 13. level 1 parameters indicate the presence of lyrics (corresponding to "How are you?")

14. Level 2 parameters indicate an interrogative voice

15.1 'neutral target state is reached again

16. select the interrogative target state (through the calculation of the level 1 and 2 parameters and the probabilistic draw).

The method of selecting a state from relative probabilities is now described with reference to FIG. 5 which gives a probability graph of states 40 to 44.

The relative probability of the state 40 is determined with respect to the value calculated above. If the value (arrow 45) is at a certain level, the corresponding state is selected (in the figure, state 42).

With reference to FIG. 4, another example of a state graph according to the invention is given. Here the following states have been defined neutral state: 46 state suitable for a first speech period (speak 1): 47 other state suitable for a second speech period (speak 2): 48 state suitable for a first period of time silence (Idlel): 49 other state suitable for a second period of silence (IdIe 2): 50 state appropriate to an introductory speech

(greeting): 51

The state graph connects unidirectionally (in both directions) all these states as a star (link 52).

In other words, in the example more particularly described with reference to FIG. 4, the dimensions are thus defined, for the calculation of the relative probabilities (dimensions of the parameters and the coefficients):

IDLE: values indicating a silence period SPEAK: values indicating a speech period NEUTRAL: values indicating a neutrality period GREETING: values indicating a reception or presentation phase

First level parameters, detected in the input signal and used as intermediate values for the calculation of the preceding parameters, are then introduced, namely:

Speak: binary value that indicates if we are talking

SpeakTime: time elapsed since the beginning of the speaking period

MuteTime: time elapsed since the beginning of the silence period

Speaklndex: number of the speaking period since a specific moment

Formulas for passing from first level to second level parameters are also defined:

- IDLE: NOT (Speak) x MuteTime

- SPEAK: Speak - NEUTRAL: NOT (Speak)

- GREETING: Speak & (Speaklndex = 1) The coefficients associated with the states are for example given by Table I below:

TABLE I

Such a parameterization, with reference to FIG. 6, and for four instants T1, T2, T3, T4, gives the current state and the values of the level 1 and 2 parameters in Table II below.

TABLE II

Tl: Current state = Neutral

U Speak = 1 "IDLE = 0 to SpeakTime = 0.01 sec" SPEAK = 1 to MuteTime = 0 sec - NEUTRAL = 0 to Speaklndex = 1 "GREETING = 1

T2: Current state = Greeting m - IDLE = 0.01

B Speak = 0 - SPEAK = 0 to SpeakTime = 0 sec ^» NEUTRAL = 1

B MuteTime = 0 .01 sec "GREETING = 0 m Speaklndex ₌ ] _

T3: Current state = Neutral m Speak = 0 - IDLE = 0.5 m SpeakTime = 0 sec - SPEAK = 0 a MuteTime = 1 .5 sec - NEUTRAL = 1 a Speaklndex ₌ i • GREETING = 0

T4: Current state = Neutral a Speak = 1 "IDLE = 0 SpeakTime = 0.01 sec SPEAK = 1 MuteTime ≈ 0 sec NEUTRAL = 0 Speaklndex = 2 GREETING = 0

The relative probability of the following states is then given in Table III below.

TABLE III

Tl T2

• Neutral = 0 Neutral = 1

^"Speaki = = 1 = 0 Speaki

"Speak2 = = 1 .2 Speak2 = 0

• Greeting - 2.5 Greeting = 0

- IdIeI = 0 IdIeI = 0.02

- Idle2 = 0 Idle2 = 0.01

T3 T4

"Neutral = 1 • Neutral = 0

• Speaki = = 0 • Speaki = = 1

- Speak2 = = 0 • Speak2 = = 1 .2

* Greeting = 0 • Greeting = 0

- IdIeI = 1 - IdIeI ≈ 0

- Idle2 = 0 5 - Idle2 = 0

Which gives in the example chosen the drawing of the probabilities corresponding to the following Table IV: TABLE IV

Tl: Current State = Neutral T2: Current State = Greeting

Speakl Neutral draw

Speak2

Greeting draw

Next state = Greetiαε Next state = Neutral

T3: Current State = Neutral T4: Current State = Neutral

Neutral Speakl draw

Honey Speak2 draw

IDLE2

Next State = Neutral Next State = Speak2 Finally, with reference to FIGS. 7 and 1, there is shown the schematic screen 52 of a mobile device for obtaining the configuration of the avatar in real time.

In step 1 ^• 8 user configures the settings of the movie he wants to customize.

For example :

• Character 53

• Expression of the character (happy, sad ...) 54

• Replica of the character 55

• Background music ^'56

• Recipient's phone number 57.

In step 2, the parameters are transmitted in the form of requests to the server application (server 11) which interprets them, creates the video, and sends it (link 13) to the encoding application.

In step 3, the video sequences are compressed to the "good" format, that is to say readable by the mobile terminals before step 4 where the compressed video sequences are transmitted (links 18, 19, 18 ', 19' 23) to the recipient for example by MMS.

As it is obvious, and as it follows from the foregoing, the invention is not limited to the embodiment more particularly described but encompasses all the variants and in particular those where the ^' diffusion is done offline and not in real time or near real time.

Claims

1. A method of animation on a screen (3, 3 ', 3' ') of a mobile device (4, 4', 4 '') of an avatar (2, 2 ', 2' ') provided with a mouth (5, 5 ') from a sound input signal (6) corresponding to the voice (7) of a telephone communication interlocutor (8), characterized in that the signal is transformed in real time sound input into an audio and video stream in which on the one hand we synchronize the movements of the mouth of the avatar with the phonemes detected in said sound input signal, and on the other hand we animate at least one other part of the avatar in a manner coherent with said signal by changes of attitudes and movements by analysis of said signal, and that in addition to the phonemes, the sound input signal is analyzed in order to detect and use for the animation one or more additional parameters called level 1 parameters, namely the periods of silence, the periods of speech and / or other elements contained in said signa l sound taken from prosody, intonation, rhythm and / or tonic accent, so that the whole avatar moves and seems to speak in real time or substantially in real time in the place of the interlocutor .

2. Method according to claim 1, characterized in that one chooses and / or configures the avatar through an online service on the Internet.

3. Method according to any one of the preceding claims, characterized in that the mobile device is a mobile phone.

4. Method according to any one of the preceding claims, characterized in that, to animate the avatar, it exploits elementary sequences, consisting of images generated by a calculation of 3D rendering, or generated from drawings.

5. The method as claimed in claim 4, wherein elementary sequences are loaded into memory at the beginning of the animation and stored in said memory for the duration of the animation for several simultaneous and / or successive interlocutors.

6. ^' Process according to any one of claims 4 and 5, characterized in that one selects in real time the elementary sequence to play, according to previously calculated and / or determined parameters.

7. Method according to any one of claims 4 to 6, characterized in that the elementary sequences being common to all avatars used in the mobile device, defining an animation graph where each node represents a point or state of transition between two elementary sequences, each connection between two transition states being unidirectional and all the elementary sequences connected through the same state to be visually compatible with the transition from the end of one animation to the beginning of the other.

8. Process according to claim 7, characterized in that each elementary sequence is duplicated so as to allow to show a character who speaks or who is silent according to the detection or not of a sound of voice.

* 9. Method according to any of the preceding claims, characterized in that the phonemes and / or the other level 1 parameters are used to calculate so-called level 2 parameters, namely the slow, fast, jerky, joyful or sad character of the avatar, from which is made in whole or part the animation of said avatar.

10. Method according to claim 9, characterized in that the level 2 parameters being considered as dimensions according to which a series of coefficients are defined with values which are fixed for each state of the animation graph, one calculates for a state e the probability value:

P _e = Σ P ₁ x C ₁ with Pi value of the level 2 parameter calculated from the level 1 parameters detected in the voice and Ci coefficient of the state e according to the dimension i, then when an elementary sequence is in Classes . the elementary sequence is allowed to go to the end or we go to the other sequence that speaks in case of detection of the voice and vice versa, then, when the sequence ends and we arrive at a new state, ^'

. the next target state is chosen according to a probability defined by the calculation of the probability values of the states connected to the current state.

11. System (1) animation _. an avatar (2, 2 ') with a mouth (5, 5') from a signal sound input device (6) corresponding to the voice (7) of a telephone communication partner (8), characterized in that it comprises a mobile telecommunication device (9) for receiving the sound input signal emitted by an external telephone source, a proprietary signal receiving server (11) comprising means (12) for analyzing said signal and transforming in real time said sound input signal into an audio and video stream, computing means arranged on the one hand to synchronize the movements of the mouth of the avatar transmitted in said stream, with the phonemes detected in said sound input signal, and on the other hand to animate at least one other part of the avatar so as to coherent with said signal by changes of attitudes and movements, and in that it further comprises means for analyzing the sound input signal in order to detect and use for the animation one or more additional parameters. , d it parameter of level 1, namely the periods of silence, the periods of speech and / or other elements contained in the sound signal taken among the prosody, the intonation, the rhythm and / or the tonic accent, so that the avatar moves and seems to speak in real time or substantially in real time in the place of the interlocutor.

12. System according to claim 11, characterized in that it comprises means for configuring the avatar through an online service on the Internet.

13. System according to any one of claims 11 and 12, characterized in that it comprises means for constitution and storage in a proprietary server, elementary sequences to animate the avatar, consisting of images generated by a calculation of 3D rendering, or generated from drawings.

14. System according to claim 13, characterized in that it comprises real-time selection means of the elementary sequence to be played, according to previously calculated and / or determined parameters.

15. System according to any one of claims 11 to 14, characterized in that, the list of elementary sequences being common to all avatars used for sending to the mobile device, it comprises means for calculating and setting an animation graph in which each node represents a point or transition state between two elementary sequences, each connection between two transition states being unidirectional and all the sequences connected through the same state to be visually compatible with the transition. from the end of one animation to the beginning of the other.

16. System according to any one of claims 11 to 15, characterized in that it comprises means for duplicating each elementary sequence to allow to show a character who speaks or who is silent according to the detection or not of a sound of voice.

17. System according to any one of claims 11 to 16 characterized in that, the phonemes and / or the other parameters being considered as dimensions according to which a series of coefficients are defined with values which are fixed for each state of the animation graph, the calculation means are arranged to calculate for a state e the value of probability:

P _e = Σ Pi X Ci with Pi value of the level 2 parameter calculated from the level 1 parameters detected in the voice and Ci coefficient of the state e according to the dimension i, then when an elementary sequence is being left unfold the elementary sequence that is silent until the end or move to the other sequence that speaks in case of detection of the voice and vice versa, then, when the sequence ends and we arrive at a new state, choose the next target state according to a probability defined by the calculations of the probability value of the states connected to the current state.