GB2423905A

GB2423905A - Animated messaging

Info

Publication number: GB2423905A
Application number: GB0504383A
Authority: GB
Inventors: Sean Smith; Larry Ger
Original assignee: Individual
Current assignee: Individual
Priority date: 2005-03-03
Filing date: 2005-03-03
Publication date: 2006-09-06
Also published as: GB0504383D0

Abstract

A data processing method and communication device for displaying an animated entity and outputing a message in audio form are described. Data representing a plurality of words in the message are received. The received data are processed to identify a plurality of visemes corresponding to the words in the message. The identified visemes are used to animate the entity such that the animated entity appears to be speaking the message as the message is output in audio form.

Description

Animated Messaging The present invention relates to messaging, and in

particular to providing an animated visual entity for delivering a message.

A message can be delivered to a user of a communications device in a number of ways.

For example, in a conventional telephone or cellular telephone, the user can listen to the message as it is delivered in real time, e.g. as part of a telephone conversation, or the message can be recorded, e.g. by a voice mail system or telephone answering machine. In both of these examples, the message is delivered in audio form only.

Therefore it would be advantageous to be able to provide a more immersive messaging format which can more actively engage a user and convey extra information beyond that discernable from audio alone.

According to a first aspect of the present invention, there is provided a data processing method for creating an animated entity to be displayed on a communication device while outputing a message in audio form. Data representing a plurality of words in the message can be processed to identify a plurality of visemes corresponding to the words in the message. The identified visemes can be used to create an animation of the entity. The animated entity can appear to speak the message as the message is output as audio.

Hence, by displaying the animated entity in synchrony with output audio, the animated entity will appear to be speaking the words of the message.

The image of the entity can be comprised of a plurality of portions. Separate portions can be provided for different anatomic parts of the entity. At least one portion can be provided for each eye of the entity. At least one portion can be provided for a mouth of the entity. Hence by animating the different portions independently, greater expressiveness can be provided. A greater or fewer number of portions can be used depending on the level of detail of animation required.

A one of the portions can include at least a part of the mouth of the entity. Animating the entity can include selecting images of at least a part of the mouth of the entity based on visernes associated with the words currently being spoken by the entity.

The method can further comprise retrieving an image of a part of the entity for each portion of the image of the entity from a database of stored images. The images stored in the database can be associated with different entity characters. The characters can be different types of characters, such as imaginary characters and real characters. The images stored in the database can be associated with different moods of the entity.

The method can further comprise alternately storing audio data in one of a first buffer and a second buffer and passing audio data from the other of the first and second buffer for processing. More than two buffers can be used. Each buffer can receive sampled speech audio data. The audio data can come from a live source of data, such as a microphone, or from a recorded source of data, such as a file.

The method can further comprise pre-processing the audio data. The audio data can be pre-processed in a number of ways to enhance the accuracy of the viseme identification.

Pre-processing the data can include any one, or combination of, removing noise, amplifying or boosting the data and/or normalising the data.

The method can further comprise determining a property of a set or group of audio data items. The set or group of audio data items can be the contents of a buffer. The property can be classified or characterised to identif' a viseme associated with the set of audio data items.

The property of the set of audio data items can be the average amplitude of the set of audio data.

The method can further comprise comparing the property with the same property determined for audio data corresponding to a plurality of different visemes to identify the viseme associated with the set of audio data items.

The property can be a complex property derived by combining a number of different properties of the set of audio data. The complex property can be numeric and can provide a signature representative of the character if the set of audio data. Identifying the viseme can comprise comparing the signature of the set of audio data items with a plurality of signatures each derived from a recording of a viseme.

The property can be selected from any one, or combination, of an average amplitude of the set of audio data items, a number of zero crossings of the set of audio data items, gradients of the set of audio data items and/or the number of discrete peaks of the set of audio data items.

The method can further comprising writing image data for a next frame in the animation of the entity to an off screen buffer. The data from the off screen buffer can then be transferred to the screen to provide a next frame in the animation of the entity.

The method can further comprise displaying the animated entity and playing the audio.

The method can further comprise transmitting the created animation and audio to a further communication device. The method can include displaying the animated entity and playing the audio on the further communication device.

The communication device can be a cellular telephony device. The device can be a cellular telephone, a smart phone, or a hybrid device, such as a PDA having a telephony function.

According to a further aspect of the invention, there is provided a communication device including a data processor and computer program code configuring the data processor to create an animated entity for display while outputing a message as audio. The computer program code can configure the data processor to: process received audio data representing a plurality of words in the message to identify a plurality of visemes corresponding to the words in the message; and use the identified visemes to create an animation of the entity such that the animated entity will appear to speak the message as the message is output as audio.

According to a further aspect of the invention, there is provided computer program code executable by a data processing device to provide any of the method aspects or communication device aspects of the invention. A computer program product comprising a computer readable medium bearing such computer program code is also provided as an aspect of the invention.

An embodiment of the invention will now be described, by way of example only, and with reference to the accompanying drawings, in which: Figure 1 shows a schematic diagram illustrating a software architecture for a communication device, including the messaging application of the invention; Figure 2 shows a schematic diagram illustrating the architecture of the messaging application of the invention; Figure 3 shows a high level process flow chart illustrating the method of the invention; Figures 4A and 4B respectively show an animated character and illustrate the partitioning of the features of the character; Figure 5 shows a character data structure used by the method of the invention; Figure 6 shows a process flow chart illustrating an audio data receiving stage of the method of the invention; Figure 7 shows a high level process flow chart illustrating an audio data processing stage of the method of the invention; Figure 8 shows a process flow chart illustrating the audio data processing stage; Figure 9 shows a process flow chart illustrating an alternative audio data processing stage; Figure 10 shows a process flow chart illustrating an animation stage of the method of the invention; and Figure 11 shows a schematic block diagram illustrating a communication environment in which the method of the invention can be used.

Similar items in different Figures share common reference numerals unless indicated otherwise.

The invention can be applied to any communications device. However, in the following, an embodiment of the invention will be described in the context of a mobile or cellular communications device, such as a cellular telephone. Cellular telephones can be considered to include "SMART" phones and Personal Digital Assistant (PDA) devices having inbuilt telephony. However, the invention is not necessarily limited to wireless communication devices and can also be used with wire based telephony and telephony over non-telephony dedicated networks, such as Voice Over Internet Protocol (VOIP) telephony. The invention can be used with any receiving communications device having sufficient data processing power, a programmable display and programmable audio output.

Figure 1 shows a schematic block diagram of a software architecture 100 including a messaging software application 102 of the present invention. Messaging application 102 sits on top of operating system 104. An example of a current mobile phone operating system is the Symbian operating system. Mobile phone operating system 104 provides a number of lower level functionalities accessible to the messaging application 102. For example the operating system 104 can include a telephony module 106. The telephony module can start and stop calls and can convert sound meta data into raw sound data which can be fed to a speaker and also provides a speech data feed to the messaging application 102. The telephony module can also generate an incoming call event which can be used to trigger the messaging application as will be described in greater detail below.

The media module 108 can handle sound, video, the display of pictures, operation of a microphone and provides a hook into the display screen of the mobile phone.

The operating system 104 will provide other functionalities, which will be apparent to a person of ordinary skill in the art, however they have not been described in order not to obscure the nature of the present invention.

Figure 2 shows a schematic representation of the software architecture 110 of the messaging application 102. The messaging application takes speech data 112 as input.

As used herein, "speech" data is not limited to data representing words actually spoken but can include recorded speech, non-spoken words and generally refers to any data representing words which can be spoken irrespective of their source or the origin of those The messaging application 102 includes a first buffer 114 and a second buffer 116 for receiving speech data. Messaging application 102 also includes a voice processing module 118 which receives speech data from the buffers and processes the speech data in order to derive visemes corresponding to the speech data.

A viseme is a generic facial image that can be used to describe a particular sound. A viseme is the visual equivalent of a phoneme. It can be thought of as the appearance of a mouth when speaking a phoneme.

Messaging application 102 also includes an animation engine 120. In one embodiment, animation engine 120 includes three controllers, each handling animation of a portion of a face. A first controller 122 is responsible for an upper left portion, a second controller 124 is responsible for an upper right portion and a third controller 126 is responsible for a lower portion. Different animation engines are pluggable to allow a wide variety of implementations. Each implementation of the animation engine will be dependent on its own meta defined and structure media data, e.g. pictures and video clips.

Messaging application 102 also includes a skin engine 130 in communication with a database 132 storing a plurality of images which can be assembled to provide an animated face. Skin engine 130 determines the appropriate subset of images to supply to the animation engine, depending on the character, and context settings for the character, to be animated and displayed. The controllers of the animation engine transfer the appropriate images to an off screen buffer 134. The contents of the off screen buffer 134 are periodically displayed on the display screen of the mobile phone in rapid succession to provide an animated character face in which the character animation is synchronised with the played audio track including the speech.

Figure 3 shows a process flowchart illustrating the method 140 of the invention as implemented in the software. When the messaging application is invoked, at step 142, the messaging application receives audio data from a source which includes the speech to be animated. At step 144, the audio data is processed by the audio processing module 118 to extract visemes. The visemes are passed to the animation engine 120 and at step 146, the animation engine, in co-operation with the skin engine 130, generates an animated character which is synchronised with the played audio track so as to give the impression of the animated character speaking the words within the audio track.

Before describing the data processing operations of the method of the invention in greater detail, the format of the characters and data structures used to animate the characters will be described with particular reference to Figures 4a, 4B and 5.

Figure 4 shows a schematic representation of a frame of an image 150 being displayed on a screen 152 of a mobile phone. The image includes a visual representation of a head of a character which is being animated to speak the words and accompanying synchronised audio track.

The display 152 is partitioned into three segments: a first, upper left segment 154; a second, upper right segment 156; and a third lower segment 158. As will be appreciated, three segments is by way of example only and a greater or fewer number of segments can be used in other embodiments of the invention. Each segment includes a background portion, e.g. portion 160 in segment 154 and a portion in which a part of the character is displayed, e.g. portion 162 in segment 154. A background image can be displayed and masks can be used to define those portions of the overall image in which the character is displayed and those portions of the overall image in which the background image is displayed.

The boundaries between the upper left, upper right and lower portions arc indicated by broken lines 164 and 166. The character is animated by displaying in rapid succession different mouth shapes in synchrony with the audio track, with corresponding different images for the upper portion of the face, depending on mood and other settings for the character which will be described in greater detail below.

Figure 4B shows an image frame of the same character as shown in Figure 4a as displayed on a screen 170 of a mobile phone. However, in this alternate embodiment, the portions of the display screen 170 have different boundaries and a mask based technique is used in the animation. As above, the image is partitioned into an upper left 172, upper right 174 and lower 176 region. However, the boundary of upper left region 172 is defined by broken lines 178 and 180, as is upper right portion 174. However, the boundary or lower portion 176 is defined by broken line 182. However, a mask defined by the edges of the display screen and broken line 182 and solid line 184 is used to control the overlap of the upper left, upper right and lower images actually displayed.

1-lence, in practice, an upper left and upper right portion of the face of the character is generated and displayed and the lower portion of the face of the character is also generated but the mask is applied to the lower portion of the character face so that only the part of the character face below line 184 is displayed. Hence, in this way, it is easier to control animation of parts of the face more closely associated with speech, such as the nose region.

It will be appreciated that the face can be segmented in different ways, including a greater and fewer number of different segments. It will also be appreciated that different mask shapes can be used for a one or several of the display portions depending on which parts of the face it is intended to animate.

Messaging application 102 includes skin engine 130 which co-operates with database 132 which stores a plurality of bit map images which can be combined to provide the animated character. Figure 5 shows a data structure 190 illustrating the images stored in database 132 and the data entities maintained by skin engine 130 so as to define the character being animated. As illustrated in Figure 5, the messaging application can animate a plurality of different characters, character_i to character_n wherein each character has a different visual appearance. For example sonic of the characters may have a realistic human appearance, some may have a stylised human appearance and others may be realistic or stylised animals including real and imaginary ones. Indeed the characters can extend to living and non-living entities.

For any one character, there can be a plurality of different versions of that character, i.e. version 1 to version_n. Each version of a specific character can have different properties. For example a first version of a character may be a two dimensional representation whereas a second version of the character, version 2, may be a three dimensional representation of the same character. Others versions of the character may be black and white and colour versions of the character and other variations of the versions of the character are envisaged. For each version of each character, a plurality of different mood types can be provided, as indicated by mood_i to mood_n. For example, the mood types happy, sad and angry can be provided although fewer or greater number of moods and different types of moods are also envisaged.

For each mood of each version of each character a group of bit map images for each of the upper left, upper right and lower portions of the character face are stored in database 132. The upper left and upper right portions of the face are particularly useful in expressing the mood of the character. The upper left and upper right images can be mirror images of each other. Each upper left and upper right image can include a number of animated components, such as eyebrow movements, eye movements and eyelid movements. For example three different eyebrow positions, five different eyeball positions and three different eyelid positions can be used which provides 45 different possible permutations of upper left and upper right images. It will be appreciated that this is merely an example and that other combinations of features can also be used and different numbers of each feature.

The set of images required for each segment of the face and for each mood of the face are stored in the database. Hence, in the above example, each segment will have 45 images associated with it for each mood. Therefore, different moods will each have their own set of 45 images, for each of the face segments. As will be appreciated, different numbers of images may be used for different face segments and for different moods.

It has been found that approximately five to seven visernes can be used to accurately reproduce an animated speaking face and so typically five to seven images for the lower part of the face are provided, each corresponding to a different viseme, and with the image of the mouth reflecting the mood of the character.

With reference to Figure 6, there is shown a flowchart illustrating a speech or audio data reception stage 200 being part of the overall method 140 of the invention and corresponding generally to step 142 of Figure 3. At step 202, the method determines whether speech or audio data is being received by the messaging application. If there is no incoming audio or speech data then processing loops and the method waits until audio data is being received by the message application. In that case, processing proceeds to step 204 at which it is determined whether the incoming speech data corresponds to a received telephone call. If it is determined that the incoming audio data does correspond to an incoming telephone call, then at step 206 the method detects the call event trigger from the telephony module of the operating system and it is determined whether to continue with the messaging application. At step 208 if it is determined that the call has been rejected by the user then processing terminates at step 210 and the messaging application can either shut down or return to being a background process. If a user does not reject the call then it is determined at step 212 whether the user decides to actually take the call. If the user does not take the call, then the messaging application again terminates at step 210 and the caller may either hang up or leave a message on an answering service. If the user then subsequently decides to listen to the message, then the incoming message is classified as a call and is processed according to the call handling path of the flowchart shown in Figure 6.

If the user decides to take the call in real time at step 212, then at step 214, the user can decide whether to display the animated character or not. If the user enters a command electing not to display the animated character then processing terminates at step 210.

Alternatively, if the user elects to use the messaging application then processing proceeds to step 216. In another embodiment, if the user decides to accept the call, then the animation is automatically begun.

If at step 204 it is determined that the audio data stream does not correspond to a call then processing proceeds to step 218 at which it is determined whether the incoming audio data stream is being generated by a microphone of the communication device. The user can enter a selection command to select to activate the animation so that it is driven by input from the microphone. Then the microphone voice data stream is obtained from the operating system at step 220 and processing proceeds to step 216.

If at step 218 the audio data stream is not originating from a microphone, then at step 222 the user can enter a selection command to open a prerecorded audio file. When it is determined that the audio data stream is originating from a file stored on, or being supplied to the device, then the audio file provides the input driving the animation.

If it is determined that the audio data is not from a file then either an exception can be thrown as the source of audio data is not recognised by the program or alternatively processing can return, as indicated by step 223 and the method again determines whether the incoming data corresponds to audio data or not. If the source of audio data is identified as being a file, e.g. by a user entering a command via a user interface of the messaging application, then at step 224 the messaging application reads the selected file and at step 226 the audio data from the file is streamed to the messaging application and processing proceeds to step 216.

At step 216, irrespective of the source of the audio data, the audio data is captured by a first of buffers 114 or 116. The size of each of buffers 114 and 116 can be configured via the messaging application 102 and in one embodiment each of buffers 114 and 116 can hold 512 bytes. The speech or audio data is stored in 16 pulse code modulation (PCM) data format. PCM is a sampling technique which can be used for digitizing analog audio signals. The signal can be sampled several thousand times a second and each sample can be represented by several bits. Each sample is a representation of the amplitude of the analog audio signal at the corresponding point in time.

If at step 228 it is determined that no further data is being received then process flow proceeds to step 230 and the buffer in which the data has been stored is handed over to the audio processing module 118 for processing at step 230. If at step 228 it is determined that data is still being received then at step 232 it is determined whether the current buffer, e.g. buffer 1 is full. If not, then processing returns to step 216 and buffer 1 continues to receive audio data. When it is determined that the current buffer is full, then processing proceeds to step 234 at which the current buffer, buffer 1, is handed over to the audio processing module 118 and then at step 236 the other buffer 116, buffer 2, is swapped in for the first buffer and at step 216 the incoming audio data is captured in the second buffer 116. Hence, the messaging application continuously receives audio data and fills a one of the buffers with the incoming audio data while the content of the other buffer is being processed by the audio processing module. As soon as the buffer receiving data is full, the buffers are swapped over and the audio processing module processes the data in the full buffer while the other buffer receives fresh audio data. This continues until the messaging application determines that no further audio data will be received, e.g. by receiving a call ended signal from the telephony module of the OS, or an expiry of a timeout, or an end of file, or an end of stream, at which time the messaging application can eventually terminate after processing any remaining audio data.

Figure 7 show a higher level flow chart illustrating the general steps of the audio data processing method 280. The audio data processing algorithm is pluggable and different audio data processing algorithms can be used dependent on the specific implementation of the invention. That shown in Figure 8 is by way of example. However, the audio data processing algorithms can share some general properties as illustrated in Figure 7. At step 282, audio data is obtained from a source, whether it be a file, a microphone, or telephony voice data. Then at step 284 some pre- processing of the data is done to clean up the data and/or to remove unreliable data and/or to put the data in a format more suitable for further analysis. Then at step 286, the audio data is analysed so as to determine what viseme the audio data is likely to correspond to. This can generally be accomplished by classif'ing the audio data, or determining what category the audio data falls into, wherein each class or category is associated with a viseme.

With reference to Figure 8, there is shown a process flowchart illustrating a first embodiment of an audio data processing method 240 corresponding generally to step 144 of the method 140 of the invention and to method 280 of Figure 7. At step 242, the audio processing module 118 accesses the audio data from the most recently filled buffer.

At step 244 noise data items are removed from the audio data. A configurable lower threshold is applied to the audio data and if the value of any data item is below the lower threshold then the value of that data item is set to zero. This helps to discard signals relating to silence or noise. The lower threshold value can be set so as to remove the bottom 10% of the amplitude range. For example, for 16 bit pcm, the maximum amplitude is 32,767. Therefore any data items with an amplitude less than 3,276 are discarded.

Noise reduction can also be used later on in the process. For example, noise reduction can be used during the classification stage. if the difference between a signature of audio data and a pre-coinputed phoneme signature exceeds a threshold value, then the audio data can be considered to be undefinable noise rather than speech to be animated.

Then at step 246, it is determined whether the data requires amplifying as quiet noises should be analysed in the same manner as louder noises. This step of the data processing helps to identify quiet noises which correspond to actual speech, such as whispering. A second threshold, higher than the lower threshold used for the preceding noise removal step can be used. If the audio data amplitude is lower than the second threshold, then it can be regarded as whisper which should be boosted. The second threshold can be set at 20% of the maximum value of the audio, which in the case of l6bit PCM, would be 6553.

All the data in the buffer is multiplied by a configurable amount if the data falls between the first and second thresholds. This can be done by multiplying each sample by the highest possible amplitude divided by the average amplitude. The average amplitude is determined by keeping arunning total of the average volume for each buffer. This can be computed using an average amplitude algorithm which is described below.

The thresholding can be applied on a sample by sample basis but in other embodiments the thresholding can be done over longer subsections of sample data. Further, in other embodiments a static threshold value is not used and instead an adaptive threshold is used. The value of the largest amplitude encountered thus far in the audio stream sample is maintained. The second threshold value is then calculated and used based on the running total. When the audio stream is being initially read, no previous analysis is available for the audio. Initially default values of 10% and 20% of the absolute maximum amplitude can be used for the first and second thresholds. As each buffer is analysed, the maximum total for a buffer is stored. After at least a certain time worth, e.g. 2 seconds, of buffer has been analysed, that is not silent, then the new dynamic maximum value will be used to calculate the first and second thresholds (at 10% and 20% of the new dynamic maximum) instead of the absolute maximum value.

Then at step 248, the data set is normalised such that the amplitude with the largest peak corresponds to the maximum value of an arbitrary amplitude scale. Hence the data set from each buffer falls within the same range making it easier to process the data. Further, each data item is presented as an integer value thereby obviating the need to carry out floating point arithmetic operations further increasing the speed of processing.

Normalisation can be important for a more flexible solution. Some samples will be louder than others. Situations can arise with lip-synching where a character whispers and because the average volume is so low the system picks the same mouth pose for the entire sample. Normalisation helps to adjust classification correctly, depending on the average values of the amplitude. The buffer is nornialised to fit within a certain range e.g. 0255.

First the value of the largest amplitude is calculated by inspecting each value in the buffer. Each sample is then divided by the value of the largest amplitude and multiplied by a factor to fit into the predetermined range. This can be optimised by first computing the factor and finding appropriate bit shifts and additions to obtain a sufficiently close result.

Then at step 250, a numerical value which is characteristic of the data in the buffer is determined so as to provide a quantitative characteristic or signature for the data in the buffer. Either a single algorithm or a number of different algorithms can be applied to the data to obtain a single or multiple quantitative values and those quantitative values can be used independently or in combination in order to derive an overall quantitative signature for the data set. Results from an amplitude averaging, a maxima pattern, a gradient and a zero crossings algorithm can be combined to produce a phoneme signature which is then compared with prestored phoneme signatures for classification. In some embodiments only a one, or a combination of fewer than all, of the algorithms can be used. The algorithms used in step 252 will be described in greater detail below.

The method then carries out a number of operations to determine what viseme, if any, the quantitative signature corresponds to.

Prior to installation of the messaging application, quantitative signatures are derived for each viseme by processing empirical data. Five to seven visemes can be used and in one embodiment the invention uses three visemes for the "SH", "MM", "OH" "L" and silence sounds. In other embodiments the viseme corresponding to the "AH" sound can also be used. For each of these sounds, audio data for approximately five to ten different speakers of that sound is captured. The algorithms used at step 250 to derive the numerical signature are then applied to each of those sounds so as to provide a plurality of numerical signatures for each sound. Hence data items representing a plurality of numerical signatures for each sound 256 are stored in database 132.

Returning to Figure 7, at step 252, a search is initiated of the signatures stored in database 132 and at step 254 the quantitative signature for the current buffer is compared with a first one of the stored signatures and at step 258 it is determined whether the quantitative signature matches the current stored signature sufficiently closely. Hence the match does not need to be exact and a threshold can be used to determine whether the signatures match sufficiently closely. If at step 258 it is determined that the quantitative signatures do match then a viseme ID data item which is looked up so as to identify the corresponding viseme at step 260. The viseme ID is then passed to the animation engine at step 262. Hence the viseme for the current data buffer has been identified and process flow loops, as indicated by line 264, so that the audio processing module 118 is ready for processing the next full data buffer.

If at step 258 the quantitative signature for the current data buffer is not found to match the current stored quantitative signature then at step 266 a next stored signature is selected for comparison and processing loops as indicated by line 268. Steps 252, 254, 258 and 266 repeat until the quantitative signature for the current data buffer has been compared with all the stored quantitative signatures. If as a result of that process, no matching quantitative signature has been identified then at step 270, the animation engine can be notified either that the data buffer corresponds to a silence, or pause in speech, or the viseme ID for the previously identified viseme is passed to the animation engine so as to maintain the facial expression.

Hence method 240 continues looping for as long as audio data is supplied to the audio processing module and supplies viseme IDs to the animation engine which uses the viseme ID information and skin information from the skin engine 130 in order to generate the animated character.

The various algorithms that can be used at step 250 of process 240 will now be described.

Given a set of audio pcm samples corresponding to sequential audio data, during processing, each algorithm will analyse the audio in subsets corresponding to the data in a one of the buffers. That is, buffer B will contain X samples of audio, each represented by sample I. The sample is generally an integer between -255 and +255, but dependent on any normalisation or different algorithms or different CPUs.

An average amplitudes algorithm can be used to determine the average amplitude of the audio sample data in a buffer. It can be calculated from the cumulative addition of each sample divided by the number of samples. It can be expressed in pseudo code as: total=0; sampleCount = 0; for each sample I in buffer B, starting with 1=0, total = total + sample(I); sampleCount = sarnpleCount+1; return aveAmp= total/sampleCount A zero crossing algorithm can be used to determine the number of times that the audio data, which can be visualised like a wave form, would cross over the horizontal time. It can be expressed in pseudocode as: crossings = 0; aboveZero = (sample(0) > 0); for each sample I in buffer B, starting with 1=0, if(sample(I)> 0 AND aboveZero is false) { aboveZero = true; crossings = crossings +1; else if(sample(I) < 0 AND aboveZero is true) { aboveZero = false; crossings = crossings + 1; return crossings A gradients algorithm can be used to identify a list of deltas representing the change in amplitude between any two samples which are X samples apart. The list represents the individual amplitude changes over the buffer. The gradients are therefore represented by an array. Generally X is 1, so gradients will indicate the difference between every sample in the buffer. This is not necessary, and for efficiency every 4th sample could be used, for example. It can be expressed in pseudo code as; GradientDeltas[sizeOfBuffer] = {0,0} For each sample I In buffer B, starting with i=1, gradientDeltas[i] sample(i) - sample(I-1); This algorithm determines the number of samples that are either a maximum or a minimum amplitude in a localised set of values, i.e a peak or trough in the sample data.

A maxima pattern algorithm can be used to derive data which indicates the number of discrete peaks in amplitude in a given buffer. That is, the algorithm provides an indication of how volatile a set of amplitudes are with respect to relative maximum amplitudes in a localised subset of amplitudes The normalise speech audio data has a genera! waveform shape, but is actually a set of discrete data items each having their own amplitude. For the purposes of the algorithm, a subset of the buffer is analysed for processing. The analysis compares individual amplitudes with neighbouring values of up to X samples in both left and right directions.

Due to this behaviour, any amplitudes processed that are close to the beginning or end of the wave form, will not be able to be adequately compared against X values as these values would not exist. For example, the sample at position 0 would have no sample to its left. Thus, a subset window is taken from the buffer. Given that the algorithm will want to compare with up to X adjacent values in both directions, the window will comprise values of X to BUFFER SIZE - X, values. Assuming X=64 and the BUFFER SIZE = 512 this would be amplitudes at index 64 to 448 of the total buffer of512 samples.

Classification is done with an effective and efficient algorithm that computes a peak count in the signal waveform. This is done by checking the magnitude of a range of points R around a point P and incrementing a property counter each time a pair of points R equidistant from the point P have a lower magnitude than P. The point P is then incremented along the waveform, making this point the new point P and the process is repeated for the entire buffer. This is expressed by the following pseudocode: For(i= 1;j<MAXPROPS;j++)P[j]=0; I/empty property array For (iMAXPROPS; i<BUFSIZE-MAXPROPS; i++) { For(j=l; j<MAXPROPS; j++) If(B[i] > B[i+j] && B[i] > B[i-j]) P[j]++; The counter of peak values is stored in a separate property buffer of size PROP BUFF SIZE. The buffer is re-used for each iteration throughout processing. The property buffer size and the maximum size of the sampled buffer are important considerations. In this case PROP BUFF SIZE = 64 and MAX_BUFF_SIZE = 512. it should be noted that PROP BUFF SIZE << MAX BUFF SIZE /2. This avoids any out of bounds memory access in the buffer array. For this reason, the algorithm begins processing the magnitude with the index= PROP BUFF SIZE and completes processing upon reaching the magnitude with the index of MAX_BUFF_SIZE PROP BUFF SIZE.

The algorithm begins at element 64 (e64) of the buffer. E64 is compared to e63 and e65 (e64-l and e64+1). If the value of e64 is higher in magnitude than both values, 1 is added to property buffer at position I (p1). Else p1 is ignored and the property buffer is left untouched. By way example of a first iteration at e64, if e64 has a magnitude of 20, then this is greater than the value at e63 but not greater than the value at e65. Thus no value is added to p1.

Next, e64 is then compared to e62 and e66 (e64-2 and e64+2). If it is higher than these equidistant points then I is added to property buffer at position p2 (as it is higher than 2 of its neighbours).

E64 is compared to el and e128 (e64-64 and e64 +64). After computation at point P at e64 is completed, the same computation is applied to the following amplitude (P= e65).

This is continued until element e is at index: MAX_BUFF_SIZE PROPBUFFSJZE. ( 5 12-64).

After e64 has been processed, all values from e64 to e[BUFSIZE - 64] are processed using the same property buffer.

The result of this process, contains the volatility classification' represented by the number of amplitude peaks in the input buffer. It also provides data for the ratio of volatility of a certain order in relation to volatility present in the wave of another order.

This is compared with previously trained property buffers which are stored as phoneme comparison data. A few phoneme comparison sets should be stored for each phoneme.

To identify a phoneme in the input buffer, the property buffer is first derived. It is then absolutely subtracted from all the property buffers representing a phoneme. This is applied to all available phoneme sets. So for example, the phoneme aahl1h' will have 10 precomputed property buffers for comparison with the current Input property buffer. All values for each of the 10 will be computed by subtraction. If any of these total subtractions will result in a difference smaller than a predefined threshold, a vote is added to the phoneme aahh'. After comparison with other phoneme sets, the phoneme with the most votes is selected as the classified phoneme to animate. Such voting may be weighted for minimal differences between the property buffer input and the phoneme sets.

Similarly, a threshold may be used to determine that the difference is so great that the sound is just undefinable noise.

With reference to Figure 9 there is shown a process flow chart illustrating another embodiment of an audio data processing method 290 which can be used in the invention.

A number of the steps in the method are similar to those described above with reference to Figure 8 and so will not be described in detail. The method is based on the visemes being determined by volume rather than on significant signal analysis. This approach is efficient while still providing acceptable animation. No pre-stored phoneme signatures are required in this approach.

Data is received from the latest full buffer at step 242 and then noise data is removed using a threshold at step 244. Then at step 292 the audio data is normalised.

Normalisation is done by assigning amplitude thresholds that map to visemes, purely based on the threshold value of the current audio being analysed. The average amplitude from the buffer is used to determine which threshold value is relevant.

In this normalisation technique, at least some or all of the audio data in the current audio sample being analysed is pre-processed before thresholds may be determined. In one embodiment as these thresholds are not hard coded values, but instead dynamically based on percentages of the maximum value of audio in the sample, they automatically provide for normalisation. That is, if the audio was generally soft, the maximum volume would be low and the thresholds for determining a viseme are adjusted accordingly to a lower value as they are a percentage of the maximum volume. Conversely, a louder sample would produce higher valued thresholds, and have buffered averages relatively louder to match to the respective threshold. Hence, in the fact that thresholds are percentages of the maximum volume in a data set, there is some normalisation.

Focussing on how step 292 the audio data is pre-processed to determine these values as part of step 292. In the case where the audio data is known before the lip-synching is required, the application can process all of the data prior to rendering the animation and prior to step 294. In the case of realtime lip-synching, which will be required, for example, for lip-synching an incoming call, the maximum value for the thresholds will be derived from a beginning subset of the speech data. This subset should not be too large, else the pre-processing will delay the forthcoming lip-synching. Thus sample values during an initial time period of about 500 milliseconds, or less, may be used to approximate the maximum volume in the forthcoming sample. This approach may be enhanced by keeping the thresholds dynamic throughout the realtime streaming case scenario. In this case the maximum volume is constantly computed and the thresholds similarly adjusted.

For the simpler algorithm just described, the dynamic normalisation process may be skipped by hardcoding threshold values for the most common audio expectation values.

This provides acceptable results, and avoids the possible inaccuracies just described, but at the risk of not being sufficiently dynamic to support as wide a range of audio samples as could be with normalisation.

At step 294 the average amplitude algorithm (as described above) is used to provide a measure of the average amplitude for the current buffer of audio data. Then at step 296 the average amplitude data is used to classify the buffer as being one of five visemes or a silent sound, having a viseme of a closed mouth. The average amplitude for a buffer is compared to 5 different bins having minimum and maximum bounds. The average amplitude percentage values are compared against thresholds to directly choose a viseme for each of 5 mouth shapes. For example, the thresholds between the different bins can be 12%, 26%, 47%, 58%, 80%. An average amplitude of less than 12% can be considered silence and is mapped to the closed mouth viseme. The particular viseme identified is then notified to the animation engine 262 which and is used by the animation engine to animate the lip-synched lower portion of the character's face.

With reference to Figure 10, there is shown a process flowchart illustrating an animation method 500 corresponding generally to step 146 of general method 140 of the invention.

Animation method 500 begins at step 502 by the animation engine 120 looking up the current character context in the skin engine 130. The current character context includes a number of data items specif'ing the character to be used in the animation. The current character context identifies the character type, the version type and the mood type for the animated entity to be displayed. The character, version and mood can be selected by the user and in the absence of user selection, default values can be used. Having determined the character context, at step 504, a list of bit maps for each of the right upper, left upper and lower parts of the character image are loaded into memory at step 504. For example, with reference to Figure 5 if the current context is character_i, version_i, mood_i then for each of upper left, upper right and lower, a list of each of the visemes recognised by the system, in the exemplary embodiment 5, the address or a pointer to the bit map for each viseme is returned so that a look-up table can be provided.

At step 506 the animation engine receives the viseme ID data item 508 from the audio processing module 118. Then at step 510, the animation engine uses the viseme ID data item to look up in the lists in memory the address or location of the bit maps for each of the upper left, upper right and lower portions of the character. If, for the buffer having just been processed, it was determined that the buffer corresponded to silence then at step 512 a count C is incremented, otherwise the count is reset to zero. Processing then proceeds to step 514 at which it is determined whether the current count exceeds a threshold value indicating that the animated display has been idle for a significant period of time owing to silence in the audio data. If it is determined that the animation has been idle for a significant period of time then processing proceeds to step 516 at which a pre- determined idle animation routine is called and a fixed sequence of frames are displayed.

Processing then returns, as indicated by line 5 1 8 to determine whether a viserne ID has been received.

If it is determined that the idle threshold has not been exceeded, then processing proceeds to step 520 at which it is determined whether no viseme was identified for the buffer and whether a random flag has been set true. A user can select to set the random flag as being true or not or alternatively a default setting can be used when the messaging application is first installed. If no viseme has been identified and the random flag is set at true then at step 522 a random animation routine can be called and a fixed sequence of images displayed on the screen. Processing then ioops as indicated by line 518 to step 506 as described above.

If at step 524 it is determined that no viseme was identified but that a user input has been received then processing proceeds to step 526 at which a user selected animation routine is called. For example the user could select to cause the character to carry out a particular animated action, such as blinking, smiling, a change of mood or indeed a change of version of character type. After the user selected animation routine has been called at step 526 processing loops as indicated by line 518 and processing proceeds again as described above.

If a viseme ID has been identified which does not correspond to silence then processing proceeds from step 506 to step 528 and the animation engine passes control to each of the upper left, upper right and lower controllers. Each controller is passed the address or other identifier locating the map corresponding to the current viseme ID and copies their bit map to the off screen buffer 134. The separate bit maps are merged at step 530 on the off screen buffer, and using any masks associated with any of the image portions at step 530. Then at step 532 a configurable delay is introduced into the animation processing so as to allow for re-synchronisation with the audio being played by the device and at step 534 the frame of image data is transferred from the off screen buffer to the screen by bit butting. Processing then ioops as indicated by line 518 and the process repeats for the next viseme ID to provide an animated character which appears to speak the words of the audio output of the device.

With reference to Figure 11, there is shown a communications environment 550 including a communication device 552 according to the invention on which the messaging application 102 can be or has been installed. There are a number of mechanisms by which the messaging application and/or new skins or characters can be downloaded or installed on communication device 552. A web server 554 can have access to a storage device 556 storing versions of the messaging application and/or groups of image bit maps providing a new skin for use with the messaging application. In order to install the messaging application or a new skin on device 552, a user of device 552 can download the string application or skin over the Internet using a web browser running on a computing device connected to the Internet. The user can then transfer the messaging application and/or skin from their computer to the communication device 552 via a hard wire or wireless communication channel, such as blue tooth or infrared. Alternatively, if the communication device 552 is Internet enabled, then the communication device 552 can download the messaging application and/or skins via a web server 558 of the cellular network of which the user is a subscriber. The user can download the messaging application and/or skins over GPRS or other third generation or later generations of communications protocols.

In use, if a caller uses a calling communications device 560 to call the user of communication device 552 then the telephone call is routed via the cellular network 562 to the user's communication device 552. If the user 552 takes the call and selects to use the animated messaging, then the user can listen to the caller's conversation on their device while also watching an animated character speak the message in synchrony and real time. Alternatively, if the user elects not to take the call then the caller may leave a voice message and when the user retrieves the voice message then the user can again opt to take the voice message by listening to the message and also viewing the animated character speaking the message. The animated character can also be used in conjunction with stored sound files, for example to sing a song or deliver other speech.

Other improvements and modifications to the basic embodiment described above are envisaged. For example a text to speech module could be included in messaging application 102 so that an e-mail or SMS or MMS message received by the communication device is converted into speech data. The speech data can then be supplied to the messaging application in order to drive the animated character and also to the audio output of the communication device so that the user can listen to the message convert into speech and also view the animated character. Similarly, an MMS message to speech conversion module can be included.

in other embodiments, an instant messenger type service can be provided in which the animated device provides an avatar for the person sending the message on the recipient's communication device and in which the avatar is animated to speak the content of the instant message which is converted from a text format into a speech format for audio output. In another embodiment of the instant messaging avatar, rather than receiving a text message which is converted into speech, the received speech from a caller is used to provide the audio output and also to drive the animated avatar of the caller on the recipient's communication device.

The invention can also be used in an application to generate and deliver animated recorded or captured audio and character animation. This application of the invention makes use of the lip-synching and animation generation with recorded audio.

This application is based on the methodology described above. However, instead of displaying the animation real time to screen, the images and audio are be captured as multimedia animation data. This multimedia data may be capture while the realtime animation is being displayed to the user or it can be done as a background process at the instruction of the user. Hence, the user can record an audio message, and an animated character apparently speaking the message is generated and the animated character and audio data are stored and bundled for transmission and replay as a multi media data file.

The multimedia data file can then be sent as an animated, playable message to other devices.

The most common form this multimedia data will take is that of a video file. in the implementation for mobile phone devices, presently the most widely supported video format is called 3GPP. Therefore, the implementation will instead of, or in addition to, animating audio in realtime, generate a 3GPP video file of the lip-synched animated sequence. The animation will also capture any non-lip synched animations that may be driven by user input or randomly generated to provide a more captivating result, e.g. the character may blink or wink, or follow a predetermined animation sequence. Once the video is stored in 3GPP format on the phone it may be viewed as a video on the phone or any other device it is transferred to. As an additional feature, this embodiment of the invention also allows the user to automatically package the 3GPP video file into an MMS message and send the MMS message to another MMS receiving device via traditional MMS send/receive channels and methods. In this scenario a user may generate an animated, lip-synched MMS message from an audio file, microphone or phone conversation and send the message to another device.

As explained above, new skins, corresponding to new characters, new versions of characters or new moods of characters can be supplied to the communication device for download over the communication network or via the Internet. In order to reduce the amount of data downloaded to the communication device, in one embodiment, only the upper left or upper right portion of the face, together with the lower portion of the face is compressed into a zip file before transmission. The received file is then decompressed and the other of the upper images can be generated as the mirror image of the received image.

The messaging application also includes a user interface providing a number of options by which the user can configure the messaging application. For example the user can select the current skin to use. The user can select to use different skins depending on the origin of theaudio data, e.g. a call, microphone or from file. A user's contact list can be integrated with the skinning engine such that when a different skin is associated with different incoming calls such that when an incoming call is detected, the system automatically selects the skin associated with the calling number and displays the animated character with the appropriate skin. The user can also enter a command to set the current mood of the current skin and also to initiate random animations or prompt the display character to carry out a particular animation, such as frowning, blinking or making some other facial expression.

Hence the above described invention provides a more immersive messaging environment in which a user receives both audio and visual output and in which the animated character appears to speak the message to the user.

th practice, the method of the invention will be implemented by a suitable software program executed by a data processing device being part of the hardware of the communication device 552. Coding of suitable software will be apparent to a person of ordinary skill in the art in view of the preceding description of the present invention.

Hence, computer program code and computer program products embodying that software are also aspects of the invention.

Generally, embodiments of the present invention, and in particular the processes involved in creating and displaying an animated audio message, employ various processes involving data stored in or transferred through one or more computer systems.

Embodiments of the present invention also relate to an apparatus for performing these operations. This apparatus may be specially constructed for the required purposes, or it may be a general-purpose computer selectively activated or reconfigured by a computer program and/or data structure stored in the computer. The processes presented herein are not inherently related to any particular computer or other apparatus. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required method steps. A particular structure for a variety of these machines will appear from the description given below.

In addition, embodiments of the present invention relate to computer readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer-implemented operations. Examples of computer-readable media include, but are not limited to, magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto- optical media; semiconductor memory devices, and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM). The data and program instructions of this invention may also be embodied on a carrier wave or other transport medium. Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.

It will be appreciated that the flowcharts are by way of illustration of the data processing operations carried out only and that the invention should not be considered to be limited to the specific steps or operations illustrated in the flowcharts. For example some of the operations illustrated in the flowcharts may be decomposed into sub-steps andlor those operations maybe combined into more general operations. Further the sequence of those operations may not necessarily be that shown in the flowcharts and some of the operations may be carried out in parallel or in other sequences.

Claims

CLAIMS: 1. A data processing method for creating an animated entity to be

displayed on a communication device while outputing a message in audio form on the communication device, the method comprising: processing data representing a plurality of words in the message to identify a plurality of visemes corresponding to the words in the message; and using the identified visemes to create an animation of the entity such that the animated entity will appear to speak the message as the message is output as audio.
2. A method as claimed in claim 1, wherein an image of the entity is comprised of a plurality of portions.
3. A method as claimed in claim 2, wherein a one of the plurality of portions includes a mouth of the entity and wherein the entity is animated by selecting images of the mouth of the entity based on visemes associated with the words currently being spoken by the entity.
4. A method as claimed in claim 2 or 3, and further comprising retrieving an image for each portion of the image of the entity from a database of stored images.
5. A method as claimed in claim 4, wherein the images stored in the database are associated with different entity characters.
6. A method as claimed in claim 4 or 5, wherein the images stored in the database are associated with different moods of the entity.
7. A method as claimed in any preceding claim, and further comprising alternately storing audio data in one of a first buffer and a second buffer and passing audio data from the other of the first and second buffer for processing.
8. A method as claimed in any preceding claim, and further comprising preprocessing the audio data.
9. A method as claimed in any preceding claim, and further comprising determining a property of a set of audio data items and classifying the property in order to identify a viseme associated with the set of audio data items.
10. A method as claimed in claim 9, wherein the property of the set of audio data items is the average amplitude of a set of audio samples.
11. A method as claimed in claim 9, and further comprising comparing the property with the same property determined for audio data corresponding to a plurality of different visemes to identify the viseme associated with the set of audio data items.
12. A method as claimed in claim 10 or 11, wherein the property is selected from the group comprising: an average amplitude of the set of audio data items; a number of zero crossings of the set of audio data items; gradients of the set of audio data items; the number of discrete peaks of the set of audio data items; and combinations thereof
13. A method as claimed in any preceding claim, and further comprising writing image data for a next frame in the animation of the entity to an off screen buffer.
14. A method as claimed in any preceding claim, and further comprising displaying the animated entity and playing the audio.
15. A method as claimed in any preceding claim, and further comprising transmitting the created animation and audio to a communication device.
16. A method as claimed in any preceding claim, wherein the communication device is a cellular telephony device.
17. A communication device, the communication device including a data processor and computer program code configuring the data processor to create an animated entity for display while outputing a message as audio, the computer program code configuring the data processor to: process received audio data representing a plurality of words in the message to identify a plurality of visemes corresponding to the words in the message; and using the identified visemes to create an animation of the entity such that the animated entity will appear to speak the message as the message is output as audio.
18. Computer program code executable by a data processing device to provide the method of any of claims ito 16 or the communication device of claim 17.
19. A computer program product comprising a computer readable medium bearing computer program code as claimed in claim 18.
20. A data processing method for creating an animated entity on a communication device to output a message in audio form substantially as hereinbefore described.