GB2468140A

GB2468140A - A character animation tool which associates stress values with the locations of vowels

Info

Publication number: GB2468140A
Application number: GB0903270A
Authority: GB
Inventors: Charles Cullen
Original assignee: Dublin Institute of Technology
Current assignee: Dublin Institute of Technology
Priority date: 2009-02-26
Filing date: 2009-02-26
Publication date: 2010-09-01
Also published as: GB0903270D0; WO2010097452A1

Abstract

Character animation is key to successful and engaging computer animations whether it be for media such as in movies or in computer games. A known difficulty in animation is the linking of the motions of an animated character with spoken words. The present application solves this problem by detecting the locations of vowels in a piece of speech and determining a stress value for the detected vowels and then animating the characters at the vowel locations in a manner consistent with the determined stress values. The locations of vowels are used as a trigger for a character's motion.

Description

A CHARACTER ANIMATION TOOL

Field

The present application is directed to the field of computer animation, in particular to software tools and production workflow solutions for computer animation.

Background

Character animation is key to successful and engaging computer animations whether it be for media such as in movies or in computer games. A known difficulty in animation is the linking of the motions of an animated character with spoken words. Software is known that animates a characters mouth in response to a speech signal, as a result of which an animated characters is seen to appear to utter the words. Whilst this is useful, the results tend to be regarded by viewers as unnatural. Other techniques have been employed which attempt to process speech to match an animated characters mouth. Again the results of this tend to be unnatural.

Some systems have made investigations into the role of more extensive face and body movements (notably the MIT BEAT prototype). The BEAT system performs linguistic analysis of synthesized text-to-speech (US) audio output in an attempt to predict the formal structure of the associated gestures and movements. However the benefits of this system are limited and artificial insofar as the system only operates on synthesized speech.

Summary

To date however, none of the prior art has considered the overarching importance of speech rhythm in relation to these gestures and movements, and none have considered the prioritization of speech events in relation to their prominence within the signal. In this regard, it has been identified by the inventor that the prior art methods whilst somewhat effective are lacking. In particular, the inventor has appreciated that in human communication linguistic content only accounts for about 7% , with the acoustic properties of speech (rhythm and prosody) accounting for a further 38% or so. Moreover, he has appreciated that the majority of human communication relies on subtle movements and more expansive gestures that comprise 55% of our interactions. The present application focuses on providing these subtle movements and more expansive gestures and relies upon the rhythm and prosody of the speech signal rather than the linguistic content (as with current speech recognition and lip-synching algorithms) to provide a simple system for assigning these movements and gestures to an animated character. Thus the technology places the emphasis of animation on the same criteria that humans use in communication. The approach of the "stress tagging animation" technique may be compared with the human operated characters such as muppets, which adopt a similar approach of concentrating on the rhythms of hand/head movements rather than lip-synching accuracy. The system presented herein by providing a simple list of events prioritized by rhythm and prosody, allows developers to easily match speech with movements, in contrast to most animations which are built from scratch. With "Stress tagging", content may be re-used and characters and voices easily changed as the timing and priority of animation events resides with the speech signal. This allows for automated tools to be provided that allocate animation to speech events, rather than the converse. In addition, as the "stress tagging framework" focuses on acoustic attributes, it is completely language independent. Thus, tools are envisaged where a particular character may be developed to respond to "stress tags", allowing it to be re-purposed in any language desired as often as needed.

The present application employs a pre-defined library of movements and gestures for several distinct characters (as examples), that may be quickly allocated to the prioritized speech events on a manual, semi-automatic or fully automated basis as required by the production.

A first embodiment provides a speech analysis system for assisting in the animation of at least one character in response to a piece of speech. The system comprises a memory for storing the piece of speech and a vowel locator for identifying the locations of vowels within the piece of speech. A vowel stress detector identifies the degree of stress associated with each identified vowel and stores the associated degree of stress for each location.

The vowel locator may determine the duration of each vowel. Suitably, the piece of speech is stored in a database in the memory. The database may store timestamps indicating locations of vowels and durations of vowels and\or a stress value for each vowel.

The vowel stress detector may score at least one characteristic of each vowel against a reference value for the characteristic. This reference value may be determined by averaging the characteristic over a windowed section of the piece of speech. The windowed section may comprise the entire piece of speech. The characteristic may comprise one or more of the following: a) pitch, b) intensity, c) duration, d) voice quality, e) jitter, and f) voice breaks.

Preferably, the at least one characteristic comprises the following characteristics: a) pitch, b) intensity and c) duration.

The animation tool may provide a character animation feature employing the locations of vowels as a trigger for a character's motion. The motion of a character selected at a particular location may be determined with reference to the degree of stress for that location. The motion of the character is automatically selected based upon the degree of stress. The animation tool may allow an animator to select a particular motion from a list presented, suitably where the list is populated with possible character motions based upon the degree of stress. The list is presented for each vowel location allows an animator to select an animated character's motion at each vowel location.

Description of Drawings

Figure 1 is a block diagram of an exemplary system according to the present application, Figure 2 is a flow chart for exemplary methods according to the present application, and Figure 3 is a graphical user interface for use with the system or method of Figures 2 or 3.

Detailed Description

The present invention will now be described with reference to some exemplary methods and systems, in which speech data is provided to a voice analysis system 2 which in turn analyses the speech data to identify the locations of vowels and the corresponding stress levels of these vowels. The inputted speech is desirably monophonic in nature. The speech 1 may be directly inputted, for example by means of a microphone. Alternatively, a pre-corded piece of speech may be employed. A database 3 stored in local memory or external memory may be employed to store different items of speech content. It will be appreciated that such a database may be readily constructed by one skilled in the art. In addition to storing the items of speech content, the database may store the results of analysis performed upon the items of speech content by a vowel locator engine 4 and vowel stress detector 5. The operation of which will be explained in greater detail below. The voice analysis system may be any general purpose computer including those operating under Windows1M' MacintoshTM or LinuxTM operating systems. The Analysis stage of the system may be performed by any suitable set of DSP audio analysis algorithms, such as provided within MATLABTM as provided by The MathWorks, Inc., Natick, USA, or the specific speech software Praat (Boersma, Paul & Weenink, David (2009). Praat: doing phonetics by computer (Version 5.1) [Computer program]. Retrieved January 31, 2009, from http://www.praat.org/) Boersma, Paul & Weenink, David (2009). Praat: doing phonetics by computer (Version 5.1) [Computer program].

Retrieved January 31, 2009, from http://www.praat.org/or purpose built SDK's like MS Speech. The Animation tool may be implemented by any suitably configured animation engine, such as Adobe Flash in A53.

Once the voice analysis system has performed an analysis, the system can provide vowel stress information 6 to an animation tool 7. The manner and mode of use of the vowel stress information by the animation tool is explained below.

The animation tool 7 may operate on the same computing system as the voice analysis system 2 or operate on a separate computing system. Similarly, the animation tool may be provided within the same software program as the vowel locator and vowel stress detectors or separate programs may be employed for each.

The mode and manner of operation of the system 2 and animation tool 7 will now be explained with reference to some exemplary modes of operation, shown in Figure 2, in which the analysis steps 20 are shown separate to the animation steps 23.

The method commences with a recorded piece of speech content which is to be used with an animated character. The piece of speech may be a single item, e.g. a sentence, or it comprises an entire vocabulary for the character in which different phrases are combined into an overall speech recording.

This overall speech recording may be used for example as a library of speech from which different pieces may be retrieved as required.

A primary step in the method, where necessary, may be employed to convert the piece of speech from stereo to monophonic speech. It will become apparent that whilst stereo speech may be employed by analyzing the Left and Right channels, that for the present purposes it is simpler and more efficient to use a monophonic form of speech. A variety of techniques are known for creating a monophonic signal from a stereo signal, including the abandonment of one channel or the simple addition of the two channels.

The monophonic speech piece is then passed through a vowel detector which is employed to detect the positions of vowels in the piece of speech. Where a vowel is located, its position is marked with a time stamp. Each time stamp suitably identifies the location and duration of the associated vowel. The piece of speech and the associated time stamps may be stored together in the database. Vowel detection techniques are well known in the art. One exemplary technique would employ a simple intensity derivative detector, which takes the differential of the input wave to obtain maxima (vowel peaks). The vowel analysis may, for example, be performed using the FFT algorithm provided as part of the Flash A53 core sound classes available from Adobe Systems Incorporated.

Each vowel is analysed for a number of prosodic characteristics. Each prosodic characteristic is then compared with an overall mean for the particular prosodic characteristic for the entire speech clip.

Exemplary prosodic characteristics which are employed include pitch, intensity and duration. These characteristics have been identified as being particularly important prosodic attributes in human speech.

Other characteristics that may be employed would include, for example but not limited to, voice quality, jitter and voice breaks.

The exemplary method described herein uses a simple scoring system and applies it to the characteristics of each vowel. This scoring system ignores interrelationships between characteristics and treats individual characteristics separately and evenly, i.e. each characteristic is scored identically. It will be appreciated that the scoring system may however be adapted to include a weighted scoring formula.

In the exemplary method however, the individual characteristics (pitch, intensity and duration) of each vowel is compared with the mean for the piece of speech as a whole. Where a characteristic for a vowel has a value which exceeds the average then the vowel receives a score of 1, where two characteristics exceed the mean value, the vowel receives a score of 2 and so on. Thus where the pitch of a vowel is above the average pitch for the piece of speech and where the duration of the vowel exceeds the average duration and the intensity exceeds the average intensity, the vowel would receive a score of 3.

This score is stored with the timestamp for the vowel in the database. As a result, the speech, vowel locations and importance (score) of each vowel location are stored or related together within the database.

The values stored in the database may then be employed with a character animation tool by automatically\semi automatically linking gestures to the locations of the time stamped vowels. In particular, the analysis tool may export an XML file for a piece of speech to the animation tool in which the speech is embedded along with information identifying the locations and scores of vowels.

Character animation tools are well known in the art and the techniques employed would be readily familiar to the skilled person. One common technique is the use of games physics to animate characters based on particular inputs as provided, for example, by an animator. These inputs are converted into motion of the character on the screen. The advantage of these animation tools is that the animator does not have to specify the precise movements for a character between frames. Instead, for example, the start point and end points might be detailed over a particular time span and the animation tool using appropriate mathematics can effectively interpolate the characters movements for each frame between the start and end points.

The present system employs such a tool and provides the timestamp and scoring data with the speech data to the character animation tool. The character animation tool employs the scoring data as an input at each identified time stamp. In an automated mode of operation, different scores may be associated with different character actions or character features. For example, a score of one might be associated with a character winking, whereas a score of two might be associated with movement of the hands and a score of three might be associated with head movement. The characters action is timed to occur at the timestamp and for the duration of the timestamp. An exemplary screen shot from an animation tool using the present methods is shown in Figure 3 in which a section of speech content is represented along an abbreviated time line 64. The section of speech is selectable from the entire piece of speech content which is represented in a smaller scale (graphical section 53). One or more slider features 55a, 55b allows a user to select a section of speech from an overall time line 52 for the speech content. Other features including for example a moving window allow a user to select the region of the speech content to be represented by the abbreviated time line. The vowel stress information is represented by dots 57 for the complete item of speech and by diamonds 60, 58, 56, 54 in a separate region 50 for the abbreviated time line.

The character to be animated is represented in a character region 62 above the time line with a variety of different actions (in this example the hand movements). Each diamond represents a vowel, with the degree of stress identified by differently identified diamonds. Thus in the exemplary screenshot shown, the scoring system described above was used with a maximum score of 3. The stress is thus represented by the relative height on the screen of the diamonds with diamonds with a score of 3 being higher than diamonds with a score of 2 etc. In addition, the diamonds contain a numeric representation of the score.

Similarly, the colours of the diamonds may be different to identify different scores, e.g. a diamond representing a score of three could be red, one with a score of 2 could be blue and a score of one might be colored green. To assist the animator, sections with no speech may also be represented 54. When an animator is using the tool, they may move along the time line selecting individual diamonds. As a diamond is selected, using a mouse for example, a motion selection tool may appear, e.g. a drop down list, allowing the animator to select an action for a character. Different actions can be pre-assigned to each drop down list with different levels, i.e. minor actions assigned to lower stress levels and major actions assigned to higher stress levels. The animator can thus select a major action from the list of major actions for a diamond with a value of 3 and a minor action from a list of minor actions presented for a diamond with a value of 1.

The animation tool generates and stores the character actions in response to the animator's selection. It will be appreciated that the speed with which the animation may be completed is extremely fast since the animator does not need to focus on timing or content. Suitably, the animation tool is one that allows for layering, thus the animator may use one layer to store the characters actions resulting from the speech above with other layers employed to account for a characters general movements about a scene.

Whilst this approach may appear relatively primitive compared to animation generally, the reality is that lip\mouth movement is only used by humans for linguistic information, which in itself accounts for a very small percentage of communication (approx 7%) with the large majority of communication hence being performed by motion of other features (55%). More importantly, the context of the exact gesture is less important than the rhythm of the gestures and the present method by tying the gestures into vowel locations and into the relative importance of vowels in the speech provides an effective animation tool. The automatic animation tool is obviously of importance in situations where an animator is not involved to produce the final piece of content, e.g. in a video game, where a character's actions whilst depending on prerecorded speech content may have other inputs, e.g. from a player.

In a semi-automatic arrangement, the tool allows a user to select from different actions for each time stamp. Thus an animator can select different actions from a dropdown box for each timestamp. In this scenario, the contents of the drop down list may be selected based on the associated score for the timestamp.

This character animation technique employs the use of acoustic, linguistic and emotional speech analysis to semi-automatically generate gestures and body movements in response to the acoustic parameters in a character's voice.

The invention is a platform that enables the creation of computer animations for use in a wide number of applications. It is cutting edge given that instead of basing animation on lip-synch, it uses speech events (acoustic, linguistic and emotional) to both manually and automatically define character movements, gestures and facial positions. The techniques have been demonstrated to work in practice.

A software front end as described above has been implemented that takes in user data (speech) and produces a corresponding animation that is close to half complete in a fraction of the time that would be required by an animator using traditional techniques..

The techniques described herein may be used to produce cheaper, faster and more effective character animations in films, games, children's TV programmes and advertisements.

The advantages include lower costs since the overall production overhead is reduced since character animation events may be characterised by non-animators based on a speech clip, freeing animators to work on other aspects of the animation process. Moreover, the animation process is faster since it is semi-automated using pre-defined libraries that allow up to 70% of the animation to be achieved without customization by an animator. The system is character independent, so that the gesture and movement libraries and characters may easily be changed.

In contrast to prior art methods, the system is largely language independent in that the techniques may be used to semi-automate characters in any spoken language.

The technology is character and language independent and the use of re-usable and pre-defined gesture I movement libraries makes it a cheap, fast and effective alternative to conventional character animation techniques.

The systems of the present application have been implemented with a variety of different characters has been tested in various languages and with various voices and the potential to reduce production costs, save time and streamline workflows have been clearly demonstrated. The process and resulting system is essentially a labour saving device that allows animators to achieve better production values in a shorter period of time, given that it takes care of 70% of the ground work-allowing animators to focus on the nuance and detail of the overall animated output.

Claims

Claims 1. A speech analysis system for assisting in the animation of at least one character to a piece of speech, the system comprising: a memory for storing the piece of speech a vowel locator for identifying the locations of vowels within the piece of speech, a vowel stress detector for identifying the degree of stress associated with each identified vowel and storing the associated degree of stress for each location.
2. A speech analysis system according to claim 1, wherein the vowel locator identifies the duration of each vowel.
3. A system according to claim 1 or claim 2, wherein the piece of speech is stored in a database in the memory.
4. A system according to claim 3, wherein the database stores timestamps indicating locations of vowels and durations of vowels.
5. A system according to claim 4, wherein the database further stores a stress value for each vowel.
6. A system according to any preceding claim, wherein the vowel stress detector scores at least one characteristic of each vowel against a reference value for the characteristic.
7. A system according to claim 6, wherein the reference value is determined by averaging the characteristic over a windowed section of the piece of speech.
8. A system according to claim 7, wherein the windowed section comprises the entire piece of speech.
9. A system according to anyone of claims 6 to 8, wherein the at least one characteristic comprises one or more of the following: a) pitch, b) intensity, c) duration, d) voice quality, e)jitter,and f) voice breaks.
10. A system according to anyone of claims 6 to 9, wherein the at least one characteristic comprises the following characteristics: a) pitch, b) intensity and c)duration.
11. An animation system comprising the system according to any preceding claim, wherein the animation tool provides a character animation feature employing the locations of vowels as a trigger for a character's motion.
12. A system according to claim 11, wherein the motion of a character selected at a particular location is determined with reference to the degree of stress for that location.
13. A system according to claim 12, wherein the motion of the character is automatically selected based upon the degree of stress.
14. A system according to claim 12, wherein the animation tool allows an animator to select a particular motion from a list presented.
15. A system according to claim 14, wherein the list is populated with possible character motions based upon the degree of stress.
16. A system according to claim 14 or claim 15, wherein the list is presented for each vowel location allowing an animator to select an animated character's motion at each vowel location.
17. A computer implemented method of animating a character's actions to a piece of speech, the method comprising the steps of: analyzing the piece of speech to identify at least one locations of a vowel, determining the degree of stress associated with the at least one identified vowel location and selecting the characters action at that at least one location based on the determined degree of stress.
18. A method according to claim 17 wherein the duration of the at least one vowel is determined.
19. A method according to claim 17 or claim 18 wherein the degree of stress is determined by comparing at least one characteristic of each vowel against a reference value for the characteristic.
20. A method according to claim 19, wherein the reference value is determined by averaging the characteristic over a windowed section of the piece of speech.
21. A method according to claim 19 wherein the windowed section comprises the entire piece of speech.
22. A method according to any one of claims 17 to 21, wherein the at least one characteristic comprises one or more of the following: a) pitch, b) intensity, c) duration, d) voice quality, e)jitter, and f) voice breaks.
23. A method according to any one of claims 17 to 23, wherein the at least one characteristic comprises the following characteristics: a) pitch, b) intensity and c) duration.
24. A method wherein the locations of vowels are used as a trigger for a character's motion in the animation.
25. A method according to claim 24 wherein the character's motion at a location is determined with reference to the degree of stress for that location.
26. A method according to claim 25, wherein the motion of the character is automatically selected based upon the degree of stress.
27. A method according to claim 25, further comprising presenting an animator with a list of possible character motions and allowing the animator to select a particular motion from the list.
28. A method according to claim 27, wherein the list is populated with possible character motions based upon the degree of stress.
29. A method according to claim 27 or claim 28, wherein the list is presented for each vowel location allowing an animator to select an animated character's motion at each vowel location.