GB2510200A

GB2510200A - Animating a computer generated head based on information to be output by the head

Info

Publication number: GB2510200A
Application number: GB1301583.9A
Authority: GB
Inventors: Javier Latorre-Martinez; Vincent Ping Leung Wan; Bjorn Stenger; Robert Anderson; Roberto Cipolla
Original assignee: Toshiba Research Europe Ltd
Current assignee: Toshiba Europe Ltd
Priority date: 2013-01-29
Filing date: 2013-01-29
Publication date: 2014-07-30
Anticipated expiration: 2033-01-29
Also published as: GB2510200B; JP2016042362A; US20140210830A1; JP6109901B2; GB201301583D0; US9959657B2; CN103971393A; JP2014146339A; EP2760023A1

Abstract

A method of animating a computer generation of a head, the head having a mouth which moves in accordance with inputted speech or text to be output by the head, said method comprising: providing an input related to the speech which is to be output by the movement of the mouth, dividing said input into a sequence of acoustic units, selecting an expression to be output by said head and converting said sequence, of acoustic units to a sequence of image vectors using a statistical model. The model has a plurality of parameters describing probability distributions which relate an acoustic unit to an image vector for a selected expression and the sequence of image vectors is then output as video such that the mouth of said head moves to mime the speech associated with the input text with the selected expression. A parameter of a predetermined type of each probability distribution in said selected expression is expressed as a weighted sum of parameters of the same type, and the weighting used is expression dependent and means that converting a sequence of acoustic units to a sequence of image vectors comprises retrieving the expression dependent weights for the selected expression. The parameters involved are provided in clusters, each comprising at least one sub-cluster with the expression dependent weights being retrieved for each cluster such that there is one weight per sub-cluster.

Description

A Computer Generated Bead

FIELD

Embodiments of the present invention as generally described herein relate to a computer generated head and a method for animating such a head.

BACKGROUND

Computer generated talking heads can be used in a number of diffei-ent siLuaQoflR For example, for providing information via a public address system, for providing information to the user of a computer etc. Such computer generated animated heads may also be used in computer games and to allow computer generated figures to "talk".

However, there is a continuing need to make such a head seem more realistic.

Systems and methods in accordance with non-limiting embodiments will now be described with 1 5 reference to the accompanying figures in which: Figure 1 is a schematic of a system for computer generating a head; Figure 2 is a tiow diagram showing the basic steps for rendering an anima.t:ing a generated head in accordance with an embodiment of the invention; Figure 3(a) is an image of the generated head with a user interface and figure 3(h) is a line drawing of the interface; Figure 4 is a schematic of a system showing how the expression characteristics ma.y he selected; Figure 5 is a variation on the system of figure 4; Figure 6 is a further variation on the system of figure 4; Figure 7 is a schematic of a Gaussian probability function; Figure 8 is a schematic of the clustering data arrangement used in a method in accordance with an embodiment of the present invention; Figure 9 is a flow diagram demonstrating a method of training a head generation system in accordance with an embodinient of the present invention; Figure 10 is a schematic of decision trees used by embodiments in accordance with the present invent ion; Figure ii is a flow diagram showing the adapting of a. system in accordance with an embodiment of the present invention; and Figure 12 is a flow diagram showing the adapting of a system in accordance with a further embodiment of the present invention; Figure 13 is a flow diagram showing the training of a system for a head generation system where the weightings are factorised; Figure 14 is a flow diagram showing in detail the sub-steps of one of the steps of the flow diagram of figure 13; Figure 15 is a flow diagram showing in detail the sub-steps of one of the steps of the flow diagram of figure 13; Figure 16 is a flow diagram showing the adaptation of the system described with reference to figure 13; Figure 17 is an image model which can be used with method and systems in accordance with embodiments of the present invention; Figure 18(a) is a variation on the model of fIgure 17; Figure 18(b) is a variation on the model of figure 18(a); Figure 19 is a flow diagram showing the training of the model of figures 18(a) and (b); Figure 20 is a schematic showing the husks of the training described with reference to figure 19; Figure 21(a) is a plot of the error against the number otmodes used in the image mode's described with reference to figures 17, 18(a) and (b) and figure 21(h) is a pint of the number of sentences used for training against the errors measured in the trained model; Figure 22(a) to (d) are confusion matrices for the emotions displayed in test data; and Figure 23 is a table showing preferences for the variations of the image model.

DETAILED DESCRIPTION

In an embodiment, a method of animating a computer generation of a head is provided, the head having a mouth which moves in accordance with speech to be output by the head, said method comprising: providing an input related to the speech which is to be output by the movemeriL o.I.' the lips; dividing said input into a sequence of acoustic units; selecting expression characteristics for the inputted text; converting said sequence of acoustic units to a sequence of image vectors using a statistical model, wherein said model has a plurality of model parameters describing probability a distributions which relate an acoustic unit to an image vector, said image vector comprising a plurality of parameters which define a face of said head; and outputting said sequence of image vectors as video such that the mouth of said head moves to mime the speech associated with the input iext with the selected expression, wherein a parameter of a predetermined type of each probability distribution in said selected expression is expressed as a weighted sum of paramctcrs of the same type, and wherein the weighting used is expression dependent, such that converting said sequence of acoustic units to a sequence of image vectors comprises retrieving the expression dependent weights for said selected expression, wherein the parameters are provided in clusters, and each cluster comprises 1 0 at least one sub-cluster, wherein said expression dependent weights are retrieved for each cluster such that there is one weight per sub-cluster.

It should be noted that the mouth means any part of the mouth, for example, the lips, jaw, tongue etc. In a further embodiment, the lips move to mime said input speech.

The above head can output speech visually from the movement of the lips of the head. In a further embodiment, said model is further configured to convert said acoustic units into speech vectors, wherein said mode! has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector, the method further comprising outputting said sequence of speech vectors as audio which is synchronised with the lip movement of the head. Thus the head can output both audio and video.

The input may be a text input which is divided into a sequence of acoustic units. In a further embodiment, the input is a speech input which is an audio input, the speech input being divided into a sequence of acoustic units and output as audio with the video of the head. Once divided into acoustic units the model can be run to associate the acoustic units derived from the speech input with image vectors such that the head can be generated to visually output the speech signal along with the audio speech signal.

Tn an embodiment, each sub-cluster may comprises at least one decision tree, said decision free being based on questions relating to at least one of linguistic, phonetic or prosodic differences.

There may be differences in the structure between the decision trees of the clusters and between trees in the sub-clusters. The probability distributions may be selected from a Gaussian distribution, Poisson distribution, Gamma distribution, Student -t distribution or Laplacian distribution.

The expression characteristics may be selected from at least one of different emotions, accents or speaking styles. Variations to the speech will often cause subtle variations to the expression displayed on a speaker's face when speaking and the above method can be used to capture these variations to allow the head to appear naturaL In one ernhodimeit, selecting expression characteristic comprises providing an input to allow the weightings to he selected via the input. Also, selecting expression characteristic comprises predicting from the speech to be outputted the wcightings which should be used. In a yet 1 0 further embodiment, selecting expression characteristic comprises predicting from external information about the speech to be output, the weightings which should be used.

It is also possible for the method to adapt to a new expression characteristic. For example, selecting expression comprises receiving an video input containing a face and varying the weightings to simulate the expression characteristics of the face of the video input.

Where the input data is an audio file containing speech, the weightings which are to be used for controlling the head can be obtained, from the audio speech input.

In a further embodiment, selecting an expression characteristic comprises randomly selecting a set of weightings from a plurality of pre-stored sets of weightings, wherein each set of weightings comprises the weightings for all sub-clusters.

The image vector comprises parameters which allow a face to be reconstructed from these parameters. In one embodiment, said image vector comprises parameters which allow the face to be constructed from a weighted sum of modes, and wherein the modes represent reconstructions of a face or part thereof In a further embodiment, the modes comprise modes to represent shape and appearance of the face. The same weighting parameter may be used for a shape modc and. its corresponding appearance mode.

The modes may be used to represent pose of the face, deformation of regions of the face, blinking etc. Static features of the head may be modelled with a fixed shape and texture.

In a further embodiment, a method of adapting a system for rendering a computer generated head to a new expression is provided, the head having a mouth which moves in accordance with speech to be output by the head, the system comprising: an input tbr receiving data to the speech which is to be output by the movement of the mouth; a processor configured to: divide said input data into a sequence of acoustic units; allow selection of expression characteristics for the inputted text; convert said sequence of acoustic units to a sequence of image vectors using a statistical model, wherein said model has a plurality of model parameters describing probability distributions which relate an acoustic unit to an image vector, said image vector comprising a plurality of parameters which defme a face of said head; and output said sequence of image vectors as video such that the mouth of said head moves to mime the speech associated with the input text with the selected expressioll, rherein a parameter of a predetermined type of each probability distribution in said selected expression is expressed as a weighted sum of parameters of the same type, and wherein the weighting used is expression dependent, such that converting said sequence of acoustic units to a sequence of image vectors comprises retrieving the expression dependent weights for said selected expression, wherein the parameters are provided in clusters, and each cluster comprises at least one sub-cluster, wheein said expression dependent weights are retrieved for each cluster such that there is one weight per sub-cluster, the method comprising: receiving a new input video file; calculating the weights applied to the clusters to maxitnise the similarity between the generated image and the new video file.

The above method may further comprise creating a new cluster using the data from the new video file; and calculating the weights applied to the clusters including the new cluster to maximise the similarity between the generated image and the new video file.

In an embodiment, a system for rendering a computer generated head is provided, the head having a mouth which moves in accordance with speech to be output by the head, the system comprising: an input for receiving data to the speech which is to be output by the movement of the mouth; a processor configured to: divide said input data into a sequence of acoustic units; allow selection of expression characteristics for the inputted text; convert said sequence of acoustic units to a sequence of image vectors using a statistical model, wherein said model has a plurality of model parameters describing probability distributions which relate an acoustic unit to an image vector, said image vector comprising a plurality of parameters which define a face of said head; and output said sequence of image vectors as video such that the lips of said head 1 5 move to mime the speech associated with the input text with the selected expression.

wherein a parameter of a predetermined type of each probability distribution in said sclcctcd. expression is expressed as a weiglited sum of parameters of the same type, and wherein the weighting used is expression dependent, such that converting said sequence of acoustic units to a. sequence of image vectors comprises retrieving the expression dependent weights for said selected expression, wherein the parameters are provided in clusters, and each cluster comprises at least one sub-cluster, wherein said expression dependent weights are retrieved for each cluster such that there is one weight per sub-cluster.

In an embodiment, an adaptable system for rendering a computer generated head is provided the head having a mouth which moves in accordance with speech to be output by the head, the system comprising: an input for receiving data to the speech which is to be output by the movement of the mouth; a processor configured to: divide said input data into a sequence of acoustic units; allow selection of expression characteristics for the inputted text; convert said sequence of acoustic units to a sequence of image vectors using a statistical model, wherein said model has a plurality of model parameters describing probability distributions which relate an acoustic unit to an image vector, said image vector comprising a plurality of parameters winch defme a face of said head; and output said sequence of image vectors as video such that the lips of said hd move to mime the speech associated with the input text with the selected expression, wherein a parameter of a predetermined type of each probability distribution in said selected expression is expressed as a weighted sum of parameters of the same type, and wherein the weighting used is expression dependent, such that converting said sequence of acoustic units to a sequence of image vectors comprises retrieving the expression dependent weights for said selected expression, wherein the parameters are provided in clusters, and each cluster comprises at least one sub-cluster, wherein said expression dependent weights are retrieved for each cluster such that there is one weight per sub-cluster.

the system further comprising a memory configured to store the said parameters provided in clusters and sub-clusters and the weights for said sub-clusters, 1 5 the system being further configured to receive a new input video tile; the processor being configured to re-calculate the weights applied to the sub-clusters to maximise the similarity between the generaled image and the new video file.

The above generated head may be rendered in 21) or 3D. For 3D, the image vectors define the head in 3 dimensions. In 3D, variations in pose are compensated for in the 3D data. However, blinking and static features may be treated as explained above.

Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.

Figure 1 is a schematic of a system for the computer generation of a head which can talk.. The system 1 comprises a processor 3 which executes a program 5. System 1 further comprises storage or memory 7. [he storage 7 stores data which is used by program 5 Lo render the head on display 19. The text to speech system 1 further comprises an input module 11 and an output module 13. The input module 11 is comiceted lo an input for data relating to the speech to be output by the head and the emotion or expression with winch the texL is Lo be output. The type ot'data which is input may take many forms winch vill be described in more detail later. The

S

input 15 may be an interface winch allows a user to directly input data. Alternatively, the input may b-c a receiver for receiving data from am external storage medium or a network.

Connected to the output module 13 is output is audiovisual output 17. The output 17 comprises a display t 9 which will display the geileraLed head.

Tn use, the system 1 receives data through data input 15. The program 5 executed on processor 3 converts inputted data into speech to be output by the head and the expression which the head is to display. The program accesses the storage to select parameters on die basis of the input data. The program renders the head. The head when animated moves its lips in accordance with the speech to be output and displays the desired expression. The head also has an audio output which outputs an audio signal containing the speech. The audio speech is synchronised with the lip movement of the head.

Figure 2 is a schematic of the basic process for animating and rendering the head. hi step S201, an input is received which relates to the speech to be output by the talking head and will also contain information relating to the expression that the head should exhibit while speaking the text.

In this specific embodimeut, the input which relates to speech will be text. In figure 2 the text is separated from the expression input. However, the input related to the speech doe-s not need to be a text input, it can be any type of signal which allows the head to be able to output speech.

For example. the input could be selected from speech input, video input, combined speech and video input. Another possible input would be any form of index that relates to a set of face/speech already produced, or to a predefined text/expression, e.g. an icon to make the system say "please" or "I'm sorry" For the avoidance of doubt, it should be noted that by outputting speech, the lips of Lb-c head move in accordance with the speech Lobe output-ted. However, the volume of the audio output 3D may be silent. In an embodiment, there is just a visual representation of the head miming the words where the speech is output visually by the movement of the lips. In further embodiments, this may or may not be accompanied by an audio output of the speech.

When text is received as an input, it is then converted into a sequence of acoustic units which may be phonemes, graphemes, context dependent phonemes or graphemes and words or part thereof.

S In one embodiment, additional information is given in the input to allow expression to be selected in step 5205. This then allows the expression weights which will be described in more detail with relation to figure 9 to be derived in step 5207.

In some embodiments, steps S205 and S207 are combined. This may be achieved in a number of different ways. For example, Figure 3 shows an interface for selecting the expression. ilere, a user directly selects the weighting using, for example. a mouse to drag and drop a point on the screen, a keyboard to input a figure etc. in figure 3(5), a selection unit 251 which comprises a mouse, keyboard or the like selects the weigjitings using display 253. Display 253, in this example has a radar chart which shows the weightings. The user can use the selecting unit 251 in order to change the dominance of the various clusters via the radar chart. It will be appreciated by those skilled in the art that other display methods may be used in the interface.

In sonic embodiments, the user can directly enter text, weights tbr emotions, weights for pitch, speed and depth.

Pitch and depth can affect the movement of the face since that the movement of the face is different when the pitch goes too high or too low and in a similar way varying the depth varies the sound of the voice between that of a big person and a little person. Speed can be controlled as an extra parameter by modifying the number of frames assigned to each model via the duration distributions.

Figure 3(a) shows the overall unit with the generated head. The head is partially shown with as a mesh without texture. In normal use, the head will be fully textured.

In a further embodiment, the system is provided with a memory which saves predetermined sets of weightings vectors. Each vector may be designed to allow the text to be outputted via the head using a different expression. The expression is displayed by the head and also is manifested in the audio output. The expression can be selected from happy, sad, neutral, angry, afraid, tender etc. In further embodiments the expression can relate to the speaking style of the user, for example, whispering shouting etc or the accent of the user.

A system in accordance with such an embodiment is shown in Figure 4. Here, tile display 253 shows different expressions which may be selected by selecting unit 251.

hi a further embodiment, the user does not separately input information relating to Ihe expression, here, as shown in figure 2, the expression wcighiings which are derived in S207 are derived directly from the text in step S203.

Such a system is shown in figure 5. For example, the system may need to output speech via the talking head corresponding to text which it recognises as being a command or a question. The system may be configured to output an electronic book. The system may recognise from the text when something is being spoken by a character in the book as opposed to the narrator, for example from quotation marks, and change the weighting to introduce a new expression to be used in the output. Similarly, the system may be configured to recognise if the text is repeated.

hi such a situation, the voice characteristics may change for the second output. Further the system may be configured to recognise if the text refers to a happy moment, or an auxious moment and the text outputted with the appropriate expression. This is shown schematically in step 8211 where the expression weights arc predicted directly from the text.

In the above system as shown in figure 5, a memory 261 is provided which stores the attributes and rules to be checked in the text. The input text is provided by unit 263 to memory 261. The rules for the text are checked and information concerning the type of expression are then passed to selector unit 265. Selection unit 265 then looks up the weightings for the selected expression.

The above system and considerations may also be applied for the system to be used in a computer game where a character in the game speaks.

In a further embodiment, the system receives information about how the head should output speech from a thither source. An example of such a system is shown in figure 6. For example, in the case of an electronic book, the system may receive inputs indicating how certain parts of the text should be outputted.

In a computer game, the system will be able to determine from the game whether a character who is speaking has been injured, is hiding so has to whisper, is trying to attract the attention of someone, has successfully completed a stage of the game etc. In the system of figure 6, the further information on how the head should output speech is received from unit 271. Unit 271 then sends this inthrination to memory 273. Memory 273 then retrieves information concerning how the voice should be output and send this to uniL 275.

Unit 275 Lhen retrieves die weightings for the desh-ed output from the head.

In a further embodiment, speech is directly input at step 5209. Here, step 5209 may comprise three sub-blocks: an automatic speech recognizer (ASR) that detects the text from the speech, and aligner that synchronize text and speech. and automatic expression recognizer. The recognised expression is converted to expression weights in S207. The recognised text then flows to text input 203. This arrangement allows an audio input to the talking head system which produces an audio-visual output. This allows for example to have real expressive speech and from there synthesize the appropriate face for it.

In a further embodiment, input text that corresponds to the speech could be used to improve the performance of module 5209 by removing or simplifying ihejob of the ASR sub-module.

hi step 5213. the text and expression weights are input into an acoustic model which in this embodiment is a cluster adaptive trained 1-1MM or CXt-HMM.

The text is then converted into a sequence of acoustic units. These acoustic units may be phonemes or graphemes. The units may be context dependent e.g. triphones, quinphones etc. which take into account not only the phoneme which has been selected but the proceeding and following phonemes, the position of the phone in the word, the number of syllables in the word the phone belongs to, etc. The text is converted into the sequence of acoustic units using teclmiques which are well-known in the art and will not be explained further here.

There are many models available for generating a face. Some of these rely on a parameterisation of the face in terms o1 for example, key points/features, muscle structure etc. Thus, a face can be defined in terms of a "face" vector of the parameters used in such a face model to generate a face. This is analogous to the situation in speech synthesis where output speech is generated from a speech vector. In speech synthesis, a speech vector has a probability of being related to an acoustic unit, there is not a one-to-one correspondence. Similarly, a face vector only has a probability of being related to an acoustic unit. Thus, a face vector can be manipulated in a similar manner to a speech vector to produce a talking head which can output both speech and a visual representation of a character speaking. Thus, it is possible to treat the face vector in the same way as the speech vector and train it from the same data.

The probability distributions are looked up which relate acoustic units to image parameters. In this embodiment, the probability distributions will be Gaussian distributions which are defined by means and variances. Although it is possible to use oilier distributions such as the Poisson, Student-t, Laplacia.n or Gamma distributions some of which are defined by variables other than the mean and variance.

Considering just the image processing at first, in this embodiment, each acoustic unit does not have a definitive one-to-one correspondence to a "face vector" or "observation" to use the terminology of the art. Said face vector consisting of a vector of parameters that define the gesture of the face at a given frame. Many acoustic units are pronounced in a similar manner, are affected by surrounding acoustic units, iheir location in a word or sentence, or are pronounced differently depending on the expression, emotional state, accent, speaking style etc of the speaker. Thus, each acoustic unit only has a probabilily of being related to a Face vector and text-to-speech syslems calculate many probabilities and choosc the most likely sequence of observations given a sequence of acoustic units.

A Gaussian distribution is shown in figure 7. Figure 7 can be thought of as being the probability distribution of an acoustic unit relating to a face vector. For example, the speech vector shown as X has a probability P1 of corresponding to the phoneme or other acoustic unit which has the distribution shown in figure 7.

The shape and position of the Gaussian is defined by its mean and variance. ihese parameters are determined during the training of the system.

These parameters are then used in a model in step 8213 which will be tenned a "head model".

The "head modcP' is a visual or audio visual version of the acoustic models which are used in speech synthesis. Tn this description, the head model is a Hidden Markov Model (13MM).

However, other models could also be used.

The memory of the talking head system will store many probability density functions relating an to acoustic unit i.e. phoneme, grapheme, word or part thereof to speech parameters. As the Gaussian distribution is generally used, these are generally referred to as Gaussians or components.

In a Hidden Markov Model or other type of head model, the probability of all potential face vectors relating to a specific acoustic unit must be considered. Then the sequence of face vectors which most likely corresponds to the sequence of acoustic units will be taken into account. This implies a global optimization over all the acoustic units of the sequence taking into account the way in which two units affect to each other. As a result, it is possible that the most likely face vector for a specific acoustic unit is not the best face vector when a sequence of acoustic units is considered.

In the flow chart of figure 2, a single stream is shown for modelling the image vector as a "compressed expressive video model". In some embodiments, there will be a plurality of different states which will each be modelled using a Gaussian. For example, in an embodiment, 1 5 the talking head system comprises multiple streams. Such streams might represent parameters for only the mouth, or only the tongue or the eyes, etc. The streams may atso be further divided into classes such as silence (sil), short pause (pau) and speech (spe) etc. In an embodiment, the data. from each of the streams and classes will be modelled using a 11MM. The HMM may comprise different numbers of states, for example, in an embodiment, 5 state HMMs may be used to model the data from some of the above streams and classes. A Gaussian component is determined for each HMM state.

The above has concentrated on the head outputting speech visually. However, the head may also output audio in addition to the visual output. Returning to figure 3, the "head model" is used to produce the image vector via one or more streams and in addition produce speech vectors via oue or more streams, In figure 2, 3 audio streams are shown which are, spectrum, Log EU arid BAP/ Cluster adaptive training is an extension to hidden Markov model text-to-speech (1IMlvI-TTS).

HMM-TTS is a parametric approach to speech synthesis which models context dependent speech units (CDSU) using 11MMs with a finite number of emitting states, usually five.

Concatenating the HMIMs and sampling from them produces a set of parameters which can then be re-synthesized into synthetic speech. Typically, a decision tree is used to cluster the CDSU to handle sparseness in the training data. For any given CDSIJ the means and variances to be used in the liTMMs may be looked up using the decision tree.

CAT uses multiple decision trees to capture style-or emotion-dependent information. This is done by expressing each parameter in terms of a sum of weighted parameters where the weighting Us derived from step S207. The parameters are combined as shown in figure 8.

Thus, in an embodiment, the mean of a Gaussian with a selected expression (for either speech or face parameters) is expressed as a weighted sum of independent means of the Gaussians. = Eqn.l

where g is the mean of component m in wilh a selected expression s, p} is the index for a cluster with P the total number of clusters, x' is the expression dependent interpolation weight of the I" cluster for the expression s; Pc(ni is the mean for component in in cluster i. In an embodiment, one of the clusters, for example, duster 1=1, all the weights are always set to 1.0. This cluster is called the bias cluster'. Each cluster comprises at least one decision tree. There will be a decision tree for each component in the cluster. In order to simplify the expression, c (?n, {i N} indicates the general leaf node index for the component m in the mean vectors decision free for cluster ?i: , with N the total number of leaf nodes across the decision trees of all the clusters. [he details of the decision trees will be explained later.

For the head model, the system looks up the means and vathmces which will he stored iii an accessible maimer. The head model also receives the expression weightings from step S207. It will be appreciated by those skilled in the art that the voice characteristic dependent weightings may he looked up before or after the means arc looked up.

The expression dependent means i.e. using the means and applying the weightings, are then used in a head model. in step S213.

The face characteristic independent means are clustered. In an ernbodinnent, each cluster comprises at least one decision tree, the decisions used in said trees are based on linguistic, phonetic and prosodic variations. In an embodiment, there is a decision tree for each component which is a member of a cluster. Prosodic, phonetic, and linguistic contexts affect the facial gesture. Phonetic contexts typically affects the position and niovenient of the mouth, and prosodic (e.g. syllable) and linguistic (e.g., part of speech of words) contcxts.affects prosody such as duration (rhhm) and other pa: of the face, e g, the bimlung of the eyes Each cluster may comprise one or more sub-clusters where each sub-cluster comprises at least one of the said decision trees.

The above can either be considered to retrieve a weight For each sub-cluster or a weight vector for each cluster, the components of Ihe weight vector being the weightings for each sub-cluster.

The following configuration may be used in accordance with an embodiment of the present invention. To model this data, in this embodiment, 5 state IIMMs are used. The data is separated into three classes for this example: silence, short pause, and speech. In this particular embodiment, the allocation of decision trees and weights per sub-cluster are as follows. H In this particular embodiment the following streams are used per cluster: Spectrum: 1 stream, 5 states, 1 tree per state x 3 classes LogFO: 3 streams, 5 states per stream, I tree per state and stream x 3 classes BAP: 1 stream, 5 states, 1 tree per state x 3 classes yin: 1 stream. 5 states, I tree per stale x 3 classes Duration: 1 stream, 5 states, I tree x 3 classes (each tree is shared across all states) Total: 3x3l = 93 decision trees For the above, the following weights are applied to each stream per expression characteristic: Spectrum: 1 stream, 5 states, 1 weight per stream x 3 classes LogFO: 3 streams, 5 states per stream, I weight per stream x 3 classes BAP: I stream, 5 states, 1 weight per stream x 3 classes VID: 1 stream, 5 states, I weight per stream x 3 classes Duration: 1 stream, 5 states, 1 weight per state and stream x 3 classes Total: 3x1 1 = 33 weights.

As shown in Ihis example, it is possible to allocate the same weight to different decision trees (VTD) or more than one weight to the same decision tree (duration) or any other combination.

As used herein, decision trees to which the same weighting is to be applied are considered to form a sub-cluster.

In one embodiment, the audio streams (spectrum, logFo) are not used to generate the video of the talking head during synthesis but are needed during training to align the audio-visual stream with the text.

The following table shows which streams are used for alignment, video and audio in accordance with an embodiment of the present invention.

Stream Used for alignment Used for video Used for audio synthesis synthesis Spectrum Yes No Yes LogFO Yes N** o Yes BA? No No Yes (but may be omitted) VID No Yes No Duration Yes Yes Yes Tn an embodiment, the mean of a Gaussian distribution with a selceled voice characteristic is expressed a.s a weighted sum of the means of a Gaussian component, where the summation uses one mean from each cluster, the mean being selected on the basis of the prosodic, linguisfic and phonetic context of the acoustic unit which is currently being processed.

The training of the model used in step S213 will be explainedin detail with reference to figures 9 to 11. Figure 2 shows a simplified model with four streams, 3 related to producing the speech vector (1 spectrum, 1 Log FO and I duration) and one related to the face/VID parameters.

(However, it should be noted from above, that many embodiments will use additional sLreanIs and multiple streams may be used to model each speech or video parameter. For example. in this figure HAP stream has been removed for simplicily. This corresponds to a simple pulse/noise type of excitation. However the mechanism to include ii or any other video or audio stream is the same as for represented streams.) These produce a sequence of speech vectors and a sequence of face vectors which are output at step 5215.

The speech vectors are then fed into the speech generation unit in step S217 which converts these into a speech sound file at step S2l9. The face vectors are then fed into face image generation unit at step 5221 which converts these parameters to video in step 5223. The video and sound files are then combined at step 5225 to produce the animated talking head.

Next, the training of a system in accordance with an embodiment of the present invention will be described with reference to figure 9.

hi image processing systems which arc based on Hidden Markov Models (HMMs), the WvIM is often expressed as: Al = (A,B,n) Eqw2 where A = {, } and is the state transition probability distribution B = {b (0))" is the state output probability distribution and 11 = {r,C1 is the initial state probability distribution and where N is the number of states in the HfvftvI.

As noted above, the face vector parameters can be derived from a 11MM in the same way as tJie 1 5 speech vector parameters.

In Lhe current embodiment, the state transition probability distribution A and the initial state probability distribution are determined in accordance with procedures well known in the art.

Therefore, the remainder of this description will be concerned with the state output probability distribution.

Generally in talking head systems the state output vector or image vector o(t) from an in!1 Gaussian component in a model set N is P(om,s,M)= N(o(t,) Eqn. 3 where p, and 2',,, are the mean and covariance of the rnth Gaussian component for speaker S. The aim when training a convenliona.1 I.alking head system is to estimate the Model parameter set Mwhidh maxnmscs likelihood for a givcn observation sequence-Tn the conventional model, there is one single speaker from which data is collected and the emotion is neutral, therefore the model parameter set is p = p and E,. = £, for the all components in.

As it is not possible to obtain the above model set based on so called Maximum Likelihood (ML) criteria purely analytically, the problem is conventionally addressed by using an iterative approach known as the expectation rnaxiniisation (EM) algorithm which is often referred to as the Baum-Welch algorithm. here, an auxiliary thnction (the "Q" function) is derived: Q(M,M)= Q)1og p(oQ),mM) m Eqn4 where y (t) is the posterior probability of component in generating the observation o(t) given 1 0 the current model parameters rvf and M is the new parameter set. After each iteration, the parameter set N' is replaced by the new parameter set M which maximises Q(M, H'). p(o(t), in I M) is a generalive model such as a (3MM, HMM etc. In the present embodiment a 11MM is used which has a slate output vector of: P(o(tm.s, M) (o(s} (s) ±Ls)) Eqn. 5 Where p,vzv} , {i T} and {] s) are indices for component, time and expression respectively and where AJN, F, and S are the total number of components, frames, and speaker expression respectively. Here data is collected from one speaker, hut the speaker will exhibit different expressions.

The exact form of ji and depends on the type of expression dependent transforms that are applied. In the most general way the expression dependent transforms includes: -a set of expression dependent weights)L -a expression-dependent cluster -a set of linear transfonns LA. , b After applying all the possible expression dependent transforms in step 211, the mean vector jiand covadance matrix of the probability distribution m for expressions become A + -Eqnô ±5) -(A5TL l-m -\ (ni) 1(m) r(rn)J Eqn.7 where Ftc(m I) are the means of cluster I for component in as described in Eqn. 1, 14v) is the mean vector for component in of the additional cluster for the expressions, which will be described later, and A)and b,) are the linear transformation matrix and the bias vector associated with regression class r(m) for the expression s.

R is the total number of regression classes and r(m) , , a} denotes the regression class to which the component m belongs.

If no linear transformation is applied A)and b1 become an identity matrix and zero vector respectively.

For reasons which will he explained later, in this embodiment, the covariances are clustered and arranged into decision trees where v (, ) v) denotes the leaf node in a cova.riance decision tree to which the co-variance matrix of the component m belongs and V is the total number of variance decision tree leaf nodes.

Using the above, the auxiliary function can be expressed as: Q(M,M') -4 $1 lqnS where C is a constant indcpcndent of M Thus, using the above and substituting equations 6 and 7 in equation 8, th.e auxiliary function shows that the model parameters may be split into four distinct parts.

The first part are the parameters of the canonical model i.e. expression independent means and the expression independent covariance { Z k) the above indices ii and k indicate leaf nodes of the mean and variance decision trees which will be described later. The second part are the expression dependent weigitts {2}, where s indicates expression and i the cluster index parameter. The third pail are the means of the expression dependent cluster J.c(m.a) and the fourth part are the CMLLR constrained maximum likelihood linear regression transforms {A,bh1 where s indicates expression and d indicates component or expression regression class to which component in belongs.

In detail, for determining the ML estimate of the mean, the following procedure is performed.

To simplify the following equations it is assumed that no linear transform is applied.

If linear 1:ra:nsform is applied, the original observation vectors {or(t)} have to be substituted by 1 0 the transformed vectors {ot, (t) = A ni0(t) + Eqn.9 Similarly, it will be assumed that there is no additional cluster. The inclusion of that extra cluster during the training is just equivalent to adding a linear transform on which A,)is the identity matrix and = D1} First, the auxiliary function of equation 4 is differentiated with respect to J.t as tbllows: dQ,M; A/i) --fl Iin.

Eqn. 10 Where Cmi ?fl.2 L( Eqn. 11 withG7 and kaccumu1ated statistics = (t.s)AY 23 tqrntJ t.'(nfl j,qljn t. S = (t, Eqn i2 By maximizing the equation in the normal way by setting the derivative to zero, the following formula is achieved for the ML estimate of pi.e.

= c;4 ( - [(I, LJ!=fl-Eqn.13 It should be noted, that the ML estimate of Mn also depends on Pk where k does not equal n. The index n is used to represent leaf nodes of decisions trees of mean vectors, whereas the index k represents leaf modes of covariance decision trees. Therefore, it is necessary to perform the optimization by iterating over all Mn until convergence.

This can be performed by optimizing all psimultaneously by solving the following equations. G [ tN

Eqn. 14 However, if the training data is small or N is quite large, the coefficient matrix of equation 7 cannot have full rank. This problem cam be avoided by using singular value decomposition or other well-known matrix faetorization techniques.

The same process is then performed in order to perform an ML estimate of the covariances i.e. the auxiliary function shown in equation (8) is differentiated with respect to E k to give: t.s.m (t. s)o(t) O(t)T ± 1' = k ts.rn yyp it, s) u(?n)=k Bqn. 15 Where = 0(t) -Eqn. 16 The ML estimate Thr expression dependent weights arid the expression dependent linear transform can also be obtained in the same manner i.e. differentiating the auxiliary thnction with respect to the parameter for which the ML estimate is required and then setting the value of the differential to 0.

For the expression dependent weights this yields -( 1 s)Ai.E_1Mm) t,7fl.

tn) =q y,Jt. t TTh

Eqn 17 In a preferred embodiment, the process is performed in an iterative manner. This basic system is explained with reference to the flow diagram of figure 9.

In step S30l, a plurality of inputs of video image are received. In LIds illustrative example, 1 speaker is used, but the speaker exhibits 3 different emotions when speaking and also speaks with a neutral expression. The data both audio and video is collected so that there is one set of data for the neutral expression and three further sets of data, one for each of the three expressions.

Nexi., in step S303, an audiovisual model is trained and produced for each of the 4 data sets.

The input visual data is parameterised to produce training data. Possible methods are explained in relation to the training for the image model with respect to figure 19. The training data is collected so that there is an acoustic unit which is related to both a speech vector and an image vector. In this embodiment, each of the 4 models is only trained using data from one face.

A cluster adaptive model is initialised and rained as follows: In step S305, the number of clusters Pis set to VI 1, where V is the number of expressions (4).

In step S307. one cluster (cluster 1), is determined as the bias cluster. In am embodiment, this will be the cluster for neutral expression. The decision trees for the bias cluster and the associated cluster mean vectors arc initialised using the expression which in step S303 produced S the best model. In tins example, each face is given a tag "Expression A (neutral)", "Expression B". "Expression C" and "Expression 1)", here The covariance matrices, space weights for multi-space probability distributions (MSD) and their parameter sharing structure are also initialised to those of the Expression A (neutral) model.

Each binary decision tree is constructed in a locally optimal fashion starting with a single root node representing all contexts. In this embodiment, by context, the following bases are used, phonetic, linguistic and prosodic. As each node is created, the next optimal question about the context is selected. The question is selected on the basis of which question causes the maximum increase in likelihood and the terminal nodes generated in the training examples.

Then, the set of terminal nodes is searched to find the one which can be split using its optimum question to provide the largest increase in the total likelihood to the training data. Providing that this increase exceeds a threshold, the node is divided using the optimal question and two new terminal nodes are created. The process stops when no new terminal nodes can be formed since any further splitting will not exceed the threshold applied to the likelihood split.

This process is shown for example in figure 10. The nih terminal node in a mean decision tree is divided into two new terminal nodesn and c by a question q. The likelihood gain achieved by this split can be calculated as follows: = ( Pit cS (n) / + (m -- ?)tES(n) \ jti Eqn. 18 Where S(n) denotes a set of components associated with node a. Note that the terms which are constant with respect to p.11 are not included.

Where C is a constant term independent of k* The maximum likelihood of p,, is given by equation 13 Thus, the above can be written as: = ji ( ft.,, Eqn. 19 Thus, the likelihood gained by splitting node n into n and fl is given by: t(n; q) = t(n) + L(n±) -Eqn. 20 Using the above, it is possible to construct a decision tree for each cluster where the tree is arranged so that. the optimal question is asked first in the tree and the decisions arc arranged in hierarchical order according to the likelihood of splitting. A weighting is then applied to each clii ster.

Decision trees might be also constructed for variance. The covariance decision trees are constructed as follows: If the case terminal node in a covariance decision tree is divided into two new terminal nodes /c and k'1 by question q, the cluster covariance matrix and the gain by the split are expressed as follows: (t) E v(m) 711 t, S j in.9 v = k Eqn.21 = y(t) log + D

-

Eqn. 22 where D is constant independent of {Sk}. Therefore the increment in likelihood is tk, q) = (k) ± .C(/cI) -((k) Eqn.23 Instep S309, a specific expression tag is assigned to each of 2....,P clusters e.g. clusters 2,3, 4, and 5 are for expressions B, C. D atid A respectively. Note, because expression A (neutral) was used to initialise Lhe bias cluster it is assigned to the last cluster to be initialised.

in step 8311, a set of CAT in1ctolatioti weights arc simply set to 1 or 0 according to the assigned expression (referred to as "voicetag" below) as: 11.0 ifi0 = 1.0 if voicetag(s) 1 10.0 otherwise 1 5 In this embodiment, there arc global weights per expression, per stream. For each expression/stream combination 3 sets of weights are set: for silence, image and pause.

In step S3 13, for each cluster 2 (P-I) in turn the clusters are initialised as follows. The face data for the associated expression, e.g. expression B for cluster 2, is aligned using the mono-speaker model for the associated face trained in step S303. Given these alignments, the statistics are computed and the decision tree and mean values for the cluster are e-stimatcd. The mean values for the cluster are computed as the normalised weighted sum of the cluster means using the weights set in step 8311 i.e. in practice this results in the mean values for a given context being the weighted sum (weight I in both eases) of the bias cluster mean for that context and the expression B model mean for that context in cluster 2.

In step S3 15, the decision trees are then rebuilt for the bias cluster using all the data from all 4 faces, and associated means and variance parameters re-estimated.

After adding the clusters for expressions B, C and D the bias cluster is re-estimated using all 4 expressions at the same time hi step S317, Cluster P (Expression A) is ilow initialised as for the other clusters, described in siep S313, using data only from Expression A. Once the clusters have been initialised as above, the CAT model is then updated/trained as follows.

1 0 In step S3 19 the decision trees are re-constructed cluster-by-cluster from cluster 1 to P, keeping the CAT weights fixed. In step S321, new means and variances are estimated in the CAl' model. Next in step S323, new CAT weights are estimated for each cluster. In an embodiment, the process loops back to S321 until convergence. The parameters and weights are estimated using maximum likelihood calculations performed by using the auxiliary function of the Baum-Welch algorithm to obtain a better estimate of said parameters.

As previously described, the parameters are estimated via an ileraLive process.

In a further embodiment, at step S323, the process ioops back to step S319 so that the decision trees are reconstructed during each iteration until convergence.

In a further embodiment, expression dependent transforms as previously described are used.

Here, the expression dependent transforms are inserted after step S323 such that the transthrms are applied and the transformed model is then iterated until convergence. In an embodiment, the transforms would be updated on each iteration.

Figure 10 shows clusters 1 to P which are in the forms of decision trees. In this simplified example, there are just four ienninal nodes in cluster 1 and three Lernilnal nodes in cluster P. It is important to note that the decision trees need not be symmetric i.e. each dccision tree can have a different number of terminal nodes. The number of terminal nodes and the number of branches in the tree is determined purely by the log likelihood splitting which achieves the maximum split at the first decision and then the questions are asked in order of the question which causes the larger split. Once the split achieved is below a threshold, the splitting of a node terminates.

The above produces a canonical model which allows the following synthesis to be performed: 1. Any of the 4 expressions can be synthesised using the final set of weight vectors corresponding to that expression 2. A random expression can be synthesised from the audiovisual space spanned by the CAT model by selling the weight vectors to arbitrary positions.

Tn a. further example, the assistant is used to synthesise an expression characteristic where the system is given an input of a target expression with the same characteristic.

1 0 In a ftrrther example, the assistant is used to synthesise an expression where the system is given an input of the speaker exhibiting the expression.

Figure 11 shows one example. First, the input target expression is received at step 501. Next, the weightings of the canonical model i.e. the weightings of the clusters which have been 1 5 previously trained, are adjusted to match the target expression in step 503.

The face video is thcn oulpulicti using the new weighlings derived in step S503.

In a further embodiment, a more complex method is used where a new cluster is provided for the new expression. This will be described with reference to figure 12.

As in figure 11, first, data of the speaker speaking exhibiting the target expression is received in step SSO1. The weightings are then adjusted to best match the target expression in step S503.

Then, a new cluster is added to the model for the target expression in step S507. Next, the decision tree is built for the new expression cluster in the same manner as described with reference to figure 9.

Then, the model parameters i.e. in this example, the means are computed for the new cluster in step S511.

Next, in step S513, the weights are updated for all clusters. Then, in step S515, the structure of the new cluster is updated.

As before, the speech vector and face vector with the new target expression is outputted using the new weightings with the new cluster in step S505.

Note, that in this embodiment, in step S515, the other cluslers are not updated at this time as this would require the training data to be available at synthesis time.

In a farther embodiment the clusters are updated after step S515 and thus the flow diagram loops back to step S509 until convergence.

Finally, in an embodiment, a linear transform such as CMLLR can be applied on top of the model to further improve the similarity to the target expression. The regression classes of this transfonn can be global or be expression dependent.

In the second case the tying structure of the regression classes can be derived from the decision 1 5 tree of the expression dependent cluster or from a clustering of the distributions obtained after applying the expression dependent weights to thc canonical model and adding the extra cluster.

At the stafl, the bias cluster represents expressior independent characteristics, whereas the other clusters represent their associated voice data set. As the training progresses the precise assignment of cluster to expression becomes less precise. The clusters and CAT weights now represent a broad acoustic space.

The above embodiments refer to the clustering using just one attribute i.e. expression.

However, it is also possible to factorise voice and facial attributes to obtain further control. In the following embodiment, expression is subdivided into speaking style(s) and emotion(e and the model is factorised for these two types or expressions or atLributes. Here, the stale output vector or vector comprised of the model parameters 0(1) fioni an rn" Gaussian componeni in a niodel set M is P(o(trn,s,e,M)= N(o(t}it,L) Eqn. 24 where am and X('SM are thc mean and covariance of the md Gaussian component for speaking style s and emotion e.

In this embodiment, s will Tefer to speaking style/voice, Speaking style can be used to represent styles such as whispering, shouting etc. It can also be used to refer to accents etc. Similarly, in this embodiment only two factors are considered but the method could be extended to oilier speech factors or these factors could be subdivided further and factorisation is perfonned for each subdivision.

The aim when training a conventional text-to-speech system is to estimate the Model parameter set M which maximises likelihood for a given observation sequence. In the conventional model, there is one style and expression/emotion, therefore the model parameter set is /1 = p, and = I, for the all components in.

As it is not possible to obtain the above model set based on so called Maximum Likelihood (ML) criteria purely analytically, the problem is conventionally addressed by using an iterative 1 5 approach known as the expectation maximisation (EM) algorithm which is often referred to as the Baum-Welch algorithm. Here, an auxiliary function (the "Q" Thnction) is derived: Q(M,M)Iog p(oQ),mtM) Eqn 25 where Yn, (t) is the posterior probability of component in generating the observation o(t) given the current model parameters M and Mis the new parameter set. After each iteration, the parameter set M' is replaced by the new parameter set M which maximises Q(M, 14'). p(o(t), m I M) is a generative model such as a GMM, 141MM etc. In the present embodiment a 14MM is used which has a state output vector of: P(o(tm,s,e,M)= N(oii5,L) Eqn. 26 Where in E (1,MN} , r} , e s} and e {t....., 5}arelndicesfor component, time, spealdng style and expression/emotion respectively and where MN, T, Sand E are the total number of components, flames, speaking styles and expressions respectively.

The exact form of ji' and t' depends on Ihe type of speaking style and emotion dependent transforms that are applied. In the most general way the style dependent transforms includes: -a set of style-emotion dependent weights iM) -a style-emotion-dependent cluster -aset of linear transforms jwherehy these transform could depend just on the style, just on the emotion or on both.

After applying all the possible style dependent transforms, the mean vector ji and covarianee matrix.c) of the probability distribution in for styles and emotion e become = + -b)J Eqn27 (cs.:_i AI1'' -. .?())i) .oii)) Eqn. 28 where are the mcan.s of cluster [for component in, p) is the mean vector for component in of the additional cluster for sty].e s emotion e, which will he described later, and and are the linear transformation matrix and the bias vector associated with regression class r(n) for the style a, expression c.

R. is the total number of regression classes and r(m)r denotes th.e regression class to which the component in belongs.

Jf no linear transformation is applied A;1afld hft become an identity matrix and zero vector respectively.

For reasons which will he explained later, in this embodiment, the covariances are clustered and arranged into decision trees where "Or) v} denotes the leaf node io a covariance decision tree to which the co-variance matrix of the component m belongs and V is the total number of variance decision tree leaf nodes.

Using the above, the auxiliary function can be expressed as: Q(M,M) f y,(t)ogV(,W)l+ (o&)-nMT(o(t)_ ji)}÷ C 0' to Eqn 29 where C is a constant independent of M Thus, using the above and substituting equations 27 and 28 in equation 29, the auxiliary function shows that the model paratneLers may be split into four thstinct paris.

The first part are the parameters of the canonical model i.e. style and expression independent means {p} and the style and expression independent covariance { L k} the above indices n and k indicate leaf nodes of the mean and variance decision trees which will be described later. The second part are the style-expression dependent weights {AYM} . where s indicates speaking style, e indicates expression and I the cluster index parameter. The third part are the means of the style-expression dependent cluster Rctm.o and the fourth part are the CMILR constrained maximum likelihood linear regression transforms {4 where s indicates style, e expression arid d indicates component or style-emotion rcgression class to which component m belongs.

Once the auxiliary function is expressed in the above manner, it is then maximized with respect to each of the variables in turn in order to obtain the ML values of the style and emotion/expression characteristic parameters, the style dependent parameters and the expression/emotion dependent pai-anieters.

hi detail, for determining the ML estimate of the mean, the following procedure is performed: To simplify the following equations it is assumed that no linear trunsform is applied.

If a linear transform is applied, the original observation vectors {o/t)} have to be substituted by the transform ones {o"(i) = A;o(t) + Lqn. 19 Similarly, it will be assumed that there is no additional cluster. The inclusion of that extra cluster during the training is just equivalent to adding a linear transform on which A)is the identity nmtrix and} First, the auxiliary thnction of equation 29 is differentiated with respect to j.t a.s follows: t)O(M M) ... -= Ii!= a Eqn. 31 Where = = IT?-, i,j Tfl.

c.Krn,i)=n C: Ui 4 =11 Eqn.32 withG'° and k(PC accumulated statistics = t.S.L m-(t. s. e)AL v(In) = 7m (t, s. , :EV(m) o(i) Eqn. 33 1 0 By maximizing the equation in the normal way by setting the derivative to zero, the following formula is achieved for the ML estimate of ji,, i.e. f;: = c; - c7flVIulJ) !i#11.

flqn. 34 16 It should be noted, that Lhc ML estimate of' p,, also depends on Pk where k does not equal n. The index n is used to represent leaf nodes of decisions trees of mean vectors, whereas the index k represents leaf modes of covariance decision frees. Therefore, it is necessary to perform the optimization by iterating over all p1. until convergence.

This can be performed by optimizing all js simultaneously by solving the following equations.

[C11 ---in FRi GVN f-4v Eqn. 35 However, if the training daLa is small or N is quite large, die coefficient matrix of equation 35 eanfloL have full rank. This problem can be avoided by using singular value decomposition or other well-known matrix factorization techniques.

The same process is then performed in order to perfonn an ML estimate of the covariances i.e. the auxiliary function shown in equation 29 is differentiated with respect to E k to give:

T

/ 4 r TF)nitt S,C)UFl(t) 0q(n (I) t' rni=k (e.rn ?m(t. S, c) Eqn. 36 Where j1:Li (t) = o() -Eqn.

The ML estimate for style dependent weights and the style dependent linear transform can also be obtained in the same manner i.e. differentiating the auxiliary function with respect to the parameter for which the ML estimate is required and then setting the value of the differential to 0.

For the expression/emotion dependent weights this yields = ( V (t, -1 yi,, ct, s, ()pjV.i_l)IS) L.rns q 1? = q Eqn 38 Where °q(ni.) (t) = 0(t) -Ltc( rni) -And similarly, for the style-dependent weights = ( c)Ai T-Ai) -i C -in q iii.) = 7m (t, s,e.)!MT E)ô (t) n.t. Where

oN? (t) = o(t) -11ci ini) -In a preferred embodiment, the process is performed in an iterative manner. This basic system is explained with reference to the flow diagrams of figures 13 to 15.

In step 8401, a plurality of inpuls of audio and video are received, In this illustrative example, 4 1 0 styles are used.

Next, in step 8403, an acoustic model is trained and produced for each of the 4 voices/styles, each speaking with neutral emotion. In this embodiment, each of the 4 models is only trained using data with one speaking style. 8403 will be explained in more detail with reference to the flow chart of figure 14.

In step 8805 of figure 14, the number of clusters P is set to V f 1, where V is the number of voices (4).

In step 5807, one cluster (cluster 1), is deternuned as the bias cluster. The decision trees for the bias cluster and the associated cluster nican vectors are initialised using the voice which in step 8303 produced the best model. In this example, each voice is given a tag "Style A", Style B", "Style C" and "Style 13", here Style A is assumed to have produced the best model. The covarianee matrices, space weights for multi-space probability distributions (MSD) and their parameter sharing structure are also initialised to those of the Style A model.

Each binary decision tree is constructed in a locally optimal fashion starting with a single root node representing all contexts. In this embodiment, by context, the following bases are used, phonetic, linguistic and prosodic. As each node is created, the next optimal question about the eonlext is selected. The question is selected on the basis of which question causes the maximum increase in likelihood and the terminal nodes generated in the training examples.

Then, the set of terminal nodes is searched to find the one which can be split using its optimum question to provide the largest increase in the total likelihood to the training data as explained above with reference to figures 9 to 12.

Decision trees might be also constructed for variance as explained above.

Tn step S809, a specific voice tag is assigned to each of 2 P clusters e.g. clusters 2, 3,4, and 5 are for styles B, C, D and A respectively. Note, because Style A was used to initialise the bias cluster it is assigned to the last cluster to be initialised.

hi step 8811, a set of CAT interpolation weights are simply set to 1 or 0 according to the assigned voice tag as: 1.3 jfj-Q 1.0 if voicetag(s) rz 0.0 otherwise In this embodiment, there are global weights per style, per stream.

In step S813, for each cluster 2 (P-I) in turn the clusters are initialised as follows. The voice data for the associated style, e.g. style B for cluster 2, is aligned using the mono-style model for the associated style trained in step 8303. Given these alignments, the statistics are computed and the decision tree and mean values for the cluster are estimated. The mean values for the cluster are computed as the normalised weighted sum of the cluster means using the weights set in step S8 11 i.e. in practice this results in the mean values for a given context being the weighted sum (weight 1 in both cases) of the bias cluster mean for that context and the style B model mean for that context in cluster 2 In step 8815, the decision trees are then rebuilt for the bias cluster using all the data from all 4 styles, and associated means and variance parameters re-estimated.

After adding the clusters for styles B, C and D the bias cluster is re-estimated using all 4 styles at the same time.

In step SSI7, Cluster P (style A) is now initialised as for the other clusters, described in step 8813, using data oniy from style A. Once the clusters have been initialised as above, die CAT model is then updaicd/truined as follows: In step 8819 the decision trees are re-constructed cluster-by-cluster from cluster Ito F, keeping the CAT weights fixed. In step 8821, new means and variances are estimated in the CAT model. Next in step 8823, new CAT weights are estimated for each cluster. In an embodiment, the process loops back to S821 until convergence. The parameters and weights are estimated using maximum likelihood calculations performed by using the auxiliary function of the Baum-Welch algorithm to obtain a beLter eslimale of said parameters.

As previously described, lhc parameters are estimated via an iterative process.

In a further embodimcnL, at step 8823, the process loops back to step S819 so that the decision trees are reconstructed during each iteration until convergence.

The process then returns to step 8405 of figure 13 where the model is then trained for different emotion both vocal and facial.

In this embodiment, emotion is modelled using cluster adaptive training in the same manner as described for modelling the speaking style in step 8403. First. "emotion clusters" are initialised in step 8405. Ihis will be explained in more detail with reference to figure 15.

Data is then collected for aL least one of the styles where in addition the input data is emotional either in terms of the facial expression or the voice. It is possible to collect data fromjust one style, where Lhe speaker provides a number of data samplcs in that style, each cxhibiting a different emotions or the speaker proiding a plurality of styles and data samples with different emotions. In this embodiment, it will be presumed that the speech samples provided to train the system to exhibit emotion come from the style used to collect the data to train the initial CAT model in step S403. However, the system can also train to exhibit emotion using data collected with different speaking styles for which data was not used in S403.

In step 8451, the non-Neutral emotion data is then grouped into N groups. In step 8453, Ne additional clusters are added to model emotion. A cluster is associated with each emotion group. For example, a cluster is associated with "Happy", etc. These emotion clusters al-c provided in addition LO the neutral style clusters formed in step 8403.

In step 8455, initialise a binary vector for the emotion cluster weighting such that if speech data is to be used for training exhibiting one emotion, the cluster is associated with that emotion is set to "1" and all other emotion clusters are weighted at "0".

During this initialisation phase the neutral emotion speaking style clusters are set to the weightings associated with the speaking style for the data.

Next, the decision trees are built fur each emotion cluster in step 8457. Finally, the weighLs are re-estimated based on all of the data in step 8459.

After die emotion clusters have been initialised as explained above, the Gaussian means and variances are re-estimated for all clusters, bias, style and emotion in step 8407.

Next, the weights for the emotion clusters are re-estimated as described above in step S409.

The decision trees are then re-computed in step.8411. Next, the process loops back to step 8407 and the model parameters, followed by the weightings in step 8409, followed by reconstructing the decision trees in step 8411 are performed until convergence. In an embodiment, the loop 8407-8409 is repeated several times.

Next, in step 8413, the model variance and means are re-estimated for all clusters, bias, styles and emotion. In step 8415 the weights are re-estimated for the speaking style clusters and the decision trees are rebuilt in step 8417. The process then loops back to step 8413 and this ioop is repeated until convergence. Then the process ioops back to step 8407 and the loop concerning emotions is repeated until converge. The process continues until convergence is reached for both loops jointly.

In a further embodiment, the system is used to adapt to a new attribute such as a new emotion.

This will be described with reference to figure 16.

First, a target voice is received in step SoUL the data is collected for the voice speaking with the new attribute. First, the weightings for the neutral style elusiers are adjusied to besi match the target voice in step S603.

Then, a new emotion cluster is added to the existing etnoLion clusicrs for the new emotion in step 5607. Next, the decision tree (br the new cluster is initialised as described with relat on to figure 12 from step S455 onwards. The weightings, model parameters and trees are then re-estimated and rebuilt for all clusters as described with reference to figure 13.

The above methods demonstrate a system which allows a computer generated head to output speech in a natural manner as the head can adopt and adapt to different expressions. The clustered form of the data allows a system to be built with a small footprint as the data to run the system is stored in a very efficient manner, also the system can easily adapt to new expressions as described above while requiring a relatively small amount of data.

The above has explained in detail how CAT-HMM is applied to render and animate the head.

As explaincd above, the face vector is comprised of a plurality of lace parameters. One suitable model for supporting a vector is an active appearance model (AATvI). Although other statistical models may be used.

An AAM is defined on a mesh of V vertices. The shape of the model. s = (xj; y; x2; y2; xy yv)T defines tile 21) position (.ij; y) of each mesh vertex and is a Linear model given by; Al S=SO+>CiS*i, i=1 Eqn.2J where s is the mean shape of the model, sj is the i11' mode of M linear shape modes and c, is its corresponding parameter which can be considered lobe a "weighting parameter".. The shape modes and how they are trained wifi he described in more detail with reference to figure 19.

However, the shape modes can he thought of as a set of facial expressions il shape for the face may be generated by a weighted sum of the shape modes where the weighting is provided by parameter C. By defining the outputted expression in this manner it is possible for the face to express a continuum of expressions.

Colour values are then included in the appearance of the model, by a = (r;; g1; b1; r2; g; b2; S rp; gp; bp)T; where (ri; g1; b1) is the RGB representation of the /" of the P pixels which project into the mean shape s0. Analogous to the shape model, the appearance is given by: a=ao+Lcjaj.

Eqn. 2.2 where a0 is the mean appearance vector of the mode!, and a1 is the i' appearance mode.

In this embodiment, a combined appearance model is used and the parameters c in equations 2.1 and 2.1 are the same and control both shape and appearance.

Figure 17 shows a schematic of such an AAM. Input into the mode! are the parameters iii step SlOOl. These weights arc ihen directed into both the shape model 1003 and the appearance model 1005.

Figurc 17 demonstrates the modes o. 5t s of the shape model 1003 and the modes a0, a3 aM of the appcarance model. The output 1007 of the shape model 1003 and the output 1009 of the appearance model are combined in step SlOl ito produce the desired face image The parameters which are input into this model can be used as the face vector referred to above

in the description accompanying figure 2 above.

The glohai nature of AAMs Jeads to some of the modes handling variations which are due to both 3D pose change as well as local defonnation.

In this embodiment AAM modes are used which correspond purely to head rotation or to other physically meaningful motions. This can be expressed mathematically as:

K M

= 0 + V pose + > csf0m1 i=1 i=K-j-1 Eqn. 2.3 In. this embodiment, a similar expression is also derived for appearance. However, the coupling of shape and appearance in AAMs makes this a difficult problem. To address this, during training, first the shape conipcments are derived which model {sfoSC}, by recording a short training sequence of head rotation with a fixed neutral expression and applying PCA to the observed mean normalized shapes = s -s. Next is projected into the pose variation space spanned by to estimate the parameters c1}1 in equation 2.3 above: TPose Ci poe 2 Fqn. 2.4 Having found these parameters the pose component is removed froni each training shape to obtain a pose rtorinaflzed training shape s; Tci$rse+ Eqn. 2.5 If shape and appearance were indeed independent then the deformation components could he found using principal component analysis (PCA) of a training set of shape samples normalized.

as in equation2.5, ensuring that only modes orthogonal to the pose modes are found.

However, there is no guarantee that the parameters calculated using equation (2.4 are the same for the shape and appearance modes, which mcns that it may not he possible to reconstruct training examples using the model derived from them.

To overcome this problem the mean of each c1} of the appearance and shape parameters is computed using: 1 / TPose aTaP 2 s0H2 + Eqn. 2.6 The mode! is then constructed by using these parameters in equation 2.5 and finding the deformation modes from samples of the complete training set.

In further embodiments, the model is adapted for accommodate local deformations such as eye blinking. This can be achieved by a modified version of the method described in which model blinking are learned from a video containing blinking with no other head motion.

Directly applying the method taught above for isolating pose to remove these blinking modes from the Iraining set may introduce artifacts. The reason for this is apparent when considering the shape mode associated with blinking in which the majority of the movcincnt is in the eyelid.

This means that if the eyes arc in a different position relative to the centroid of the face (for example if the mouth is open, lowering the centroid) then the eyelid is moved toward the mean eyelid position, even if this artificially opens or closes the eye. !nstead of computing the parameters of absolute coordinates in equation 2.6, relative shape coordinates are implemented using a Laplacian operator: --1 (L() L(s?'') + -2 \ L(s1' 2 ta' 2 Eqn.2.7 The Laplacian operator LU is defined on a shape sample such that the relative position, ô, of each vertex i within the shape can be calculated from its original position Pe using mPiPj it__i j. 1 2 jEAI LZJ1 Eqn. 2.8 where Nis a one-neighbourhood defined on the AAM mesh and d11 is the distance between vertices (and/in the mean shape. This approach correctly normalizes the training samples for blinking, as relative motion within the eye is modelled instead of th.e position of the eye within the face.

Further embodiments also accommodate for the fact that different regions of the face can be moved nearly independently. It has been explained above that the modes are decomposed into pose and deformation components. This allows further separation of the deformation components according to the local region they affect. The model can be split into R regions and its shape can be modelled according to:

K H _

S = o + >i: + ), j=1 iE15 Eqn. 2.9 where I is the set of component indices associated with regionj. In one embodiment, modes for each region are learned by only considering a subset of the model's vertices according to manually selected boundaries marked in Ihe mean shape. Modes are I eraLively included up to a maximum number, by greedily adding the mode corresponding to the region which allows the model to represent the peatcst proporl.ion of the observed variance in the training set.

An analogous model is used for appearance. Linearly blending is applied locally near the region boundaries. This approach is used to split Lhc face into am upper and lower half The advantage of this is that changes in mouth shape during synthesis cannot lead to artefacts in the upper half 1 5 of the thee. Since global modes are used to model pose 1.herc is no risk of the upper and lower halves of the face having a different pose.

Figure 18 demonstrates the enhanced AAM as described above. As for the AAM of figure 17, the input weightings for the AAM of figure 18(a) can form a. face vector to be used in the algorithm described with reference to figure 2.

However, here the input parameters ci are divided into parameters for pose which are input at S1051, parameters for blinking Sl053 and parameters to model deformation in each region as input at S1055. In figw-c 18, regions Ito Rare shown.

Next, these parameters are ted into the shnpc model 1057 and appearance model 1059. Tlere: l:he pose parameters ale used to weight the pose modes 1061 of the shape model 1057 and the pose modes 1063 of the appeara:nce (tifidel; the blink parameters are used to weight the blink mode 1065 of the shape model 1057 and thc blin.k mode 1067 of the appearance model; and the regional deformation parameters are used to weight the regional deformation modes l09 of the shape model 1057 and the regional deformation modes 1071 of the appearance model.

As for figure 17, a generated shape is output in step 51073 and a generated appearance is output in step 51075. The generated shape and generated appearance are then combined in step S1077 to produce the generated image.

Since the Leeth and tongue are occluded in many of the training examples, the synthesis of these regions may cause significant artefacts. To reduce these artefacts a fixed shape and texture for the upper and lower teeth is used. The displacenients of these static textures are given by the displacement of a vertex at the centre of the upper and lower teeth respectively. The teeth ate rendered before the resL of the race, ensuring that the correct, occlusions occur.

Figure 18(b) shows an amendment to figure 18(a) where the static artefacts are rendered first.

After the shape and appearance have been generated in steps SI 073 and S1075 respectively, the position of the teeth are determined in step S1081. In an embodiment, the teeth are determined to he at a position which is relative to a fixed visible point on the face. The teeth are then rendered by assuming a fixed shape and texture for the teeth in step S 1083. Next the rest of the face is rendered in step S 1085.

Figure 19 is a flow diawani showing the training of the system in accordance w[th an embodnnent of the present invention. Training images are collected in step 51301. In one embodiment, the training images are collected covering a range of expressions. For example, audio and visua.! data may be collected by using cameras arranged to collect thc speaker's facial expression and microphones to collect audio. The speaker can read out sentences and will receive instructions on the emotion or expression which needs to he used when reading a particular sentence.

The data is selected so that it is possible to select a set of frames from the training iniages which correspond to a set of common phonemes in each of the emotions. in sonic embodiments, about 7000 training sentences are used. However, much of this data is used to train the speech mactel to produce the speech vector as previously described.

In addition to the training data described above, further training data is captured to isolate the modes due to pose change. For example, video of the speaker rotating their head may be captured while keeping a fixed neutral expression.

Also, video is captured of the speaker blinking while keeping the rest of their face still.

Tn step 51303, the images for building the iAM are selected. In an embodiment, only about frames are required to build the AAM. The images are selected which allow data to he collected over a range of frames where the speaker exhibits a wide range of emotions. For 1 0 example, frames may be selected where the speaker demonstrates different expressions such as different mouth shapes. eyes open, closed, wide open etc. In one embodiment, frames are selected which correspond to a set of common phonemes in each of the emotions to be displayed by the head.

In further embodiments, a larger number of frames could be use, for example, all of the frames in a long video sequence. in a yet further embodiment frames may be selected where the speaker has performed a set of facial expressions which roughly correspond to separate groups of muscles being activated..

In step S 1305, the points of interest on the frames selected in step S1303 are labelled. In an embodiment this is done by visually identifying key points on the face, for example eye corners, mouth corners and moles or blemishes. Some contours may also be labelled (for example, face and hair silhouette and lips) and key points may be generated automatically from these contours by equidistant subdivision of the contours into points.

In other embodiments, the key points are found automatically using trained key point detectors.

In a yet thrther embodiment, ley points are found by aligning multiple face images automatically. In a yet further embodiment, two or more of the above niethods can be combined with hand labelling so that a semi-automatic process is provided by inferring some of the missing information from labels supplied by a user during the process.

Tn step 51307, the frames which were captured to model pose change are selected and an AAM is built to model pose alone.

Next, in step S1309, the frames which were captured to model blinking are selected AAIVI modes are constructed to mode blinking alone.

Next, a further AAM is built using all of the frames selected including the ones used to model pose and blink, but before building the model, the effect of k modes was removed from the data as described above.

Frames where the AAI\4 has performed poorly are selected. These frames are then hand labelled and added to the training set. The process is repeated until there is little further improvement adding new images.

The AAM has been trained once all AAM parameters for the modes -pose, blinking and deformation have been established.

Figure 20 is a schematic of how the AAM is constructed. The training images 1361 are labelled and a shape model 1363 is derived. The texture 1365 is also extracted for each face model.

Once the AAM modes and parameters are calculated as explained above, the shape model 1363 and the texture model 365 are combined to generate the face 1367.

In one embodiment, the AAM parameters and their first time derivates are used at the input for a CAT-HMM training algorithm as previously described.

In a further embodiment, the spatial domain of a previously trained AAM is extended to further domains without affecting the existing model. For example, it maybe employed to extend a model that was trained only on the lace rcon to include hair and ca regions hi order to add more realism.

A set of N training images for an existing AAM are known, as are the original model coefficient vectors {1 c, R for these images. The regions to be included in the model are [lien labelled, resulting in a new set of N training shapes frff and appearances Given the original model with M modes, the new shape modes {s1}t, should satisfy the following constraint: ciT siT (cxt)T CNT Aj1 j (eict)T Eqn. 2.10 which states that the new modes can he cojithineci, using the original model coefficients, to reconstruct the extended training shapes it". Assuming that the number of training samples V is larger than the number of modes M, the new shape triodes can be obtained as the least-squares S solution. New appearance modes arc found analogously.

to illustrate the above, an experiment was conducted. Here, a corpus of 6925 sentctees divided between 6 emotions; neutral, tender, angry, afraid, happy and sad was used. From the data 300 seIltelkees were held out as a. test set and the remaining data was used to train the speech model.

The speech data. was parameterized using a standard feature set consisting of 45 dimensional Mel-frequency ecpstral coefficients, log-PU (pitch) and 25 band aperiodicities, together with the first and second time derivatives of these features. The visual data was parameterized using the different AAMs described below. Some AAMs were trained in order to evaluate the improvements obtained with the proposed extensions. in each case the AAM was controlled by 17 parameters and the parameter values and their first time derivalives were used in die CAT model.

The first model used, AAMbasc, was built from 71 training images in which 47 facial keypoints were labeled by hand. Additionally, contours around both eyes, the inner and outer lips, and the edge of the face were labeled and points were sampled at uniform intervals along (heir length.

The second model, AAMdeeomp, separates both 3D head rotation (modeled by two modes) and blinking (modeled by one mode) from the deformation modes. The third model, AAMregions, is built in the same way as AAMdecomp expect that 8 modes are used to model the lower half of the face and 6 to model the upper half. The final model, AAIVIfull, is identical to AAMregions except for the mouth region which is modified to handle static shapes differently.

In the first experiment the reconstruction error of each AAM was quantitatively evaluated on the complete data set of 6925 sentences which contains approximately I million frames. The reconstruction error was measured as the L2 norm of the per-pixel difference between an input image warped onto the mean shape of each 1\AM and the generated appearance.

Figure 2 1(a) shows how reconstruction errors vary with the number of AAM modes. lit can be seen that while with few modes, AAMbase has the lowest reconsu-uction error, as the number of modes increases the difference in error decreases. In other words, the flexibility that semantically meaningful modes provide does not come at the expense of reduced racking accuracy. In tact the modified models were found to be more robust than the base model, having a lower worst case error on average, as shown in figure 21(b). This is likely due to AAMregions and AAMdeconip being better able to generalize to unseen examples as they do not overfit the training data by learning spurious correlations between different face regions.

A number of large-scale user studies were performed in order to evaluate the perceptual quality of the synthesized videos. The experiments were distributed via a crowd sourcing website, presenting users with videos generated by the proposed system.

In the first study the ability of the proposed VT18 system to express a range of emotions was 1 5 evaluated. Users were presented either with video or audio clips of a single sentence from the test set and were asked Lu identify the eniotion expressed by the speaker, selecting from a list of six emotions. The synthetic video dam for this evaluation was gencruied using the AAMregions model. It is also compared with versions of synthetic video only and synthetic audio only, as well as cropped versions of the actual video footage. Jo each case 10 sentences in each of the six emotions were evaluated by 20 people, resulting in a total sample size of 1200.

The average recognition rates are 73?/o for the captured footage, 77% for our generated video (with audio), 52% for the synthetic video only and 68% for the synthetic audio only. These results indicate that the recognition rates for synthetically generated results are comparable, even slightly higher than for the real footage. This may be due to the stylization of the expression in the synthesis. Confusion matrices between the different expressions are shown in figure 22. Tender and neutral expressions are most easily confused in all cases. While some eniolions are better recognized from audio only, the overall recognition rate is higiier when using both cues.

To determine the qualitative effect of the AAM on the fmal system preference tests were performed on systems built using the different AAMs. For each preference test 10 sentences in each of the six emotions were generated with two models rendered side by side. Each pair of AAMs was evaluated by 10 users who were asked to select between the left model, rigid model or having no preference (the order of our model renderings was switched between experiments to avoid bias), resulting in a total of 600 pairwise comparisons per preference test.

In this experiment the videos were shown without audio in order to focus on the quality of the face model. From table 1 shown in figure 23 it can be seen that AAMfuII achieved the highest score, and that AAMregions is also pre%rred over the standard AAM. This preference is rnosl pronounced for expressions such as angry, where there is a large amount of head motion and less so for emotions such as neutral and tender which do not involve siilficant movement of the head.

While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modificalions as would fall within lie scope and spirit of the inventions.