WO2009030636A1

WO2009030636A1 - Method and system for synthesis of non-primary facial expressions

Info

Publication number: WO2009030636A1
Application number: PCT/EP2008/061319
Authority: WO
Inventors: John Ghent; John Mcdonald
Original assignee: National University Of Ireland, Maynooth
Priority date: 2007-09-05
Filing date: 2008-08-28
Publication date: 2009-03-12
Also published as: IE20070634A1

Abstract

The present invention relates to a method for synthesis of a non-primary facial expression. The method comprises acquiring a facial image and calculating a shape and a texture for each of a plurality of primary facial expressions for the facial image. The method further comprises the step of generating a subject-specific model of the face including each of the primary facial expressions using the calculated shapes and textures. The method also comprises selecting an expression vector corresponding to a non-primary facial expression to be synthesised and applying the expression vector to the model to synthesise a facial image having the selected non-primary facial expression. The invention also relates to a system for synthesis of a non-primary facial expression and to a method for generating an expression look-up table.

Description

Title

METHOD AND SYSTEM FOR SYNTHESIS OF NON-PRIMARY FACIAL EXPRESSIONS

Field of the Invention

The present invention relates to methods and systems for synthesising images of facial expressions, and in particular, to a method and system for synthesising photo -realistic images of facial expressions of an individual from a neutral image of the individual.

Background to the Invention

Techniques for generating images representing different facial expressions of an individual are known. These techniques have applications in many areas of technology, in particular, in animation and computer gaming applications.

One technique used to synthesise facial expressions is to manually alter the facial expression of an image of an individual using image processing software packages. However, the process of altering an expression can take several hours and requires a highly skilled graphic artist. Furthermore, it is difficult to output a photo -realistic image using this technique.

Several attempts have been made to automate facial expression synthesis over the past 10 years. One approach is described in "Geometric driven photorealistic facial expression synthesis", Q. Zhang, Z. Liu, B. Guo, and H. Shum, SIGGRAPH Symposium on Computer Animation, 2003. This approach takes several images of a single face, warps each image to a mean shape, divides the face into 14 regions and uses these regions as examples of texture change for each expression. A difference vector of feature points between neutral and non-neutral expressions is then calculated. This difference vector is used to synthesise a shape depicting a specific expression. However, this approach has limitations, as it is subject or face specific and computationally expensive.

Parameterisation techniques change the appearance of an individual's expression by varying independent parameter values as described in "Parameterized facial expression synthesis based on mpeg-4", A. Raouzaiou, N. Tsapatsoulis, and K. Karpouzis, Journal on Applied Signal Processing, 2002. The advantage of this approach is the ability to synthesise numerous expressions by combining different parameter values. However, this technique is dependent on the facial mesh topology and therefore completely generic parameterisation is not feasible.

Realistic 3D synthetic facial expressions can be modelled using a physics-based approach. This technique creates geometric models of the eyes, eyelids, teeth (incisor, canine and molar teeth), and hair and neck as described in "Realistic modelling for facial animation", Y. Lee, D. Terzopoulos, and K. Waters, ACM, 1995. This approach can produce excellent results. However, the images are not photorealistic and it requires intensive calculations to complete a synthetic expression. Similarly, the vast majority of 3D approaches to facial expression synthesis are computationally complex and can require extensive calculations.

There have been several attempts to model muscular behaviour for facial expression synthesis. However, these approaches lack the ability to model the exact structure of the human face. Muscle movements are simulated in the form of splines, wires and free form deformations. An example of free form deformations can be seen in "Model based face reconstruction for animation", Y. Lee, P. Karla and N. Magnenat- Thalmann, Proceedings of Multimedia Modelling, 1997. A disadvantage of this approach is that the synthesised images are not photo -realistic.

US Patent No. 6,940,454 describes a method for generating facial expression values. However, the outputs are not photo -realistic images but a coded description of the audio and visual data contained in the input sequence. US Patent No. 6,088,040 describes a facial image conversion method. This method takes images of a number of expressions from one face and synthesises expressions for that face. In US Patent No. 7,123,263, a method for modelling a 3-dimensional facial system is described. However, this method requires information from the user to guess which 3- dimensional model of a face to use and hence is not fully automatic. Furthermore, this approach is more complex due to its 3-dimensional nature. A method for synthesis of primary facial expressions (fear, anger, surprise, happiness sadness and disgust) is described in "A Computational Model of Facial Expression", J. Ghent, PhD Thesis, 2005 and "Photo -realistic facial expression synthesis", J. Ghent and J. McDonald, Image and Vision Computing Journal, 2005. This method involves generation of a subject-independent matrix-based shape and texture model for each of the primary facial expressions.

A shape model of a facial expression describes how the outline shape of the face, as well as the position and shape of the features on the face, should change to portray that expression. A texture model of a facial expression describes how the image pixel values (that is, the intensity or darkness of each pixel of the facial image) should change to portray the expression. Thus, a shape and texture model may be used to accurately predict how the shape of the face, the shape and position of the features on the face and the shadows, wrinkles and creases on the face will change for a particular expression.

When a facial expression is to be generated for an individual face, the subject- independent model is applied to a neutral image of the face to generate an image with the required facial expression. However, this subject-independent model is only useful for synthesising primary facial expressions, and is inadequate for synthesis of subtle, non-primary facial expressions (that is, expressions other than the six primary expressions). Furthermore, because the models for each of the expressions are entirely separate, this approach provides no means of interpolating between expressions, and synthesis of hybrid or dynamic facial expressions is therefore not possible.

Hybrid facial expressions are a collection of expressions at different intensities and/or expressions with a mixture of two or more of the primary facial expressions. The term hybrid expression includes intermediate expressions exhibited during the transition from one primary expression to another. Dynamic facial expressions are facial expressions which change over time, for example, in an animation or movie. For example, to create a movie of an individual forming a smile from a surprised expression would require a method of synthesising dynamic and hybrid facial expressions.

Summary of the Invention According to a first aspect of the present invention, there is provided a method for synthesis of a non-primary facial expression, comprising: acquiring a facial image; calculating a shape and a texture for each of a plurality of primary facial expressions for the facial image; generating a subject-specific model of the face including each of the primary facial expressions using the calculated shapes and textures; selecting an expression vector corresponding to a non-primary facial expression to be synthesised; applying the expression vector to the model to synthesise a facial image having the selected non-primary facial expression.

An advantage of this arrangement is that photo -realistic non-primary facial expressions may be synthesised using a single acquired facial image. The term non- primary facial expression includes hybrid and dynamic facial expressions. Because this method provides a single shape and texture model for each subject, rather than for each expression, it is possible to interpolate between the primary expressions, thereby allowing synthesis of non-primary facial expressions.

Preferably, the acquired facial image portrays a substantially neutral expression. This allows an accurate result to be achieved.

Preferably, the subject specific model of the face is a matrix-based model. The step of applying the expression vector to the model may comprise multiplying the expression vector by the matrix to obtain an output vector that describes the selected non-primary facial expression. Each of the columns of the matrix may correspond to a primary facial expression. The output vector may be a weighted linear combination of the columns of the matrix. The output vector thus represents a weighted combination of each of the primary facial expressions. The method may include the step of generating an estimate of the size and location of the face within the acquired facial image. The method may further include the step of scaling and aligning the estimate to an initial mean face shape. The scaling and aligning may be done using Procrustes alignment. Procrustes alignment is a standard algorithm for aligning two shapes to each other by adjusting only the scale, rotation, and translation, so that the actual shape remains unchanged. The initial mean face shape may be pre-calculated based on a collection of facial images.

The method may include the step of locating at least one feature of the face. This step may include locating features such as the eyes, eyebrows, pupils, nose, and/or mouth. The locating step may include identifying a plurality of landmark points associated with at least one feature of the face.

The step of calculating the shape and texture of the plurality of primary facial expressions for the facial image may comprise applying subject-independent Facial Expression Shape Model and Facial Expression Texture Model algorithms to the facial image to calculate a shape and texture for the face for each of the primary facial expressions.

The method may further include the step of aligning the calculated shapes for each of the primary expressions to each other. The alignment may be done using Procrustes alignment. A new mean face shape may be calculated from the aligned shapes. The method may also include the step of warping each of the calculated textures to the new mean face shape. The warping may be done using a piece-wise affϊne transformation algorithm.

The method may further include the step of scaling the synthesised face shape to its original size (thus creating a new synthesised shape vector) and warping the synthesised texture to the new synthesised shape vector. The method may further comprise combining the synthesised facial image with the acquired facial image. The step of combining the synthesised facial image with the acquired facial image may comprise: superimposing the synthesised facial image over the acquired facial image; and blending the synthesised facial image with the acquired facial image.

According to another aspect of the invention, there is provided a method for generating an expression look-up table, comprising: analysing a plurality of images, each of which depicts a facial expression; for each image, determining the relationship between a neutral expression and the facial expression depicted therein to create an expression vector associated with the facial expression; storing each expression vector indexed by an expression description descriptive of the associated facial expression.

The expression look-up table may be used to synthesise a facial expression in accordance with the method described above. An expression vector corresponding to the expression to be synthesised may be selected from the expression look-up table and applied to the subject-specific model to synthesise a facial image having the desired facial expression.

The expression look-up table may be generated once, during a set-up phase, using images of a single subject (or of a plurality of subjects). The expression vectors generated from these images may then be applied to subject- specific models of other subjects in accordance with the method described above.

According to a further aspect of the present invention, there is provided a system for synthesis of a non-primary facial expression, comprising: means for acquiring a facial image; means for calculating a shape and texture of a plurality of primary facial expressions for the facial image; means for generating a subject-specific model of the face including each of the primary facial expressions using the calculated shapes and textures; means for selecting an expression vector corresponding to a non-primary facial expression to be synthesised; means for applying the expression vector to the model to synthesise a facial image having the selected non-primary facial expression.

Brief Description of the Drawings

Figure 1 is a flow chart for a method of synthesis of a non-primary facial expression in accordance with the present invention;

Figure 2 shows a model of a face for use in the method of the invention;

Figure 3 is a flow chart for a method of generation of a facial expression shape model according to the present invention; and

Figure 4 is a flow chart for a method of generation of a facial expression texture model according to the present invention.

Detailed Description of the Drawings

A method for synthesis of a non-primary facial expression is shown in Figure 1. Step 1 comprises acquiring a facial image. The image acquisition is performed using a standard digital camera. The acquired image preferably portrays the subject with a neutral facial expression. In step 2, an estimate of the size and location of the face within the image is generated. The face is assumed to be free from occlusions and located so that it is the most prominent object in the image. These constraints ensure that it can be found using standard statistical methods.

In step 3, the estimate is scaled and aligned to a predetermined mean face shape Procrustes alignment. The mean face shape is an average face shape calculated based on a collection of facial images. 122 landmark points located around the eye, the eyebrow, the pupil, the nose, and the mouth represent each shape, i.e. the acquired face shape and the mean face shape. Procrustes alignment is used to minimise the weighted sum of the squared distances between the corresponding landmark points. In step 4 of the method, an active shape model search is performed to locate all features of the face (eyes, eyebrows, pupils, nose, and mouth). The features of the face are located using the landmark points discussed above. For example, eight landmark points around the eye accurately describe the location of the eye on the face. The landmark points are used as inputs to the subsequent steps of the method.

Step 5 comprises calculating the shape and texture of a plurality of primary facial expressions for the facial image. This is done using the techniques described in "A Computational Model of Facial Expression", J. Ghent, PhD Thesis, 2005 and "Photo- realistic facial expression synthesis", J. Ghent and J. McDonald, Image and Vision Computing Journal, 2005.

As shown in Figures 3 and 4, subject-independent shape and texture models denoted as the Facial Expression Shape Model (FESM), and the Facial Expression Texture Model (FETM) respectively are used to calculate the shape and the texture of the six primary facial expressions (fear, anger, surprise, happiness, sadness and disgust) for the face. The FESM and FETM are generated (step 14) from a database of images of faces depicting each of the primary facial expressions as described by the Facial Action Coding System (FACS). In steps 5a and 5b, the FESM and FETM algorithms are applied to the acquired facial image to calculate a shape and texture for the face for each of the primary facial expressions.

In step 6, the calculated shapes for each of the primary expressions are aligned to each other using Procrustes alignment and a new mean face shape is calculated. In step 7, each of the calculated textures is warped to the newly calculated mean face shape using a piece-wise affϊne transformation algorithm.

Step 8 comprises generating a subject-specific model of the face including each of the primary facial expressions using the calculated shapes and textures. As shown in Figures 3 and 4, this is achieved by generating a shape and texture model that contains only the neutral expression and the 6 primary facial expressions for the individual face. The model is a matrix-based model denoted as iSpace (or identity-Space). The iSpace is generated using the warped textures and the aligned shapes generated in steps 6 and 7. iSpace is a unified expression model of the face which allows primary and non-primary (that is, hybrid or dynamic) expressions to be generated for a specific face.

Thus, a subject-specific FESM and FETM are generated for the face, denoted iFESM and iFETM respectively. Together, the iFESM and iFETM form the iSpace for the face as shown in Figure 2. The iSpace allows for the synthesis of non-primary facial expressions that retain the identity of an individual and, by interpolation, the generation of dynamic expression image sequences of that individual. The primary facial expressions form the basis of the iSpace and the iSpace consists of all linear combinations of this basis. Particular combinations will result in particular non- primary facial expressions.

In step 9, an expression vector corresponding to a non-primary facial expression to be synthesised is selected. The expression vector is selected from an Expression Code Book (ECB). The ECB is a look-up table that relates high-level expression descriptions (e.g. pleasantly surprised) to 7-dimensional expression vectors. The ECB is generated by analysing a plurality of images, each of which depicts a facial expression. For each image, the relationship between a neutral expression and the facial expression depicted therein is determined to create an expression vector associated with the facial expression. The expressions vectors are stored so that they may be indexed by an expression description descriptive of the associated facial expression.

Essentially, the ECB provides coordinates of particular expressions within iSpace.

The expression vector for a particular non-primary expression describes the amount of each primary expression in the non-primary expression. Each of the columns of the iSpace matrix corresponds to a primary facial expression (or neutral expression). A particular output expression is generated as a weighted linear combination of the column vectors of the iSpace matrix, where the weights are given by the particular expression vector. Thus, the ECB can be used in conjunction with iSpace to generate dynamic, hybrid, and non-neutral facial expressions. This is shown in Equation 1 below where V₀ is an output vector, S₁ is an MxN iSpace Matrix (i.e. the column vectors of this matrix are the basis vectors) and E_v is the expression vector.

V₀ - S_tE_v (Equation 1)

Thus, the output vector V₀ is a linear combination of the columns of S₁. Any non- primary facial expression may be approximated by a linear combination of the basis vectors (i.e. primary expressions). There are potentially an infinite number of weighted linear combinations that may be generated and thus there is no limit to the degree of subtlety of the hybrid expressions than may be synthesised using the method of the present invention

The ECB is generated by using images portraying specific facial expressions to calculate the expression vector as follows:

S^mvE_t = E, (Equation 2)

where S'"^v is the pseudo-inverse of iSpace and E₁ is a training expression vector. The ECB is generated only once using a single exemplar individual and may then be applied to other faces according to the method of the invention.

In step 10, the expression vector is applied to the iSpace model to synthesise a facial image having the selected non-primary facial expression. In step 11, the shape and textures are projected out of iSpace to generate the output vector V₀ as given by

Equation 1. The synthesised shape is rescaled to its original size and the synthesised texture is warped to the new synthesised shape vector.

In steps 12 and 13, the synthesised facial image is combined with the acquired facial image. Step 12 comprises superimposing the synthesised facial image over the acquired facial image. In step 13, the synthesised image is blended with the acquired image using standard image processing techniques for added realism. The method and system of the present invention may be implemented within a mobile telephone, personal digital assistant (PDA), camera or other similar device and used to generate images and/or animations portraying a variety of facial expressions of a subject. The background of the image may also be varied, so as to depict the subject in various locations or situations. Where the invention is implemented in a mobile telephone, PDA or other mobile telecommunications device, the created images or animations may be may be shared or sent via the Internet or transmitted using wireless telecommunication networks.

The words "comprises/comprising" and the words "having/including" when used herein with reference to the present invention are used to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof.

It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination.

Claims

1. A method for synthesis of a non-primary facial expression, comprising: acquiring a facial image; calculating a shape and a texture for each of a plurality of primary facial expressions for the facial image; generating a subject-specific model of the face including each of the primary facial expressions using the calculated shapes and textures; selecting an expression vector corresponding to a non-primary facial expression to be synthesised; and applying the expression vector to the model to synthesise a facial image having the selected non-primary facial expression.

2. A method as claimed in claim 1, wherein the subject-specific model of the face is a matrix-based model.

3. A method as claimed in claim 2, wherein the step of applying the expression vector to the model comprises multiplying the expression vector by the matrix to obtain an output vector that describes the selected non-primary facial expression.

4. A method as claimed in claim 2 or claim 3, wherein each of the columns of the matrix corresponds to a primary facial expression.

5. A method as claimed in claim 4, wherein the output vector is a weighted linear combination of the columns of the matrix.

6. A method as claimed in any preceding claim, further comprising: generating an estimate of the size and location of the face within the acquired facial image.

7. A method as claimed in claim 6, further comprising: scaling and aligning the estimate to a mean face shape.

8. A method as claimed in claim 7, wherein the steps of scaling and aligning are done using Procrustes alignment.

9. A method as claimed in any preceding claim, further comprising: locating at least one feature of the face.

10. A method as claimed in claim 9, wherein the locating step includes locating at least one of: an eye, an eyebrow, a pupil, a nose, or a mouth.

11. A method as claimed in claim 9 or claim 10, wherein the locating step comprises: identifying a plurality of landmark points associated with at least one feature of the face.

12. A method as claimed in any preceding claim, wherein the step of calculating the shape of the plurality of primary facial expressions for the facial image comprises: applying a subject-independent Facial Expression Shape Model algorithm to the facial image to calculate a shape for the face for each of the primary facial expressions.

13. A method as claimed in any preceding claim, wherein the step of calculating the texture of the plurality of primary facial expressions for the facial image comprises: applying a subject-independent Facial Expression Texture Model algorithm to the facial image to calculate a texture for the face for each of the primary facial expressions.

14. A method as claimed in any preceding claim, further comprising: aligning the calculated shapes for each of the primary expressions to each other.

15. A method as claimed in claim 14, wherein the step of aligning the calculated shapes is done using Procrustes alignment.

16. A method as claimed in claim 14 or claim 15, further comprising: calculating a new mean face shape from the aligned shapes.

17. A method as claimed in claim 16, further comprising: warping each of the calculated textures to the new mean face shape.

18. A method as claimed in claim 17, wherein the step of warping is done using a piece-wise affϊne transformation algorithm.

19. A method as claimed in any preceding claim, further comprising: combining the synthesised facial image with the acquired facial image.

20. A method as claimed in claim 19, wherein the step of combining the synthesised facial image with the acquired facial image comprises: superimposing the synthesised facial image over the acquired facial image; and blending the synthesised facial image with the acquired facial image.

21. A system for synthesis of a non-primary facial expression, comprising: means for acquiring a facial image; means for calculating a shape and texture for each of a plurality of primary facial expressions for the facial image; means for generating a subject-specific model of the face including each of the primary facial expressions using the calculated shapes and textures; means for selecting an expression vector corresponding to a non-primary facial expression to be synthesised; and means for applying the expression vector to the model to synthesise a facial image having the selected non-primary facial expression.

22. A method for generating an expression look-up table, comprising: analysing a plurality of images, each of which depicts a facial expression; for each image, determining the relationship between a neutral expression and the facial expression depicted therein to create an expression vector associated with the facial expression; storing each expression vector indexed by an expression description descriptive of the associated facial expression.

23. A method for synthesis of a non-primary facial expression substantially as hereinbefore described with reference to and/or as illustrated in the accompanying drawings.

24. A method for generating an expression look-up table substantially as hereinbefore described with reference to and/or as illustrated in the accompanying drawings.