WO2020260862A1

WO2020260862A1 - Facial behaviour analysis

Info

Publication number: WO2020260862A1
Application number: PCT/GB2020/051503
Authority: WO
Inventors: Stefanos ZAFEIRIOU; Dimitrios KOLLIAS
Original assignee: Facesoft Ltd.
Priority date: 2019-06-28
Filing date: 2020-06-22
Publication date: 2020-12-30
Also published as: CN113994341A; GB201909300D0; GB2588747A; GB2588747B

Abstract

This specification relates to methods for analysing facial behaviour in facial images using neural networks. According to a first aspect of this disclosure, there is described a method of training a neural network for facial behaviour analysis, the method comprising: inputting, to the neural network, a plurality of facial images, the plurality of facial images comprising: one or more first facial images from a first dataset, the first training dataset comprising facial images each with a known emotion label; and one or more second facial images from a second dataset, the second training dataset comprising facial images each with known action unit activations, generating, for each of the plurality of facial images and using the neural network, a predicted emotion label and predicted action unit activations; and updating parameters of the neural network in dependence on a comparison of: predicted emotion labels of the one or more first facial images to the known emotion labels of the one or more first facial images; and predicted action unit activations of the one or more second facial images to the known action unit activations of the one or more second facial images.

Description

Facial Behaviour Analysis

Field

This specification relates to methods for analysing facial behaviour in facial images using neural networks, and methods of training neural networks for facial behaviour analysis by processing of facial images.

Background

Automatic facial behaviour analysis relates to determining human affective states (e.g. emotion) from facial visual data. Facial behaviour analysis has wide ranging

applications such as, for example, improved human-computer/human-robot interaction. Moreover, facial behaviour analysis systems may be used to automatically annotate data (for example, visual/audio data), e.g. by determining a human reaction to the data instead of requiring a manual annotation. Therefore, improving system(s) and methods for facial behaviour analysis may improve systems directed at these and other applications.

Developing methods and systems for facial behaviour analysis of facial images recorded under unconstrained conditions (e.g.‘in-the-wild’) is a difficult task. The labels required to annotate in-the-wild datasets may be costly to acquire as they may require skilled annotators to provide these manual annotations, and so datasets with these labels may have fewer examples than is desired for developing facial behaviour analysis systems with a desired level of performance. Additionally, facial behaviour analysis comprises various different tasks which reflect different, yet inter-connected, aspects of human affective states.

Summary

According to a first aspect of this disclosure, there is described a method of training a neural network for facial behaviour analysis, the method comprising: inputting, to the neural network, a plurality of facial images, the plurality of facial images comprising: one or more first facial images from a first dataset, the first training dataset comprising facial images each with a known emotion label; and one or more second facial images from a second dataset, the second training dataset comprising facial images each with known action unit activations, generating, for each of the plurality of facial images and using the neural network, a predicted emotion label and predicted action unit activations; and updating parameters of the neural network in dependence on a comparison of: predicted emotion labels of the one or more first facial images to the known emotion labels of the one or more first facial images; and predicted action unit activations of the one or more second facial images to the known action unit activations of the one or more second facial images.

The comparison may be performed by a multi-task objective function, the multitask objective function comprising: an emotion loss comparing predicted emotion labels to known emotion labels; and an activation loss comparing predicted action unit activations to known action unit activations. The emotion loss and/or activation loss may comprise a cross entropy loss.

The plurality of facial images may further comprise one or more third facial images from a third dataset, the third training dataset comprising facial images each with a known valence and/or arousal value. Updating the parameters of the neural network may be further in dependence on comparison of predicted valence and/ or arousal values of the one or more third facial images to the known valence and/or arousal values of the one or more third facial images. The comparison may be performed by a multi-task objective function, the multitask objective function comprising a continuous loss comparing predicted valence and/or arousal values to known valence and/or arousal values. The continuous loss may comprise a measure of a concordance correlation coefficient between the predicted valence and/ or arousal values and the known valence and/or arousal values.

The facial images of the first dataset may be each associated with derived action unit activations, the derived action unit activations determined based on the known emotion label of said image. The parameters of the neural network may be updated in dependence on a comparison of the predicted action unit activations of the one or more first facial images to the derived action unit activations of the one or more first facial images. One or more facial images of the second dataset maybe each associated with a derived emotion label, each derived emotion label determined based on the known action unit activations of said facial image. The parameters of the neural network may be updated in dependence on a comparison of the predicted emotion labels of one or more second facial images to the corresponding derived emotion labels of the one or more second facial images. The derived action unit activations and the derived emotion labels may be determined based on a set of prototypical action unit activations for each emotion label and a set of weighted action unit activations for each emotional label. The derived emotion labels maybe a distribution over a set possible emotion labels. The predicted emotion labels may comprise a probability measure across the set of possible emotions. The parameters of the neural network maybe updated in dependence on a comparison of derived emotional labels to the predicted emotion labels.

The parameters of the neural network may be updated in dependence on a comparison of a distribution of the predicted action unit activations to an expected distribution of action unit activations, the expected distribution of action unit activations being determined based on the predicted emotional labels of the plurality of facial images. The expected distribution of action unit activations may further be determined based on a modelled relationship between emotion labels and action unit activations.

The method may be iterated until a threshold condition is met. According to a further aspect of this disclosure, there is described a method of facial behaviour analysis, the method comprising: inputting a facial image into a neural network; processing the image using the neural network; and outputting, from the neural network, a predicted emotion label for the facial image, predicted action unit activations for the facial image and/or a predicted valence and/or arousal value for the facial image, wherein the neural network comprises a plurality of parameters determined using the training methods described herein.

According to a further aspect of this disclosure, there is described a system comprising: one or more processors; and a memory, the memory comprising computer readable instructions that, when executed by the one or more processors, cause the system to perform one or more of the methods described herein.

According to a further aspect of this disclosure, there is described a computer program product comprising computer readable instructions that, when executed by a computing device, cause the computing device to perform one or more of the methods described herein.

Brief Description of the Drawings

Embodiments will now be described by way of non-limiting examples with reference to the accompanying drawings, in which: Figure la and lb show an overview of example methods of facial behaviour analysis of facial images using a trained neural network;

Figure 2 shows an overview of an example method of training a neural network for facial behaviour analysis of facial images;

Figure 3 shows an overview of an example structure of a neural network for facial behaviour analysis of facial images;

Figure 4 shows a flow diagram of an example method of training a neural network for facial behaviour analysis of facial images;

Figure 5 shows a flow diagram of an example method of facial behaviour analysis of facial images using a trained neural network; and

Figure 6 shows a schematic example of a system/apparatus for performing any of the methods described herein.

Detailed Description

Example implementations provide system(s) and methods for facial behaviour analysis of facial images.

Improved facial behaviour analysis may be achieved by various example

implementations comprising a facial behaviour analysis method utilising a neural network trained with a multi-task objective function. Use of such a multi-task loss function in training a neural network can result in a higher performance for facial behaviour analysis by the neural network when compared to other facial localisation methods. For example, the neural network may have a lower error rate in recognising facial emotions/behaviour.

Methods and systems disclosed herein enable the implantation of a single multi-task, multi-domain and multi-label network that can be trained jointly end-to-end on various tasks relating to facial behaviour analysis. The methods of jointly training a neural network end-to-end for face behaviour analysis disclosed herein may obviate the need to utilise pre-trained neural networks, which may require fine-tuning to perform well on the new task(s) and/ or domain(s). Therefore, the trained neural network may generalise better to unseen facial images captured in-the-wild than previous

approaches. The trained neural network simultaneously predicts different aspects of facial behaviour analysis, and may outperform single-task neural networks as a result of enhanced emotion recognition capabilities. Additionally, the enhanced emotion recognition capabilities may allow the neural network to generate useful feature representations of input facial images, and so the neural network may be successfully applied to perform tasks beyond the ones it has been trained for. The multiple tasks may include, for example, automatic recognition of expressions, estimation of continuous emotions (e.g. valence and/or arousal), and detection of facial unit activations (activations of e.g. upper/inner eyebrows, nose wrinkles; facial unit activations are also referred to as action unit activations herein). Facial images used to train the neural network may be from multiple domains. For example, facial images may be captured from a user operating a camera of a mobile device, they may be extracted from video frames, and they may be captured in a controlled, lab-based, recording environment (while still allowing for spontaneous expressions of the subjects). A facial image used to train the neural network may have multiple labels; one or more of these labels may be derived from known labels as will be described below in relation to Figure 2.

Figure la and lb show an overview of example methods of facial behaviour analysis of facial images using a trained neural network. The method too takes a facial image 102 as input and outputs a predicted emotion label 106, predicted action unit (AU) activations 108, and optionally, predicted valence and/ or arousal values 110 using a trained neural network 104.

The facial image 102, x, is an image comprising one or more faces. For example, in a colour image, x e ^HxM'^x3 where H is the height of the image in pixels, W is the height of the image in pixels and the image has three colour channels (e.g. RGB or CIELAB).

The facial images 102, 108 may, in some embodiments, be in black-and- white/greyscale. Additionally or alternatively, any visual data relating to faces maybe input to the systems and methods described herein, such as, for example, 3D facial scans and UV representations of 3D facial visual data.

The predicted emotion label 106 is a label describing the predicted emotion/expression of the face present in the facial image 102. The emotion label may be a discrete variable, which can take on one value out of a possible set of values. For example, the set of possible emotions may include: angry, disgust, fear, happy, sad, surprise, and neutral. The predicted emotion label 106 may be represented by a one-hot vector with dimension equal to the number of possible emotions, with non-zero entries apart from the value of the index corresponding to the predicted emotion. Additionally or alternatively, the predicted emotion label 106 may take on more than one value, which may represent compound emotions. Additionally or alternatively, the predicted emotion label 106 may represent a probability distribution over the set of possible emotions representing how confident the trained neural network 104 is in its predictions. For example, the predicted emotion label 106 may comprise, for each emotion in the set of emotions used, a probability that said emotion is present in the input image 102, x, i.e. p(y_emo \x) for each emotion label y_emo . The predicted action unit activations 108 represent the predicted activation of facial muscles of the face in the facial image 102. Each action unit activation represents a part of a facial expression. For example, one action unit may be activated when a person wrinkles their nose, another may be activated when the lip corner is pulled. The action units (AU) coding system is a way of coding facial motion with respect to facial muscles, adopted as a common standard towards systematically categorising physical manifestation of complex facial expressions. Acquiring labels for action unit activations may be costly as it may require skilled annotators with expert knowledge of the AU coding system to manually label facial images with the action unit activations. The activation of each action unit may be modelled as a binary variable.

The predicted valence and/or arousal values 110 are predictions of continuous emotions depicted in the facial image 102. Generally, valence values may measure how negative/positive a person is, and arousal values may measure how active/passive a person is. These values maybe configured to lie in a standardised range, for example, they may lie in a continuous range from -1 to 1.

The trained neural network 104 is a neural network configured to process the input facial image 102 to output a predicted emotion label 106, predicted action unit (AU) activations 108, and optionally, predicted valence and/or arousal values 110. The trained neural network may be configured to output any label/target output that relates to facial behaviour analysis.. Examples of neural network architectures are described below in relation to Figure 3.

The trained neural network 104 comprises a plurality of layers of nodes, each node associated with one or more parameters. The parameters of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. One or more of the layers of the trained generator neural network 106 may be convolutional layers.

Figure 2 shows an overview of an example method of training a neural network, such as the neural network in Figures la and lb, for facial behaviour analysis of facial images. The method 200 comprises training a neural network 212 jointly to perform multiple facial behaviour analysis tasks; the neural network maybe trained using a plurality of datasets which may be from different domains and have different labels/target outputs. The objective of the neural network is to generate predictions for facial visual data which are similar to the labels/target outputs in the training dataset, while generalising well (e.g. by accurately capturing various aspects of emotion recognition) on unseen examples of facial visual data.

The neural network 212 is trained by processing a training batch of facial images 210, generating predictions for the training batch 210, and updating the parameters of the neural network 212 based on a multi-task objective function 220 comprising a comparison of: (i) the generated predictions and (ii) the corresponding labels/target outputs of the examples in the training batch 210. Example structures of the neural network 212 are discussed below in relation to Figure 4.

The training batch of facial images 210 includes a plurality of batches of labelled training data. The plurality of batches of labelled training data comprises a first batch 202-1 of facial images, a second batch 202-2 of facial images, and optionally, a third batch 202-3 of facial images. Each batch is associated with different types of facial emotion labels, though facial images may have more than one type of label and be present in more than one of the batches. The batches maybe taken from a plurality of different datasets. The size of each batch maybe different. For example, the first batch 202-1 may contain a higher number of facial images than the second batch 202-2. The second batch 202-2 may contain a higher number of facial images than the third batch 202-3.

The first batch 202-1 includes one or more facial images with corresponding known emotion labels y_{emo 20}4. The first batch 202-1 may comprise a plurality of facial images. Generally, the emotion label maybe a discrete variable, which can take on one value out of a possible set of values. For example, the set of possible emotions may include: angry, disgust, fear, happy, sad, surprise, and neutral, and so y_emo e {1, 2, ..., 7} for these 7 emotions. It will be appreciated that there maybe more or less than 7 emotions in the set of possible emotions: some of these emotions may be omitted and other emotions may be included in the set of possible emotions.

The second batch 202-2 includes one or more facial images with corresponding known action unit activations y_{au 20}6. The second batch 202-2 may comprise a plurality of facial images. The action unit activations represent the activation of different facial muscles, and is a coding system for coding facial motion with respect to facial muscles. The action unit activation maybe represented as y_au e {0, l}¹⁷, where the activations of 17 action units are modelled as binary variables. It will be appreciated that there may be more or fewer action unit activations in the batch 202-2. The third batch 202-3, which may optionally be included in the training batch 210, includes one or more facial images with corresponding known valence and/or arousal values 208. The third batch 202-3 may comprise a plurality of facial images. Valence and arousal values are continuous emotions. Generally, valence values y_v may measure how negative/positive a person is, and arousal values y_a may measure how

active/passive a person is. These values may be configured to lie in a standardised range, for example, they may lie in a continuous range from -1 to 1. Where both valence and arousal values are included, this maybe represented as y_va e [— 1,1]². Additionally or alternatively, other continuous emotion values maybe included in the batch 202-3. The training images 210 used in each batch 202 may be extracted from sets of images containing faces with one or more known labels. ResNet may be used to extract bounding boxes of facial images and facial landmarks. The facial landmarks may be used to align the extracted facial images. The extracted facial images maybe resized to fixed image size (e.g. n-by-n-by-3 for a colour image, where n is the width/height of the image). The intensity of the extracted facial images may, in some embodiments, be normalised to the range [-1,1].

During training, at each iteration the plurality of batches is concatenated and input into the neural network 212 (or input sequentially into the neural network 212). A plurality of sets of output data are obtained, and compared to corresponding known labels of the facial images in the training batch 210. The parameters of the neural network are updated in dependence on this comparison.

By inputting a training batch 210 that comprises facial images from each of the plurality of batches 202-1, 202-2, 202-3, the network processes images from each category of labels, and thus the parameters of the neural network 212 are updated based on all the types of label present in the training data. In examples where each batch 202-1, 202-2, 202-3 comprises a plurality of facial images, the problem of“noisy gradients” in the parameter updates is avoided, leading to better convergence behaviour. Furthermore, some types of cost/objective function (such as the CCC cost function described below) use a sequence of predictions to perform the comparison of predicted labels and ground truth labels.

The parameters of the neural network 212 may be updated using a multi-task objective function 220, which may comprise an emotion loss 222 comparing predicted emotion labels 214 to known emotion labels 204; and an action unit activation loss 224 comparing predicted action unit activations 216 to known action unit activations 206. Optionally, the multi-task objective function 220 may further comprise a continuous loss 226 comparing predicted valence and/or arousal values 218 to known valence and/ or arousal values 208. Additionally or alternatively, the multi-task objective function 220 may compare predictions with labels derived from known labels in order to update the parameters of the neural network 212 as will be described in more detail below. In some embodiments, the multi-task objective function 220 may comprise an emotion loss 222. An emotion loss 222 compares predictions of emotions 214 p(y_emoW to known emotion labels 204. Additionally or alternatively, the emotion loss 222 may compare predicted emotion labels 214 with derived emotion labels as described below. The emotion loss 222 may comprise a cross-entropy loss. In embodiments where the output of the neural network is a probability distribution over the set of possible emotions, the emotion loss 222 may be given by an expectation value over the set of emotions present in the input facial images:

However, it will be appreciated that any type of classification loss may alternatively be utilised, for example, a hinge loss, a square loss, or an exponential loss. In some embodiments, the multi-task objective function 220 may comprise an action unit activation loss 224. An action unit activation loss 224 compares predictions of action unit activations 216 p(y_au\^x ) to known action unit activations 206. Additionally or alternatively, the action unit activation loss 222 may compare predicted action unit activations 216 with derived action unit activations as described below. The action unit activation loss 224 may comprise a binary cross-entropy loss. An action unit activation loss 224 may, in some embodiments, be given by:

where the negative log likelihood may be given by:

The S £ {0, 1} in the equation above denotes whether the image x contains an annotation for the inaction unit AU_L. However, it will be appreciated that any type of classification loss may be utilised, for example, a hinge loss, a square loss, or an exponential loss.

In some embodiments, a multi-task objective function 220 that maybe used to train the neural network may be given by:

&MT ⁼ ^Emo + ^l^AU

The real numbers bis typically non-negative, and controls the relative contribution of the activation unit loss 224 to the emotion loss 222. In some embodiments, the multi-task objective function 220 may further comprise a continuous loss 226. The continuous loss 226 compares predicted valence/arousal values with known valence/arousal values. The continuous loss 226 may measure the agreement between the predicted valence/arousal values with the ground truth valence/arousal values. This maybe measured by a concordance correlation coefficient. For example, the continuous loss 226 may be given by: ·n A — 1 fff(yv yv ) An example of a multi-task objective function 220 that maybe used to train the neural network when valence and arousal labels are present may be given by:

&MT ⁼ ^Emo + di £AU + ^-2-^VA

One or more of the elements of the multi-task objective function 220 maybe omitted. The real numbers l l₂ are typically non-negative and control the relative

contributions of the individual loss functions to the multi-task objective function 220. Additional losses maybe included in the multi-task objective function, such as a distribution matching loss, as will be discussed in further detail below.

The different tasks that the neural network 212 is trained to perform maybe interconnected in relation to facial behaviour analysis. For example, a facial image depicting a certain expression may also result in certain action units being activated in substantially all examples of facial images depicting the certain expression. These action units maybe referred to as prototypical action units. Action units activated in a significant proportion of examples of facial images depicting a certain expression may be referred to as observational action units. Prototypical and observational action units maybe derived from an empirical model. For example, sets of images with known emotion labels may be annotated with action unit activations. Observational action unit activations, and associated weights, may be determined from action unit activations that a fraction of annotators observe. Table 1 below shows examples of emotions and their corresponding prototypical and observational action units.

Table 1: Examples of emotions and their prototypical and observation action units (AUs). The weight is the fraction of examples that observed the activation of the AU.

As a result of the different tasks being interconnected, it may be beneficial to couple together two or more of these tasks. This may lead to a trained neural network with enhanced emotion recognition capabilities as a result of generating feature

representations of facial images which better capture the different aspects of facial behaviour analysis. Training the neural network 212 with co-annotated images which have labels derived from known labels is one way to couple together different tasks. Another way to couple the tasks together is to use a distribution matching loss which align the predictions of the emotions and action units tasks during training.

For facial images in the batch 202-1 with known emotion labels 204, additional action unit activations can be derived using associations between the emotion label and action units. Given an image x with the ground truth annotation of emotion y_emo, the prototypical and observational AUs of this emotion may be provided as an additional label. The facial image may be co-annotated with derived action unit activations y_au that contain only the prototypical and observational AUs. The activation of

observational action units may be weighted using an empirically derived weight, for example, by using the weights from Table 1. The weights relate to the probability of an activation unit being present given a particular emotion label. Additionally, or alternatively, observational action units may have equal weighting to the activations of prototypical action units. The co-annotated facial image may be included in the training batch 210 twice, once with the emotion label and once with the derived action unit activations.

Similarly, for facial images in the batch 202-2 with known action unit activations 206, additional emotion labels can be derived using associations between the emotion label and action units. For an image x with the ground truth annotation of the action units y_au, it can be determined whether it can be co-annotated with an emotion label. For example, an emotion maybe present when all the associated prototypical and observational AUs are present in the ground truth annotation of action units. In cases when more than one emotion is possible, the derived emotion label y_emo may be assigned to the emotion with the largest requirement of prototypical and observational AUs. The co-annotated facial image may be included in the training batch 210 twice, once with the known action unit activations and once with the derived emotion label.

Additionally or alternatively, derived emotion labels may be soft labels, forming a distribution over a set of possible emotion labels. More specifically, for each emotion, a score can be computed over its prototypical and observational AUs being present, e.g. a distribution over the emotional labels can be determined based on a comparison of the activation unit labels present to the prototypical and/or observational activation units for each emotional label. For example, for emotion happy, the score (y_au(.AU 12) + y_au{AU 25) + 0.51 y_au(AU6))/(l + 1 + 0.51) can be computed. Additionally or alternatively, all weights may equal 1 if without reweighting. The scores over emotion categories may be normalised to form a probability distribution over emotion labels. The normalisation may, for example, be performed by a softmax operation over the scores for each emotion label.

In various embodiments, a distribution matching loss may also be included in the multi-task objective function 220. A distribution matching loss aligns the predictions of the emotions and action units tasks during training. This may be performed by a comparison between the probability distribution of predictions and an expected probability distribution. The expected distribution of action unit activations may be determined based on the predicted emotional labels of the plurality of facial images. This may be determined based on a modelled relationship between emotion labels and action unit activations, such as in Table 1. For example, the action unit activations may be modelled as a mixture over the emotion categories. An expected action unit activation distribution maybe given as:

The conditional probability p(y_au\y_emo) maybe defined deterministically from an empirical model, such as the one provided in Table 1. For example, p{y_au\y_emo ) = 1 for prototypical and observational action units, and zero otherwise. For example, AU2 is prototypical for emotion surprised and observational for emotion fearful, then

y

q(AU 2 I x) = - ( (surprised \ x ) + p (fearful |x)). Additionally, or alternatively, the conditional probability for observational action units may be weighted such that p{y_a ^l _u\y_emo) = ^w with weights w which maybe from Table 1.

A distribution matching loss for action units maybe given as:

Similarly, a distribution matching loss for emotion categories may be given, using the derived soft emotion labels, as described above, as q(y_emo |x)· When distribution matching is used, an example multi-task objective function 220 may be given as:

&MT — ^Emo + ^ILAU + L-₂ &VA + ^₃ &DM

One or more of L_AU and L_VA may be omitted. The parameters of the neural network 212 may be updated using an optimisation procedure in order to determine a setting of the parameters that substantially optimise (e.g. minimise) the multi-task objective function 220. The optimisation procedure may be stochastic gradient descent for example. One or more of the datasets used to populate batches 202-1, 202-2, 202-3 may be optional, and other datasets relating to facial behaviour analysis tasks may be included in the training batch 210 when training the neural network 212. In some embodiments, the training batch 210 may comprise a number of training examples with sufficiently varied labels/output target types such that all of the components of the multi-task objective function 220 contribute to the objective function. In this way, the weight updates of the neural network 212 may be based on gradients which are not noisy, thus allowing better and/ or faster convergence of the neural network 212 during training. Faster convergence of the neural network 212 may reduce the computational/network resources required to train the neural network 212 to an appropriate level of performance, e.g. by reducing the number of calculations performed by a processor, or by reducing the amount of data transmitted over a network, for example, if the training datasets are stored on a remote data storage server.

Figure 3 shows an overview of an example structure of a neural network for facial behaviour analysis of facial images. In this example, the neural network 104 is in the form of a convolutional neural network, comprising a plurality of convolutional layers 302 and a plurality of subsampling layers 304.

Each convolutional layer 302 is operable to apply one or more convolutional filters to the input of said convolutional layer 302. For example, one or more of the

convolutional layers 302 may apply a two-dimensional convolutional block with kernel size three, a stride of one, and a padding size of one. However, other kernel sizes, strides and padding sizes may alternatively or additionally be used. In the example shown, there are a total of thirteen convolutional layers 302 in the neural network 104. Other numbers of convolutional layers 302 may alternatively be used.

Interlaced with the convolutional layers 302 of are a plurality of subsampling layers 304 (also referred to herein as down-sampling layers). One or more convolutional layers 302 may be located between each subsampling layer 304. In the example shown, either two or three convolutional layers 302 are applied between each application of a subsampling layer 304. Each subsampling layer 304 is operable to reduce the dimension of the input to that subsampling layer. For example, one or more of the subsampling layers may apply an average two-dimensional pooling with kernel and stride sizer of two. Other subsampling methods and/or subsampling parameters may alternatively or additionally be used.

One or more fully connected layers 306 may also be present in the neural network, for example three are shown in Figure 3. The fully connected layers 306 may be directly after the last subsampling layer 304, as shown in Figure 3. Each fully connected layer may have a dimension of 4096, although other dimension sizes are possible. The last fully connected layer may be a layer with no activation function. All of the predictions generated by the neural network 104 may be generated from this output layer. In this way, the predictions for all tasks are pooled from the same feature space.

A classification layer 310 may follow the last fully connected layer, in order to generate the predicted emotion labels 312. This may be a softmax layer.

A plurality of sigmoid units may be applied to the last fully connected layer in order to generate predictions for the action unit activations. In this example, there are 17 sigmoid units in order to generate predictions for 17 action units.

The direct output of the last fully connected layer may be used in order to generate predictions for valence/arousal values, which are continuous variables.

One or more activation functions are used in the layers of the neural network 106. For example, the ReLU activation function may be used. Alternative or additionally, an ELU activation function may be used in one or more of the layers. Other activation functions may alternative or additionally be used.

Figure 4 shows a flow diagram of an example method of training a neural network for facial behaviour analysis of facial images. The flow diagram corresponds to the methods described above in relation to Figure 2.

At operation 4.1, a plurality of facial images, a plurality of facial images is input into a neural network. The neural network is described by a set of neural network parameters (e.g. the weights and biases of various layers of the neural network).

The plurality of facial images comprise one or more first facial images from a first dataset, the first training dataset comprising facial images each with a known emotion label; and one or more second facial images from a second dataset, the second training dataset comprising facial images each with known action unit activations. The facial images of the first dataset may each be associated with derived action unit activations. The derived action unit activations may be determined based on the known emotion label of said image. One or more facial images of the second dataset may each be associated with a derived emotion label. Each derived emotion label maybe determined based on the known action unit activations of said facial image. The derived action unit activations and the derived emotion labels may be determined based on a set of prototypical action unit activations for each emotion label and a set of weighted action unit activations for each emotional label. The derived emotion labels maybe a distribution over a set of possible emotion labels. The plurality of facial images may additionally comprise one or more third facial images from a third dataset, the third training dataset comprising facial images each with a known valence and/ or arousal value.

At operation 4.2, a predicted emotion label and predicted action unit activations are generated for each of the plurality of facial images using the neural network. The predicted emotion labels may comprise a probability measure across the set of possible emotions. In some embodiments, predicted valence and/or arousal values are additionally generated. The neural network processes an input facial image through a plurality of neural network layers to output the predicted emotion label and predicted action unit activations (and optionally the predicted valence and arousal values).

At operation 4.3, the parameters of the neural network are updated. The updating is in dependence on a comparison of predicted emotion labels of the one or more first facial images to the known emotion labels of said one or more first facial images; and predicted action unit activations of the one or more second facial images to the known action unit activations of said one or more second facial images. Updating the parameters of the neural network may further be in dependence on comparison of predicted valence and/ or arousal values of the one or more third facial images to the known valence and/or arousal values of said one or more third facial images.

In some embodiments, the parameters of the neural network may be updated in dependence on a comparison of the predicted action unit activations of the one or more first facial images to derived action unit activations of the one or more first facial images, as described above in relation to Figure 2. The parameters of the neural network maybe updated in dependence on a comparison of the predicted emotion labels of one or more second facial images to corresponding derived emotion labels of the one or more second facial images, as described above in relation to Figure 2. The parameters of the neural network may be updated in dependence on a comparison of derived emotional labels to the predicted emotion labels. The parameters of the neural network maybe updated in dependence on a comparison of a distribution of the predicted action unit activations to an expected distribution of action unit activations, as described above in relation to Figure 2. The expected distribution of action unit activations may be determined based on the predicted emotional labels of the plurality of facial images. The expected distribution of action unit activations may further be determined based on a modelled relationship between emotion labels and action unit activations.

The comparison may be performed by a multi-task objective function. The multitask objective function may comprise: an emotion loss comparing predicted emotion labels to known emotion labels; and an activation loss comparing predicted action unit activations to known action unit activations. The emotion loss and/or activation loss may each comprise a cross entropy loss. The multi-task objective function may further comprise a continuous loss comparing predicted valence and/or arousal values to known valence and/ or arousal values. The continuous loss may comprise a measure of a concordance correlation coefficient between the predicted valence and/or arousal values and the known valence and/or arousal values.

An optimisation procedure maybe used to update the parameters of the neural network. An example of such an optimisation procedure is a gradient descent algorithm, though other methods may alternatively be used.

Operations 4.1 to 4.3 may be iterated until a threshold condition is met. The threshold condition maybe a predetermined number of iterations or training epochs.

Alternatively, the threshold condition may be that a change in the value of the multi task loss function between iterations falls below a predetermined threshold value.

Other examples of threshold conditions for terminating the training procedure may alternatively be used. Figure 5 shows a flow diagram of an example method of facial behaviour analysis of facial images using a trained neural network. The flow diagram corresponds to the methods described above in relation to Figures la and lb. The parameters of the neural network maybe determined using any of the training methods described herein (i.e. the neural network is trained using any of the training methods described herein).

At operation 5.1, a facial image is input into a neural network. At operation 5.2, the image is processed using a neural network. The facial image is processed through a plurality of neural network layers. At operation 5.3, a predicted emotion label for the facial image, predicted action unit activations for the facial image and/or a predicted valence and/or arousal value for the facial image is output from the neural network.

Figure 6 shows a schematic example of a system/apparatus for performing any of the methods described herein. The system/apparatus shown is an example of a computing device. It will be appreciated by the skilled person that other types of computing devices/systems may alternatively be used to implement the methods described herein, such as a distributed computing system. The apparatus (or system) 600 comprises one or more processors 602. The one or more processors control operation of other components of the system/apparatus 600. The one or more processors 602 may, for example, comprise a general purpose processor. The one or more processors 602 may be a single core device or a multiple core device. The one or more processors 602 may comprise a central Central processing Processing unit Unit (CPU) or a graphical processing unit (GPU). Alternatively, the one or more processors 602 may comprise specialised processing hardware, for instance a RISC processor or programmable hardware with embedded firmware. Multiple processors may be included. The system/ apparatus comprises a working or volatile memory 604. The one or more processors may access the volatile memory 604 in order to process data and may control the storage of data in memory. The volatile memory 604 may comprise RAM of any type, for example Static RAM (SRAM), Dynamic RAM (DRAM), or it may comprise Flash memory, such as an SD-Card.

The system/apparatus comprises a non-volatile memory 606. The non-volatile memory 606 stores a set of operation instructions 608 for controlling the operation of the processors 602 in the form of computer readable instructions. The non-volatile memory 606 may be a memory of any kind such as a Read Only Memory (ROM), a Flash memory or a magnetic drive memory. The one or more processors 602 are configured to execute operating instructions 608 to cause the system/apparatus to perform any of the methods described herein. The operating instructions 608 may comprise code (i.e. drivers) relating to the hardware components of the system/apparatus 600, as well as code relating to the basic operation of the system/apparatus 600. Generally speaking, the one or more processors

602 execute one or more instructions of the operating instructions 608, which are stored permanently or semi-permanently in the non-volatile memory 606, using the volatile memory 604 to temporarily store data generated during execution of said operating instructions 608.

Implementations of the methods described herein may be realised as in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These may include computer program products (such as software stored on e.g. magnetic discs, optical disks, memory, Programmable Logic Devices) comprising computer readable instructions that, when executed by a computer, such as that described in relation to Figure 6, cause the computer to perform one or more of the methods described herein. Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure. In particular, method aspects may be applied to system aspects, and vice versa. Furthermore, any, some and/ or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination. It should also be appreciated that particular combinations of the various features described and defined in any aspects of the invention can be implemented and/or supplied and/or used independently.

Although several embodiments have been shown and described, it would be

appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles of this disclosure, the scope of which is defined in the claims.

Claims

Claims l. A computer implemented method of training a neural network for facial behaviour analysis, the method comprising:

inputting, to the neural network, a plurality of facial images, the plurality of facial images comprising:

one or more first facial images from a first dataset, the first training dataset comprising facial images each with a known emotion label; and

one or more second facial images from a second dataset, the second training dataset comprising facial images each with known action unit activations,

generating, for each of the plurality of facial images and using the neural network, a predicted emotion label and predicted action unit activations; and

updating parameters of the neural network in dependence on a comparison of: predicted emotion labels of the one or more first facial images to the known emotion labels of the one or more first facial images; and

predicted action unit activations of the one or more second facial images to the known action unit activations of the one or more second facial images.

2. The method of claim l, wherein the comparison is performed by a multi-task objective function, the multitask objective function comprising:

an emotion loss comparing predicted emotion labels to known emotion labels; and

an activation loss comparing predicted action unit activations to known action unit activations.

3. The method of claim 2, wherein the emotion loss and/ or activation loss comprise a cross entropy loss.

4. The method of any preceding claim, wherein the plurality of facial images further comprises one or more third facial images from a third dataset, the third training dataset comprising facial images each with a known valence and/ or arousal value, wherein the method further comprises generating a predicted valence and/ or arousal value, and wherein updating the parameters of the neural network is further in dependence on comparison of predicted valence and/ or arousal values of the one or more third facial images to the known valence and/or arousal values of the one or more third facial images.

5. The method of claim 4, comparison is performed by a multi-task objective function, the multitask objective function comprising a continuous loss comparing predicted valence and/or arousal values to known valence and/or arousal values.

6. The method of any of claim 6, wherein the continuous loss comprises a measure of a concordance correlation coefficient between the predicted valence and/or arousal values and the known valence and/ or arousal values.

7. The method of any preceding claim:

wherein the facial images of the first dataset are each associated with derived action unit activations, the derived action unit activations determined based on the known emotion label of said image; and

wherein the parameters of the neural network are updated in dependence on a comparison of the predicted action unit activations of the one or more first facial images to the derived action unit activations of the one or more first facial images.

8. The method of claim 7:

wherein one or more facial images of the second dataset are each associated with a derived emotion label, each derived emotion label determined based on the known action unit activations of said facial image; and

wherein the parameters of the neural network are updated in dependence on a comparison of the predicted emotion labels of one or more second facial images to the corresponding derived emotion labels of the one or more second facial images.

9. The method of claim 8, wherein the derived action unit activations and the derived emotion labels are determined based on a set of prototypical action unit activations for each emotion label and a set of weighted action unit activations for each emotional label.

10. The method of any of claims 8 or 9, wherein the derived emotion labels are a distribution over a set of possible emotion labels.

11. The method of claim 10; wherein the predicted emotion labels comprise a probability measure across the set of possible emotions;

wherein the parameters of the neural network are updated in dependence on a comparison of derived emotional labels to the predicted emotion labels.

12. The method of any preceding claim, wherein the parameters of the neural network are updated in dependence on a comparison of a distribution of the predicted action unit activations to an expected distribution of action unit activations, the expected distribution of action unit activations being determined based on the predicted emotional labels of the plurality of facial images.

13. The method of claim 12, wherein the expected distribution of action unit activations is further determined based on a modelled relationship between emotion labels and action unit activations.

14. The method of any preceding claim, wherein:

the one or more first facial images from a first dataset comprises a plurality of images from the first dataset; and

the one or more second facial images from a second dataset comprises a plurality of images from the second dataset.

15. The method of any preceding claim, wherein the method is iterated until a threshold condition is met.

16. A computer implemented method of facial behaviour analysis, the method comprising:

inputting a facial image into a neural network;

processing the image using the neural network; and

outputting, from the neural network, a predicted emotion label for the facial image, predicted action unit activations for the facial image and/ or a predicted valence and/ or arousal value for the facial image,

wherein the neural network comprises a plurality of parameters determined using the training method of any preceding claim.

17. A system comprising:

one or more processors; and

a memory, the memory comprising computer readable instructions that, when executed by the one or more processors, cause the system to perform the method of any preceding claim.

18. A computer program product comprising computer readable instructions that, when executed by a computing device, cause the computing device to perform the method of any of claims 1-16.