GB2585261A

GB2585261A - Methods for generating modified images

Info

Publication number: GB2585261A
Application number: GB2000377.8A
Authority: GB
Inventors: Toisoul Antoine; Kossaifi Jean; Bulat Adrian; Tzimiropoulos Georgios; Pantic Maja
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2019-04-01
Filing date: 2020-01-10
Publication date: 2021-01-06
Anticipated expiration: 2040-01-10
Also published as: GB2585261B; GB2585261A8; GB202000377D0; WO2020204460A1

Abstract

Method for generating a modified image by recognising human emotions in an input image S100, comprising: receiving an image of a face cropped from an input image S102, S104; identifying facial landmarks using a first set of convolutional neural network (CNN) layers S106; determining discrete emotions S108 and continuous emotions S110 from the facial landmarks using second set of CNN layers; and outputting a modified image comprising at least one of the facial landmarks, the determined discrete emotion and the determined continuous emotion S112. The first and second CNN layers may be trained by a human or by a teacher neural network. The identification of facial landmarks may comprise: generating, by the first set of CNN layers, a facial landmarks heatmap; applying the heatmap to a face image; and analysing areas of the face identified by the heatmap. The discrete emotions may be neutral, happy, sad, surprise, fear, disgust, anger and contempt and the emotion with the highest probability outputted. The continuous emotion may comprise an arousal value and a valence value. An emotion loss function may be applied to the system.

Description

Methods for Generating Modified Images

Field

[1] The present application generally relates to a method of processing images, for example to generate modified images, and in particular to recognising human emotions in images.

Background

[2] Facial affect analysis is at the cornerstone of human computer interactions as the only direct window to a person's emotional state. Discrete emotional classes (such as anger, happiness, sadness etc.) are not representative of the spectrum of emotions displayed by humans on a daily basis.

Psychologists typically rely on dimensional measures, namely valence (how positive the state of mind) and arousal (how calming or exciting the experience). However, while this task is natural for humans, it is extremely hard for computer based systems and automatic estimation of valence and arousal in naturalistic (e.g. in the wild) conditions is an open problem. Additionally, the subjectivity of measures even for humans means it is hard to obtain good quality data.

[3] Most existing work in computer vision focuses on simplistic setting, namely that of predicting discrete classes of emotion. Existing work may follow an established pipeline. First a face detector is run on an image to detect every face in the image. Each face is then cropped to remove the background using the bounding box given by the face detector before being pre-processed. The pre-processing typically consists in the detection of facial landmarks (fiducial points) on the face. These points are then used to project the face in a canonical frame (e.g. remove similarities, that is translation, rotation and scaling). These aligned images are finally used as an input for a method that can estimate facial information such as emotions. Different methods are typically used to estimate different types of facial information. A paper entitled "Registration-Free Face-SSD: Single Shot Analysis of Smiles, Facial Attributes, and Affect in the Wild" by Jang et al published in Computer Vision and Image Understanding arXIV:1902.04042v1 11 Feb 2019 describes how face detection and a single face-related task analysis, e.g. smile recognition, attribute heatmap or valence/arousal heatmap may be estimated.

[004] The present applicant has recognised the need for an alternative method for analysing images containing human faces.

Summary

[005] Broadly speaking, the present techniques relate to a method, apparatus and system for recognising or measuring human emotion in an image and outputting a modified image which reflects the human emotion.

[6] In a first approach of the present techniques, there is provided a computer-implemented method for generating a modified image from an input image by recognising human emotions in an input image, the method comprising: receiving at least one cropped image from an input image, the at least one cropped image comprising a human face; identifying, using a first set of convolutional neural network (CNN) layers, a plurality of facial landmarks on the human face in the at least one cropped image; determining, using a second set of CNN layers and the identified plurality of facial landmarks, at least one discrete emotion and at least one continuous emotion for the human face in the at least one cropped image; and outputting a modified image comprising at least one of the facial landmarks, the determined at least one discrete emotion and the determined at least one continuous emotion for the human face in the at least one cropped image.

[7] In a second approach of the present techniques, there is provided a method for a computer-implemented method for recognising human emotions in images, the method comprising: receiving a cropped image comprising a human face; identifying, using a first set of convolutional neural network (CNN) layers, a plurality of facial landmarks on the human face in the received image; determining, using a second set of CNN layers and the identified plurality of facial landmarks, at least one discrete emotion and at least one continuous emotion for the human face in the received image; and outputting, the facial landmarks, the determined at least one discrete emotion and the determined at least one continuous emotion for the human face in the received image.

[8] Facial landmarks may be selected from the locations of the eyes, nose and mouth. A discrete emotion may be a single classification of a face and may include any suitable classification, including for example any or all of neutral, happy, sad, surprise, fear, disgust, anger and contempt. Each discrete emotion may be identified together with its probability. A continuous emotion is a value which is indicative of a detected emotion and may include arousal which is a measure of how calm or excited a person is based on their expression and valence which is a measure of how positive or negative the person's expression is. The measure may range between fixed values, e.g. -1 to 1.

[9] Discrete emotional classes may not be representative of the spectrum of emotions displayed by humans on a daily basis. However, the automatic estimation of a continuous emotion in naturalistic conditions proves difficult to many known computer based systems. The method described may be considered to be a facial affect analysis which detects facial landmarks and estimates both discrete and continuous emotions in a single pass. The analysis of the image is an end-to-end method with a single model comprising two sets of CNN layers. As described in more detail below reaches a performance superior to both a method using agreement between expert human annotators and known computerised methods. Outputting may comprise displaying at least one facial landmark, the determined at least one discrete emotion and the determined at least one continuous emotion for the human face on the received image.

[10] Prior to the receiving step, the method may further comprise detecting, within a received image, at least one human face; cropping the received image around each detected human face; and outputting one or more cropped images, each cropped image comprising a detected human face from the received image.

[11] Identifying a plurality of facial landmarks may comprise applying a facial landmarks heatmap to the human face in the cropped image; and analysing areas of the human face identified by the facial landmarks heatmap to identify the plurality of facial landmarks. The facial landmarks heatmap may be generated by the first set of CNN layers. The method may comprise applying an attention mechanism which multiplies features extracted using different layers in the first set of CNN layers by the facial landmarks heatmap.

[12] Prior to the receiving step, the first and second sets of CNN layers may be trained. For example, the method may further comprise training the first set of CNN layers to identify a plurality of facial landmarks on a human face by analysing a plurality of images, wherein each image of the plurality of images is annotated using a set of pre-defined facial landmarks applied by human annotators. The method may further comprise training the second set of CNN layers to identify the at least one discrete emotion and the at least one continuous emotion on a human face by analysing a plurality of images, wherein each image of the plurality of images is annotated using a set of pre-defined discrete and continuous emotions applied by human annotators. It will be appreciated that the training of the first and second sets of CNN layers may represent a stand-alone approach of the present techniques.

[13] The training may be improved by "cleaning" the plurality of images which are analysed, for example to remove inconsistencies or inaccuracies. The method may further comprise identifying, prior to the training, one or more images of the plurality of images comprising an incorrect annotation; and removing the identified one or more images from the database. For example, an image having an incorrect discrete emotion annotation and/or an incorrect continuous emotion annotation may be identified.

[14] Alternatively, or additionally to the cleaning of the data, the method may comprise applying distillation during the training process. The method may further comprise training a teacher network by analysing a first plurality of images, wherein the teacher network comprises a first plurality of CNN layers which identify a plurality of facial landmarks on a human face and a second plurality of CNN layers which use the identified plurality of facial landmarks to determine at least one discrete emotion and at least one continuous emotion and each image of the plurality of images is annotated using a set of pre-defined facial landmarks, a set of pre-defined discrete emotions and a set of pre-defined continuous emotions. The first plurality of CNN layers in the teacher layer may be similar or identical to the first set of CNN layers which are used in the method to identify facial landmarks and similarly, the second plurality of CNN layers in the teacher layer may be similar or identical to the second set of CNN layers which are used to determine the discrete and continuous emotions. Once the teacher network has been trained, a second plurality of images may be input into the trained teacher network and the facial landmarks, the determined at least one discrete emotion, and the determined at least one continuous emotion for each human face in the received second plurality of images may be output from the teacher network.

[015] The output from the teacher network may be used to train the first and second sets of CNN layers which may be termed a student network. For example, the method may comprise training the first set of CNN layers to identify a plurality of facial landmarks on a human face by analysing the second plurality of images, wherein each image of the second plurality of images is annotated using the facial landmarks output from the teacher network. Similarly, the method may comprise training the second set of CNN layers to identify the at least one discrete emotion and the at least one continuous emotion on a human face by analysing the second plurality of images, wherein each image of the second plurality of images is annotated using a set of pre-defined discrete and continuous emotions output from the teacher network. The teacher network may smooth incorrect labels in the second plurality of images because it has learnt from the labels in the first plurality of images.

[016] The method may further comprise applying a loss function which combines at least one metric for the discrete emotion and at least one metric for the continuous emotion. Various metrics may be included in the loss function, including a categorical loss which is be a measure of how the predicted category of the discrete emotion value corresponds to the target category (namely, the ground truth emotion as annotated by human experts). Other metrics include an error metric, e.g. the root mean square error, which determines the level of error of the predicted values of the continuous emotion when compared to the target values, a sign metric which compares the sign of the continuous emotion with the target sign and one or more correlation coefficients (e.g. a Pearson correlation coefficient or a Concordance correlation coefficient) which measures correlation between the predictions and the target values. An optimal method may minimise error but return higher values for the other metrics. For example, the loss function may be a combination of categorical loss for the at least one discrete emotion and a loss to maximise a concordance correlation coefficient for the at least one continuous emotion. The loss function may be a combination of categorical loss for the at least one discrete emotion, a loss to maximise a concordance correlation coefficient for the at least one continuous emotion, a loss related to minimizing the root mean square error for the at least one continuous emotion and a loss to maximise the Pearson correlation coefficient for the at least one continuous emotion.

[17] The plurality of images and/or the cropped image may comprise any one of: a still image, a frame of a recorded video, a frame from a videoconference stream, and a frame of a livestream.

[18] In a related approach of the present techniques, there is provided a non-transitory data carrier carrying processor control code to implement any of the methods described herein.

[19] The method may be implemented on any suitable electronic user device, such as a smartphone, smart television, gaming equipment, robotic assistant, etc. The method may be used in various applications, including detection of special moments based on emotions and a filter based on the emotion state (for example, as in SnapchatTM, lnstagram TM or similar). The modified image which is output may thus include a filter which is representative of at least one of the discrete emotion and the continuous emotion.

[20] In a related approach of the present techniques, there is provided an apparatus for recognising human emotions in images, the apparatus comprising: at least one processor, coupled to a memory, arranged to: receive a cropped image comprising a human face; identify, using a first set of convolutional neural network (CNN) layers, a plurality of facial landmarks on the human face in the received image; determine, using a second set of CNN layers and the identified plurality of facial landmarks, at least one discrete emotion and at least one continuous emotion for the human face in the received image; and output, based on the determining, a plurality of facial landmarks, at least one discrete emotion and at least one continuous emotion for the human face in the received image.

[21] In another related approach of the present techniques, there is provided an apparatus for generating a modified image from an input mage by recognising human emotions in an input image, the apparatus comprising: at least one processor, coupled to a memory, arranged to: receive at least one cropped image from an input image, the at least one cropped image comprising a human face; identify, using a first set of convolutional neural network (CNN) layers, a plurality of facial landmarks on the human face in the at least one cropped image; determine, using a second set of CNN layers and the identified plurality of facial landmarks, at least one discrete emotion and at least one continuous emotion for the human face in the at least one cropped image; and output, based on the determining, a modified image comprising at least one of the identified plurality of facial landmarks, the at least one discrete emotion and the at least one continuous emotion for the human face in the at least one cropped image.

[022] The apparatus may be any one of: a smartphone, tablet, laptop, computing device, smart television, gaming device, and robotic device. It will be understood that this is a non-limiting and non-exhaustive list of example apparatuses. The memory may store any or all of the first set of convolutional neural network (CNN) layers, the second set of convolutional neural network (CNN) layers and/or the input image. It will also be appreciated that the features of the method described above also apply to the apparatus.

[023] As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.

[024] Furthermore, the present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.

[25] Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.

[26] Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.

[27] The techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP).

The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD-or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.

[28] It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.

[29] In an embodiment, the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.

[30] The above-mentioned features described with respect to the first approach apply equally to the second and third approaches.

Brief description of drawings

[031] Implementations of the present techniques will now be described, by way of example only, with reference to the accompanying drawings, in which: [032] Figure 1 a is a flow chart of a method for processing images; [033] Figure 1 b is a schematic representation of an input in to the method of Figure 1 a; [034] Figures 1c and ld are schematic representations of intermediate outputs in the method of Figure 1a; [035] Figure 2 is an illustration of an output from the method of Figure 1 a, [36] Figure 3 is a schematic representation of a neural network for use in the method of Figure 1 a; and [37] Figure 4 a block diagram of an apparatus and system for implementing the method of Figure la.

Detailed description of drawings

[38] Broadly speaking, the present techniques relate to a method, apparatus and system for recognising or measuring human emotion in an image and outputting a modified image which reflects the human emotion.

[39] Figure la illustrates a method of processing an input image which is received at step S100. An example of an input image is shown in Figure lb and in this example, the image comprises three human faces which are each at different angles with respect to the image. The image may be a still image or may be extracted from a continuous stream of images and/or a video. In the next step, the faces are identified in the input image (S102). An intermediate image is then generated at step S104 using the identified images and an example is shown in Figure 1 c. The intermediate image is produced by a face detector and comprises a bounding box around each detected face. The background around the bounding box is cropped from the image so that the next steps are processed on the images only. The intermediate images may thus be termed cropped images. The intermediate images may not be scaled, e.g. the original sizes may be retained, and may not be rotated to the same orientation, e.g. the original orientation may be retained as shown in Figure 1 c.

[040] The previous steps may be considered to be pre-processing and as explained in more detail below may be performed using standard techniques. Once the pre-processing is complete, the intermediate image may be input into the next stage of the method. In this processing stage, three separate outputs are simultaneously identified. Step S106 identifies facial landmarks in the intermediate image and Figure ld shows schematically a result of such identification. As shown in Figure 1d, the location of the eyes and mouth for each image are identified and may be illustrated by a dotted or dashed line which is overlaid on the original image.

[041] In step S108, at least one discrete emotion is obtained or determined. A discrete emotion may be a single classification of a face and may include any suitable classification, including for example any or all of neutral, happy, sad, surprise, fear, disgust, anger and contempt. Each discrete emotion may be identified together with its probability. The resulting discrete emotion which is output may be the discrete emotion having the highest probability or alternatively several discrete emotions may be output together with their probability and may be ranked. When evaluating continuous streams of images on which a single discrete emotion is output, the discrete emotion may remain unchanged until the probability of a new discrete emotion is above a threshold and the output discrete emotion may be then changed. It will be appreciated that any suitable arrangement may be applied.

[42] In step S110, a continuous emotion is obtained or determined. A continuous emotion is a value which is indicative of a detected emotion and may include arousal which is a measure of how calm or excited a person is based on their expression and valence which is a measure of how positive or negative the person's expression is. The measure may range between fixed values, e.g. -1 to 1. The measure may be constantly changing. The continuous emotion may be related to the discrete emotion.

For example, a determined discrete emotion of happy should have a positive valence and similarly, a determined discrete emotion of sad should have a negative valence. The measure of arousal for both the sad and happy emotions may be negative or positive. As another example, a determined discrete emotion of fear should have a negative valence and a positive arousal. Continuous emotion (e.g. valence and arousal) may also be termed a dimensional measure of affect which is standard in psychology and other fields. For example, these measures were introduced in "A circumplex model of affect" by Russel et all published in the Journal of Personality and Social Psychology 39, 1980.

[43] Although steps S106, S108 and S110 are shown as discrete steps, each of the facial landmarks, discrete and continuous emotions are determined simultaneously and may also be output simultaneously (step S112). Figure 2 illustrates an output image which has been processed as described above. As shown in Figure 2, each face is enclosed within a bounding box 20. A discrete emotion 22 is indicated on the image (in this arrangement, there is a single discrete emotion which is presented above the bounding box). In this output image, all the faces have the same discrete emotion -happy. It will be appreciated that different faces may have the same or different discrete emotions.

Two values for the continuous emotions 24, 26 are shown (in this arrangement, below the bounding box). One value 24 indicates the valence and the other value 26 is for arousal. The continuous emotions are in line with the discrete emotion. Each of the faces has a valence value of 1 which is the maximum level of positivity and has an arousal of around 0.0. It will be appreciated that Figure 2 is just one possible way of representing the recognised emotions and a filter may be used as an alternative to outputting a word and/or numeric values.

[44] Figure 3 is a schematic representation of the network for implementing the method of Figure la. The network comprises a first convolutional neural network (CNN) 30 having a plurality of convolutional blocks and a second convolutional neural network (CNN) 32 also having a plurality of convolutional blocks. The first CNN 30 may be termed a Face Alignment Network and is described in "How far are we from solving the 2d and 3d face alignment problem?" by Bulat et al published in International Conference on Computer Vision 2017.

[45] As shown schematically in Figure 3, there are links 34 between the first and second convolution neural networks. There may for example be links 34 between different layers 38 in the first convolution neural network each of which provide heatmaps which extracts facial features within the image. An attention mechanism 36 such as the one described in "Attention is all you need" by Vaswani et al published in Advances in Neural Information Processing Systems is used. The attention mechanism 36 multiplies the features extracted at the three layers 38 in the first CNN by the predicted facial landmarks (heatmap). The heatmap represents the probability of the location of each facial landmark. Using the heatmap allows the second CNN to better focus on the areas of the face which are likely to be important to estimate emotions and removes regions that are less useful.

[46] As well as using the predicted facial landmarks in the second CNN, as shown the predicted facial landmarks 40 are output from the first CNN. The continuous and discrete emotions are output from the second CNN which has a fully connected layer as a final layer. Linking the predicted facial landmarks from the first CNN is the way described effectively means that the overall network creates a joint prediction of three key outputs. The joint prediction means that proposed network is relatively fast and may be used in real time to analyse video and may output continuously changing discrete and continuous emotion values as well as continuously changing facial landmarks. The processing pipeline may also be considered to be simplified.

[47] Before the network may be used, the network needs to be trained. The training ideally needs to be done on a database which has been collected in naturalistic (in the wild) conditions and is accurately annotated, e.g. for valence and arousal. The database may contain static and/or video images. In one arrangement, a large scale dataset known as AffectNet which is described in "A database for facial expression, valence and arousal computing in the wild" by Mollahosseini et al published in CoRR abs/1708.03985 (2017) was selected as a training dataset. This database is a large scale dataset for discrete and continuous emotion recognition (valence and arousal). It contains more than a million images downloaded from the Internet using specific queries on search engines, annotated in term of 66 facial landmarks. Of these one million images, 450,000 were manually annotated by twelve human annotators. The dataset contains by far the largest variety of subjects.

[48] The test results for different variants of the method of Figure 1a are shown in the table below and are compared with some standard techniques.

Valence Arousal Network Acc. RMSE SAGR PCC CCC RMSE SAGR PCC CCC Reference 0.58 0.37 0.74 0.66 0.60 0.41 0.65 0.54 0.34 Face generation 0.60 0.37 0.78 0.66 0.62 0.39 0.75 0.55 0.54 Face-SSD - 0.44 0.73 0.58 0.57 0.39 0.71 0.50 0.47 Human Inter-agreement - 0.34 0.82 0.82 0.82 0.36 0.67 0.57 0.55 ResNet-18 regression - 0.39 0.78 0.66 0.66 0.34 0.77 0.6 0.6 ResNet-18 regression FAN landmarks 0.37 0.79 0.69 0.69 0.34 0.78 0.61 0.6 ResNet-18 hybrid FAN landmarks 0.59 0.37 0.79 0.7 0.7 0.33 0.79 0.62 0.62 EmoFAN hybrid 0.6 0.36 0.8 0.71 0.71 0.33 0.8 0.64 0.64 EmpFAN hybrid attention 0.6 0.35 0.81 0.72 0.72 0.33 0.8 0.64 0.64 EmoFAN hybrid attention MSE+PCC+CCC shake-shake distillation 0.62 0.33 0.81 0.73 0.73 0.30 0.81 0.65 0.65 EmoFAN hybrid attention MSE+PCC+CCC shake-shake distillation Cleaned sets 0.74 0.29 0.84.82 0.82 0.27 0.80 0.75 0.75 [049] As shown in the table, the quality of the predictions for both valence and arousal are evaluated using four metrics labelled RMSE, SAGR, PCC and CCC, where Y corresponds to the predicted labels and r corresponds to the target labels and py and ay correspond to the mean and the standard deviation of y: * the Root Mean Square Error (RMSE) which tells how close predicted values are from the target values (the lower the better) DeSECI', * the Sign Agreement (SAGR) which evaluates whether the sign of the predicted value matches the sign of the target value (the higher the better) S /1G, - * the Pearson Correlation Coefficient (PCC) which measures how correlated predictions and target values are (the higher the better) p( * the Concordance Correlation Coefficient (CCC) which also measures a correlation but penalizes correlated signals with different means (the higher the better) [50] The standard techniques include a line labelled "reference" which is the first CNN of Figure 3 used alone and as described in "How far are we from solving the 2d and 3d face alignment problem?" by Bulat et al published in International Conference on Computer Vision 2017. Other standard techniques include "face-generation" which is described in "Generating faces for affect analysis" by Kollias et al published in CoRR abs/1811.05027 (2018), "Face-SSD" which is described in "Registration-free face-ssd: Single shot analysis of smiles, facial attributes, and affect in the wild" by Yang et al published in Computer Vision and Image Understanding (2019) and analysis by humans labelled "Human Inter-agreement" which is also described in the paper by Bulat et al. Additionally, a baseline to predict valence and arousal values using a traditional approach is labelled as ResNet-18 regression.

[51] As explained above in Figure la, estimating good facial landmarks is part of the preprocessing step of the algorithm and as shown in the table below has a significant impact on the performance of the algorithm. As the landmarks provided in databases such as the AffectNet database were originally computed using legacy methods, they were recomputed using the state-of-the-art facial landmarks detector described in "How far are we from solving the 2d & 3d face alignment problem?" by Bulat et al published in International Conference on Computer Vision (2017). The results are given in the lines "ResNet-18 regression -FAN landmarks" and "ResNet-18 regression -hybrid FAN landmarks". As can be seen, the networks trained using these improved landmarks shows an improvement in performance compared to the baseline "ResNet-18 regression".

[52] As shown in Figure 3, the network (or model) contains two parts, namely two CNNs. The first CNN is also known as the FAN and this is unchanged from the paper in Bulat except to provide links to the second CNN which contains the emotional convolutions layers. The FAN allows a prediction of accurate facial landmarks and the emotional convolutional layers take as input the features learned by the FAN to predict the emotions directly.

[053] Even expert human annotators are prone to error when annotating a large amount of data. However, it is expected that there will fewer instances of both emotions being mislabelled when both discrete and continuous emotions are annotated. The network shown in Figure 3 is trained to simultaneously determine (or estimate) the continuous and discrete emotion when annotations for both are available in images. As a result, the network is more robust against outliers in the dataset that have any of the two mislabelled. The loss function used by the network is changed to a sum of a cross entropy for the categorical loss Lategones and the CCC loss Lccc for the regression. The loss functions are defined as:

CCC

[54] The results for just the two stages of this network operating in combination and using the loss function above are shown in line "EmoFAN hybrid" of the table. As explained with reference to Figure 3, an attention mechanism may be added to the network and the result is shown in line "EmoFAN hybrid attention" of the table. There is an improvement in the metrics when compared to the previous line.

[55] For continuous affect prediction, maximizing the correlation coefficients, namely PCC and CCC, is of main interest. However each metric encodes important information about the task and they can also be related (e.g. a lower RMSE usually leads to a higher SAGR as the prediction error is lower). Therefore an optimal predictor should be able to maximize all of them while minimizing the RMSE. This information may be encoded into the network by changing the loss function to a sum of four terms: a categorical loss for discrete emotions, a loss related to minimizing the RMSE, a loss to maximize the PCC and finally a loss to maximize the CCC.

[56] The regression loss may be further regularized with shake-shake regularization coefficients a, [3, y chosen randomly uniformly in the [0;1] range at each iteration of the training process. The use of such coefficients is described in "Shake-shake regularization" by Gastaldi published in arXiv preprint arXiv:1705.07485 (2017). The regularization avoids the network to only focus on the minimization one of the three regression losses. The full loss minimized by the network of Figure 3 is given in the equation below: where Loco and Lcategones are as defined above [057] The change of loss leads to an overall improvement of the network not only on the regression task but also on the accuracy of the predicted discrete emotions.

[58] A known technique to improve network predictions is described in "Born-again neural networks" by Furlanello et al published in International Conference on Machine Learning, 1602-1611 (2018). The known technique may be termed distillation and it works in two steps. First a teacher network is trained on a specific database. Then a second network, called the student network, is trained on the same database but using the labels predicted from the teacher network instead of the labels given in the database. The idea behind this process is that the teacher network already learnt how to smooth incorrect labels in the database and therefore providing these to the student network gives much cleaner data to learn from. This technique may be optionally incorporated into the training steps which occur before the method shown in Figure 1 a.

[59] The results for the model of Figure 3 which incorporates the two CNNs, the attention mechanism and the loss function using MSA, PCC and CCC with distillation are shown in line "EmoFAN hybrid attention -MSE+PCC+CCC shake-shake distillation". Again there is improvement.

[60] At the core of Machine Learning is the idea of minimizing the empirical risk as a good approximation to the actual risk and circumvents the need for evaluating the joint probability distribution over the labels and the data. As a result, a crucial assumption is that the distribution of the data and their labels is the same for the training, testing and validation sets. It was noted that a large proportion of labels were incorrect in the database. This phenomenon, known as labels shift, severely affects performance.

[61] Before testing the trained network, the testing and validation data may be cleaned by manually removing all incorrect labels. As demonstrated in the results in the table above, such cleaned data gives a better way to validate the hyper-parameters of the model of Figure 3 and evaluate its performance. More specifically the test set may be cleaned in two steps. First the images with wrong annotations of the discrete emotions were removed. Then images that had incorrect values of valence and arousal were removed. The conditions for cleaning the valence and arousal depending on the discrete emotions were as follows: * Neutral should have an intensity of emotion smaller than 0.2 * Happy should have positive valence * Sad should have negative valence * Surprise should have positive arousal * Fear should have negative valence and positive arousal * Disgust should have negative valence * Anger should have positive arousal * Contempt should have negative valence [062] The original test set has 4000 images (500 for each of the 8 discrete emotions). After cleaning the discrete emotions, the number of images left is 2828. A further clean of valence and arousal values results in a test set containing 2733 images.

[063] The results for the model incorporating the two CNNs, the attention mechanism, the improved loss function, distillation and cleaning of the test data are given in line "EmoFAN hybrid attention MSE+PCC+CCC shake-shake distillation". Clearly, the combination together with the cleaned data translates into a large improvement in performance when compared to the known methods. The performance of our method on the test set shows that the network can predict arousal values much better than humans on the AffectNet database. This confirms the fact that arousal is known to be hard to estimate for humans. On the valence predictions the network has a lower RMSE and higher SAGR and equal values of PCC and CCC when compared to human inter-agreement. This result is also understandable as humans perform well when having to assess whether an emotion is positive or negative.

[064] Another database which may be used for training and testing is known as AFEW-VA and is described in "Afew-va database for valence and arousal estimation in-the-wild" by Kossaifi et al published in Image and Vision Computing 65, 23-36 (2017). This database is a dataset of 600 video clips, spanning 30,000 frames collected in the wild with high quality annotations of valence and arousal levels and 68 facial landmarks, all accurately annotated per-frame. As the AFEW-VA database is very small, the first CNN in the model of Figure 3 was trained on the AffectNet database and fine tuning of the emotional layers was done on AFEW-VA. To be comparable to previous works, a subject independent 5 fold cross validation was used. The comparative results using this dataset are shown in the table below: Valence Arousal Network RMSE SAGR PCC CCC RMSE SAGR PCC CCC Reference method RF Hybrid DCT 0.27 - 0.407 - 0.23 - 0.45 -FG2019 submission ResNet50 0.40 0.33 0.33 0.41 0.42 0.40 CNN+TRL Aff-Wild fine tuning with cross - - 0.51 0.52 - - 0.58 0.56 Face generation 0.22 0.54 0.17 0.59 EmoFAN (as in Figure 3) 0.23 0.65 0.7 0.69 0.22 0.81 0.67 0.66 [065] In the table above the first reference method is described in the paper by Kossaifi. The line labelled "Alf-Wild fine tuning with cross validation" is another reference method described in "Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond" by Kollias et al published in arXiv preprint arXiv:1804.10938 (2018). The line labelled "Face generation" is a known method described in "Generating faces for affect analysis" by Kollias et al published in CoRR abs/1811.05027 (2018).

[066] Another database that may be used is known as SEWA and is described in "Sewa db: A rich database for audio-visual emotion and sentiment research in the wild" by Kossaifi et al published in. arXiv preprint arXiv:1901.02839 (2019). The database is a large-scale multimodal dataset containing over 2000 minutes of audio and video data, and richly annotated in terms of 68 facial landmarks, continuous valence and arousal levels, as well as facial action units. It contains 398 different subjects from six different cultures. The comparative results using this dataset are shown in the table below: Valence Arousal Network RMSE SAGR PCC CCC RMSE SAGR PCC CCC Reference SEWA -SVM Video-Video multicultural 0.32 0.31 0.18 0.20 Reference SEWA -LSTM Video-Video multicultural - - 0.32 0.28 - - 0.15 0.12 FG2019 submission VGG16 CNN+TRL 0.33 - 0.50 0.47 0.39 - 0.44 0.39 Inter agreement 0.24 0.64 0.38 0.27 0.24 0.62 0.33 0.23 EmoFAN (Figure 3) 0.3 0.74 0.68 0.6 0.35 0.68 0.66 0.56 [067] The first two reference methods which are compared are described in the paper by Kossaifi which describes the dataset. The results from analysis by a human are also shown as "inter-agreement" [068] The results in the second and third tables also show that the proposed network performs much better than previous state of the art methods. It even outperforms by a large margin the human inter agreement on the SEWA database on all metrics except the RMSE. One possible reason for the improved performance may result from the training to simultaneously estimate both the discrete and continuous emotion using the same network. Such a network is likely to be more robust against outliers in the dataset that have a mislabelled annotation.

[069] Figure 4 is a block diagram of an apparatus 102 and system 100 for recognising human emotions in images and for generating modified images based on the recognition of the emotion.

[070] The apparatus 102 may be any one of: a smartphone, tablet, laptop, computing device, smart television, gaming device, and robotic device. It will be understood that this is a non-limiting and non-exhaustive list of example apparatuses. Apparatus 102 comprises a processor 108 which comprise processing logic to carry out the method described above and to generate output images in response to the processing. The processor may comprise one or more of: a microprocessor, a microcontroller, and an integrated circuit.

[71] The apparatus 102 comprises one or more interfaces 114, such as a user interface, that enable the device to receive inputs and/or generate outputs (e.g. audio and/or visual inputs and outputs, or control commands, etc.) For example, the apparatus 102 may comprise a display screen to display the output image to a user.

[72] The apparatus 102 may comprise storage 110. Storage 110 may comprise a volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example. Storage 110 may store the input the first and second set of convolutional neural network layers and one or more input images.

[73] In some cases, the apparatus 102 may comprise an app (i.e. a software application) 104 which a user may use to initiate the recognition of human emotion in an image and/or to generate the modified image.

[74] The system 100 may comprise one or more additional devices (not shown) which communicate with the apparatus 102. The devices may be communicatively coupled to apparatus 102 (directly or indirectly, e.g. via a home hub or gateway device). Thus, apparatus 102 may comprise a communication module 106 suitable for sending and receiving data. The communication module may communicate with other components of the system 100 using any one or more of: wireless communication (e.g. WFO, hypertext transfer protocol (HTTP), message queuing telemetry transport (MATT), a wireless mobile telecommunication protocol, short range communication such as radio frequency communication (RFID) or near field communication (NFC), or by using the communication protocols specified by ZigBee, Thread, Bluetooth, Bluetooth LE, IPv6 over Low Power Wireless Standard (6LoVVPAN), Constrained Application Protocol (CoAP), wired communication. The communication module 106 may use a wireless mobile (cellular) telecommunication protocol to communicate with components of the system, e.g. 3G, 4G, 5G, 6G etc. The communication module 106 may communicate with other devices in the system 100 using wired communication techniques, such as via metal cables or fibre optic cables. The apparatus 102 may use more than one communication technique to communicate with other devices in the system 100. It will be understood that this is a non-exhaustive list of communication techniques that the communication module 106 may use. It will also be understood that intermediary devices (such as a gateway) may be located between the apparatus 102 and other components in the system 100, to facilitate communication between the machines/components.

[75] Thus, the present techniques provide apparatus (102) for generating a modified image from an input mage by recognising human emotions in an input image, the apparatus comprising: at least one processor (108), coupled to a memory (110). The processor (108) is arranged to: receive at least one cropped image from an input image, the at least one cropped image comprising a human face; identify, using a first set of convolutional neural network (CNN) layers, a plurality of facial landmarks on the human face in the at least one cropped image; determine, using a second set of CNN layers and the identified plurality of facial landmarks, at least one discrete emotion and at least one continuous emotion for the human face in the at least one cropped image; and output, based on the determining, a modified image comprising a plurality of facial landmarks, at least one discrete emotion and at least one continuous emotion for the human face in the at least one cropped image.

[76] Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present techniques, the present techniques should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present techniques have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.

Claims

CLAIMS1. A computer-implemented method for generating a modified image from an input image by recognising human emotions in an input image, the method comprising: receiving at least one cropped image from an input image, the at least one cropped image comprising a human face; identifying, using a first set of convolutional neural network (CNN) layers, a plurality of facial landmarks on the human face in the at least one cropped image; determining, using a second set of CNN layers and the identified plurality of facial landmarks, at least one discrete emotion and at least one continuous emotion for the human face in the at least one cropped image; and outputting a modified image comprising at least one of the facial landmarks, the determined at least one discrete emotion and the determined at least one continuous emotion for the human face in the at least one cropped image.
2. The method as claimed in claim 1 further comprising, prior to the receiving step: detecting, within the input image, at least one human face; cropping the input image around each detected human face; and outputting at least one cropped image, each cropped image comprising a detected human face from the received image.
3. The method as claimed in claim 1 or 2 wherein the step of identifying a plurality of facial landmarks comprises: applying a facial landmarks heatmap to the human face in the at least one cropped image; and analysing areas of the human face identified by the facial landmarks heatmap to identify the plurality of facial landmarks.
4. The method as claimed in claim 3 wherein the facial landmarks heatmap is generated by the first set of CNN layers.
The method as claimed in any one of claims 1 to 4 further comprising: training the first set of CNN layers to identify a plurality of facial landmarks on a human face by analysing a plurality of images, wherein each image of the plurality of images is annotated using a set of pre-defined facial landmarks applied by human annotators.
The method as claimed in any one of claims 1 to 5 further comprising: training the second set of CNN layers to identify the at least one discrete emotion and the at least one continuous emotion on a human face by analysing a plurality of images, wherein each image of the plurality of images is annotated using a set of pre-defined discrete and continuous emotions applied by human annotators.
The method as claimed in claim 5 or 6 further comprising: identifying, prior to the training, one or more images of the plurality of images comprising an incorrect annotation; and removing the identified one or more images from the database.
8. The method as claimed in claim 7 wherein the step of identifying one or more images comprising an incorrect annotation comprises: identifying an image having an incorrect discrete emotion annotation.
9. The method as claimed in claim 8 where the step of identifying one or more images comprising an incorrect annotation further comprises: identifying an image having an incorrect continuous emotion annotation.
10. The method as claimed in any one of claims 1 to 9 further comprising: training a teacher network by analysing a first plurality of images, wherein the teacher network comprises a first plurality of CNN layers which identify a plurality of facial landmarks on a human face and a second plurality of CNN layers which use the identified plurality of facial landmarks to determine at least one discrete emotion and at least one continuous emotion and each image of the plurality of images is annotated using a set of pre-defined facial landmarks, a set of predefined discrete emotions and a set of pre-defined continuous emotions; inputting a second plurality of images into the trained teacher network, outputting from the teacher network the facial landmarks, the determined at least one discrete emotion, and the determined at least one continuous emotion for each human face in the received second plurality of images.
11. The method as claimed in claim 10 further comprising: training the first set of CNN layers to identify a plurality of facial landmarks on a human face by analysing the second plurality of images, wherein each image of the second plurality of images is annotated using the facial landmarks output from the teacher network.
12. The method as claimed in claim 10 or claim 11 further comprising: training the second set of CNN layers to identify the at least one discrete emotion and the at least one continuous emotion on a human face by analysing the second plurality of images, wherein each image of the second plurality of images is annotated using a set of pre-defined discrete and continuous emotions output from the teacher network.
13. The method as claimed in any preceding claim, further comprising applying a loss function which is a combination of categorical loss for the at least one discrete emotion and a concordance correlation coefficient loss for the at least one continuous emotion.
14. The method as claimed in claim 13, further comprising combining a loss related to minimizing the root mean square error for the at least one continuous emotion and a loss to maximise the Pearson correlation coefficient for the at least one continuous emotion in the applied loss function.
15. The method as claimed in any preceding claim wherein the input image is any one of: a still image, a frame of a recorded video, a frame from a videoconference stream, and a frame of a livestream.
16. The method as claimed in any preceding claim wherein the discrete emotion is selected from neutral, happy, sad, surprise, fear, disgust, anger and contempt.
17. The method as claimed in any preceding claim wherein determining the at least one discrete emotion comprises determining a probability for each identified discrete emotion and outputting the discrete emotion having the highest probability.
18. The method as claimed in claim 17 comprising evaluating continuous streams of input images, outputting a first discrete emotion for a first input image in the continuous stream, determining whether the probability of a second discrete emotion for a subsequent input image is above a threshold and changing the output discrete emotion to the second discrete emotion when the probability is above the threshold, wherein the second discrete emotion is different to the first discrete emotion.
19. The method as claimed in any preceding claim wherein the continuous emotion comprises at least one of an arousal value and a valence value.
20. A non-transitory data carrier carrying processor control code which, when implemented on a processor, causes the processor to implement the method of any one of claims 1 to 19.
21. An apparatus for generating a modified image from an input mage by recognising human emotions in an input image, the apparatus comprising: at least one processor, coupled to a memory, arranged to: receive at least one cropped image from an input image, the at least one cropped image comprising a human face; identify, using a first set of convolutional neural network (CNN) layers, a plurality of facial landmarks on the human face in the at least one cropped image; determine, using a second set of CNN layers and the identified plurality of facial landmarks, at least one discrete emotion and at least one continuous emotion for the human face in the at least one cropped image; and output, based on the determining, a modified image comprising at least one of the identified plurality of facial landmarks, the at least one discrete emotion and the at least one continuous emotion for the human face in the at least one cropped image.