WO2020049276A1

WO2020049276A1 - System and method for facial landmark localisation using a neural network

Info

Publication number: WO2020049276A1
Application number: PCT/GB2019/052425
Authority: WO
Inventors: Jiankang DENG; Stefanos ZAFEIRIOU
Original assignee: Facesoft Ltd.
Priority date: 2018-09-03
Filing date: 2019-08-30
Publication date: 2020-03-12
Also published as: CN112639810A; GB2576784A; GB2576784B; GB201814275D0

Abstract

A neural network system for localising facial landmarks in an image comprises two or more convolutional neural networks in series. Each convolution neural network comprises a plurality of downsampling layers, a plurality of upsampling layers, and a plurality of lateral connections connecting layers of equal size. Each of the lateral connections comprises one or more aggregation nodes having a upsampling input; and at least one of the aggregation nodes further includes a downsampling input.

Description

SYSTEM AND METHOD FOR FACIAL LANDMARK

LOCALISATION USING A NEURAL NETWORK

Field of the Invention

The present invention relates to facial landmark localisation, and particular to systems and methods for localising facial landmarks using a neural network system.

Background

Facial landmark localisation, or face alignment, relates to finding the location of various facial features within an image. Examples of the facial features that may be located include the mouth, the nasal ridge, the eyebrows and the jaw. Facial landmark localisation has wide ranging applications including, but not limited to, face recognition, face tracking, expression understanding, and 3D face reconstruction.

Summary

According to a first aspect of the invention, there is provided a neural network system for localising facial landmarks in an image, the neural network system including two or more convolutional neural networks in series, each convolution neural network including a plurality of downsampling layers, a plurality of upsampling layers and a plurality of lateral connections connecting layers of equal size; wherein each of the lateral connections comprises one or more aggregation nodes having a upsampling input; and wherein at least one of the aggregation nodes further includes a

downsampling input.

Each convolutional neural network may include 3 or 4 downsampling layers and a corresponding number of upsampling layers.

Each convolutional neural network may include 3 downsampling layers.

Only one of the aggregation nodes may include a downsampling input.

Each of the aggregation nodes may include a downsampling output.

At least one of the lateral connections may include one or more lateral skip

connections.

At least one of the lateral connections may include one or more convolutions. At least one of the convolutions may be a depth-wise separable convolution

The neural network system may include an input layer capable of receiving a 128 x 128 pixel input image.

The neural network system may include a channel aggregation block, including a plurality of channel decrease steps, a plurality of channel increase steps and a plurality of branch connections connecting steps of equal channel size.

Each of the plurality of downsampling layers, each of the plurality of upsampling layers and each of the aggregation nodes may include the channel aggregation block

The output of each convolutional neural network may be connected with a spatial transformer.

The spatial transformer may include at least one deformable convolution.

According to a second aspect of the invention, there is provided a method of localising facial landmarks in an image using a neural network system, the method including the steps of receiving an input image, the input image including a plurality of input pixels; in each of a plurality of convolutional neural networks in series: downsampling the input image using a plurality of downsampling layers; and upsampling the

downsampled image using a plurality of upsampling layers; wherein the upsampling includes combining the upsampled image with an aggregated signal input to the upsampling layer using a lateral connection from a downsampling layer of equal size; wherein each of the lateral connections includes one or more aggregation nodes having a upsampling input; and wherein at least one of the aggregation nodes further includes a downsampling input; and generating an output heatmap corresponding to the input image from the output of the series of convolutional neural networks, the heatmap representing a probability for a location of a facial landmark at each input pixel.

According to a third aspect of the invention, there is provided a method of training a neural network system, including receiving a plurality of training facial images and a plurality of expected facial landmark parameter sets, inputting one of the plurality of training facial images to the neural network system, in response to the inputting, generating a facial landmark heatmap for the training facial image using the neural network system, and training the neural network system, including updating one or more neural network weights of the plurality of neural network weights to minimise a loss measure based at least in part on a comparison between the generated facial landmark heatmap and the corresponding expected facial landmark parameter set.

The method may include generating a plurality of transformed facial images by applying one or more transformations to the plurality of training facial images, inputting a transformed facial image to the neural network system, the input transformed facial image corresponding to a transformation of the input training facial image, in response to the inputting, generating a facial landmark heatmap for the transformed facial image using the neural network system, wherein training the neural network system includes updating the neural network weights to minimise a transformation loss measure based at least in part on a comparison between the generated facial landmark heatmap for the training facial image and the generated facial landmark heatmap for the corresponding transformed facial image.

Brief Description of the Drawings

Certain embodiments of the present invention will now be described, by way of example, with reference to the following figures.

Figure 1 is a schematic block diagram illustrating an example of a neural network system for localising facial landmarks;

Figure 2 is a schematic block diagram illustrating an example of a convolutional neural network topology for use in a neural network system for localising facial landmarks; Figure 3 is a schematic diagram illustrating an example of a novel neural network building block architecture for use in a convolutional neural network;

Figure 4 is a flow diagram of an example method for generating a facial landmark heatmap for an input image using a neural network system;

Figure 5 is a flow diagram of a first example method for training a neural network system for localising facial landmarks in an image; and

Figure 6 is a flow diagram of an extended example method for training a neural network system for localising facial landmarks in an image.

Detailed Description Example implementations provide system(s) and method(s) for improved localisation of facial landmarks in images. Facial landmark localisation has applications in face recognition, face tracking, expression understanding and 3D face reconstruction. Therefore, improving facial landmark localisation, by using systems and methods described herein, may improve the performance of systems directed at these applications. For example, face recognition systems using the system(s) and method(s) described herein in face recognition systems may recognise faces with a greater accuracy, as measured using standard facial recognition test datasets. Improved localisation of facial landmarks in images may be achieved by various example implementations by way of a novel neural network system design, a novel convolutional neural network topology, and/or a novel neural network building block architecture. The novel neural network system design may include a plurality convolutional neural networks and corresponding inside transformers, where the inside transformers employ deformable convolution to flexibly model geometric face transformations in a computationally efficient manner. The use of inside transformers employing deformable convolutions may enable such example implementations to be spatially invariant to arbitrary input face images.

The novel convolutional neural network topology, henceforth referred to as a Scale Aggregation Topology (SAT), improves on U-net topologies by iteratively and hierarchically merging the feature hierarchy with additional aggregation nodes within the lateral connections. The additional aggregation nodes have lateral inputs, upsampling inputs and downsampling inputs. These aggregation nodes enable multi- scale features to be aggregated improving the localisation of facial landmarks. In some variants of the SAT, the downsampling inputs to some nodes are removed as, in some situations, e.g. when there is limited training data, removing these connections reduces computational complexity, model size and training time, without affecting facial landmark localisation performance.

The novel neural network building block architecture, henceforth referred to as a Channel Aggregation Block (CAB) architecture, is a building block architecture that is symmetric in channel. In a CAB architecture, input signals branch off before each channel decrease and converge back before each channel increase to maintain the channel information. Using such channel compression in the building block

architecture can help contextual modelling improving facial landmark localisation by incorporating channel-wise heatmap relationships and increasing robustness when local observation is blurred. In some variants of a CAB architecture, depth-wise separable convolutions and replication-based channel extensions are employed to control computational complexity and model size.

Neural Network System Referring to Figure l, a neural network system too for localising facial landmarks in an image is shown.

The neural network system too is configured to receive a neural network input 102 and produce a neural network output 104. The neural network system too may be implemented on one or more suitable computing devices. For example, the one or more computing devices may be any number of or any combination of one or more mainframes, one or more blade servers, one or more desktop computers, one or more tablet computers, or one or more mobile phones. Where the neural network system too is implemented on a plurality of suitable computing devices, the computing devices may be configured to communicate with each other over a suitable network. Examples of suitable networks include the internet, local area networks, fixed microwave networks, cellular networks and wireless networks. The neural network system too may be implemented using a suitable machine learning framework The neural network input 102 may be an input image including a plurality of input pixels or a matrix representation of such an image. The input image may be a cropped and/or rescaled version of an original image, e.g. the input image maybe the region of the original image corresponding to a detected face and rescaled to 128x128 pixels. The neural network output 104 may be an output heatmap corresponding to the input image. The heatmap may represent a probability for a location of a facial landmark at each pixel. Such a heatmap may be output in one or more of a range of formats, e.g. an image where colours resemble the probabilities of certain facial landmarks, one or more matrix representations having numerical probabilities of facial features in a matrix location corresponding to a pixel location in the input image, and/or one or more compact vector representations (embeddings) of such information. The neural network system too includes a first convolutional neural network no, an inside transformer 120, a second convolutional neural network 112 and a second inside transformer 122. The first convolutional neural network 110 receives and processes the neural network input 102 using a number of neural network layers to produce a first neural network output. The first neural network output is received by the first inside transformer 120 which performs an appropriate transformation to produce a first transformed neural network output. The first transformed neural network output is received and processed by the second convolutional neural network 112 to produce a second neural network output. The second neural network output is received by the second inside transformer 122 which performs an appropriate transformation to produce a second transformed neural network output. The second transformed neural network output may itself be the neural network output 104 or the neural network output 104 may be derived from the second transformed neural network output.

The convolutional neural networks 110, 112 may have a Scale Aggregation Topology (SAT) neural network topology, as previously described. For example, convolutional neural networks may have the variant of the SAT topology illustrated in Figure 2 and described in the related section of the description. In other embodiments, the convolutional neural networks may have a U-Net, Hourglass or Deep Layer Aggregation topology. The first convolutional neural network 110 and the second convolutional neural network 112 may have the same topology or different topology. Where the first convolutional neural network 110 and the second convolutional neural network 112 have different topologies, the topologies maybe largely similar but have subtle variations, e.g. different connections removed, different convolutions replaced by depth-wise convolutions and/or lateral connections. There may also be minor differences in the block architectures of each.

The inside transformers 120, 122 maybe spatial transformers to improve and/or provide spatial transformation modelling capacity to the neural network system too. In some embodiments, the inside transformers 120, 122 perform parameter implicit transformation by deformable convolution. Deformable convolution extends conventional convolution by learning additional offsets in a local and dense manner to the locations sampled by the convolution operation. Deformable convolution is able to more flexibly model geometric face transformations and has higher computational efficiency than transformers employing explicit parameter transformation. The inside transformers 120, 122 may alternatively perform explicit parameter transformations using a spatial transformer network (STN).

Convolutional Neural Network Topology

Referring to Figure 2, an example of a convolutional neural network topology suitable for use in neural networks systems for facial landmark localisation is shown. The illustrated topology is a variant of the previously described Scale Aggregation Topology (SAT).

The convolutional neural network topology 200 includes a number of nodes 210-224, where each node may be a channel aggregation block. Some nodes 220-224 are referred to as aggregation nodes as they are nodes within the lateral connections which aggregate information across multiple scales.

Each of the nodes 210-224 is connected to other nodes in the network according to the connections 230-266 illustrated. These connections 230-266 receive one or more inputs from a given node in the network, optionally perform an operation, and transfer the result to another node.

The solid connections 230-236 represent downsampling convolutions. These connections receive an input from a node (e.g. node 210), perform a downsampling convolution and provide the result as an input to another node (e.g. node 211). The downsampling convolution may include applying a convolution layer followed by a pooling layer to the input, or may include applying a convolution layer having a stride of two or more. In some instances, the result of the downsampling convolution operation has half the vertical resolution and half the horizontal resolution of the connection input, e.g. if the connection input has a resolution of 128x128 then the result would have a resolution of 64x64.

The dashed connections 240-245 represent upsampling convolutions. These connections receive an input from a node (e.g. node 211), perform an upsampling convolution and provide the result as an input to another node (e.g. node 220). The upsampling convolution may include applying a convolution layer followed by an unpooling layer to the input, or may include applying a convolution layer having a fractional stride. In some instances, the result of the upsampling convolution operation has twice the vertical resolution and twice the vertical resolution of the input, e.g. if the connection input has a resolution of 64x64 then the connection result would have a resolution of 128x128.

The connections 250-252 represent depth-wise separable convolutions. These connections receive an input from a node (e.g. node 212), perform a depth-wise separable convolution on the input and provide the result as an input to another node (e.g. node 214). The use of depth-wise separable convolutions, instead of conventional convolutions, in these instances reduces the number of parameters and computation required. Furthermore, in some instances and for particular connections, replacing conventional convolutions with depth-wise separable convolutions substantially maintains the performance, e.g. the facial recognition accuracy, of the relevant system. In some situations, e.g. when training data is limited, the use of depth-wise separable convolutions for some connections may improve the performance of the relevant system because the reduction in parameters reduces optimisation difficulty during model training.

The dotted connections 260-266 represent lateral skip connections. These lateral skip connections receive an input from a node (e.g. node 210) and provide the received value to another node (e.g. node 220). These lateral skip connections allow high-resolution information from inputs to be maintained without requiring substantial processing. Furthermore, as lateral skip connections require few, if any, parameters, the use of lateral skip connections instead of convolutions reduces the number of parameters and computation required. Furthermore, in some instances and for particular connections, replacing convolutions with lateral skip connections substantially maintains the performance, e.g. the facial landmark localisation accuracy, of the relevant system. In some situations, e.g. when training data is limited, the use of lateral skip connections for some connections may improve the performance of the relevant system because the reduction in parameters reduces optimisation difficulty during model training.

In the illustrated convolution neural network topology 200, there are not

downsampling connections between nodes 224 and 214, and nodes 222 and 215, as may have been expected. For these particular connections and in given instances, e.g. when training data is limited, the absence of these connections allowed performance, e.g. facial landmark localisation accuracy, of the relevant system to be maintained while reducing the number of parameters and computation required. Furthermore, in certain situations, e.g. when training data is limited, the removal of these connections may improve the performance of the relevant system because the reduction in parameters reduces optimisation difficulty during model training.

Despite the above, it should be noted that, given sufficient training data, computational resources and time, the variation of the scale aggregation topology resulting in the greatest performance would be expected to be a topology where all nodes are connected with convolutions. This is because such a topology would have the greatest

computational power and preserve the greatest amount of information.

Neural Network Building Block Architecture

Referring to Figure 3, an example of a neural network building block architecture suitable for use in neural networks systems for facial landmark localisation is shown. The illustrated architecture is a variant of the previously described Channel

Aggregation Block (CAB) architecture.

The neural network building block architecture 200 includes a number of operations (310-345). These operations include channel decrease convolutions (310-314), self- concatenations (320-325), depth-wise separable convolutions (330-332) and matrix additions (340-345). In the illustrated neural network building block architecture 300, input signals branch off before each channel decrease and converge back before each channel increase to maintain the channel information. For example, the input signal branches off prior to the channel decrease at downsampling convolution 312 and converges back before the corresponding channel increase at self-concatenation 322. Using such channel compression in the building block architecture can help contextual modelling improving facial landmark localisation by incorporating channel-wise heatmap relationships and increasing robustness when local observation is blurred.

The operations 310-314 are channel decrease convolutions. These operations receive an input, perform a channel decrease convolution on the input and provide the result as an input to one or more following operations. The input to the channel decrease convolution maybe a three-dimensional input with dimensions, W x H x C_l5 where W is the width of the input, H is the height of the input and C₁ is the number of channels in the input. Each channel of the input, i.e. each I x ii x l maybe a feature heatmap. The channel decrease convolution may comprise applying a four-dimensional convolution kernel to the input. The four-dimensional kernel may have dimensions,

K x K x C₁ x C₂ , where K is an arbitrary, typically small integer, e.g. 3, and C₂ is the desired number of channels. The result of the channel decrease convolution has dimensions W x H x C₂. In some embodiments, C₂ is C / 2. For example, if the received input has dimensions 128 x 128 x 64, the four-dimensional convolution kernel may have dimensions 3 x 3 x 64 x 32, and the result of the four-dimensional convolution may have dimensions 128 x 128 x 32.

The operations 320-325 are self-concatenation operations. These operations, receive an input, perform a self-concatenation on the input and provide the result as an input to one or more following operations. The self-concatenation may include concatenating the received input with itself one or more of times along the dimension corresponding to the channels of the input, e.g. a depth-wise concatenation. Therefore, the result has a greater number of channels than the input. For example, where the input has C₁ channels and is concatenated with itself once, the result will have ^ ^c 2 channels. To provide a further example, if the received input has dimensions 128 x 128 x 32 and the input is concatenated with itself once then the result has dimensions 128 x 128 x 64.

The use of self-concatenation operations enables the number of channels to be increased, and the described channel compression architecture, which has the benefits described above, without substantial computation or additional parameters.

Alternatives that may be employed, e.g. channel increase convolutions, require substantial computation and a significant number of parameters. Therefore, the model size of a neural network system employing such self-concatenation operations is reduced compared to these alternatives. The operations 330-332 are depth-wise separable convolutions. These connections receive an input, perform a depth-wise separable convolution on the input and provide the result as an input to one or more following operations. The use of depth-wise separable convolutions, instead of conventional convolutions, in these instances reduces the number of parameters and computation required. The operations 340-345 are matrix additions. These connections receive two or more inputs, perform a matrix addition of the two or more inputs, and provide the result as an input to one or more following operations. The matrix additions (340-345) combine information from representations having different numbers of channels.

The neural network building block architecture presented is illustrative and variations of the presented neural network building block architecture may be employed instead. These variations maybe advantageous in certain contexts. For example, where more training data and more computational power are available, it maybe beneficial to replace the self-concatenation operations with channel increase convolutions and/ or the depth-wise separable concatenations with conventional convolutions. Furthermore, it should be noted that the other innovative aspects of the present disclosure may be employed with various neural network building block architectures, e.g. ResNet blocks, Inception ResNet blocks and/or Hierachical, Parallel & Multi-Scale (HPM) blocks.

Facial Landmark Heatmap Generation Method

Figure 4 is a flow diagram illustrating an example method by which a facial landmark heatmap for an input image using a neural network system may be generated. The method may be perform by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the neural network system too. The neural network system may, for example, be neural network system too. In step 410, an input image comprising a plurality of pixels is received. The input image maybe a cropped and/or rescaled version of an original image, e.g. the input image maybe the region of the original image corresponding to a detected face and rescaled to 128x128 pixels. In loop 420, the steps 422 and 424 are both repeated in and for each of a plurality of neural networks in series. For example, steps 422 and 424 maybe performed for each of the first convolutional neural network 110 and the second convolutional neural network 112 of neural network system 112. In step 422, the input image is downsampled using a plurality of downsampling layers of the respective neural network. The plurality of downsampling layers may be downsampling convolutional layers. The first downsampling layer may apply a convolution followed by a pooling operation to the input image and, or may apply a convolution layer having a stride of two or more to the image. Other layers of the plurality of downsampling layers may apply such operations to the output of the preceding layer. In some embodiments, the plurality of downsampling layers include the downsampling connections 230-236 of convolutional neural network topology 200.

In step 424, the downsampled image is upsampled using a plurality of upsampling layers of the respective neural network. The plurality of upsampling layers maybe upsampling convolutional layers. Each upsampling convolutional layer may include applying a convolution followed by an unpooling operation to the input, or may include applying a convolution having a fractional stride. These upsampling convolutional layers may be or correspond with the upsampling convolution connections 240-245 of convolutional neural network topology 200.

The upsampling may include combining the upsampled image with an aggregated signal input to the upsampling layer of equal size using a lateral connection from a downsampling layer of equal size. The lateral connections may include one or more aggregation nodes. Each of these aggregation nodes may have an upsampling input and at least one of the aggregation nodes may have a further downsampling input.

In 430, an output heatmap corresponding to the input image is generated from the output of the series of convolutional neural networks. The heatmap represents a probability for a location of a facial landmark at each input pixel. The heatmap maybe output in one or more of a range of formats, e.g. an image where colours resemble the probabilities of certain facial landmarks, one or more matrix representations having numerical probabilities of facial features in a matrix location corresponding to a pixel location in the input image, and/or one or more compact vector representations (embeddings) of such information.

Facial Landmark Localisation Neural Network Training Method

Figure 5 is a flow diagram illustrating an example method 500 by which a neural network system for localising facial landmarks in an image may be trained. The method may be perform by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the neural network system too. The neural network system trained by this method may, for example, be neural network system too.

In step 510, training facial images and corresponding expected facial landmark parameter sets are received. The training facial images may be cropped and/ or rescaledversion of corresponding original images, e.g. each of the training facial images may be the region of an original image corresponding to a detected face and rescaledto 128x128 pixels. The expected facial landmark parameter sets may be training heatmaps corresponding to the training facial images. Each of the training heatmaps may represent a probability for a location of a facial landmark at each pixel. As the locations of the facial landmarks are known for each of the training facial images, the probability for a location of a facial landmark at each pixel of each image may be zero or one. Zero may represent the absence of the facial landmark and one may represent the presence of the facial landmark. Such training heatmaps may be in one or more of a range of formats, e.g. an image where colours resemble the probabilities of certain facial landmarks, one or more matrix representations having numerical probabilities of facial features in a matrix location corresponding to a pixel location in the input image, and/or one or more compact vector representations (embeddings) of such information.

In step 520, one of the training facial images is input to the neural network system. The training facial image maybe input to the neural system by any suitable mechanism. For example, the training facial image maybe: retrieved from a file by the neural network system; received via an application programming interface (API) call; or received via a remote service call, e.g. a REST or SOAP service call.

In step 530, a facial landmark heatmap for the training facial image is generated using the neural network system. The generated facial landmark heatmap may be output in one or more of a range of formats, e.g. an image where colours resemble the

probabilities of certain facial landmarks, one or more matrix representations having numerical probabilities of facial features in a matrix location corresponding to a pixel location in the input image, and/or one or more compact vector representations (embeddings) of such information. The generated facial landmark heatmap may be in the same format as that of the corresponding expected facial landmark parameters sets, e.g. the same format as the corresponding training heatmap. In step 540, the neural network system is trained. By definition, the neural network system includes a plurality of weights. The training includes updating one or more weights of the plurality of neural network weights to minimise a loss measure that is based at least in part on a comparison between the generated facial landmark heatmap and the corresponding expected facial landmark parameter set. The loss measure may be the mean squared error (MSE) pixel-wise loss, the sigmoid cross-entropy (CE) pixel- wise loss. In the case that the generated facial landmark heatmap and the

corresponding expected facial landmark parameter set are in different formats, the corresponding expected facial landmark parameter set may be transformed in to the format of the generated facial landmark heatmap to enable the loss measure to be calculated. The updating of the one or more weights may be performed by some variant of the gradient descent optimization algorithm which employs backpropagation, e.g. Adam, Nadam or Adamax.

While, for the purposes of simplicity, the above training method has been described in relation to a single training iteration, it should be noted that in a typical embodiment a plurality of training iterations are performed until a convergence criterion is met, or an iteration or time limit is reached. Furthermore, each of the training iterations may update the one or more weights based on a batch of corresponding generated facial landmark heatmaps and expected facial landmark parameter using a batch gradient optimization algorithm.

Extended Facial Landmark Localisation Neural Network System Training Method

Figure 6 is a flow diagram illustrating an extended example method 600 by which a neural network system for localising facial landmarks in an image may be trained. The method may be perform by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the neural network system too. The neural network system trained by this method may, for example, be neural network system too.

In step 610, a plurality of transformed facial images are generated by applying one or more transformations to ta plurality of training facial images. The applied

transformations may, for example, be affine or flip transformations. Examples of affine transformations include translation, scaling and rotation. Steps 510 to 530 of the second neural network system training method 600 correspond to those of the first neural network system training method 500.

In step 620, a transformed facial image is input to the neural network system, where the transformed facial image is a transformation of the input training facial image. The transformed image is an image of the plurality of transformed facial images.

In step 630, a facial landmark heatmap for the transformed facial image is generated using the neural network system. The generated facial landmark heatmap for the transformed facial heatmap is in the same format as the facial landmark heatmap for the training facial image.

In step 540, the neural network system is trained and the plurality of weights updated as previously described. However, in this instance, the neural network weights are also updated to minimise a transformation loss measure based at least in part on a comparison between the generated facial landmark heatmap for the training facial image and the generated facial landmark heatmap for the corresponding transformed facial image. Minimising a transformation loss measure encourages the regression network to output coherent landmarks when there are rotation, scale, translation and flip transformations applied to the images. The transformation loss measure and the loss measure may be combined in to a combined loss measure that can be used to update the neural network weights.

An example of a combined loss measure is:

where N is the landmark number, i.e. the index of a given facial landmark; / is the input image; G(I) is the expected facial landmark heatmap, i.e. the ground truth; H(I ) is the generated facial landmark heatmap, T is the transformation, and l is the weight to balance the loss L_p~p, which is based on the difference between the generated facial landmark heatmap for the training facial image and the generated facial landmark heatmap for the corresponding transformed facial image, and L_p~g, which is based on the difference between the generated facial landmark heatmap and expected facial landmark heatmap. L_p-g includes two components L_p-gl and L_{p-g 2}. L_p-gl is based on the difference between the first training facial image and the expected facial landmark heatmap. L_p-gZ is based on the difference between the transformed facial image and an expected facial landmark heatmap for the transformed facial image. The expected facial landmark heatmap for the transformed facial image may be derived by applying the same transformation used to generate the transformed facial image to the expected facial landmark heatmap for the training facial image. In the equation above, the Mean Squared Error (MSE) pixel-wise loss is used to derive each of the component loss measures, however, any suitable loss functions maybe used, e.g. Sigmoid Cross- Entropy (CE) pixel-wise loss. Different loss functions may also be used for different components of the combined loss. For example, the Sigmoid Cross-Entropy (CE) pixel- wise loss function may be used to derive L_p-g and the Mean Squared Error (MSE) pixel-wise loss maybe used to derive L_p-p.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes may be made in these

embodiments without departing from the invention, the scope of which is defined in the appended claims. Various components of different embodiments maybe combined where the principles underlying the embodiments are compatible.

Claims

Claims l. A neural network system for localising facial landmarks in an image, the neural network system comprising:

two or more convolutional neural networks in series, each convolution neural network comprising:

a plurality of downsampling layers;

a plurality of upsampling layers; and

a plurality of lateral connections connecting layers of equal size;

wherein each of the lateral connections comprises one or more aggregation nodes having a upsampling input; and

wherein at least one of the aggregation nodes further includes a downsampling input.

2. The neural network system of claim 1, wherein each convolutional neural network comprises 3 or 4 downsampling layers and a corresponding number of upsampling layers.

3. The neural network system of claim 2, wherein each convolutional neural network comprises 3 downsampling layers.

4. The neural network system of any preceding claim, wherein only one of the aggregation nodes includes a downsampling input.

5. The neural network system of any one of claims 1 to 3, wherein each of the aggregation nodes includes a downsampling output.

6. The neural network system of any preceding claim, wherein at least one of the lateral connections comprises one or more lateral skip connections.

7. The neural network system of any preceding claim, wherein at least one of the lateral connections comprises one or more convolutions.

8. The neural network system of claim 7, wherein at least one of the convolutions is a depth-wise separable convolution

9. The neural network system of any preceding claim, further comprising an input layer capable of receiving a 128 x 128 pixel input image.

10. The neural network system of any preceding claim, further comprising a channel aggregation block, comprising:

a plurality of channel decrease steps;

a plurality of channel increase steps; and

a plurality of branch connections connecting steps of equal channel size.

11. The neural network system of any preceding claim, wherein each of the plurality of downsampling layers, each of the plurality of upsampling layers and each of the aggregation nodes comprises the channel aggregation block

12. The neural network system of any preceding claim, wherein the output of each convolutional neural network is connected with a spatial transformer.

13. The neural network system of claim 12, wherein the spatial transformer comprises at least one deformable convolution.

14. A method of localising facial landmarks in an image using a neural network system, the method comprising the steps of:

receiving an input image, the input image comprising a plurality of input pixels; in each of a plurality of convolutional neural networks in series:

downsampling the input image using a plurality of downsampling layers; and

upsampling the downsampled image using a plurality of upsampling layers;

wherein the upsampling further comprises combining the upsampled image with an aggregated signal input to the upsampling layer using a lateral connection from a downsampling layer of equal size;

wherein at least one of the aggregation nodes further includes a downsampling input; and generating an output heatmap corresponding to the input image from the output of the series of convolutional neural networks, the heatmap representing a probability for a location of a facial landmark at each input pixel.

15. A method of training the neural network system of any one of claims 1 to 13, comprising:

receiving a plurality of training facial images and a plurality of expected facial landmark parameter sets;

inputting one of the plurality of training facial images to the neural network system;

in response to the inputting, generating a facial landmark heatmap for the training facial image using the neural network system; and

training the neural network system, comprising updating one or more neural network weights of the plurality of neural network weights to minimise a loss measure based at least in part on a comparison between the generated facial landmark heatmap and the corresponding expected facial landmark parameter set.

16. The method of claim 15, further comprising:

generating a plurality of transformed facial images by applying one or more transformations to the plurality of training facial images;

inputting a transformed facial image to the neural network system, the input transformed facial image corresponding to a transformation of the input training facial image; and

in response to the inputting, generating a facial landmark heatmap for the transformed facial image using the neural network system;

wherein training the neural network system further comprises updating the neural network weights to minimise a transformation loss measure based at least in part on a comparison between the generated facial landmark heatmap for the training facial image and the generated facial landmark heatmap for the corresponding transformed facial image.