GB2582833A

GB2582833A - Facial localisation in images

Info

Publication number: GB2582833A
Application number: GB1906027.6A
Authority: GB
Inventors: Deng Jiankang; Zafeiriou Stefanos
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2020-10-07
Anticipated expiration: 2039-04-30
Also published as: CN114080632A; GB2582833B; WO2020221990A1; GB201906027D0

Abstract

The invention relates to methods of facial localisation in images. According to a first aspect (Fig. 2), there is described a computer-implemented method of training a neural network 106 for face localisation, the method comprising: inputting, to the neural network, a training image 202 comprising one or more faces (104); processing, using the neural network, the training image and outputting for each of a plurality of training anchors in the training image, one or more sets of output data, wherein the output data comprises a predicted facial classification, a predicted location of a corresponding face box (110), and one or more corresponding feature vectors; then updating parameters of the neural network in dependence on an objective function. The objective function comprises, for each positive anchor in the training image, a classification loss 210, a box regression loss 212 and a feature loss 214, based on comparisons of predicted properties with known properties in the anchor. A second aspect (Fig. 1) covers using a neural network to perform face localisation, wherein the network comprises convolutional layers and filters, as well as skip connections.

Description

Facial Localisation in Images

Field

This specification relates to methods of facial localisation in images. In particular, this specification relates to the use and training of a neural network for facial localisation in images.

Background

Facial localisation in images is the task of determining the presence and location of Yo faces within an image. it may further comprise identifying the locations of bounding boxes that surround each face determined to be present in the image. Automatic face localisation is a prerequisite step of facial image analysis in many applications, such as facial attribute (e.g. expression and/or age) and/or facial identity recognition. Typically, the training of a neural network for facial localisation (a facial classification/detection neural network) in images comprises the use of only classification and box regression losses.

Facial detection differs from generic object detection in several respects. Facial detection features smaller ratio variations (for example, from 1:1to 1:1.5) but much larger scale variations (for example, from several pixels to thousands of pixels). This can provide significant challenges when adapting generic object detectors for facial detection and localisation.

Summary

According to a first aspect of this specification, there is described a computer implemented method of training a neural network for face localisation, the method comprising: inputting, to the neural network, a training image, the training image comprising one or more faces; processing, using the neural network, the training image; outputting, from the neural network and for each of a plurality of training anchors in the training image, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors; updating parameters of the neural network in dependence on an objective function. The objective function comprises, for each positive anchor in the training image: a classification loss comparing a predicted classification for the positive anchor to a known classification for the positive anchor; a box regression loss comparing a predicted locations of a face box for the positive anchor to a known location of the face box; and a feature loss comparing pixel-based properties of the one or more feature vectors for the positive anchor to known pixel-based properties of a face associated with the positive anchor.

The one or more feature vectors may comprise predicted locations of a plurality of facial landmarks of a face, and wherein the feature loss comprises a facial landmark regression loss comparing the predicted locations of the plurality of facial landmarks to known locations of the plurality of facial landmarks.

_to The one or more feature vectors may comprise an encoded representation of a face within the predicted face box, and the method may further comprise: generating a three-dimensional representation of the face using a mesh decoder neural network, the mesh decoder neural network comprising one or more geometric convolutional layers; and generating a two-dimensional image of the face from the three-dimensional representation of the face using a differentiable renderer, wherein the feature loss comprises a dense regression loss comparing the generated two-dimensional image of the face to a ground truth image of the face within the predicted face box.

The feature vector may further comprise camera parameters and/or illumination 20 parameters, and wherein the differentiable renderer uses the camera parameters and/or illumination parameters when generating the two-dimensional image of the face from the three-dimensional representation of the face.

The dense regression loss may be a pixel-wise difference between the generated two-25 dimensional image of the face and the ground truth image of the face within the predicted face box.

The mesh decoder may comprise a plurality of geometric convolutional layers and a plurality of upscaling layers, and wherein the plurality of upscaling layers arc interlaced with the plurality of geometric convolutional layers.

The objective function may comprise, for each negative anchor in the training image, a classification loss comparing a predicted classification for the negative anchor to a known classification for the negative anchor. -3 -

The method may further comprise iterating the method using one or more further training images until a threshold condition is satisfied.

The neural network may comprise: a first plurality of convolutional layers comprising an input layer, a plurality of convolutional filters and one or more skip connections; and a second plurality of convolutional layers laterally connected to the first plurality of convolutional layers and configures to process the output of the first plurality of convolutional layers in a top-down manner. The second plurality of convolutional layers may comprise one or more deformable convolutional layers.

Each lateral connection may be configured to merge an output of a layer in the first plurality of convolutional layers with an output of a preceding layer of the second plurality of convolutional layers.

The neural network may further comprise one or more context modules, each context module configured to process the output of a layer of the second plurality of convolutional layers using one or more deformable convolutional networks.

According to a further aspect of this specification, there is described a method of face localisation, the method comprising: identifying, using a neural network, one or more faces within the input image; and identifying, using the neural network, one or more corresponding face boxes within the input image, wherein the neural network has been trained using any of the training methods described herein.

According to a further aspect of this specification, there is described a computer-implemented method of face localisation, the method comprising: inputting, to the neural network, an image, the image comprising one or more faces; processing, using the neural network, the image; outputting, from the neural network, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors. The neural network comprises: a first plurality of convolutional layers comprising an input layer, a plurality of convolutional filters and one or more skip connections; and a second plurality of convolutional layers laterally connected to the first plurality of convolutional layers and configures to process the output of the first plurality of convolutional layers in a top-down manner. -4 -

Each lateral connection may be configured to merge an output of a layer in the first plurality of convolutional layers with an output of a preceding layer of the second plurality of convolutional layers. The second plurality of convolutional layers may comprise one or more deformable convolutions.

The neural network further may comprise one or more context modules, each context module configured to process the output of a layer of the second plurality of convolutional layers using one or more deformable convolutional networks.

_to The neural network may have been trained using any of the training methods described herein.

According to a further aspect of this specification, there is described a system comprising: one or more processors; and a memory, the memory comprising computer readable instructions that, when executed by the one or more processors, cause the system to perform one or more of the methods described herein.

According to a further aspect of this specification, there is described a computer program product comprising computer readable instructions that, when executed by a 20 computing device, cause the computing device to perform one or more of the methods described herein.

As used herein, the terms facial localisation and/or face localisation are preferably used to connote one or more of: face detection; face alignment; pixel-wise face parsing; 25 and/or 3D dense correspondence regression.

Brief Description of the Drawings

Embodiments will now be described by way of non-limiting examples with reference to the accompanying drawings, in which: Figure 1 shows a schematic overview of a method of facial localisation in images; Figure 2 shows a schematic overview of a method of training a neural network for facial localisation in images using a multi-task loss; Figure 3 shows a flow diagram of a method of training a neural network for facial localisation in images; -5 -Figure 4 shows an example of a neural network structure for facial localisation in images; Figure 5 shows an example of a lateral connection in a neural network for facial localisation in images; Figure 6 shows a further example of a neural network structure for facial localisation in images; and Figure 7 shows an example of a computer system for performing facial localisation.

Detailed Description

io In the embodiments described herein, a facial recognition neural network may be configured to output further feature vectors in addition to a face score (i.e. an indication of whether a face is present at a location or not) and a face box. For example, the facial recognition neural network may output one or more facial landmark locations. Alternatively or additionally, the facial recognition neural network may output an embedding of the region within the face box. These outputs may be used in a multi-task loss function to train the facial recognition neural network.

Use of such a multi-task loss function can result in a higher performance of the facial classification/detection neural network when compared to other facial localisation methods. For example, the facial classification/detection neural network may have a lower error rate.

In some embodiments, a lightweight backbone network is employed. This can allow the facial classification/detection neural network to be run in real-time on a single CPU 25 core.

Figure 1 shows a schematic overview of a method 100 of facial localisation in images. An image 102 comprising one or more faces 104 is input into a neural network 106. The neural network 106 processes the image through a plurality of neural network layers, and outputs one or more sets of output facial data 108. Each set of output facial data 108 comprises a predicted facial classification for a face 106 in the image 102, a predicted location of a corresponding face box, and one or more corresponding feature vectors that encode one or more properties of a corresponding face in the image 102.

The input image 102, 1, is an image comprising one or more faces. The input image 102 may comprise a set of pixel values in an array, for example a two-dimensional or three- -6 -x dimensional array. For example, in a colour image E e xw 3, H is the height of the image in pixels, W is the width of the image in pixels and the image has three colour channels (e.g. RGB or CIELAB) The image 102 may, in some embodiments, be in black-and-white/greyscale.

The neural network 106 comprises a plurality of layers of nodes, each node associated with one or more parameters. The parameters of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. One or more of the layers of the trained generator neural network 106 may be convolutional layers. One or more of the layers of the trained generator neural network 106 may be deformable convolutional layers.

In some embodiments, the neural network 106 may comprise a residual network and a feature pyramid network. The neural network 106 may comprise lateral connections between layers of the residual network and layers of the feature pyramid network. The feature pyramid may process the outputs of the residual network in a top-down manner. Examples of neural network architectures using a residual network and a feature pyramid are described below in relation to Figures 4 to 6.

The parameters of the neural network 106 may be trained using a multi-task objective function, G, (also referred to herein as a loss function). The multi-task objective function may comprise two or more of a facial classification loss; a face box regression loss; a facial landmark regression loss; and/or a dense face regression loss. The multitask loss function may comprise a supervised loss and/or a self-supervised loss. Examples of method of training the neural network are described below in relation to Figures 2 and 3.

The set of output facial data 108 comprises, for each potential face identified in the image 102 by the neural network, a predicted facial classification 108a for a face 106 in the image 102. The predicted classification 108a comprises a probability of an anchor in the image being a face. -7 -

The set of output facial data 108 further comprises, for each potential face identified in the image 102 by the neural network, a predicted bounding box io8b for a face 106 in the image 102. The predicted bounding box io8b may, for example, be the coordinates of the corners of a bounding box no of the potential face.

The set of output facial data 108 further comprises, for each potential face identified in the image 102 by the neural network, one or more feature vectors 1o8c encoding one or properties of the face. For example, a feature vector 1o8c may comprise the location of one or more facial landmarks, such as the eyes, nose, chin and/or mouth of a face. The feature vector 1o8c may alternatively or additionally comprise an embedding of the face contained within the bounding box 11o.

Figure 2 shows a schematic overview of a method of training a neural network for facial localisation in images using a multi-task loss. The method comprises training the neural network 106 on a set of training images 102 using a multi-task loss function 204. For each of a plurality of images 202 in a training dataset, training image 202 is input into the neural network 1o6 and processed through a plurality of neural network layers. The neural network 106 outputs a set of predicted facial data 206 based on the input training image 202. The predicted facial data 206 is compared to known facial data for the corresponding training image 202 using a multi-task loss function 204, and the comparison is used to update parameters of the neural network 1o6.

The training image 202 is taken from a training set. The training set comprises a plurality of images containing faces, and a set of known facial data for each image. The known facial data comprises the locations of each face in the corresponding image, as well as the locations of the bounding boxes for each face in the corresponding image. In some embodiments, the known facial data further comprises locations of one or more facial landmarks for each face in the corresponding image.

Each training image may be associated with a plurality of training anchors. Each anchor represents a "prior" box where a face could potentially be in the image. Each anchor may be defined a reference box comprising a central location and a scale. However, it will be appreciated that the anchors may be defined in other ways. Anchors are used in facial detection to distinguish positive and negative training samples. A positive anchor is defined as an anchor for which a face is present. A negative anchor is defined as an anchor for which a face is not present. An example of the use of Anchors -8 -is provided in "Faster R-CNN: Towards Real-Time Object Detection with Regional Proposal Networks" (Shaoqing Ren et al., arXiv:1506.01497, the contents of which are incorporated herein by reference).

The predicted facial data 206 comprises a facial classification 210. The predicted facial classification 210 comprises a probability, p,, of an associated anchor, i, being a face. The predicted facial classification 210 may, in some embodiments, have two components corresponding to "face present" and "face not present".

/0 The predicted facial classification 210 for an anchor is associated with a classification loss, Leic(p,,p:), where p; is the known facial classification for anchor i (i.e. 1 for a positive anchor and 0 for a negative anchor). The classification loss compares a predicted classification for the anchor to a known classification for the anchor. The classification loss may, for example, be a softmax loss for binary cases, such as a cross /5 entropy loss or binary hinge loss. However, the skilled person will appreciate that other loss functions may alternative be used. The classification loss may, in some embodiments, only be used for positive anchors. In other embodiments, it may be used for both positive and negative anchors.

The predicted facial data 206 comprises a predicted facial box location 212, tt. The predicted facial box location 212 comprises data identifying a predicted location of a bounding box for the corresponding face for an anchor, i. The predicted facial box location 212 maybe, for example, coordinates of the corners of the predicted bounding box for the face, such as all four corners of the box, or coordinates of two diagonally opposite corners of the box. The predicted facial box location 212 may alternatively be the coordinates of the centre of the bounding box and the height and width of the box. Other examples are possible.

The predicted facial box location 212 for a positive anchor is associated with a box regression loss, b Eox b t where t; is the known facial box location for positive anchor - it, 1, i. The box regression loss compares the predicted facial box location 212, ti, to the known location of the corresponding face box, t;. The box regression loss may, for example, be a robust loss function, such as a smooth-Li loss function. However, the skilled person will appreciate that other loss functions may alternative be used. In some embodiments, the box regression targets may be normalised to a centre location and width and height of the box. -9 -

The predicted facial data also comprises one or more predicted feature vectors that each encode one or more pixel-based properties of the area with the facial box. The feature vectors are each associated with a feature loss, Li, comparing pixel-based properties of the one or more feature vectors for the positive anchor to known pixel-based properties of the face associated with the positive anchor.

An example of such a feature vector is a facial landmark data 214. The facial landmark data 214 comprises the predicted coordinates/locations of one or more facial landmarks within the predicted facial box. For example, five facial landmarks corresponding to the eyes, the corners of the mouth and the nose may be used.

The facial landmark data 214 is associated with a facial landmark regression loss, Cp"(li, 4), where 17 are the known facial landmark locations for positive anchor i. The /5 facial landmark regression loss compares the predicted facial landmark locations 214, to the known location of the corresponding facial landmarks, 17. The facial landmark regression loss may, for example, be a robust loss function, such as a smooth-Li loss function. However, the skilled person will appreciate that other loss functions may alternative be used. In some embodiments, the facial landmark location targets may be normalised to a centre location.

Alternatively or additionally, the feature vector may comprise an embedding 216, Ps7, of potential face within the bounding box. The embedding 216 may comprise shape and texture parameters of the face within the corresponding bounding box. In some embodiments, camera parameters 218 and/or lighting parameters zzo may also be output by the neural network io6. Camera parameters 218 may comprise one or more of: a camera location; a camera pose; and/or a focal length of the camera. Lighting (or illumination) parameters may comprise one or more of: a light source location; colour values of the light source; and/or ambient light colours. The camera parameters 218 and/or lighting parameters 220 may be referred to as rendering parameters.

The embedding 216 may be associated with a dense regression loss function, Lpuet. The dense regression loss does not directly compare the embedding 216 to a known embedding. Instead, the embedding is post-processed to generate a predicted facial image 222, which is then compared to the potential facial image (i.e. the image in the -10 -bounding box) using the regression loss function. This branch of the process may therefore be considered to be self-supervised.

To generate the predicted facial image 222, the embedding may be processed using a 5 mesh decoder neural network 224 to generate a three dimensional representation of the potential face, such a three dimensional mesh. The mesh decoder neural network 224 comprises a plurality of layers of nodes, each node associated with one or more parameters. The parameters of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in io the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. The mesh decoder neural network 224 comprises one or more geometric convolutional layers (also referred to herein as mesh convolutional layers). A geometric convolutional layer is a type of convolutional filter layer of the neural network used in geometric deep learning that can be applied directly in the mesh domain.

The mesh decoder may further comprise one or more up-scaling layers configured to upscale the mesh (i.e. increase the number of vertices in the mesh). The up-scaling layers may be interlaced with the geometric convolutional layers. For example the geometric convolutional layers may alternate with the up-scaling layers.

The mesh decoder 224 is configured to take as input the embedding parameters 216 and to process the input using a series of neural network layers to output a shape map and a texture map of the two dimensional potential face from the training image. The output of the mesh decoder neural network may be symbolically represented as DNST* The shape and texture map comprises three dimensional shape and texture information relating to the input visual data. In some embodiments, the output is a set of parameters describing (x, y, z) coordinates of points in a mesh and corresponding (r, g, b) values for each point in the mesh. The output may, in some embodiments, be a coloured mesh within a unit sphere. ;The mesh decoder may be pre-trained, and have its parameters fixed during training of the neural network 106. An example of a mesh decoder neural network and how a mesh 35 decoder neural network may be trained is described in relation to co-pending GB Patent Application No. 902524 6 (the contents of which are incorporated herein by reference). ;A mesh may be defined in terms of an undirected connected graph g = E), where V E 1[8ilx6 is a set of n vertices containing joint shape (e.g. (x, y, z)) and texture (e.g. (r, g, b)) information and E E [0,1}nxn is an adjacency matrix defining the connection status between vertices. However, it will be appreciated that alternative representations of a mesh may be used. ;/0 An example of a convolutional operator on a graph/mesh can be defined by formulating mesh filtering with a kernel go using a recursive polynomial, such as a Chebyshev polynomial. In defining such a convolutional operator on a mesh, it is useful to define some intermediate variables in the following way. A non-normalised graph Laplacian may be defined as L = D -E Enn where D E Einn is a diagonal matrix with D11 = E A normalised graph Laplacian may then be defined as L = In -D'I2ED-112, where In is the identity matrix. ;The Laplacian can be diagonalised by the Fourier bases U = un_l] E Rnxn such that L = UAU T, where A = diag([A0, . , 4_1]) E W. A graph Fourier representation of a given mesh, X E Rnx 6, can then be defined as R = U' 'x with inverse x = UR. ;Given these definitions, an example of a convolutional operator, go, on a graph/mesh can be defined. The filter go can be parameterised as a truncated Chebyshev polynomial expansion of order K as K-1 go (A) = ekTk CA1 k=0 where 0 E IRK is a vector of Chebyshev coefficients and Tk (A) E Rn" is the Chebyshev polynomial of order k evaluated at a scaled Laplacian A = 2A/Amax -In. Tk(x) can be recursively computed using Tk(x) = 2xTk_1 (x) -Tk_2 (X), with TO (x) = 1 and T1(x) = x. ;A spectral convolution can then be defined as Fin yi = go ti (A) xi E PZ i=1 where x E R"Fin is the input and y E EnxFou,. Such a filtering operation is very efficient, costing only 0(KIE I) operations. ;Other types of geometric/mesh convolution (or graph convolutions) may alternatively 5 or additionally be used. For example, other sets of orthogonal polynomials may be used in place of the Chebyshev polynomials. ;The joint shape and texture map undergoes further processing using a differentiable renderer 226 to generate a 2D rendered representation 222, .72(724,57,), of the potential ro face within the bounding box. The representation may comprise a two-dimensional projection of the joint shape and texture map onto an image plane. The sets of rendering parameters 218, 220 may be used to render the representation of the two-dimensional visual data, for example to define the parameters of a camera model used in the rendering, i.e. the 2D rendered representation 222 is given by 34 Dpn., Pccon, Pitt). ;The differentiable renderer 226 may, for example, be a 3D mesh renderer that projects the coloured mesh DpsT onto a 2D image plane. An example of such a differentiable renderer 226 may be found in Genova et al. (CVPR, 2018. 2,3), though other renderers may alternatively be used. ;The dense regression loss, Lpiret, compares the rendered 2D face 222, 2, to the original facial image in the training image 202 (i.e. the ground truth image within the predicted facial box). A pixel-wise difference may be used to perform the comparison. An example of the dense regression loss may be given by: ;W H 1 jll ;H X W ;I ;where I* is the anchor crop of the ground-truth training image and H and Ware the 25 width and height of the anchor crop respectively. 11x11" is an L-n norm, such as an L-1 norm. Other pixel-wise loss functions may alternatively be used.

The multi-task objective function 204 comprises a combination of the classification loss, the box regression loss and the feature loss. The aim of the training is, for each training anchor i, to minimise the multi-task objective function. An example of a multitask loss function for an anchor i is: -13 - £ = Lac(pi,K) + Ail4Lbox(ti, + A2g/Lpts(ii, li J + /13KLpixei, where Vii). are constants that control the relative significance of the different loss functions in the multi-task objective function. These parameters may, for example, be in the range (6,1]. For example, il,=6.25, A2=6.1, and Ji3=6.61.

The multi-task objective function 204 is used to update the parameters of the neural network 106 during training. The parameters of the neural network 106 may be updated using an optimisation procedure that aims to optimise the multi-task loss function 208. The optimisation procedure may, for example, be a gradient descent algorithm. The optimisation procedure may use backpropagation, e.g. by back- /0 propagating the multi-task loss function 204 to the parameters of the neural network 166. The optimisation procedure may be associated with a learning rate which varies during training. For example, the learning rate may start from an initial value, such as 10^ (-3), and then increase after a first threshold number of epochs, for example to lo^ (-2) after 5 epochs. The learning rate may then decay after further threshold epochs, for example reducing by a pre-defined factor at pre-defined epoch numbers. For example, the learning rate may decay by a factor of ten at 55 and 68 epochs.

Figure 3 shows a flow diagram of a method of training a neural network for facial localisation in images. The method may correspond to the method described above in relation to Figure 2.

At operation 3.1, a training image is input to the neural network, the training image comprising one or more faces. The training image may be selected from a training dataset comprising a plurality of images, each image comprising one or more faces at known locations and with known bounding boxes. One or more features of the faces, such as the locations of facial landmarks, may also be known.

At operation 3.2, the input training image is processed using the neural network.

At operation 3.3, the neural network outputs, for each of a plurality of training anchors in the training image, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors.

-14 -Operations 3.1 to 3.3 may be iterated over a subset of the training examples in the training data to form an ensemble of training examples. This ensemble may be used to determine the objective function in operation 3.4, for example by averaging the multitask objective function over the ensemble. Alternatively, operation 3.4 may be performed for each training example separately.

At operation 3.4, parameters of the neural network are updated in dependence on an objective function. The objective function comprises, for each positive anchor in the training image: a classification loss comparing a predicted classification for the positive to anchor to a known classification for the positive anchor; a box regression loss comparing a predicted locations of a face box for the positive anchor to a known location of the face box; and a feature loss comparing pixel-based properties of the one or more feature vectors for the positive anchor to known pixel-based properties of a face associated with the positive anchor. The update may be performed after each iteration of operations 3.1 to 3.3. Alternatively, the update may be performed after a number of iterations of operations 3.1 to 3.3. In these embodiments, the multi-task loss function may comprise one or more expectation values taken over the ensemble of training examples.

Operations 3.1 to 3.4 may be iterated over the training dataset until a threshold condition is met. The threshold condition may be a threshold number of training epochs. The threshold number of training epochs may lie in the range 50-150, such as 70-90, for example 80.

Figure 4 shows an example of a neural network structure for facial localisation in images. The neural network 1o6 may, in some embodiments, use this structure.

The neural network 400 comprises a backbone network 402. The backbone network 402 may comprise a plurality of convolutional layers. The backbone network 402 may 3o further comprise a plurality of down-sampling layers that reduce the dimensionality of an input to the layer. The backbone network may be initialised as a pre-trained classification network, such as a ResNet network (e.g. ResNet-152). The structure of the backbone network 402 may, for example, be given for an input image 404 size of 64ox64o colour pixels by: -15 -liktylti lIsle " .,,ith& .,. 320 2( IL\ p0o, d-" I 1 P. 2).6 3, i':28 RO,-iitln: 512 rV Ni c 4 I. 5,12 -_c.

where the convolutional building blocks are shown in square brackets and down-sampling is performed by conv3_1, conv4_1 and conv5_1 with a stride of 2. The backbone network 402 is configured to process the input image 404 through a series of convolutional layers (in this example labelled C2-C6) in a "bottom up" manner (i.e. from a low indexed layer to a high indexed layer) and generate an output for each convolutional layer. Although six layers are shown in this example, fewer or greater numbers of layers may be used. During training, the one or more layers of the pre-/0 trained backbone network, such as the first two layers, may be fixed to improve accuracy.

During training one or more of the convolutional layers may be fixed to their initial, pre-trained value. For example, the first two layers may be fixed. This can achieve /5 higher accuracy of the neural network once trained.

The neural network 400 further comprises a feature pyramid network 406. The feature pyramid network 406 comprises a plurality of feature pyramid levels (in this example labelled P2-P6, with P6 acting directly on/being shared with the final layer of the backbone network 402). The feature pyramid levels may comprise convolutional layers.

The convolutional layers may be deformable convolutional layers. The convolutional layers may be interlaced with one or more upscaling layers. The feature pyramid 406 is configured to process the outputs of a corresponding layer in the backbone network -16 - 402 in a "top-down" manner (i.e. from a high indexed layer to a low indexed layer) in order to generate a plurality of feature maps.

An example of a feature pyramid structure is given by: 1.21_ Iii -1 14116 V7 1C-4' N,111 4\11,14}, where The dimension of the output of each layer is indicated in the first column, the "anchor" column indicates the scales associated with each layer of the pyramid, and the "stride" column indicates the stride of the convolutions performed by each layer. For example, in this example the P2 layer is associated with the anchor scales 16x16, /0 20.16x20.16 and 25.4x25.4 and performs convolutions with stride 4. However, it will be appreciated that different scales and convolutions may alternatively be used. The lower levels of the feature pyramid may be associated with smaller scales than the higher levels and/or have a smaller stride.

The neural network 400 comprises a plurality of lateral connections 408 between the output of layers of the backbone network 402 and the layers of the feature pyramid 406. The lateral connections 408 may, in some embodiments, process the output of the associated backbone network 402 layer and then input it to the corresponding feature pyramid layer, along with the output of a higher feature pyramid layer. In other words, each feature pyramid layer after the highest layers receives as input a combination of the output of the corresponding backbone network layer and the output of the previous higher layer of the feature pyramid, e.g. layer P3 of the feature pyramid receives as input a combination of the output of layer C3 of the backbone network and the output of layer P4 of the feature pyramid. The higher layers of the feature pyramid (in this example layers P6 and P5) receive their input from the backbone network without combining it with the output of a previous layer.

Each layer of the feature pyramid outputs a set of features. The features may be further processed to generate the output data 108 and/or the multi-task loss 208. The number 30 of features in each layer maybe fixed to be the same. For example, each layer maybe fixed to have 256 features.

-17 -The use of lateral connections allows the "bottom up" pathways and "top down" pathways to be combined. In general, top-down pathways may be semantically strong but spatially coarse, while the bottom-up pathways may be spatially dedicate but semantically weak. By combing the pathways using lateral connections, the performance of the neural network in facial localisation can be improved.

The backbone network 402 may, in some embodiments, be a MobileNet network, such MobileNet-o.25. This provides a more lightweight model that can run on a single CPU _to in substantially real time. These lightweight models may use 7x7 convolution with stride 4 on the input image to reduce the data size. Dense anchors may be tiled on P4-P5 to improve performance. In some embodiments, the deformable convolutional layers are replaced with convolutional layers to speed up the processing. During training, the one or more layers of the pre-trained backbone network, such as the first two layers, may be fixed to improve accuracy.

Figure 5 shows an example of a lateral connection in a neural network for facial localisation in images. in this example, the lateral connection is for the P3 level of the feature pyramid, though it may mutatis mutandis be applied to any of the lateral connections between layers.

The lateral connection 500 provides the output of a layer 502 of the backbone network 402 (in this example the C3 layer of the backbone network) to a corresponding layer of the feature pyramid network 406 (in this example the P3 layer of the feature pyramid).

The lateral connection 500 may be configured to process the output of the layer 502 of the backbone network 402 to generate processed output 504. The processing may, for example be to alter the dimensionality of the output of the layer 502 of the backbone network 402 so that it can be used as input to the corresponding feature pyramid layer. In the example shown, a 1x1 convolution is used to reduce the dimensionality of the output of the layer 502 of the backbone network from 8ox8ox512 to 8ox8ox256. Each lateral connection may apply a different convolution to reduce or increase the dimensionality of the associated output layer to match the corresponding feature pyramid layer.

The output from the immediately higher layer 5o6 of the feature pyramid (in this example P4) undergoes upsampling to increase its dimensionality. In the example -18 -shown, the output of the P4 layer of the pyramid network is up-scaled by a factor of two from 4ox4ox256 to 8ox8ox256. The upscaling may, for example, be performed by nearest neighbour up-sampling. The up-scaled output is then added to the processed output 504 from the lateral connection Soo in order to generate a merged map 5o8.

Element wise addition may be used to merge the up-scaled output with the processed output 504.

The merged map 508 is used as input to the layer of the feature pyramid network 406 corresponding to the lateral connection 500 (in this example the P3 layer). The P3 layer applies a 3x3 deformable convolution to the merged map 5o8 in order to generate an output feature map 510. The use of a deformable convolution can reduce the aliasing effects of upsampling, as well as perform non-rigid context modelling.

Figure 6 shows a further example of a neural network structure for facial localisation in images. The neural network 600 of this example is substantially the same as that of Figure 4, but with the addition of a context module 602 for each of the feature pyramid layers. The context module 602 may increase the receptive field and enhance the rigid context modelling power of the neural network 600.

Each context module 602 takes as input the feature map 604 from a corresponding feature pyramid layer. A sequence of deformable convolutions is performed using the feature map 604. Each deformable convolution after a first deformable convolution is applied to the output 6o6a-c of the previous deformable convolution. In the example shown, three 3x3 deformable convolutions are applied. However, it will be appreciated that fewer or greater numbers of deformable convolutions may be applied, and that different sized deformable convolutions may be used.

The outputs of each deformable convolution 6o6a-c are combined to form an enhanced output 608. The enhanced output 608 may be further processed to generate the output data 108 and/or the multi-task loss 208.

Figure 7 shows a schematic example of a system/apparatus for performing any of the methods described herein. The system/apparatus shown is an example of a computing device. It will be appreciated by the skilled person that other types of computing devices/systems may alternatively be used to implement the methods described herein, such as a distributed computing system.

-19 -The apparatus (or system) 700 comprises one or more processors 702. The one or more processors control operation of other components of the system/apparatus 700. The one or more processors 702 may, for example, comprise a general purpose processor.

The one or more processors 702 may be a single core device or a multiple core device. The one or more processors 702 may comprise a central processing unit (CPU) or a graphical processing unit (GPU). Alternatively, the one or more processors 702 may comprise specialised processing hardware, for instance a RISC processor or programmable hardware with embedded firmware. Multiple processors may be to included.

The system/apparatus comprises a working or volatile memory 704. The one or more processors may access the volatile memory 704 in order to process data and may control the storage of data in memory. The volatile memory 704 may comprise RAM of any type, for example Static RAM (SRAM), Dynamic RAM (DRAM), or it may comprise Flash memory, such as an SD-Card.

The system/apparatus comprises a non-volatile memory 706. The non-volatile memory 706 stores a set of operation instructions 708 for controlling the operation of the processors 702 in the form of computer readable instructions. The non-volatile memory 706 may be a memory of any kind such as a Read Only Memory (ROM), a Flash memory or a magnetic drive memory.

The one or more processors 702 are configured to execute operating instructions 408 to cause the system/apparatus to perform any of the methods described herein. The operating instructions 708 may comprise code (i.e. drivers) relating to the hardware components of the system/apparatus 700, as well as code relating to the basic operation of the system/apparatus 700. Generally speaking, the one or more processors 702 execute one or more instructions of the operating instructions 708, which arc stored permanently or semi-permanently in the non-volatile memory 706, using the volatile memory 704 to temporarily store data generated during execution of said operating instructions 708.

Implementations of the methods described herein may be realised as in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These may include computer program products (such as software stored on e.g. magnetic discs, optical disks, memory, Programmable Logic Devices) comprising computer readable instructions that, when executed by a computer, such as that described in relation to Figure 7, cause the computer to perform one or more of the methods described herein.

Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features maybe expressed alternatively in terms of their corresponding structure. In particular, method aspects may be applied io to system aspects, and vice versa.

Furthermore, any, some and/or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination. It should also be appreciated that particular combinations of the various features described and defined in any aspects of the invention can be implemented and/or supplied and/or used independently.

Although several embodiments have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles of this disclosure, the scope of which is defined in the claims.

Claims

-21 -Claims 1. A computer implemented method of training a neural network for face localisation, the method comprising: inputting, to the neural network, a training image, the training image comprising one or more faces; processing, using the neural network, the training image; outputting, from the neural network and for each of a plurality of training anchors in the training image, one or more sets of output data, each set of output data io comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors; updating parameters of the neural network in dependence on an objective function, wherein the objective function comprises, for each positive anchor in the training image: a classification loss comparing a predicted classification for the positive anchor to a known classification for the positive anchor; a box regression loss comparing a predicted locations of a face box for the positive anchor to a known location of the face box; and a feature loss comparing pixel-based properties of the one or more feature vectors for the positive anchor to known pixel-based properties of a face associated with the positive anchor.
2. The method of claim 1, wherein the one or more feature vectors comprises predicted locations of a plurality of facial landmarks of a face, and wherein the feature loss comprises a facial landmark regression loss comparing the predicted locations of the plurality of facial landmarks to known locations of the plurality of facial landmarks.
3. The method of claim 1 or 2, wherein the one or more feature vectors comprises an encoded representation of a face within the predicted face box, and wherein the method comprises: generating a three-dimensional representation of the face using a mesh decoder neural network, the mesh decoder neural network comprising one or more geometric convolutional layers; and generating a two-dimensional image of the face from the three-dimensional representation of the face using a differentiable renderer, wherein the feature loss comprises a dense regression loss comparing the generated two-dimensional image of the face to a ground truth image of the face within the predicted face box.
4. The method of claim 3, wherein the feature vector further comprises camera parameters and/or illumination parameters, and wherein the differentiable renderer uses the camera parameters and/or illumination parameters when generating the two-dimensional image of the face from the three-dimensional representation of the face.
5. The method of any of claims 3 or 4, wherein the dense regression loss is a pixel-wise difference between the generated two-dimensional image of the face and the ground truth image of the face within the predicted face box.
6. The method of any of claims 3 to 5, wherein the mesh decoder comprises a /5 plurality of geometric convolutional layers and a plurality of upscaling layers, and wherein the plurality of upscaling layers are interlaced with the plurality of geometric convolutional layers.
7. The method of any preceding claim, wherein the objective function comprises, for each negative anchor in the training image, a classification loss comparing a predicted classification for the negative anchor to a known classification for the negative anchor.
8. The method of any preceding claim, further comprising iterating the method using one or more further training images until a threshold condition is satisfied.
9. The method of any preceding claim, wherein the neural network comprises: a first plurality of convolutional layers comprising an input layer, a plurality of convolutional filters and one or more skip connections; and a second plurality of convolutional layers laterally connected to the first plurality of convolutional layers and configures to process the output of the first plurality of convolutional layers in a top-down manner.
10. The method of claim 9, wherein each lateral connection is configured to merge an output of a layer in the first plurality of convolutional layers with an output of a preceding layer of the second plurality of convolutional layers.it.
The method of claims 9 or to, wherein the second plurality of convolutional layers comprises one or more deformable convolutions.
12. The method of any of claims 9 to 11, wherein the neural network further comprises one or more context modules, each context module configured to process the output of a layer of the second plurality of convolutional layers using one or more deformable convolutional networks.
13. A computer implemented method of face localisation, the method comprising: identifying, using a neural network, one or more faces within the input image; and identifying, using the neural network, one or more corresponding face boxes within the input image, wherein the neural network has been trained using the method of any preceding claim.
14. A computer implemented method of face localisation, the method comprising: inputting, to the neural network, an image, the image comprising one or more 20 faces; processing, using the neural network, the image; outputting, from the neural network, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors, wherein the neural network comprises: a first plurality of convolutional layers comprising an input layer, a plurality of convolutional filters and one or more skip connections; and a second plurality of convolutional layers laterally connected to the first plurality of convolutional layers and configures to process the output of the first plurality of convolutional layers in a top-down manner.
15. The method of claim 14, wherein each lateral connection is configured to merge an output of a layer in the first plurality of convolutional layers with an output of a preceding layer of the second plurality of convolutional layers.
16. The method of claims 14 or 15, wherein the second plurality of convolutional layers comprises one or more deformable convolutions.
17. The method of any of claims 14 to 16, wherein the neural network further 5 comprises one or more context modules, each context module configured to process the output of a layer of the second plurality of convolutional layers using one or more deformable convolutional networks.
18. The method of any of claims 14-17, wherein the neural network has been trained ir) using the method of any of claims 1-8
19. A system comprising: one or more processors; and a memory, the memory comprising computer readable instructions that, when 15 executed by the one or more processors, cause the system to perform the method of any preceding claim.
20. A computer program product comprising computer readable instructions that, when executed by a computing device, cause the computing device to perform the 20 method of any of claims 1-18.