WO2020049276A1 - System and method for facial landmark localisation using a neural network - Google Patents

System and method for facial landmark localisation using a neural network Download PDF

Info

Publication number
WO2020049276A1
WO2020049276A1 PCT/GB2019/052425 GB2019052425W WO2020049276A1 WO 2020049276 A1 WO2020049276 A1 WO 2020049276A1 GB 2019052425 W GB2019052425 W GB 2019052425W WO 2020049276 A1 WO2020049276 A1 WO 2020049276A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
network system
input
facial
image
Prior art date
Application number
PCT/GB2019/052425
Other languages
French (fr)
Inventor
Jiankang DENG
Stefanos ZAFEIRIOU
Original Assignee
Facesoft Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Facesoft Ltd. filed Critical Facesoft Ltd.
Priority to CN201980057333.9A priority Critical patent/CN112639810A/en
Publication of WO2020049276A1 publication Critical patent/WO2020049276A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Definitions

  • the present invention relates to facial landmark localisation, and particular to systems and methods for localising facial landmarks using a neural network system.
  • Facial landmark localisation relates to finding the location of various facial features within an image. Examples of the facial features that may be located include the mouth, the nasal ridge, the eyebrows and the jaw. Facial landmark localisation has wide ranging applications including, but not limited to, face recognition, face tracking, expression understanding, and 3D face reconstruction.
  • a neural network system for localising facial landmarks in an image
  • the neural network system including two or more convolutional neural networks in series, each convolution neural network including a plurality of downsampling layers, a plurality of upsampling layers and a plurality of lateral connections connecting layers of equal size; wherein each of the lateral connections comprises one or more aggregation nodes having a upsampling input; and wherein at least one of the aggregation nodes further includes a
  • Each convolutional neural network may include 3 or 4 downsampling layers and a corresponding number of upsampling layers.
  • Each convolutional neural network may include 3 downsampling layers.
  • Only one of the aggregation nodes may include a downsampling input.
  • Each of the aggregation nodes may include a downsampling output.
  • At least one of the lateral connections may include one or more lateral skip
  • At least one of the lateral connections may include one or more convolutions. At least one of the convolutions may be a depth-wise separable convolution
  • the neural network system may include an input layer capable of receiving a 128 x 128 pixel input image.
  • the neural network system may include a channel aggregation block, including a plurality of channel decrease steps, a plurality of channel increase steps and a plurality of branch connections connecting steps of equal channel size.
  • Each of the plurality of downsampling layers, each of the plurality of upsampling layers and each of the aggregation nodes may include the channel aggregation block
  • each convolutional neural network may be connected with a spatial transformer.
  • the spatial transformer may include at least one deformable convolution.
  • a method of localising facial landmarks in an image using a neural network system including the steps of receiving an input image, the input image including a plurality of input pixels; in each of a plurality of convolutional neural networks in series: downsampling the input image using a plurality of downsampling layers; and upsampling the
  • the upsampling includes combining the upsampled image with an aggregated signal input to the upsampling layer using a lateral connection from a downsampling layer of equal size; wherein each of the lateral connections includes one or more aggregation nodes having a upsampling input; and wherein at least one of the aggregation nodes further includes a downsampling input; and generating an output heatmap corresponding to the input image from the output of the series of convolutional neural networks, the heatmap representing a probability for a location of a facial landmark at each input pixel.
  • a method of training a neural network system including receiving a plurality of training facial images and a plurality of expected facial landmark parameter sets, inputting one of the plurality of training facial images to the neural network system, in response to the inputting, generating a facial landmark heatmap for the training facial image using the neural network system, and training the neural network system, including updating one or more neural network weights of the plurality of neural network weights to minimise a loss measure based at least in part on a comparison between the generated facial landmark heatmap and the corresponding expected facial landmark parameter set.
  • the method may include generating a plurality of transformed facial images by applying one or more transformations to the plurality of training facial images, inputting a transformed facial image to the neural network system, the input transformed facial image corresponding to a transformation of the input training facial image, in response to the inputting, generating a facial landmark heatmap for the transformed facial image using the neural network system, wherein training the neural network system includes updating the neural network weights to minimise a transformation loss measure based at least in part on a comparison between the generated facial landmark heatmap for the training facial image and the generated facial landmark heatmap for the corresponding transformed facial image.
  • Figure 1 is a schematic block diagram illustrating an example of a neural network system for localising facial landmarks
  • Figure 2 is a schematic block diagram illustrating an example of a convolutional neural network topology for use in a neural network system for localising facial landmarks
  • Figure 3 is a schematic diagram illustrating an example of a novel neural network building block architecture for use in a convolutional neural network
  • Figure 4 is a flow diagram of an example method for generating a facial landmark heatmap for an input image using a neural network system
  • Figure 5 is a flow diagram of a first example method for training a neural network system for localising facial landmarks in an image
  • Figure 6 is a flow diagram of an extended example method for training a neural network system for localising facial landmarks in an image.
  • Example implementations provide system(s) and method(s) for improved localisation of facial landmarks in images.
  • Facial landmark localisation has applications in face recognition, face tracking, expression understanding and 3D face reconstruction. Therefore, improving facial landmark localisation, by using systems and methods described herein, may improve the performance of systems directed at these applications.
  • face recognition systems using the system(s) and method(s) described herein in face recognition systems may recognise faces with a greater accuracy, as measured using standard facial recognition test datasets.
  • Improved localisation of facial landmarks in images may be achieved by various example implementations by way of a novel neural network system design, a novel convolutional neural network topology, and/or a novel neural network building block architecture.
  • the novel neural network system design may include a plurality convolutional neural networks and corresponding inside transformers, where the inside transformers employ deformable convolution to flexibly model geometric face transformations in a computationally efficient manner.
  • the use of inside transformers employing deformable convolutions may enable such example implementations to be spatially invariant to arbitrary input face images.
  • the novel convolutional neural network topology improves on U-net topologies by iteratively and hierarchically merging the feature hierarchy with additional aggregation nodes within the lateral connections.
  • the additional aggregation nodes have lateral inputs, upsampling inputs and downsampling inputs.
  • These aggregation nodes enable multi- scale features to be aggregated improving the localisation of facial landmarks.
  • the downsampling inputs to some nodes are removed as, in some situations, e.g. when there is limited training data, removing these connections reduces computational complexity, model size and training time, without affecting facial landmark localisation performance.
  • CAB Channel Aggregation Block
  • CAB architecture can help contextual modelling improving facial landmark localisation by incorporating channel-wise heatmap relationships and increasing robustness when local observation is blurred.
  • depth-wise separable convolutions and replication-based channel extensions are employed to control computational complexity and model size.
  • Neural Network System Referring to Figure l, a neural network system too for localising facial landmarks in an image is shown.
  • the neural network system too is configured to receive a neural network input 102 and produce a neural network output 104.
  • the neural network system too may be implemented on one or more suitable computing devices.
  • the one or more computing devices may be any number of or any combination of one or more mainframes, one or more blade servers, one or more desktop computers, one or more tablet computers, or one or more mobile phones.
  • the computing devices may be configured to communicate with each other over a suitable network. Examples of suitable networks include the internet, local area networks, fixed microwave networks, cellular networks and wireless networks.
  • the neural network system too may be implemented using a suitable machine learning framework
  • the neural network input 102 may be an input image including a plurality of input pixels or a matrix representation of such an image.
  • the input image may be a cropped and/or rescaled version of an original image, e.g. the input image maybe the region of the original image corresponding to a detected face and rescaled to 128x128 pixels.
  • the neural network output 104 may be an output heatmap corresponding to the input image.
  • the heatmap may represent a probability for a location of a facial landmark at each pixel.
  • Such a heatmap may be output in one or more of a range of formats, e.g.
  • the neural network system too includes a first convolutional neural network no, an inside transformer 120, a second convolutional neural network 112 and a second inside transformer 122.
  • the first convolutional neural network 110 receives and processes the neural network input 102 using a number of neural network layers to produce a first neural network output.
  • the first neural network output is received by the first inside transformer 120 which performs an appropriate transformation to produce a first transformed neural network output.
  • the first transformed neural network output is received and processed by the second convolutional neural network 112 to produce a second neural network output.
  • the second neural network output is received by the second inside transformer 122 which performs an appropriate transformation to produce a second transformed neural network output.
  • the second transformed neural network output may itself be the neural network output 104 or the neural network output 104 may be derived from the second transformed neural network output.
  • the convolutional neural networks 110, 112 may have a Scale Aggregation Topology (SAT) neural network topology, as previously described.
  • SAT Scale Aggregation Topology
  • convolutional neural networks may have the variant of the SAT topology illustrated in Figure 2 and described in the related section of the description.
  • the convolutional neural networks may have a U-Net, Hourglass or Deep Layer Aggregation topology.
  • the first convolutional neural network 110 and the second convolutional neural network 112 may have the same topology or different topology. Where the first convolutional neural network 110 and the second convolutional neural network 112 have different topologies, the topologies maybe largely similar but have subtle variations, e.g. different connections removed, different convolutions replaced by depth-wise convolutions and/or lateral connections. There may also be minor differences in the block architectures of each.
  • the inside transformers 120, 122 maybe spatial transformers to improve and/or provide spatial transformation modelling capacity to the neural network system too.
  • the inside transformers 120, 122 perform parameter implicit transformation by deformable convolution.
  • Deformable convolution extends conventional convolution by learning additional offsets in a local and dense manner to the locations sampled by the convolution operation.
  • Deformable convolution is able to more flexibly model geometric face transformations and has higher computational efficiency than transformers employing explicit parameter transformation.
  • the inside transformers 120, 122 may alternatively perform explicit parameter transformations using a spatial transformer network (STN).
  • STN spatial transformer network
  • FIG. 2 an example of a convolutional neural network topology suitable for use in neural networks systems for facial landmark localisation is shown.
  • the illustrated topology is a variant of the previously described Scale Aggregation Topology (SAT).
  • SAT Scale Aggregation Topology
  • the convolutional neural network topology 200 includes a number of nodes 210-224, where each node may be a channel aggregation block. Some nodes 220-224 are referred to as aggregation nodes as they are nodes within the lateral connections which aggregate information across multiple scales.
  • Each of the nodes 210-224 is connected to other nodes in the network according to the connections 230-266 illustrated. These connections 230-266 receive one or more inputs from a given node in the network, optionally perform an operation, and transfer the result to another node.
  • the solid connections 230-236 represent downsampling convolutions. These connections receive an input from a node (e.g. node 210), perform a downsampling convolution and provide the result as an input to another node (e.g. node 211).
  • the downsampling convolution may include applying a convolution layer followed by a pooling layer to the input, or may include applying a convolution layer having a stride of two or more.
  • the result of the downsampling convolution operation has half the vertical resolution and half the horizontal resolution of the connection input, e.g. if the connection input has a resolution of 128x128 then the result would have a resolution of 64x64.
  • the dashed connections 240-245 represent upsampling convolutions. These connections receive an input from a node (e.g. node 211), perform an upsampling convolution and provide the result as an input to another node (e.g. node 220).
  • the upsampling convolution may include applying a convolution layer followed by an unpooling layer to the input, or may include applying a convolution layer having a fractional stride.
  • the result of the upsampling convolution operation has twice the vertical resolution and twice the vertical resolution of the input, e.g. if the connection input has a resolution of 64x64 then the connection result would have a resolution of 128x128.
  • the connections 250-252 represent depth-wise separable convolutions. These connections receive an input from a node (e.g. node 212), perform a depth-wise separable convolution on the input and provide the result as an input to another node (e.g. node 214).
  • a node e.g. node 212
  • perform a depth-wise separable convolution on the input e.g. node 214
  • the use of depth-wise separable convolutions instead of conventional convolutions, in these instances reduces the number of parameters and computation required.
  • replacing conventional convolutions with depth-wise separable convolutions substantially maintains the performance, e.g. the facial recognition accuracy, of the relevant system.
  • the use of depth-wise separable convolutions for some connections may improve the performance of the relevant system because the reduction in parameters reduces optimisation difficulty during model training.
  • the dotted connections 260-266 represent lateral skip connections. These lateral skip connections receive an input from a node (e.g. node 210) and provide the received value to another node (e.g. node 220). These lateral skip connections allow high-resolution information from inputs to be maintained without requiring substantial processing. Furthermore, as lateral skip connections require few, if any, parameters, the use of lateral skip connections instead of convolutions reduces the number of parameters and computation required. Furthermore, in some instances and for particular connections, replacing convolutions with lateral skip connections substantially maintains the performance, e.g. the facial landmark localisation accuracy, of the relevant system. In some situations, e.g. when training data is limited, the use of lateral skip connections for some connections may improve the performance of the relevant system because the reduction in parameters reduces optimisation difficulty during model training.
  • FIG. 3 an example of a neural network building block architecture suitable for use in neural networks systems for facial landmark localisation is shown.
  • the illustrated architecture is a variant of the previously described Channel
  • CAB Aggregation Block
  • the neural network building block architecture 200 includes a number of operations (310-345). These operations include channel decrease convolutions (310-314), self- concatenations (320-325), depth-wise separable convolutions (330-332) and matrix additions (340-345).
  • input signals branch off before each channel decrease and converge back before each channel increase to maintain the channel information.
  • the input signal branches off prior to the channel decrease at downsampling convolution 312 and converges back before the corresponding channel increase at self-concatenation 322.
  • Using such channel compression in the building block architecture can help contextual modelling improving facial landmark localisation by incorporating channel-wise heatmap relationships and increasing robustness when local observation is blurred.
  • the operations 310-314 are channel decrease convolutions. These operations receive an input, perform a channel decrease convolution on the input and provide the result as an input to one or more following operations.
  • the input to the channel decrease convolution maybe a three-dimensional input with dimensions, W x H x C l5 where W is the width of the input, H is the height of the input and C 1 is the number of channels in the input.
  • Each channel of the input i.e. each I x ii x l maybe a feature heatmap.
  • the channel decrease convolution may comprise applying a four-dimensional convolution kernel to the input.
  • the four-dimensional kernel may have dimensions,
  • K x K x C 1 x C 2 where K is an arbitrary, typically small integer, e.g. 3, and C 2 is the desired number of channels.
  • the result of the channel decrease convolution has dimensions W x H x C 2 .
  • C 2 is C / 2.
  • the four-dimensional convolution kernel may have dimensions 3 x 3 x 64 x 32, and the result of the four-dimensional convolution may have dimensions 128 x 128 x 32.
  • the operations 320-325 are self-concatenation operations. These operations, receive an input, perform a self-concatenation on the input and provide the result as an input to one or more following operations.
  • the self-concatenation may include concatenating the received input with itself one or more of times along the dimension corresponding to the channels of the input, e.g. a depth-wise concatenation. Therefore, the result has a greater number of channels than the input. For example, where the input has C 1 channels and is concatenated with itself once, the result will have ⁇ c 2 channels. To provide a further example, if the received input has dimensions 128 x 128 x 32 and the input is concatenated with itself once then the result has dimensions 128 x 128 x 64.
  • the operations 330-332 are depth-wise separable convolutions. These connections receive an input, perform a depth-wise separable convolution on the input and provide the result as an input to one or more following operations. The use of depth-wise separable convolutions, instead of conventional convolutions, in these instances reduces the number of parameters and computation required.
  • the operations 340-345 are matrix additions. These connections receive two or more inputs, perform a matrix addition of the two or more inputs, and provide the result as an input to one or more following operations.
  • the matrix additions (340-345) combine information from representations having different numbers of channels.
  • the neural network building block architecture presented is illustrative and variations of the presented neural network building block architecture may be employed instead. These variations maybe advantageous in certain contexts. For example, where more training data and more computational power are available, it maybe beneficial to replace the self-concatenation operations with channel increase convolutions and/ or the depth-wise separable concatenations with conventional convolutions. Furthermore, it should be noted that the other innovative aspects of the present disclosure may be employed with various neural network building block architectures, e.g. ResNet blocks, Inception ResNet blocks and/or Hierachical, Parallel & Multi-Scale (HPM) blocks.
  • HPM Parallel & Multi-Scale
  • Figure 4 is a flow diagram illustrating an example method by which a facial landmark heatmap for an input image using a neural network system may be generated.
  • the method may be perform by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the neural network system too.
  • the neural network system may, for example, be neural network system too.
  • an input image comprising a plurality of pixels is received.
  • the input image maybe a cropped and/or rescaled version of an original image, e.g. the input image maybe the region of the original image corresponding to a detected face and rescaled to 128x128 pixels.
  • the steps 422 and 424 are both repeated in and for each of a plurality of neural networks in series.
  • steps 422 and 424 maybe performed for each of the first convolutional neural network 110 and the second convolutional neural network 112 of neural network system 112.
  • the input image is downsampled using a plurality of downsampling layers of the respective neural network.
  • the plurality of downsampling layers may be downsampling convolutional layers.
  • the first downsampling layer may apply a convolution followed by a pooling operation to the input image and, or may apply a convolution layer having a stride of two or more to the image. Other layers of the plurality of downsampling layers may apply such operations to the output of the preceding layer.
  • the plurality of downsampling layers include the downsampling connections 230-236 of convolutional neural network topology 200.
  • the downsampled image is upsampled using a plurality of upsampling layers of the respective neural network.
  • the plurality of upsampling layers maybe upsampling convolutional layers.
  • Each upsampling convolutional layer may include applying a convolution followed by an unpooling operation to the input, or may include applying a convolution having a fractional stride.
  • These upsampling convolutional layers may be or correspond with the upsampling convolution connections 240-245 of convolutional neural network topology 200.
  • the upsampling may include combining the upsampled image with an aggregated signal input to the upsampling layer of equal size using a lateral connection from a downsampling layer of equal size.
  • the lateral connections may include one or more aggregation nodes. Each of these aggregation nodes may have an upsampling input and at least one of the aggregation nodes may have a further downsampling input.
  • an output heatmap corresponding to the input image is generated from the output of the series of convolutional neural networks.
  • the heatmap represents a probability for a location of a facial landmark at each input pixel.
  • the heatmap maybe output in one or more of a range of formats, e.g. an image where colours resemble the probabilities of certain facial landmarks, one or more matrix representations having numerical probabilities of facial features in a matrix location corresponding to a pixel location in the input image, and/or one or more compact vector representations (embeddings) of such information.
  • Figure 5 is a flow diagram illustrating an example method 500 by which a neural network system for localising facial landmarks in an image may be trained.
  • the method may be perform by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the neural network system too.
  • the neural network system trained by this method may, for example, be neural network system too.
  • training facial images and corresponding expected facial landmark parameter sets are received.
  • the training facial images may be cropped and/ or rescaledversion of corresponding original images, e.g. each of the training facial images may be the region of an original image corresponding to a detected face and rescaledto 128x128 pixels.
  • the expected facial landmark parameter sets may be training heatmaps corresponding to the training facial images.
  • Each of the training heatmaps may represent a probability for a location of a facial landmark at each pixel. As the locations of the facial landmarks are known for each of the training facial images, the probability for a location of a facial landmark at each pixel of each image may be zero or one. Zero may represent the absence of the facial landmark and one may represent the presence of the facial landmark.
  • Such training heatmaps may be in one or more of a range of formats, e.g. an image where colours resemble the probabilities of certain facial landmarks, one or more matrix representations having numerical probabilities of facial features in a matrix location corresponding to a pixel location in the input image, and/or one or more compact vector representations (embeddings) of such information.
  • a range of formats e.g. an image where colours resemble the probabilities of certain facial landmarks, one or more matrix representations having numerical probabilities of facial features in a matrix location corresponding to a pixel location in the input image, and/or one or more compact vector representations (embeddings) of such information.
  • one of the training facial images is input to the neural network system.
  • the training facial image maybe input to the neural system by any suitable mechanism.
  • the training facial image maybe: retrieved from a file by the neural network system; received via an application programming interface (API) call; or received via a remote service call, e.g. a REST or SOAP service call.
  • API application programming interface
  • a facial landmark heatmap for the training facial image is generated using the neural network system.
  • the generated facial landmark heatmap may be output in one or more of a range of formats, e.g. an image where colours resemble the
  • the generated facial landmark heatmap may be in the same format as that of the corresponding expected facial landmark parameters sets, e.g. the same format as the corresponding training heatmap.
  • the neural network system is trained.
  • the neural network system includes a plurality of weights.
  • the training includes updating one or more weights of the plurality of neural network weights to minimise a loss measure that is based at least in part on a comparison between the generated facial landmark heatmap and the corresponding expected facial landmark parameter set.
  • the loss measure may be the mean squared error (MSE) pixel-wise loss, the sigmoid cross-entropy (CE) pixel- wise loss.
  • MSE mean squared error
  • CE sigmoid cross-entropy
  • the corresponding expected facial landmark parameter set may be transformed in to the format of the generated facial landmark heatmap to enable the loss measure to be calculated.
  • the updating of the one or more weights may be performed by some variant of the gradient descent optimization algorithm which employs backpropagation, e.g. Adam, Nadam or Adamax.
  • each of the training iterations may update the one or more weights based on a batch of corresponding generated facial landmark heatmaps and expected facial landmark parameter using a batch gradient optimization algorithm.
  • Figure 6 is a flow diagram illustrating an extended example method 600 by which a neural network system for localising facial landmarks in an image may be trained.
  • the method may be perform by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the neural network system too.
  • the neural network system trained by this method may, for example, be neural network system too.
  • a plurality of transformed facial images are generated by applying one or more transformations to ta plurality of training facial images. The applied
  • transformations may, for example, be affine or flip transformations.
  • affine transformations include translation, scaling and rotation.
  • Steps 510 to 530 of the second neural network system training method 600 correspond to those of the first neural network system training method 500.
  • a transformed facial image is input to the neural network system, where the transformed facial image is a transformation of the input training facial image.
  • the transformed image is an image of the plurality of transformed facial images.
  • a facial landmark heatmap for the transformed facial image is generated using the neural network system.
  • the generated facial landmark heatmap for the transformed facial heatmap is in the same format as the facial landmark heatmap for the training facial image.
  • the neural network system is trained and the plurality of weights updated as previously described.
  • the neural network weights are also updated to minimise a transformation loss measure based at least in part on a comparison between the generated facial landmark heatmap for the training facial image and the generated facial landmark heatmap for the corresponding transformed facial image.
  • Minimising a transformation loss measure encourages the regression network to output coherent landmarks when there are rotation, scale, translation and flip transformations applied to the images.
  • the transformation loss measure and the loss measure may be combined in to a combined loss measure that can be used to update the neural network weights.
  • N is the landmark number, i.e. the index of a given facial landmark; / is the input image; G(I) is the expected facial landmark heatmap, i.e. the ground truth; H(I ) is the generated facial landmark heatmap, T is the transformation, and l is the weight to balance the loss L p ⁇ p , which is based on the difference between the generated facial landmark heatmap for the training facial image and the generated facial landmark heatmap for the corresponding transformed facial image, and L p ⁇ g , which is based on the difference between the generated facial landmark heatmap and expected facial landmark heatmap.
  • L p-g includes two components L p-gl and L p-g 2 .
  • L p-gl is based on the difference between the first training facial image and the expected facial landmark heatmap.
  • L p-gZ is based on the difference between the transformed facial image and an expected facial landmark heatmap for the transformed facial image.
  • the expected facial landmark heatmap for the transformed facial image may be derived by applying the same transformation used to generate the transformed facial image to the expected facial landmark heatmap for the training facial image.
  • the Mean Squared Error (MSE) pixel-wise loss is used to derive each of the component loss measures, however, any suitable loss functions maybe used, e.g. Sigmoid Cross- Entropy (CE) pixel-wise loss. Different loss functions may also be used for different components of the combined loss. For example, the Sigmoid Cross-Entropy (CE) pixel- wise loss function may be used to derive L p-g and the Mean Squared Error (MSE) pixel-wise loss maybe used to derive L p-p .
  • MSE Mean Squared Error

Abstract

A neural network system for localising facial landmarks in an image comprises two or more convolutional neural networks in series. Each convolution neural network comprises a plurality of downsampling layers, a plurality of upsampling layers, and a plurality of lateral connections connecting layers of equal size. Each of the lateral connections comprises one or more aggregation nodes having a upsampling input; and at least one of the aggregation nodes further includes a downsampling input.

Description

SYSTEM AND METHOD FOR FACIAL LANDMARK
LOCALISATION USING A NEURAL NETWORK
Field of the Invention
The present invention relates to facial landmark localisation, and particular to systems and methods for localising facial landmarks using a neural network system.
Background
Facial landmark localisation, or face alignment, relates to finding the location of various facial features within an image. Examples of the facial features that may be located include the mouth, the nasal ridge, the eyebrows and the jaw. Facial landmark localisation has wide ranging applications including, but not limited to, face recognition, face tracking, expression understanding, and 3D face reconstruction.
Summary
According to a first aspect of the invention, there is provided a neural network system for localising facial landmarks in an image, the neural network system including two or more convolutional neural networks in series, each convolution neural network including a plurality of downsampling layers, a plurality of upsampling layers and a plurality of lateral connections connecting layers of equal size; wherein each of the lateral connections comprises one or more aggregation nodes having a upsampling input; and wherein at least one of the aggregation nodes further includes a
downsampling input.
Each convolutional neural network may include 3 or 4 downsampling layers and a corresponding number of upsampling layers.
Each convolutional neural network may include 3 downsampling layers.
Only one of the aggregation nodes may include a downsampling input.
Each of the aggregation nodes may include a downsampling output.
At least one of the lateral connections may include one or more lateral skip
connections.
At least one of the lateral connections may include one or more convolutions. At least one of the convolutions may be a depth-wise separable convolution
The neural network system may include an input layer capable of receiving a 128 x 128 pixel input image.
The neural network system may include a channel aggregation block, including a plurality of channel decrease steps, a plurality of channel increase steps and a plurality of branch connections connecting steps of equal channel size.
Each of the plurality of downsampling layers, each of the plurality of upsampling layers and each of the aggregation nodes may include the channel aggregation block
The output of each convolutional neural network may be connected with a spatial transformer.
The spatial transformer may include at least one deformable convolution.
According to a second aspect of the invention, there is provided a method of localising facial landmarks in an image using a neural network system, the method including the steps of receiving an input image, the input image including a plurality of input pixels; in each of a plurality of convolutional neural networks in series: downsampling the input image using a plurality of downsampling layers; and upsampling the
downsampled image using a plurality of upsampling layers; wherein the upsampling includes combining the upsampled image with an aggregated signal input to the upsampling layer using a lateral connection from a downsampling layer of equal size; wherein each of the lateral connections includes one or more aggregation nodes having a upsampling input; and wherein at least one of the aggregation nodes further includes a downsampling input; and generating an output heatmap corresponding to the input image from the output of the series of convolutional neural networks, the heatmap representing a probability for a location of a facial landmark at each input pixel.
According to a third aspect of the invention, there is provided a method of training a neural network system, including receiving a plurality of training facial images and a plurality of expected facial landmark parameter sets, inputting one of the plurality of training facial images to the neural network system, in response to the inputting, generating a facial landmark heatmap for the training facial image using the neural network system, and training the neural network system, including updating one or more neural network weights of the plurality of neural network weights to minimise a loss measure based at least in part on a comparison between the generated facial landmark heatmap and the corresponding expected facial landmark parameter set.
The method may include generating a plurality of transformed facial images by applying one or more transformations to the plurality of training facial images, inputting a transformed facial image to the neural network system, the input transformed facial image corresponding to a transformation of the input training facial image, in response to the inputting, generating a facial landmark heatmap for the transformed facial image using the neural network system, wherein training the neural network system includes updating the neural network weights to minimise a transformation loss measure based at least in part on a comparison between the generated facial landmark heatmap for the training facial image and the generated facial landmark heatmap for the corresponding transformed facial image.
Brief Description of the Drawings
Certain embodiments of the present invention will now be described, by way of example, with reference to the following figures.
Figure 1 is a schematic block diagram illustrating an example of a neural network system for localising facial landmarks;
Figure 2 is a schematic block diagram illustrating an example of a convolutional neural network topology for use in a neural network system for localising facial landmarks; Figure 3 is a schematic diagram illustrating an example of a novel neural network building block architecture for use in a convolutional neural network;
Figure 4 is a flow diagram of an example method for generating a facial landmark heatmap for an input image using a neural network system;
Figure 5 is a flow diagram of a first example method for training a neural network system for localising facial landmarks in an image; and
Figure 6 is a flow diagram of an extended example method for training a neural network system for localising facial landmarks in an image.
Detailed Description Example implementations provide system(s) and method(s) for improved localisation of facial landmarks in images. Facial landmark localisation has applications in face recognition, face tracking, expression understanding and 3D face reconstruction. Therefore, improving facial landmark localisation, by using systems and methods described herein, may improve the performance of systems directed at these applications. For example, face recognition systems using the system(s) and method(s) described herein in face recognition systems may recognise faces with a greater accuracy, as measured using standard facial recognition test datasets. Improved localisation of facial landmarks in images may be achieved by various example implementations by way of a novel neural network system design, a novel convolutional neural network topology, and/or a novel neural network building block architecture. The novel neural network system design may include a plurality convolutional neural networks and corresponding inside transformers, where the inside transformers employ deformable convolution to flexibly model geometric face transformations in a computationally efficient manner. The use of inside transformers employing deformable convolutions may enable such example implementations to be spatially invariant to arbitrary input face images.
The novel convolutional neural network topology, henceforth referred to as a Scale Aggregation Topology (SAT), improves on U-net topologies by iteratively and hierarchically merging the feature hierarchy with additional aggregation nodes within the lateral connections. The additional aggregation nodes have lateral inputs, upsampling inputs and downsampling inputs. These aggregation nodes enable multi- scale features to be aggregated improving the localisation of facial landmarks. In some variants of the SAT, the downsampling inputs to some nodes are removed as, in some situations, e.g. when there is limited training data, removing these connections reduces computational complexity, model size and training time, without affecting facial landmark localisation performance.
The novel neural network building block architecture, henceforth referred to as a Channel Aggregation Block (CAB) architecture, is a building block architecture that is symmetric in channel. In a CAB architecture, input signals branch off before each channel decrease and converge back before each channel increase to maintain the channel information. Using such channel compression in the building block
architecture can help contextual modelling improving facial landmark localisation by incorporating channel-wise heatmap relationships and increasing robustness when local observation is blurred. In some variants of a CAB architecture, depth-wise separable convolutions and replication-based channel extensions are employed to control computational complexity and model size.
Neural Network System Referring to Figure l, a neural network system too for localising facial landmarks in an image is shown.
The neural network system too is configured to receive a neural network input 102 and produce a neural network output 104. The neural network system too may be implemented on one or more suitable computing devices. For example, the one or more computing devices may be any number of or any combination of one or more mainframes, one or more blade servers, one or more desktop computers, one or more tablet computers, or one or more mobile phones. Where the neural network system too is implemented on a plurality of suitable computing devices, the computing devices may be configured to communicate with each other over a suitable network. Examples of suitable networks include the internet, local area networks, fixed microwave networks, cellular networks and wireless networks. The neural network system too may be implemented using a suitable machine learning framework The neural network input 102 may be an input image including a plurality of input pixels or a matrix representation of such an image. The input image may be a cropped and/or rescaled version of an original image, e.g. the input image maybe the region of the original image corresponding to a detected face and rescaled to 128x128 pixels. The neural network output 104 may be an output heatmap corresponding to the input image. The heatmap may represent a probability for a location of a facial landmark at each pixel. Such a heatmap may be output in one or more of a range of formats, e.g. an image where colours resemble the probabilities of certain facial landmarks, one or more matrix representations having numerical probabilities of facial features in a matrix location corresponding to a pixel location in the input image, and/or one or more compact vector representations (embeddings) of such information. The neural network system too includes a first convolutional neural network no, an inside transformer 120, a second convolutional neural network 112 and a second inside transformer 122. The first convolutional neural network 110 receives and processes the neural network input 102 using a number of neural network layers to produce a first neural network output. The first neural network output is received by the first inside transformer 120 which performs an appropriate transformation to produce a first transformed neural network output. The first transformed neural network output is received and processed by the second convolutional neural network 112 to produce a second neural network output. The second neural network output is received by the second inside transformer 122 which performs an appropriate transformation to produce a second transformed neural network output. The second transformed neural network output may itself be the neural network output 104 or the neural network output 104 may be derived from the second transformed neural network output.
The convolutional neural networks 110, 112 may have a Scale Aggregation Topology (SAT) neural network topology, as previously described. For example, convolutional neural networks may have the variant of the SAT topology illustrated in Figure 2 and described in the related section of the description. In other embodiments, the convolutional neural networks may have a U-Net, Hourglass or Deep Layer Aggregation topology. The first convolutional neural network 110 and the second convolutional neural network 112 may have the same topology or different topology. Where the first convolutional neural network 110 and the second convolutional neural network 112 have different topologies, the topologies maybe largely similar but have subtle variations, e.g. different connections removed, different convolutions replaced by depth-wise convolutions and/or lateral connections. There may also be minor differences in the block architectures of each.
The inside transformers 120, 122 maybe spatial transformers to improve and/or provide spatial transformation modelling capacity to the neural network system too. In some embodiments, the inside transformers 120, 122 perform parameter implicit transformation by deformable convolution. Deformable convolution extends conventional convolution by learning additional offsets in a local and dense manner to the locations sampled by the convolution operation. Deformable convolution is able to more flexibly model geometric face transformations and has higher computational efficiency than transformers employing explicit parameter transformation. The inside transformers 120, 122 may alternatively perform explicit parameter transformations using a spatial transformer network (STN).
Convolutional Neural Network Topology
Referring to Figure 2, an example of a convolutional neural network topology suitable for use in neural networks systems for facial landmark localisation is shown. The illustrated topology is a variant of the previously described Scale Aggregation Topology (SAT).
The convolutional neural network topology 200 includes a number of nodes 210-224, where each node may be a channel aggregation block. Some nodes 220-224 are referred to as aggregation nodes as they are nodes within the lateral connections which aggregate information across multiple scales.
Each of the nodes 210-224 is connected to other nodes in the network according to the connections 230-266 illustrated. These connections 230-266 receive one or more inputs from a given node in the network, optionally perform an operation, and transfer the result to another node.
The solid connections 230-236 represent downsampling convolutions. These connections receive an input from a node (e.g. node 210), perform a downsampling convolution and provide the result as an input to another node (e.g. node 211). The downsampling convolution may include applying a convolution layer followed by a pooling layer to the input, or may include applying a convolution layer having a stride of two or more. In some instances, the result of the downsampling convolution operation has half the vertical resolution and half the horizontal resolution of the connection input, e.g. if the connection input has a resolution of 128x128 then the result would have a resolution of 64x64.
The dashed connections 240-245 represent upsampling convolutions. These connections receive an input from a node (e.g. node 211), perform an upsampling convolution and provide the result as an input to another node (e.g. node 220). The upsampling convolution may include applying a convolution layer followed by an unpooling layer to the input, or may include applying a convolution layer having a fractional stride. In some instances, the result of the upsampling convolution operation has twice the vertical resolution and twice the vertical resolution of the input, e.g. if the connection input has a resolution of 64x64 then the connection result would have a resolution of 128x128.
The connections 250-252 represent depth-wise separable convolutions. These connections receive an input from a node (e.g. node 212), perform a depth-wise separable convolution on the input and provide the result as an input to another node (e.g. node 214). The use of depth-wise separable convolutions, instead of conventional convolutions, in these instances reduces the number of parameters and computation required. Furthermore, in some instances and for particular connections, replacing conventional convolutions with depth-wise separable convolutions substantially maintains the performance, e.g. the facial recognition accuracy, of the relevant system. In some situations, e.g. when training data is limited, the use of depth-wise separable convolutions for some connections may improve the performance of the relevant system because the reduction in parameters reduces optimisation difficulty during model training.
The dotted connections 260-266 represent lateral skip connections. These lateral skip connections receive an input from a node (e.g. node 210) and provide the received value to another node (e.g. node 220). These lateral skip connections allow high-resolution information from inputs to be maintained without requiring substantial processing. Furthermore, as lateral skip connections require few, if any, parameters, the use of lateral skip connections instead of convolutions reduces the number of parameters and computation required. Furthermore, in some instances and for particular connections, replacing convolutions with lateral skip connections substantially maintains the performance, e.g. the facial landmark localisation accuracy, of the relevant system. In some situations, e.g. when training data is limited, the use of lateral skip connections for some connections may improve the performance of the relevant system because the reduction in parameters reduces optimisation difficulty during model training.
In the illustrated convolution neural network topology 200, there are not
downsampling connections between nodes 224 and 214, and nodes 222 and 215, as may have been expected. For these particular connections and in given instances, e.g. when training data is limited, the absence of these connections allowed performance, e.g. facial landmark localisation accuracy, of the relevant system to be maintained while reducing the number of parameters and computation required. Furthermore, in certain situations, e.g. when training data is limited, the removal of these connections may improve the performance of the relevant system because the reduction in parameters reduces optimisation difficulty during model training.
Despite the above, it should be noted that, given sufficient training data, computational resources and time, the variation of the scale aggregation topology resulting in the greatest performance would be expected to be a topology where all nodes are connected with convolutions. This is because such a topology would have the greatest
computational power and preserve the greatest amount of information.
Neural Network Building Block Architecture
Referring to Figure 3, an example of a neural network building block architecture suitable for use in neural networks systems for facial landmark localisation is shown. The illustrated architecture is a variant of the previously described Channel
Aggregation Block (CAB) architecture.
The neural network building block architecture 200 includes a number of operations (310-345). These operations include channel decrease convolutions (310-314), self- concatenations (320-325), depth-wise separable convolutions (330-332) and matrix additions (340-345). In the illustrated neural network building block architecture 300, input signals branch off before each channel decrease and converge back before each channel increase to maintain the channel information. For example, the input signal branches off prior to the channel decrease at downsampling convolution 312 and converges back before the corresponding channel increase at self-concatenation 322. Using such channel compression in the building block architecture can help contextual modelling improving facial landmark localisation by incorporating channel-wise heatmap relationships and increasing robustness when local observation is blurred.
The operations 310-314 are channel decrease convolutions. These operations receive an input, perform a channel decrease convolution on the input and provide the result as an input to one or more following operations. The input to the channel decrease convolution maybe a three-dimensional input with dimensions, W x H x Cl5 where W is the width of the input, H is the height of the input and C1 is the number of channels in the input. Each channel of the input, i.e. each I x ii x l maybe a feature heatmap. The channel decrease convolution may comprise applying a four-dimensional convolution kernel to the input. The four-dimensional kernel may have dimensions,
K x K x C1 x C2 , where K is an arbitrary, typically small integer, e.g. 3, and C2 is the desired number of channels. The result of the channel decrease convolution has dimensions W x H x C2. In some embodiments, C2 is C / 2. For example, if the received input has dimensions 128 x 128 x 64, the four-dimensional convolution kernel may have dimensions 3 x 3 x 64 x 32, and the result of the four-dimensional convolution may have dimensions 128 x 128 x 32.
The operations 320-325 are self-concatenation operations. These operations, receive an input, perform a self-concatenation on the input and provide the result as an input to one or more following operations. The self-concatenation may include concatenating the received input with itself one or more of times along the dimension corresponding to the channels of the input, e.g. a depth-wise concatenation. Therefore, the result has a greater number of channels than the input. For example, where the input has C1 channels and is concatenated with itself once, the result will have ^ c 2 channels. To provide a further example, if the received input has dimensions 128 x 128 x 32 and the input is concatenated with itself once then the result has dimensions 128 x 128 x 64.
The use of self-concatenation operations enables the number of channels to be increased, and the described channel compression architecture, which has the benefits described above, without substantial computation or additional parameters.
Alternatives that may be employed, e.g. channel increase convolutions, require substantial computation and a significant number of parameters. Therefore, the model size of a neural network system employing such self-concatenation operations is reduced compared to these alternatives. The operations 330-332 are depth-wise separable convolutions. These connections receive an input, perform a depth-wise separable convolution on the input and provide the result as an input to one or more following operations. The use of depth-wise separable convolutions, instead of conventional convolutions, in these instances reduces the number of parameters and computation required. The operations 340-345 are matrix additions. These connections receive two or more inputs, perform a matrix addition of the two or more inputs, and provide the result as an input to one or more following operations. The matrix additions (340-345) combine information from representations having different numbers of channels.
The neural network building block architecture presented is illustrative and variations of the presented neural network building block architecture may be employed instead. These variations maybe advantageous in certain contexts. For example, where more training data and more computational power are available, it maybe beneficial to replace the self-concatenation operations with channel increase convolutions and/ or the depth-wise separable concatenations with conventional convolutions. Furthermore, it should be noted that the other innovative aspects of the present disclosure may be employed with various neural network building block architectures, e.g. ResNet blocks, Inception ResNet blocks and/or Hierachical, Parallel & Multi-Scale (HPM) blocks.
Facial Landmark Heatmap Generation Method
Figure 4 is a flow diagram illustrating an example method by which a facial landmark heatmap for an input image using a neural network system may be generated. The method may be perform by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the neural network system too. The neural network system may, for example, be neural network system too. In step 410, an input image comprising a plurality of pixels is received. The input image maybe a cropped and/or rescaled version of an original image, e.g. the input image maybe the region of the original image corresponding to a detected face and rescaled to 128x128 pixels. In loop 420, the steps 422 and 424 are both repeated in and for each of a plurality of neural networks in series. For example, steps 422 and 424 maybe performed for each of the first convolutional neural network 110 and the second convolutional neural network 112 of neural network system 112. In step 422, the input image is downsampled using a plurality of downsampling layers of the respective neural network. The plurality of downsampling layers may be downsampling convolutional layers. The first downsampling layer may apply a convolution followed by a pooling operation to the input image and, or may apply a convolution layer having a stride of two or more to the image. Other layers of the plurality of downsampling layers may apply such operations to the output of the preceding layer. In some embodiments, the plurality of downsampling layers include the downsampling connections 230-236 of convolutional neural network topology 200.
In step 424, the downsampled image is upsampled using a plurality of upsampling layers of the respective neural network. The plurality of upsampling layers maybe upsampling convolutional layers. Each upsampling convolutional layer may include applying a convolution followed by an unpooling operation to the input, or may include applying a convolution having a fractional stride. These upsampling convolutional layers may be or correspond with the upsampling convolution connections 240-245 of convolutional neural network topology 200.
The upsampling may include combining the upsampled image with an aggregated signal input to the upsampling layer of equal size using a lateral connection from a downsampling layer of equal size. The lateral connections may include one or more aggregation nodes. Each of these aggregation nodes may have an upsampling input and at least one of the aggregation nodes may have a further downsampling input.
In 430, an output heatmap corresponding to the input image is generated from the output of the series of convolutional neural networks. The heatmap represents a probability for a location of a facial landmark at each input pixel. The heatmap maybe output in one or more of a range of formats, e.g. an image where colours resemble the probabilities of certain facial landmarks, one or more matrix representations having numerical probabilities of facial features in a matrix location corresponding to a pixel location in the input image, and/or one or more compact vector representations (embeddings) of such information.
Facial Landmark Localisation Neural Network Training Method
Figure 5 is a flow diagram illustrating an example method 500 by which a neural network system for localising facial landmarks in an image may be trained. The method may be perform by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the neural network system too. The neural network system trained by this method may, for example, be neural network system too.
In step 510, training facial images and corresponding expected facial landmark parameter sets are received. The training facial images may be cropped and/ or rescaledversion of corresponding original images, e.g. each of the training facial images may be the region of an original image corresponding to a detected face and rescaledto 128x128 pixels. The expected facial landmark parameter sets may be training heatmaps corresponding to the training facial images. Each of the training heatmaps may represent a probability for a location of a facial landmark at each pixel. As the locations of the facial landmarks are known for each of the training facial images, the probability for a location of a facial landmark at each pixel of each image may be zero or one. Zero may represent the absence of the facial landmark and one may represent the presence of the facial landmark. Such training heatmaps may be in one or more of a range of formats, e.g. an image where colours resemble the probabilities of certain facial landmarks, one or more matrix representations having numerical probabilities of facial features in a matrix location corresponding to a pixel location in the input image, and/or one or more compact vector representations (embeddings) of such information.
In step 520, one of the training facial images is input to the neural network system. The training facial image maybe input to the neural system by any suitable mechanism. For example, the training facial image maybe: retrieved from a file by the neural network system; received via an application programming interface (API) call; or received via a remote service call, e.g. a REST or SOAP service call.
In step 530, a facial landmark heatmap for the training facial image is generated using the neural network system. The generated facial landmark heatmap may be output in one or more of a range of formats, e.g. an image where colours resemble the
probabilities of certain facial landmarks, one or more matrix representations having numerical probabilities of facial features in a matrix location corresponding to a pixel location in the input image, and/or one or more compact vector representations (embeddings) of such information. The generated facial landmark heatmap may be in the same format as that of the corresponding expected facial landmark parameters sets, e.g. the same format as the corresponding training heatmap. In step 540, the neural network system is trained. By definition, the neural network system includes a plurality of weights. The training includes updating one or more weights of the plurality of neural network weights to minimise a loss measure that is based at least in part on a comparison between the generated facial landmark heatmap and the corresponding expected facial landmark parameter set. The loss measure may be the mean squared error (MSE) pixel-wise loss, the sigmoid cross-entropy (CE) pixel- wise loss. In the case that the generated facial landmark heatmap and the
corresponding expected facial landmark parameter set are in different formats, the corresponding expected facial landmark parameter set may be transformed in to the format of the generated facial landmark heatmap to enable the loss measure to be calculated. The updating of the one or more weights may be performed by some variant of the gradient descent optimization algorithm which employs backpropagation, e.g. Adam, Nadam or Adamax.
While, for the purposes of simplicity, the above training method has been described in relation to a single training iteration, it should be noted that in a typical embodiment a plurality of training iterations are performed until a convergence criterion is met, or an iteration or time limit is reached. Furthermore, each of the training iterations may update the one or more weights based on a batch of corresponding generated facial landmark heatmaps and expected facial landmark parameter using a batch gradient optimization algorithm.
Extended Facial Landmark Localisation Neural Network System Training Method
Figure 6 is a flow diagram illustrating an extended example method 600 by which a neural network system for localising facial landmarks in an image may be trained. The method may be perform by executing computer-readable instructions using one or more processors of one or more computing devices, e.g. the one or more computing devices implementing the neural network system too. The neural network system trained by this method may, for example, be neural network system too.
In step 610, a plurality of transformed facial images are generated by applying one or more transformations to ta plurality of training facial images. The applied
transformations may, for example, be affine or flip transformations. Examples of affine transformations include translation, scaling and rotation. Steps 510 to 530 of the second neural network system training method 600 correspond to those of the first neural network system training method 500.
In step 620, a transformed facial image is input to the neural network system, where the transformed facial image is a transformation of the input training facial image. The transformed image is an image of the plurality of transformed facial images.
In step 630, a facial landmark heatmap for the transformed facial image is generated using the neural network system. The generated facial landmark heatmap for the transformed facial heatmap is in the same format as the facial landmark heatmap for the training facial image.
In step 540, the neural network system is trained and the plurality of weights updated as previously described. However, in this instance, the neural network weights are also updated to minimise a transformation loss measure based at least in part on a comparison between the generated facial landmark heatmap for the training facial image and the generated facial landmark heatmap for the corresponding transformed facial image. Minimising a transformation loss measure encourages the regression network to output coherent landmarks when there are rotation, scale, translation and flip transformations applied to the images. The transformation loss measure and the loss measure may be combined in to a combined loss measure that can be used to update the neural network weights.
An example of a combined loss measure is:
Figure imgf000016_0001
where N is the landmark number, i.e. the index of a given facial landmark; / is the input image; G(I) is the expected facial landmark heatmap, i.e. the ground truth; H(I ) is the generated facial landmark heatmap, T is the transformation, and l is the weight to balance the loss Lp~p, which is based on the difference between the generated facial landmark heatmap for the training facial image and the generated facial landmark heatmap for the corresponding transformed facial image, and Lp~g, which is based on the difference between the generated facial landmark heatmap and expected facial landmark heatmap. Lp-g includes two components Lp-gl and Lp-g 2. Lp-gl is based on the difference between the first training facial image and the expected facial landmark heatmap. Lp-gZ is based on the difference between the transformed facial image and an expected facial landmark heatmap for the transformed facial image. The expected facial landmark heatmap for the transformed facial image may be derived by applying the same transformation used to generate the transformed facial image to the expected facial landmark heatmap for the training facial image. In the equation above, the Mean Squared Error (MSE) pixel-wise loss is used to derive each of the component loss measures, however, any suitable loss functions maybe used, e.g. Sigmoid Cross- Entropy (CE) pixel-wise loss. Different loss functions may also be used for different components of the combined loss. For example, the Sigmoid Cross-Entropy (CE) pixel- wise loss function may be used to derive Lp-g and the Mean Squared Error (MSE) pixel-wise loss maybe used to derive Lp-p.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes may be made in these
embodiments without departing from the invention, the scope of which is defined in the appended claims. Various components of different embodiments maybe combined where the principles underlying the embodiments are compatible.

Claims

Claims l. A neural network system for localising facial landmarks in an image, the neural network system comprising:
two or more convolutional neural networks in series, each convolution neural network comprising:
a plurality of downsampling layers;
a plurality of upsampling layers; and
a plurality of lateral connections connecting layers of equal size;
wherein each of the lateral connections comprises one or more aggregation nodes having a upsampling input; and
wherein at least one of the aggregation nodes further includes a downsampling input.
2. The neural network system of claim 1, wherein each convolutional neural network comprises 3 or 4 downsampling layers and a corresponding number of upsampling layers.
3. The neural network system of claim 2, wherein each convolutional neural network comprises 3 downsampling layers.
4. The neural network system of any preceding claim, wherein only one of the aggregation nodes includes a downsampling input.
5. The neural network system of any one of claims 1 to 3, wherein each of the aggregation nodes includes a downsampling output.
6. The neural network system of any preceding claim, wherein at least one of the lateral connections comprises one or more lateral skip connections.
7. The neural network system of any preceding claim, wherein at least one of the lateral connections comprises one or more convolutions.
8. The neural network system of claim 7, wherein at least one of the convolutions is a depth-wise separable convolution
9. The neural network system of any preceding claim, further comprising an input layer capable of receiving a 128 x 128 pixel input image.
10. The neural network system of any preceding claim, further comprising a channel aggregation block, comprising:
a plurality of channel decrease steps;
a plurality of channel increase steps; and
a plurality of branch connections connecting steps of equal channel size.
11. The neural network system of any preceding claim, wherein each of the plurality of downsampling layers, each of the plurality of upsampling layers and each of the aggregation nodes comprises the channel aggregation block
12. The neural network system of any preceding claim, wherein the output of each convolutional neural network is connected with a spatial transformer.
13. The neural network system of claim 12, wherein the spatial transformer comprises at least one deformable convolution.
14. A method of localising facial landmarks in an image using a neural network system, the method comprising the steps of:
receiving an input image, the input image comprising a plurality of input pixels; in each of a plurality of convolutional neural networks in series:
downsampling the input image using a plurality of downsampling layers; and
upsampling the downsampled image using a plurality of upsampling layers;
wherein the upsampling further comprises combining the upsampled image with an aggregated signal input to the upsampling layer using a lateral connection from a downsampling layer of equal size;
wherein each of the lateral connections comprises one or more aggregation nodes having a upsampling input; and
wherein at least one of the aggregation nodes further includes a downsampling input; and generating an output heatmap corresponding to the input image from the output of the series of convolutional neural networks, the heatmap representing a probability for a location of a facial landmark at each input pixel.
15. A method of training the neural network system of any one of claims 1 to 13, comprising:
receiving a plurality of training facial images and a plurality of expected facial landmark parameter sets;
inputting one of the plurality of training facial images to the neural network system;
in response to the inputting, generating a facial landmark heatmap for the training facial image using the neural network system; and
training the neural network system, comprising updating one or more neural network weights of the plurality of neural network weights to minimise a loss measure based at least in part on a comparison between the generated facial landmark heatmap and the corresponding expected facial landmark parameter set.
16. The method of claim 15, further comprising:
generating a plurality of transformed facial images by applying one or more transformations to the plurality of training facial images;
inputting a transformed facial image to the neural network system, the input transformed facial image corresponding to a transformation of the input training facial image; and
in response to the inputting, generating a facial landmark heatmap for the transformed facial image using the neural network system;
wherein training the neural network system further comprises updating the neural network weights to minimise a transformation loss measure based at least in part on a comparison between the generated facial landmark heatmap for the training facial image and the generated facial landmark heatmap for the corresponding transformed facial image.
PCT/GB2019/052425 2018-09-03 2019-08-30 System and method for facial landmark localisation using a neural network WO2020049276A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201980057333.9A CN112639810A (en) 2018-09-03 2019-08-30 Face key point positioning system and method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1814275.2 2018-09-03
GB1814275.2A GB2576784B (en) 2018-09-03 2018-09-03 Facial landmark localisation system and method

Publications (1)

Publication Number Publication Date
WO2020049276A1 true WO2020049276A1 (en) 2020-03-12

Family

ID=63921024

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2019/052425 WO2020049276A1 (en) 2018-09-03 2019-08-30 System and method for facial landmark localisation using a neural network

Country Status (3)

Country Link
CN (1) CN112639810A (en)
GB (1) GB2576784B (en)
WO (1) WO2020049276A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597941A (en) * 2020-05-08 2020-08-28 河海大学 Target detection method for dam defect image
CN112801043A (en) * 2021-03-11 2021-05-14 河北工业大学 Real-time video face key point detection method based on deep learning
CN113435245A (en) * 2021-05-18 2021-09-24 西安电子科技大学 Method, system and application for identifying individual airborne radiation source

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020098360A1 (en) * 2018-11-15 2020-05-22 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Method, system, and computer-readable medium for processing images using cross-stage skip connections
CN110717405B (en) * 2019-09-17 2023-11-24 平安科技(深圳)有限公司 Face feature point positioning method, device, medium and electronic equipment
EP3885996A1 (en) * 2020-03-27 2021-09-29 Aptiv Technologies Limited Method and system for determining an output of a convolutional block of an artificial neural network
US11574500B2 (en) 2020-09-08 2023-02-07 Samsung Electronics Co., Ltd. Real-time facial landmark detection

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9836853B1 (en) * 2016-09-06 2017-12-05 Gopro, Inc. Three-dimensional convolutional neural networks for video highlight detection
GB2555431A (en) * 2016-10-27 2018-05-02 Nokia Technologies Oy A method for analysing media content

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DENG JIANKANG ET AL: "Cascade Multi-View Hourglass Model for Robust 3D Face Alignment", 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), 15 May 2018 (2018-05-15), IEEE, Los Alamitos, CA, USA, pages 399 - 403, XP033354726, DOI: 10.1109/FG.2018.00064 *
JIA GUO ET AL: "Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, 5 December 2018 (2018-12-05), NY 14853, XP080989415 *
YANG JING ET AL: "Stacked Hourglass Network for Robust Facial Landmark Localisation", 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 21 July 2017 (2017-07-21), IEEE,Los Alamitos, CA, USA, pages 2025 - 2033, XP033145995, DOI: 10.1109/CVPRW.2017.253 *
YU FISHER ET AL: "Deep Layer Aggregation", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 18 June 2018 (2018-06-18), IEEE,Los Alamitos, CA, USA, pages 2403 - 2412, XP033476206, DOI: 10.1109/CVPR.2018.00255 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597941A (en) * 2020-05-08 2020-08-28 河海大学 Target detection method for dam defect image
CN111597941B (en) * 2020-05-08 2021-02-09 河海大学 Target detection method for dam defect image
CN112801043A (en) * 2021-03-11 2021-05-14 河北工业大学 Real-time video face key point detection method based on deep learning
CN113435245A (en) * 2021-05-18 2021-09-24 西安电子科技大学 Method, system and application for identifying individual airborne radiation source
CN113435245B (en) * 2021-05-18 2023-06-30 西安电子科技大学 Method, system and application for identifying individual aerial radiation source

Also Published As

Publication number Publication date
CN112639810A (en) 2021-04-09
GB2576784A (en) 2020-03-04
GB2576784B (en) 2021-05-19
GB201814275D0 (en) 2018-10-17

Similar Documents

Publication Publication Date Title
WO2020049276A1 (en) System and method for facial landmark localisation using a neural network
CN110378348B (en) Video instance segmentation method, apparatus and computer-readable storage medium
US20220222776A1 (en) Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution
US20060193535A1 (en) Image matching method and image interpolation method using the same
WO2021175064A1 (en) Methods, devices and media providing an integrated teacher-student system
CN109478330B (en) Tracking system based on RGB-D camera and method thereof
CN113792730B (en) Method and device for correcting document image, electronic equipment and storage medium
CN111652899A (en) Video target segmentation method of space-time component diagram
US8131069B2 (en) System and method for optimizing single and dynamic markov random fields with primal dual strategies
US11030750B2 (en) Multi-level convolutional LSTM model for the segmentation of MR images
CN112602099A (en) Deep learning based registration
CN117256019A (en) Multi-resolution attention network for video motion recognition
CN114677412B (en) Optical flow estimation method, device and equipment
CN115147680B (en) Pre-training method, device and equipment for target detection model
Durasov et al. Double refinement network for efficient monocular depth estimation
CN109034176B (en) Identification system and identification method
CN113344784B (en) Optimizing a supervisory generated countermeasure network through latent spatial regularization
Liu et al. Optimization-Inspired Learning with Architecture Augmentations and Control Mechanisms for Low-Level Vision
CN116205953A (en) Optical flow estimation method and device based on hierarchical total-correlation cost body aggregation
Roozgard et al. Dense image registration using sparse coding and belief propagation
CN113838104B (en) Registration method based on multispectral and multimodal image consistency enhancement network
WO2023087300A1 (en) Medical image segmentation method and system, terminal, and storage medium
US10628698B1 (en) Method for image stitching
CN115272906A (en) Video background portrait segmentation model and algorithm based on point rendering
US10460491B2 (en) Method for real-time deformable fusion of a source multi-dimensional image and a target multi-dimensional image of an object

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19763061

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19763061

Country of ref document: EP

Kind code of ref document: A1