WO2020221990A1 - Facial localisation in images - Google Patents
Facial localisation in images Download PDFInfo
- Publication number
- WO2020221990A1 WO2020221990A1 PCT/GB2020/051012 GB2020051012W WO2020221990A1 WO 2020221990 A1 WO2020221990 A1 WO 2020221990A1 GB 2020051012 W GB2020051012 W GB 2020051012W WO 2020221990 A1 WO2020221990 A1 WO 2020221990A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- neural network
- face
- image
- predicted
- facial
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
- G06F18/2414—Smoothing the distance, e.g. radial basis function networks [RBFN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
Definitions
- This specification relates to methods of facial localisation in images
- this specification relates to the use and training of a neural network for facial localisation in images.
- Facial localisation in images is the task of determining the presence and location of faces within an image. It may further comprise identifying the locations of bounding boxes that surround each face determined to be present in the image. Automatic face localisation is a prerequisite step of facial image analysis in many applications, such as facial attribute (e.g. expression and/or age) and/or facial identity recognition.
- classification/detection neural network in images comprises the use of only
- Facial detection differs from generic object detection in several respects. Facial detection features smaller ratio variations (for example, from 1:1 to 1:1.5) but much larger scale variations (for example, from several pixels to thousands of pixels). This can provide significant challenges when adapting generic object detectors for facial detection and localisation.
- a computer implemented method of training a neural network for face localisation comprising: inputting, to the neural network, a training image, the training image comprising one or more faces; processing, using the neural network, the training image; outputting, from the neural network and for each of a plurality of training anchors in the training image, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors; updating parameters of the neural network in dependence on an objective function.
- the objective function comprises, for each positive anchor in the training image: a classification loss comparing a predicted classification for the positive anchor to a known classification for the positive anchor; a box regression loss comparing a predicted locations of a face box for the positive anchor to a known location of the face box; and a feature loss comparing pixel-based properties of the one or more feature vectors for the positive anchor to known pixel-based properties of a face associated with the positive anchor.
- the one or more feature vectors may comprise predicted locations of a plurality of facial landmarks of a face, and wherein the feature loss comprises a facial landmark regression loss comparing the predicted locations of the plurality of facial landmarks to known locations of the plurality of facial landmarks.
- the one or more feature vectors may comprise an encoded representation of a face within the predicted face box, and the method may further comprise: generating a three-dimensional representation of the face using a mesh decoder neural network, the mesh decoder neural network comprising one or more geometric convolutional layers; and generating a two-dimensional image of the face from the three-dimensional representation of the face using a differentiable renderer, wherein the feature loss comprises a dense regression loss comparing the generated two-dimensional image of the face to a ground truth image of the face within the predicted face box.
- the feature vector may further comprise camera parameters and/or illumination parameters, and wherein the differentiable renderer uses the camera parameters and/or illumination parameters when generating the two-dimensional image of the face from the three-dimensional representation of the face.
- the dense regression loss maybe a pixel-wise difference between the generated two- dimensional image of the face and the ground truth image of the face within the predicted face box.
- the mesh decoder may comprise a plurality of geometric convolutional layers and a plurality of upscaling layers, and wherein the plurality of upscaling layers are interlaced with the plurality of geometric convolutional layers.
- the objective function may comprise, for each negative anchor in the training image, a classification loss comparing a predicted classification for the negative anchor to a known classification for the negative anchor.
- the method may further comprise iterating the method using one or more further training images until a threshold condition is satisfied.
- the neural network may comprise: a first plurality of convolutional layers comprising an input layer, a plurality of con volutional filters and one or more skip connections; and a second plurality of convolutional layers laterally connected to the first plurality of con volutional layers and configures to process the output of the first plurality of convolutional layers in a top-down manner
- the second plurality of convolutional layers may comprise one or more deformable convolutional layers.
- Each lateral connection maybe configured to merge an output of a layer in the first plurality of convolutional layers with an output of a preceding layer of the second plurality of convolutional layers.
- the neural network may further comprise one or more context modules, each context module configured to process the output of a layer of the second plurality of
- a method of face localisation comprising: identifying, using a neural network, one or more faces within the input image; and identifying, using the neural network, one or more corresponding face boxes within the input image, wherein the neural network has been trained using any of the training methods described herein.
- a computer- implemented method of face localisation comprising: inputting, to t he neural network, an image, the image comprising one or more faces: processing, using the neural network, the image; outputting, from the neural network, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors.
- the neural network comprises: a first plurality of convolutional layers comprising an input layer, a plurality of convolutional filters and one or more skip connections; and a second plurality of convolutional layers laterally connected to the first plurality of convolutional layers and configures to process the output of the first plurality of convolutional layers in a top-down manner.
- Each lateral connection may be configured to merge an output of a layer in the first plurality of convolutional layers with an output of a preceding layer of the second plurality of convolutional layers.
- the second plurality of convolutional layers may comprise one or more deformable convolutions.
- the neural network further may comprise one or more context modules, each context module configured to process the output of a layer of the second plurality of convolutional layers using one or more deformable convolutional networks.
- the neural network may have been trained using any of the training methods described herein.
- a system comprising: one or more processors; and a memory the memory comprising computer readable instructions that, when executed by the one or more processors, cause the system to perform one or more of the methods described herein.
- a computer program product comprising computer readable instructions that, when executed by a computing device, cause the computing device to perform one or more of the methods described herein.
- facial localisation and/or face localisation are preferably used to connote one or more of: face detection; face alignment; pixel-wise face parsing; and/or 3D dense correspondence regression.
- Figure 1 shows a schematic overview of a method of facial localisation in images
- Figure 2 shows a schematic overview of a method of training a neural network for facial localisation in images using a multi-task loss
- Figure 3 shows a flow diagram of a method of training a neural network for facial localisation in images
- Figure 4 shows an example of a neural network structure for facial localisation in images
- Figure 5 shows an example of a lateral connection in a neural network for facial localisation in images
- Figure 6 shows a further example of a neural network structure for facial localisation in images.
- Figure 7 shows an example of a computer system for performing facial localisation.
- a facial recognition neural network may be configured to output further feature vectors in addition to a face score (i.e. an indication of whether a face is present at a location or not) and a face box.
- the facial recognition neural network may output one or more facial landmark locations.
- the facial recognition neural network may output an embedding of the region within the face box. These outputs may be used in a multi-task loss function to train the facial recognition neural network.
- the facial classification/detection neural network may have a lower error rate.
- a lightweight backbone network is employed. This can allow the facial classification/detection neural network to be run in real-time on a single CPU core.
- Figure 1 shows a schematic overview of a method 100 of facial localisation in images.
- An image 102 comprising one or more faces 104 is input into a neural network 106.
- the neural network 106 processes the image through a plurality of neural network layers, and outputs one or more sets of output facial data 108.
- Each set of output facial data 108 comprises a predicted facial classification for a face 106 in the image 102, a predicted location of a corresponding face box, and one or more corresponding feature vectors that encode one or more properties of a corresponding face in the image 102.
- the input image 102, 1, is an image comprising one or more faces.
- the input image 102 may comprise a set of pixel values in an array, for example a two-dimensional or three- d imensional array.
- R HxWx 3 H is the height of the image in pixels
- W is the width of the image in pixels
- the image has three colour channels (e.g. RGB or CIELAB).
- the image 102 may, in some embodiments, be in black-and-white/greyscale.
- the neural network 106 comprises a plurality of layers of nodes, each node associated with one or more parameters.
- the parameters of each node of the neural network may comprise one or more weights and/or biases.
- the nodes take as input one or more outputs of nodes in the previous layer.
- the one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network.
- One or more of the layers of the trained generator neural network 106 may be convolutional layers.
- One or more of the layers of the trained generator neural network 106 maybe deformable convolutional layers.
- the neural network 106 may comprise a residual network and a feature pyramid network.
- the neural network 106 may comprise lateral connections between layers of the residual network and layers of the feature pyramid network.
- the feature pyramid may process the outputs of the residual network in a top-down manner. Examples of neural network architectures using a residual network and a feature pyramid are described below in relation to Figures 4 to 6.
- the parameters of the neural network 106 may be trained using a multi-task objective function, £, (also referred to herein as a loss function).
- the multi-task objective function may comprise two or more of a facial classification loss; a face box regression loss; a facial landmark regression loss; and/or a dense face regression loss.
- the multi task loss function may comprise a supervised loss and/or a self-supervised loss.
- the set of output facial data 108 comprises, for each potential face identified in the image 102 by the neural network, a predicted facial classification 108a for a face 106 in the image 102.
- the predicted classification 108a comprises a probability of an anchor in the image being a face.
- the set of output facial data 108 further comprises, for each potential face identified in the image 102 by the neural network a predicted bounding box 108b for a face 106 in the image 102.
- the predicted bounding box 108b may, for example, be the coordinates of the corners of a bounding box 110 of the potential face.
- the set of output facial data 108 further comprises, for each potential face identified in the image 102 by the neural network, one or more feature vectors 108c encoding one or properties of the face.
- a feature vector 108c may comprise the location of one or more facial landmarks, such as the eyes, nose, chin and/ or mouth of a face.
- the feature vector 108c may alternatively or additionally comprise an embedding of the face contained within the bounding box 110
- Figure 2 shows a schematic overview of a method of training a neural network for facial localisation in images using a multi-task loss.
- the method comprises training the neural network 106 on a set of training images 102 using a multi-task loss function 204. For each of a plurality of images 202 in a training dataset, training image 202 is input into the neural network 106 and processed through a plurality of neural network layers.
- the neural network 106 outputs a set of predicted facial data 206 based on the input training image 202
- the predicted facial data 206 is compared to known facial data for the corresponding training image 202 using a multi-task loss function 204, and the comparison is used to update parameters of the neural network 106.
- the training image 202 is taken from a training set.
- the training set comprises a plurality of images containing faces, and a set of known facial data for each image.
- the known facial data comprises the locations of each face in the corresponding image as well as the locations of the bounding boxes for each face in the corresponding image.
- the known facial data further comprises locations of one or more facial landmarks for each face in the corresponding image.
- Each training image maybe associated with a plurality of training anchors.
- Each anchor represents a“prior” box where a face could potentially be in the image.
- Each anchor may be defined a reference box comprising a central location and a scale.
- Anchors are used in facial detection to distinguish positive and negative training samples
- a positive anchor is defined as an anchor for which a face is present.
- a negative anchor is defined as an anchor for which a face is not present.
- An example of the use of Anchors is provided in“ Faster R-CNN: Towards Real-Time Object Detection with Regional Proposal Networks” (Shaoqing Ren et al., arXiv: 1506 01497 the contents of which are incorporated herein by reference).
- the predicted facial data 206 comprises a facial classification 210.
- the predicted facial classification 210 comprises a probability, pamba of an associated anchor, i being a face.
- the predicted facial classification 210 may, in some embodiments, have two
- the predicted facial classification 210 for an anchor is associated with a classification loss, where p i * is the known facial classification for anchor i (i.e. 1 for a
- the classification loss compares a predicted classification for the anchor to a known classification for the anchor.
- the classification loss may, for example, be a softmax loss for binary cases, such as a cross entropy loss or binary hinge loss. However, the skilled person will appreciate that other loss functions may alternative be used.
- the classification loss may, in some
- the predicted facial data 206 comprises a predicted facial box location 212, 3 ⁇ 4.
- the predicted facial box location 212 comprises data identifying a predicted location of a bounding box for the corresponding face for an anchor, i.
- the predicted facial box location 212 may be, for example, coordinates of the corners of the predicted bounding box for the face, such as all four corners of the box, or coordinates of two diagonally opposite corners of the box.
- the predicted facial box location 212 may alternatively be the coordinates of the centre of the boundi ng box and the height and width of the box. Other examples are possible.
- the predicted facial box location 212 for a positive anchor is associated with a box regression loss, L box (t i , t j ), where tj is the known facial box location for positive anchor i.
- the box regression loss compares the predicted facial box location 212, U, to the known location of the corresponding face box, t ⁇ .
- the box regression loss may, for example, be a robust loss function, such as a smooth-Li loss function. However, the skilled person will appreciate that other loss functions may alternative be used.
- the box regression targets may be normalised to a centre location and width and height of the box.
- the predicted facial data also comprises one or more predicted feature vectors that each encode one or more pixel-based properties of the area with the facial box.
- the feature vectors are each associated with a feature loss, Lp comparing pixel-based properties of the one or more feature vectors for the positive anchor to known pixel- based properties of the face associated with the positive anchor.
- the facial landmark data 214 comprises the predicted coordinates/locations of one or more facial landmarks within the predicted facial box. For example, five facial landmarks corresponding to the eyes, the corners of the mouth and the nose may be used.
- the facial landmark data 214 is associated with a facial landmark regression loss, , where l * i are the known facial landmark locations for positive anchor i.
- the facial landmark regression loss compares the predicted facial landmark locations 214, li, to the known location of the corresponding facial landmarks, l ⁇ .
- the facial landmark regression loss may, for example, be a robust loss function, such as a smooth-Li loss function. However, the skilled person will appreciate that other loss functions may alternative be used.
- the facial landmark location targets may be normalised to a centre location.
- the feature vector may comprise an embedding 216, P ST , of potential face within the bounding box.
- the embedding 216 may comprise shape and texture parameters of the face within the corresponding bounding box.
- camera parameters 218 and/or lighting parameters 220 may also be output by the neural network 106.
- Camera parameters 218 may comprise one or more of: a camera location; a camera pose; and/or a focal length of the camera.
- Lighting (or illumination) parameters may comprise one or more of: a light source location; colour values of the light source; and/or ambient light colours.
- the camera parameters 218 and/or lighting parameters 220 may be referred to as rendering parameters.
- the embedding 216 may be associated with a dense regression loss function, L Pixel .
- the dense regression loss does not directly compare the embedding 216 to a known embedding. Instead, the embedding is post-processed to generate a predicted facial image 222, which is then compared to the potential facial image (i.e. the image in the bounding box) using the regression loss function. This branch of t he process may therefore be considered to be self-supervised.
- the embedding may be processed using a mesh decoder neural network 224 to generate a three dimensional representation of the potential face, such a three dimensional mesh.
- the mesh decoder neural network 224 comprises a plurality of Sayers of nodes, each node associated with one or more parameters.
- the parameters of each node of the neural network may comprise one or more weights and/or biases.
- the nodes take as input one or more outputs of nodes in the previous layer.
- the one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network.
- the mesh decoder neural network 224 comprises one or more geometric convolutional layers (also referred to herein as mesh
- a geometric convolutional layer is a type of convolutional filter layer of the neural network used in geometric deep learning that can be applied directly in the mesh domain.
- the mesh decoder may further comprise one or more up-scaling layers configured to upscale the mesh (i.e. increase the number of vertices in the mesh).
- the up-scaling layers may be interlaced with the geometric convolutional layers.
- the geometric convolutional layers may alternate with the up-scaling layers.
- the mesh decoder 224 is configured to take as input the embedding parameters 216 and to process the input using a series of neural network layers to output a shape map and a texture map of the two dimensional potential face from the training image.
- the output of the mesh decoder neural network maybe symbolically represented as 2) PsT .
- the shape and texture map comprises three dimensional shape and texture information relating to the input visual data.
- the output is a set of parameters describing (x, y, z) coordinates of points in a mesh and corresponding (r, g, b) values for each point in the mesh.
- the output may, in some embodiments, be a coloured mesh within a unit sphere.
- the mesh decoder may be pre-trained, and have its parameters fixed during training of the neural network 106.
- An example of a mesh decoder neural network and how a mesh decoder neural network may be trained is described in relation to co-pending GB Patent Application No. 902524.6 (the contents of which are incorporated herein by reference).
- a mesh maybe defined in terms of an undirected connected graph , where V Î R nX6 is a set of n vertices containing joint shape (e.g. (x, y, z)) and texture (e.g. (r, g, b)) information and e Î ⁇ 0,1 ⁇ nxn is an adjacency matrix defining the connection status between vertices.
- joint shape e.g. (x, y, z)
- texture e.g. (r, g, b)
- e Î ⁇ 0,1 ⁇ nxn is an adjacency matrix defining the connection status between vertices.
- An example of a convolutional operator on a graph/mesh can be defined by formulating mesh filtering with a kernel go using a recursive polynomial, such as a Chebyshev polynomial.
- a normalised graph Laplacian may then be defined as L
- I n is the identity matrix
- g e an example of a convolutional operator, g e , on a graph/mesh.
- the filter g q can be parameterised as a truncated Chebyshev polynomial expansion of order K as
- a spectral convolution can then be defined as where x e R nxF in is the input and y e R nxF out .
- Such a filtering operation is very efficient, costing only 0(K
- geometric/ mesh convolution or graph convolutions
- other sets of orthogonal polynomials may be used in place of the Chebyshev polynomials.
- the joint shape and texture map undergoes further processing using a differentiable Tenderer 226 to generate a 2D rendered representation 222, , , of the potential face within the bounding box.
- the representation may comprise a two-dimensional projection of the joint shape and texture map onto an image plane.
- the sets of rendering parameters 218, 220 maybe used to render the representation of the two- dimensional visual data, for example to define the parameters of a camera model used in the rendering, i.e. the 2D rendered representation 222 is given by .3 ⁇ 4
- the differentiable Tenderer 226 may, for example, be a 3D mesh Tenderer that projects the coloured mesh D PST onto a 2D image plane.
- An example of such a differentiable Tenderer 226 maybe found in Genova et al. (CVPR, 2018. 2,3), though other Tenderers may alternatively be used.
- the dense regression loss, L Pixei compares the rendered 2D face 222, 31, to the original facial image in the training image 202 (i.e. the ground truth image within the predicted facial box). A pixel -wise difference maybe used to perform the comparison.
- An example of the dense regression loss may be given by:
- /* is the anchor crop of the ground-truth training image and H and IT are the width and height of the anchor crop respectively.
- L-n norm such as an L-i
- the multi-task objective function 204 comprises a combination of the classification loss, the box regression loss and the feature loss.
- the aim of the training is, for each training anchor i, to minimise the multi-task objective function.
- An example of a multi task loss function for an anchor i is:
- the multi-task objective function 204 is used to update the parameters of the neural network 106 during training.
- the parameters of the neural network 106 may be updated using an optimisation procedure that aims to optimise the multi-task loss function 208.
- the optimisation procedure may, for example, be a gradient descent algorithm.
- the optimisation procedure may use backpropagation, e.g. by back - propagating the multi-task loss function 204 to the parameters of the neural network
- the optimisation procedure maybe associated with a learning rate which varies during training.
- the learning rate may start from an initial value, such as 10 L (-3), and then increase after a first threshold number of epochs, for example to 10 L (-2) after 5 epochs.
- the learning rate may then decay after further threshold epochs, for example reducing by a pre-defined factor at pre-defined epoch numbers.
- the learning rate may decay by a factor of ten at 55 and 68 epochs.
- Figure 3 shows a flow diagram of a method of training a neural network for facial localisation in images. The method may correspond to the method described above in relation to Figure 2.
- a training image is input to the neural network, the training image comprising one or more faces.
- the training image maybe selected from a training dataset comprising a plurality of images, each image comprising one or more faces at known locations and with known bounding boxes.
- One or more featu res of the faces, such as the locations of facial landmarks, may also be known.
- the input training image is processed using the neural network.
- the neural network outputs, for each of a plurality of training anchors in the training image, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors.
- Operations 3.1 to 3.3 may be iterated over a subset of the training examples in the training data to form an ensemble of training examples. This ensemble may be used to determine the objective function in operation 3.4, for example by averaging the multi task objective function over the ensemble. Alternatively, operation 3.4 may be performed for each training example separately.
- parameters of the neural network are updated in dependence on an objective function.
- the objective function comprises, for each positive anchor in the training image: a classification loss comparing a predicted classification for the positive anchor to a known classification for the positive anchor; a box regression loss comparing a predicted locations of a face box for the positive anchor to a known location of the face box; and a feature loss comparing pixel -based properties of the one or more feature vectors for the positive anchor to known pixel-based properties of a face associated with the positive anchor.
- the update maybe performed after each iteration of operations 3.1 to 3.3. Alternatively, the update may be performed after a number of iterations of operations 3.1 to 3.3.
- the multi-task loss function may comprise one or more expectation values taken over the ensemble of training examples.
- Operations 3.1 to 3.4 may be iterated over the training dataset until a threshold condition is met.
- the threshold condition may be a threshold number of training epochs.
- the threshold number of traini ng epochs may lie in the range 50-150, such as 70-90, for example 80.
- Figure 4 shows an example of a neural network structure for facial localisation in images.
- the neural network 106 may, in some embodiments, use this structure.
- the neural network 400 comprises a backbone network 402
- the backbone network 402 may comprise a plurality of convolutional layers.
- the backbone network 402 may further comprise a plurality of down-sampling layers that reduce the dimensionality of an input to the layer.
- the backbone network may be initialised as a pre-trained classification network, such as a ResNet network (e.g. ResNet-152).
- ResNet ResNet-152
- the structure of the backbone network 402 may, for example, be given for an input image 404 size of 640x640 colour pixels by:
- the backbone network 402 is configured to process the input image 404 through a series of convolutional layers (in this example labelled C2-C6) in a“bottom up” manner (i.e. from a low indexed layer to a high indexed layer) and generate an output for each convolutional layer. Although six layers are shown in this example, fewer or greater numbers of layers may be used.
- the one or more layers of the pre- trained backbone network such as the first two layers, may be fixed to improve accuracy.
- one or more of the convolutional layers may be fixed to their initial, p re-trained value.
- the first two layers maybe fixed. This can achieve higher accuracy of the neural network once trained.
- the neural network 400 further comprises a feature pyramid network 406.
- the feature pyramid network 406 comprises a plurality of feature pyramid levels (in this example labelled P2-P6, with P6 acting directly on/being shared with the final layer of the backbone network 402).
- the feature pyramid levels may comprise convolutional layers.
- the convolutional layers maybe deformable convolutional layers.
- the convolutional layers may be interlaced with one or more upscaling layers.
- the feature pyramid 406 is configured to process the outputs of a corresponding layer in the backbone network 402 in a“top-down” manner (i.e. from a high indexed layer to a low indexed layer) in order to generate a plurality of feature maps.
- the dimension of the output of each layer is indicated in the first column
- the “anchor” column indicates the scales associated with each layer of the pyramid
- the “stride” column indicates t he stride of the convolutions performed by each layer.
- the P2 layer is associated with the anchor scales 16x16, 20.16x20 16 and 25.4x25.4 and performs convolutions with stride 4
- the lower levels of the feature pyramid may be associated wit h smaller scales than the higher levels and/or have a smaller stride.
- the neural network 400 comprises a plurality of lateral connections 408 between the output of Sayers of the backbone network 402 and the layers of the feature pyramid 406.
- the lateral connections 408 may, in some embodiments, process the output of the associated backbone network 402 layer and then input it to the corresponding feature pyramid layer, along with the output of a higher feature pyramid layer.
- each feature pyramid layer after the highest layers receives as input a combination of the output of the corresponding backbone network layer and the output of the previous higher layer of the feature pyramid, e.g. layer P3 of the feature pyramid receives as input a combination of the output of layer C3 of the backbone network and the output of layer P4 of the feature pyramid.
- the higher layers of the feature pyramid (in this example layers P6 and P5) receive their input from the backbone network without combining it with the output of a previous layer.
- Each layer of the feature pyramid outputs a set of features.
- the features may be further processed to generate the output data 108 and/or the multi-task loss 208.
- the number of features in each layer may be fixed to be the same. For example, each layer may be fixed to have 256 features.
- the use of lateral connections allows the“bottom up” pathways and“top down” pathways to be combined. In general, top-down pathways may be semantically strong but spatially coarse, while the bottom-up pathways may be spatially dedicate but semantically weak. By combing the pathways using lateral connections, the
- the backbone network 402 may, in some embodiments, be a Mobil eNet network, such MobileNet-0.25. This provides a more lightweight model that can run on a single CPU in substantially real time. These lightweight models may use 7x7 convolution with stride 4 on the input image to reduce the data size. Dense anchors may be tiled on P4- P5 to improve performance. In some embodiments, the deformable convolutional layers are replaced with convolutional layers to speed up the processing. During training, the one or more layers of the p re-trained backbone network, such as the first two layers, may be fixed to improve accuracy.
- Figure 5 shows an example of a lateral connection in a neural network for facial localisation in images.
- the lateral connection is for the P3 level of the feature pyramid, though it may mutatis mutandis be applied to any of the lateral connections between layers.
- the lateral connection 500 provides the output of a layer 502 of the backbone network 402 (in this example the C3 layer of the backbone network) to a corresponding layer of the feature pyramid network 406 (in this example the P3 layer of the feature pyramid).
- the lateral connection 500 may be configured to process the output of the layer 502 of the backbone network 402 to generate processed output 504.
- the processing may, for example be to alter the dimensionality of the output of the layer 502 of the backbone network 402 so that it can be used as input to the corresponding feature pyramid layer.
- a 1x1 convolution is used to reduce the dimensionality of the output of the layer 502 of the backbone network from 80x80x512 to 80x80x256.
- Each lateral connection may apply a different convolution to reduce or increase the dimensionality of the associ ated output layer to match the corresponding feature pyramid layer.
- the output from the immediately higher layer 506 of the feature pyramid (in this example P4) undergoes upsampling to increase its dimensionality.
- the output of the P4 layer of the pyramid network is up-scaled by a factor of two from 40x40x256 to 80x80x256.
- the upscaling may, for example, be performed by nearest neighbour up-sampling.
- the up-scaled output is then added to the processed output 504 from the lateral connection 500 in order to generate a merged map 508. Element wise addition may be used to merge the up-scaled output with the processed output 504.
- the merged map 508 is used as input to the layer of the feature pyramid network 406 corresponding to the lateral connection 500 (in this example the P3 layer).
- the P3 layer applies a 3x3 deformable convolution to the merged map 508 in order to generate an output feature map 510.
- the use of a deformable convolution can reduce the aliasing effects of upsampling, as well as perform non-rigid context modelling.
- Figure 6 shows a further example of a neural network structure for facial localisation in images.
- the neural network 600 of this example is substantially the same as that of Figure 4, but with the addition of a context module 602 for each of the feature pyramid layers.
- the context module 602 may increase the receptive field and enhance the rigid context modelling power of the neural network 600.
- Each context module 602 takes as input the feature map 604 from a corresponding feature pyramid layer.
- a sequence of deformable convolutions is performed using the feature map 604.
- Each deformable convolution after a first deformable convolution is applied to the output 6o6a-c of the previous deformable convolution.
- three 3x3 deformable convolutions are applied.
- fewer or greater numbers of deformable convolutions may be applied, and that different sized deformable convolutions maybe used.
- the outputs of each deformable convolution 6o6a-e are combined to form an enhanced output 608.
- the enhanced output 608 maybe further processed to generate the output data 108 and/or the multi-task loss 208.
- FIG. 7 shows a schematic example of a system/apparatus for performing any of the methods described herein.
- the system/ apparatus shown is an example of a computing device. It will be appreciated by the skilled person that other types of computing devices/systems may alternatively be used to implement the methods described herein, such as a distributed computing system.
- the apparatus (or system) 700 comprises one or more processors 702
- the one or more processors control operation of other components of the system/ apparatus 700.
- the one or more processors 702 may, for example, comprise a general purpose processor.
- the one or more processors 702 maybe a single core device or a multiple core device.
- the one or more processors 702 may comprise a central processing unit (CPU) or a graphical processing unit (GPU).
- the one or more processors 702 may comprise specialised processing hardware, for instance a RISC processor or
- the system/apparatus comprises a working or volatile memory 704.
- the one or more processors mav access the volatile memory 704 in order to process data and mav control the storage of data in memory.
- the volatile memory 704 may comprise RAM of any type, for example Static RAM (SRAM), Dynamic RAM (DRAM), or it may comprise Flash memory, such as an SD-Card.
- the system/apparatus comprises a non-volatile memory 706.
- the non-volatile memory 706 stores a set of operation instructions 708 for controlling the operation of the processors 702 in the form of computer readable instructions.
- the non-volatile memory 706 may be a memory of any kind such as a Read Only Memory (ROM), a Flash memory or a magnetic drive memory.
- the one or more processors 702 are configured to execute operating instructions 408 to cause the system/ apparatus to perform any of the methods described herein.
- the operating instructions 708 may comprise code (i.e. drivers) relating to the hardware components of the system/apparatus 700, as well as code relating to the basic operation of the system/ apparatus 700.
- the one or more processors 702 execute one or more instructions of the operating instructions 708, which are stored permanently or semi-permanently in the non-volatile memory 706, using the volatile memory ' 704 to temporarily store data generated during execution of said operating instructions 708.
- Implementations of the methods described herein may be realised as in digital electronic circuitry, integrated circuitry', specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof.
- ASICs application specific integrated circuits
- These may include computer program products (such as software stored on e.g. magnetic discs, optical disks, memory, Programmable Logic Devices) comprising computer readable instructions that, when executed by a computer, such as that described in relation to Figure 7, cause the computer to perform one or more of the methods described herein.
- Any system feature as described herein may also be provided as a method feature, and vice versa.
- means plus function features may be expressed alternatively in terms of their corresponding structure.
- method aspects may be applied to system aspects, and vice versa.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
This specification relates to methods of facial localisation in images. According to a first aspect of this specification, there is described a computer implemented method of training a neural network for face localisation, the method comprising: inputting, to the neural network, a training image, the training image comprising one or more faces; processing, using the neural network, the training image; outputting, from the neural network and for each of a plurality of training anchors in the training image, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors; updating parameters of the neural network in dependence on an objective function. The objective function comprises, for each positive anchor in the training image: a classification loss comparing a predicted classification for the positive anchor to a known classification for the positive anchor; a box regression loss comparing a predicted locations of a face box for the positive anchor to a known location of the face box; and a feature loss comparing pixel-based properties of the one or more feature vectors for the positive anchor to known pixel-based properties of a face associated with the positive anchor.
Description
Facial Localisation in Images Field
This specification relates to methods of facial localisation in images In particular, this specification relates to the use and training of a neural network for facial localisation in images.
Background
Facial localisation in images is the task of determining the presence and location of faces within an image. It may further comprise identifying the locations of bounding boxes that surround each face determined to be present in the image. Automatic face localisation is a prerequisite step of facial image analysis in many applications, such as facial attribute (e.g. expression and/or age) and/or facial identity recognition.
Typically the training of a neural network for facial localisation (a facial
classification/detection neural network) in images comprises the use of only
classification and box regression losses.
Facial detection differs from generic object detection in several respects. Facial detection features smaller ratio variations (for example, from 1:1 to 1:1.5) but much larger scale variations (for example, from several pixels to thousands of pixels). This can provide significant challenges when adapting generic object detectors for facial detection and localisation.
Summary
According to a first aspect of this specification, there is described a computer implemented method of training a neural network for face localisation, the method comprising: inputting, to the neural network, a training image, the training image comprising one or more faces; processing, using the neural network, the training image; outputting, from the neural network and for each of a plurality of training anchors in the training image, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors; updating parameters of the neural network in dependence on an objective function. The objective function comprises, for each positive anchor in the training image: a classification loss comparing a predicted classification for the positive anchor to a known classification for the positive anchor; a box regression loss comparing a predicted locations of a face box for the positive anchor
to a known location of the face box; and a feature loss comparing pixel-based properties of the one or more feature vectors for the positive anchor to known pixel-based properties of a face associated with the positive anchor. The one or more feature vectors may comprise predicted locations of a plurality of facial landmarks of a face, and wherein the feature loss comprises a facial landmark regression loss comparing the predicted locations of the plurality of facial landmarks to known locations of the plurality of facial landmarks. The one or more feature vectors may comprise an encoded representation of a face within the predicted face box, and the method may further comprise: generating a three-dimensional representation of the face using a mesh decoder neural network, the mesh decoder neural network comprising one or more geometric convolutional layers; and generating a two-dimensional image of the face from the three-dimensional representation of the face using a differentiable renderer, wherein the feature loss comprises a dense regression loss comparing the generated two-dimensional image of the face to a ground truth image of the face within the predicted face box.
The feature vector may further comprise camera parameters and/or illumination parameters, and wherein the differentiable renderer uses the camera parameters and/or illumination parameters when generating the two-dimensional image of the face from the three-dimensional representation of the face.
The dense regression loss maybe a pixel-wise difference between the generated two- dimensional image of the face and the ground truth image of the face within the predicted face box.
The mesh decoder may comprise a plurality of geometric convolutional layers and a plurality of upscaling layers, and wherein the plurality of upscaling layers are interlaced with the plurality of geometric convolutional layers.
The objective function may comprise, for each negative anchor in the training image, a classification loss comparing a predicted classification for the negative anchor to a known classification for the negative anchor.
The method may further comprise iterating the method using one or more further training images until a threshold condition is satisfied.
The neural network may comprise: a first plurality of convolutional layers comprising an input layer, a plurality of con volutional filters and one or more skip connections; and a second plurality of convolutional layers laterally connected to the first plurality of con volutional layers and configures to process the output of the first plurality of convolutional layers in a top-down manner The second plurality of convolutional layers may comprise one or more deformable convolutional layers.
Each lateral connection maybe configured to merge an output of a layer in the first plurality of convolutional layers with an output of a preceding layer of the second plurality of convolutional layers. The neural network may further comprise one or more context modules, each context module configured to process the output of a layer of the second plurality of
convolutional layers using one or more deformable convolutional networks.
According to a further aspect of this specification, there is described a method of face localisation, the method comprising: identifying, using a neural network, one or more faces within the input image; and identifying, using the neural network, one or more corresponding face boxes within the input image, wherein the neural network has been trained using any of the training methods described herein. According to a further aspect of this specification, there is described a computer- implemented method of face localisation , t he method comprising: inputting, to t he neural network, an image, the image comprising one or more faces: processing, using the neural network, the image; outputting, from the neural network, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors. The neural network comprises: a first plurality of convolutional layers comprising an input layer, a plurality of convolutional filters and one or more skip connections; and a second plurality of convolutional layers laterally connected to the first plurality of convolutional layers and configures to process the output of the first plurality of convolutional layers in a top-down manner.
Each lateral connection may be configured to merge an output of a layer in the first plurality of convolutional layers with an output of a preceding layer of the second plurality of convolutional layers. The second plurality of convolutional layers may comprise one or more deformable convolutions.
The neural network further may comprise one or more context modules, each context module configured to process the output of a layer of the second plurality of convolutional layers using one or more deformable convolutional networks. The neural network may have been trained using any of the training methods described herein.
According to a further aspect of this specification, there is described a system comprising: one or more processors; and a memory the memory comprising computer readable instructions that, when executed by the one or more processors, cause the system to perform one or more of the methods described herein.
According to a further aspect of this specification, there is described a computer program product comprising computer readable instructions that, when executed by a computing device, cause the computing device to perform one or more of the methods described herein.
As used herein, the terms facial localisation and/or face localisation are preferably used to connote one or more of: face detection; face alignment; pixel-wise face parsing; and/or 3D dense correspondence regression.
Brief Description of the Drawings
Embodiments will now be described by way of non-limiting examples with reference to the accompanying drawings, in which:
Figure 1 shows a schematic overview of a method of facial localisation in images;
Figure 2 shows a schematic overview of a method of training a neural network for facial localisation in images using a multi-task loss;
Figure 3 shows a flow diagram of a method of training a neural network for facial localisation in images;
Figure 4 shows an example of a neural network structure for facial localisation in images;
Figure 5 shows an example of a lateral connection in a neural network for facial localisation in images;
Figure 6 shows a further example of a neural network structure for facial localisation in images; and
Figure 7 shows an example of a computer system for performing facial localisation. Detailed Description
In the embodiments described herein, a facial recognition neural network may be configured to output further feature vectors in addition to a face score (i.e. an indication of whether a face is present at a location or not) and a face box. For example, the facial recognition neural network may output one or more facial landmark locations. Alternatively or additionally, the facial recognition neural network may output an embedding of the region within the face box. These outputs may be used in a multi-task loss function to train the facial recognition neural network.
Use of such a multi-task loss function can result in a higher performance of the facial classification/detection neural network when compared to other facial localisation methods. For example, the facial classification/detection neural network may have a lower error rate.
In some embodiments, a lightweight backbone network is employed. This can allow the facial classification/detection neural network to be run in real-time on a single CPU core.
Figure 1 shows a schematic overview of a method 100 of facial localisation in images.
An image 102 comprising one or more faces 104 is input into a neural network 106. The neural network 106 processes the image through a plurality of neural network layers, and outputs one or more sets of output facial data 108. Each set of output facial data 108 comprises a predicted facial classification for a face 106 in the image 102, a predicted location of a corresponding face box, and one or more corresponding feature vectors that encode one or more properties of a corresponding face in the image 102. The input image 102, 1, is an image comprising one or more faces. The input image 102 may comprise a set of pixel values in an array, for example a two-dimensional or three-
dimensional array. For example, in a colour image 1 RHxWx 3 H is the height of the image in pixels, W is the width of the image in pixels and the image has three colour channels (e.g. RGB or CIELAB). The image 102 may, in some embodiments, be in black-and-white/greyscale.
The neural network 106 comprises a plurality of layers of nodes, each node associated with one or more parameters. The parameters of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. One or more of the layers of the trained generator neural network 106 may be convolutional layers. One or more of the layers of the trained generator neural network 106 maybe deformable convolutional layers.
In some embodiments, the neural network 106 may comprise a residual network and a feature pyramid network. The neural network 106 may comprise lateral connections between layers of the residual network and layers of the feature pyramid network. The feature pyramid may process the outputs of the residual network in a top-down manner. Examples of neural network architectures using a residual network and a feature pyramid are described below in relation to Figures 4 to 6.
The parameters of the neural network 106 may be trained using a multi-task objective function, £, (also referred to herein as a loss function). The multi-task objective function may comprise two or more of a facial classification loss; a face box regression loss; a facial landmark regression loss; and/or a dense face regression loss. The multi task loss function may comprise a supervised loss and/or a self-supervised loss.
Examples of method of training the neural network are described below in relation to Figures 2 and 3.
The set of output facial data 108 comprises, for each potential face identified in the image 102 by the neural network, a predicted facial classification 108a for a face 106 in the image 102. The predicted classification 108a comprises a probability of an anchor in the image being a face.
The set of output facial data 108 further comprises, for each potential face identified in the image 102 by the neural network a predicted bounding box 108b for a face 106 in the image 102. The predicted bounding box 108b may, for example, be the coordinates of the corners of a bounding box 110 of the potential face.
The set of output facial data 108 further comprises, for each potential face identified in the image 102 by the neural network, one or more feature vectors 108c encoding one or properties of the face. For example, a feature vector 108c may comprise the location of one or more facial landmarks, such as the eyes, nose, chin and/ or mouth of a face. The feature vector 108c may alternatively or additionally comprise an embedding of the face contained within the bounding box 110
Figure 2 shows a schematic overview of a method of training a neural network for facial localisation in images using a multi-task loss. The method comprises training the neural network 106 on a set of training images 102 using a multi-task loss function 204. For each of a plurality of images 202 in a training dataset, training image 202 is input into the neural network 106 and processed through a plurality of neural network layers. The neural network 106 outputs a set of predicted facial data 206 based on the input training image 202 The predicted facial data 206 is compared to known facial data for the corresponding training image 202 using a multi-task loss function 204, and the comparison is used to update parameters of the neural network 106.
The training image 202 is taken from a training set. The training set comprises a plurality of images containing faces, and a set of known facial data for each image. The known facial data comprises the locations of each face in the corresponding image as well as the locations of the bounding boxes for each face in the corresponding image. In some embodiments, the known facial data further comprises locations of one or more facial landmarks for each face in the corresponding image. Each training image maybe associated with a plurality of training anchors. Each anchor represents a“prior” box where a face could potentially be in the image. Each anchor may be defined a reference box comprising a central location and a scale.
However, it will be appreciated that the anchors may be defined in other ways. Anchors are used in facial detection to distinguish positive and negative training samples A positive anchor is defined as an anchor for which a face is present. A negative anchor is defined as an anchor for which a face is not present. An example of the use of Anchors
is provided in“ Faster R-CNN: Towards Real-Time Object Detection with Regional Proposal Networks” (Shaoqing Ren et al., arXiv: 1506 01497 the contents of which are incorporated herein by reference). The predicted facial data 206 comprises a facial classification 210. The predicted facial classification 210 comprises a probability, p„ of an associated anchor, i being a face. The predicted facial classification 210 may, in some embodiments, have two
components corresponding to“face present” and“face not present”. The predicted facial classification 210 for an anchor is associated with a classification loss, where pi* is the known facial classification for anchor i (i.e. 1 for a
positive anchor and o for a negative anchor). The classification loss compares a predicted classification for the anchor to a known classification for the anchor. The classification loss may, for example, be a softmax loss for binary cases, such as a cross entropy loss or binary hinge loss. However, the skilled person will appreciate that other loss functions may alternative be used. The classification loss may, in some
embodiments, only be used for positive anchors. In other embodiments, it may be used for both positive and negative anchors. The predicted facial data 206 comprises a predicted facial box location 212, ¾. The predicted facial box location 212 comprises data identifying a predicted location of a bounding box for the corresponding face for an anchor, i. The predicted facial box location 212 may be, for example, coordinates of the corners of the predicted bounding box for the face, such as all four corners of the box, or coordinates of two diagonally opposite corners of the box. The predicted facial box location 212 may alternatively be the coordinates of the centre of the boundi ng box and the height and width of the box. Other examples are possible.
The predicted facial box location 212 for a positive anchor is associated with a box regression loss, Lbox(ti, tj), where tj is the known facial box location for positive anchor i. The box regression loss compares the predicted facial box location 212, U, to the known location of the corresponding face box, t}. The box regression loss may, for example, be a robust loss function, such as a smooth-Li loss function. However, the skilled person will appreciate that other loss functions may alternative be used. In some embodiments, the box regression targets may be normalised to a centre location and width and height of the box.
The predicted facial data also comprises one or more predicted feature vectors that each encode one or more pixel-based properties of the area with the facial box. The feature vectors are each associated with a feature loss, Lp comparing pixel-based properties of the one or more feature vectors for the positive anchor to known pixel- based properties of the face associated with the positive anchor.
An example of such a feature vector is a facial landmark data 214. The facial landmark data 214 comprises the predicted coordinates/locations of one or more facial landmarks within the predicted facial box. For example, five facial landmarks corresponding to the eyes, the corners of the mouth and the nose may be used.
The facial landmark data 214 is associated with a facial landmark regression loss,
, where l* i are the known facial landmark locations for positive anchor i. The facial landmark regression loss compares the predicted facial landmark locations 214, li, to the known location of the corresponding facial landmarks, l\. The facial landmark regression loss may, for example, be a robust loss function, such as a smooth-Li loss function. However, the skilled person will appreciate that other loss functions may alternative be used. In some embodiments, the facial landmark location targets may be normalised to a centre location.
Alternatively or additionally, the feature vector may comprise an embedding 216, PST, of potential face within the bounding box. The embedding 216 may comprise shape and texture parameters of the face within the corresponding bounding box. In some embodiments, camera parameters 218 and/or lighting parameters 220 may also be output by the neural network 106. Camera parameters 218 may comprise one or more of: a camera location; a camera pose; and/or a focal length of the camera. Lighting (or illumination) parameters may comprise one or more of: a light source location; colour values of the light source; and/or ambient light colours. The camera parameters 218 and/or lighting parameters 220 may be referred to as rendering parameters.
The embedding 216 may be associated with a dense regression loss function, LPixel. The dense regression loss does not directly compare the embedding 216 to a known embedding. Instead, the embedding is post-processed to generate a predicted facial image 222, which is then compared to the potential facial image (i.e. the image in the
bounding box) using the regression loss function. This branch of t he process may therefore be considered to be self-supervised.
To generate the predicted facial image 222, the embedding may be processed using a mesh decoder neural network 224 to generate a three dimensional representation of the potential face, such a three dimensional mesh. The mesh decoder neural network 224 comprises a plurality of Sayers of nodes, each node associated with one or more parameters. The parameters of each node of the neural network may comprise one or more weights and/or biases. The nodes take as input one or more outputs of nodes in the previous layer. The one or more outputs of nodes in the previous layer are used by the node to generate an activation value using an activation function and the parameters of the neural network. The mesh decoder neural network 224 comprises one or more geometric convolutional layers (also referred to herein as mesh
convolutional layers). A geometric convolutional layer is a type of convolutional filter layer of the neural network used in geometric deep learning that can be applied directly in the mesh domain.
The mesh decoder may further comprise one or more up-scaling layers configured to upscale the mesh (i.e. increase the number of vertices in the mesh). The up-scaling layers may be interlaced with the geometric convolutional layers. For example the geometric convolutional layers may alternate with the up-scaling layers.
The mesh decoder 224 is configured to take as input the embedding parameters 216 and to process the input using a series of neural network layers to output a shape map and a texture map of the two dimensional potential face from the training image. The output of the mesh decoder neural network maybe symbolically represented as 2)PsT. The shape and texture map comprises three dimensional shape and texture information relating to the input visual data. In some embodiments, the output is a set of parameters describing (x, y, z) coordinates of points in a mesh and corresponding (r, g, b) values for each point in the mesh. The output may, in some embodiments, be a coloured mesh within a unit sphere.
The mesh decoder may be pre-trained, and have its parameters fixed during training of the neural network 106. An example of a mesh decoder neural network and how a mesh decoder neural network may be trained is described in relation to co-pending GB
Patent Application No. 902524.6 (the contents of which are incorporated herein by reference).
A mesh maybe defined in terms of an undirected connected graph
, where V Î RnX6 is a set of n vertices containing joint shape (e.g. (x, y, z)) and texture (e.g. (r, g, b)) information and e Î {0,1}nxn is an adjacency matrix defining the connection status between vertices. However, it will be appreciated that alternative representations of a mesh may be used. An example of a convolutional operator on a graph/mesh can be defined by formulating mesh filtering with a kernel go using a recursive polynomial, such as a Chebyshev polynomial. In defining such a convolutional operator on a mesh, it is useful to define some intermediate variables in the following way. A non-normalised graph Laplacian may be defined as L = D— e Î Rnxn where D e Rnxn is a diagonal matrix with
åj £ij . A normalised graph Laplacian may then be defined as L
where In is the identity matrix.
The Laplacian can be diagonalised by the Fourier bases
such that L = UAUT, where A = diag
A graph Fourier representation of a given mesh, x Î Rnxn, can then be defined as x --- UTx with inverse x = U.x.
Given these definitions, an example of a convolutional operator, ge, on a graph/mesh can be defined. The filter gq can be parameterised as a truncated Chebyshev polynomial expansion of order K as
where Q e RK is a vector of Chebyshev coefficients and Tk(A) e Rnxn is the Chebyshev polynomial of order k evaluated at a scaled Laplacian A = 2A/lmax - In, Tk(x) can be recursively computed using Tk(x) 2xTk-1(x) - Tk-2(x), with T0(x) = 1 and T, (x) = x.
A spectral convolution can then be defined as
where x e RnxFin is the input and y e RnxF out. Such a filtering operation is very efficient, costing only 0(K|£|) operations.
Other types of geometric/ mesh convolution (or graph convolutions) may alternatively or additionally be used. For example, other sets of orthogonal polynomials may be used in place of the Chebyshev polynomials.
The joint shape and texture map undergoes further processing using a differentiable Tenderer 226 to generate a 2D rendered representation 222, ,
, of the potential face within the bounding box. The representation may comprise a two-dimensional projection of the joint shape and texture map onto an image plane. The sets of rendering parameters 218, 220 maybe used to render the representation of the two- dimensional visual data, for example to define the parameters of a camera model used in the rendering, i.e. the 2D rendered representation 222 is given by .¾
The differentiable Tenderer 226 may, for example, be a 3D mesh Tenderer that projects the coloured mesh DPST onto a 2D image plane. An example of such a differentiable Tenderer 226 maybe found in Genova et al. (CVPR, 2018. 2,3), though other Tenderers may alternatively be used. The dense regression loss, LPixei, compares the rendered 2D face 222, 31, to the original facial image in the training image 202 (i.e. the ground truth image within the predicted facial box). A pixel -wise difference maybe used to perform the comparison. An example of the dense regression loss may be given by:
where /* is the anchor crop of the ground-truth training image and H and IT are the width and height of the anchor crop respectively. is an L-n norm, such as an L-i
norm. Other pixel-wise loss functions may alternatively be used.
The multi-task objective function 204 comprises a combination of the classification loss, the box regression loss and the feature loss. The aim of the training is, for each training anchor i, to minimise the multi-task objective function. An example of a multi task loss function for an anchor i is:
where {A,} are constants that control the relative significance of the different loss functions in the multi-task objective function. These parameters may, for example, be in the range (0,1]. For example, l1=o.25, l2=0.1, and l =0.01. The multi-task objective function 204 is used to update the parameters of the neural network 106 during training. The parameters of the neural network 106 may be updated using an optimisation procedure that aims to optimise the multi-task loss function 208. The optimisation procedure may, for example, be a gradient descent algorithm. The optimisation procedure may use backpropagation, e.g. by back - propagating the multi-task loss function 204 to the parameters of the neural network
106. The optimisation procedure maybe associated with a learning rate which varies during training. For example, the learning rate may start from an initial value, such as 10L (-3), and then increase after a first threshold number of epochs, for example to 10 L (-2) after 5 epochs. The learning rate may then decay after further threshold epochs, for example reducing by a pre-defined factor at pre-defined epoch numbers. For example, the learning rate may decay by a factor of ten at 55 and 68 epochs.
Figure 3 shows a flow diagram of a method of training a neural network for facial localisation in images. The method may correspond to the method described above in relation to Figure 2.
At operation 3.1, a training image is input to the neural network, the training image comprising one or more faces. The training image maybe selected from a training dataset comprising a plurality of images, each image comprising one or more faces at known locations and with known bounding boxes. One or more featu res of the faces, such as the locations of facial landmarks, may also be known.
At operation 3.2, the input training image is processed using the neural network. At operation 3.3, the neural network outputs, for each of a plurality of training anchors in the training image, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors.
Operations 3.1 to 3.3 may be iterated over a subset of the training examples in the training data to form an ensemble of training examples. This ensemble may be used to determine the objective function in operation 3.4, for example by averaging the multi task objective function over the ensemble. Alternatively, operation 3.4 may be performed for each training example separately.
At operation 3.4, parameters of the neural network are updated in dependence on an objective function. The objective function comprises, for each positive anchor in the training image: a classification loss comparing a predicted classification for the positive anchor to a known classification for the positive anchor; a box regression loss comparing a predicted locations of a face box for the positive anchor to a known location of the face box; and a feature loss comparing pixel -based properties of the one or more feature vectors for the positive anchor to known pixel-based properties of a face associated with the positive anchor. The update maybe performed after each iteration of operations 3.1 to 3.3. Alternatively, the update may be performed after a number of iterations of operations 3.1 to 3.3. In these embodiments, the multi-task loss function may comprise one or more expectation values taken over the ensemble of training examples. Operations 3.1 to 3.4 may be iterated over the training dataset until a threshold condition is met. The threshold condition may be a threshold number of training epochs. The threshold number of traini ng epochs may lie in the range 50-150, such as 70-90, for example 80. Figure 4 shows an example of a neural network structure for facial localisation in images. The neural network 106 may, in some embodiments, use this structure.
The neural network 400 comprises a backbone network 402 The backbone network 402 may comprise a plurality of convolutional layers. The backbone network 402 may further comprise a plurality of down-sampling layers that reduce the dimensionality of an input to the layer. The backbone network may be initialised as a pre-trained classification network, such as a ResNet network (e.g. ResNet-152). The structure of the backbone network 402 may, for example, be given for an input image 404 size of 640x640 colour pixels by:
where the convolutional building blocks are shown in square brackets and down sampling is performed by conv3_i, conv4_i and convs_i with a stride of 2. The backbone network 402 is configured to process the input image 404 through a series of convolutional layers (in this example labelled C2-C6) in a“bottom up” manner (i.e. from a low indexed layer to a high indexed layer) and generate an output for each convolutional layer. Although six layers are shown in this example, fewer or greater numbers of layers may be used. During training, the one or more layers of the pre- trained backbone network, such as the first two layers, may be fixed to improve accuracy.
During training one or more of the convolutional layers may be fixed to their initial, p re-trained value. For example, the first two layers maybe fixed. This can achieve higher accuracy of the neural network once trained.
The neural network 400 further comprises a feature pyramid network 406. The feature pyramid network 406 comprises a plurality of feature pyramid levels (in this example labelled P2-P6, with P6 acting directly on/being shared with the final layer of the backbone network 402). The feature pyramid levels may comprise convolutional layers. The convolutional layers maybe deformable convolutional layers. The convolutional layers may be interlaced with one or more upscaling layers. The feature pyramid 406 is configured to process the outputs of a corresponding layer in the backbone network
402 in a“top-down” manner (i.e. from a high indexed layer to a low indexed layer) in order to generate a plurality of feature maps.
An example of a feature pyramid structure is given by:
where The dimension of the output of each layer is indicated in the first column, the “anchor” column indicates the scales associated with each layer of the pyramid, and the “stride” column indicates t he stride of the convolutions performed by each layer. For example, in this example the P2 layer is associated with the anchor scales 16x16, 20.16x20 16 and 25.4x25.4 and performs convolutions with stride 4 However, it will be appreciated that different scales and convolutions may alternatively be used. The lower levels of the feature pyramid may be associated wit h smaller scales than the higher levels and/or have a smaller stride. The neural network 400 comprises a plurality of lateral connections 408 between the output of Sayers of the backbone network 402 and the layers of the feature pyramid 406. The lateral connections 408 may, in some embodiments, process the output of the associated backbone network 402 layer and then input it to the corresponding feature pyramid layer, along with the output of a higher feature pyramid layer. In other words, each feature pyramid layer after the highest layers receives as input a combination of the output of the corresponding backbone network layer and the output of the previous higher layer of the feature pyramid, e.g. layer P3 of the feature pyramid receives as input a combination of the output of layer C3 of the backbone network and the output of layer P4 of the feature pyramid. The higher layers of the feature pyramid (in this example layers P6 and P5) receive their input from the backbone network without combining it with the output of a previous layer.
Each layer of the feature pyramid outputs a set of features. The features may be further processed to generate the output data 108 and/or the multi-task loss 208. The number of features in each layer may be fixed to be the same. For example, each layer may be fixed to have 256 features.
The use of lateral connections allows the“bottom up” pathways and“top down” pathways to be combined. In general, top-down pathways may be semantically strong but spatially coarse, while the bottom-up pathways may be spatially dedicate but semantically weak. By combing the pathways using lateral connections, the
performance of the neural network in facial localisation can be improved.
The backbone network 402 may, in some embodiments, be a Mobil eNet network, such MobileNet-0.25. This provides a more lightweight model that can run on a single CPU in substantially real time. These lightweight models may use 7x7 convolution with stride 4 on the input image to reduce the data size. Dense anchors may be tiled on P4- P5 to improve performance. In some embodiments, the deformable convolutional layers are replaced with convolutional layers to speed up the processing. During training, the one or more layers of the p re-trained backbone network, such as the first two layers, may be fixed to improve accuracy.
Figure 5 shows an example of a lateral connection in a neural network for facial localisation in images. In this example, the lateral connection is for the P3 level of the feature pyramid, though it may mutatis mutandis be applied to any of the lateral connections between layers.
The lateral connection 500 provides the output of a layer 502 of the backbone network 402 (in this example the C3 layer of the backbone network) to a corresponding layer of the feature pyramid network 406 (in this example the P3 layer of the feature pyramid). The lateral connection 500 may be configured to process the output of the layer 502 of the backbone network 402 to generate processed output 504. The processing may, for example be to alter the dimensionality of the output of the layer 502 of the backbone network 402 so that it can be used as input to the corresponding feature pyramid layer. In the example shown, a 1x1 convolution is used to reduce the dimensionality of the output of the layer 502 of the backbone network from 80x80x512 to 80x80x256. Each lateral connection may apply a different convolution to reduce or increase the dimensionality of the associ ated output layer to match the corresponding feature pyramid layer. The output from the immediately higher layer 506 of the feature pyramid (in this example P4) undergoes upsampling to increase its dimensionality. In the example
shown, the output of the P4 layer of the pyramid network is up-scaled by a factor of two from 40x40x256 to 80x80x256. The upscaling may, for example, be performed by nearest neighbour up-sampling. The up-scaled output is then added to the processed output 504 from the lateral connection 500 in order to generate a merged map 508. Element wise addition may be used to merge the up-scaled output with the processed output 504.
The merged map 508 is used as input to the layer of the feature pyramid network 406 corresponding to the lateral connection 500 (in this example the P3 layer). The P3 layer applies a 3x3 deformable convolution to the merged map 508 in order to generate an output feature map 510. The use of a deformable convolution can reduce the aliasing effects of upsampling, as well as perform non-rigid context modelling.
Figure 6 shows a further example of a neural network structure for facial localisation in images. The neural network 600 of this example is substantially the same as that of Figure 4, but with the addition of a context module 602 for each of the feature pyramid layers. The context module 602 may increase the receptive field and enhance the rigid context modelling power of the neural network 600. Each context module 602 takes as input the feature map 604 from a corresponding feature pyramid layer. A sequence of deformable convolutions is performed using the feature map 604. Each deformable convolution after a first deformable convolution is applied to the output 6o6a-c of the previous deformable convolution. In the example shown, three 3x3 deformable convolutions are applied. However, it will be appreciated that fewer or greater numbers of deformable convolutions may be applied, and that different sized deformable convolutions maybe used.
The outputs of each deformable convolution 6o6a-e are combined to form an enhanced output 608. The enhanced output 608 maybe further processed to generate the output data 108 and/or the multi-task loss 208.
Figure 7 shows a schematic example of a system/apparatus for performing any of the methods described herein. The system/ apparatus shown is an example of a computing device. It will be appreciated by the skilled person that other types of computing devices/systems may alternatively be used to implement the methods described herein, such as a distributed computing system.
The apparatus (or system) 700 comprises one or more processors 702 The one or more processors control operation of other components of the system/ apparatus 700. The one or more processors 702 may, for example, comprise a general purpose processor. The one or more processors 702 maybe a single core device or a multiple core device. The one or more processors 702 may comprise a central processing unit (CPU) or a graphical processing unit (GPU). Alternatively, the one or more processors 702 may comprise specialised processing hardware, for instance a RISC processor or
programmable hardware with embedded firmware. Multiple processors may be included.
The system/apparatus comprises a working or volatile memory 704. The one or more processors mav access the volatile memory 704 in order to process data and mav control the storage of data in memory. The volatile memory 704 may comprise RAM of any type, for example Static RAM (SRAM), Dynamic RAM (DRAM), or it may comprise Flash memory, such as an SD-Card.
The system/apparatus comprises a non-volatile memory 706. The non-volatile memory 706 stores a set of operation instructions 708 for controlling the operation of the processors 702 in the form of computer readable instructions. The non-volatile memory 706 may be a memory of any kind such as a Read Only Memory (ROM), a Flash memory or a magnetic drive memory.
The one or more processors 702 are configured to execute operating instructions 408 to cause the system/ apparatus to perform any of the methods described herein. The operating instructions 708 may comprise code (i.e. drivers) relating to the hardware components of the system/apparatus 700, as well as code relating to the basic operation of the system/ apparatus 700. Generally speaking, the one or more processors 702 execute one or more instructions of the operating instructions 708, which are stored permanently or semi-permanently in the non-volatile memory 706, using the volatile memory' 704 to temporarily store data generated during execution of said operating instructions 708.
Implementations of the methods described herein may be realised as in digital electronic circuitry, integrated circuitry', specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations
thereof. These may include computer program products (such as software stored on e.g. magnetic discs, optical disks, memory, Programmable Logic Devices) comprising computer readable instructions that, when executed by a computer, such as that described in relation to Figure 7, cause the computer to perform one or more of the methods described herein.
Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure. In particular, method aspects may be applied to system aspects, and vice versa.
Furthermore, any, some and/or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination. It should also be appreciated that particular combinations of the various features described and defined in any aspects of the invention can be implemented and/ or supplied and/or used independently.
Although several embodiments have been shown and described, it would be
appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles of this disclosure, the scope of which is defined in the claims.
Claims
Claims
1 A computer implemented method of training a neural network for face localisation, the method comprising:
inputting, to the neural network, a training image, the training image comprising one or more faces;
processing, using the neural network, the training image;
outputting, from the neural network and for each of a plurality of training anchors in the training image, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors;
updating parameters of the neural network in dependence on an objective function, wherein the objective function comprises, for each positive anchor in the training image:
a classification loss comparing a predicted classification for the positive anchor to a known classification for the positive anchor;
a box regression loss comparing a predicted locations of a face box for the positive anchor to a known location of the face box; and
a featu re loss comparing pixel-based properties of the one or more feature vectors for the positive anchor to known pixel -based properties of a face associated with the positive anchor
2 The method of claim l, wherein the one or more feature vectors comprises predicted locations of a plurality of facial landmarks of a face, and wherein the feature loss comprises a facial landmark regression loss comparing the predicted locations of the plurality of facial landmarks to known locations of the plurality of facial landmarks.
3 The method of claim 1 or 2, wherein the one or more feature vectors comprises an encoded representation of a face within the predicted face box, and wherein the method comprises:
generating a three-dimensional representation of the face using a mesh decoder neural network, the mesh decoder neural network comprising one or more geometric convolutional layers; and
generating a two-dimensional image of the face from the three- dimensional representation of the face using a differentiable Tenderer,
wherein the feature loss comprises a dense regression loss comparing the generated two-dimensional image of the face to a ground truth image of the face within the predicted face box. 4. The method of claim 3, wherein the feature vector further comprises camera parameters and/or illumination parameters, and wherein the differentiable renderer uses the camera parameters and/or illumination parameters when generating the two- dimensional image of the face from the three-dimensional representation of the face. 5. The method of any of claims 3 or 4, wherein the dense regression loss is a pixel - wise difference between the generated two-dimensional image of the face and the ground truth image of the face within the predicted face box.
6. The method of any of claims 3 to 5, wherein the mesh decoder comprises a plurality of geometric convolutional layers and a plurality of upscaling layers, and wherein the plurality of upscaling layers are interlaced with the plurality of geometric convolutional layers.
7. The method of any preceding claim, wherein the objective function comprises, for each negative anchor in the training image, a classification loss comparing a predicted classification for the negative anchor to a known classification for the negative anchor.
8. The method of any preceding claim, further comprising iterating the method using one or more further training images until a threshold condition is satisfied.
9. The method of any preceding claim, wherein the neural network comprises: a first plurality of convolutional layers comprising an input layer, a plurality of convolutional filters and one or more skip connections; and
a second plurality of convolutional Sayers laterally connected to the first plurality of convolutional layers and configures to process the output of the first plurality of convolutional layers in a top-down manner.
10. The method of claim 9, wherein each lateral connection is configured to merge an output of a layer in the first plurality of convolutional layers with an output of a preceding layer of the second plurality of convolutional layers.
li. The method of claims 9 or 10, wherein the second plurality of convolutional layers comprises one or more deformable convolutions. 12. The method of any of claims 9 to 11, wherein the neural network further comprises one or more context modules, each context module configured to process the output of a layer of the second plurality of convolutional layers using one or more deformable convolutional networks. 13. A computer implemented method of face localisation, the method comprising: identifying, using a neural network, one or more faces within the input image; and
identifying, using the neural network, one or more corresponding face boxes within the input image,
wherein the neural network has been trained using the method of any preceding claim.
14. A computer implemented method of face localisation, the method comprising: inputting, to the neural network, an image, the image comprising one or more faces;
processing, using the neural network, the image;
outputting, from the neural network, one or more sets of output data, each set of output data comprising a predicted facial classification, a predicted location of a corresponding face box, and one or more corresponding feature vectors,
wherein the neural network comprises:
a first plurality of convolutional layers comprising an input layer, a plurality of convolutional filters and one or more skip connections; and
a second plurality of convolutional layers laterally connected to the first plurality of convolutional layers and configures to process the output of the first plurality of convolutional layers in a top-down manner
15. The method of claim 14, wherein each lateral connection is configured to merge an output of a layer in the first plurality of convolutional layers with an output of a preceding layer of the second plurality of convolutional layers.
16. The method of claims 14 or 15, wherein the second plurality of convolutional layers comprises one or more deformable convolutions
17. The method of any of claims 14 to 16, wherein the neural network further comprises one or more context modules, each context module configured to process the output of a layer of the second plurality of convolutional layers using one or more deformable convolutional networks.
18. The method of any of claims 14-17, wherein the neural network has been trained using the method of any of claims 1-8
19. A system comprising:
one or more processors; and
a memory the memory comprising computer readable instructions that, when executed by the one or more processors, cause the system to perform the method of any preceding claim.
20. A computer program product comprising computer readable instructions that, when executed by a computing device, cause the computing device to perform the method of any of claims 1-18.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202080031201.1A CN114080632A (en) | 2019-04-30 | 2020-04-24 | Face localization in images |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1906027.6A GB2582833B (en) | 2019-04-30 | 2019-04-30 | Facial localisation in images |
GB1906027.6 | 2019-04-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020221990A1 true WO2020221990A1 (en) | 2020-11-05 |
Family
ID=66809192
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB2020/051012 WO2020221990A1 (en) | 2019-04-30 | 2020-04-24 | Facial localisation in images |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN114080632A (en) |
GB (1) | GB2582833B (en) |
WO (1) | WO2020221990A1 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364754A (en) * | 2020-11-09 | 2021-02-12 | 云南电网有限责任公司迪庆供电局 | Bolt defect detection method and system |
CN112396053A (en) * | 2020-11-25 | 2021-02-23 | 北京联合大学 | Method for detecting object of all-round fisheye image based on cascade neural network |
CN112733672A (en) * | 2020-12-31 | 2021-04-30 | 深圳一清创新科技有限公司 | Monocular camera-based three-dimensional target detection method and device and computer equipment |
CN112964693A (en) * | 2021-02-19 | 2021-06-15 | 山东捷讯通信技术有限公司 | Raman spectrum band region segmentation method |
CN113221842A (en) * | 2021-06-04 | 2021-08-06 | 第六镜科技(北京)有限公司 | Model training method, image recognition method, device, equipment and medium |
CN113435466A (en) * | 2020-12-26 | 2021-09-24 | 上海有个机器人有限公司 | Method, device, medium and terminal for detecting elevator door position and switch state |
WO2022246720A1 (en) * | 2021-05-24 | 2022-12-01 | 中国科学院深圳先进技术研究院 | Training method of surgical action identification model, medium and device |
JP2023530796A (en) * | 2021-05-28 | 2023-07-20 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Recognition model training method, recognition method, device, electronic device, storage medium and computer program |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110414417B (en) * | 2019-07-25 | 2022-08-12 | 电子科技大学 | Traffic sign board identification method based on multi-level fusion multi-scale prediction |
CN113496139B (en) * | 2020-03-18 | 2024-02-13 | 北京京东乾石科技有限公司 | Method and apparatus for detecting objects from images and training object detection models |
CN111462002B (en) * | 2020-03-19 | 2022-07-12 | 重庆理工大学 | Underwater image enhancement and restoration method based on convolutional neural network |
CN111814884B (en) * | 2020-07-10 | 2024-09-17 | 江南大学 | Upgrading method of target detection network model based on deformable convolution |
CN111985642B (en) * | 2020-08-17 | 2023-11-14 | 厦门真景科技有限公司 | Beauty neural network training method, apparatus, equipment and storage medium |
CN115410265B (en) * | 2022-11-01 | 2023-01-31 | 合肥的卢深视科技有限公司 | Model training method, face recognition method, electronic device and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB902524A (en) | 1959-06-03 | 1962-08-01 | Oswald Arthur Silk | Improvements in or relating to seed sowing apparatus |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10614289B2 (en) * | 2010-06-07 | 2020-04-07 | Affectiva, Inc. | Facial tracking with classifiers |
CN106295678B (en) * | 2016-07-27 | 2020-03-06 | 北京旷视科技有限公司 | Neural network training and constructing method and device and target detection method and device |
CN107194376A (en) * | 2017-06-21 | 2017-09-22 | 北京市威富安防科技有限公司 | Mask fraud convolutional neural networks training method and human face in-vivo detection method |
-
2019
- 2019-04-30 GB GB1906027.6A patent/GB2582833B/en active Active
-
2020
- 2020-04-24 WO PCT/GB2020/051012 patent/WO2020221990A1/en active Application Filing
- 2020-04-24 CN CN202080031201.1A patent/CN114080632A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB902524A (en) | 1959-06-03 | 1962-08-01 | Oswald Arthur Silk | Improvements in or relating to seed sowing apparatus |
Non-Patent Citations (8)
Title |
---|
FORRESTER COLE ET AL: "Synthesizing Normalized Faces from Facial Identity Features", IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 17 October 2017 (2017-10-17), pages 3386 - 3395, XP055541837, ISBN: 978-1-5386-0457-1, DOI: 10.1109/CVPR.2017.361 * |
GENEVA ET AL., CVPR, vol. 2, 2018, pages 3 |
HAO WANG ET AL: "Face R-CNN", ARXIV, 4 June 2017 (2017-06-04), XP055713126, Retrieved from the Internet <URL:https://arxiv.org/pdf/1706.01061.pdf%20http://arxiv.org/abs/1706.01061.pdf> [retrieved on 20200709] * |
JIANKANG DENG ET AL: "RetinaFace: Single-stage Dense Face Localisation in the Wild", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 May 2019 (2019-05-02), XP081271467 * |
LAI YING-HSIU ET AL: "Emotion-Preserving Representation Learning via Generative Adversarial Network for Multi-View Facial Expression Recognition", 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), IEEE, 15 May 2018 (2018-05-15), pages 263 - 270, XP033354708, DOI: 10.1109/FG.2018.00046 * |
SHAOQING REN ET AL.: "Faster R-CNN: Towards Real-Time Object Detection with Regional Proposal Networks", ARXIV:1506.01497 |
YUXIANG ZHOU ET AL: "Dense 3D Face Decoding over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 April 2019 (2019-04-06), XP081165962 * |
ZHANG KAIPENG ET AL: "Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks", IEEE SIGNAL PROCESSING LETTERS, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 23, no. 10, 1 October 2016 (2016-10-01), pages 1499 - 1503, XP011622636, ISSN: 1070-9908, [retrieved on 20160909], DOI: 10.1109/LSP.2016.2603342 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364754A (en) * | 2020-11-09 | 2021-02-12 | 云南电网有限责任公司迪庆供电局 | Bolt defect detection method and system |
CN112364754B (en) * | 2020-11-09 | 2024-05-14 | 云南电网有限责任公司迪庆供电局 | Bolt defect detection method and system |
CN112396053A (en) * | 2020-11-25 | 2021-02-23 | 北京联合大学 | Method for detecting object of all-round fisheye image based on cascade neural network |
CN113435466A (en) * | 2020-12-26 | 2021-09-24 | 上海有个机器人有限公司 | Method, device, medium and terminal for detecting elevator door position and switch state |
CN112733672A (en) * | 2020-12-31 | 2021-04-30 | 深圳一清创新科技有限公司 | Monocular camera-based three-dimensional target detection method and device and computer equipment |
CN112964693A (en) * | 2021-02-19 | 2021-06-15 | 山东捷讯通信技术有限公司 | Raman spectrum band region segmentation method |
WO2022246720A1 (en) * | 2021-05-24 | 2022-12-01 | 中国科学院深圳先进技术研究院 | Training method of surgical action identification model, medium and device |
JP2023530796A (en) * | 2021-05-28 | 2023-07-20 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Recognition model training method, recognition method, device, electronic device, storage medium and computer program |
CN113221842A (en) * | 2021-06-04 | 2021-08-06 | 第六镜科技(北京)有限公司 | Model training method, image recognition method, device, equipment and medium |
CN113221842B (en) * | 2021-06-04 | 2023-12-29 | 第六镜科技(北京)集团有限责任公司 | Model training method, image recognition method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
GB201906027D0 (en) | 2019-06-12 |
GB2582833B (en) | 2021-04-07 |
GB2582833A (en) | 2020-10-07 |
CN114080632A (en) | 2022-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020221990A1 (en) | Facial localisation in images | |
Chibane et al. | Stereo radiance fields (srf): Learning view synthesis for sparse views of novel scenes | |
Wang et al. | Shape inpainting using 3d generative adversarial network and recurrent convolutional networks | |
Petersen et al. | Pix2vex: Image-to-geometry reconstruction using a smooth differentiable renderer | |
KR102693803B1 (en) | Generation of 3D object models from 2D images | |
Henderson et al. | Unsupervised object-centric video generation and decomposition in 3D | |
WO2020174215A1 (en) | Joint shape and texture decoders for three-dimensional rendering | |
Mohajerani et al. | Shadow detection in single RGB images using a context preserver convolutional neural network trained by multiple adversarial examples | |
Wang et al. | OctreeNet: A novel sparse 3-D convolutional neural network for real-time 3-D outdoor scene analysis | |
EP3872761A2 (en) | Analysing objects in a set of frames | |
Bounsaythip et al. | Genetic algorithms in image processing-a review | |
Shao et al. | Efficient three-dimensional point cloud object detection based on improved Complex-YOLO | |
CN118154770A (en) | Single tree image three-dimensional reconstruction method and device based on nerve radiation field | |
CN118262034A (en) | System and method for reconstructing an animated three-dimensional human head model from an image | |
Šircelj et al. | Segmentation and recovery of superquadric models using convolutional neural networks | |
WO2021228172A1 (en) | Three-dimensional motion estimation | |
US20220172421A1 (en) | Enhancement of Three-Dimensional Facial Scans | |
CN116030181A (en) | 3D virtual image generation method and device | |
Ardö et al. | Height Normalizing Image Transform for Efficient Scene Specific Pedestrian Detection | |
Teed | Optimization Inspired Neural Networks for Multiview 3D Reconstruction | |
Bettouche et al. | Improving Image Tracing with Convolutional Autoencoders by High-Pass Filter Preprocessing | |
CN115984583B (en) | Data processing method, apparatus, computer device, storage medium, and program product | |
WO2023241276A1 (en) | Image editing method and related device | |
Garbade | Semantic Segmentation and Completion of 2D and 3D Scenes | |
Singer | View-Agnostic Point Cloud Generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20723492 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20723492 Country of ref document: EP Kind code of ref document: A1 |