CN112639810A

CN112639810A - Face key point positioning system and method

Info

Publication number: CN112639810A
Application number: CN201980057333.9A
Authority: CN
Inventors: 邓健康; 斯特凡诺斯·扎菲里乌
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-09-03
Filing date: 2019-08-30
Publication date: 2021-04-09
Also published as: GB2576784A; GB2576784B; WO2020049276A1; GB201814275D0

Abstract

A neural network system for locating key points of a human face in an image comprises two or more series-connected convolutional neural networks. Each convolutional neural network includes a plurality of downsampling layers, a plurality of upsampling layers, and a plurality of transverse connections connecting layers of the same size. Each transversal connection comprises one or more aggregation nodes having an upsampled input, wherein at least one aggregation node further comprises a downsampled input.

Description

Face key point positioning system and method

Technical Field

The present invention relates to face key point positioning, and more particularly, to a system and method for positioning face key points using a neural network system.

Background

Face keypoint localization, or face alignment, involves finding the location of various face features in an image. Examples of facial features that may be located include the mouth, nose bridge, eyebrows, and chin. Face key point localization has a wide range of applications including but not limited to face recognition, face tracking, expression understanding, and 3D face reconstruction.

Disclosure of Invention

According to a first aspect of the present invention, there is provided a neural network system for locating face keypoints in an image, the neural network system comprising two or more convolutional neural networks in series, each convolutional neural network comprising a plurality of downsampling layers, a plurality of upsampling layers, and a plurality of transverse connections connecting layers from the same size; wherein each said lateral connection comprises one or more aggregation nodes with an upsampled input; wherein at least one of the aggregation nodes further comprises a downsampling input.

Each of the convolutional neural networks may include 3 or 4 downsampling layers and a corresponding number of upsampling layers.

Each of the convolutional neural networks may include 3 downsampling layers.

Only one of the aggregation nodes may include a downsampling input.

Each of the aggregation nodes may comprise a down-sampled output.

At least one of the transverse connections may comprise one or more transverse hopping connections.

At least one of the transverse connections may comprise one or more convolutions.

At least one of the convolutions may be a depth-wise separable convolution.

The neural network system may include an input layer capable of receiving a 128 x 128 pixel input image.

The neural network system may include a channel aggregation block including: a number of channels decreasing step size, a number of channels increasing step size, and a number of branch connections connecting step sizes of the same channel size.

Each of the plurality of downsampling layers, each of the plurality of upsampling layers, and each of the aggregation nodes may include the channel aggregation block.

The output of each of the convolutional neural networks may be connected to a spatial transformer.

The spatial transformer may comprise at least one deformable convolution.

According to a second aspect of the present invention, there is provided a method of locating face keypoints in an image using a neural network system, the method comprising the steps of receiving an input image, the input image comprising a plurality of input pixels; in each convolutional neural network of a plurality of convolutional neural networks in series: downsampling the input image using a plurality of downsampling layers; upsampling the downsampled image using a plurality of upsampling layers; wherein the upsampling comprises merging the upsampled image and the aggregate signal input into the upsampled layer using a horizontal connection in the same sized downsampled layer; wherein each said lateral connection comprises one or more aggregation nodes with an upsampled input; wherein at least one of the aggregation nodes further comprises a downsampling input; generating, from the outputs of the series of convolutional neural networks, a corresponding output thermodynamic diagram for the input image, the thermodynamic diagram representing possible locations of face keypoints at each input pixel.

According to a third aspect of the present invention, there is provided a method of training a neural network system, comprising: receiving a plurality of training face images and a plurality of expected face key point parameter sets; inputting one of the plurality of training face images into the neural network system; in response to the input, generating a face keypoint thermodynamic diagram for the training face image using the neural network system; training the neural network system, including updating one or more of a plurality of neural network weights to minimize loss, measured in part based on a comparison of the generated face keypoint thermodynamic diagram and the corresponding set of expected face keypoint parameters.

The method may include generating a plurality of transformed face images by applying one or more transforms to the plurality of training face images; inputting the transformed face image into the neural network system, wherein the input transformed face image corresponds to the transformation of the input training face image; in response to the input, generating a face keypoint thermodynamic diagram for the transformed face image using the neural network system; wherein training the neural network system further comprises updating the neural network weights to minimize a conversion loss, the loss measured in part based on a comparison of the face keypoint thermodynamic diagrams generated for the training face images and the face keypoint thermodynamic diagrams generated for the corresponding converted face images.

Drawings

Certain embodiments of the present invention are described by way of example in conjunction with the following figures.

FIG. 1 is a schematic block diagram of an exemplary neural network system for locating key points of a face;

FIG. 2 is a schematic block diagram of an exemplary convolutional neural network topology for locating face keypoints in a neural network system;

FIG. 3 is a schematic diagram of an exemplary novel neural network building block architecture for use in a convolutional neural network;

FIG. 4 is a flow diagram of an exemplary method for generating a face keypoint thermodynamic diagram for an input image using a neural network system;

FIG. 5 is a flow diagram of a first exemplary method of training a neural network system to locate face keypoints in an image;

FIG. 6 is a flow diagram of an exemplary method of training a neural network system to locate an extension of a face keypoint in an image.

Detailed Description

Exemplary implementations provide systems and methods for enhancing face keypoint localization in images. The face key point positioning is applied to the aspects of face recognition, face tracking, expression understanding, 3D face reconstruction and the like. Thus, by using the systems and methods described herein, improving face keypoint localization may improve the performance of the system for these applications. For example, a face recognition system using the systems and methods described herein may recognize faces more accurately as measured using a standard face recognition test data set.

Improving the localization of face key points in an image can be achieved through various exemplary implementations, through novel neural network system designs, novel convolutional neural network topologies, and/or novel neural network building block architectures.

The novel neural network system design can comprise a plurality of convolution neural networks and corresponding internal converters, wherein the internal converters adopt deformable convolution, and geometric face transformation modeling is flexibly carried out by using an efficient operation method. Such an exemplary implementation can be made spatially invariant to any input face image using an internal transducer employing deformable convolution.

A new convolutional neural network topology, referred to hereinafter as Scale Aggregation Topology (SAT), improves the U-network topology by iteratively and hierarchically merging feature levels with other aggregation nodes within a cross-connect. The other aggregation nodes have a horizontal input, an upsample input, and a downsample input. These aggregation nodes enable multi-scale features to be aggregated to promote face keypoint localization. In other types of SAT, downsampling inputs to certain nodes are removed, because in some cases, such as when the training data is limited, removing these connections reduces computational complexity, model size, and training time, and does not affect the performance of face keypoint localization.

A novel neural network building block architecture, hereinafter referred to as a Channel Aggregation Block (CAB) architecture, is a symmetric building block architecture on a channel. In the CAB architecture, the input signal is branched before each channel is reduced and converged before each channel is increased to maintain channel information. The use of this channel compression method in the building block architecture can help context modeling by including the relationship of channel-by-channel thermodynamic diagrams and improving the robustness of the algorithm in case of local blurring of the current image, thereby improving the face keypoint localization. In other types of CAB architectures, depth-wise separable convolution and copy-based channel expansion are employed to control computational complexity and model size.

Neural network system

Referring to fig. 1, fig. 1 shows a neural network system 100 for locating key points of a face in an image.

The neural network system 100 is operable to receive a neural network input 102 and generate a neural network output 104. The neural network system 100 may be implemented on one or more suitable computing devices. For example, the one or more computing devices may be any number or any combination of one or more mainframes, one or more blade servers, one or more desktop computers, one or more tablets, one or more mobile phones. Wherein the neural network system 100 is implemented on a plurality of suitable computing devices that may be configured to communicate with each other over a suitable network. Examples of suitable networks include the internet, local area networks, fixed microwave networks, cellular networks, and wireless networks. The neural network system 100 may be implemented using a suitable machine learning framework.

The neural network input 102 may be an input image comprising a plurality of input pixels or a matrix representation of such an image. The input image may be a cropped (crop) and/or rescaled (rescaled) version of an original image, e.g. the input image may be the original image region corresponding to a detected face and rescaled to 128 x 128 pixels.

The neural network output 104 may be an output thermodynamic diagram corresponding to the input image. The thermodynamic diagram may represent the possible locations of the face's key points at each pixel. Such a thermodynamic diagram may be output in one or more of a variety of formats, such as probabilities that various colors resemble certain face keypoints in an image, one or more matrix representations of the numerical probabilities of having face features in matrix locations corresponding to pixel locations in the input image, and/or one or more compact vector representations (embeddings) of such information.

The neural network system 100 includes a first convolutional neural network 110, an internal transformer 120, a second convolutional neural network 112, and a second internal transformer 122. The first convolutional neural network 110 receives and processes the neural network input 102 that generates a first neural network output using a plurality of neural network layers. The first internal transformer 120 receives the first neural network output, and the first internal transformer 120 performs the appropriate transformation to generate a first transformed neural network output. The second convolutional neural network 112 receives and processes the first transformed neural network output to generate a second neural network output. The second internal transformer 122 receives the second neural network output, and the second internal transformer 122 performs the appropriate transformation to generate a second transformed neural network output. The second transformed neural network output may itself be the neural network output 104, or the neural network output 104 may be derived from the second transformed neural network output.

The convolutional neural networks 110, 112 may have a Scale Aggregation Topology (SAT) neural network topology, as described above. For example, the convolutional neural network may have other types of SAT topologies as shown in fig. 2 and described in relevant portions of the specification. In other embodiments, the convolutional neural network may have a U-net, hourglass symbol, or deep convergence topology. The first convolutional neural network 110 and the second convolutional neural network 112 may have the same topology or different topologies. When the first convolutional neural network 110 and the second convolutional neural network 112 have different topologies, the topologies may be substantially similar, but with slight variations, e.g., different connections are removed, different convolutions are replaced by deep convolutions and/or transverse connections. Slight differences in the architecture of the various blocks are also possible.

The internal transformers 120, 122 may be spatial transformers to improve and/or provide spatial transform modeling capabilities to the neural network system 100. In some embodiments, the internal transformers 120, 122 perform a parametric implicit transformation by deformable convolution. The deformable convolution learns other offsets in a locally dense manner to extend the conventional convolution to the locations sampled by the convolution operation. Compared with a transformer adopting explicit parameter transformation, the deformable convolution has more flexible geometric human face transformation modeling capability and higher computational efficiency. The internal transformers 120, 122 may also perform explicit parametric transformations using a Spatial Transformer Network (STN).

Convolutional neural network topology

Referring to fig. 2, fig. 2 is an example of a convolutional neural network topology suitable for face keypoint localization in a neural network system. The illustrated topologies are other types of Scale Aggregation Topologies (SAT) described above.

The convolutional neural network topology 200 includes a plurality of

nodes

210 and 224, where each node may be a channel aggregation block. Some of the

nodes

220 and 224 are referred to as aggregation nodes because they are nodes in a lateral connection that aggregates information across multiple scales.

Each of the nodes 210 to 224 is connected to other nodes in the network according to the illustrated connections 230 to 266. These connections 230 to 266 receive one or more inputs from a given node in the network, selectively perform operations, and transmit results to another node.

The solid line connections 230 to 236 represent downsampling convolutions. These connections receive input from a node (e.g., node 210), perform downsampling convolutions, and provide the results as input to another node (e.g., node 211). Downsampling the convolution may include applying a pooled layer after the convolutional layer to the input, or may include applying a convolutional layer having two or more steps to the input. In some examples, half of the results of the downsampling convolution operation are the vertical resolution of the join input and the other half are the horizontal resolution of the join input, e.g., if the resolution of the join input is 128 x 128, the result of the downsampling convolution operation is 64 x 64 resolution.

The dashed connections 240 to 245 represent downsampling convolutions. These connections receive an input from a node (e.g., node 211), perform an upsampling convolution, and provide the result as an input to another node (e.g., node 220). The upsampling convolution may include applying a convolution layer preceding a depoling layer to the input, or may include applying a convolution layer with a fractional step to the input. In some examples, the result of the upsampling convolution operation is twice the vertical resolution of the input and twice the vertical resolution of the input, e.g., if the resolution of the join input is 64 x 64, the result of the downsampling convolution operation is 128 x 128 resolution.

Connections 250 through 252 represent depth-wise separable convolutions. These connections receive an input from a node (e.g., node 212), perform a depth-wise separable convolution on the input, and provide the result as an input to another node (e.g., node 214). In these cases, the use of depth-wise separable convolution instead of conventional convolution reduces the number of parameters and computations required. Furthermore, in some cases, replacing the conventional convolution with a depth-wise separable convolution for a particular connection substantially preserves the performance of the correlation system, e.g., the accuracy of face recognition. In some cases, for example, when the training data is limited, using depth-wise separable convolution for certain connections may improve the performance of the correlation system because the reduction in parameters reduces the difficulty of optimization during model training.

The dashed connections 260 to 266 represent lateral jump connections. These transverse hop connections receive input from a node (e.g., node 210) and provide the received values to another node (e.g., node 220). These lateral hopping connections may not require significant processing to maintain the incoming high resolution information. Furthermore, since a lateral hopping connection requires few, if any, parameters, the number of parameters and computations required can be reduced by using a lateral hopping connection rather than convolution. Furthermore, in some cases, for a particular connection, the convolution is replaced with a horizontal jump connection, substantially preserving the performance of the correlation system, e.g., the accuracy of face keypoint localization. In some cases, for example, when training data is limited, using a lateral hopping connection for certain connections may improve performance of the relevant system, since the reduction in parameters reduces the difficulty of optimization during model training.

In the illustrated convolutional neural network topology 200, there is no downsampling connection between

nodes

224 and 214 and

nodes

222 and 215, as expected. For these specific connections and specific cases, for example, when the training data is limited, the lack of these connections can preserve the performance of the correlation system while reducing the number of parameters and computations required, e.g., the accuracy of face keypoint localization. Furthermore, in some cases, for example, when training data is limited, removing these connections may improve the performance of the relevant system, as the reduction in parameters reduces the difficulty of optimization during model training.

It should be noted, however, that the expectation that a change in the scale aggregation topology yields optimal performance is that all nodes are connected in a convoluted topology, provided there is sufficient training data, computational resources, and time. This is because such a topology has the greatest computational power and retains the most amount of information.

Neural network building block architecture

Referring to fig. 3, fig. 3 illustrates an example of a neural network building block architecture suitable for face keypoint localization in a neural network system. The illustrated architecture is a variant of the Channel Aggregation Block (CAB) architecture described previously.

The neural network building block architecture 200 includes a plurality of operations (310 through 345). These operations include channel reduction convolution (310 through 314), self-join (320 through 325), depth-wise separable convolution (330 through 332), and matrix addition (340 through 345). In the illustrated neural network building block architecture 300, the input signal branches before each channel decreases and converges before each channel increases to maintain channel information. For example, the input signal branches before the channels of downsampling convolution 312 decrease and converges before the corresponding channels from connection 322 increase. The use of this channel compression method in the building block architecture can help context modeling by including the relationship of channel-by-channel thermodynamic diagrams and improving the robustness of the algorithm in case of local blurring of the current image, thereby improving the face keypoint localization.

Operations 310 through 314 reduce convolution for the channel. These operations receive an input, perform a channel reduction convolution on the input, and take the result as an inputTo one or more subsequent operations. The input of the channel reduction convolution may be a three-dimensional input of size, WxH x C₁Where W is the width of the input, H is the height of the input, C₁Is the number of channels in the input. Each channel of the input, for example, each W × H × 1, may be a characteristic thermodynamic diagram. The channel reduction convolution may include applying a four-dimensional convolution kernel to the input. The four-dimensional kernel can be of various sizes, K × K × C₁×C₂Wherein K is arbitrary, typically a small integer, such as 3, and C₂Is the number of channels required. The result of the channel reduction convolution is WXHXC₂The size of (2). In some embodiments, C₂Is C₁/2. For example, if the size of the received input is 128 × 128 × 64, the size of the four-dimensional convolution kernel may be 3 × 3 × 64 × 32, and the result of the four-dimensional convolution may be 128 × 128 × 32.

Operations 320 through 325 are self-connect operations. These operations receive input, perform self-join on the input, and provide the results as input to one or more subsequent operations. Self-join may include cascading the received input with itself one or more times along the input's channel by a corresponding size, e.g., a deep join. Thus, the result is that the number of channels is greater than the input. For example, when the input is C₁When a channel is cascaded with itself once, the result is C₁X 2 channels. To provide a further example, if the size of the received input is 128 × 128 × 32, and the input is concatenated once with itself, the result is a size of 128 × 128 × 64.

The use of self-join operations allows the number of channels to be increased and the channel compression architecture has the advantages described above without requiring extensive computations or without requiring extensive additional parameters.

Alternatives that can be used require a large number of calculations and a large number of parameters, e.g. channel adding convolutions. Thus, the model size of a neural network system employing such self-join operation is reduced compared to these alternatives.

Operations 330 through 332 are depth-wise separable convolutions. These connections receive an input, perform a depth-wise separable convolution on the input, and provide the result as an input to one or more subsequent operations. In these cases, the use of depth-wise separable convolution instead of conventional convolution reduces the number of parameters and computations required.

Operations 340 through 345 are matrix additions. These connections receive two or more inputs, perform matrix addition of the two or more inputs, and provide the results as inputs to one or more subsequent operations. The matrix addition (340 to 345) combines a lot of information from the representations with different channel numbers.

The proposed neural network building block architecture is illustrative and can be replaced with the proposed modified neural network building block architecture. These variations may be advantageous in certain situations. For example, in situations where more training data and more computational power is available, it may be beneficial to replace the self-join operation with a channel-added convolution, and/or to replace the conventional convolution with a depth-wise separable convolution. Furthermore, it should be noted that other innovative aspects of the present invention can be used for various neural network building block architectures, such as: ResNet, Incepton ResNet, and/or Hierachiacal, and Parallel & Multi-Scale (HPM) blocks.

Face key point thermodynamic diagram generation method

FIG. 4 is a flow diagram of an exemplary method by which a face keypoint thermodynamic diagram may be generated for an input image using a neural network system. The method may be performed by executing computer-readable instructions using one or more processors of one or more computing devices, for example, that implement the neural network system 100. For example, the neural network system may also be the neural network system 100.

In step 410, an input image comprising a plurality of pixels is received. The input image may be a cropped and/or rescaled version of the original image, e.g. a region of the original image corresponding to a detected face and rescaled to 128 x 128 pixels.

In step 420, steps 422 and 424 are repeated in each of the plurality of series neural networks. For example, steps 422 and 424 may be performed for each of the first convolutional neural network 110 and the second convolutional neural network 112 of the neural network system 100.

In step 422, the input image is downsampled using a plurality of downsampling layers of a respective neural network. The plurality of downsampled layers may be downsampled convolutional layers. The first downsampling layer may perform a pooling operation after applying convolution to the input image, or may apply a convolution layer having two or more steps to the image. Other layers of the plurality of downsampling layers may apply such operations to the output of a previous layer. In some embodiments, the plurality of downsampling layers includes downsampling connections 230-236 of the convolutional neural network topology 200.

In step 424, the upsampled image is upsampled using a plurality of upsampling layers of the corresponding neural network. The plurality of upsampled layers may be upsampled convolutional layers. Each upsampling convolution may include applying a convolution layer preceding the unbuffer layer to the input, or may include applying a convolution layer with a fractional step to the input. These upsampled convolutional layers may be or correspond to the upsampled convolutional connections 240 through 245 of the convolutional neural network topology 200.

The upsampling may include combining the upsampled image and the aggregate signal input into the same sized upsampled layer using a horizontal connection in the same sized downsampled layer. The lateral connection may include one or more aggregation nodes. Each of these aggregation nodes may have an upsampling input and at least one aggregation node may have a further downsampling input.

At 430, an output thermodynamic diagram corresponding to the input image is generated from the output of the series of convolutional neural networks. The thermodynamic diagram represents the possible locations of the face keypoints at each input pixel. The thermodynamic diagram may be output in one or more of a variety of formats, such as probabilities that various colors in the image resemble certain face keypoints, one or more matrix representations of the numerical probabilities of having face features in matrix locations corresponding to pixel locations of the input image, and/or one or more compact vector representations (embeddings) of such information.

Face key point positioning neural network training method

FIG. 5 is a flow diagram of an exemplary method 500 by which a neural network system may be trained to locate face keypoints in an image. The method may be performed by executing computer-readable instructions using one or more processors of one or more computing devices, for example, that implement the neural network system 100. For example, the neural network system trained by the method may be the neural network system 100.

In step 510, a training face image and a corresponding set of expected face keypoint parameters are received. The training face image may be a cropped and/or rescaled version of the corresponding original image, e.g., the training face image may be a region of the original image corresponding to the detected face and rescaled to 128 x 128 pixels.

The set of expected face keypoint parameters may be a training thermodynamic diagram corresponding to the training face image. Each of the training thermodynamic diagrams may represent the possible locations of face key points at each pixel. Since the location of the face keypoints for each of the training face images is known, the possible locations of the face keypoints at each pixel can be 0 or 1. 0 may indicate that there is no face keypoint, and 1 may indicate that there is a face keypoint. Such a training thermodynamic diagram may be output in one or more of a variety of formats, such as probabilities that various colors resemble certain face keypoints in an image, one or more matrix representations of the numerical probabilities of having face features in matrix positions corresponding to pixel positions in the input image, and/or one or more compact vector representations (embeddings) of such information.

In step 520, one of the training face images is input into the neural network system. The training face images may be input to the nervous system by any suitable mechanism. For example, the training face image may be: retrieving, by the neural network system, from a document; calling and receiving through an Application Programming Interface (API); or received through a remote service call (e.g., a REST or SOAP service call).

In step 530, a face keypoint thermodynamic diagram for the training face image is generated using the neural network system. The generated face keypoint thermodynamic diagrams may be output in one or more of a variety of formats, such as probabilities that various colors in the image resemble certain face keypoints, one or more matrix representations of the numerical probabilities that there are face features in matrix locations corresponding to pixel locations in the input image, and/or one or more compact vector representations (embeddings) of such information. The generated face keypoint thermodynamic diagrams may be in the same format as the corresponding set of expected face keypoint parameters, e.g. in the same format as the corresponding training thermodynamic diagrams.

In step 540, the neural network system is trained. By definition, the neural network system includes a plurality of weights. The training includes updating one or more of the plurality of neural network weights to minimize loss, measured in part based on a comparison of the generated face keypoint thermodynamic diagram and the corresponding set of expected face keypoint parameters. The loss metric may be Mean Squared Error (MSE) pixel loss, i.e., cross-entropy (CE) pixel loss of the S-type. In the case where the generated face keypoint thermodynamic diagram and the corresponding expected face keypoint parameter set have different formats, the corresponding expected face keypoint parameter set may be converted into the format of the generated face keypoint thermodynamic diagram, so that loss can be calculated. The updating of the one or more weights may be performed by some variant of the gradient descent optimization algorithm employing back-propagation, such as: adam, Nadam, or Adamax.

Although for simplicity the training method described above has been described for a single training iteration, it should be noted that in typical embodiments, multiple training iterations are performed until the convergence criterion is met, or an iteration or time limit is reached. In addition, each of the training iterations may update the one or more weights based on a batch of correspondingly generated face keypoint thermodynamic diagrams and an expected face keypoint parameter set using a batch gradient optimization algorithm.

Extended human face key point positioning neural network system training method

FIG. 6 is a flow diagram of an exemplary method 600 by which a neural network system may be trained to locate face keypoints in an image. The method may be performed by executing computer-readable instructions using one or more processors of one or more computing devices, for example, that implement the neural network system 100. For example, the neural network system trained by the method may be the neural network system 100.

In step 610, a plurality of transformed facial images are generated by applying one or more transformations to a plurality of training facial images. For example, the applied transformation may be an affine transformation or a flip transformation. Examples of affine transformations include translation, scaling, and rotation.

Steps 510 to 530 of the second neural network system training method 600 correspond to steps 510 to 530 of the first neural network system training method 500.

In step 620, the transformed face image is input to the neural network system, and the transformed face image is a transformation of the input training face image. The transformed images are images of a plurality of transformed face images.

In step 630, a face keypoint thermodynamic map for the transformed face image is generated using the neural network system. And generating the face key point thermodynamic diagrams for the converted face thermodynamic diagrams in the same format as the face key point thermodynamic diagrams of the training face pictures.

In step 540, the neural network system is trained and the plurality of weights are updated as previously described. However, in this example, the neural network weights are also updated to minimize the transform loss, which is measured in part based on a comparison of the face keypoint thermodynamic diagrams generated for the training face images and the face keypoint thermodynamic diagrams generated for the corresponding transformed face images. Minimizing the transform loss forces the regression network to output coherent landmarks when rotation, scaling, translation, and flipping transforms are applied to the image. The transformation loss metric value and the loss metric value may be combined to form a composite loss metric value, which may be used to update the neural network weights.

An example of an integrated loss metric is:

where N is the keypoint number, i.e., the index of a given facial landmark; i is an input image; g (I) is an expected face key point thermodynamic diagram, namely a ground truth value; h (i) is the generated face keypoint thermodynamic diagram. Where T is the transformation and λ is the balance loss L_p-pAnd loss L_p-qWeight of (1), L_p-pCalculating based on the difference between a face key point thermodynamic diagram generated for a training face image and a face key point thermodynamic diagram generated for a corresponding transformed face image, L_p-qAnd calculating based on the difference value of the generated face key point thermodynamic diagram and the expected face key point thermodynamic diagram. L is_p-gComprising two components L_p-g1And L_p-g2。L_p-g1Based on the difference between the first training face image and the expected face keypoint thermodynamic diagram. L is_p-g2Based on the difference between the transformed face image and the expected face keypoint thermodynamic diagram of the transformed face image. By applying the same transformation used to generate the transformed face image to the prospective face keypoint thermodynamic diagram of the training face image, the prospective face keypoint thermodynamic diagram of the transformed face image can be obtained. In the above equation, Mean Squared Error (MSE) pixel loss is used to derive each component loss metric value, however, any suitable loss function may be used, for example, S-entropy (CE) pixel loss. Different loss functions may also be used for different components of the composite loss. For example, an S-type cross-entropy (CE) pixel loss function may be usedTo L_p-gAnd can use Mean Squared Error (MSE) pixel loss to derive L_p-p。

While embodiments of the invention have been shown and described, it will be understood by those skilled in the art that changes may be made thereto without departing from the scope of the invention, which is defined by the appended claims. Various parts of different embodiments may be combined together, in case the principles of the embodiments are compatible with each other.

Claims

1. A neural network system for locating key points of a face in an image, the neural network system comprising:

two or more convolutional neural networks in series, each convolutional neural network comprising:

a plurality of downsampling layers;

a plurality of upsampling layers;

a plurality of transverse connections connecting layers of the same size;

wherein each said lateral connection comprises one or more aggregation nodes with an upsampled input;

wherein at least one of the aggregation nodes further comprises a downsampling input.

2. The neural network system of claim 1, wherein each convolutional neural network comprises 3 or 4 downsampling layers and a corresponding number of upsampling layers.

3. The neural network system of claim 2, wherein each convolutional neural network comprises 3 downsampling layers.

4. The neural network system of any preceding claim, wherein only one of the aggregation nodes comprises a downsampling input.

5. The neural network system of any one of claims 1-3, wherein each of the aggregation nodes comprises a downsampled output.

6. The neural network system of any preceding claim, wherein at least one of the lateral connections comprises one or more lateral skip connections.

7. The neural network system of any preceding claim, wherein at least one of the transverse connections comprises one or more convolutions.

8. The neural network system of claim 7, wherein at least one of the convolutions is a depth-wise separable convolution.

9. The neural network system of any preceding claim, further comprising an input layer capable of receiving a 128 x 128 pixel input image.

10. The neural network system of any preceding claim, further comprising a channel aggregation block, the channel aggregation block comprising:

the multiple channels are reduced by step size;

a plurality of channels increase step size;

a plurality of branch connections connecting steps of the same channel size.

11. The neural network system of any preceding claim, wherein each downsampling layer of the plurality of downsampling layers, each upsampling layer of the plurality of upsampling layers, and each aggregation node of the aggregation nodes comprises the channel aggregation block.

12. The neural network system of any preceding claim, wherein the output of each convolutional neural network is connected to a spatial transformer.

13. The neural network system of claim 12, wherein the spatial transformer comprises at least one deformable convolution.

14. A method for locating key points of a face in an image using a neural network system, the method comprising the steps of:

receiving an input image, the input image comprising a plurality of input pixels;

in each convolutional neural network of a plurality of convolutional neural networks in series:

downsampling the input image using a plurality of downsampling layers;

upsampling the downsampled image using a plurality of upsampling layers;

wherein the upsampling further comprises merging the upsampled image and aggregate signal input into the upsampled layer using a horizontal connection in the same sized downsampled layer;

wherein at least one of the aggregation nodes further comprises a downsampling input;

generating, from the outputs of the series of convolutional neural networks, a corresponding output thermodynamic diagram for the input image, the thermodynamic diagram representing possible locations of face keypoints at each input pixel.

15. A method for training the neural network system of any one of claims 1 to 13, comprising:

receiving a plurality of training face images and a plurality of expected face key point parameter sets;

inputting one of the plurality of training face images into the neural network system;

in response to the input, generating a face keypoint thermodynamic diagram for the training face image using the neural network system;

training the neural network system, including updating one or more of a plurality of neural network weights to minimize loss, measured in part based on a comparison of the generated face keypoint thermodynamic diagram and the corresponding set of expected face keypoint parameters.

16. The method of claim 15, further comprising:

generating a plurality of transformed face images by applying one or more transformations to the plurality of training face images;

inputting the transformed face image into the neural network system, wherein the input transformed face image corresponds to the transformation of the input training face image;

in response to the input, generating a face keypoint thermodynamic diagram for the transformed face image using the neural network system;

wherein training the neural network system further comprises updating the neural network weights to minimize a conversion loss, the loss measured in part based on a comparison of the face keypoint thermodynamic diagrams generated for the training face images and the face keypoint thermodynamic diagrams generated for the corresponding converted face images.