WO2016179808A1

WO2016179808A1 - An apparatus and a method for face parts and face detection

Info

Publication number: WO2016179808A1
Application number: PCT/CN2015/078851
Authority: WO
Inventors: Xiaoou Tang; Shuo YANG; Ping Luo; Chen Change Loy
Original assignee: Xiaoou Tang
Priority date: 2015-05-13
Filing date: 2015-05-13
Publication date: 2016-11-17
Also published as: CN107851192A; CN107851192B

Abstract

Disclosed is an apparatus (1000) for face parts and face detection, comprising: a face proposal unit (100), achieving precise localization of face parts of an input image, and exploiting a spatial structure for inferring face likeliness for each of the parts, and generating bounding box proposals for the input image based on these face likeliness, wherein the generated bounding box proposals include at least one of faces and backgrounds; and a face detection unit (200) being electronically communicated with the face proposal unit and verifying if the generated bounding box proposals include true faces or just backgrounds.

Description

AN APPARATUS AND A METHOD FOR FACE PARTS AND FACE DETECTION

Technical Field

The disclosures relate to an apparatus and a method for face parts and face detection

Background

There is a long history of using neural network for the task of face detection. For example, Rowley et al. exploited a set of neural network-based filters to detect presence of faces in multiple scales, and merged the detections from individual filters. Osadchy et al. demonstrated that a joint learning of face detection and pose estimation significantly improves the performance of face detection. The seminal work of Vaillant et al. adopted a two-stage coarse-to-fine detection. Specifically, the first stage approximately locates the face region, whilst the second stage provides a more precise localisation. While great efforts have been devoted for addressing face detection under occlusion, these methods are all confined to frontal faces, without discovering faces under variations of both pose and occlusion.

In the last decades, cascade-based and deformable part models (DPM) detectors dominate the face detection approaches. Viola and Jones introduced fast Haar-like features computation via integral image and boosted cascade classifier. Various studies thereafter follow a similar pipeline. Amongst the variants, SURF cascade was one of the top performers. Later Chen et al. demonstrated state-of-the-art face detection performance by learning face detection and face alignment jointly in the same cascade framework. Deformable part models define face as a collection of parts. Latent Support Vector Machine is typically used to find the parts and their relationships. DPM is shown more robust to occlusion than the cascade-based methods. A recent study has also demonstrated state-of-the art performance with just a vanilla DPM, achieving better results than more sophisticated DPM variants.

A recent study shows that face detection can be further improved by using deep learning, leveraging the high capacity of deep convolutional networks. However, the network proposed in the art does not have explicit mechanism to handle the occlusion, and the face detector therefore fails to detect faces with heavy occlusions.

Summary

The invention aims to address the problem of face detection under severe occlusion and pose variations. The detected faces can then be used for various applications such as face alignment, face tracking, or face recognition.

The present application trains attribute-aware deep convolutional networks (aka the face proposal unit) to achieve precise localisation of face parts, and exploit their spatial structure for inferring face likeliness. Bounding box proposals are then generated based on these face likeliness. These proposals may contain both faces and backgrounds and the bounding box is not precise enough. Thus, a face detection unit is then used to verify if these proposals are true faces or just background. The same face detection unit is also employed to obtain bounding boxes with more precise locations.

In an aspect, disclosed is an apparatus for face parts and face detection. The apparatus may comprise:

a face proposal unit for exploiting a spatial structure for inferring face likeliness for each of the face parts of an input image, and generating bounding box proposals for the input image based on these face likeliness； and

a face detection unit being electronically communicated with the face proposal unit and verifying if any of the generated bounding box proposals includes a true face or just a background.

In one embodiment of the present application, the face detection unit further determines a location of the face in the generated bounding box proposals, if at least one of the generated bounding box proposals includes the true face.

In one embodiment of the present application, the face proposal unit may further comprises:

a neural network unit, wherein neural network unit receives an input image and predict target face or face parts for the input image to determine a probability of each pixel of the input image belonging to each predetermined face part；

a faceness measure unit that, based on the determined probability, generates a plurality of pre-proposed bounding boxes, and a probability that each face part is located in the pre-proposed bounding box, and

a bounding box proposal unit that determines the pre-proposed bounding boxes with the probability above the predetermined threshold as a face proposal for said face part.

In a further aspect, disclosed is a method for face parts and face detection, comprising:

achieving a localisation of face parts in an input image；

exploiting, based on the localization, a spatial structure for inferring face likeliness for each of the parts；

generating bounding box proposals for the input image based on these face likeliness, wherein the generated bounding box proposals include at least one of faces and backgrounds； and

verifying if any of the generated bounding box proposals includes a true face or just a background, if yes, the method may further comprise:

determining a location of the face in the generated bounding box proposals.

In a further aspect, disclosed is a method for face parts and face detection comprising:

predicting a target face or face parts for an input image to determine a probability of each pixel of the input image belonging to each predetermined face part of the input image；

generating, based on the determined probability, a plurality of pre-proposed bounding boxes, and a probability that each face part is located in the pre-proposed bounding box,

determining a pre-proposed bounding box with the highest probability (with the probability above the predetermined threshold) as a face proposal for said face part； and

verifying if the generated bounding box proposals include true faces or just backgrounds. The method may further comprise:

determining a location of the faces in the generated bounding box proposals, if it is verified that the generated bounding box proposals include true faces.

In a further aspect, disclosed is a system for face parts and face detection, comprising:

a memory that stores executable components； and

a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:

a face proposal component configured to exploit a spatial structure for inferring face likeliness for each of the face parts of an input image, and generate bounding box proposals for the input image based on the face likeliness； and

a face detection component configured to verify if the generated bounding box proposals include true faces or just backgrounds.

means for achieving a localisation of face parts in an input image；

means for exploiting, based on the localization, a spatial structure for inferring face likeliness for each of the parts；

means for generating bounding box proposals for the input image based on these face likeliness, wherein the generated bounding box proposals include at least one of faces and backgrounds； and

means for verifying if any of the generated bounding box proposals includes a true face or just a background, if yes, the method may further comprise:

means for determining a location of the face in the generated bounding box proposals.

In a further aspect, disclosed is a system for face parts and face detection comprising:

means for predicting a target face or face parts for an input image to determine a probability of each pixel of the input image belonging to each predetermined face part of the input image；

means for generating, based on the determined probability, a plurality of pre-proposed bounding boxes, and a probability that each face part is located in the pre-proposed bounding box,

means for determining a pre-proposed bounding box with the highest probability (with the probability above the predetermined threshold) as a face proposal for said face part； and

means for verifying if the generated bounding box proposals include true faces or just backgrounds. The method may further comprise:

means for determining a location of the faces in the generated bounding box proposals, if it is verified that the generated bounding box proposals include true faces.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 illustrates an apparatus 1000 for face parts and face detection according to one embodiment of the present application.

Fig. 2 illustrates a schematic block diagram of the face proposal unit according to an embodiment of the present application.

Fig. 3 is a schematic diagram illustrating a flow process for the training unit to train the multiple or single neural network model according to one embodiment of the present application.

Fig. 4 illustrates a process for neural network unit 101 to predict the target face or face parts according to one embodiment of the present application.

Fig. 5 illustrates a prediction process in the neural network unit configured with multiple CNNs according to one embodiment of the present application.

Fig. 6 is a schematic diagram illustrating a process for the faceness measure unit 102 to generate pre-proposed bounding boxes and faceness score for each pre-proposed bounding box according to one embodiment of the present application.

Fig. 7 is a schematic diagram illustrating examples of the faceness measure for a bounding box according to one embodiment of the present application.

Fig. 8 is a schematic diagram illustrating example of the faceness measure for hair part according to one embodiment of the present application.

Fig. 9 is a schematic diagram illustrating a flow chart for the bounding boxes proposal unit according to one embodiment of the present application.

Fig. 10 illustrates a method for face parts and face detection according to one embodiment of the present application.

Fig. 11 illustrates a method for face parts and face detection according to a further embodiment of the present application.

Fig. 12 illustrates a system for face parts and face detection according to one embodiment of the present application, in which the functions of the present invention are carried out by the software.

Detailed Description

Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a" , "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising, " when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Fig. 1 illustrates an apparatus 1000 for face parts and face detection according to one embodiment of the present application. As shown, the apparatus 1000 comprises a face proposal unit 100 and a face detection unit 200.

I face proposal unit 100

The face proposal unit 100 is configured to automatically generate face proposal bounding boxes, faceness scores and response maps of face parts, and the output therefrom will be fed into the face detection unit 200 which is electronically communicated with the face proposal unit 100. To be specific, the face proposal unit 100 is configured to receive imagery data such as RGB Image or RGBD Image. The imagery data can be any form of RGB images or RGBD images. The RGBD image consists of a normal RGB image and a depth image. The depth image refers to such an image in which every pixel represents the distance from camera sensor to the object in the image. Based on the received imagery data, the face proposal unit 100 operates to output face proposal bounding boxes, faceness score of each proposed bounding box and response maps of face parts. The bounding box is defined by the coordinates of the top-left and the bottom-right points of it. (x_l， y_l， x_r， y_r) .

Fig. 2 illustrates a schematic block diagram of the face proposal unit 100 according to an embodiment of the present application. As shown, the face proposal unit 100 comprises a neural network unit 101, a faceness measure unit 102, and a bounding box proposal unit 103.

1.1) Neural Network Unit 101

.The neural network unit 101 may be configured with multiple or single neural network model trained with different supervision information.

Implementation 1:

Given n face parts, for example, eye, nose, mouth, hair and beard. Other parts definition is possible, a convolutional neural network (CNN) can be trained for each face part by using face attributes corresponding to specific face part as the designated output. Therefore, the neural network system consists of n convolutional neural networks (CNNs) .

Implementation 2:

Given n face parts namely, left eye, right eye, nose, mouth, left ear, right ear. Other parts definition is possible, one convolutional neural network (CNN) is trained to predicate whether an input image center falls into the defined face part region with a certain scale of the face part. In this case, the neural network system has only one convolutional neural network (CNN) .

The multiple or single neural network model in the neural network unit 101 may be trained by a training unit 300. By inputting a predetermined set of training data, each of which is labeled with the corresponding ground truth labels corresponding to the designated output, the network (s) can be trained by using different designated output (or combination of them) . These include but are not limited to the examples we mentioned below:

a. Face attributes, such as young, old, big eye, small eyes, point nose, big mouth, and black hair. The ground truths are vectors with each dimension representing the degree of one face attribute. The value of the vectors can be discrete or continuous.

b. Face landmarks, the coordinates of face key point. Typically, face key points include center of left eye, center of right, center of nose and mouth corner.

c. Face parts, the ground truth labels are binary vectors that indicate whether a predetermined face part appear in the input image.

Fig. 3 is a schematic diagram illustrating a flow process 3000 for the training unit 300 to train the multiple or single neural network model according to one embodiment of the present application. As shown, the process 3000 begins with step s301, in which the training unit 30 draws a data sample and the corresponding ground truth labels corresponding to the designated target output, from the predetermined training set, and then feeds the data sample and the corresponding ground truth labels to the neural network system. At step s302, based on the data sample and the corresponding ground truth labels, the neural network generates target predictions for the data sample. At step s303, the training unit 300 operates to compute the error between the target predictions and the ground truth labels. In one example, cross-entropy loss may be used:

Where, |D| is number of training samples, x_i is the training sample y_i is the ground truth label. p (y_i|x_i) is the sigmoid function. i.e.

indicting the probability of the presence of the j-th attribute or face part in each pre-defined scale. f (x_i) is the feature representation generated by the neural network of training sample x_i.

It shall be appreciated that other loss function may be used to train this neural network unit.

At step s304, the training unit 300 operates to back-propagate the error through the neural network system so as to adjust weights on connections between neurons of the neural network system. And at step s305, it is determined if the error is less than a predetermined value, i.e., if the process is converged. If not, the steps s301-305 will be repeated until it is converged.

As discussed in the above, the neural network unit 101 receives imagery data (i.e., the input image) and generates the responses maps of the predetermined face parts. Fig. 4 illustrates a process for neural network unit 101 to predict the target face or face parts according to one embodiment of the present application.

In step s401, for the received imagery data (i.e., given an unseen data sample) , the neural network unit 101 with the trained neural network operates to generate a target prediction for the received imagery data. There may be a lot of the target prediction for a given trained neural network. For example, the trained neural network may operate to predict a set of face part attributes, such as big eyes, small eyes, narrowing eyes, and self-confident eyes. For the input image, it is needed to predict what probability is that the attributes included in the target prediction exist in the input image. And then, at step s402, the neural network unit 101 operates to compute, based on the generated target prediction, the probability for each pixel in the input image belonging to each predetermined face part. Alternatively, the probability may be obtained from feature maps extracted from the neural network. For example, the feature maps can be extracted from last convolution layer of the convolution neural network.

In step s403, the neural network unit 101 operates to generate the responses maps of the predetermined face parts based on the results of step s401 and s402. For each predetermined face part, the target prediction generated in step s401 and the probability that each pixel in the input image will be located in the predetermined face part, as discussed in step s402, consist of the response map.

In Implementation 1:

In implementation 1, in which n face parts, for example, eye, nose, mouth, hair and beard is defined (for purpose of discussion, set n＝5) , for each face part, a convolutional neural network (CNN) was trained by using face attributes as supervisory information. During prediction, each test image is fed into five trained convolutional neural networks (CNN) as shown in the Fig. 5.

Generally speaking, each of the convolutional neural networks (CNN) generates m response maps corresponding to a specific face part. the neural network unit 101 combines m response maps by taking average or maximal value for each pixel of m response maps and generate one response map for each face part at step s403.

For each convolutional neural network, its output can be formulated as

h^v(l)＝relu(b^v(l))+Σ_uK^vu(l)*h^u(l-1)) 2)

where relu (x) ＝max (0， x) is the rectified linear activation function, other activation function can be used such as sigmoid function. *denotes the convolution operator. k^vu(l) and b^v(l) denote the filters and bias. h^v(l) represents the v-th output channel at the l-th layer.

Therefore, the output of each convolutional neural network can be expressed as h^l, i.e., probability of pixel (i， i) belonging to each predetermined face part for the input image. The response map obtained for pixel (i， j) can be generated from h^l by

where, (i， i) is the coordinate of the pixel for output and m is the number of output channel.

denotes the probability of pixel (i， j) belonging to each predetermined face part for the input image.

In Implementation 2:

In this implementation, n face parts namely, left eye, right eye, nose, mouth, left ear, right ear have been also defined. Other parts definition is possible. One trained convolutional neural network is used to predict whether an input image center falls into defined face part region in a pre-defined scale. During prediction, each test image is fed into one trained convolutional neural networks. This convolutional neural network outputs 6 response maps corresponding to 6 face parts. Here

means the probability of pixel (i， i) belonging to each predetermined face part with a pre-defined scale for the input image. For Implementation 2, the computation is similar to Implementation 1 with n＝1.

Returning to Fig. 2, the face proposal unit 100 further comprises a faceness measure unit 102 for generating the faceness score of each pre-proposed bounding box, and a bounding box proposal unit 103 for proposing the bounding boxes of face candidates. The faceness measure unit 102, based on the determined probability, generates a plurality of pre-proposed bounding boxes, and a probability that each face part is located in the pre-proposed bounding box. The bounding box proposal unit 103 determines the pre-proposed bounding box with the highest probability (with the probability above the predetermined threshold) as a face proposal for said face part.

1.2) Faceness Measure Unit 102

The faceness measure unit 102 receives the responses maps of the predetermined face parts generated by the neuron network prediction unit 101 for each data sample, and outputs pre-proposed bounding boxes and faceness score for each pre-proposed bounding box in the input image. This unit takes advantage of part information to deal with occlusion.

Fig. 6 is a schematic diagram illustrating a flow process 6000 for the faceness measure unit 102 to generate pre-proposed bounding boxes and faceness score for each pre-proposed bounding box according to one embodiment of the present application. As shown, the process 6000 begins with step s601, in which a faceness measure is defined for each predetermined face part. For example, in this step, it is defined how to divide the face parts in the pre-proposed bounding box as discussed below.

At step s602, given responses maps of predetermined face parts and pre-proposed bounding boxes, the faceness measure unit 102 crops the responses maps of the predetermined face parts based on each of the pre-proposed bounding boxes.

The pre-proposed bounding box can be generated by some methods. These include but are not limited to the example as below.

a. General object proposal methods i.e. Selective Search, MCG, Edgebox and sliding window.

b. Output of the neuron network, it first conducts non-maximal suppression (NMS) and thresholdings on the faceness maps to get some key points for each face part. For each key point, a bounding box centered on the key points with the pre-defined scale is proposed.

For each pre-proposed bounding box, it will have n faceness scores corresponding to n defined face parts as shown in Fig. 7.

At step s603, the faceness measure unit 102 operates to compute the faceness score of cropped responses map for each face part generated from the step s602 with the defined faceness measure from step s601 for the specific face part.

To be specific, give a response map of hair h^a in the Implementation 1 generated from neural network unit 101. The faceness score of hair part is computed as following.

Denote

be the faceness score of a window w for face part. Given a pre-proposed bounding box ABCD, we first cropped the face part response map based on pre-proposed bounding box ABCD. Then, we divide the bounding box ABCD into two parts ABEF and EFCD. How to divide these parts in the pre-proposed bounding box is defined by faceness measure for each face part. In this case, we defined BE/CE＝1/3. Alternatively, this ratio can be learned from training data.

Where

is defined in the prediction unit above.

Generally,

is attained by dividing the sum of values in ABEF (red) by the sum of values in FECD (white) from the response map. This value can be efficiently computed using integral image.

1.3) Bounding Box Proposal Unit 103

The bounding box proposal unit 103 takes the pre-proposed bounding boxes and faceness scores for each pre-proposed bounding boxes as inputs, and output bounding boxes and faceness scores for each bounding boxes.

It is given that there are a plurality of pre-proposed bounding boxes, each of which has a faceness score to indicate the probability that the pre-proposed bounding box includes the predetermined face parts. At step s901, the bounding box proposal unit 103 operates to, for each face part, conduct bounding box non-maximum suppression based on the faceness score for this face part. The procedure of bounding box non-maximum suppression is by finding the window of maximum faceness score and then removing all other bounding boxes with IOU (intersection over union) larger than a pre-defined overlap threshold. After bounding box non-maximum suppression, keep only the bounding boxes whose faceness score is above a pre-defined threshold.

And then, in step s902, the bounding box proposal unit 103 operates to, unionize all the bounding boxes proposed in step s901 and add faceness scores of each face part for each bounding boxes to obtain the final faceness score, i.e., the probability that each face part is located in the pre-proposed bounding box. For example, for each defined face part, the bounding box proposal unit 103 conducts non-maximum-suppression and thresholding, then gets proposed bounding boxes of face part. The same process will be applied to all face parts. The final proposed bounding boxes are the union of bounding boxes proposed by all face parts.

II face detection unit 200

As discussed in the above, the face proposal unit 100 is designed to achieve precise localisation of face parts of an input image, and exploit their spatial structure for inferring face likeliness. Bounding box proposals are then generated based on these face likeliness. These proposals may contain both faces and backgrounds and the bounding box is not precise enough. Thus, a face detection unit 200 is then used to verify if these proposals are true faces or just background. The face detection unit 200 is also employed to obtain bounding boxes with more precise locations, i.e., the precise locations of the face or face parts in the generated bounding box proposals.

In other words, the face detection unit 200 is electronically communicated with or coupled to the face proposal unit 100, and is designed to give the predictions of class label and other designated target information based on the bounding boxes and faceness scores for each bounding boxes generated by the bounding box proposal unit 103. In particular, the face detection unit 200 takes the cropped RGB Images or RGBD Image based on bounding boxes proposed by face proposal unit 100 as its input, and outputs class label and other designated target information.

It shall be noted that the face detection unit 200 must predict class label, i.e. face and non-face. For other target information it could be face attributes, face bounding box coordinates, face landmarks and other target information. The face detection unit 200 can be configured with, for example, neural network, support vector machines, random forest, boosting and other mechanism.

The neural network configured in the face detection unit 200 shall be also trained. To this end, a predetermined set of training data will be inputted, each of which is labeled with the corresponding ground truth labels corresponding to the designated output. If the network is used to predict class label (i.e. face and non-face) , the ground truth labels are the binary vector indicate whether a face appear in the input images； if the network is used to predict class label and face bounding boxes coordinates, the ground truth labels are the collection of the class label and face bounding boxes coordinates. The process for training the neural network configured in the face detection unit 200 can be the same as illustrated in Fig. 3.

Once trained, the face detection unit 200 is capable of predicting the class label of a given data sample and other designated output. For example, we feed the bounding boxes proposed by the face proposal unit 100 into the face detection prediction unit 200. For each proposed bounding box, the face detection unit 200 predicts a confidence of whether the proposed bounding box contains a face nor not and the face location in the proposed bounding box. The face detection unit 200 first removes some proposed bounding boxes with the confidence below the threshold. Then, it generates the face detection prediction based on the prediction of face location in the proposed bounding box and conduct bounding box non-maximum suppression based on the confidence for proposed bounding box by finding the window of maximum respective confidence and then removing all other bounding boxes with IOU (intersection over union) larger than a pre-defined overlap threshold. In other words, the proposed bounding boxes will be arranged in descending order according to their respective confidence, and then these proposed bounding boxes the confidences of which overlap a predetermined threshold will be removed.

According to one aspect, there is also provided a method for face parts and face detection. As shown in Fig. 10, at step s1001, it may achieve localisations of face parts in an input image, and at step S1002 exploit a spatial structure for inferring face likeliness for each of the parts. At step s1003, the bounding box proposals for the input image may be generated based on these face likeliness, wherein the generated bounding box proposals include at least one of faces and backgrounds. The steps s1001～1003 may be carried out, for example, by the face proposal unit 100 as discussed in the above, and thus the detailed discussion for the face proposal unit 100 is also applicable to these steps.

At step s1004, it is verified if the generated bounding box proposals include true faces or just backgrounds, if yes, the steps1005 may determine a location of the faces in the generated bounding box proposals. It shall be noted that the steps s1004 and 1005 may be the same as the procedures for the face detection unit 200 as discussed in the above, and thus the detailed description thereof are omitted herein.

Fig. 11 is a schematic diagram illustrating a flow process for a method for face parts and face detection according to a further embodiment of the present application. As shown, in step s1101, a target face or face parts for an input image is predicted to determine a probability of each pixel of the input image belonging to each predetermined face part of the input image. At step s1102, based on the determined probability, it is generated a plurality of pre-proposed bounding boxes, and a probability that each face part is located in the pre-proposed bounding box. At step s1103, a pre-proposed bounding box with the highest probability is determined as a face proposal for said face part； and then at step s1104, it is verified if the generated bounding box proposals include true faces or just backgrounds. If yes, at step s1105, a location of the faces in the generated bounding box proposals will be determined. Since the procedures for the face proposal unit 100 are applicable to the steps s1101～s1103, and the procedures for the face detection unit 200 are applicable to the steps s1104～s1105, the detailed description for these steps are omitted herein.

As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ” Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments.

In addition, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium. Fig. 12 illustrates a system 3000 for face parts and face detection according to one embodiment of the present application, in which the functions of the present invention are carried out by the software. Referring to Fig. 12, the system 3000 comprises a memory 3001 that stores executable components and a processor 3002, electrically coupled to the memory 3001 to execute the executable components to perform operations of the system 3000. The executable components may comprise: a face proposal component 3003 configured to achieve precise localisation of face parts of an input image, and exploit a spatial structure for inferring face likeliness for each of the parts, and generate bounding box proposals for the input image based on these face likeliness, wherein the generated bounding box proposals include at least one of faces and backgrounds； and a face detection component 3004 configured to verify if the generated bounding box proposals include true faces or just backgrounds. If the generated bounding box proposals include true faces, the face detection component 3004 further determine a location of the faces in the generated bounding box proposals. The functions of the

components

3003 and 3004 are similar to those of the unit100 and 200, respectively, and thus the detailed descriptions thereof are omitted herein.

Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.

Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims

An apparatus (1000) for face parts and face detection, comprising:

a face proposal unit (100) , exploiting a spatial structure for inferring face likeliness for each of face parts of an input image, and generating bounding box proposals for the input image based on the face likeliness； and

a face detection unit (200) being electronically communicated with the face proposal unit and verifying if any of the generated bounding box proposals include a true face or just a background.
The apparatus according to claim 1, wherein the face detection unit (200) further determines a location of the face in each of the generated bounding box proposals, if it is verified that at least one of the generated bounding box proposals includes the true face.
The apparatus according to claim 1, wherein the face proposal unit (100) further comprises:

a neural network unit (101) that receives the input image and predicts a target face or face parts for the input image to determine a probability of each pixel of the input image belonging to a respective predetermined face part therein；

a faceness measure unit (102) , wherein the faceness measure unit (102) , based on the determined probability, generates a plurality of pre-proposed bounding boxes, and probabilities that the face parts are located in corresponding pre-proposed bounding boxes ； and

a bounding box proposal unit (103) that determines pre-proposed bounding boxes with a probability above a predetermined threshold as a face proposal for said face part.
The apparatus according to claim 3, wherein the neural network unit (101) is further configured to:

generate a target prediction including a set of face part attributes for the predetermined face part of the input image； and

compute a probability that at least one of the face part attributes exists in the predetermined face part.
The apparatus according to claim 3 or 4, wherein the neural network unit (101) is configured with a plurality of convolutional neural networks, and

wherein, for each predetermined face part, one of the networks was trained by using the set of face attributes as supervisory information.
The apparatus according to claim 5, wherein, the input image is fed into the convolutional neural networks, and each of the convolutional neural networks generates a response map corresponding to a specific face part, and

wherein, the neural network unit (101) takes average or maximal value for each pixel of all response maps for the input image to generate one response map for each face part so as to indicate a probability of the pixel belonging to each predetermined face part for the input image.
The apparatus according to claim 3 or 4, wherein the neural network unit (101) is configured with one convolutional neural network that was pre-trained so as to predict whether the input image falls into a defined face part region in a pre-defined scale.
The apparatus according to claim 3, wherein the face detection unit (200) is configured to

receive the bounding boxes proposed by the face proposal unit 100； and

predict, for each of the proposed bounding boxes, a confidence as to whether the proposed bounding box contains a face nor not and a face location in the proposed bounding box.
The apparatus according to claim 8, wherein the face detection unit (200) is further configured to:

remove at least one proposed bounding boxes with a confidence below a pre-determined threshold； and

generate a face detection prediction in the proposed bounding box and conduct bounding box non-maximum suppression based on the confidence for remain proposed bounding boxes.
The apparatus according to claim 3, wherein the faceness measure unit (102) is further configured to

crop, from given responses maps of predetermined face parts, responses maps of predetermined face parts based on given pre-proposed bounding boxes； and

compute a faceness score of each of the cropped responses maps for each predetermined face part.
The apparatus according to claim 10, wherein the bounding box proposal unit (103) is further configured to

find a window of maximum faceness score from the computed faceness scores；

remove all other bounding boxes with IOU (intersection over union) larger than a pre-defined overlap threshold；

unionize all kept bounding boxes and add faceness scores of each face part for each of the kept bounding boxes to obtain the final faceness score indicating a probability that each face part is located in the corresponding pre-proposed bounding box.
A method for face parts and face detection, comprising:

achieving a localization of face parts in an input image；

exploiting, based on the localization, a spatial structure for inferring face likeliness for each of the parts；

generating bounding box proposals for the input image based on the face likeliness； and

verifying if any of the generated bounding box proposals include a true face or just a background.
The method according to claim 12, further comprising:

determining a location of the face in the generated bounding box proposals, if it is verified that at least one of the generated bounding box proposals includes the true face.
A method for face parts and face detection, comprising:

predicting a target face or face parts for an input image to determine a probability of each pixel of the input image belonging to each predetermined face part of the input image；

generating, based on the determined probability, a plurality of pre-proposed bounding boxes, and a probability that each predetermined face part is located in the corresponding pre-proposed bounding box,

determining a pre-proposed bounding box with the probability above the predetermined threshold as a face proposal for said face part； and

verifying if any of the generated bounding box proposals include a true face or just a background.
The method according to claim 14, further comprising:

determining a location of the faces in the generated bounding box proposals, if it is verified that at least one of the generated bounding box proposals includes the true face.
The method according to any one of claims 12-15, the predicting is carried out in a plurality of convolutional neural networks,

wherein, the input image is fed into the convolutional neural networks, and each of the convolutional neural networks generates a response map corresponding to a specific face part, and

wherein, average or maximal value for each pixel of all response maps for the input image generates one response map for each face part so as to indicate a probability of the pixel belonging to each predetermined face part for the input image.
The method according to any one of claim 12-15, the predicting is carried out in one pre-trained convolutional neural network to predict whether the input image falls into a defined face part region in a pre-defined scale.
The method according to any one of claims 12-15, wherein the generating further comprises:

removing at least one proposed bounding boxes with a confidence below a pre-determined threshold； and

generating a face detection prediction based on a prediction of a face location in the proposed bounding box and conduct bounding box non-maximum suppression based on the confidence for the proposed bounding box.
The method according to any one of claims 12-15, wherein the generating further comprises:

cropping, given responses maps of predetermined face parts and pre-proposed bounding boxes, responses maps of the predetermined face parts based on each pre-proposed bounding box； and

computing a faceness score of each of the cropped responses maps for each face part.
The method according to claim 19, further comprising:

finding a window of maximum faceness score from the computed faceness scores；

removing all other bounding boxes with IOU (intersection over union) larger than a pre-defined overlap threshold； and

unionizing all kept bounding boxes and add faceness scores of each face part for each of the kept bounding boxes to obtain the final faceness score indicating a probability that each face part is located in the pre-proposed bounding box.
A system for face parts and face detection, comprising:

a memory that stores executable components； and

a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:

a face proposal component configured to exploit a spatial structure for inferring face likeliness for each of face parts of an input image, and generate bounding box proposals for the input image based on the face likeliness； and

a face detection component configured to verify if any of the generated bounding box proposals includes a true face or just a background.
The system according to claim 21, wherein, if at least one of the generated bounding box proposals includes the true face, the face detection component further determines a location of the face in the generated bounding box proposals.