WO2016183834A1

WO2016183834A1 - An apparatus and a method for locating facial landmarks of face image

Info

Publication number: WO2016183834A1
Application number: PCT/CN2015/079429
Authority: WO
Inventors: Xiaoou Tang; Shizhan ZHU; Cheng Li; Chen Change Loy
Original assignee: Xiaoou Tang
Priority date: 2015-05-21
Filing date: 2015-05-21
Publication date: 2016-11-24
Also published as: CN107615295A; CN107615295B

Abstract

The present invention discloses a system, an apparatus and a method for locating facial landmarks of the face image. The method for locating facial landmarks of a face image comprises retrieving a set of candidate shapes respectively from a predetermined shape region, each of the candidate shapes having labeled facial landmarks; aligning each of the retrieved candidate shape with the face image to obtain corresponding aligned shapes; determining, according to the aligned shapes obtained in a current stage of two or more stages, a sub-region of the shape region to select a set of candidate shapes therefrom to be retrieved at a next stage following the current stage, repeating the retrieving, the aligning and the determining for the stages to locate the facial landmarks of the face image. With the present method and system, the final solution can be prevented from being trapped in local optima due to poor initialization encountered by cascaded regression approaches and the robustness in coping with large pose variations can be improved.

Description

AN APPARATUS AND A METHOD FOR LOCATING FACIAL LANDMARKS OF FACE IMAGE

Technical Field

The disclosures relate to face alignment, in particular, to a method, an apparatus and a system for locating facial landmarks of a face image.

Background

Face alignment aims at locating facial key points automatically. Among the many different approaches for face alignment, the cascaded regression approach has emerged as one of the most popular. The algorithm typically starts from an initial shape, e.g., a mean shape of training samples, and refines the shape through sequentially trained regressors.

However, the cascaded regression approach has a widely acknowledged shortcoming of its dependence on initialization. In particular, if the initialized shape is far from the target shape, it is unlikely that the discrepancy will be completely rectified by subsequent iterations in the cascade. As a consequence, the final solution may be trapped in local optima. Existing methods often circumvent this problem by adopting some heuristic assumptions or strategies which mitigate the problem to certain extent, but do not fully resolve the issue.

All the aforementioned methods assume the initial shape is provided in some forms, typically a mean shape. Mean shape is used with the assumption that the test samples are distributed close to the mean pose of the training samples. This assumption does not always hold especially for faces with large pose variations. Cao et al. propose to run the algorithm several times using different initializations and take as final output the median of all predictions. Burgos-Artizzu et al. improve the strategy by a smart restart method but it requires cross-validation to determine a threshold and the number of runs. In general, these strategies mitigate the problem to some extents, but still do not fully eliminate the dependence on shape initializations. Zhang et al. propose to obtain initialization by predicting a rough estimation from global image patch, still followed by sequentially trained auto-encoder regression networks.

Summary

The application aims to address at least one or more of the above problems of face alignment. The method according to the present application begins with a coarse search over a shape space that contains diverse shapes, and employs the coarse solution to constrain subsequent finer search of shanes, that is, “coarse-to-fine” approach. The unique stage-by-stage progressive and adaptive search can i) prevents the final solution from being trapped in local optima due to poor initialization, a common problem encountered by cascaded regression approaches； and ii) improves the robustness in coping with large pose variations.

In addition, the apparatus according to the present application proposes a hybrid features setting to achieve practical speed. Owing to the unique error tolerance in the coarse-to-fine searching mechanism, the apparatus is capable of switching different types of regression features in different optimization stages, without sacrificing accuracy too much.

In an aspect, disclosed is a method for locating facial landmarks of a face image. The method may comprise: retrieving a set of candidate shapes respectively from a predetermined shape region, each of the candidate shapes having labeled facial landmarks； aligning each of the retrieved candidate shape with the face image to obtain corresponding aligned shapes； determining, according to the aligned shapes obtained in a current stage of two or more stages, a sub-region of the shape region to select a set of candidate shapes therefrom to be retrieved at a next stage following the current stage, and repeating the retrieving, the aligning and the determining for the stages to locate the facial landmarks of the face image.

In another aspect, disclosed is an apparatus for locating facial landmarks of a face image. The apparatus may comprise: a retrieving unit for retrieving a set of candidate shapes from a predetermined shape region in one or more sequential stages, each of the candidate shapes having pre-labeled facial landmarks； an aligning unit being electronically communicated with the retrieving unit and aligning each of the retrieved candidate shape with the face image to obtain corresponding aligned shapes； and a determining unit being electronically communicated with the aligning unit and determining, according to the aligned shapes obtained in a current stage of the stages, a sub-region of the shape region to select a set of candidate shapes therefrom to be retrieved at a next stage following the current stage.

In another aspect, disclosed is a system for locating facial landmarks of a face image. The system may comprise: a memory for storing executable components and a processor being electrically coupled to the memory to execute the executable components to perform operations of the system, wherein the executable components comprise a retrieving component for retrieving a set of candidate shapes from a predetermined shape region in one or more sequential stages, each of the candidate shapes having pre-labeled facial landmarks； an aligning component for aligning each of the retrieved candidate shape with the face image to obtain corresponding aligned shapes； and a determining component for determining, according to the aligned shapes obtained in a current stage of the stages, a sub-region of the shape region to select a set of candidate shapes therefrom to be retrieved at a next stage following the current stage.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 illustrates an apparatus for locating facial landmarks of a face image according to one embodiment of the present application.

Fig. 2 illustrates a schematic block diagram of the determining unit according to an embodiment of the present application.

Fig. 3 illustrates a method for locating facial landmarks of a face image according to one embodiment of the present application.

Fig. 4 illustrates a schematic flow of the determining step of the method according to one embodiment of the present application.

Fig. 5 is a diagram illustrating a process for selecting sub-regions in three stages, which is visualized in 2D, according to one embodiment of the present application.

Fig. 6 is an example in which the method for locating facial landmarks is performed during three stages according to one embodiment of the present application.

Fig. 7 illustrates a system for locating facial landmarks of a face image according to one embodiment of the present application, in which the functions of the present invention are carried out by the software.

Detailed Description

Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a" , "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising, " when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Hereinafter, shape space refers to a 2n dimensional linear space, where n refers to the number of landmarks. Shapes in the shape space represent (x, y) coordinates of the n facial landmarks. Sub-region refers to a subset of the shape space, instead of the spatial notion of face region.

Fig. 1 illustrates an apparatus 1000 for locating facial landmarks of a face image according to one embodiment of the present application. As shown, the apparatus 1000 comprises a retrieving unit 100, an aligning unit 200 and a determining unit 300. With the apparatus according to the present application, the location of facial landmarks of a face, such as eyes pupil or mouth corners, etc. can be automatically detected.

As shown in Fig. 1, the retrieving unit 100 may be configured to retrieve a set of candidate shapes from a predetermined shape region in one or more sequential stages, each of the candidate shapes having pre-labeled facial landmarks. In an embodiment, the candidate shapes are obtained from a set pre-processed by Procrustes analysis. The shape space S is fixed throughout the whole process.

The aligning unit 200 may be electronically communicated with the retrieving unit 100. The aligning unit 200 may be configured to align each of the retrieved candidate shape with the face image to obtain corresponding aligned shapes. In an embodiment, the aligning unit 200 may further extract facial features from the face image and map the extracted facial features to a shape residual by using at least one regressor, so that the aligned shapes are obtained by using the shape residual. Different numbers and different types of facial features can be extracted in different stages. For example, the SIFT (Scale Invariant Feature Transform) feature is used in all stages to obtain the best accuracy. In an implementation, the BRIEF (Binary Robust Independent Elementary Features) feature is used in the first two stages and SIFT feature is used in the last stage. It is understand that the present application is not limited thereto, the features can be any known features.

The determining unit 300 may be electronically communicated with the aligning unit 200 and configured to determine, according to the aligned shapes obtained in a current stage of the stages, a sub-region of the shape region to select a set of candidate shapes therefrom to be retrieved at a next stage following the current stage. According to an embodiment, the determining unit 300 may further comprise a center inferring unit 301 and a suitability inferring unit 302, which is shown in Fig. 2 and will be described later in details.

Fig. 3 illustrates a method 2000 for locating facial landmarks of a face image according to one embodiment of the present application. Fig. 4 illustrates a schematic flow of the determining performed by the determining unit 300. The configurations and functions of the elements of the apparatus 1000 and the processes of method 2000 will be described in details with reference to Figs. 1-4.

As shown in Fig. 3, at step S100, a set of candidate shapes may be retrieved respectively from a predetermined shape region, each of the candidate shapes having labeled facial landmarks. At step S200, each of the retrieved candidate shape may be aligned with the face image to obtain corresponding aligned shapes. At S300, according to the aligned shapes obtained in a current stage of two or more stages at step S200, a sub-region of the shape region is determined to select a set of candidate shapes therefrom to be retrieved at a next stage following the current stage.

Then, at S400, it is determined that whether the steps S100 to S300 finish in all the stages. In an embodiment, a predetermined number of stages are finished means that the process finishes. Note that, the present application is not limited thereto, any known method in the art is available. If yes at step S400, the process ends and the center of the sub-region inferred at a last stage of the stages is determined as the located facial landmarks of the face image, which will be described later. If no, the process proceeds to the step S100. The method 2000 begins with a coarse search over the shape space that contains diverse shapes and employs the coarse results to constrain subsequent finer search of shape. With the method 2000, the facial landmarks of the face image can be located accurately.

Hereinafter, an example in which N candidate shapes are retrieved from the shape space and denoted as s ＝ {s₁， s₂， ... ， s_N} (N＞＞2n) during l ＝ 1， ... ， L stages will be described. Fig. 6 illustrates an exemplary embodiment in which the method 2000 is performed during three stages according to one embodiment. From Fig. 6, it can be seen that the problems that the landmarks on nose and mouth are trapped in local optimal due to poor initialization in the prior art can be overcame by the method for locating coarse-to-fine shape. In an implementation of the method according to the present application, 35 fps real-time performance is achieved on a single core i5-4590. Compared with the conventional cascaded regression, the estimation error is only 12.04. It is understood that the embodiment is only of exemplary and the present application is not limited thereto.

The candidate shapes in S are obtained from a predetermined shape space. At the first stage, a set of N_l candidate shapes including

j＝1， 2， ... are retrieved from the shape space S randomly, for example, based on uniform distribution.

The aligning unit 200 may align the N_l candidate shape with the face image of several iterations. For iteration k ＝ 1， 2， ... ， K, local appearance patterns

are computed as a feature f. Then, the feature f is mapped to a shape residual Δx＝M_reg (k) (f) by using K_l regressors reg (k) . With K iterations, the aligned shape

j＝1， 2， ... is obtained by

After the aligned shapes are obtained, the center inferring unit 301 may infer the center of the sub-region of the shape space. In the l^th stage, the sub-region of the shape space is represented by

where

represents the center of the sub-region, and

represents the suitability probability that defines scopes of the sub-region around the center

According to one embodiment, the center of the sub-region is determined by combining linearly all the aligned shapes for collectively inferring the sub-region center as below:

In the equation (1) , a weight vector w is used. The weight vector may be determined by adopting a dominant set approach. More precisely, an undirected graph G ＝ {V， E} is constructed, where weight of each edge in E is represented by an affinity defined as below:

An affinity matrix A is formed by representing all the elements a_pq in a matrix forms and the diagonal elements of A is set to zero to avoid self-loops.

For t＝1， ... ， T

where, ο denotes element-wise vector multiplication； and

From this, the weight vector can be determined. Unlike the conventional approach in which all the aligned shapes are averaged by fixing the weight, the susceptible to small quantity of erroneous aligned shapes caused by the local optima can be suppressed.

After the center inferring unit 301 infers the center of sub-region by the equation (1) as above, the determining unit 300 can determine the sub-region accordingly, so that a set of candidate shapes will be retrieved from the sub-region according to the suitability probability.

According to another embodiment, the suitability inferring unit 302 may infer, according to the inferred center of the sub-region and the local appearance patterns of the face image, a suitability probability of each candidate shape suitable to the face image, to determine the sub-region of the shape region. In an embodiment, the suitability inferring unit 302 is further configured to calculate, according to the determined center of the sub-region, an adjustable probability of scope near the center to be adjusted； calculate, according to the determined center of the sub-region, facial part similarity probability of a plurality of facial parts of the face image； and obtain the suitability by multiplying the adjustment probability and the facial part similarity probability.

In particular, for the center of sub-region x _(l) and the shape space {s_i} , the adjustable probability p_i is calculated by the following equation:

The adjustable probability aims to approximately delineate the retrieving scope near x _(l) and typically the suitability is more concentrated for the later stages.

In addition, the facial part similarity probability p_i is calculated based on local appearance patterns φ exacted from the face image by the following equation:

The latter of equation (5) is represented by discriminative mapping (Hough regression voting) and different facial part r are divided. The facial part similarity probability aims to guide shapes moving towards more plausible shape region by separately considering local appearance from each facial part.

Then, the suitability is calculated by multiplying the adjustment probability inferred by equation (4) and the facial part similarity probability inferred by equation (5) as above.

After these processes continue through all the stages L, the center of the last sub-region in the last stage is determined as the final shape, that is, the coordinate of the facial landmarks of the face image can be determined accurately.

In the above, the method for locating the facial landmarks of the face image has been described with reference to Figs. 1-4. The processes of inferring center of sub-region x _(l) and inferring the suitability

may be trained by a training algorithm. The training algorithm is listed in Table 1.

Table 1-Training Algorithm of coarse-to-fine

In a training procedure, the center of sub-region x _(l) for the l^th stage is trained by given suitability

In particular, each candidate shape

j＝1， 2， ... is regressed to a shape closer to a ground-truth shape x^*.

For iteration k＝1， 2， ... ， K, the local appearance information

is computed as feature firstly； then, the regressors M_reg are trained by:

finally,

is updated by

to obtain

j＝1， 2， .... .

Then, for i-th training sample, the sub-region center

is trained by the following equation:

For the weight vector wⁱ, an undirected graph is constructed and the vertices of the graph are the aligned shapes. Each edge in the edge set is weighted by affinity defined as

Then, the weight vector wⁱ is optimized by the following equation:

In another training procedure, the suitability

is trained by given center of sub-region x _(l) for the l^th stage.

For the adjustable probability p_i as represented by equation (4) , the covariance matrix is learned by ground-truth shape x^* and the center of sub-region x _(l) . Σ is the covariance matrix of x _(l) -x^* over all training samples and is restricted to be diagonal.

For the facial part similarity probability p_i as represented by equation (5) , different facial part are divided. For the facial part r,

is learned by discriminative mapping.

Then, the suitability probability

is trained by the following equation:

As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” ， “circuit， ” “module” or “system. ” Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments.

In addition, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium. Fig. 7 illustrates a system 3000 for locating facial landmarks of a face image according to one embodiment of the present application, in which the functions of the present invention are carried out by the software. Referring to Fig. 7, the system 3000 comprises a memory 3001 that stores executable components and a processor 3002, electrically coupled to the memory 3001 to execute the executable components to perform operations of the system 3000. The executable components may comprise: a retrieving component 3003 for retrieving for retrieving a set of candidate shapes from a predetermined shape region in one or more sequential stages, each of the candidate shapes having pre-labeled facial landmarks； an aligning component 3004 for aligning each of the retrieved candidate shape with the face image to obtain corresponding aligned shapes； a determining component 3005 for determining, according to the aligned shapes obtained in a current stage of the stages, a sub-region of the shape region to select a set of candidate shapes therefrom to be retrieved at a next stage following the current stage. The functions of the components 3003 to 3005 are similar to those of the unit 100 to 300, respectively, and thus the detailed descriptions thereof are omitted herein.

Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.

Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims

A method for locating facial landmarks of a face image, comprising:

retrieving a set of candidate shapes respectively from a predetermined shape region, each of the candidate shapes having labeled facial landmarks；

aligning each of the retrieved candidate shape with the face image to obtain corresponding aligned shapes；

determining, according to the aligned shapes obtained in a current stage of two or more stages, asub-region of the shape region to select a set of candidate shapes therefrom to be retrieved at a next stage following the current stage； and

repeating the retrieving, the aligning and the determining for the stages to locate the facial landmarks of the face image.
The method according to claim 1, wherein the determining further comprises:

inferring a center of the sub-region, according to the aligned shapes obtained in the current stage and local appearance patterns of the face image.
The method according to claim 2, wherein the located facial landmarks of the face image is determined from the center of the sub-region inferred at a last stage of the stages.
The apparatus according to claim 2, wherein the determining further comprises:

inferring, according to the inferred center of the sub-region and the local appearance patterns of the face image, asuitability probability of each candidate shape suitable to the face image, to determine the sub-region of the shape region.
The apparatus according to claim 4, wherein the inferring suitability probability is further performed by

calculating, according to the determined center of the sub-region, an adjustable probability of scopes to be adjusted around the center； and

calculating, according to the local appearance patterns of the face image, facial part similarity probability of facial parts of the face image, to obtain the suitability probability by multiplying the adjustment probability and the facial part similarity probability.
The apparatus according to claim 1, wherein the aligning further comprises:

extracting facial features from the face image； and

mapping the extracted facial features to a shape residual by using at least one regressors, so that the aligned shapes are obtained by using the shape residual.
The apparatus according to claim 6, wherein different numbers and different types of facial features can be extracted in different stages.
The apparatus according to claim 7, wherein the facial features extracted in the first stages is SIFT and that extracted in other stages is SIFT and BRIEF.
An apparatus for locating facial landmarks of a face image, comprising:

a retrieving unit for retrieving a set of candidate shapes from a predetermined shape region in one or more sequential stages, each of the candidate shapes having pre-labeled facial landmarks；

an aligning unit being electronically communicated with the retrieving unit and aligning each of the retrieved candidate shape with the face image to obtain corresponding aligned shapes； and

a determining unit being electronically communicated with the aligning unit and determining, according to the aligned shapes obtained in a current stage of the stages, asub-region of the shape region to select a set of candidate shapes therefrom to be retrieved at a next stage following the current stage.
The apparatus according to claim 9, wherein the determining unit further comprises:

a center inferring unit for inferring a center of the sub-region, according to the aligned shapes obtained in the current stage and local appearance patterns of the face image.
The apparatus according to claim 10, wherein the located facial landmarks of the face image is determined from the center of the sub-region inferred at a last stage of the stages.
The apparatus according to claim 10, wherein the determining unit further comprises:

a suitability inferring unit for inferring, according to the inferred center of the sub-region and the local appearance patterns of the face image, asuitability probability of each candidate shape suitable to the face image, to determine the sub-region of the shape region.
The apparatus according to claim 12, wherein the suitability inferring unit is further configured to

calculate, according to the determined center of the sub-region, an adjustable probability of scopes to be adjusted around the center； and

calculate, according to the local appearance patterns of the face image, facial part similarity probability of facial parts of the face image, to obtain the suitability probability by multiplying the adjustment probability and the facial part similarity probability.
The apparatus according to claim 9, wherein the aligning unit is further configured to

extracting facial features from the face image； and

mapping the extracted facial features to a shape residual by using at least one regressors, so that the aligned shapes are obtained by using the shape residual.
The apparatus according to claim 14, wherein different numbers and different types of facial features can be extracted in different stages.
The apparatus according to claim 15, wherein the features extracted in the first two stages is BRIEF and that extracted in other stages is SIFT.
A system for locating facial landmarks of a face image, comprising:

a image capturing unit for capturing the face image；

a retrieving unit for retrieving a set of candidate shapes from a predetermined shape region in one or more sequential stages, each of the candidate shapes having pre-labeled facial landmarks；

an aligning unit being electronically communicated with the retrieving unit and aligning each of the retrieved candidate shape with the face image to obtain corresponding aligned shapes； and

a determining unit being electronically communicated with the aligning unit and determining, according to the aligned shapes obtained in a current stage of the stages, asub-region of the shape region to select a set of candidate shapes therefrom to be retrieved at a next stage following the current stage.
The system according to claim 17, wherein the determining unit further comprises:

a center inferring unit for inferring a center of the sub-region, according to the aligned shapes obtained in the current stage and local appearance patterns of the face image； and

a suitability inferring unit for inferring, according to the inferred center of the sub-region and the local appearance patterns of the face image, asuitability probability of each candidate shape suitable to the face image, to determine the sub-region of the shape region.
The system according to claim 18, further comprising:

a training unit for training the center inferring unit with a given suitability and training the suitability inferring unit with a given center of the sub-region, so as to modify parameters used by the determining unit.
The system according to claim 19, wherein the located facial landmarks of the face image is determined from the center of the sub-region inferred at a last stage of the stages.