WO2017084098A1

WO2017084098A1 - System and method for face alignment

Info

Publication number: WO2017084098A1
Application number: PCT/CN2015/095197
Authority: WO
Inventors: Xiaoou Tang; Shizhan ZHU; Cheng Li; Chen Change Loy
Original assignee: Sensetime Group Limited
Priority date: 2015-11-20
Filing date: 2015-11-20
Publication date: 2017-05-26
Also published as: CN108701206B; CN108701206A

Abstract

The present invention discloses a system and method for face alignment. The method for face alignment which may comprises: extracting a feature of a face image based on a predetermined face shape in the face image, estimating a shape residual for each of a plurality of predetermined domains by applying a regressor to the extracted feature, computing a regressed shape for each of the plurality of predetermined domains by adding the shape residuals to the face shape, obtaining a feature for each domain based on the regressed shape, predicting a composition vector by using the obtained features, weighting the regressed shapes by using the predicted composition vector, and compositing the weighted regressed shapes to output a compositional shape.

Description

SYSTEM AND METHOD FOR FACE ALIGNMENT

Technical Field

The present application relates to the technical field of pattern recognition, more particularly to a system and a method for face alignment.

Background of the Application

Face alignment aims to automatically localize facial parts, which are essential for many subsequent processing modules, e.g., face recognition, attributes prediction, and robust face frontalisation.

The study of face alignment has made rapid progresses in recent years. Unconstrained face alignment beyond frontally biased faces is an emerging research topic. However, the existing methods cannot properly handle faces with unconstrained variations.

For example, the supervised descent method (SDM) is a representative method among the mainstream approaches. As shown in Fig. 1 (a) , even the approach is retrained on AFLW dataset which provides a good example of images typically found in unconstrained scenarios, its effective scope is confined within frontally biased faces, and it has difficulty to cover an enlarged shape parameter space due to large head rotations and face deformations caused by rich expressions. Xiong and De la Torre have the same observation –a cascaded regressor such as the SDM is only effective within a specific domain of homogeneous descent (DHD) (see X. Xiong and F. De la Torre. Global supervised descent method. In CVPR, 2015) .

An intuitive multi-view approach has also been proposed, head poses are first estimated, followed by face alignment on a specific view. Although there is a performance improvement, however, as shown in Fig. 1 (b) , the heuristic partitioning with respect to the head pose only is still suboptimal because it neglects other shape deformation or appearance variations, e.g. large mouth, large face scale or sunglasses. Moreover, this approach assumes independence between different view models without considering their inter-complementary and regularization role. Hence, the error caused by head pose estimation could easily be propagated and amplified to the final shape estimation, reducing the overall robustness.

The above approaches demonstrate the difficulties of covering a wider range of shape and appearance variations beyond frontal faces, both with a single model and multiple models.

There is therefore a need for a practical approach to address the problems of unconstrained face alignment.

Summary of the Application

The present application intends to provide an effective and efficient approach for unconstrained face alignment. It does not rely on 3D face modelling and 3D annotations, and does not make assumption on the pose range. It can comfortably deal with arbitrary view pose and rich expressions in the full AFLW dataset. In addition, the alignment is achieved on a single image without the need of temporal prior. The present application achieves this by using a cascaded compositional learning.

One aspect of the present application discloses a method for face alignment which may comprises: extracting a feature of a face image based on a predetermined face shape in the face image, estimating a shape residual for each of a plurality of predetermined domains by applying a regressor to the extracted feature, computing a regressed shape for each of the plurality of predetermined domains by adding the shape residuals to the face shape, obtaining a feature for each domain based on the regressed shape, predicting a composition vector by using the obtained features, weighting the regressed shapes by using the predicted composition vector, and compositing the weighted regressed shapes to output a compositional shape.

According to an embodiment of the present application, extracting the feature may comprises: traversing a region surrounding each of at least one landmark of the predetermined face shape to each tree of a predetermined decision forest until a leaf node is reached for each tree, obtaining a vector for each of the landmarks the vector indicating the reached leaf node of the tree, and combining the vector for each of the landmarks to output the extracted feature.

According to an embodiment of the present application, obtaining the feature for each domain may comprise: using the vector for each of the landmarks to obtain the feature for each domain.

According to an embodiment of the present application, predicting the composition vector may comprise: predicting the composition vector by inputting the obtained feature into a predetermined composition forest.

According to an embodiment of the present application, the method may further comprise training the predetermined decision forest by using a Hough forest approach to minimize a structured loss of the predetermined decision forest.

According to an embodiment of the present application, the structured loss of the predetermined decision forest is minimized by regressing the difference between the predetermined face shape and a preset shape for each of the at least one landmark of the predetermined face shape.

According to an embodiment of the present application, the method may further comprise training the regressor by linear regression learning.

According to an embodiment of the present application, the method may further comprise training the predetermined composition forest by minimizing a discrepancy between the compositional shape and a preset shape.

According to an embodiment of the present application, a domain is excluded if the composition vector is zero for the domain.

Another aspect of the present application discloses an apparatus for face alignment which may comprise an extracting means for extracting a feature of a face image based on a predetermined face shape in the face image, an estimating means for estimating a shape residual for each of a plurality of predetermined domains by applying a regressor to the extracted feature, a computing means for computing a regressed shape for each of the plurality of predetermined domains by adding the shape residual to the face shape, an obtaining means for obtaining a feature for each domain based on the regressed shape, a predicting means for predicting a composition vector by using the obtained features, a weighting means for weighting the regressed shapes by using the predicted composition vector, and a compositing means for compositing the weighted regressed shapes to output a compositional shape.

According to an embodiment of the present application, the extracting means may comprise: a traversing sub-means for traversing a region surrounding each of at least one landmark of the predetermined face shape to each tree of a predetermined decision forest until a leaf node is reached for each tree, an obtaining sub-means for obtaining a vector for each of the landmarks, the vector indicating the reached leaf node of the tree, and a combining sub-means for combining the vector for each of the landmarks to output the extracted feature.

According to an embodiment of the present application, the obtaining sub-means may use the vector for each of the landmarks to obtain the feature for each domain.

According to an embodiment of the present application, the predicting means may predict the composition vector by inputting the obtained feature into a predetermined composition forest.

According to an embodiment of the present application, the apparatus may further comprise a decision forest training means for training the predetermined decision forest by using a Hough forest approach to minimize a structured loss of the predetermined decision forest.

According to an embodiment of the present application, the structured loss of the predetermined decision forest may be minimized by regressing the difference between the predetermined face shape and a preset shape for each of the at least one landmark of the predetermined face shape.

According to an embodiment of the present application, the apparatus may further comprise a regressor training means for training the regressor by linear regression learning.

According to an embodiment of the present application, the apparatus may further comprise a composition forest training means for training the predetermined composition forest by minimizing a discrepancy between the compositional shape and a preset shape.

Yet another aspect of the present application discloses a system for face alignment which may comprise a processor, and a memory, the memory storing computer-readable instructions which when executed by the processor, cause the processor to: extract a feature of a face image based on a predetermined face shape in the face image, estimate a shape residual for each of a plurality of predetermined domains by applying a regressor to the extracted feature, compute a regressed shape for each of the plurality of predetermined domains by adding the shape residual to the face shape, obtain a feature for each domain based on the regressed shape, predict a composition vector by using the obtained features, weight the regressed shapes by using the predicted composition vector, and composite the weighted regressed shapes to output a compositional shape..

Still another aspect of the present application discloses a non-volatile computer storage medium, storing computer-readable instructions which when executed by a processor, cause the processor to: extract a feature of a face image based on a predetermined face shape in the face image, estimate a shape residual for each of a plurality of predetermined domains by applying a regressor to the extracted feature, compute a regressed shape for each of the plurality of predetermined domains by adding the shape residual to the face shape, obtain a feature for each domain based on the regressed shape, predict a composition vector by using the obtained features, weight the regressed shapes by using the predicted composition vector, and composite the weighted regressed shapes to output a compositional shape.

Brief Description of the Drawing

Other features, objects and advantages of the present application will become more apparent from a reading of the detailed description of the non-limiting embodiments, said description being given in relation to the accompanying drawings, among which:

Fig. 1 illustrates test error distributions of two existing approach on the AFLW dataset, in which two factors, yaw and mouth size, are selected to visualize the distribution and provide the representative facial images in five regions (I-V) ；

Fig. 2 illustrates an exemplary flow chart of a method for face alignment according to an embodiment of the present application；

Fig. 3 illustrates an exemplary flowchart of extracting a feature for a face image according to an embodiment of the present application；

Fig. 4 illustrates an exemplary flowchart of obtaining a regressed domain specific shape according to an embodiment of the present application；

Fig. 5 illustrates an exemplary flowchart of predicting a compositional shape according to an embodiment of the present application；

Fig. 6 illustrates a schematic block diagram of an apparatus for face alignment according to an embodiment of the present application； and

Fig. 7 illustrates a schematic structural diagram of a schematic structural diagram of a computer system that is adapted for implementing the method and the apparatus for face alignment according to an embodiment of the present application.

Detailed Description

The present application will be further described in detail in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are provided to illustrate the present invention, instead of limiting the present invention. It also should be noted that only parts related to the present invention are shown in the figures for convenience of description.

It should be noted that, the embodiments of the present application and the features in the present application, on a non-conflict basis, may be combined with each other. The present application will be further described in details below in conjunction with the accompanying drawings and embodiments.

Fig. 2 illustrates an exemplary flow chart of a method for face alignment according to an embodiment of the present application.

At step 100, a feature is extracted of a face image. In a non-limiting example, for each landmark on the face image, a binary feature is obtained. The binary features for all the landmarks are subsequently combined to form the feature of the face image.

At step 200, regressed domain specific shapes of the face image are obtained. For each domain, an estimated shape residual is obtained by using the feature of the face image. The estimated shape residual is added to the predetermined shape s of the face image to compute the regressed domain specific shapes.

At step 300, a compositional shape for the face image is predicted. For each domain, a feature is obtained by using the extracted feature of step 100. The feature for each domain is inputted into a composition forest to predict a composition vector. The domain specific shape for each domain is then weighted by the composition vector. All the weighted domain specific shapes are aggregated to obtain a compositional shape of the face image.

Fig. 3 illustrates an exemplary flowchart of extracting a feature for a face image according to an embodiment of the present application.

At step 110, a sample, i.e. a region surrounding each landmark l, is traversed to each tree of a predetermined decision forest until a leaf node is reached for each tree to obtain a binary vector

which indicates where each leaf node of the tree is reached (1 when reached and 0 otherwise) . The dimensionality of

equals the total number of leafs in the decision forest and the number of 1 in the vector equals the total number of the trees in the forest.

For each landmark, the decision forest can be trained using a Hough Forest approach to minimize the structured loss by simultaneously minimize a landmark regression residual and classify the facial part and the background. The landmark regression residual is defined as a difference between the predetermined face shape s and a ground-truth shape s*for each landmark. The ground-truth shape s*is preset.

At step 120, all the features for the landmarks are combined to form the extracted feature

for the face image, i.e.,

Fig. 4 illustrates an exemplary flowchart of obtaining a regressed domain specific shape according to an embodiment of the present application.

At step 210, for each domain k, a shape residual Δs_k is estimated by applying a domain-specific regressor ω_k. The shape residual Δs_k is obtained as follows:

K domains may be defined by partitioning all training samples into K subsets. For example, all samples may be partitioned according to the principle components of shape and local appearance. Each component halves the samples and hence K is always a power of 2. It is worth pointing out that head pose is not the only underlying factor for the partition. By observing the mean face of each domain, it has been observed that some domains are dominant by shape deformation or appearance property, e.g. wide-open mouth, large facial scaling, large face contour or faces with sunglasses. All domains share the same feature mapping

For each domain k, the domain-specific regressor ω_k may be learned by linear regression learning. The domain-specific regressor ω_k may be defined as:

At step 220, the regressed domain specific shape s_k is computed by adding the shape residual Δs_k to the predetermined face shape s, i.e., s_k ＝ s +Δs_k, (k＝1, …, K) .

Fig. 5 illustrates an exemplary flowchart of predicting a compositional shape according to an embodiment of the present application.

At step 310, a feature

for each domain k is obtained. The previously learned feature mapping

is used to obtain the feature

for each domain k.

At step 320, the regressed domain specific shape s_k and the feature

for the domain is inputted into a predetermined composition forest f'to predict a composition vector p.

The predetermined composition forest f'may be trained by minimizing the discrepancy between the compositional shape s’and the ground-truth shape s*, which can be expressed

The composition vector p is a meaningful quantitative description of domains. For example, the composition of two incompatible domains (e.g. left and right profile-view domains) should not co-occur. Each composition is also non-negative that provides valid shape contribution. The composition vector p is estimated after Δs_k so that it could directly exploit the local appearance. This provides the opportunity to handle faces in the unconstrained scenario by still only extracting the fast pixel feature throughout an embodiment of the present application.

At step 330, the domain specific shape s_k is weighted by the composition vector p.

At step 340, the weighted domain specific shape s_k is aggregated to output the composition shape s’, i.e.,

Fig. 6 illustrates a schematic block diagram of an apparatus for face alignment according to an embodiment of the present application.

As shown in Fig. 6, the apparatus for face alignment 2000 comprises a feature extraction unit 2100, a domain specific regression unit 2200 and a composition prediction unit 2300.

The feature extraction unit 2100 is used for extracting a feature of a face image. The face image and a predetermined shape of the face image are inputted into the feature extraction unit 2100, and the feature of the face image is outputted. In the feature extraction unit 2100, a sample, i.e. a region surrounding each landmark l, is traversed to each tree of a predetermined decision forest until a leaf node is reached for each tree to obtain a binary vector

which indicates whether each leaf node of the tree is reached (1 for reached and 0 otherwise) . The dimensionality of

equals the total number of leafs in the decision forest and the number of 1 in the vector equals the total number of trees in the forest. The decision forest can be trained as described above. The feature extraction unit 2100 combines all the features for the landmarks to form the extracted feature

for the face image, i.e.,

The domain specific regression unit 2200 is used for obtaining regressed domain specific shapes of the face image. The extracted feature of the face image is inputted into the domain specific regression unit 2200, and the regressed domain specific shapes are outputted. In the domain specific regression unit 2200, a shape residual Δs_k is estimated for each domain k by applying a domain-specific regressor ω_k. The shape residual Δs_k is obtained as follows:

K domains may be defined by partitioning all training samples into K subsets. The domain specific regression unit 2200 then computes the regressed domain specific shape s_k by adding the shape residual Δs_kto the predetermined face shape s.

The composition prediction unit 2300 is used for predicting a compositional shape for the face image. The regressed domain specific shapes are inputted into the composition prediction unit 2300, and the compositional shape for the face image is outputted. In the composition prediction unit 2300, a feature

for each domain k is obtained. The feature mapping

may be determined in the feature extraction unit 2100. The composition prediction unit 2300 then inputs the regressed domain specific shape s_k and the feature

for the domain into a predetermined composition forest f'to predict a composition vector p. The predetermined composition forest f'may be trained by minimizing the discrepancy between the compositional shape s’and the ground-truth shape s*, which can be expressed

The composition prediction unit 2300 weights the domain specific shape s_k by using the composition vector p and aggregates the weighted domain specific shape s_k to output the composition shape s’.

It should be understood that the units or sub-units described in the apparatus for face alignment 2000 correspond to the steps of the method described above with reference to the flow chart. Therefore, the operations and characteristics described above with reference to the method also apply to the apparatus for face alignment 2000 and the units thereof, and thus will not be repeated herein.

Referring now to Fig. 7, a schematic structural diagram of a computer system 3000 that is adapted for implementing the method and the apparatus for face alignment according to an embodiment of the present application is shown.

As shown in Fig. 7, the computer system 3000 comprises a central processing unit (CPU) 3001, which may perform a variety of appropriate actions and processes according to a program stored in a read only memory (ROM) 3002 or a program loaded to a random access memory (RAM) 3003 from a storage part 3008. RAM 3003 also stores various programs and data required by operations of the system 3000. CPU 3001, ROM 3002 and RAM 3003 are connected to each other via a bus 3004. An input/output (I/O) interface 3005 is also connected to the bus 3004.

The following components are connected to the I/O interface 3005: an input part 3006 comprising a keyboard, a mouse and the like, an output part 3007 comprising a cathode ray tube (CRT) , a liquid crystal display (LCD) , a speaker and the like； the storage part 3008 comprising a hard disk and the like； and a communication part 3009 comprising a network interface card, such as a LAN card, a modem and the like. The communication part 3009 performs communication process via a network, such as the Internet. A driver 3010 is also connected to the I/O interface 3005 as required. A removable medium 3011, such as a magnetic disk, an optical disk, a magneto-optical disk and a semiconductor memory, may be installed onto the driver 3010 as required, so as to install a computer program read therefrom to the storage part 3008 as needed.

In particular, according to the embodiment of the present disclosure, the method described above with reference to Figs. 2 to 5 may be implemented as a computer software program. For example, the embodiment of the present disclosure comprises a computer program product, which comprises a computer program that tangibly included in a machine-readable medium. The computer program comprises program codes for executing the method in Figs. 2 to 5. In such embodiments, the computer program may be downloaded from the network via the communication part 3009 and installed, and/or be installed from the removable medium 3011.

The flow charts and the block diagrams in the figures illustrate the system architectures, functions, and operations which may be achieved by the systems, devices, methods, and computer program products according to various embodiments of the present application. For this, each block of the flow charts or the block diagrams may represent a module, a program segment, or a portion of the codes which comprise one or more executable instructions for implementing the specified logical functions. It should also be noted that, in some alternative implementations, the functions denoted in the blocks may occur in a different sequence from that marked in the figures. For example, two blocks denoted in succession may be performed substantially in parallel, or in an opposite sequence, which depends on the related functions. It should also be noted that each block of the block diagrams and/or the flow charts and the combination thereof may be achieved by a specific system which is based on the hardware and performs the specified functions or operations, or by the combination of the specific hardware and the computer instructions.

The units or modules involved in the embodiments of the present application may be implemented in hardware or software. The described units or modules may also be provided in a processor. The names of these units or modules do not limit the units or modules themselves.

As another aspect, the present application further provides a computer readable storage medium, which may be a computer readable storage medium contained in the device described in the above embodiments； or a computer readable storage medium separately exists rather than being fitted into any terminal apparatus. One or more computer programs may be stored on the computer readable storage medium, and the programs are executed by one or more processors to perform the formula input method described in the present application.

The above description is only the preferred embodiments of the present application and the description of the principles of applied techniques. It will be appreciated by those skilled in the art that, the scope of the claimed solutions as disclosed in the present application are not limited to those consisted of particular combinations of features described above, but should cover other solutions formed by any combination of features from the foregoing or an equivalent thereof without departing from the inventive concepts, for example, a solution formed by replacing one or more features as discussed in the above with one or more features with similar functions disclosed (but not limited to) in the present application.

Claims

A method for face alignment, comprising:

extracting a feature of a face image based on a predetermined face shape in the face image；

estimating a shape residual for each of a plurality of predetermined domains by applying a regressor to the extracted feature；

computing a regressed shape for each of the plurality of predetermined domains by adding the shape residuals to the face shape；

obtaining a feature for each domain based on the regressed shape；

predicting a composition vector by using the obtained features；

weighting the regressed shapes by using the predicted composition vector； and

compositing the weighted regressed shapes to output a compositional shape.
The method of claim 1, wherein extracting the feature comprises:

traversing a region surrounding each of at least one landmark of the predetermined face shape to each tree of a predetermined decision forest until a leaf node is reached for each tree；

obtaining a vector for each of the landmarks, the vector indicating the reached leaf node of the tree； and

combining the vector for each of the landmarks to output the extracted feature.
The method of claim 2, wherein obtaining the feature for each domain comprises:

using the vector for each of the landmarks to obtain the feature for each domain.
The method of claim 1, wherein predicting the composition vector comprises:

predicting the composition vector by inputting the obtained feature into a predetermined composition forest.
The method of claim 1, further comprising training the predetermined decision forest by using a Hough forest approach to minimize a structured loss of the predetermined decision forest.
The method of claim 5, wherein the structured loss of the predetermined decision forest is minimized by regressing the difference between the predetermined face shape and a preset shape for each of the at least one landmark of the predetermined face shape.
The method of claim 1, further comprising training the regressor by linear regression learning.
The method of claim 4, further comprising training the predetermined composition forest by minimizing a discrepancy between the compositional shape and a preset shape.
The method of claim 1, excluding a domain if the composition vector is zero for the domain.
An apparatus for face alignment, comprising:

an extracting means for extracting a feature of a face image based on a predetermined face shape in the face image；

an estimating means for estimating a shape residual for each of a plurality of predetermined domains by applying a regressor to the extracted feature；

a computing means for computing a regressed shape for each of the plurality of predetermined domains by adding the shape residual to the face shape；

an obtaining means for obtaining a feature for each domain based on the regressed shape；

a predicting means for predicting a composition vector by using the obtained features；

a weighting means for weighting the regressed shapes by using the predicted composition vector； and

a compositing means for compositing the weighted regressed shapes to output a compositional shape.
The apparatus of claim 10, wherein the extracting means comprises:

a traversing sub-means for traversing a region surrounding each of at least one landmark of the predetermined face shape to each tree of a predetermined decision forest until a leaf node is reached for each tree；

an obtaining sub-means for obtaining a vector for each of the landmarks, the vector indicating the reached leaf node of the tree； and

a combining sub-means for combining the vector for each of the landmarks to output the extracted feature.
The apparatus of claim 11, wherein the obtaining sub-means uses the vector for each of the landmarks to obtain the feature for each domain.
The apparatus of claim 10, wherein the predicting means predicts the composition vector by inputting the obtained feature into a predetermined composition forest.
The apparatus of claim 10, further comprising a decision forest training means for training the predetermined decision forest by using a Hough forest approach to minimize a structured loss of the predetermined decision forest.
The apparatus of claim 14, wherein the structured loss of the predetermined decision forest is minimized by regressing the difference between the predetermined face shape and a preset shape for each of the at least one landmark of the predetermined face shape.
The apparatus of claim 10, further comprising a regressor training means for training the regressor by linear regression learning.
The apparatus of claim 13, further comprising a composition forest training means for training the predetermined composition forest by minimizing a discrepancy between the compositional shape and a preset shape.
The apparatus of claim 10, a domain is excluded if the composition vector is zero for the domain.
A system for face alignment, comprising:

a processor； and

a memory；

the memory storing computer-readable instructions which when executed by the processor, cause the processor to:

extract a feature of a face image based on a predetermined face shape in the face image；

estimate a shape residual for each of a plurality of predetermined domains by applying a regressor to the extracted feature；

compute a regressed shape for each of the plurality of predetermined domains by adding the shape residual to the face shape；

obtain a feature for each domain based on the regressed shape；

predict a composition vector by using the obtained features；

weight the regressed shapes by using the predicted composition vector； and

composite the weighted regressed shapes to output a compositional shape.
The system of claim 19, wherein extracting the feature comprises:

traversing a region surrounding each of at least one landmark of the predetermined face shape to each tree of a predetermined decision forest until a leaf node is reached for each tree；

obtaining a vector for each of the landmarks, the vector indicating the reached leaf node of the tree； and

combining the vector for each of the landmarks to output the extracted feature.
The system of claim 20, wherein obtaining the feature for each domain comprises:

using the vector for each of the landmarks to obtain the feature for each domain.
The system of claim 19, wherein predicting the composition vector comprises:

predicting the composition vector by inputting the obtained feature into a predetermined composition forest.
The system of claim 19, wherein the processor is further configured to train the predetermined decision forest by using a Hough forest approach to minimize a structured loss of the predetermined decision forest.
The system of claim 23, wherein the structured loss of the predetermined decision forest is minimized by regressing the difference between the predetermined face shape and a preset shape for each of the at least one landmark of the predetermined face shape.
The system of claim 19, wherein the processor is further configured to train the regressor by linear regression learning.
The system of claim 22, wherein the processor is further configured to train the predetermined composition forest by minimizing a discrepancy between the compositional shape and a preset shape.
The system of claim 19, the processor is further configured to exclude a domain if the composition vector is zero for the domain.
A non-volatile computer storage medium, storing computer-readable instructions which when executed by a processor, cause the processor to:

extract a feature of a face image based on a predetermined face shape in the face image；

estimate a shape residual for each of a plurality of predetermined domains by applying a regressor to the extracted feature；

compute a regressed shape for each of the plurality of predetermined domains by adding the shape residual to the face shape；

obtain a feature for each domain based on the regressed shape；

predict a composition vector by using the obtained features；

weight the regressed shapes by using the predicted composition vector； and

composite the weighted regressed shapes to output a compositional shape.