WO2016070300A1

WO2016070300A1 - System and method for detecting genuine user

Info

Publication number: WO2016070300A1
Application number: PCT/CN2014/000982
Authority: WO
Inventors: Xiaoou Tang; Tak Wai HUI; Chen Change Loy
Original assignee: Xiaoou Tang
Priority date: 2014-11-07
Filing date: 2014-11-07
Publication date: 2016-05-12
Also published as: CN106937532B; CN106937532A

Abstract

Disclosed is a system for detecting a genuine user. The system may comprise a retriever, a coplanarity determiner, a constructor and a detector. The retriever may receive an image sequence of a subject including at least a first and a second image, and retrieve facial landmarks of the first and second images to form matched pairs of facial landmarks, in which each facial landmark of the first image is matched with a facial landmark in the second image. The coplanarity determiner may determine if 3D points associated with the matched pairs of facial landmarks are coplanar based on locations of the facial landmarks. The constructor may construct a 3D point cloud for the first and second images from the locations of the facial landmarks if the 3D points associated with the facial landmarks are not coplanar. The detector may detect whether the subject is a real face of the genuine user based on the constructed 3D point cloud.

Description

SYSTEM AND METHOD FOR DETECTING GENUINE USER

Technical Field

The present application generally relates to a field of face recognition, more particularly, to a system for detecting a genuine user, a mobile apparatus provided with the system. The present application further relates to a method for detecting a genuine user.

Background

Recently, face recognition systems have been applied in various useful applications, such as surveillance, access control, criminal investigations and others. However, face recognition system may be highly vulnerable by spoofing attacks where an imposter tries to bypass the face recognition system using a photograph or video of the owner.

Efforts have been devoted to handle this problem, but most of the existing methods rely on involuntary movements (smile or eye blinking) to detect potential intrusions. For example, analyzing the optical flow pattern of a live face can reveal the partial information to discriminate it against a spoofed one. Another example addresses the problem of live face detection by using a binary classifier with a Lambertian model. However, these countermeasures can be easily fooled by showing a video clip of a genuine user to the face recognition system.

For example, there is another method for identify spoofing attack by using dense 3D face structure. However, the method requires a structured-light 3D scanning system to obtain accurate 3D face scans. This solution is neither cost-effective nor feasible for mobile apparatus, e.g. mobile handsets or tablets, since specialized hardware is absent in these general apparatus.

Summary

According to an embodiment of the present application, disclosed is a system for detecting a genuine user. The system may comprise a retriever, a coplanarity determiner, a constructor and a detector. The retriever may receive an image sequence of a subject including at least a first and a second image, and retrieve facial landmarks of the first and second images to form matched pairs of facial landmarks, in which each facial landmark of the first image is matched with a facial landmark in the second image. The coplanarity determiner may determine if 3D points associated with the matched pairs of facial landmarks are coplanar based on locations of the facial landmarks. The constructor may construct a 3D point cloud for the first and second images from the locations of the facial landmarks if the 3D points associated with the facial landmarks are not coplanar. The detector may detect whether the subject is a real face of the genuine user based on the constructed 3D point cloud.

According to another embodiment of the present application, disclosed a mobile apparatus provided with the system according to any one of the system for detecting a genuine user as mentioned above.

According to an embodiment of the present application, disclosed is a method for detecting a genuine user. The method may comprise a step of receiving an image sequence of a subject including at least a first image and a second image； a step of retrieving facial landmarks of the first and second images to form matched pairs of facial landmarks in which each facial landmark of the first image is matched with a facial landmark in the second image； a step of determining if 3D points associated with the matched pairs of facial landmarks are coplanar based on locations of the facial landmarks； a step of constructing a 3D point cloud for the first and second images from said locations if the 3D points associated with the facial landmarks are not coplanar； and a step of detecting whether the subject is a real face of the genuine user based on the constructed 3D point cloud.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 is a schematic diagram illustrating a system for detecting a genuine user consistent with an embodiment of the present application.

Fig. 2 is a schematic diagram illustrating a retriever of the system for detecting a genuine user consistent with some disclosed embodiments.

Fig. 3is a schematic diagram illustrating a coplanarity determiner of the system for detecting a genuine user, consistent with one embodiment of the present application.

Fig. 4 is a schematic diagram illustrating a constructor of the system for detecting a genuine user, consistent with one embodiment of the present application.

Fig. 5is a schematic diagram illustrating a detector of the system for detecting a genuine user, consistent with one embodiment of the present application.

Fig. 6is a schematic diagram illustrating face liveness detection system when it is implemented in software, consistent with some disclosed embodiments.

Fig. 7 is a schematic flowchart illustrating a method for detecting a genuine user consistent with some disclosed embodiments.

Fig. 8 is a schematic flowchart illustrating a method for detecting a genuine user consistent with some other disclosed embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like parts. Fig. 1 is a schematic diagram illustrating an exemplary face liveness detection system1000 consistent with some disclosed embodiments.

Referring to Fig. 1, where the system 1000 is implemented by the hardware, it may comprise a retriever 100, a coplanarity determiner 200, a constructor 300 and a detector 400.

In the embodiment shown in Fig. 1, the retriever 100 may retrieve an image sequence of a subject including at least a first image and a second image. The retriever 100 may further retrieve facial landmarks of the first and second images, wherein each facial landmark of the first image is matched with a facial landmark in the second image. The coplanarity determiner 200 may determine if 3D points associated with the matched pairs of facial landmarks are coplanar based on locations of the facial landmarks retrieved by the retriever 100. If the 3D points associated with the facial landmarks are not coplanar, the constructor 300 may construct a 3D point cloud for the first and second images from the locations of the facial landmarks. The detector 400 may detect whether the subject is a real face of the genuine user based on the constructed 3D point cloud. In one embodiment of the present application, the detector 400 may detect whether the subject is a real face of the genuine user based on the constructed 3D point cloud and the pre-stored 3D facial template.

The retriever 100 may be a general-purpose (color or gray-scale) imaging system provided on a mobile apparatus. According to an embodiment, the retriever 100 may be a monocular camera, or binocular camera (also known as stereo camera) wherein two cameras are rigidly mounted on a platform. Fig. 2 illustrates the image sequence captured by such two kinds of camera.

According to an embodiment, the retriever 100 has an image plane placed at a distance which equals to a focal length f from its optical center. The physical image plane of the retriever 100 may not be configured in the same way as the aforementioned one. A camera coordinate system C: X-Y-Z is defined on the retriever 100, wherein X-axis points to the right, Y-axis points downwards and Z-axis points along the optical axis. In the case that the retriever is a binocular camera, a global camera coordinate system C is placed at the left-camera coordinate system C_l for the simplicity； however, the present application is not limited thereto.

According to an embodiment, the coplanarity determiner 200 shown in Fig. 3 may comprise a correspondence relationship measuring device 201 and a distance calculating device 202. The correspondence relationship measuring device 201 may measure a correspondence relationship between the locations of the facial landmarks in the matched pairs of facial landmarks. The distance calculating device 202 may calculate a distance between each of the facial landmarks in the first image and the corresponding facial landmark in the second image obtained by the measured correspondence relationship. From this, if the distance in terms of a certain combination of all the measured distances is smaller than a predetermined threshold, the 3D points associated with the matched pairs of facial landmarks are determined as coplanar, that is, the subject is determined to be a planar face.

According to an embodiment, the constructor 300 shown in Fig. 4 may comprise a transformation determining device 301 and a depth calculating device 302. The transformation determining device 301 may determine a geometric transformation between the first and second images in accordance with the locations of the extracted facial landmarks. The depth calculating device 302 may calculate a depth value for each of the matched pairs of the facial landmarks.

According to an embodiment, the detector 400 shown in Fig. 5 may comprise a registering device 401 and a similarity calculating device 402. The registering device 401 may align the constructed 3D point cloud with a pre-stored 3D point cloud for the genuine user. The similarity calculating device 402 may calculate a similarity between the constructed 3D point cloud and the per-stored 3D point cloud.

Hereinafter, each components of the system 1000 will be described in details in two exemplary embodiments in which the monocular and binocular cameras are respectively used as the retriever100. Note that the image sequence may be captured by any other kinds of image capturing apparatus and the present application is not limited thereto.

Embodiment 1: monocular camera

As shown in Fig. 2, the monocular camera may capture an image sequence of the subject consisting of at least two image framesI₁, I₂, …, I_S, S≥2. Hereinafter, the first and second images refer to any two image frames for the monocular camera, unless otherwise specifically stated. In the present application, any two image frames p^th and q^th in the monocular image sequence, i.e., I_p and I_q are captured, wherein the image frames are associated with coordinate system C_p and C_q respectively, where p≠q . For ease of description, all quantities related to I_p and I_q are denoted with

sub-indices

1 and 2 respectively. A 3D point X＝ (X, Y, Z) ^T, which is measured from the coordinate system C₁, is denoted as X₁, and a 3D point, which is measured from the coordinate system C₂ is denoted as X₂.

For any two images I₁ and I₂, the retriever 100 may retrieve N₁ and N₂ facial landmarks respectively, which are denoted as p_1i and p_2j in pixel coordinates, where

i＝1, 2, …, N₁, andj＝1, 2, …, N₂. According to an embodiment, N₁ is not necessary equal to N₂. The facial landmarks may be retrieved by any know facial landmark detection methods in the art.

According to another embodiment, numbers of the facial landmarks retrieved in each images may be counted. If there are not enough facial landmarks retrieved in each image, such as smaller than N, the system allows the user to perform another face capturing until time duration for detecting is over.

Each facial landmark in the image I₁ may be matched with a facial landmark in the image I₂ to form a matched pairs of facial landmarks, and are denoted as m₁ in I₁ and m₂ in I₂. In the case that the facial landmarks are not ordered, the retriever 100 may determine the matched facial landmarks m₁ in I₁ and m₂ in I₂ by minimizing a matching cost such as the normalized cross-correlation between two image patches as follows:

where M is number of pixels in the matching window Ω_n,

and σ_n are the mean and the standard deviation of intensity values of the image patch centered at p_n respectively for n＝1, 2. Besides using the normalized cross-correlation between two image patches, any other feature descriptor (in form of a vector) in the art may also be used for the matching cost.

On the other hand, in the case that the facial landmarks are ordered, each landmark in one image I₁ may be matched to the corresponding one in another image I₂ by comparing their landmark indices in the ordered lists. In this case, it is not necessary to find the correspondence by measuring the cross-correlation as above formula (1) .

In particular, for image point p in pixel coordinates, the corresponding image coordinates x is defined as follows:

The projective coordinates

of the image point p is defined as follows:

wheref_x and f_y represent the focal lengths of the retriever in the x-direction and the y-direction respectively,

(u, v) represents the camera center with respect to the pixel coordinate,

represents an intrinsic parameter matrix of the camera, γ represents a skew parameter of the camera.

Then, for the retrieved N matched facial landmarks (m_1k, m_2k) , where k＝1,2, …, N, the matched facial landmarks in the images I₁ andI₂ are related to each other by a fundamental matrix F as follows:

where F＝K^-TEK^-1 and E＝ [t]×R.

Referring to Fig. 3, the correspondence relationship measuring device 201 may measure correspondence (i.e., a homography matrix) H between the locations of the facial landmarks in the matched pairs of facial landmarks (m_1k, m_2k) as follows:

The distance calculating device 202 may calculate an overall distance in terms of a certain combination of all the measured distance between the facial landmarks, for example, a root-mean-square-error (RMSE) of the facial landmarks (m_1k, m_2k) by using the above mentioned homography matrix H as follows:

where m′_2k＝Hm_1k, and m′_1k＝H^-1m_2k.

Thus, the coplanarity determiner 200 may determine whether the 3D points associated with the matched pairs of facial landmarks are coplanar, that is, to determine whether the 3D points associated with the facial landmarks lie on the same 3D plane or not. If the calculated err_H is smaller than a pre-defined threshold, the subject is considered to be a planar face, i.e. it is not a real face. Then, the system allows the user to perform another face capturing until the time duration is over.

On the other hand, if the calculated err_H is higher the pre-defined threshold, the subject is more likely a genuine user’s face. Then, the transformation determining device 301 of the constructor 300 shown in Fig. 4 may determine the geometric transformation (hereinafter a transformation matrix) between two images in accordance with locations of the matched facial landmarks in each image.

For the image I₁, the 3D point X＝ (X, Y, Z) ^Tdenoted as X₁ may be mapped to C₂ as follows, wherein the determined geometric transformation is denoted as T:

where R and t represent the rotation matrix and the translation vector respectively； and

T represents the transformation matrix.

According to another embodiment, if the retrieved facial landmarks are not ordered, the retriever 100 may acquire feature vectors of the facial landmarks. The retriever 100 may establish the correspondence between each landmark in one image to the corresponding facial landmark in another image by minimizing the matching cost of the acquired feature vectors. Then, the transformation determining device 301 may determine the geometric transformation in accordance with the locations associated with the matched pairs of the facial landmarks.

According to another embodiment, the system 1000 may further comprise an Inertial measurement device (IMU) (not shown) which is configured to acquire inertial data of the images. However, the present application is not limited thereto, and the system 1000 may acquire the inertial data from the IMU which is provided outside of the system 1000. The IMU may comprise a 3-axis accelerometer, a 3-axis gyroscope and a 3-axis magnetometer. The following will describe the constructor 300 in the cases that the IMU is available and is not available in details. Note that, a single 3D point cloud is constructed from two monocular images.

Case 1: IMU is not available

Referring to Fig. 4, the transformation determining device 301 of the constructor 300 may firstly calculate the fundamental matrix F between the two sets of the matched facial landmarks {m₁} and {m₂} for I₁ and I₂ by using a normalized 8-point algorithm or any one of the similar methods in the art. Then, the fundamental matrix F is converted to the essential matrix E. The essential matrix E may also be determined directly by using a 5-point algorithm or any one of the similar methods in the art. Then, the transformation matrix

is recovered by using a singular value decomposition (SVD) method for the essential matrix E.

Then, the depth calculating device 302 of the constructor 300 may calculate the projective coordinates of the matched facial landmarks

in (I₁, I₂) by the above mentioned formula (3) . From this, the depth calculating device 302 of the constructor 300 may establish a 3D point cloud for the two images I₁ and I₂. In particular, a Z-component of the 3D point (i.e., relative to C₂) associated with the pairs of facial landmarks can be determined as follows:

After all the Z-components are found, the corresponding 3D point X₂ with respect to C₂ can be determined as follows:

Then, the 3D point cloud {X₂} of the subject can be constructed by the constructor 300 by collecting all these 3D points resulted from the above formula (9) .

Case 2: IMU is available

If the IMU is available, the transformation determining device 301 may determine the geometric transformation in accordance with locations of the facial landmarks and the inertial data acquired from the IMU.

Then, the depth calculating device 302 may calculate the projective coordinates of the matched facial landmarks

in (I₁, I₂) . Different from the case 1, a Z-component of the 3D point (i.e., relative to C₂) associated with the pairs of facial landmark scan be determined as follows:

Then, after all the Z-components are determined by formula (10) , the corresponding 3D point cloud {X₂} with respect to C₂ can be determined according to the above formula (9) .

From this, for the above mentioned cases1and 2, the 3D point cloud {X_m} is constructed usingthe two images I₁ and I₂ of the subject.

Then, referring to Fig. 5, the registering device 401 of the detector 400 may align the constructed 3D point cloud {X_m} with a pre-stored 3D point cloud {X_ref}for the genuine user. The pre-stored 3D point cloud for the genuine user serves as a reference 3D template. According to an embodiment, the 3D template may be extracted by the aforementioned procedures and be stored in a non-volatile memory of a mobile device before the face liveness detection. The registered 3D point cloud is denoted as {X′_m} . Then, the 3D distance between the constructed 3D point cloud and the pre-stored 3D point cloud is calculated as follows:

If the score err_3D is smaller than a pre-defined threshold, the subject is considered to be the real face of the genuine user. Otherwise, the system allows the user to perform another face scan until the face liveness detection time is over.

Alternatively, the coplanarity determiner200 may also be configured to determine whether the facial landmarks in the matched pairs of facial landmarks are coplanar by the following processes.

Firstly, a planar fitting of the 3D point cloud {X} constructed from the monocular image frames (I₁, I₂)is performed, which may be carried out in accordance with the 3D plane equation of Z_k＝aX_k+bY_k+cZ_k, k＝1, 2…, N. Then, the 3D plane normal n＝(a, b, c) ^T is determined by solving the following least-square equation:

n＝argmin Σ_k (aX_k+bY_k+C-Z_k) ², (12)

or

Then, the distance calculating device 202 may calculate a root-mean-square-error (RMSE) of the planar fitness of the 3D point as follows:

Similar to the value err_H, if err_plane is smaller than the pre-defined threshold, the subject is considered to be a planar face. The system allows the user to perform another face scan until the time duration is over.

Embodiment 2: binocular camera (also referred to as stereo camera)

The above embodiment in which the monocular camera is used as the retriever 100 has been described. Hereinafter, another embodiment in which a binocular camera is used as the retriever 100 will be described.

For the binocular camera, the image sequence (i.e., stereo sequence) consists of at least one stereo frame

S≥1, wherein each stereo image frame has two images, namely a left image I^l captured by a left camera and a right image I^r captured by a right camera. The left and right images are captured simultaneously in the stereo camera. Hereinafter, the first and second images(I₁, I₂)refer to the left and right images (I^l, I^r) in any one stereo image frame for the stereo camera, unless otherwise specifically stated.

Similar to the monocular camera, the system 100 according to the second embodiment can be implemented by the same configuration as that described in the embodiment 1 only by replacing the

sub-indices

1 and 2 in the image pair (I₁, I₂) with l and r in the stereo-pairimage (I^l, I^r) .

In particular, the correspondence relationship measuring device 201 of the coplanarity determiner 200 measures the homography matrix H between the locations of the facial landmarks in the matched pairs of facial landmarks

as follows:

Then the distance calculating device 202 of the coplanarity determiner 200 calculates err_H by replacing the pair (I₁, I₂) with (I^l, I^r) in accordance with the formula (6) in the embodiment 1. If the calculated err_H is smaller than a pre-defined threshold, the subject is considered to be a planar face.

Then, the constructor 300 may construct a single 3D point cloud for a stereo image frame (I^l, I^r) . In particular, the intra-transformation matrix T_lr＝ (R_lr, t_lr) associated with the left image I^l and the right image I^r in the same stereo image frame (I^l, I^r) may be determined by the transformation determining device 301. According to another embodiment, the intra-transformation matrix T_lr＝ (R_lr, t_lr) may be not determined by the transformation determining device 301. The intra-transformation matrix T_lr＝ (R_lr, t_lr) is fixed and is used permanently once the stereo camera is calibrated, and it can be determined beforehand. After the intra-transformation matrix T_lr＝ (R_lr, t_lr) is determined and the projective coordinates of the matched facial landmarks

are determined similar to the embodiment 1, the constructor 300 construct the 3D point cloud similarly using the formulas (10) and (9) by interchanging all the variables with

indices

1 and 2 to l and r respectively.

According to another embodiment, if the two images I^l and I^r in the stereo frame are rectified, the intra-transformation matrix T_lr＝ (R_lr, t_lr) is determined as R_lr＝I₃ and t_lr＝ (b, 0, 0) ^T, where b represents a length of a baseline of the stereo after rectification. The 3D point (, say relative to C₂) associated with the point pair can be determined as follows:

where d＝ |x₁-x₂| represents disparity between the two image points for apair of matched facial landmarks.

From this, the 3D point cloud {X₂} of the subject can be constructed. All the variables in the formula (16) are about the stereo camera after performing rectification on the pair of stereo images.

Then, similar to the embodiment 1, the registering device 401 of the detector 400 may align the constructed 3D point cloud {X_m} with a pre-stored 3D point cloud {X_ref}for the genuine user. The registered 3D point cloud is denoted as {X′_m} . Then, the 3D distance between the constructed 3D point cloud and the pre-stored 3D point cloud is calculated as follows:

According to another embodiment, the transformation determining device 301 may also determine an inter-transformation matrix T across two stereo image frames

andby the similar method as that in the

cases

1 and 2 of the embodiment 1.

Alternatively, two point clouds {X₁} and {X₂} may be constructed for the stereo image frames

and

respectively. Then, the inter-transformation matrix T may be determined by aligningthe point cloud {X₁} with respect to {X₂} .

Alternatively, the coplanarity determiner 200 may determine whether the facial landmarks in the matched pairs of facial landmarks are coplanar by the following method. Firstly, a planar fitting of the 3D point cloud {X} constructed from the stereo image frame (I^l, I^r) is performed in the form of 3D plane equation: Z_k＝aX_k+bY_k+cZ_k, k＝1,2…, N. Then, the 3D plane normal n＝ (a, b, c) ^T is determined by solving the above formula (12) or the formula (13) .

Then, the distance calculating device 202 may calculate a root-mean-square-error (RMSE) of the planar fitness of the 3D point, i.e., err_plane by the above formula (14) .

According to the system 1000 for detecting a genuine user, a 3D structure of face can be exploited to automatically determine whether the subject is a real person or simply faces on photograph or video sequence portraying a genuine user so as to prevent spoofing attacks where an imposter tries to bypass a face recognition system using a photograph or video of the user.

In the above, embodiments in which one pair of image frame (i.e., two image frames I₁ and I₂) for the monocular camera and one stereo-pair image frame (i.e., a stereo image frame (I^l, I^r) ) at different positions and/or angles are captured respectively are described. However, the present application is not limited thereto. More than one pair of monocular image frames (or more than one stereo-pair image frame) may be captured to improve the accuracy and robustness of the system 1000.

For more than one pair of monocular image frames (or stereo-pair image frame) , err_H (or err_plane) are determined and the mean of all err_H (or err_plane) is compared with the pre-defined threshold to determine whether the subject is a planar face. If not, a 3D point cloud is constructed by the constructor 300 for each of the image pairs. According to an embodiment, M image frames are selected from the captured sequence. Then, an optimal 3D point cloud with respect to the world coordinate frame W (i.e., relative to the first coordinate system C₁) may be constructed by determining the refined transformationmatrices of all the M image frames and minimizing the re-projection errors in all the M image framesas follows:

{X} , {T} ＝arg minΣ_iΣ_j||P_i(X_j)-x_j||², (18)

where i＝ 1, 2, …, M, j＝ 1, 2, …, N, and P_i (X_j) represents the image projection of the j ^th 3D point X_j onto the i^th imageI_i.

Then, the detector 400 can detect whether the subject is the genuine user based on the constructed optimal 3D point cloud.

It shall be appreciated that the system 1000 may be implemented using certain hardware, software, or a combination thereof. In addition, the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes. Fig. 6 is a schematic diagram illustrating a system for detecting a genuine user 1000 when it is implemented in software consistent with some disclosed embodiments.

In the case that the system 1000 is implemented with software, the system 1000 may be run in a general purpose computer, a computer cluster, a mainstream computer, a computing apparatus dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion. As shown in Fig. 6, at least one of above mentioned computer and apparatus may include one or more processors (

processors

102, 104, 106 etc. ) , a memory 112, a storage apparatus 116 for storing the program instructions for the processor to implement the method 2000 as will be discussed later, and a bus to facilitate information exchange among various apparatus of system 1000. Processors 102-106 may include a central processing device ( “CPU” ) , a graphic processing device ( “GPU” ) , or other suitable information processing apparatus. Depending on the type of hardware being used, processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods that will be explained in greater detail below. It is noted that although only one block is shown in Fig. 6, memory 112 may include multiple physical apparatus installed on a central computing apparatus or on different computing apparatus.

According to an embodiment, a mobile apparatus is disclosed in which the system for detecting a genuine user 1000 as described above is provided. Under fixed camera settings (e.g. focal length, exposure, white balance) and lighting condition of the surrounding environment, a user is required to hold the mobile provided with such detection system and move it around the face so that the retriever in the apparatus can capture face images at different positions and/or angles, which are slightly deviated from the frontal face.

Fig. 7 is a schematic flowchart illustrating a face liveness detection method 2000 consistent with some disclosed embodiments. Hereinafter, the method 2000 may be described in detail with respect to Fig. 7.

At step S701, an image sequence of a subject including at least a first image and a second image is received. The image sequence of the subject may be captured by a monocular camera or a stereo camera. Then, at step S702, facial landmarks of the first and second images may be retrieved to form matched pairs of facial landmarks in which each facial landmark of the first image is matched with a facial landmark in the second image. At step S703, whether 3D points associated with facial landmarks in the matched pairs of facial landmarks are coplanar is determined based on locations of the facial landmarks. Then, a 3D point cloud is constructed for the first and second images from said locations if the 3D points associated with the facial landmarks are not coplanar at step S704. The, at step S705, whether the subject is a real face of the genuine user is determined based on the constructed 3D point cloud.

According to an embodiment, the step S703 may further comprise a step of measuring correspondence relationship between the locations of the facial landmarks in the matched pairs of facial landmarks and a step of calculating a distance between the facial landmarks in the matched pairs of facial landmarks based on the measured correspondence relationship and said locations, so as to determine whether the facial landmarks in the matched pairs of facial landmarks are coplanar. If the distance err_H (or err_plane) calculated by rules of formula (6) (or (14) ) is smaller than a predetermined threshold, the subject is determined as a planar face.

According to an embodiment, the step of S704 may further comprise a step of determining a geometric transformation between the first and second images in accordance with the locations of the matched pairs of facial landmarks and a step of calculating a depth value for each of the matched pairs of the facial landmarks.

According to an embodiment, the step S705 may further comprise a step of aligning the constructed 3D point cloud with a pre-stored 3D point cloud for the genuine user and a step of calculating a similarity between the constructed 3D point cloud and the per-stored 3D point cloud, such that the subject is determined as a genuine use based on the calculated similarity.

According to an embodiment, before the step S703, number of the retrieved facial landmarks of each image may be counted. If the number is smaller than a preset threshold, the step of S701 is performed until an allowedtime duration expires.

According to an embodiment, before the step S703, inertial data of the first and second images may be acquired from an Inertial Measurement Device (IMU) . Then, in the step S703, the geometric transformation is determined in accordance with the locations of the facial landmarks and the acquired inertial data. According to another embodiment, before the step S703, feature vectors of the facial landmarks may be acquired. Then, the correspondence between each landmark in the first image to the corresponding landmark in the second image can be found by minimizing the matching cost of the acquired feature vectors. After that, in the step S703, the geometric transformation is determined in accordance with the locations of the matched pairs of the retrieved facial landmarks. According to another embodiment, in the step S703, the geometric transformation may be determined in accordance with the locations of the retrieved facial landmarks and the inertial data from the IMU.

Fig. 8is a schematic flowchart illustrating a face liveness detection method 2000 consistent with one embodiment of the present application. As shown in Fig. 8, image sequences are captured and the matched facial landmarks are retrieved from the images of the image sequence at step S801. Then, whether the number of the retrieved facial landmarks is enough, that is, the number is higher than a preset value is determined at step S802. If not, the subject is considered to be not a face. Otherwise, correspondence and coplanarity of the landmarks in the matched facial landmarks are determined at step S803. Then, at step S804, the coplanarity is compared with a first threshold to determine whether the locations of the facial landmarks lie on the same plane. If the coplanarity is smaller than the first threshold, the subject is considered to be a planar face. Otherwise, it is determined that whether a pre-stored 3D point cloud for the user is available. If no, the subject is determined as a 3D face. If yes, at step S806, a geometric transformation between the images is determined and the transformation is used to construct a 3D point cloud at step S807. Then, the constructed 3D point cloud is aligned with the pre-stored 3D point cloud and the similarity therebetween is determined at step S808. Then, if the similarity is smaller than a second threshold at step S809, the subject is determined to be a real face of the user. Otherwise, the subject is considered as a 3D face. At step S810, time duration for detecting is counted. When the subject is not a face, or when the subject is determined as a planar face, or when the subject is determined a 3D face, the method2000 allows proceeding to step S801 until the time duration expires.

With the system for detecting a genuine user and method of the present application, spoofing attacks where an imposter tries to bypass a face recognition system using a photograph or video of the user can be prevented. The system for detecting a genuine user can also be coupled with 2D and/or 3D face recognition system to perform early detection of spoofing attacks on mobile apparatus.

Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.

Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims

A system for detecting a genuine user, comprising:

a retriever configured to receive an image sequence of a subject including at least a first and a second image, and retrieve facial landmarks of the first and second images to form matched pairs of facial landmarks, in which each facial landmark of the first image is matched with a facial landmark in the second image；

a coplanarity determiner configured to determine if 3D points associated with the matched pairs of facial landmarks are coplanar based on locations of the facial landmarks；

a constructor configured to construct a 3D point cloud for the first and second images from the locations of the facial landmarks if the 3D points associated with the facial landmarks are not coplanar； and

a detector configured to detect whether the subject is a real face of the genuine user based on the constructed 3D point cloud.
The system of claim 1, wherein the coplanarity determiner further comprises:

a correspondence relationship measuring device configured to measure a correspondence relationship between the locations of the facial landmarks in the matched pairs of facial landmarks； and

a distance calculating device configured to calculate a distance between each of the facial landmarks in the first image and the corresponding facial landmark in the second image obtained by the measured correspondence relationship, so as to determine if the 3D points associated with the facial landmarks in the matched pairs of facial landmarks are coplanar.
The system of claim 2, wherein the coplanarity determiner determines that the subject is a planar face if the calculated distance is smaller than a predetermined threshold.
The system of claim 1, wherein the constructor further comprises:

a transformation determining device configured to determine a geometric transformation between the first and second images in accordance with the locations of the extracted facial landmarks； and

a depth calculating device configured to calculate a depth value for each of the matched pairs of the facial landmarks, such that the constructor establishes the 3D point cloud for the first and second images from the determined geometric transformation and the calculated depth.
The system of claim 1, wherein the detector further comprises:

a registering device configured to align the constructed 3D point cloud with a pre-stored 3D point cloud for the genuine user； and

a similarity calculating device configured to calculate a similarity between the constructed 3D point cloud and the pre-stored 3D point cloud, such that the detector detects whether the subject is a genuine user based on the calculated similarity.
The system of claim 4, wherein the retriever is further configured to acquire feature vectors of the retrieved facial landmarks and establish the correspondence between each facial landmark in the first image to the corresponding facial landmark in the second image, and the transformation determining device is further configured to determine the geometric transformation in accordance with the locations ofthe matched pairs of the facial landmarks.
The system of claim 4, further comprising:

an Inertial Measurement Device (IMU) configured to acquire inertial data of the first and second images,

wherein the transformation determining device is further configured to determine the geometric transformation in accordance with the locations of the facial landmarks and the acquired inertial data.
The system of claim 4, further comprising:

an Inertial Measurement Device (IMU) configured to acquire inertial data of the first and second images, and

wherein the retriever is further configured to acquire feature vectors of the retrieved facial landmarks and establish the correspondence between each facial landmark in the first image to the corresponding facial landmark in the second image, and

wherein the transformation determining device is further configured to determine the geometric transformation in accordance with the locations of the matched pairs of facial landmarks and the acquired inertial data.
The system of claim 1, wherein the image sequence is captured by a monocular camera or a binocular camera.
A mobile apparatus provided with the system for detecting a genuine user according to claim 1.
A method for detecting a genuine user, comprising:

receiving an image sequence of a subject including at least a first image and a second image；

retrieving facial landmarks of the first and second images to form matched pairs of facial landmarks in which each facial landmark of the first image is matched with a facial landmark in the second image；

determining if 3D points associated with the matched pairs of facial landmarks are coplanar based on locations of the facial landmarks；

constructing a 3D point cloud for the first and second images from said locations if the 3D points associated with the facial landmarks are determined to be not coplanar； and

detecting whether the subject is the genuine user based on the constructed 3D point cloud.
The method of claim 11, wherein the step of determining further comprises:

measuring correspondence relationship between the locations of the facial landmarks in the matched pairs of facial landmarks； and

calculating a distance between each of the facial landmarks in the first image and the corresponding facial landmark in the second image obtained by the measured correspondence relationship.
The method of claim 12, wherein, the subject is determined as a planar face if the calculated distance is smaller than a predetermined threshold.
The method of claims 11, wherein the step of constructing further comprises:

determining a geometric transformation between the first and second images in accordance with the locations of the extracted facial landmarks； and

calculating a depth value for each ofthe matched pairs of the facial landmark.
The system of claim 11, wherein the step of detecting further comprises:

aligning the constructed 3D point cloud with a pre-stored 3D point cloud for the genuine user； and

calculating a similarity between the constructed 3D point cloud and the pre-stored 3D point cloud.
The method of claim 11, before the step of determining, further comprising:

counting a number of the retrieved facial landmarks of each image； wherein if the number is smaller than a preset threshold, the step of receiving is performed until an allowed time duration expires.
The method of claim 11, before the step of determining, further comprising:

acquiring feature vectors of the retrieved facial landmarks, and

wherein the step of determining further comprising:

determining the geometric transformation in accordance with locations of the matched pairs of the retrieved facial landmarks.
The method of claim 14, further comprising:

acquiring inertial data of the first and second images from an Inertial Measurement Device (IMU) ； and

wherein the geometric transformation is determined in accordance with the locations of the facial landmarks and the acquired inertial data.
The method of claim 14, further comprising:

acquiring feature vectors of the retrieved facial landmarks； and

acquiring inertial data of the first and second images from an IMU； and

wherein the geometric transformation is determined in accordance with the locations of the matched pairs of the facial landmarks and the acquired inertial data.
The method of claim 11, wherein the step of retrieving is performed by a monocular camera or a binocular camera.