WO2024168396A1

WO2024168396A1 - Automated facial detection with anti-spoofing

Info

Publication number: WO2024168396A1
Application number: PCT/AU2024/050111
Authority: WO
Inventors: Eldho ABRAHAM; Thomas LANDGREBE; Joshua MERRITT
Original assignee: Icm Airport Technics Pty Ltd
Priority date: 2023-02-16
Filing date: 2024-02-16
Publication date: 2024-08-22

Abstract

Disclosed is a method of estimating whether a presented face for a user is a real face by analysing image frames acquired of the presented face. The method comprises one or more of: determining and analysing movements effected by the user to fit the facial image of the presented face to a randomised goal; determining 5 and analysing changes in facial feature metrics in the facial images in response to a change in the user's expression; determining and analysing an effect of an illumination change on contrast levels between a facial region in the acquired images and one or more regions adjacent the facial region.

Description

AUTOMATED FACIAL DETECTION WITH ANTI-SPOOFING

TECHNICAL FIELD

This disclosure relates to automated facial authentication or identification systems, in particular to address potential vulnerabilities of these systems to spoofing attacks, whereby someone attempts to be authenticated by presenting an image of a face to the system.

BACKGROUND ART

The use of biometrics in the authentication or identification of individuals is gaining traction in recent years, in particular given the advance in facial recognition and image process techniques. An application for which such use can be readily adopted is the identification or registration of passengers, particularly in airports, where there are already self-serve kiosks where passengers can complete other functions such as checking into flights, printing boarding passes, or printing baggage tags. With the advance of computing and camera technologies, facial biometrics verification also increasingly may be used in other scenarios, such as building access control.

In a facial biometric identification system, an image of a person’s face is captured, analysed and compared with a database of registered faces to determine whether there is a match. Based on the result of this determination the system ascertains the identity of the person. This process is potentially vulnerable to “spoofing” attempts by an imposter to disguise their true identity, by presenting a “spoof’, i.e., the facial image of someone else, to the biometric identification system. The system needs to be able to determine whether it has captured an image of a live face, or an image of a “spoof’.

Current solutions to detect such “spoofing” - i.e., making an estimate that the image is of a spoof, rely on analysing colour images taken with cameras. However, this approach is limited in its ability to defeat spoofing attempts using videos. Further adding to the issue is the availability of image manipulation software which can be used to animate pictures. For instance, there are mobile applications which can be downloaded to synthesize blinking. An

I imposter can present a display of a mobile device showing a facial picture of a different person to the biometric identification system, while using such software to animate the picture to make it appear like the person is interacting with the biometric identification system. This makes it more difficult for facial biometric identification systems to detect an imposter attempt by requesting a live interaction with the person it is trying to identify.

It is to be understood that, if any prior art is referred to herein, such reference does not constitute an admission that the prior art forms a part of the common general knowledge in the art in other country.

SUMMARY

In a first aspect, the invention provides a method of estimating whether a presented face for a user is a real face by analysing image frames acquired of the presented face, comprising one or more of:

(a) determining and analysing movements effected by the user to fit a facial image of the presented face to a randomised goal;

(b) determining and analysing changes in facial feature metrics in the facial images in response to a change in the user’s expression; and

(c) determining and analysing an effect of an illumination change on contrast levels between a facial region in the acquired images and one or more regions adjacent the facial region.

In some examples, (a), (b), and (c) are performed sequentially.

In some examples, at least one of (a), (b), and (c) is performed at the same time as another one of (a), (b), and (c).

In some examples, determining and analysing movements effected by the user to fit a facial image of the presented face to a randomised goal comprises: displaying a randomised goal on a display screen, and directing the user to effect movement to move his or her facial image from a current position in relation to the display screen such that his or her facial image will fit the randomised goal; analysing image frames acquired while the user is directed to track the randomised goal; and estimating whether the presented face is a real face based on the analysis.

The goal may be randomised in that it has a randomised location, size, or both.

In some examples, analysing image frames acquired while the user is directed to track the randomised goal comprises estimating movement effected by the user.

In some examples, estimating whether the presented face is a real face comprises comparing the determined movement or movements with a movement which a person is expected to make.

In some examples, the estimation is made by a machine learning model.

In some examples, the estimation is made by a reinforcement learning based trained using data in relation to natural movements.

In some examples, determining movement comprises determining a path between the current position and the randomised goal.

In some examples, determining and analysing changes in facial feature metrics in the facial images in response to a change in the user’s expression comprises: directing the user to make an expression to cause a visually detectable facial feature warping; analysing image frames acquired while the user is directed to make the expression; and estimating whether the presented face is a real face based on the analysis. In some examples, analysing the image frames acquired while the user is directed to make the expression comprises obtaining a time series of metric values from the series of analysed frames, by calculating, from each analysed frame, a metric based on position of one or more facial features

In some examples, estimating whether the presented face is a real face based on the analysis comprises: determining a momentum in the times series of metric values and comparing the momentum with an expected momentum profile for a real smiling face.

In some examples, the user is directed to smile.

In some examples, wherein for each analysed frame, the metric calculated is or comprises a ratio of a distance between eyes of a detected face in the analysed frame to a width of a mouth of the detected face.

In some examples, the method comprises comparing the image frames acquired while the user is directed to make the expression with a reference image in which the user has a neutral expression, and selecting an image from the analysed image frames which is most similar to the reference image as an anchor image in which the user is deemed to have a neutral expression.

In some examples, analysing an effect of a change in illumination on the presented face on the image data comprises: capturing one or more first image frames of the presented face at a first illumination level, and capturing one or more second image frames of the presented face at a second illumination level which is different than the first illumination level; analysing the first image frame to determine a first contrast level between a detected face region in the first image frame and an adjacent region which is adjacent the detected face region; analysing the second image frame to determine a second contrast level between a detected face region in the second image frame and an adjacent region which is adjacent the detected face region of the second image frame, wherein a relationship between the adjacent region and the detected face region of the second image frame is the same as a relationship between the adjacent region and the detected face region of the first image frame; comparing the first and second contrast levels to estimate whether the presented face is likely to be a real face.

In some examples, comparing the first and second contrast levels comprises determining whether a change between the first and the second contrast levels is greater than a threshold.

In some examples, during any step when the user is required to fit his or her facial image to a target area on a screen, the method comprises: directing the user to fit a facial image of his or her presented face to a first, larger target area; and upon detecting a facial image within in the first target area, setting an area generally taken by the facial image as a reference target area.

In some examples, the method comprises applying a tolerance range around the first, larger target area, whereby detection of a facial image within the tolerance range will trigger setting of the area of the facial image as the reference target area.

In some examples, the method comprises applying a tolerance range around the reference target area, so that the facial image of user’s presented face is considered to stay within the reference target area if it is within the tolerance range around the referenced area.

In another aspect, the invention provides an apparatus for estimating whether a presented face for a user is a real face by analysing image frames acquired of the presented face, comprising, a processor configured to execute machine instructions which implement the method mentioned above.

In some examples, the apparatus is a local device used or accessed by the user. In other examples, the apparatus is a kiosk at an airport. The kiosk may be a check in kiosk, a bagdrop kiosk, a security kiosk, or another kiosk. In some examples, the local device is a mobile phone or tablet.

In some examples, the acquired image frames are sent over a communication network to a backend system, and are processed by a processor of the backend system configured to execute machine instructions at least partially implementing the method mentioned above.

In some examples, the presented face is estimated to be a real face, if processing results by the processor of the apparatus and processing results by the processor of the backend system both estimate that the presented face is a real face.

In some examples, the apparatus is configured to enable the user to interface with an automated system biometric matching system to enrol or verity his or her identity.

In some examples, the user is an air-travel passenger, and the backend system is a server system hosting a biometric matching service or is a server system connected to another server system hosting biometric matching service.

In another aspect, the invention provides a method of biometrically determining a subject’s identity, including: estimating whether a presented face of the subject is a real face, in accordance with the method mentioned above; providing a two-dimensional image acquired of the presented face for biometric identification of the subject, if it is estimated that the presented face is a real face; and outputting a result of the biometric identification.

In some examples, the method is performed during a check-in process by an air travel passenger.

In a further aspect, there is provided computer readable medium having stored thereon machine readable instructions, which when executed are adapted to perform any of the method mentioned in the above. In a fourth aspect, there is provided a biometric identification system, including an image capture arrangement, a depth data capture arrangement, and a processor configured to execute machine readable instructions, which when executed are adapted to perform the method of biometrically determining a subject’s identity mentioned above.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments will now be described by way of example only, with reference to the accompanying drawings in which

Figure l is a schematic for a liveness estimation method, according to one embodiment of the present invention;

Figure l is a flow chart for a facial expression detection process according to one embodiment;

Figure 3 (1) is an image in which a person has a smiling expression;

Figure 3 (2) is an image in which a person has a neutral expression;

Figure 4 (1) shows the lip distance to eye separation ratio (LDZED) through the frames, where the facial expression changes from a neutral to a smiling expression;

Figure 4 (2) shows LDZED through the frames, where the facial expression changes from a smiling to a neutral expression;

Figure 5 (1) depicts the LDZED and momentum in the LDZED as calculated from image frames acquired over a period of time, when the expression remains neutral;

Figure 5 (2) depicts the LDZED and momentum in the LDZED as calculated from image frames acquired over a period of time, when the expression is a slow-paced smile made by a real face;

Figure 5 (3) depicts the LDZED and momentum in the LDZED as calculated from image frames acquired over a period of time, when the expression is a medium-paced smile made by a real face; Figure 5 (4) depicts the LDZED and momentum in the LDZED as calculated from image frames acquired over a period of time, when the expression is a fast-paced smile made by a real face;

Figure 6 (1) depicts a first image of a face changing from a neutral expression to a smiling expression, in which the face is neutral, and a second image of the face in which the face is smiling;

Figure 6 (2) is a schematic depicting of the times that image frames are expected to be acquired when the camera frame rate is 10 frames per second (fps);

Figure 6 (3) depicts a time series of LDZED data measured from 12 frames taken at 2 fps;

Figure 6 (4) is a schematic depicting how the time series of Figure 6 (3) are used to populate or interpolate values to populate a data series for analysis at a target sampling rate which is higher than 2 fps;

Figure 7 schematically depicts a user interface on a mobile device where the user is asked to fit his or her facial image to a randomised target;

Figure 8 is a depiction of a reinforcement training model;

Figure 9 depicts an example face motion test, in accordance with an embodiment of the invention;

Figure 10 depicts an example face dimensionality analysis, in accordance with an embodiment of the invention;

Figure 11 depicts an example of a facial region and four adjacent, non-facial regions;

Figure 12 depicts an example facial target fitting process in accordance with an embodiment of the invention;

Figure 13 (1) schematically depicts a rough fitting step mentioned in Figure 12;

Figure 13 (2) schematically depicts a concise fitting step mentioned in Figure 12;

Figure 13 (3) schematically depicts a reference reset step mentioned in Figure 12; Figure 14 schematically depicts an example of an automated system for the purpose of authenticating a traveller or registering a traveller, in accordance with an embodiment of the invention.

DETAILED DESCRIPTION

In the following detailed description, reference is made to accompanying drawings which form a part of the detailed description. The illustrative embodiments described in the detailed description, depicted in the drawings, are not intended to be limiting. Other embodiments may be utilised and other changes may be made without departing from the spirit or scope of the subject matter presented. It will be readily understood that the aspects of the present disclosure, as generally described herein and illustrated in the drawings can be arranged, substituted, combined, separated and designed in a wide variety of different configurations, all of which are contemplated in this disclosure.

Disclosed is a method and system for detecting spoofing attempts or attacks. In a spoofing attempt, a facial image or model, rather than a live face, is presented to an automated system which uses facial biometrics for purposes such as the enrolment, registration or the verification of identities, to try to fool the automated system.

The “spoof’ which is presented to the automated system in a spoofing attempt may be a static two-dimensional (2D) spoof such as a print-out or cut-out of a picture, or a dynamic 2D spoof such as a video of a face presented on a screen. The spoof may be a static three-dimensional (3D) spoof such as a static 3D model or a 3D rendering. Another type of spoofs are dynamic 3D spoofs, for example a 3D model with face expression dynamics in real or a virtual camera.

Aspects of the disclosure will be described herein, with examples in the context of antispoofing for the biometric identification of persons such as air-transit passengers or other travellers. However, the technology disclosed is applicable to any other automated systems utilising facial biometrics.

In the context of air travel, the capture and biometric analysis of a passenger’s face can occur at the various points, such as at flight check-in, baggage drop, security, boarding, etc. For example, in a typical identification system utilising facial biometric matching, the identification system includes image analysis algorithms, which rely on colour images taken with cameras. Systems using these algorithms are therefore limited in their ability to detect when a pre-recorded or a synthesized image sequence (i.e., video sequence) is being shown to the camera of the biometric identification system, rather than the real face of a person. The challenges are even greater when a 3D spoof is presented.

Thus, anti-spoofing for such systems may be done by configuring them to, or combining them with a system configured to, estimate whether the image being analysed is likely to have been taken of a spoof or a real face, i.e., estimating the liveness of the presented face.

Embodiments of the present invention provide a method for estimating the liveness of a face presented to a facial biometric system, i.e., determining whether it is a real face or a spoof. The disclosed method is implementable as an anti-spoofing algorithm, step, or module, in the facial biometric system. The system may be configured to enrol or register passengers, or verify passenger identities, or both. The disclosure also covers facial biometric systems which are configured to implement the method.

Figure 1 is a high-level schematic for a liveness estimation method 100, according to one embodiment of the present invention. Image data is received or acquired by a system implementing the method at step 102. The image data is processed at step 104, and then a liveness estimation is made at step 106. The processing of the image data, in the depicted embodiment, includes a facial expression analysis 108, a motion tracking analysis 110, and a facial lighting response analysis 112. The processing at step 104 may be an interactive process whereby the system will output instruction, to direct the passenger attempting to register or authenticate their identity (e.g., at check-in) to take particular actions. Image data captured whilst the passenger is performing the actions can then be analysed. Arrow 105 represents the interactive process whereby during the processing (step 104), further image data are acquired or received to be analysed.

It should be noted that in other embodiments, the processing which occurs at step 104 may be different than that depicted in Figure 1, by including only one or two of the three types of analyses shown in the figure. Facial expression analysis

The facial expression detection 108 is configured to analyze the facial images as detected in the input image data. The input image data is acquired whilst the user is directed make a particular facial expression or a series of expressions, in order to determine if a facial expression likely to be made by a real face can be detected. The facial expression is of a type such that at least a partial set of the user’s facial features or muscles are expected to move whilst the user is making the expression. The movement or movements cause a “warping” in the facial features as compared with an expressionless or neutral face. Thus, the analysis for performing the facial expression detection 108 is configured to characterise this warping from the image data, to determine whether a real face is being captured making the expression, or whether the facial image is likely to have been captured from a “spoof’.

Figure 2 is a high-level depiction of the facial expression detection 108 according to one embodiment. In this embodiment, the facial expression detection 108 is conceptually depicted (as represented by the dashed rectangular box) as occurring after the step of detecting facial images in the image data (step 113). However, in other embodiments, the facial image detection 113 may be included as part of the facial expression detection process 108.

Referring to Figure 2, the facial expression detection 108 at step 114 detects one or more facial features in each processed image frame. At step 116, one or more metrics may be calculated from the detected facial features, such as the width or height of a particular facial feature or the distance between facial features. At step 118, the positions of the detected features, or the metrics calculated from step 116 (if step 116 is performed) are tracked over the period of time during which the user is asked to perform the facial expression. At step 120, the time series of data are analysed, to determine if the time series of data are indicative of a real “live” face or a spoof being captured in the image data.

One example of the process is now described with reference to Figures 3 (1) and 3 (2), which for illustrative purposes show examples where the user is directed to make is to smile. The facial features being identified are the eyes and the mouth of the person. When a person changes from a neutral or non-smiling expression (Figure 3 (2)) to a smiling expression (Figure 3 (1)), the separation between the eyes is expected to remain the same, whilst the mouth is expected to widen. Therefore, the ratio of the mouth width to the distance between the eyes is expected to increase as shown in Figure 4 (1). Conversely, when a person changes from a smiling expression to a neutral or non-smiling expression, this ratio is expected to decrease as shown in Figure 4 (2). The metrics which are calculated from the facial features may include the distance between the eyes (ED) and the distance across the width of the lips (LD), and the ratio between the two distances (i.e., LDZED). In this example, values of the ratio LDZED are tracked over time and the time series of the values of the metric are analysed. It should be noted that in other implementations where the user is asked to smile or make other expressions, different metrics can be tracked within the same premise of tracking a movement or a warping in facial metrics over a series of images.

The aforementioned process may be generalised to include other embodiments. The generalisation may be in one or more different ways. For example, other expressions than a smile may be used, as long as the expressions can be expected to cause measurable changes or a “warping” of the facial muscles. The computation used for quantifying the changes for example involve a different number of facial points for the purpose of analyzing the different types of expressions. The computation may also be non-linear, for instance involving a model based on non-linear computation such as but not limited to a poly- or a spline- model, rather than a liner model.

In the facial expression analysis 108, the analysis which is performed (step 120) on the time series of values may include determining a momentum in the time series of values, and ascertaining characteristics of the momentum to see if they indicate a “lively” expression has been made, i.e., by a real face. Different methods may be used to measure momentum. An example is the Moving Average Convergence/Divergence (MACD) analysis, but other tools for providing measures of momentum may be used instead.

Referring again to the example where the user is asked to smile and the metric being measured over time is the ratio LDZED, the momentum in the time series differs, depending on whether there remains no smile (Figure 5 (1)), or whether the smile is a slow-paced mile (Figure 5 (2)), a medium-paced smile (Figure 5 (3)), or a fast-paced smile (Figure 5 (4)). In Figures 5 (1) to 5 (4), the LDZED metric (in % unit) are shown on the top graphs. The MACD analyses are shown in the bottom graphs of Figures 5 (1) to 5 (4). The horizontal axis represents sample points.

As can be seen from the Figure 5 (1), when there is no change in facial expression, as might be expected from a statis 2D or 3D spoof, changes in the LDZED ratio do not indicate any consistent momentum or trend in momentum. For instance, in Figure 5 (1), the amplitudes of the MACD line 501 are within a low threshold, in this case plus or minus 0.1. On the other hand, as can be seen from Figures 5 (2) to 5 (4), when the smile is made by a real face (from a neutral expression), irrespective of the pace of the smile, the momentum generally increases toward the end of the time period, and where the MACD line 502 has much higher amplitudes. The determination of whether one or more of the aforementioned characteristics can be observed from the momenta data helps to indicate whether the smile is made by a “lively”, i.e., real, face.

In the above, the momentum which is being analysed indicates the momentum in the facial metrics as the facial expression is expected to change to or from a “neutral expression”. Thus, the facial expression analysis algorithm needs to have access to an image which can be considered to provide a neutral or expressionless face. This image may be referred to as an “anchor” image. This may be done by asking the person interacting with the automated system to assume an expressionless face. The algorithm may set a threshold or threshold range for the metric or metrics being analysed, and assign an image in which the threshold or threshold range is met as the anchor image.

Alternatively, the anchor image may be chosen based on a reference image for the person who is interacting with the automated system, in which the face is expected to be neutral. The reference image may be a prior existing image such as an identification photograph, an example being a driver’s license photograph or the photograph in a passport. For example, each image in the series of input images is compared with the reference image. The input image which is considered “closest” to the reference image will be chosen as the “anchor” image. The image frames in the series input image after the anchor image can then be used for the analysis to determine whether the “face” in the images is a real face or a spoof. The comparison between the reference image and the input images may be made using biometric methods. The comparison may be made on the basis of the specific facial feature metric(s) being used in the facial expression analysis, by comparing the metric(s) calculated from the reference image with the same metric(s) calculated from the input images, and identifying the input image from which the calculated metric(s) is or are closest to the metrics calculated from the prior existing image. The identified input image is then chosen as the “anchor” image.

In the air travel scenario where a passenger is interacting with an automated system to, e.g., check-in, his or her passport photograph can be used to provide the reference image. This has the benefit, as explained above, of the passport photograph generally being expected to show a face with a neutral expression, per the International Civil Aviation Organization (ICAO) requirement for portrait quality.

Using a photograph such as a passport photograph has further benefits in that facial metrics measured on the basis of horizontal distances in the passport photograph is not expected to be subject to significant influence by distortions in the camera used to acquire the passport photo. This is because, in relation to the face, a person’s eyes and lips are expected to remain in the same vertical axis irrespective of the facial expression. Therefore, any vertical distortions can be expected to affect the eyes and the lips equally. Thus vertical distortions are not expected to have any real influence on ratios such as LDZED which rely on measurements across a horizontal distance. On the other hand, horizontal distortion may impact the lip-eye-ratio metric LDZED. However, since passport images are generally taken with the International Civil Aviation Organization (ICAO) portrait quality standards, in most cases the effect of camera distortions on the metric LDZED can be expected to be minimal.

Therefore, even if there are variances between the optical distortions in the camera used to acquire the passport photo and those in the camera used to acquire the image data while the user is interacting with the automated system, the variances are not expected to have significant effects on the LDZED measurements.

In practical implementation, particularly where users are interacting with the automated system via their own devices, the facial expression detection algorithms may run on hardware having different technical specifications. For example, some older smartphones have lower frame rates than the newest smartphones. Therefore, in some embodiments, the algorithm is configured so that it will be able to perform the facial analysis when run on different hardware having different frame rates. In this way, such embodiments of the liveness estimation system which are intended to work on different types of devices as may be owned by users of the biometric system, will be “device” agnostic by being agnostic to frame rates, provided a minimum level of frame rate is available.

In the context of air travel, this is useful in cases where passengers perform self check-in on their own mobile devices. The self check-in may be done on a mobile application installed on the mobile device, or via a web-based application which the user can access using a browser on the mobile device. For example, the web application may be supported by a server providing the 1 to N biometric matching to verify passenger identities.

Consider the case where a slower camera acquires images at a frame rate of 10 frames per second (fps), and a faster camera acquires images at a frame rate of 30 fps. Consecutive frames taken at by the faster cameras will be taken at temporal sample points which are at a temporal separation which is only a third of the temporal separation between consecutive frames taken by the slower camera. Thus, given the exact same change in expression in real time over a period of time, the difference between consecutive image samples taken by the slower camera is expected to be greater than the difference between consecutive image samples taken by the faster camera. This can lead to differences between the analysis results. For instance, if MACD analysis is performed, then the analysis performed on the series of metrics calculated from consecutive images taken by the slower camera will indicate a higher “momentum” than the MACD analysis result from the series of metrics calculated from consecutive images taken by the faster camera.

To mitigate the problem, optionally for all embodiments, the facial expression analysis algorithm is configured to perform the analysis whereby the samples used are at a “target sampling rate”. The algorithm may be configured to require a minimum or predefined number of samples (M samples) at the “target sampling rate” to be available. The number “M” may be determined as the number of samples which are expected over a predetermined period of time at the target sampling rate. The M samples are used as the data series for expression analysis. The initial time point for the M samples is set to temporally coincide with one of the input image frames, and the facial feature metric calculated from that input image frame will be used to as the first of the M samples. The image frame providing the first sample in the M sample series may be the very first input image frame acquired. Alternatively it may be an image frame which is taken at a predefined period of time after the initial image frame, or it may be the first image captured once the algorithm determines that the facial image fits to a “goal” area in the display view, or it may be the input image frame used as the anchor image.

For each of the subsequent M-l samples, the sample value will be the facial feature metric calculated from the input image frame which temporally coincides with the sample, if available. If no input image frame which temporally coincides with a required sample is available, then the value for that sample will be determined from the facial metric values calculated from the input image frames which are the closest in time to the sample. For instance, the sample value may be determined by interpolating between the facial feature metrics calculated from the input image frames which temporally, immediately precede and immediately succeed the time for the sample.

The target sampling rate may be one which is expected to be met or exceeded by the frame rates of most camera hardware included in the users’ devices (e.g., most of the available smart phone or tablet cameras). The facial expression analysis algorithm may be configured so that it requires the input images to be acquired at or above a minimum actual sampling rate which is required in order to generate useful input data for the facial expression analysis algorithm at the target sampling rate.

As an example, Figure 6 depicts an example of how the facial expression analysis algorithm obtains the data series comprising data samples at M discrete temporal points over three seconds, as defined by the analysis sampling rate. The sampling rates provided below are examples only, serving to illustrate how the analysis algorithm works, and are not essential features of the invention.

Figure 6 (1) depicts the real time continuous motion which occurs when a person changes the facial expression from a neutral expression to a smiling expression. The images in Figure 6 (1) are artificial-intelligence generated images, provided for illustrative purposes only, and are not actual images acquired of a user. The neutral expression is shown in the input image which is chosen to be the “anchor image” (illustrated by the left hand side image of Figure 6 (1)). The smiling expression may be assumed to be shown in an input image (illustrated by the right hand side image of Figure 6 (1)) which is taken by the camera a pre-set period of time, e.g., 3 seconds (s), after the anchor image.

In this example, the target sampling rate for the data series to be analysed is 10 samples per second as depicted in Figure 6 (2). However the actual sampling rate is only 2 samples per second, where there are input images at times To, To + 0.5s, To + Is, etc. Designating as To as 0.0 seconds, then there are facial metrics calculated from the actual images, at times 0.0 second, 0.5 second, 1.0 second, etc, as shown in Figure 6 (3). The system thus needs to generate a data series where the sample points are at the target sampling rate, which in this case means samples are needed at times To, To + 0.1s, To + 0.2s, To + 0.3s, To + 0.4s, ..., etc. It will be understood that the sampling rates mentioned in this paragraph are illustrative only and should not be taken as limiting how embodiments of the algorithm should be implemented.

For each required sample for the input data series, if there is a temporally coinciding input frame from the camera, then the facial feature metrics calculated from those input frames are used to provide corresponding sample points in the data series, as represented by the dashed arrows between Figures 6 (3) and 6 (4). For some required sample time point for the data series, there may not be any corresponding input frame from the camera, and hence no metrics or measurements which can be directly calculated from the acquired image frames to provide the sample values. Therefore, the sample values at each of these time points are estimated from the sample values of the nearest neighbours. The estimation may be an interpolation. The interpolation may be a linear interpolation.

The data series can then be used to analyse the characteristics of any observed facial feature motion or warping, to make an estimation of whether the face captured in the input images is likely to be a real face or a spoof.

This method of building the data series for analysis has the benefit of being generally agnostic to the variation in the frame rates in the cameras, at least for cameras capable of operating at or above a target frame rate. Also, at least given current image sensor frame rates, the processing speed of the CPU or processor running the expression analysis algorithms is likely to be much higher. Using interpolation means that the processing algorithm does not necessarily need to wait for the camera to produce enough frames so that metrics can be calculated to fill the required number of data samples in the data series.

Face motion analysis

The liveness estimation 100 may further include a face motion analysis algorithm 110 (see Figure 1), which analyses how the user performs, as can be determined from the input images being captured, when asked to move his or her face when directed by the liveness estimation system.

Referring to Figure 7, in some embodiments, the user is asked to make movements such that his or her facial image matches to a “goal” area on the screen. Optionally, once the goal appears, it does not remain unchanged in the same position on the screen 700, and will instead move to or appear in at least one other position, or change in size, or both. The goal may thus be considered a dynamic goal. The positioning, sizing, or both, of the goal 704 may be randomised, to make the algorithm more robust against someone trying to use a dynamic spoof with software which tries to learn the and anticipate the movement pattern.

Referring to Figures 7 and 8, from a “trained reinforcement” perspective, the person interacting with the automated system whilst the face motion analysis is performed is the “agent”. Given that most users can be expected to be able to follow directions, a user can be considered a trained “agent” interacting with the system, such that his or her facial image 702 is shown on the display area 700 and will move within the display area 700 to match the position, size, or both, of the goal 704. Therefore, the display area 700 may be considered an ‘environment’ in which the agent is doing an action at time t (“At”), namely, to position the facial image 702 to the goal 704. The position of the user’s face at time t may be considered the “State” at time t (“St”). The “reward” at time t (“Rt”) is therefore defined as the State St matching or substantially matching the position of the goal 704. The face motion analysis determines one or more various factors, such as whether the reward condition is met, or a characterisation of the relationship between the change in the State and the change in Reward, over a period of time in which the analysis is undertaken, in order to estimate whether the face is likely a real face or a spoof.

Figure 9 outlines an example implementation of a “face motion” test 900 provided by the face motion analysis. At step 902, the algorithm detects the face in the incoming images. At step 904 a “goal” is displayed in a location of the screen which is away from the detected face. The goal will define an area on the screen. The position of the goal and thus the defined area on the screen, may be randomly selected to result in a “randomised goal”. The system provides direction to the user to make the required movement or movements so that his or her facial image will move to the area defined by the goal. If the algorithm successfully detects the face which fits to the goal area, the system will track the detected face over the image frames to determine the path it takes to move from its starting position (i.e., State “S”) to the position of the area defined by the goal (step 906). The movement or movements, i.e., motion, determined from the images will be analysed (arrow 912).

The movement analysis may include an analysis of the “path” of the detected face takes through the image frames (step 914). The determined path is then analysed and liveness estimation made based upon the analysis. This path is represented by arrows 706 in Figure 7, for the facial image to be fitted to the goal 704.

When the face motion test 900 is performed by a user with his or her real face, the path is expected to be smoother and shorter than the path which would be taken to move the “spoof’. For instance, some dynamic spoofs use “brute force” attacks where the spoof presented to the automated system will be caused by software to move to random positions until it matches the position of the “goal”. Brute force attacks thus are likely to result in a path which is indirect and which may be erratic, even if they successfully fit the detected facial image to the “goal”. Thus, the analysis may be a comparison between the path which is determined or estimated to have been taken, with the path which is expected when someone is not attempting a spoof attack.

The analysis of the movement or movements may additionally or instead comprise a determination of the “naturality” of the movement or movements (step 916). The movement may comprise only the movement of the user’s face if the user is interacting with the automated system where the image sensor is in a fixed position. In embodiments where the algorithm is intended for applications that are run on mobile devices, the movement may comprise the movement of the face, motion attributed to the movement of the camera as made by the user, or a combination of both. The order of the path analysis (step 914) and the naturality analysis (step 916) may be reversed from that shown in Figure 9.

During the process, if the algorithm does not successfully detect a “real” face which successfully tracks the goal, then it will determine there has been a compliance failure (910), and may ask the user to try again (arrow 908). Although in Figure 9, the “compliance failure” determination (910) occurs after the face neutrality analysis (step 916), a determination of “compliance failure” may also be made if a failure occurs at one or more of the other steps. For instance, a compliance failure may be determined, if the system does not detect that there is a successful tracking of the facial image to the randomised goal (failure at step 906), or if the system determines from the path analysis that the facial image is an image of a spoof (failure at step 914), or both.

In video data, the final capture (i.e., the last frame or frames) in the video stream is dependent on the motion of the face, or the camera motion if the camera is provided by or as a mobile device, or both. This is because when there is a relative movement between the camera and the user’s face, this relative movement impacts one or more of the position, angle, or depth of the facial features as captured by the camera. Also the relative movement can affect the lighting or shadows as can be observed in the image data. Moving a real face in a three- dimensional (3D) environment, i.e., a “natural movement”, will cause different behaviours in the observed shadow and lighting, as opposed to moving a 2D spoof. Therefore, it is possible to analyse the above-mentioned parameters in the image data, in order to estimate the position or the movement of the user’s face in relation to the camera, in the physical 3D environment, and then determine whether the movement is a “natural movement”.

The determination of whether a movement is a natural movement, as opposed to a spoofed video play or a brute force presentation attack, may be made by a trained model using machine learning. For example, a reinforcement training model (Figure 8) can be used. The user or user’s face would be considered the “agent”, and its position in the 3D environment would be considered as “State”. The “reward” may be a determination that the movement is a natural movement.

On the basis of the path analysis (step 914), the naturality analysis (step 916), or both, the algorithm makes an assessment, i.e., estimation, of whether the face is likely to be a spoof meaning there is a compliance failure (910), or likely to be real (918).

The face motion test 900 may be performed a number of times. That is, the face motion analysis may include multiple iterations of the face motion test 900. The algorithm may require that a “successful” outcome, in which the face is estimated to be likely real, to be achieved for all of the tests or for a threshold number or percentage of the tests, for the overall analysis to estimate that the detected face is likely to be the image of a real face.

Face Dimensionality Analysis

The liveness determination 100 may further include a lighting response analysis 112 (see Figure 1) which analyses the input image frames to assess the response visible in the input image frames to changes in lighting.

In an environment in which a user is interacting with the automated system on a mobile device, some key face spoofing scenarios may involve either using a mobile device screen or printed photo to match against a documented face image, e.g., a passport image or an enrolment image. It is expected that the responses of a real face which is in 3D will be different than a 2D spoof or a mask worn over someone’s face. Therefore, the lighting response analysis may also be referred to as the face dimensionality analysis.

Figure 10 depicts an example process 1000 implemented to provide the face dimensionality analysis 112. In the analysis, the input images are checked to ensure that there is a detectable face correctly positioned in the field of view (1002). For the algorithm to work it is important that there are no significant movements in the detected face, and that the presented face is sufficiently close to the screen or camera, so that the screen brightness, front flash, or both, is close enough to change face illumination significantly. In some embodiments, the light from the mobile screen provides the illumination. Once a correctly positioned face is detected in an input image, a plurality of images will be captured, at different illumination levels. One or more first images may be captured at a first illumination level (1004), and then one or more second number of images may be captured at a second illumination level which is different than the first illumination level (1006). The illumination statistics will then be calculated (1008) for the first and second images, to find occurrence of a transition between “dark” and “bright” regions.

The calculation of the intensity statistics is done in respect of a face region in the image, and one or more non-face regions adjacent to the face region. In the example shown in Figure 11, in the input image five regions are defined, include face region R5 and four other regions Rl, R2, R3, R4, respectively to the left of, to the right of, above, and below, the face region R5. Regions Rl, R2, R3 are chosen so that they capture the background behind a real person, if the person is a real person presenting a real face. Region R4 is a body region which is expected to be of similar “depth’ from the screen or camera, but subjected to slightly less illumination than the face due to the expected positioning of the face in relation to the source of illumination. It should be noted that Figure 11 is an example only. Other embodiments may have different regions. For instance another embodiment may not include any “body region”, or may include a different number of “background regions”.

The statistics are calculated for the series of input frames captured at steps 1004 and 1006, to determine the variation in the intensity contrast between the face region and the adjacent region or regions (1010).

When the presented face is “real face”, the face will be closer to the screen or the camera than the background which is captured in the adjacent regions which are not of the user’s person. These adjacent regions are expected to be at least a head’s width behind the face. Accordingly it is expected that the face region will be more illuminated than adjacent regions captured of the background. Therefore, when an illumination level is changed, the effect of this change is expected to cause the largest variation in intensity levels for the face region (e.g., R5 in Figure 11), compared with the regions in which the background behind the person is captured (Rl, R2, R3 in Figure 11). For example, referencing Figure 11, the algorithm will calculate a statistic indicating the brightness contrast(s) between region R5 and the adjacent regions (one or more of Rl, R2, R3) in the first images, also referred to as the “inter-region” contrast(s) for the first images. The algorithm also calculates a statistic indicating the brightness contrast between those same regions in the second images to obtain the “inter-region” contrasts for the second images. The two statistics are compared to determine the amount of variation in the inter-region contrast(s) (e.g., differences in the intensities of the regions) , as caused by the change in lighting intensity. If there is a variation been between the inter-region contrast(s) in the first images and the inter-region contrast(s) in the second images, and the amount of the variation is above a threshold, then the face is more likely to be a real face (1014) than a spoof (1012).

Even though exposure compensation corrupts absolute changes in brightness levels, relative changes (i.e., contrasts) are significantly impacted by a change in illumination level. Therefore, the dark to bright transition from an adjacent region to the face region is expected to become more pronounced.

It should be noted that in practice, there may be different ways of implementing the process 1000. For example, how to designate a region is “dark” or “bright” and how to calculate the contrast between regions, can be implemented by the skilled person. One example is to apply a threshold to an average pixel intensity, but other methods may be used. Also, the order of the processing shown in Figure 10 may be different. For instance the capture of the second images may be done after or parallel with the calculation of the illumination statistics on the first images.

Liveness estimation in accordance with different embodiments of the invention may include one, two, or three of the facial expression analysis, facial dimensionality analysis, and facial motion analysis. Furthermore, while these analyses, where two or more are provided, can be performed sequentially, they may be performed simultaneously if allowed by processing capabilities of the hardware used. For example, a person is directed to performing an expression during the facial expression analysis, the illumination level can be changed so that the data acquired during the time can also be used to perform the face dimensionality analysis.

Face fit In one or more of the analyses mentioned above, the user is asked to position his or her face so that the facial image is within a certain target area on the screen. Generally, there is a tolerance range for the positioning, so that the algorithm considers the face image to be “in position” even if there is a slight difference between the area taken by the face image and the target area. Irrespective of whether the tolerance is large or small, if the user places his or her face at the boundary of the tolerance range, there is a lesser degree of freedom for the user to do further required actions, such as smiling (e.g., in a facial expression analysis example). The user’s facial image may more easily go out of the tolerance area, and the user may as a result need repeat the process of positioning his or her face again.

To mitigate this problem, as an option, the liveness estimation system applies a novel process in fitting the user’s facial image to the target area. An example of the face fitting process is depicted in Figure 12. The process starts with a “rough fitting” step 1202 in which the user is asked to move so that an outline of his or her facial image is in a first area (1304 in Figure 13(1)). The first area 1304 will be set as a relatively large tolerance range bound by the dashed lines 1306, 1308 which respectively represent the lower and upper limit of the first area 1304. The first area 1304 may also be considered the rough fitting target. Circle 1302 represents the centre of the rough fitting target 1304. Dashed line 1306 may also be considered to define the “negative” tolerance from the target centre 1302 and dashed line 1308 may be considered to define the “positive” tolerance from the target’s centre 1302. At step 1204, after the outline or boundary of the user’s facial image is detected to be within the rough fitting target 1304, the position of the boundary of the facial image 1310 (see Figure 13(2)) is measured. This boundary 1310 is considered to define a reference area. At step 1206, a revised, smaller tolerance range is determined with the measured boundary 1310 as the centre of the smaller range. Referring to Figure 13 (2), the tolerance range 1312 around the measured facial boundary 1310, as defined by dashed lines 1314, 1316, is smaller than the rough tolerance range 1304. Steps 1204 and 1206 may together be considered the “concise fitting” steps. The resulting tolerance range 1312 may be considered to provide the “concise fitting target”. In Figures 13 (1) to 13 (3), the targets and their bounding lines are defined by circles. However these may take other shapes such as an oval or a shape that resembles a facial boundary shape.

The aforementioned rough and concise fitting processes, discussed with reference to Figures 13 (1) and 13 (2) will repeat in order to reset the rough fitting target and then reset the concise fitting target , if the user’s facial image goes out of bounds of concise fitting target (step 1208). The reset rough fitting area 1320 as defined by dashed lines 1322, 1324 bounding the target centre 1318, are shown in Figure 13 (3).

Figure 14 schematically depicts an example of an automated system 1400 for the purpose of authenticating a traveller or registering a traveller. The system 1400 includes a device 1402 which includes a camera 1406 which is configured to acquire input image data, or the device 1402 has access to a camera feed. The device 1402 includes a processor 1403, which may be a central processing unit or another processor arrangement, configured to execute machine instructions to provide the liveness estimation method mentioned above, either in full or in part. For example the processor 1403 can be configured to only execute the method in part, if a backend system with more powerful processing is required to process any of the steps of the liveness estimation method. The machine instructions may be stored in a memory device 1407 collocated with the processor 1403 as shown, or they may partially or wholly reside in one or more remote memory locations accessible by the processor 1403. The processor 1403 may also have access to data storage 1405 adapted to contain the data to be processed, and possibly to at least temporarily store results from the processing.

The device 1402 further includes an interface arrangement 1404 configured to provide audio and/or video interfacing capabilities to interact with the traveller. The interface arrangement 1404 includes the display screen and may further include other components such as a speaker, microphone, etc. There may also be a communication module 1409 so that the device 1402 may receive or access data wirelessly, or communicate data or results to a remote location, e.g., a computer at a separate server, a computer at a monitoring station or cloud storage, over a communication network enabling wireless communication 1411. In use, the input image data are processed by a liveness estimator 1408 configured to implement the liveness estimation method. As mentioned above, the liveness estimator 1408 is provided as a computer program or module, which may be part of an application executed by the processor 1403 of the device 1402. Alternatively the liveness estimator 108 is supported by a remote server or is a cloud-based application, and accessible via a web-based application in a browser.

In Figure 14, the box denoting the device 1402 is represented in dashed lines to conceptually signifies that the components therein may be provided in the same physical device or housing, or one or more the components may instead be located separately. For example, in embodiments where the device 1402 is a programmable personal device such as a mobile phone or tablet, the mobile phone or tablet can provide a single hardware equipment containing the input/output (I/O) interface arrangement 1404, processor 1403, data storage 1405, communications module 1409, camera hardware 1406, local memory 1407. The machine instructions for the liveness estimation can be stored locally or accessed from the cloud or a remote location, as mentioned previously.

In the depicted example, the automated system 1400 is used in the travel context. A passport image 1410 is provided as a reference image for the purpose of the expression analysis performed by the liveness estimator, in embodiments where the analysis is performed. The provision of the passport photo may be by the traveller taking a photogram or a scanned image of the passport page from the device 1402. In examples where the device 1402 is a kiosk such as an airport check-in kiosk, the kiosk may include a scanning device configured to scan the relevant passport page.

In some embodiments, the device 1402 is a “local device” as it is in a wireless connection with a backend system 1412. Such local devices may be provided by mobile phones or tablets. In the depicted embodiment the backend system 1412 is a remote server or server system where the 1:N biometric matching engine 1414 resides. Communication between the device 1402 and the backend system 1412 is represented by dashed double arrow 1411, and may be over a wireless network such as but not limited to a 3G, 4G, or 5G data network, or over a WiFi network. However the backend system 1412 may instead be provided by another server or server system, such as an airport server separate to but in communication with the server performing the 1 :N matching.

In these embodiments, the backend system 1412 may include a backend liveness estimator 1416 configured to implement the same method as that implemented by the liveness estimator 1408, either partially or in full. In this case the camera feed data and the passport image 1410 are also sent to the backend liveness estimator 1416. That is, while the liveness estimator 1408 in the device is processing the live camera feed, the camera data is also being fed to the backend server 1412 for the same processing. This serves the purpose of performing a verification run of the processing to ensure there is no corruption in the result(s) returned by the liveness estimation, or for the purpose of performing step(s) in the liveness estimation method which might be too computationally intensive for the local device 1402 to handle, or both. The automated process of authenticating or enrolling the traveller will only proceed, for 1 :N matching to occur, if both the results from the local liveness estimator 1408 and the backend liveness estimator 1416 both indicate “liveness” of the facial image in the camera feed.

In the above, the liveness estimation is described as part of a check before biometric identification is performed. However, the implementation of biometric identification does not affect the working of the liveness estimation and thus is not considered a part of the invention in any of the aspects disclosed. For instance, liveness estimation may be implemented in systems which do not perform biometric identification. For example, it may be implemented in systems to check whether anyone passing through or at a check point is using a spoofing device to conceal his or her identity or to pose as someone else, e.g., to join a video conference or to register themselves onto a particular user database, using a “spoof’.

Variations and modifications may be made to the parts previously described without departing from the spirit or ambit of the disclosure. In the claims which follow and in the preceding description, except where the context requires otherwise due to express language or necessary implication, the word “comprise” or variations such as “comprises” or “comprising” is used in an inclusive sense, i.e. to specify the presence of the stated features but not to preclude the presence or addition of further features in various embodiments of the disclosure.

Claims

1. A method of estimating whether a presented face for a user is a real face by analysing image frames acquired of the presented face, comprising one or more of:

(a) determining and analysing movement or movements effected by the user to fit a facial image of the presented face to a randomised goal;

(c) determining and analysing an effect of an illumination change on a contrast level between a facial region in the acquired images and one or more regions adjacent the facial region.

2. The method of claim 1, wherein (a), (b), and (c) are performed sequentially.

3. The method of claim 1, wherein at least one of (a), (b), and (c) is performed at the same time as another one of (a), (b), and (c).

4. The method of any preceding claim, wherein determining and analysing movements effected by the user to fit a facial image of the presented face to a randomised goal comprises: displaying a randomised goal on a display screen, and directing the user to effect movement to move his or her facial image from a current position in relation to the display screen such that his or her facial image will track the randomised goal; analysing image frames acquired while the user is directed to track the randomised goal; and estimating whether the presented face is a real face based on the analysis.

5. The method of claim 4, wherein analysing image frames acquired while the user is directed to track the randomised goal comprises determining movement or movements effected by the user.

6. The method of claim 5, wherein estimating whether the presented face is a real face comprises comparing the determined movement or movements with a movement which a person is expected to make.

7. The method of claim 6, where in the estimation is made by a machine learning model.

8. The method of claim 7, wherein the estimation is made by a reinforcement learning based trained using data in relation to natural movements.

9. The method of any one of claims 5 to 8, wherein determining movement comprises determining a path between the current position and the randomised goal.

10. The method of any preceding claim, wherein determining and analysing changes in facial feature metrics in the facial images in response to a change in the user’s expression comprises: directing the user to make an expression to cause a visually detectable facial feature warping; analysing image frames acquired while the user is directed to make the expression; and estimating whether the presented face is a real face based on the analysis.

11. The method of claim 10, wherein analysing the image frames acquired while the user is directed to make the expression comprises: obtaining a time series of metric values from the series of analysed frames, by calculating, from each analysed frame, a metric based on position of one or more facial features

12. The method of claim 11, wherein estimating whether the presented face is a real face based on the analysis comprises: determining a momentum in the times series of metric values and comparing the momentum with an expected momentum profile for a real smiling face.

13. The method of any one of claims 10 to 12, wherein the user is directed to smile.

14. The method of claim 13, wherein for each analysed frame, the metric calculated is or comprises a ratio of a distance between eyes of a detected face in the analysed frame to a width of a mouth of the detected face.

15. The method of any one of claims 10 to 14, comprising comparing the image frames acquired while the user is directed to make the expression with a reference image in which the user has a neutral expression, and selecting an image from the analysed image frames which is most similar to the reference image as an anchor image in which the user is deemed to have a neutral expression.

16. The method of any preceding claim, wherein comprising analysing an effect of a change in illumination on the presented face on the image data comprises: capturing one or more first image frames of the presented face at a first illumination level, and capturing one or more second image frames of the presented face at a second illumination level which is different than the first illumination level; analysing the first image frame to determine a first contrast level between a detected face region in the first image frame and an adjacent region which is adjacent the detected face region; analysing the second image frame to determine a second contrast level between a detected face region in the second image frame and an adjacent region which is adjacent the detected face region of the second image frame, wherein a relationship between the adjacent region and the detected face region of the second image frame is the same as a relationship between the adjacent region and the detected face region of the first image frame; comparing the first and second contrast levels to estimate whether the presented face is likely to be a real face.

17. The method of claim 16, wherein comparing the first and second contrast levels comprises determining whether a change between the first and the second contrast levels is greater than a threshold.

18. The method of any preceding claim, wherein, during any step when the user is required to fit his or her facial image to a target area on a screen, the method comprises: directing the user to fit a facial image of his or her presented face to a first, larger target area; and upon detecting a facial image within in the first target area, setting an area generally taken by the facial image as a reference target area.

19. The method of claim 18, comprising applying a tolerance range around the first, larger target area, whereby detection of a facial image within the tolerance range will trigger setting of the area of the facial image as the reference target area.

20. The method of claim 18 or 19, comprising applying a tolerance range around the reference target area, so that the facial image of user’s presented face is considered to stay within the reference target area if it is within the tolerance range around the referenced area.

21. An apparatus for estimating whether a presented face for a user is a real face by analysing image frames acquired of the presented face, comprising a processor configured to execute machine instructions which implement the method of any preceding claim.

22. The apparatus of claim 21, wherein the apparatus is a local device used or accessed by the user.

23. The apparatus of claim 22, wherein the local device is a mobile phone or tablet.

24. The apparatus of any one of claims 21 to 23, wherein the acquired image frames are sent over a communication network to a backend system, and are processed by a processor of the backend system configured to execute machine instructions at least partially implementing the method of any one of claims 1 to 20.

25. The apparatus of claim 24, wherein the presented face is estimated to be a real face, if processing results by the processor of the apparatus and processing results by the processor of the backend system both estimate that the presented face is a real face.

26. The apparatus of claim 25, wherein the apparatus is configured to enable the user to interface with an automated system biometric matching system to enrol or verity his or her identity.

27. The apparatus of claim 25 or 26, wherein user is an air-travel passenger, and the backend system is a server system hosting a biometric matching service or is a server system connected to another server system hosting biometric matching service.

28. A method of biometrically determining a subject’s identity, including estimating whether a presented face of the subject is a real face, in accordance with the method of any one of claims 1 to 20; providing a two-dimensional image acquired of the presented face for biometric identification of the subject, if it is estimated that the presented face is a real face; and outputting a result of the biometric identification.

29. The method of claim 28, being performed during a check-in process by an air travel passenger.