WO2022120532A1

WO2022120532A1 - Presentation attack detection

Info

Publication number: WO2022120532A1
Application number: PCT/CN2020/134321
Authority: WO
Inventors: Wei Huang; Ketan Kotwal; Wenkang XU; Xiaolin Huang; Sebastien Marcel
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2022-06-16
Also published as: CN116982093A

Abstract

A method for detecting image presentation attacks is disclosed. The method comprises obtaining a subject image, denoising the subject image to obtain a denoised representation of the subject image, computing a residual image representing a difference between the subject image and the denoised representation of the subject image, obtaining a database of one or more texture primitives, generating a texture descriptor representing image texture details of one or more regions of the residual image as a function of texture primitives in the database, and performing on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.

Description

PRESENTATION ATTACK DETECTION

Field of the Disclosure

The present disclosure relates to detecting image presentation attacks.

Background of the Disclosure

Biometric authentication is used in computer science for user verification for access control. Forms of biometric authentication rely on imaging a user’s biometric traits, for example, imaging the user’s face, hand, finger or iris to detect physical features, whereby the detected physical features may be compared to known physical features of an authorised user to authenticate the access attempt. Presentation attacks may be perpetrated on such a biometric authentication system by an unauthorised user seeking to impermissibly gain access to a computer system, whereby the unauthorised user attempts to impersonate the biometric presentation of an authorised user. For example, such an unauthorised user may wear a face mask, or a hand/finger prosthetic, or show a printed image or an electronic display, depicting the authorised user’s presentation. It is desirable to be able to reliably detect presentation attacks in order to reduce the risk of unauthorised access to a computer system.

Summary of the Disclosure

An objective of the present disclosure is to provide a method for detecting image presentation attacks.

The foregoing and other objectives are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the Figures.

Aspects of the present disclosure relate to a texture-based approach to presentation attack detection, whereby presentation attacks may be identified by detecting micro-textural differences between a bonafide presentation, e.g. a bonafide human face, and artifacts, e.g. a face-mask, printed photo, or video displayed on an electronic display. Such a texture-based approach may allow effective discrimination between bonafide and attack presentations based on artifact characteristics, such as the presence of pigments (printing defects) and specular reflection and shade (display attacks) . In other words, aspects of the disclosure relate to classifying presentation attacks based on the characteristic micro-textural differences between bonafide presentations and presentation attack artifacts.

A first aspect of the present disclosure provides a method for detecting image presentation attacks, the method comprising, obtaining a subject image, denoising the subject image to obtain a denoised representation of the subject image, computing a residual image representing a difference between the subject image and the denoised representation of the subject image, obtaining a database of one or more texture primitives, generating a texture descriptor representing image texture details of one or more regions of the residual image as a function of texture primitives in the database, and performing on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.

In examples, the method is for use in detecting image presentation attacks on a user verification system. The user identification system may, for example, be deployed as a part of a user verification system of a computer access control system for controlling access to a computer system. The user verification system may, for example, comprise an image capture device for capturing an image of a user presenting to the user verification system, for example, an image of the presenting user’s face or fingerprint, to thereby predict whether the presenting user is an authorised user or an unauthorised user.

A subject image, i.e. an image presented for analysis, is thus obtained. For example, the subject image could be obtained using an image capture device, and the method could involve capturing the image using an image capture device. In other examples, the subject image could be captured by an external system, and obtaining the subject image could involve obtaining the image file, optionally following initial processing of the image file.

The subject image is initially denoised. For example, the subject image could be denoised using a neural network trained for denoising images of a bonafide presentation, e.g. of a real human. The residual image is computed as a difference between the subject image and the denoised version of the subject image, for example, as a pixelwise difference between the subject image and the denoised representation. The residual image is thus a smoothened representation of the subject image, primarily representing micro-textural features of the subject image.

However, the texture details contained in residual images are superimposed with other corrupting noisy or high frequency contents of the input presentation, which may impair the process of predicting whether the image depicts a presentation attack. Thus, a texture descriptor is then generated using a database of texture primitives, the texture descriptor (s) representing one or more regions/patches of the residual image. For example, the texture descriptor may comprise a code vector where texture details of the region/patch of the residual image are encoded as a combination of texture primitives. The texture descriptor represents the contents of the residual image (s) in a more discriminate manner, as a function of the texture primitives in the database. The accuracy and reliability of the classification operation for predicting image presentation attacks may thus advantageously be improved.

In examples, the database of texture primitives is specifically learnt for the intended imaging application, e.g. for imaging faces in the NIR channel. An advantage of this approach is that the texture primitives are then specific to the application, may most accurately define the textures of the image, and may thus allow more accurate/reliable classification of the subject image as representing either a bonafide or attack presentation.

In an implementation, the generating a texture descriptor comprises generating a texture descriptor representing image texture details of a plurality of regions of the residual image as a function of texture primitives in the database.

In other words, in examples the texture descriptor may represent plural different regions of the residual image as a function of one or more of the texture primitives. Consequently, the texture details of the subject image may be more fully represented by the texture descriptor. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved. In examples, the texture descriptor may represent every region of the residual image as a function of the texture primitives, such that the texture details of the subject image is most fully represented.

In an implementation, the generating a texture descriptor comprises generating a texture descriptor representing image texture details of each of the one or more regions of the residual image as a function of a plurality of texture primitives in the database.

In other words, in examples the texture descriptor may represent each of the one or more different regions of the residual image as a function of a plurality of the texture primitives. Consequently, fuller texture information defining the texture details of the subject image may be represented. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved. In examples, the texture descriptor may represent each of the one or more regions of the residual image as a function of each of the texture primitives in the database, such that the texture details of the subject image is most fully represented.

In an implementation, the generating a texture descriptor comprises generating a texture descriptor representing image texture details of each of the one or more regions of the residual image as a linear combination of the plurality of texture primitives in the database and respective coefficients relating texture details of the region of the residual image to each of the texture primitives. In other words, the texture descriptor may represent the texture information of one or more the regions as a linear combination of a plurality of the texture primitives. This may advantageously represent a computationally efficient form for comprehensively representing texture information.

In an implementation, the denoising the subject image comprises denoising the subject image using a convolutional neural network for predicting a denoised representation of an input image.

A convolutional neural network may advantageously allow accurate and reliable denoising of images, with relatively low computational complexity. For example, the convolutional neural network could be a denoising auto-encoder. The convolutional neural network could be trained using only bonafide images of a user, e.g. of a user’s face, rather than of attack presentations. Consequently, the denoiser may be expected to be a more efficient denoiser for bonafide presentations as compared to attack presentations. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.

In an implementation, the performing on the texture descriptor a classification operation comprises performing on the texture descriptor operations of a convolutional neural network for predicting image presentation attacks based on image texture details.

A convolutional neural network may advantageously allow accurate and reliable prediction of presentation attacks, with relatively low computational complexity. For example, the convolutional neural network could be a multi-layer perceptron.

In an implementation, the method further comprises performing on the subject image, an image intensity normalization operation for changing an intensity range of the received image to a predetermined intensity range, and/or an image resizing operation for changing a size of the received image to a predetermined size.

In other words, the subject image may be subjected to intensity normalising/resizing operations prior to the denoising stage. Consequently, the subject image, and so the texture details of the subject image, may be adapted to match as closely as possible a desired intensity/size of the texture details, e.g. to match the intensity/size on which the denoiser and/or the classifier operations have been trained, and/or on which the texture primitive database was learnt. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.

In an implementation, the method further comprises, performing on the subject image a feature position detection operation for detecting a region of the subject image containing a predetermined image feature; and performing on the detected region of the subject image a visibility detection operation for detecting a visibility of the predetermined image feature in the detected region of the input image.

In other words, the method involve detecting, and checking the visibility of, certain features of an image. For example, the operations could be aimed at detecting the position of facial landmarks, such as eye and/or mouth regions in facial images, and subsequently checking the visibility of those features in the subject image (s) . Thus, the existence of particular features of the image may be determined, which may be indicative of a liveness of the imaged subject. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved. In examples, the visibility detection operation could involve computing an entropy of the regions and comparing the computer entropy to an expected entropy of a region depicting the predetermined feature, e.g. a mouth or eye.

In an implementation, the method further comprises, performing on the subject image a feature position detection operation for detecting a region of the subject image containing a predetermined image feature, convolving the detected region of the subject image with an edge detection kernel and computing a histogram representing an output of the convolution, obtaining a reference histogram, and computing a difference between the histogram and the reference histogram.

For example, this implementation could be used for detecting edges, such as edges bounding cut-out regions, of a mask, to thereby detect attack presentations. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.

The reference histogram may be generated for a particular feature, e.g. eyes/mouth, of a bonafide image, and computing the difference between the histograms may involve computing a difference/similarity between the edge features. The magnitude of the difference may be used as a reliable proxy for the presence of edges of a mask, e.g. cut-outs.

In an implementation, the method further comprises receiving a further subject image; denoising the further input image to obtain a denoised representation of the further subject image; computing a further residual image representing a difference between the further subject image and the denoised representation of the further subject image; generating a further texture descriptor representing image texture details of a region of the further residual image, corresponding spatially to one of the one or more regions of the residual image, as a function of texture primitives in the database; and computing a difference between the further texture descriptor and the texture descriptor for the corresponding region of the residual image.

This method may advantageously allow for detection of local motion of a subject of the subject image, occurring between an acquisition time of the subject image and the acquisition time of the further subject image. For example, the method may be used for detecting movement of eye or mouth regions of an imaged user. Such a comparison may detect movement of the imaged subject, which may thereby be used to infer a liveness of the subject. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.

In an implementation, the obtaining a convolutional neural network for predicting a denoised representation of a subject image comprises, obtaining a convolutional neural network, receiving a training image, adding image noise to the training image to generate a noisy representation of the training image, performing operations of the convolutional neural network on the noisy representation of the training image and generating a prediction for a denoised representation of the noisy representation of the training image, quantifying a difference between the prediction for a denoised representation of the noisy representation of the training image and the training image; and modifying operations of the convolutional neural network based on the difference.

Adding image noise may involve adding white Gaussian noise. Modifying operations of the convolutional neural network based on the difference may involve updating parameters, such as weights, of the CNN.

In an implementation, the receiving a subject image comprises receiving an image of near-infrared radiation, and wherein the obtaining a database of one or more texture primitives comprises obtaining a database of one or more texture primitives representing textures of near-infrared radiation imagery.

In other words, the subject images may be acquired in the NIR channel. NIR images are advantageously relatively unsusceptible to corruption by varying levels of background visible light during imaging, good for imaging in low visible light levels, e.g. at night, and good for imaging texture certain key texture details which discriminate between bonafide and attack presentations. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.

In an implementation, the receiving an input image comprises imaging using an optical imaging device. For examples, the imaging device could be a near-infrared imaging device, sensitive to near-infrared radiation.

A second aspect of the present disclosure provided a computer program comprising instructions, which, when executed by a computer, system cause the computer system to carry out a method of any implementation of the first aspect of the present disclosure.

A third aspect of the present disclosure comprises a computer-readable data carrier having the computer program of the second aspect of the present disclosure stored thereon.

A fourth aspect of the present disclosure provides a computer system for detecting image presentation attacks, wherein the computer system is configured to, obtain a subject image, denoise the subject image to obtain a denoised representation of the subject image, compute a residual image representing a difference between the subject image and the denoised representation of the subject image, obtain a database of one or more texture primitives, generate a texture descriptor representing image texture details of one or more regions of the residual image as a function of texture primitives in the database, and perform on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.

In implementations, the computer system of the fourth aspect of the present disclosure may be further configured to perform a method of any implementation of the first aspect of the present disclosure.

These and other aspects of the disclosure will be apparent from the embodiment (s) described below.

Brief Description of the Drawings

In order that the present invention may be more readily understood, embodiments of the disclosure will now be described, by way of example, with reference to the accompanying drawings, in which:

Figure 1 shows schematically an example of a computing system embodying an aspect of the disclosure, comprising a computer and a biometric verification system;

Figure 2 shows schematically an example implementation of the biometric verification system identified previously with reference to Figure 1, comprising a presentation attack detection module;

Figure 3 shows schematically an example implementation of the presentation attack detection module, identified previously with reference to Figure 2;

Figure 4 shows schematically an example method performed by the biometric verification system x, comprising a method for detecting presentation attacks;

Figure 5 shows schematically processes involved in the method for detecting presentation attacks, identified previously with reference to Figure 4;

Figure 6 shows schematically an example of computational functionality provided by the presentation attack detection module identified previously with reference to Figure 3;

Figure 7 shows schematically processes involved in obtaining subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;

Figure 8 shows schematically processes involved in denoising subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;

Figure 9 shows schematically stages of a denoising auto-encoder involved in the process for denoising subject images identified previously with reference to Figure 8;

Figure 10 shows schematically processes involved in computing residual images corresponding to subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;

Figure 11 shows schematically processes involved in obtaining a database of texture primitives, for the method for detecting presentation attacks identified previously with reference to Figure 5;

Figure 12 shows schematically processes involved in generating texture descriptors, for the method for detecting presentation attacks identified previously with reference to Figure 5;

Figure 13 shows schematically processes involved in assessing features of subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;

Figure 14 shows schematically processes involved in predicting presentation attacks in subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5; and

Figure 15 shows schematically a multi-layer perceptron classifier model, utilised in the process for predicting presentation attacks in subject images identified previously with reference to Figure 14.

Detailed Description of the Disclosure

Referring firstly to Figure 1, a computing system 101 embodying an aspect of the present disclosure comprises a computer 102 and a biometric verification system 103 in communication with the computer 102 via connection 104.

Computer 102 is operable to perform a computing function. For example, computer 102 may run computer programs for performing functions. In examples, computer 102 is a personal computer, a smartphone, or a computer installed onboard a vehicle for controlling functions of the vehicle. In other examples, computer 102 is a payment device for accepting a payment from a user, such as a point of sale device, or for dispensing a payment to a user, for example, an automated teller machine. In such applications, it may be desirable to control access to the computer, or to functionality of the computer, so as prevent unauthorised use of the computer. For example, where the computer 102 is a personal computer, it may be desirable to restrict access to functionality of the computer to one or more authorised users of the personal computer, to thereby prevent unauthorised users from using the computer.

Biometric verification system 103 is functional to verify whether a user requesting access to the computer 102 is an authorised user. An output of biometric verification system 103 may thus be used by computer 102 to determine whether to grant access to the computer, e.g. access to an operating system or programs running on the computer, to the user. In examples, biometric verification system 103 is functional to image a user requesting access to computer 102, and verify that the imaged user is an authorised user of the computer, by comparing biometric information extracted from the image with predefined biometric characteristics of authorised users. A difficulty encountered by such a system is that an unauthorised user may perpetrate a presentation attack on the biometric verification system 103, whereby the unauthorised user may attempt to impersonate the biometric presentation of an authorised user, to thereby impermissibly gain access the computer. For example, such an unauthorised user may wear a face mask, or a hand/finger prosthetic, or show a printed image or an electronic display, depicting the authorised user’s presentation. As will be described in further detail with reference to the later Figures, in examples therefore biometric verification system 103 is functional to detect certain forms of presentation attacks in images of the user, to thereby reduce the risk of unauthorised access to the computer 102.

In examples, biometric verification system 103 may be connected to computer 102 via a network 104. Network 104 may be implemented, for example, by a wide area network (WAN) such as the Internet, a local area network (LAN) , a metropolitan area network (MAN) , and/or a personal area network (PAN) , etc. The network may be implemented using wired technology such as Ethernet, Data Over Cable Service Interface Specification (DOCSIS) , synchronous optical networking (SONET) , and/or synchronous digital hierarchy (SOH) , etc. ) and/or wireless technology e.g., Institute of Electrical and Electronics (IEEE) 802.11 (Wi-Fi) , IEEE 802.15 (WiMAX) , Bluetooth, ZigBee, near-field communication (NFC) , and/or Long-Term Evolution (LTE) , etc. ) . The network 104 may include at least one device for communicating data in the network. For example, the network 104 may include computing devices, routers, switches, gateways, access points, and/or modems. In other examples, biometric verification system 103 may be connected to computer 102 via a simpler data transfer connection 104, e.g. via a connection in accordance with the Universal Serial Bus standard.

In the illustrated example, biometric verification system 103 is depicted as being structurally distinct from, and co-located with computer 102. In other examples, biometric verification system 103 could be located remotely of the computer 102, or could instead be incorporated into computer 102, such that biometric verification system 103 utilises computing resource of the computer 102. For example, computer 102 could comprise a handheld computing device, e.g. a smart phone, and biometric verification system 103 may be integrated in the handheld computing device.

Referring in particular to Figure 2, in examples, biometric verification system 103 comprises user verification module 201, user identification module 202, presentation attack detection module 203, and image acquisition module 204. Image acquisition module 204 comprises imaging device 205. The components 201 to 205 of the biometric verification system 103 are in communication via system bus 106, which is in turn connected to computer 102 via connection104.

User verification module 201 is for determining whether a user requesting access to the computer 102 is an authorised user of the system, based on inputs from user identification module 202 and from presentation attack detection module 203, and for communicating such determination to the computer 102 via connection 104. User verification system 201 may, for example, comprise a computer processor for performing the user verification task.

User identification module 202 is for identifying an attempted user of the computer from an image of the user acquired by image acquisition module 204. In examples, the user identification module 202 is configured to extract and analyse biometric information from images acquired by image acquisition module 204, access predefined biometric information stored in storage, e.g. in storage internal to user identification module, in which biometric information of authorised users is stored, and determine whether the biometric information extracted from the acquired images matches biometric information of an authorised user. User identification system 202 may then communicate a determination as to whether the user requesting access appears to be an authorised user to user verification system. In an example to be described in detail herein, user identification module 202 is configured for facial recognition, and is configured to analyse facial features of an imaged user, to determine whether the facial features match predefined facial features of an authorised user. In other examples, user identification module 202 may be configured for alternative forms of biometric identification, for example, fingerprint or iris recognition. Various suitable methods for analysis of biometric images for identification of a user, such as facial, fingerprint or iris recognition, are known to persons skilled in the art. User identification module 202 may, for example, comprise computer storage for storing predefined biometric information of authorised users, and a computer processor for performing the user identification task.

Presentation attack detection module 203 is for detecting presentation attack attempts in images acquired by image acquisition module 204. In particular, in examples, presentation attack detection module 203 is for predicting whether an image acquired by image acquisition module 204 depicts a presentation attack. Presentation attack detection module 203 will be described in further detail with particular reference to Figures 3 and 6.

Image acquisition module 204 is for acquiring one or more images of a user requesting access to the computer 102, for communication to user identification system 202 and presentation attack detection system 203. For example, image acquisition module 204 may be for imaging a user’s face, and imaging device 205 may thus be fixed in a suitable position such that a face region of a user requesting access to computer 102 may be imaged. In examples, imaging device 205 is an optical camera. In an example to be described in detail herein, imaging device 205 is for imaging a user’s face. In examples, the imaging device 205 is configured for imaging near-infrared (NIR) radiation, e.g. for imaging a user’s face in the NIR channel. In such examples, imaging device 205 may comprise an optical camera having an NIR filter fitted to the lens. The filter may thus selectively pass NIR spectra radiation to an image sensor. In other examples, image acquisition device could be configured for imaging other regions of a user’s body, for example, as a fingerprint or iris imager. Image acquisition module 204 may comprise a computer processor for performing the imaging task, and may optionally further comprise storage for storing acquired images.

In the example, biometric verification system is described as comprising four distinct modules 201 to 204, each having independent computing resource, processor and/or memory resource. In other examples however, the functionality of one or more of the modules may be combined and implemented by shared computing resource. For example, the functionality of all of the modules 201 to 204 could be implemented by a common processor.

Referring to Figure 3, in an example, presentation attack detection module 203 comprises processor 301, storage 302, memory 303, input/output interface 304, and system bus 305. The presentation attack detection module 203 is configured to run a computer program for detecting presentation attacks in images acquired by image acquisition module 204.

Processor 301 is configured for execution of instructions of a computer program. Storage 302 is configured for non-volatile storage of computer programs for execution by the processor 301. In the embodiment, the computer program for predicting presentation attacks in images acquired by image acquisition module 204 is stored in storage 302. Memory 303 is configured as read/write memory for storage of operational data associated with computer programs executed by the processor 301. Input/output interface 304 is provided for communicating presentation attack detection module 203 with system bus 206. The components 301 to 304 of the presentation attack detection module 203 are in communication via system bus 305.

Referring to Figure 4, in an example, biometric verification system 103 is configured to perform a user verification procedure for verifying that a user requesting access to computer 102 is an authorised user of the computer. For example, the biometric verification system 103 may perform the verification procedure in response to receiving a prompt from computer 102 via connection 104.

At stage 401, the image acquisition module 204 images a user requesting access to the computer 102, using imaging device 205. In examples, stage 401 involves imaging the user’s face, optionally in the NIR channel, wherein the actual range of wavelengths may be pre-configured. In examples, the image acquisition module 204 acquires a plurality of frames, where the duration of frame acquisition and time interval between successive frames (frame rate) may be defined at stage 401. In examples, the image acquisition module 204 may be capable of imaging at plural different resolutions, and stage 401 may involve defining an imaging resolution. In examples, the imaging device 205 may have one or more illumination devices for a specified NIR range. Stage 401 may thus further involve adjusting the illumination devices such that the region nominally containing a subject’s head/face is properly and uniformly illuminated. Although minor variations in the ambient light are acceptable, the capturing session should have reasonable illumination conditions. During the capture, it is preferred that the subject’s face occupies a major area of the image being captured. The image acquisition module 204 may then communicate the acquired image (s) to user identification module 202 and presentation attack detection module 203.

At stage 402, the user identification module 202 analyses the image data acquired at stage 401, extracts biometric information relating to the imaged user (s) , e.g. facial feature information, and determines whether the user is an authorised user of the computer by comparing the extracted biometric information to predefined biometric information of authorised users, i.e. predefined biometric information stored in computer storage accessible by the user identification module 202. The user identification module 202 may output the determination to user verification module 201.

At stage 403, the presentation attack detection module 203 analyses the image data acquired at stage 401, and generates a prediction of whether the acquired image (s) depict a presentation attack, i.e. whether the image is of an unauthorised user attempting to perpetrate a presentation attack. For example, this stage could involve the presentation attack detection module 203 predicting whether the acquired image (s) show a face mask or printed photo. The operation of presentation attack detection module 203 will be described in further detail with reference to later Figures 5 to 15. The presentation attack detection module 203 may output the determination to user verification module 201.

At stage 404, the user verification module 201 may evaluate the determinations from the user identification module 202 and the presentation attack detection module 203, received at

stages

402 and 403 respectively, determine whether the imaged user is an authorised user, and communicate that determination to computer 102. For example, where the determination of the user identification module 202 at stage 402 is that the imaged user appears to be an authorised user, and the prediction of the presentation attack detection module 203 at stage 403 is that the image does not depict a presentation attack, the user verification system 201 may determine that the user requesting access is an authorised user. In contrast, if determination of the user identification module 202 at stage 402 is that the imaged user does not appear to be an authorised user, or if the prediction of the presentation attack detection module 203 at stage 403 is that the image does depict a presentation attack, the user verification system 101 may determine at stage 404 that the user requesting access is not an authorised user, and may notify the computer 102 accordingly.

Referring in particular to Figure 5, in an example, the method of stage 403, for detecting presentation attacks, comprises seven stages. In examples, the method of stage 402 is implemented by the processor 301 of presentation attack detection module 203, in response to instructions of the computer program stored in storage 302 of presentation attack detection module 203.

At stage 501, the computer program stored in storage 302 causes the processor 301 to obtain one or more subject images for analysis, i.e. to obtain images of an attempted user for analysis.

A stage 502, the computer program stored in storage 302 causes the processor 301 to denoise the one or more subject images obtained at stage 501, i.e. to remove image noise from the images, to obtain denoised representations of the subject images.

At stage 503, the computer program stored in storage 302 causes the processor 301 to compute one or more residual images, each residual image representing a difference between a subject image obtained at stage 501 and the respective denoised image computed at stage 502.

At stage 504, the computer program stored in storage 302 causes the processor 301 to obtain a database of texture primitives, each texture primitive encoding information representing a texture feature.

At stage 505, the computer program stored in storage 302 causes the processor 301 to generate one or more texture descriptors, e.g. code vectors, each texture descriptor representing one or more regions of a residual image computed at stage 503 as a function of texture primitives in the database obtained at stage 504.

At stage 506, the computer program stored in storage 302 causes the processor 301 to perform on the subject images obtained at stage 501 one or more feature assessment operations.

At stage 507, the computer program stored in storage 302 causes the processor 301 to perform classification operations, based on outputs of

stages

505 and 506, for predicting whether the subject images obtained at stage 501 depict a presentation attack.

In other examples, stage 403 may comprise fewer or further operational stages, depending on the instructions contained in the computer program. For example, in other implementations the operations of stage 506 may be omitted from the method.

Referring next to Figure 6, in examples, the presentation attack detection module 203, depicted previously with reference to Figure 3, is configured to support the functionality of a plurality of functional modules. In the example, each of the functional modules utilise the processor 301, storage 302, and memory 303 of the presentation attack detection module 203.

Pre-processor module 601 is provided for supporting the method of stage 501, for retrieving images, e.g. facial images, from image acquisition module 204, performing image processing operations on the acquired images, and for outputting subject images for analysis by later modules.

Denoiser module 602 is provided for supporting the method of stage 502, for denoising images output by pre-processor module 601 to remove image noise from the images, to obtain denoised representations of the images.

Residual image computing module 603 is provided for supporting the method of stage 503, for computing residual images representing a difference between a subject image output by pre-processor module 601 and the respective denoised image output by denoiser 602.

Texture descriptor generator module 604 is provided for supporting the methods of

stages

504 and 505, for generating a database of texture primitives, and generating texture descriptors representing regions of residual images output by residual image computing module 603 as a function of one or more of the texture primitives.

Feature assessment module 605 is provided for supporting the method of stage 506, to perform feature assessment operations on the subject images output by pre-processor module 601. In examples in which the image acquisition module 204 is utilised to image a user’s face, feature assessment module 605 may comprise eye-region and/or mouth-

region assessment sub-modules

606, 607 respectively.

Classifier module 608 is provided for supporting the method of stage 507, for predicting, based on the outputs of texture descriptor generator module 604 and feature assessment module 605, at

stages

505 and 506 respectively, whether the images obtained at stage 501 depict a presentation attack.

Referring to Figure 7, in examples, the method of stage 501 for obtaining subject images comprises six stages.

At stage 701, the image acquisition module 204 acquires one or more images of a user attempting access to the computer 102. In examples, stage 701 may involve the presentation attack detection module 203 retrieving pre-acquired images from image acquisition module 204, or may instead involve presentation attack detection module 203 instructing image acquisition module 204 to acquire images of a presenting user by imaging device 205, e.g. images of the user’s face, optionally in the NIR channel.

At stages 702 to 706, the pre-processor module 601 performs certain image processing operations on the image (s) acquired at stage 701. In examples, the pre-processor module 601 processes each image (or frame) independently and identically.

At stage 702, the pre-processor module 601 performs an image normalization operation, whereby the image (s) acquired at stage 701 are normalized for a specific, predefined, intensity range.

In examples, stage 702 involves calculation of minimum (I _min) and maximum (I _max) values for intensity thresholds from the image statistics. The normalization operation on the image data of each frame, may be as shown in equation 1:

These values may be dynamically computed for each acquired image to capture most of the valid intensity values while ignoring spurious noisy pixels. Once the range thresholds are computed, then the valid range of pixels (||I _max-I _min||) may be mapped to a different finite range for further processing.

At stage 703, the pre-processor module 601 performs an image resizing operation, whereby the normalized image (s) output at stage 702 are converted to a predefined fixed dimension through appropriate resizing.

At stage 704, the pre-processor module 601 performs a feature detection operation on the resized image (s) output at stage 703, for example, to detect a user’s face in the image and/or facial landmarks of a user’s face, such as the user’s mouth and/or eye regions. This operation may utilise state-of-the-art libraries such as dlib, or techniques such as a deformable parts model, convolutional neural networks (CNN) , cascaded pose regression, and/or multi-task CNNs. In the case of facial imaging, if at stage 704 the pre-processor module 601 is unable to detect a valid face, then a signal may be generated, to be communicated to the computer 102, to inform the same to the user. Also, if the dimensions of the detected face are smaller than a predetermined threshold, the input image may be rejected. If this behaviour is observed across several frames, a user may be provided with suitable signal, e.g. by computer 102.

At stage 705, in response to detection of valid features at stage 704, e.g. a valid face/facial landmarks (i.e., coordinates of various facial features) , the pre-processor module 601 performs an alignment operation on the image (s) , whereby the images are aligned for the subsequent processing. This stage may involve includes selecting one or more image features, e.g. facial features (such as left and right eye, or left/right/middle points of mouth) ; and computing a two-dimensional transformation of the image such that the coordinates of these specific features are consistent across a succession of images.

At stage 706, the pre-processor module 601 performs a cropping operation on the image (s) . For example, the images may be cropped using available features, such as facial landmarks, to depict only a desired image feature, e.g. a face region. The images can also be again resized at stage 706 to predetermined dimensions.

The output of the pre-processor module 601 is thus a cropped and aligned subject image, e.g. a facial image, and corresponding image features, e.g. facial landmarks such as eye/mouth regions.

Referring to Figures 8 and 9 collectively, in examples, the method of stage 502 for denoising images comprises two stages.

At stage 801, the denoiser module 602 is trained to denoise images of the type to be analysed, e.g. images in the NIR channel. In examples, a denoising auto-encoder (DAE) CNN is utilised for denoising images. In examples the DAE comprises one or more units of convolutional, pooling, and normalization layers; and one or more units of fully connected layers.

For training the DAE at stage 801, bonafide training images are obtained, e.g. non-presentation attack images, such as images of real human faces, and the images are pre-processed by pre-processor module 601 by the method of stages 702 to 706, as described previously.

During training of the DAE, the training images are intentionally corrupted by adding a suitable noise, such as white Gaussian noise (AWGN) , of varying levels. In case of AWGN, these levels are determined by discrete values of the variance (or standard deviation) of the Gaussian function used to generate the noise probability mass function (pmf) . Therefore, for N pre-processed images and m levels of noise, the total training data of mN images may be obtained through augmentation. For an input image I _F, the noisy or corrupted image

may be obtained through a stochastic mapping, given by equation (2) :

where σ is the noise level.

At a higher level, the architecture of a DAE consists of an encoder and a decoder. During training, the encoder maps the noisy input

to the hidden representation h, as the function given by equation 3:

where f represents the overall encoder model with parameters θ _E. On the decoder side, the hidden representation, h is reconstructed into I _F-DN using a decoder function g consisting of parameters θ _D, such that equation 4 holds:

During training, the parameters θ _E and θ _D of the DAE are learnt through minimization of a loss function over an average reconstruction error, E (||I _F-DN-I _F||) of training images (authorised user images only) . The training can be conducted using a suitable optimization method (such as Stochastic gradient descent (SGD) or Adam) , and learning parameters. The batch size of training images can be decided by the amount of computing resources available for the training. After a reasonable convergence and good accuracy, the model may be saved, e.g. in storage 302, for the deployment phase. The model consists of θ _E and θ _D parameters.

Referring in particular to Figure 9, at stage 802, the DAE trained at stage 801 is deployed to denoise pre-processed subject images output by pre-processor module 601 at stage 501. At stage 802, the pre-processed subject image (s) are passed through the DAE without any noise corruption. Because the DAE model was trained at stage 801 using only bonafide images, i.e. images of real humans instead of presentation attacks, in the particular channel of interest, e.g. NIR, it has learnt the finer textural details of real humans. Therefore it is expected to be a more efficient denoiser for bonafide presentations as compared to attack presentations. The output of the DAE at stage 802 is thus a smoothened/filtered version of the input image.

Whilst the example training stage 801, for training the DAE, is described as being performed immediately prior to denoising stage 802, in alternative examples, training stage 801 could be performed well in advance of denoising stage 802. Indeed, in examples, training stage 801 could be performed by an engineer prior to deployment of the biometric verification system 103.

Referring to Figure 10, in examples, the method of stage 503 for computing residual images comprises pixelwise subtraction of the filtered image, output by the DAE denoiser 602 at stage 502, from the pre-processed image output by the pre-processor module 601 stage 501, as input to the DAE denoiser. This procedure yields an image that primarily consists of textural information in the input image. For example, in the case of presentation attacks constructed using a digital display, the residual image is expected to contain patterns of scanline noise. Similarly, for a three-dimensional face mask formed of paper, the residual image is expected to represent mainly the fine granular texture of paper material.

For an input image, e.g. a face image, I _F, the DAE generates a somewhat smoothened output, I _F-DN. The residual image, I _residue, is obtained from pixelwise difference between the two, as given by equation 5:

I _residue=I _F-DN-I _F (5)

The resultant residual image thus primarily encodes the information related to texture patterns of the input images.

However, these texture details are superimposed with other corrupting noisy or high frequency contents of the input presentation. Such corruptions may impair the process of predicting whether the image depicts a presentation attack. Thus, an objective of later stage 505 is to represent the contents of the residual image (s) in a more discriminate manner.

Referring to Figure 11, in examples, the method of stage 504 for obtaining a database of texture primitives comprises five stages. In the example, stage 504 involves generating the database, by the training procedure described below. In other examples, stage 504 could involve obtaining a pre-generated database of texture primitives, i.e. a database generated at a prior time step, optionally by a third party, e.g. using the following procedure.

In examples, the database of texture primitives is a “dictionary” of textural primitives or codewords that are specifically learnt for the intended imaging application, e.g. for imaging faces in the NIR channel. An advantage of this approach of generating texture primitives that are specific to the intended imaging application, e.g. texture primitives of NIR imagery where the intended application will image in the NIR channel, is that the texture primitives are then specific to the application, may most accurately define the textures of the image, and may thus allow more accurate/reliable classification at later stage 507.

An objective of later stage 505 is to represent a local patch of residue image I _residue as a linear combination of texture primitives, i.e., codewords of the dictionary, Thus, entries in the database should encode texture primitives, such that at stage 505, an input image, through its residual image, may be represented as a vector of texture primitives.

At stage 1101, training images, including both bonafide images, e.g. images of real human faces, and attack presentations, e.g. images of subjects wearing masks, are obtained.

At stage 1102, a residual image is obtained, for example by the method of stages 501 to 503, for each input training image (both classes, bonafide and attack presentations) .

At stage 1103, the residual images obtained at stage 1102 may be tessellated to obtain small, non-overlapping regions, otherwise known as patches, of n*n dimensions.

At stage 1104, for each region/patch, I _residue [i, j] , 0＜＜i<n, 0＜＜j<n, texture primitives, otherwise known as codes, for inclusion in the database may be computed by minimization of equation 6:

where α and C represent the coefficients and texture primitive, respectively. Their values are computed through the alternate minimization technique, where an acceptable error norm, e _min can be predetermined. This training procedure thus generates a set of texture primitives or codewords, representing different textural features of the bonafide and attack training images.

At stage 1105, the texture primitives generated at stage 1104 are compiled to form a database of texture primitives. Columns of the database may thus represent individual texture primitives, i.e. individual codewords.

Referring to Figure 12, in examples, the method of stage 505 comprises computing coefficients mapping regions of the residual images generated at stage 503 to texture primitives in the database generated at stage 504.

As previously described, the texture details contained in residual images are superimposed with other corrupting noisy or high frequency contents of the input presentation, which may impair the later classification process at stage 507 for predicting whether the image depicts a presentation attack. Thus, an objective of stage 505 is to represent the contents of the residual image (s) in a more discriminate manner, as a function of the texture primitives in the database. The accuracy and reliability of the presentation attack detection prediction, at later stage 507, may thus be improved. As previously described, in examples, the texture primitives in the database are generated specifically for the imaging application, e.g. in the NIR channel, and may thus accurately represent the textures of the subject image.

At stage 505, the texture primitive database learnt at stage 504, is used to obtain a texture descriptor, also known as a feature descriptor, for the residual image (s) output by the DAE denoiser 602 at stage 502. The incoming residual image I _residue is divided into smaller, non-overlapping patches of n*n. For each patch, an optimal vector of coefficients may be computed using equation (7) :

This operation thus involves computing the coefficients α, such that each patch of each residual image may be represented as a function of one or more of the (micro) texture primitives contained in the texture primitive database and their respective coefficients, e.g. as a linear combination of plural texture primitives, as given by the function in Figure 12.

Whilst in the example learning stage 504 for generating the database of texture primitives is described as being performed immediately prior to deployment stage 505, in alternative examples, learning stage 504 could be performed well in advance of deployment stage 505. Indeed, in examples, training stage 504 could be performed by an engineer prior to deployment of the biometric verification system 103.

The order of tessellated patches/regions of the residual images should be predefined, and should be consistent to obtain the descriptor specific to the spatial coordinates. If the residual image is tessellated into i×j patches of uniform sizes, and the database of texture primitives consists of P codewords; then the feature descriptor F _texture has dimensionality given by equation 8:

F _texture∈R ^i×j×P (8)

The texture/feature descriptor (s) generated at stage 505 may then passed to classifier module 608, for inclusion in the classification operation at later stage 507, as will be described further with reference to Figure 15.

Referring to Figure 13, in examples relating to subject facial images, the method of stage 506 for performing feature assessments comprises two operations, which may suitably be performed in parallel, each operation comprising three stages. Stages 1301 to 1303 are deployed for assessing eye regions of subject face images, whilst stages 1304 to 1306 are deployed for assessing mouth regions of subject face images.

Eye and mouth regions provide several important cues for detection of presentation attacks relating to facial images. Stage 506 thus involves conducting an assessment of eye and mouth regions over a sequence of images frames to test for occlusion, local motion, and masking possibilities. Stage 506 involves examining a variety of such cues from individual image frames as well as sequence of image frames. The features are extracted from a single frame as well as over a sequence of frames.

Stages 1301 to 1303 are deployed for assessing eye regions of subject face images.

Features of an eye region of a face image can be considered to be useful indicators of presentation attacks only if eyes are visible in the image, i.e. if the eyes were visible to the imaging device at stage 501. For a subject wearing dark glasses or using another means to cover or hide their eyes, assessments of eye region features are not useful, and should be excluded from consideration in the later classification stage 507. Additionally, partial occlusion of one or both eyes may result in lower accuracy of the presentation attack prediction at later stage 507.

Thus, at stage 1301, a check is performed for the visibility of a user’s eyes in subject facial images.

The visibility check performed at stage 1301 may involve analysing the facial landmarks detected by pre-processor module 601 at stage 705 to identify relevant regions of the image. Based on the landmarks related to eyes, and predetermined feature dimensions, e.g. facial dimensions, a region of the image containing eyes is identified. The feature detection at stage 704 may approximate or estimate these locations, in case of partial occlusions. Therefore, an explicit check for visibility is desirable. In case of occlusion using glasses or other materials, the eye region of the image may be expected to appear nearly homogeneous. In a visible view of the eye regions, i.e. including pupils, eyebrows, eyelids etc., the region will include visible features. Stage 1301 involves looking for such visible features in the region of interest. Firstly, the entropy of a rectangular eye region I _eye is computed by equation 9:

H _eye= -∑ _kp _klog ₂ (p _k) (9)

where p _k refers to the histogram of eye region I _eye. The entropy of a visible eye region is expected to be much higher than that of an occluded eye region.

However, it also possible for a user’s eyes to be covered by alternate means including visibly distinct features, which may render the above entropy calculation, H _eye not useful in detecting the visibility of a users’ eye region. Thus, stage 1301 may further involve a pattern checking operation. In this operation, an edge map of an eye region of an image is computed using edge detection operators such as Sobel, Prewitt, or Canny. The edge map is slightly blurred by convolving with a two-dimensional Gaussian kernel of small variance. A template of the eye region is pre-computed from a small set of visible and clear bonafide presentations. The process of blurring slightly dilates the edge map, and hence, compensates for minor differences in the shapes of eyes of individuals.

During the visibility check, the blurred edge map of the eye region of the subject image is matched against the pre-computed template using a normalized cross-correlation (NCC) technique, given by equation 10:

where

is the normalized image of a blurred eye region, and T _eye is the normalized template of the eye region.

The values of entropy H _eye and normalized cross-correlation NCC _eye computed by equations 9 and 10 respectively are passed to the classifier module 608 for inclusion in the later classification stage 507.

Human eyes exhibit a natural motion over small intervals of time (e.g, eyelid blinking and gaze changes) . The presence of such movement in input subject images may thus usefully serve as an indication that the image depicts a live human, rather than e.g. a presentation attack in the form of a printed image. Stage 506 may thus further involve assessment of motion between frames in an image sequence. In the example, such an assessment does not explicitly check for a blink or gaze; rather it calculates the degree of generic local motion. This feature is computed over a sequence of frames. This feature is computed only if the visibility check at stage 1301 provides a positive output.

Thus, at stage 1302, the eye region, I _eye is divided into small patches of m*n dimensions. For a given p-th frame, the mean absolute difference (MAD) between the patch I _eye [k ₁m, k ₂n] from p-th frame and the patch at same spatial location from (p-1) -th frame is calculated. The MAD is a scalar value, which would remain close to zero if the patches do not change over frame sequence. For a natural eye movement–in the form of blink, gaze, eyelid opening, eyelid closing, etc. –the MAD sequence is expected to be inconsistent. For sudden changes such as eye blinks or quick head movements, the MAD sequence consists of high frequency (impulse) signals; whereas for a slow movement such as gaze, the MAD sequence contains relatively lower frequency contents.

The MAD sequence is analysed over a moving window of frames, and it may be computed from every n-th frame (n could be 2, 3, 5.. ) , rather than over consecutive frames. These parameters can be defined in accordance with the frame rate of the overall system. The differential value of MAD is passed to the classifier module 601 for inclusion in the classification operation at later stage 507.

Additionally, for every patch of eye region, a texture descriptor is also generated by texture descriptor generator 604, in the way described previously with reference to stage 505. As described previously, these texture descriptors capture fine texture features of the patch/region. Therefore, if the contents of an image patch change (due to eye movements) , the corresponding textures also exhibit a large change. Stage 1302 may thus further compute a difference between texture descriptors of a given spatial patch over a frame sequence, which thereby be used to estimate of local changes/movement. Note that, this function does not check for explicit blink or gaze movements; but finds indications for an overall motion of any kind. The objective of this operation is not to quantify the amount of motion, rather it is aimed at identifying any motion that may be helpful in evaluating the liveliness of the presentation.

While local motion of an eye region is an important liveliness characteristic, it may be spoofed by an attacker, by replaying a video of a subject, or by using a mask with cut-outs at the eye region (where an attacker’s eyes will be seen through the cut-outs of the mask) . Therefore, stage 506 further involves provides a check for a possibility of cut-out features around the eye regions (in the case of detection of mask or print attacks) . This functionality is tested only if the visibility check for the eye region at stage 1301 is positive.

It may be expected that, in the case of the presence of cut-outs in a mask, strong cut edges will be visible in the region around the eyes. Thus, at stage 1303, the eye region, I _eye, is convolved with an edge detection kernel such as Sobel, and the output of the convolution operation is normalized for mean and standard deviation. A histogram of this normalized edge image is computed. A reference/template histogram is computed from a set of visible bonafide presentations from training data, by selecting the eye regions, and then obtaining edge maps through convolution.

The histogram for attack presentations with cut-outs may be expected to consist of higher values as compared to the corresponding histogram of bonafide samples. The magnitude of a difference between the reference and test histograms is considered a useful indicator of the presence of cut-outs in the given region, and is passed to the classifier module 608 for inclusion in the classification operation at later stage 507.

The mouth region of a facial image also provides several important cues for detection of presentation attacks. Thus, stages 1304 to 1306 involve extracting a variety of such cues in a similar manner to the eye region assessment of stages 1301 to 1303.

Features of a mouth region can be considered to be useful indicators of presentation attacks only if the mouth is visible in the image, i.e. if it is visible to the imaging device at stage 501. A subject (or attacker) might occlude the mouth completely or partially by covering, e.g. with hands or clothing. Since high occlusions degrade the accuracy of a presentation attack prediction, it is desirable to check the amount of occlusion or visibility of the mouth region in subject images.

Thus, at stage 1304, a visibility check, to check for the visibility of a mouth region in subject facial images, is conducted. The visibility check for the mouth region is conducted by analysing the features, e.g. facial landmarks, detected by the pre-processor module 601 at stage 704, to identify relevant regions of the subject image. A region containing the mouth is identified based on the corresponding landmarks, and predetermined dimensions of an average face. An explicit check for lips/mouth is desirable in particular where the feature detection at stage 704 is focused on identifying a silhouette of a face rather than specifically a mouth feature. In case of occlusion of the mouth region, e.g. by clothing, it may be expected that the mouth region will appear nearly homogeneous. Conversely, in a visible view, features of the mouth region, such as the lips should be visible. Similar to the eye region assessment, this stage generates entropy and edge map-based functions for a visibility check. The entropy of a rectangular mouth region I _mouth is computed by Equation 11:

H _mouth=-∑ _kp _klog ₂ (p _k) (11)

where p _k refers to the histogram of mouth region I _mouth. The entropy of a visible mouth region may be expected to be higher than that of an occluded mouth region. However, if the mouth is covered, e.g. by clothing or a facial mask, then the entropy calculation may not be a useful indicator of the visibility of the mouth.

Stage 1304 thus further involves a pattern-checking operation for assessing the visibility of a mouth region. An edge map of a mouth region is computed using edge detection operators such as Sobel, Prewitt, or Canny. This edge map is slightly blurred using a 2D Gaussian kernel of small variance. A template of the mouth region is pre-computed from a small set of facial images in which the mouth region is visible. The process of blurring slightly dilates the edge map, and hence, compensates for minor differences in the shapes of mouth for different individuals.

During the visibility check, the blurred edge map of the mouth region of subject images is matched against the pre-computed template using a NCC technique, given by equation 12:

where

is the normalized image of blurred mouth region; and T _mouth is the normalized template of mouth region. This value, along with the region entropy H _mouth, , is then passed to the classifier module 608 for inclusion in the classification operation at later stage 507.

A natural movement of the mouth region, e.g. of the lips, can be a useful indicator of the liveness of a presentation. Stage 506 may thus further involve detecting local motion between successive image frames. Since stage 1305 requires checking for motion, this feature is computed over a sequence of frames. Additionally, this feature is computed only if the visibility check for the mouth region at stage 1304 returns a positive output.

Thus, at stage 1305, the mouth region, I _mouth, is divided into small patches of m*n dimensions. For a given p-th frame, the mean absolute difference (MAD) between the patches from a same spatial location (I _mouth [k ₁m, k ₂n] ) over consecutive frames, p-th and (p-1) -th frames, is calculated. The MAD is a scalar value that remains close to zero if the patches do not change over a frame sequence. For a moderate movement of lips, either natural or to speak, the MAD sequence may be expected to be inconsistent. For a specific speech utterance or quick head movement, the MAD sequence may be expected to consist of high frequency (impulse) signals, whereas for a slow natural movement, the MAD sequence may be expected to contain relatively lower frequency contents.

The MAD sequence is analyzed over a moving window of frames, and it may be computed from every n-th frame (n could be 2, 3, 5.. ) , rather than over consecutive frames. These parameters can be defined in accordance with the frame rate of overall system. The differential value of the MAD analysis is passed to the classifier module 608 for inclusion in the classification operation at later stage 507.

Additionally, for every patch of mouth region, a texture descriptor is also generated by texture descriptor generator 604, in the way described previously with reference to stage 505. As described previously, these texture descriptors capture fine texture features of the patch. Therefore, if the contents of an image patch change (e.g. due to lip movements) , the corresponding textures are also expected to exhibit a large change. Thus, at this stage, the difference between texture descriptors of a given spatial patch over a frame sequence is computed, and is used to estimate a local change/movement. The objective of this operation is not to quantify the amount of motion or utterances, rather it is aimed just at detecting any motion that may be a useful indicator of liveliness of the presentation.

Although detection of local motion of a mouth region in an image is a helpful liveliness characteristic, it can be spoofed by an attacker replaying a video of subject, or by using a mask with a cut-out at the mouth region (wherein the attacker’s mouth may be visible through the cut-out) . Thus, stage 1306 involves a check for the possibility of a cut-out around the mouth region of face (in case of mask or print attacks) . This functionality is tested only if the visibility check for the mouth region at stage 1304 is positive.

The presence of a cut-out around the mouth region can be inferred from the presence of unnaturally strong edges. The mouth region, I _mouth, is convolved with an edge detection kernel such as Sobel, and the output of convolution is normalized for mean and standard deviation. A histogram of this normalized edge image is computed, which is then used as a feature. A reference histogram is computed from a set of images in which a subject’s mouth region is visible, by selecting the mouth region, and then obtaining its edge map through convolution.

The histogram for attack presentations with cut-outs around the mouth region may be expected to have higher values as compared to the corresponding histogram of bonafide presentations.

The total magnitude of difference between the reference and test histograms is passed to the classifier module 608 for inclusion in the classification operation at later stage 507.

Referring next to Figures 14 and 15 collectively, in examples, stage 507 for performing a classification operation for predicting presentation attacks comprises two stages.

At stage 1401, the classifier module 608 is trained to predict presentation attacks based on the outputs of texture descriptor generator 604 at stage 505, and feature assessment module 605 at stage 506. In examples, the classifier module 608 utilises a neural network, such as a multi-layer perceptron (MLP) network with one or more hidden layers, one input layer, and one output layer. The number of neurons in the input layer is equal to the sum of dimensions of input features provided by the texture descriptor generator 604 and feature assessment module 605. This number is itself governed by the size of the texture primitive database, and the dimensions of the subject images.

For training the MLP at stage 1401, feature regions of training images, e.g. eye and mouth regions, of both bonafide and attack classes, along with labels identifying the nature of texture primitives in the database, i.e. bonafide or attack, is utilised. The training can be conducted using a suitable optimization method (such as Stochastic gradient descent (SGD) or Adam) , and learning parameters. The batch size of training images can be decided by the amount of computing resources available for the training. After a reasonable convergence and good accuracy, the model may be saved for the deployment phase. The model consists of learned weight parameters.

During the classification stage 1402, the classifier module takes as inputs the outputs of texture descriptor generator module 604, and feature assessment module 605, output at

stages

505 and 506 respectively, and by operation of the learned neural network, outputs predictions as to whether the subject image obtained at stage 501 shows a bonafide presentation of a human, or an attack presentation, e.g. a mask or printed image. The MLP model utilised by the classifier module consists of two neurons at the output. One output provides the classification of presentation attack detection, i.e., whether the input image depicts a bonafide human or a presentation attack. The second output is used to provide a signal to the user, e.g. via the computer 102, if major occlusions are observed in the input images. While the operation of the classification module is robust to minor image occlusions, its performance may degrade as the amount of occlusion increases. The visibility of regions such as eyes and mouth can provide helpful cues in this regard.

In other examples, other classification procedures could be used. For example, in other examples, a simpler, or more complex, neural network could be used. As previously described, in examples, stage 506 for feature assessment may be omitted from the method, and the classifier module 608 may thus receive as inputs only texture descriptors output by texture descriptor generator module at stage 505. In such examples, a simpler neural network may be utilised. In other examples, a classification method not involving a neural network may be utilised.

Whilst in the example, training stage 1401 for training the classifier module 608 is described as being performed immediately prior to classification stage 1402, in alternative examples, training stage 1401 could be performed well in advance of classification stage 1402. Indeed, in examples, training stage 1401 could be performed by an engineer prior to deployment of the biometric verification system 103.

Aspects of the present disclosure have been described in detail herein in the context of a biometric verification system for verifying the authority of a user requesting access to a computer, as a part of a computer access control system. However, the utility of the disclosure is not limited to such an application. In particular, it should be understood that the presentation attack detection module 203, and the method for presentation attack detection using the module 203, have wider utility generally for detecting presentation attacks in images. Thus, in other examples of aspects of the disclosure, the presentation attack detection module 203 and/or the method for presentation attack detection using the module 203 could be deployed in isolation of one or more other features of the computing system 101. For example, in an alternative embodiment, presentation attack detection module 203 could be deployed as a standalone module for detecting presentation attacks in images input to the module.

Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality.

Claims

A method for detecting image presentation attacks, the method comprising:

obtaining a subject image;

denoising the subject image to obtain a denoised representation of the subject image;

computing a residual image representing a difference between the subject image and the denoised representation of the subject image;

obtaining a database of one or more texture primitives;

generating a texture descriptor representing image texture details of one or more regions of the residual image as a function of texture primitives in the database; and

performing on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.
The method of any one of the preceding claims, wherein the generating a texture descriptor comprises generating a texture descriptor representing image texture details of a plurality of regions of the residual image as a function of texture primitives in the database.
The method of any one of the preceding claims, wherein the generating a texture descriptor comprises generating a texture descriptor representing image texture details of each of the one or more regions of the residual image as a function of a plurality of texture primitives in the database.
The method of claim 3, wherein the generating a texture descriptor comprises generating a texture descriptor representing image texture details of each of the one or more regions of the residual image as a linear combination of the plurality of texture primitives in the database and respective coefficients relating texture details of the region of the residual image to each of the texture primitives.
The method of any one of the preceding claims, wherein the denoising the subject image comprises denoising the subject image using a convolutional neural network for predicting a denoised representation of an image.
The method of any one of the preceding claims, wherein the performing on the texture descriptor a classification operation comprises performing on the texture descriptor operations of a convolutional neural network for predicting image presentation attacks based on image texture details.
The method of any one of the preceding claims, comprising performing on the subject image:

an image intensity normalization operation for changing an intensity range of the subject image to a predetermined intensity range; and/or

an image resizing operation for changing a size of the subject image to a predetermined size.
The method of any one of the preceding, further comprising:

performing on the subject image a feature position detection operation for detecting a region of the subject image containing a predetermined image feature; and

performing on the detected region of the subject image a visibility detection operation for detecting a visibility of the predetermined image feature in the detected region of the subject image.
The method of any one of the preceding claims, further comprising:

performing on the subject image a feature position detection operation for detecting a region of the subject image containing a predetermined image feature;

convolving the detected region of the subject image with an edge detection kernel and computing a histogram representing an output of the convolution;

obtaining a reference histogram; and

computing a difference between the histogram and the reference histogram.
The method of any one of the preceding claims, further comprising:

obtaining a further subject image;

denoising the further subject image to obtain a denoised representation of the further subject image;

computing a further residual image representing a difference between the further subject image and the denoised representation of the further subject image;

generating a further texture descriptor representing image texture details of a region of the further residual image, corresponding spatially to one of the one or more regions of the residual image, as a function of texture primitives in the database; and

computing a difference between the further texture descriptor and the texture descriptor for the corresponding region of the residual image.
The method of any one of claims 5 to 10, wherein the denoising the subject image comprises:

obtaining a convolutional neural network;

receiving a training image;

adding image noise to the training image to generate a noisy representation of the training image;

performing operations of the convolutional neural network on the noisy representation of the training image and generating a prediction for a denoised representation of the noisy representation of the training image;

quantifying a difference between the prediction and the training image; and

modifying operations of the convolutional neural network based on the difference.
The method of any one of the preceding claims, wherein the obtaining the subject image comprises receiving an image of near-infrared radiation, and wherein the obtaining a database of one or more texture primitives comprises obtaining a database of one or more texture primitives representing textures of near-infrared radiation imagery.
The method of any one of the preceding claims, wherein the obtaining a subject image comprises imaging using an optical imaging device.
A computer program comprising instructions, which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 13.
A computer-readable data carrier having the computer program of claim 14 stored thereon.
A computer for detecting image presentation attacks, wherein the computer is configured to:

obtain a subject image;

denoise the subject image to obtain a denoised representation of the subject image;

compute a residual image representing a difference between the subject image and the denoised representation of the subject image;

obtain a database of one or more texture primitives;

generate a texture descriptor representing image texture details of one or more regions of the residual image as a function of texture primitives in the database; and

perform on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.