WO2022120532A1 - Presentation attack detection - Google Patents

Presentation attack detection Download PDF

Info

Publication number
WO2022120532A1
WO2022120532A1 PCT/CN2020/134321 CN2020134321W WO2022120532A1 WO 2022120532 A1 WO2022120532 A1 WO 2022120532A1 CN 2020134321 W CN2020134321 W CN 2020134321W WO 2022120532 A1 WO2022120532 A1 WO 2022120532A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
texture
subject image
subject
stage
Prior art date
Application number
PCT/CN2020/134321
Other languages
French (fr)
Inventor
Wei Huang
Ketan Kotwal
Wenkang XU
Xiaolin Huang
Sebastien Marcel
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to CN202080107787.5A priority Critical patent/CN116982093A/en
Priority to PCT/CN2020/134321 priority patent/WO2022120532A1/en
Publication of WO2022120532A1 publication Critical patent/WO2022120532A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • the present disclosure relates to detecting image presentation attacks.
  • Biometric authentication is used in computer science for user verification for access control. Forms of biometric authentication rely on imaging a user’s biometric traits, for example, imaging the user’s face, hand, finger or iris to detect physical features, whereby the detected physical features may be compared to known physical features of an authorised user to authenticate the access attempt. Presentation attacks may be perpetrated on such a biometric authentication system by an unauthorised user seeking to impermissibly gain access to a computer system, whereby the unauthorised user attempts to impersonate the biometric presentation of an authorised user. For example, such an unauthorised user may wear a face mask, or a hand/finger prosthetic, or show a printed image or an electronic display, depicting the authorised user’s presentation. It is desirable to be able to reliably detect presentation attacks in order to reduce the risk of unauthorised access to a computer system.
  • An objective of the present disclosure is to provide a method for detecting image presentation attacks.
  • aspects of the present disclosure relate to a texture-based approach to presentation attack detection, whereby presentation attacks may be identified by detecting micro-textural differences between a bonafide presentation, e.g. a bonafide human face, and artifacts, e.g. a face-mask, printed photo, or video displayed on an electronic display.
  • a texture-based approach may allow effective discrimination between bonafide and attack presentations based on artifact characteristics, such as the presence of pigments (printing defects) and specular reflection and shade (display attacks) .
  • aspects of the disclosure relate to classifying presentation attacks based on the characteristic micro-textural differences between bonafide presentations and presentation attack artifacts.
  • a first aspect of the present disclosure provides a method for detecting image presentation attacks, the method comprising, obtaining a subject image, denoising the subject image to obtain a denoised representation of the subject image, computing a residual image representing a difference between the subject image and the denoised representation of the subject image, obtaining a database of one or more texture primitives, generating a texture descriptor representing image texture details of one or more regions of the residual image as a function of texture primitives in the database, and performing on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.
  • the method is for use in detecting image presentation attacks on a user verification system.
  • the user identification system may, for example, be deployed as a part of a user verification system of a computer access control system for controlling access to a computer system.
  • the user verification system may, for example, comprise an image capture device for capturing an image of a user presenting to the user verification system, for example, an image of the presenting user’s face or fingerprint, to thereby predict whether the presenting user is an authorised user or an unauthorised user.
  • a subject image i.e. an image presented for analysis
  • the subject image could be obtained using an image capture device, and the method could involve capturing the image using an image capture device.
  • the subject image could be captured by an external system, and obtaining the subject image could involve obtaining the image file, optionally following initial processing of the image file.
  • the subject image is initially denoised.
  • the subject image could be denoised using a neural network trained for denoising images of a bonafide presentation, e.g. of a real human.
  • the residual image is computed as a difference between the subject image and the denoised version of the subject image, for example, as a pixelwise difference between the subject image and the denoised representation.
  • the residual image is thus a smoothened representation of the subject image, primarily representing micro-textural features of the subject image.
  • a texture descriptor is then generated using a database of texture primitives, the texture descriptor (s) representing one or more regions/patches of the residual image.
  • the texture descriptor may comprise a code vector where texture details of the region/patch of the residual image are encoded as a combination of texture primitives.
  • the texture descriptor represents the contents of the residual image (s) in a more discriminate manner, as a function of the texture primitives in the database.
  • the database of texture primitives is specifically learnt for the intended imaging application, e.g. for imaging faces in the NIR channel.
  • An advantage of this approach is that the texture primitives are then specific to the application, may most accurately define the textures of the image, and may thus allow more accurate/reliable classification of the subject image as representing either a bonafide or attack presentation.
  • the generating a texture descriptor comprises generating a texture descriptor representing image texture details of a plurality of regions of the residual image as a function of texture primitives in the database.
  • the texture descriptor may represent plural different regions of the residual image as a function of one or more of the texture primitives. Consequently, the texture details of the subject image may be more fully represented by the texture descriptor. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
  • the texture descriptor may represent every region of the residual image as a function of the texture primitives, such that the texture details of the subject image is most fully represented.
  • the generating a texture descriptor comprises generating a texture descriptor representing image texture details of each of the one or more regions of the residual image as a function of a plurality of texture primitives in the database.
  • the texture descriptor may represent each of the one or more different regions of the residual image as a function of a plurality of the texture primitives. Consequently, fuller texture information defining the texture details of the subject image may be represented. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
  • the texture descriptor may represent each of the one or more regions of the residual image as a function of each of the texture primitives in the database, such that the texture details of the subject image is most fully represented.
  • the generating a texture descriptor comprises generating a texture descriptor representing image texture details of each of the one or more regions of the residual image as a linear combination of the plurality of texture primitives in the database and respective coefficients relating texture details of the region of the residual image to each of the texture primitives.
  • the texture descriptor may represent the texture information of one or more the regions as a linear combination of a plurality of the texture primitives. This may advantageously represent a computationally efficient form for comprehensively representing texture information.
  • the denoising the subject image comprises denoising the subject image using a convolutional neural network for predicting a denoised representation of an input image.
  • a convolutional neural network may advantageously allow accurate and reliable denoising of images, with relatively low computational complexity.
  • the convolutional neural network could be a denoising auto-encoder.
  • the convolutional neural network could be trained using only bonafide images of a user, e.g. of a user’s face, rather than of attack presentations. Consequently, the denoiser may be expected to be a more efficient denoiser for bonafide presentations as compared to attack presentations. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
  • the performing on the texture descriptor a classification operation comprises performing on the texture descriptor operations of a convolutional neural network for predicting image presentation attacks based on image texture details.
  • a convolutional neural network may advantageously allow accurate and reliable prediction of presentation attacks, with relatively low computational complexity.
  • the convolutional neural network could be a multi-layer perceptron.
  • the method further comprises performing on the subject image, an image intensity normalization operation for changing an intensity range of the received image to a predetermined intensity range, and/or an image resizing operation for changing a size of the received image to a predetermined size.
  • the subject image may be subjected to intensity normalising/resizing operations prior to the denoising stage. Consequently, the subject image, and so the texture details of the subject image, may be adapted to match as closely as possible a desired intensity/size of the texture details, e.g. to match the intensity/size on which the denoiser and/or the classifier operations have been trained, and/or on which the texture primitive database was learnt. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
  • the method further comprises, performing on the subject image a feature position detection operation for detecting a region of the subject image containing a predetermined image feature; and performing on the detected region of the subject image a visibility detection operation for detecting a visibility of the predetermined image feature in the detected region of the input image.
  • the method involve detecting, and checking the visibility of, certain features of an image.
  • the operations could be aimed at detecting the position of facial landmarks, such as eye and/or mouth regions in facial images, and subsequently checking the visibility of those features in the subject image (s) .
  • the existence of particular features of the image may be determined, which may be indicative of a liveness of the imaged subject.
  • the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
  • the visibility detection operation could involve computing an entropy of the regions and comparing the computer entropy to an expected entropy of a region depicting the predetermined feature, e.g. a mouth or eye.
  • the method further comprises, performing on the subject image a feature position detection operation for detecting a region of the subject image containing a predetermined image feature, convolving the detected region of the subject image with an edge detection kernel and computing a histogram representing an output of the convolution, obtaining a reference histogram, and computing a difference between the histogram and the reference histogram.
  • this implementation could be used for detecting edges, such as edges bounding cut-out regions, of a mask, to thereby detect attack presentations.
  • edges such as edges bounding cut-out regions
  • this implementation could be used for detecting edges, such as edges bounding cut-out regions, of a mask, to thereby detect attack presentations.
  • the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
  • the reference histogram may be generated for a particular feature, e.g. eyes/mouth, of a bonafide image, and computing the difference between the histograms may involve computing a difference/similarity between the edge features.
  • the magnitude of the difference may be used as a reliable proxy for the presence of edges of a mask, e.g. cut-outs.
  • the method further comprises receiving a further subject image; denoising the further input image to obtain a denoised representation of the further subject image; computing a further residual image representing a difference between the further subject image and the denoised representation of the further subject image; generating a further texture descriptor representing image texture details of a region of the further residual image, corresponding spatially to one of the one or more regions of the residual image, as a function of texture primitives in the database; and computing a difference between the further texture descriptor and the texture descriptor for the corresponding region of the residual image.
  • This method may advantageously allow for detection of local motion of a subject of the subject image, occurring between an acquisition time of the subject image and the acquisition time of the further subject image.
  • the method may be used for detecting movement of eye or mouth regions of an imaged user. Such a comparison may detect movement of the imaged subject, which may thereby be used to infer a liveness of the subject.
  • the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
  • the obtaining a convolutional neural network for predicting a denoised representation of a subject image comprises, obtaining a convolutional neural network, receiving a training image, adding image noise to the training image to generate a noisy representation of the training image, performing operations of the convolutional neural network on the noisy representation of the training image and generating a prediction for a denoised representation of the noisy representation of the training image, quantifying a difference between the prediction for a denoised representation of the noisy representation of the training image and the training image; and modifying operations of the convolutional neural network based on the difference.
  • Adding image noise may involve adding white Gaussian noise.
  • Modifying operations of the convolutional neural network based on the difference may involve updating parameters, such as weights, of the CNN.
  • the receiving a subject image comprises receiving an image of near-infrared radiation
  • the obtaining a database of one or more texture primitives comprises obtaining a database of one or more texture primitives representing textures of near-infrared radiation imagery.
  • the subject images may be acquired in the NIR channel.
  • NIR images are advantageously relatively unsusceptible to corruption by varying levels of background visible light during imaging, good for imaging in low visible light levels, e.g. at night, and good for imaging texture certain key texture details which discriminate between bonafide and attack presentations.
  • the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
  • the receiving an input image comprises imaging using an optical imaging device.
  • the imaging device could be a near-infrared imaging device, sensitive to near-infrared radiation.
  • a second aspect of the present disclosure provided a computer program comprising instructions, which, when executed by a computer, system cause the computer system to carry out a method of any implementation of the first aspect of the present disclosure.
  • a third aspect of the present disclosure comprises a computer-readable data carrier having the computer program of the second aspect of the present disclosure stored thereon.
  • a fourth aspect of the present disclosure provides a computer system for detecting image presentation attacks, wherein the computer system is configured to, obtain a subject image, denoise the subject image to obtain a denoised representation of the subject image, compute a residual image representing a difference between the subject image and the denoised representation of the subject image, obtain a database of one or more texture primitives, generate a texture descriptor representing image texture details of one or more regions of the residual image as a function of texture primitives in the database, and perform on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.
  • the computer system of the fourth aspect of the present disclosure may be further configured to perform a method of any implementation of the first aspect of the present disclosure.
  • Figure 1 shows schematically an example of a computing system embodying an aspect of the disclosure, comprising a computer and a biometric verification system;
  • FIG. 2 shows schematically an example implementation of the biometric verification system identified previously with reference to Figure 1, comprising a presentation attack detection module;
  • FIG. 3 shows schematically an example implementation of the presentation attack detection module, identified previously with reference to Figure 2;
  • Figure 4 shows schematically an example method performed by the biometric verification system x, comprising a method for detecting presentation attacks
  • Figure 5 shows schematically processes involved in the method for detecting presentation attacks, identified previously with reference to Figure 4;
  • Figure 6 shows schematically an example of computational functionality provided by the presentation attack detection module identified previously with reference to Figure 3;
  • Figure 7 shows schematically processes involved in obtaining subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;
  • Figure 8 shows schematically processes involved in denoising subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;
  • Figure 9 shows schematically stages of a denoising auto-encoder involved in the process for denoising subject images identified previously with reference to Figure 8;
  • Figure 10 shows schematically processes involved in computing residual images corresponding to subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;
  • Figure 11 shows schematically processes involved in obtaining a database of texture primitives, for the method for detecting presentation attacks identified previously with reference to Figure 5;
  • Figure 12 shows schematically processes involved in generating texture descriptors, for the method for detecting presentation attacks identified previously with reference to Figure 5;
  • Figure 13 shows schematically processes involved in assessing features of subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;
  • Figure 14 shows schematically processes involved in predicting presentation attacks in subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;
  • Figure 15 shows schematically a multi-layer perceptron classifier model, utilised in the process for predicting presentation attacks in subject images identified previously with reference to Figure 14.
  • a computing system 101 embodying an aspect of the present disclosure comprises a computer 102 and a biometric verification system 103 in communication with the computer 102 via connection 104.
  • Computer 102 is operable to perform a computing function.
  • computer 102 may run computer programs for performing functions.
  • computer 102 is a personal computer, a smartphone, or a computer installed onboard a vehicle for controlling functions of the vehicle.
  • computer 102 is a payment device for accepting a payment from a user, such as a point of sale device, or for dispensing a payment to a user, for example, an automated teller machine.
  • it may be desirable to control access to the computer, or to functionality of the computer, so as prevent unauthorised use of the computer.
  • the computer 102 is a personal computer, it may be desirable to restrict access to functionality of the computer to one or more authorised users of the personal computer, to thereby prevent unauthorised users from using the computer.
  • Biometric verification system 103 is functional to verify whether a user requesting access to the computer 102 is an authorised user. An output of biometric verification system 103 may thus be used by computer 102 to determine whether to grant access to the computer, e.g. access to an operating system or programs running on the computer, to the user. In examples, biometric verification system 103 is functional to image a user requesting access to computer 102, and verify that the imaged user is an authorised user of the computer, by comparing biometric information extracted from the image with predefined biometric characteristics of authorised users.
  • a difficulty encountered by such a system is that an unauthorised user may perpetrate a presentation attack on the biometric verification system 103, whereby the unauthorised user may attempt to impersonate the biometric presentation of an authorised user, to thereby impermissibly gain access the computer.
  • an unauthorised user may wear a face mask, or a hand/finger prosthetic, or show a printed image or an electronic display, depicting the authorised user’s presentation.
  • biometric verification system 103 is functional to detect certain forms of presentation attacks in images of the user, to thereby reduce the risk of unauthorised access to the computer 102.
  • biometric verification system 103 may be connected to computer 102 via a network 104.
  • Network 104 may be implemented, for example, by a wide area network (WAN) such as the Internet, a local area network (LAN) , a metropolitan area network (MAN) , and/or a personal area network (PAN) , etc.
  • WAN wide area network
  • LAN local area network
  • MAN metropolitan area network
  • PAN personal area network
  • the network may be implemented using wired technology such as Ethernet, Data Over Cable Service Interface Specification (DOCSIS) , synchronous optical networking (SONET) , and/or synchronous digital hierarchy (SOH) , etc.
  • DOCSIS Data Over Cable Service Interface Specification
  • SONET synchronous optical networking
  • SOH synchronous digital hierarchy
  • the network 104 may include at least one device for communicating data in the network.
  • the network 104 may include computing devices, routers, switches, gateways, access points, and/or modems.
  • biometric verification system 103 may be connected to computer 102 via a simpler data transfer connection 104, e.g. via a connection in accordance with the Universal Serial Bus standard.
  • biometric verification system 103 is depicted as being structurally distinct from, and co-located with computer 102.
  • biometric verification system 103 could be located remotely of the computer 102, or could instead be incorporated into computer 102, such that biometric verification system 103 utilises computing resource of the computer 102.
  • computer 102 could comprise a handheld computing device, e.g. a smart phone, and biometric verification system 103 may be integrated in the handheld computing device.
  • biometric verification system 103 comprises user verification module 201, user identification module 202, presentation attack detection module 203, and image acquisition module 204.
  • Image acquisition module 204 comprises imaging device 205.
  • the components 201 to 205 of the biometric verification system 103 are in communication via system bus 106, which is in turn connected to computer 102 via connection104.
  • User verification module 201 is for determining whether a user requesting access to the computer 102 is an authorised user of the system, based on inputs from user identification module 202 and from presentation attack detection module 203, and for communicating such determination to the computer 102 via connection 104.
  • User verification system 201 may, for example, comprise a computer processor for performing the user verification task.
  • User identification module 202 is for identifying an attempted user of the computer from an image of the user acquired by image acquisition module 204.
  • the user identification module 202 is configured to extract and analyse biometric information from images acquired by image acquisition module 204, access predefined biometric information stored in storage, e.g. in storage internal to user identification module, in which biometric information of authorised users is stored, and determine whether the biometric information extracted from the acquired images matches biometric information of an authorised user.
  • User identification system 202 may then communicate a determination as to whether the user requesting access appears to be an authorised user to user verification system.
  • user identification module 202 is configured for facial recognition, and is configured to analyse facial features of an imaged user, to determine whether the facial features match predefined facial features of an authorised user.
  • user identification module 202 may be configured for alternative forms of biometric identification, for example, fingerprint or iris recognition.
  • biometric identification for example, fingerprint or iris recognition.
  • User identification module 202 may, for example, comprise computer storage for storing predefined biometric information of authorised users, and a computer processor for performing the user identification task.
  • Presentation attack detection module 203 is for detecting presentation attack attempts in images acquired by image acquisition module 204. In particular, in examples, presentation attack detection module 203 is for predicting whether an image acquired by image acquisition module 204 depicts a presentation attack. Presentation attack detection module 203 will be described in further detail with particular reference to Figures 3 and 6.
  • Image acquisition module 204 is for acquiring one or more images of a user requesting access to the computer 102, for communication to user identification system 202 and presentation attack detection system 203.
  • image acquisition module 204 may be for imaging a user’s face, and imaging device 205 may thus be fixed in a suitable position such that a face region of a user requesting access to computer 102 may be imaged.
  • imaging device 205 is an optical camera.
  • imaging device 205 is for imaging a user’s face.
  • the imaging device 205 is configured for imaging near-infrared (NIR) radiation, e.g. for imaging a user’s face in the NIR channel.
  • NIR near-infrared
  • imaging device 205 may comprise an optical camera having an NIR filter fitted to the lens.
  • the filter may thus selectively pass NIR spectra radiation to an image sensor.
  • image acquisition device could be configured for imaging other regions of a user’s body, for example, as a fingerprint or iris imager.
  • Image acquisition module 204 may comprise a computer processor for performing the imaging task, and may optionally further comprise storage for storing acquired images.
  • biometric verification system is described as comprising four distinct modules 201 to 204, each having independent computing resource, processor and/or memory resource.
  • the functionality of one or more of the modules may be combined and implemented by shared computing resource.
  • the functionality of all of the modules 201 to 204 could be implemented by a common processor.
  • presentation attack detection module 203 comprises processor 301, storage 302, memory 303, input/output interface 304, and system bus 305.
  • the presentation attack detection module 203 is configured to run a computer program for detecting presentation attacks in images acquired by image acquisition module 204.
  • Processor 301 is configured for execution of instructions of a computer program.
  • Storage 302 is configured for non-volatile storage of computer programs for execution by the processor 301.
  • the computer program for predicting presentation attacks in images acquired by image acquisition module 204 is stored in storage 302.
  • Memory 303 is configured as read/write memory for storage of operational data associated with computer programs executed by the processor 301.
  • Input/output interface 304 is provided for communicating presentation attack detection module 203 with system bus 206.
  • the components 301 to 304 of the presentation attack detection module 203 are in communication via system bus 305.
  • biometric verification system 103 is configured to perform a user verification procedure for verifying that a user requesting access to computer 102 is an authorised user of the computer.
  • the biometric verification system 103 may perform the verification procedure in response to receiving a prompt from computer 102 via connection 104.
  • the image acquisition module 204 images a user requesting access to the computer 102, using imaging device 205.
  • stage 401 involves imaging the user’s face, optionally in the NIR channel, wherein the actual range of wavelengths may be pre-configured.
  • the image acquisition module 204 acquires a plurality of frames, where the duration of frame acquisition and time interval between successive frames (frame rate) may be defined at stage 401.
  • the image acquisition module 204 may be capable of imaging at plural different resolutions, and stage 401 may involve defining an imaging resolution.
  • the imaging device 205 may have one or more illumination devices for a specified NIR range.
  • Stage 401 may thus further involve adjusting the illumination devices such that the region nominally containing a subject’s head/face is properly and uniformly illuminated. Although minor variations in the ambient light are acceptable, the capturing session should have reasonable illumination conditions. During the capture, it is preferred that the subject’s face occupies a major area of the image being captured.
  • the image acquisition module 204 may then communicate the acquired image (s) to user identification module 202 and presentation attack detection module 203.
  • the user identification module 202 analyses the image data acquired at stage 401, extracts biometric information relating to the imaged user (s) , e.g. facial feature information, and determines whether the user is an authorised user of the computer by comparing the extracted biometric information to predefined biometric information of authorised users, i.e. predefined biometric information stored in computer storage accessible by the user identification module 202.
  • the user identification module 202 may output the determination to user verification module 201.
  • the presentation attack detection module 203 analyses the image data acquired at stage 401, and generates a prediction of whether the acquired image (s) depict a presentation attack, i.e. whether the image is of an unauthorised user attempting to perpetrate a presentation attack. For example, this stage could involve the presentation attack detection module 203 predicting whether the acquired image (s) show a face mask or printed photo.
  • the presentation attack detection module 203 may output the determination to user verification module 201.
  • the user verification module 201 may evaluate the determinations from the user identification module 202 and the presentation attack detection module 203, received at stages 402 and 403 respectively, determine whether the imaged user is an authorised user, and communicate that determination to computer 102. For example, where the determination of the user identification module 202 at stage 402 is that the imaged user appears to be an authorised user, and the prediction of the presentation attack detection module 203 at stage 403 is that the image does not depict a presentation attack, the user verification system 201 may determine that the user requesting access is an authorised user.
  • the user verification system 101 may determine at stage 404 that the user requesting access is not an authorised user, and may notify the computer 102 accordingly.
  • the method of stage 403, for detecting presentation attacks comprises seven stages.
  • the method of stage 402 is implemented by the processor 301 of presentation attack detection module 203, in response to instructions of the computer program stored in storage 302 of presentation attack detection module 203.
  • the computer program stored in storage 302 causes the processor 301 to obtain one or more subject images for analysis, i.e. to obtain images of an attempted user for analysis.
  • a stage 502 the computer program stored in storage 302 causes the processor 301 to denoise the one or more subject images obtained at stage 501, i.e. to remove image noise from the images, to obtain denoised representations of the subject images.
  • the computer program stored in storage 302 causes the processor 301 to compute one or more residual images, each residual image representing a difference between a subject image obtained at stage 501 and the respective denoised image computed at stage 502.
  • the computer program stored in storage 302 causes the processor 301 to obtain a database of texture primitives, each texture primitive encoding information representing a texture feature.
  • the computer program stored in storage 302 causes the processor 301 to generate one or more texture descriptors, e.g. code vectors, each texture descriptor representing one or more regions of a residual image computed at stage 503 as a function of texture primitives in the database obtained at stage 504.
  • texture descriptors e.g. code vectors
  • the computer program stored in storage 302 causes the processor 301 to perform on the subject images obtained at stage 501 one or more feature assessment operations.
  • the computer program stored in storage 302 causes the processor 301 to perform classification operations, based on outputs of stages 505 and 506, for predicting whether the subject images obtained at stage 501 depict a presentation attack.
  • stage 403 may comprise fewer or further operational stages, depending on the instructions contained in the computer program.
  • the operations of stage 506 may be omitted from the method.
  • the presentation attack detection module 203 is configured to support the functionality of a plurality of functional modules.
  • each of the functional modules utilise the processor 301, storage 302, and memory 303 of the presentation attack detection module 203.
  • Pre-processor module 601 is provided for supporting the method of stage 501, for retrieving images, e.g. facial images, from image acquisition module 204, performing image processing operations on the acquired images, and for outputting subject images for analysis by later modules.
  • images e.g. facial images
  • Denoiser module 602 is provided for supporting the method of stage 502, for denoising images output by pre-processor module 601 to remove image noise from the images, to obtain denoised representations of the images.
  • Residual image computing module 603 is provided for supporting the method of stage 503, for computing residual images representing a difference between a subject image output by pre-processor module 601 and the respective denoised image output by denoiser 602.
  • Texture descriptor generator module 604 is provided for supporting the methods of stages 504 and 505, for generating a database of texture primitives, and generating texture descriptors representing regions of residual images output by residual image computing module 603 as a function of one or more of the texture primitives.
  • Feature assessment module 605 is provided for supporting the method of stage 506, to perform feature assessment operations on the subject images output by pre-processor module 601.
  • feature assessment module 605 may comprise eye-region and/or mouth-region assessment sub-modules 606, 607 respectively.
  • Classifier module 608 is provided for supporting the method of stage 507, for predicting, based on the outputs of texture descriptor generator module 604 and feature assessment module 605, at stages 505 and 506 respectively, whether the images obtained at stage 501 depict a presentation attack.
  • the method of stage 501 for obtaining subject images comprises six stages.
  • stage 701 the image acquisition module 204 acquires one or more images of a user attempting access to the computer 102.
  • stage 701 may involve the presentation attack detection module 203 retrieving pre-acquired images from image acquisition module 204, or may instead involve presentation attack detection module 203 instructing image acquisition module 204 to acquire images of a presenting user by imaging device 205, e.g. images of the user’s face, optionally in the NIR channel.
  • the pre-processor module 601 performs certain image processing operations on the image (s) acquired at stage 701. In examples, the pre-processor module 601 processes each image (or frame) independently and identically.
  • the pre-processor module 601 performs an image normalization operation, whereby the image (s) acquired at stage 701 are normalized for a specific, predefined, intensity range.
  • stage 702 involves calculation of minimum (I min ) and maximum (I max ) values for intensity thresholds from the image statistics.
  • the normalization operation on the image data of each frame may be as shown in equation 1:
  • the pre-processor module 601 performs an image resizing operation, whereby the normalized image (s) output at stage 702 are converted to a predefined fixed dimension through appropriate resizing.
  • the pre-processor module 601 performs a feature detection operation on the resized image (s) output at stage 703, for example, to detect a user’s face in the image and/or facial landmarks of a user’s face, such as the user’s mouth and/or eye regions.
  • This operation may utilise state-of-the-art libraries such as dlib, or techniques such as a deformable parts model, convolutional neural networks (CNN) , cascaded pose regression, and/or multi-task CNNs.
  • a signal may be generated, to be communicated to the computer 102, to inform the same to the user.
  • the input image may be rejected. If this behaviour is observed across several frames, a user may be provided with suitable signal, e.g. by computer 102.
  • the pre-processor module 601 performs an alignment operation on the image (s) , whereby the images are aligned for the subsequent processing.
  • This stage may involve includes selecting one or more image features, e.g. facial features (such as left and right eye, or left/right/middle points of mouth) ; and computing a two-dimensional transformation of the image such that the coordinates of these specific features are consistent across a succession of images.
  • the pre-processor module 601 performs a cropping operation on the image (s) .
  • the images may be cropped using available features, such as facial landmarks, to depict only a desired image feature, e.g. a face region.
  • the images can also be again resized at stage 706 to predetermined dimensions.
  • the output of the pre-processor module 601 is thus a cropped and aligned subject image, e.g. a facial image, and corresponding image features, e.g. facial landmarks such as eye/mouth regions.
  • the method of stage 502 for denoising images comprises two stages.
  • the denoiser module 602 is trained to denoise images of the type to be analysed, e.g. images in the NIR channel.
  • a denoising auto-encoder (DAE) CNN is utilised for denoising images.
  • the DAE comprises one or more units of convolutional, pooling, and normalization layers; and one or more units of fully connected layers.
  • bonafide training images are obtained, e.g. non-presentation attack images, such as images of real human faces, and the images are pre-processed by pre-processor module 601 by the method of stages 702 to 706, as described previously.
  • the training images are intentionally corrupted by adding a suitable noise, such as white Gaussian noise (AWGN) , of varying levels.
  • AWGN white Gaussian noise
  • these levels are determined by discrete values of the variance (or standard deviation) of the Gaussian function used to generate the noise probability mass function (pmf) . Therefore, for N pre-processed images and m levels of noise, the total training data of mN images may be obtained through augmentation.
  • the noisy or corrupted image may be obtained through a stochastic mapping, given by equation (2) :
  • the architecture of a DAE consists of an encoder and a decoder.
  • the encoder maps the noisy input to the hidden representation h, as the function given by equation 3:
  • the parameters ⁇ E and ⁇ D of the DAE are learnt through minimization of a loss function over an average reconstruction error, E (
  • the training can be conducted using a suitable optimization method (such as Stochastic gradient descent (SGD) or Adam) , and learning parameters.
  • the batch size of training images can be decided by the amount of computing resources available for the training.
  • the model may be saved, e.g. in storage 302, for the deployment phase.
  • the model consists of ⁇ E and ⁇ D parameters.
  • the DAE trained at stage 801 is deployed to denoise pre-processed subject images output by pre-processor module 601 at stage 501.
  • the pre-processed subject image (s) are passed through the DAE without any noise corruption. Because the DAE model was trained at stage 801 using only bonafide images, i.e. images of real humans instead of presentation attacks, in the particular channel of interest, e.g. NIR, it has learnt the finer textural details of real humans. Therefore it is expected to be a more efficient denoiser for bonafide presentations as compared to attack presentations.
  • the output of the DAE at stage 802 is thus a smoothened/filtered version of the input image.
  • training stage 801 for training the DAE, is described as being performed immediately prior to denoising stage 802, in alternative examples, training stage 801 could be performed well in advance of denoising stage 802. Indeed, in examples, training stage 801 could be performed by an engineer prior to deployment of the biometric verification system 103.
  • the method of stage 503 for computing residual images comprises pixelwise subtraction of the filtered image, output by the DAE denoiser 602 at stage 502, from the pre-processed image output by the pre-processor module 601 stage 501, as input to the DAE denoiser.
  • This procedure yields an image that primarily consists of textural information in the input image.
  • the residual image is expected to contain patterns of scanline noise.
  • the residual image is expected to represent mainly the fine granular texture of paper material.
  • I F For an input image, e.g. a face image, I F , the DAE generates a somewhat smoothened output, I F-DN .
  • the residual image, I residue is obtained from pixelwise difference between the two, as given by equation 5:
  • the resultant residual image thus primarily encodes the information related to texture patterns of the input images.
  • an objective of later stage 505 is to represent the contents of the residual image (s) in a more discriminate manner.
  • stage 504 for obtaining a database of texture primitives comprises five stages.
  • stage 504 involves generating the database, by the training procedure described below.
  • stage 504 could involve obtaining a pre-generated database of texture primitives, i.e. a database generated at a prior time step, optionally by a third party, e.g. using the following procedure.
  • the database of texture primitives is a “dictionary” of textural primitives or codewords that are specifically learnt for the intended imaging application, e.g. for imaging faces in the NIR channel.
  • An advantage of this approach of generating texture primitives that are specific to the intended imaging application, e.g. texture primitives of NIR imagery where the intended application will image in the NIR channel, is that the texture primitives are then specific to the application, may most accurately define the textures of the image, and may thus allow more accurate/reliable classification at later stage 507.
  • An objective of later stage 505 is to represent a local patch of residue image I residue as a linear combination of texture primitives, i.e., codewords of the dictionary.
  • entries in the database should encode texture primitives, such that at stage 505, an input image, through its residual image, may be represented as a vector of texture primitives.
  • training images including both bonafide images, e.g. images of real human faces, and attack presentations, e.g. images of subjects wearing masks, are obtained.
  • a residual image is obtained, for example by the method of stages 501 to 503, for each input training image (both classes, bonafide and attack presentations) .
  • the residual images obtained at stage 1102 may be tessellated to obtain small, non-overlapping regions, otherwise known as patches, of n*n dimensions.
  • I residue [i, j] , 0 ⁇ i ⁇ n, 0 ⁇ j ⁇ n, texture primitives, otherwise known as codes, for inclusion in the database may be computed by minimization of equation 6:
  • ⁇ and C represent the coefficients and texture primitive, respectively. Their values are computed through the alternate minimization technique, where an acceptable error norm, e min can be predetermined.
  • This training procedure thus generates a set of texture primitives or codewords, representing different textural features of the bonafide and attack training images.
  • the texture primitives generated at stage 1104 are compiled to form a database of texture primitives.
  • Columns of the database may thus represent individual texture primitives, i.e. individual codewords.
  • the method of stage 505 comprises computing coefficients mapping regions of the residual images generated at stage 503 to texture primitives in the database generated at stage 504.
  • stage 505 an objective of stage 505 is to represent the contents of the residual image (s) in a more discriminate manner, as a function of the texture primitives in the database.
  • the accuracy and reliability of the presentation attack detection prediction, at later stage 507, may thus be improved.
  • the texture primitives in the database are generated specifically for the imaging application, e.g. in the NIR channel, and may thus accurately represent the textures of the subject image.
  • the texture primitive database learnt at stage 504 is used to obtain a texture descriptor, also known as a feature descriptor, for the residual image (s) output by the DAE denoiser 602 at stage 502.
  • a texture descriptor also known as a feature descriptor
  • the incoming residual image I residue is divided into smaller, non-overlapping patches of n*n.
  • an optimal vector of coefficients may be computed using equation (7) :
  • each patch of each residual image may be represented as a function of one or more of the (micro) texture primitives contained in the texture primitive database and their respective coefficients, e.g. as a linear combination of plural texture primitives, as given by the function in Figure 12.
  • learning stage 504 for generating the database of texture primitives is described as being performed immediately prior to deployment stage 505, in alternative examples, learning stage 504 could be performed well in advance of deployment stage 505. Indeed, in examples, training stage 504 could be performed by an engineer prior to deployment of the biometric verification system 103.
  • the order of tessellated patches/regions of the residual images should be predefined, and should be consistent to obtain the descriptor specific to the spatial coordinates. If the residual image is tessellated into i ⁇ j patches of uniform sizes, and the database of texture primitives consists of P codewords; then the feature descriptor F texture has dimensionality given by equation 8:
  • the texture/feature descriptor (s) generated at stage 505 may then passed to classifier module 608, for inclusion in the classification operation at later stage 507, as will be described further with reference to Figure 15.
  • stage 506 for performing feature assessments comprises two operations, which may suitably be performed in parallel, each operation comprising three stages. Stages 1301 to 1303 are deployed for assessing eye regions of subject face images, whilst stages 1304 to 1306 are deployed for assessing mouth regions of subject face images.
  • Eye and mouth regions provide several important cues for detection of presentation attacks relating to facial images.
  • Stage 506 thus involves conducting an assessment of eye and mouth regions over a sequence of images frames to test for occlusion, local motion, and masking possibilities.
  • Stage 506 involves examining a variety of such cues from individual image frames as well as sequence of image frames. The features are extracted from a single frame as well as over a sequence of frames.
  • Stages 1301 to 1303 are deployed for assessing eye regions of subject face images.
  • an eye region of a face image can be considered to be useful indicators of presentation attacks only if eyes are visible in the image, i.e. if the eyes were visible to the imaging device at stage 501.
  • assessments of eye region features are not useful, and should be excluded from consideration in the later classification stage 507. Additionally, partial occlusion of one or both eyes may result in lower accuracy of the presentation attack prediction at later stage 507.
  • stage 1301 a check is performed for the visibility of a user’s eyes in subject facial images.
  • the visibility check performed at stage 1301 may involve analysing the facial landmarks detected by pre-processor module 601 at stage 705 to identify relevant regions of the image. Based on the landmarks related to eyes, and predetermined feature dimensions, e.g. facial dimensions, a region of the image containing eyes is identified. The feature detection at stage 704 may approximate or estimate these locations, in case of partial occlusions. Therefore, an explicit check for visibility is desirable. In case of occlusion using glasses or other materials, the eye region of the image may be expected to appear nearly homogeneous. In a visible view of the eye regions, i.e. including pupils, eyebrows, eyelids etc., the region will include visible features. Stage 1301 involves looking for such visible features in the region of interest. Firstly, the entropy of a rectangular eye region I eye is computed by equation 9:
  • p k refers to the histogram of eye region I eye .
  • the entropy of a visible eye region is expected to be much higher than that of an occluded eye region.
  • stage 1301 may further involve a pattern checking operation.
  • an edge map of an eye region of an image is computed using edge detection operators such as Sobel, Prewitt, or Canny.
  • the edge map is slightly blurred by convolving with a two-dimensional Gaussian kernel of small variance.
  • a template of the eye region is pre-computed from a small set of visible and clear bonafide presentations. The process of blurring slightly dilates the edge map, and hence, compensates for minor differences in the shapes of eyes of individuals.
  • NCC normalized cross-correlation
  • T eye is the normalized template of the eye region
  • Stage 506 may thus further involve assessment of motion between frames in an image sequence. In the example, such an assessment does not explicitly check for a blink or gaze; rather it calculates the degree of generic local motion. This feature is computed over a sequence of frames. This feature is computed only if the visibility check at stage 1301 provides a positive output.
  • the eye region, I eye is divided into small patches of m*n dimensions.
  • the mean absolute difference (MAD) between the patch I eye [k 1 m, k 2 n] from p-th frame and the patch at same spatial location from (p-1) -th frame is calculated.
  • the MAD is a scalar value, which would remain close to zero if the patches do not change over frame sequence.
  • the MAD sequence is expected to be inconsistent.
  • the MAD sequence consists of high frequency (impulse) signals; whereas for a slow movement such as gaze, the MAD sequence contains relatively lower frequency contents.
  • the MAD sequence is analysed over a moving window of frames, and it may be computed from every n-th frame (n could be 2, 3, 5.. ) , rather than over consecutive frames. These parameters can be defined in accordance with the frame rate of the overall system.
  • the differential value of MAD is passed to the classifier module 601 for inclusion in the classification operation at later stage 507.
  • a texture descriptor is also generated by texture descriptor generator 604, in the way described previously with reference to stage 505. As described previously, these texture descriptors capture fine texture features of the patch/region. Therefore, if the contents of an image patch change (due to eye movements) , the corresponding textures also exhibit a large change. Stage 1302 may thus further compute a difference between texture descriptors of a given spatial patch over a frame sequence, which thereby be used to estimate of local changes/movement. Note that, this function does not check for explicit blink or gaze movements; but finds indications for an overall motion of any kind. The objective of this operation is not to quantify the amount of motion, rather it is aimed at identifying any motion that may be helpful in evaluating the liveliness of the presentation.
  • stage 506 further involves provides a check for a possibility of cut-out features around the eye regions (in the case of detection of mask or print attacks) . This functionality is tested only if the visibility check for the eye region at stage 1301 is positive.
  • the eye region, I eye is convolved with an edge detection kernel such as Sobel, and the output of the convolution operation is normalized for mean and standard deviation.
  • a histogram of this normalized edge image is computed.
  • a reference/template histogram is computed from a set of visible bonafide presentations from training data, by selecting the eye regions, and then obtaining edge maps through convolution.
  • the histogram for attack presentations with cut-outs may be expected to consist of higher values as compared to the corresponding histogram of bonafide samples.
  • the magnitude of a difference between the reference and test histograms is considered a useful indicator of the presence of cut-outs in the given region, and is passed to the classifier module 608 for inclusion in the classification operation at later stage 507.
  • stages 1304 to 1306 involve extracting a variety of such cues in a similar manner to the eye region assessment of stages 1301 to 1303.
  • a mouth region can be considered to be useful indicators of presentation attacks only if the mouth is visible in the image, i.e. if it is visible to the imaging device at stage 501.
  • a subject or attacker might occlude the mouth completely or partially by covering, e.g. with hands or clothing. Since high occlusions degrade the accuracy of a presentation attack prediction, it is desirable to check the amount of occlusion or visibility of the mouth region in subject images.
  • a visibility check to check for the visibility of a mouth region in subject facial images.
  • the visibility check for the mouth region is conducted by analysing the features, e.g. facial landmarks, detected by the pre-processor module 601 at stage 704, to identify relevant regions of the subject image. A region containing the mouth is identified based on the corresponding landmarks, and predetermined dimensions of an average face.
  • An explicit check for lips/mouth is desirable in particular where the feature detection at stage 704 is focused on identifying a silhouette of a face rather than specifically a mouth feature. In case of occlusion of the mouth region, e.g. by clothing, it may be expected that the mouth region will appear nearly homogeneous.
  • p k refers to the histogram of mouth region I mouth .
  • the entropy of a visible mouth region may be expected to be higher than that of an occluded mouth region. However, if the mouth is covered, e.g. by clothing or a facial mask, then the entropy calculation may not be a useful indicator of the visibility of the mouth.
  • Stage 1304 thus further involves a pattern-checking operation for assessing the visibility of a mouth region.
  • An edge map of a mouth region is computed using edge detection operators such as Sobel, Prewitt, or Canny. This edge map is slightly blurred using a 2D Gaussian kernel of small variance.
  • a template of the mouth region is pre-computed from a small set of facial images in which the mouth region is visible. The process of blurring slightly dilates the edge map, and hence, compensates for minor differences in the shapes of mouth for different individuals.
  • T mouth is the normalized template of mouth region. This value, along with the region entropy H mouth , , is then passed to the classifier module 608 for inclusion in the classification operation at later stage 507.
  • a natural movement of the mouth region can be a useful indicator of the liveness of a presentation.
  • Stage 506 may thus further involve detecting local motion between successive image frames. Since stage 1305 requires checking for motion, this feature is computed over a sequence of frames. Additionally, this feature is computed only if the visibility check for the mouth region at stage 1304 returns a positive output.
  • the mouth region, I mouth is divided into small patches of m*n dimensions.
  • the mean absolute difference (MAD) between the patches from a same spatial location (I mouth [k 1 m, k 2 n] ) over consecutive frames, p-th and (p-1) -th frames is calculated.
  • the MAD is a scalar value that remains close to zero if the patches do not change over a frame sequence.
  • the MAD sequence may be expected to be inconsistent.
  • the MAD sequence may be expected to consist of high frequency (impulse) signals, whereas for a slow natural movement, the MAD sequence may be expected to contain relatively lower frequency contents.
  • the MAD sequence is analyzed over a moving window of frames, and it may be computed from every n-th frame (n could be 2, 3, 5.. ) , rather than over consecutive frames. These parameters can be defined in accordance with the frame rate of overall system.
  • the differential value of the MAD analysis is passed to the classifier module 608 for inclusion in the classification operation at later stage 507.
  • a texture descriptor is also generated by texture descriptor generator 604, in the way described previously with reference to stage 505. As described previously, these texture descriptors capture fine texture features of the patch. Therefore, if the contents of an image patch change (e.g. due to lip movements) , the corresponding textures are also expected to exhibit a large change. Thus, at this stage, the difference between texture descriptors of a given spatial patch over a frame sequence is computed, and is used to estimate a local change/movement. The objective of this operation is not to quantify the amount of motion or utterances, rather it is aimed just at detecting any motion that may be a useful indicator of liveliness of the presentation.
  • stage 1306 involves a check for the possibility of a cut-out around the mouth region of face (in case of mask or print attacks) . This functionality is tested only if the visibility check for the mouth region at stage 1304 is positive.
  • the presence of a cut-out around the mouth region can be inferred from the presence of unnaturally strong edges.
  • the mouth region, I mouth is convolved with an edge detection kernel such as Sobel, and the output of convolution is normalized for mean and standard deviation.
  • a histogram of this normalized edge image is computed, which is then used as a feature.
  • a reference histogram is computed from a set of images in which a subject’s mouth region is visible, by selecting the mouth region, and then obtaining its edge map through convolution.
  • the histogram for attack presentations with cut-outs around the mouth region may be expected to have higher values as compared to the corresponding histogram of bonafide presentations.
  • the total magnitude of difference between the reference and test histograms is passed to the classifier module 608 for inclusion in the classification operation at later stage 507.
  • stage 507 for performing a classification operation for predicting presentation attacks comprises two stages.
  • the classifier module 608 is trained to predict presentation attacks based on the outputs of texture descriptor generator 604 at stage 505, and feature assessment module 605 at stage 506.
  • the classifier module 608 utilises a neural network, such as a multi-layer perceptron (MLP) network with one or more hidden layers, one input layer, and one output layer.
  • MLP multi-layer perceptron
  • the number of neurons in the input layer is equal to the sum of dimensions of input features provided by the texture descriptor generator 604 and feature assessment module 605. This number is itself governed by the size of the texture primitive database, and the dimensions of the subject images.
  • training the MLP at stage 1401 For training the MLP at stage 1401, feature regions of training images, e.g. eye and mouth regions, of both bonafide and attack classes, along with labels identifying the nature of texture primitives in the database, i.e. bonafide or attack, is utilised.
  • the training can be conducted using a suitable optimization method (such as Stochastic gradient descent (SGD) or Adam) , and learning parameters.
  • SGD Stochastic gradient descent
  • the batch size of training images can be decided by the amount of computing resources available for the training. After a reasonable convergence and good accuracy, the model may be saved for the deployment phase.
  • the model consists of learned weight parameters.
  • the classifier module takes as inputs the outputs of texture descriptor generator module 604, and feature assessment module 605, output at stages 505 and 506 respectively, and by operation of the learned neural network, outputs predictions as to whether the subject image obtained at stage 501 shows a bonafide presentation of a human, or an attack presentation, e.g. a mask or printed image.
  • the MLP model utilised by the classifier module consists of two neurons at the output. One output provides the classification of presentation attack detection, i.e., whether the input image depicts a bonafide human or a presentation attack.
  • the second output is used to provide a signal to the user, e.g. via the computer 102, if major occlusions are observed in the input images. While the operation of the classification module is robust to minor image occlusions, its performance may degrade as the amount of occlusion increases. The visibility of regions such as eyes and mouth can provide helpful cues in this regard.
  • stage 506 for feature assessment may be omitted from the method, and the classifier module 608 may thus receive as inputs only texture descriptors output by texture descriptor generator module at stage 505.
  • a simpler neural network may be utilised.
  • a classification method not involving a neural network may be utilised.
  • training stage 1401 for training the classifier module 608 is described as being performed immediately prior to classification stage 1402, in alternative examples, training stage 1401 could be performed well in advance of classification stage 1402. Indeed, in examples, training stage 1401 could be performed by an engineer prior to deployment of the biometric verification system 103.
  • the presentation attack detection module 203 and the method for presentation attack detection using the module 203, have wider utility generally for detecting presentation attacks in images.
  • the presentation attack detection module 203 and/or the method for presentation attack detection using the module 203 could be deployed in isolation of one or more other features of the computing system 101.
  • presentation attack detection module 203 could be deployed as a standalone module for detecting presentation attacks in images input to the module.

Abstract

A method for detecting image presentation attacks is disclosed. The method comprises obtaining a subject image, denoising the subject image to obtain a denoised representation of the subject image, computing a residual image representing a difference between the subject image and the denoised representation of the subject image, obtaining a database of one or more texture primitives, generating a texture descriptor representing image texture details of one or more regions of the residual image as a function of texture primitives in the database, and performing on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.

Description

PRESENTATION ATTACK DETECTION
Field of the Disclosure
The present disclosure relates to detecting image presentation attacks.
Background of the Disclosure
Biometric authentication is used in computer science for user verification for access control. Forms of biometric authentication rely on imaging a user’s biometric traits, for example, imaging the user’s face, hand, finger or iris to detect physical features, whereby the detected physical features may be compared to known physical features of an authorised user to authenticate the access attempt. Presentation attacks may be perpetrated on such a biometric authentication system by an unauthorised user seeking to impermissibly gain access to a computer system, whereby the unauthorised user attempts to impersonate the biometric presentation of an authorised user. For example, such an unauthorised user may wear a face mask, or a hand/finger prosthetic, or show a printed image or an electronic display, depicting the authorised user’s presentation. It is desirable to be able to reliably detect presentation attacks in order to reduce the risk of unauthorised access to a computer system.
Summary of the Disclosure
An objective of the present disclosure is to provide a method for detecting image presentation attacks.
The foregoing and other objectives are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the Figures.
Aspects of the present disclosure relate to a texture-based approach to presentation attack detection, whereby presentation attacks may be identified by detecting micro-textural differences between a bonafide presentation, e.g. a bonafide human face, and artifacts, e.g. a face-mask, printed photo, or video displayed on an electronic display. Such a texture-based approach may allow effective discrimination between bonafide and attack presentations based on artifact characteristics, such as the presence of pigments (printing defects) and specular reflection and shade (display attacks) . In other words, aspects of the disclosure relate to classifying presentation attacks based on the characteristic micro-textural differences between bonafide presentations and presentation attack artifacts.
A first aspect of the present disclosure provides a method for detecting image presentation attacks, the method comprising, obtaining a subject image, denoising the subject image to obtain a denoised representation of the subject image, computing a residual image representing a difference between the subject image and the denoised representation of the subject image, obtaining a database of one or more texture primitives, generating a texture descriptor representing image texture details of one or more regions of the residual image as a function of texture primitives in the database, and performing on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.
In examples, the method is for use in detecting image presentation attacks on a user verification system. The user identification system may, for example, be deployed as a part of a user verification system of a computer access control system for controlling access to a computer system. The user verification system may, for example, comprise an image capture device for capturing an image of a user presenting to the user verification system, for example, an image of the presenting user’s face or fingerprint, to thereby predict whether the presenting user is an authorised user or an unauthorised user.
A subject image, i.e. an image presented for analysis, is thus obtained. For example, the subject image could be obtained using an image capture device, and the method could involve capturing the image using an image capture device. In other examples, the subject image could be captured by an external system, and obtaining the subject image could involve obtaining the image file, optionally following initial processing of the image file.
The subject image is initially denoised. For example, the subject image could be denoised using a neural network trained for denoising images of a bonafide presentation, e.g. of a real human. The residual image is computed as a difference between the subject image and the denoised version of the subject image, for example, as a pixelwise difference between the subject image and the denoised representation. The residual image is thus a smoothened representation of the subject image, primarily representing micro-textural features of the subject image.
However, the texture details contained in residual images are superimposed with other corrupting noisy or high frequency contents of the input presentation, which may impair the process of predicting whether the image depicts a presentation attack. Thus, a texture descriptor is then generated using a database of texture primitives, the texture descriptor (s) representing  one or more regions/patches of the residual image. For example, the texture descriptor may comprise a code vector where texture details of the region/patch of the residual image are encoded as a combination of texture primitives. The texture descriptor represents the contents of the residual image (s) in a more discriminate manner, as a function of the texture primitives in the database. The accuracy and reliability of the classification operation for predicting image presentation attacks may thus advantageously be improved.
In examples, the database of texture primitives is specifically learnt for the intended imaging application, e.g. for imaging faces in the NIR channel. An advantage of this approach is that the texture primitives are then specific to the application, may most accurately define the textures of the image, and may thus allow more accurate/reliable classification of the subject image as representing either a bonafide or attack presentation.
In an implementation, the generating a texture descriptor comprises generating a texture descriptor representing image texture details of a plurality of regions of the residual image as a function of texture primitives in the database.
In other words, in examples the texture descriptor may represent plural different regions of the residual image as a function of one or more of the texture primitives. Consequently, the texture details of the subject image may be more fully represented by the texture descriptor. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved. In examples, the texture descriptor may represent every region of the residual image as a function of the texture primitives, such that the texture details of the subject image is most fully represented.
In an implementation, the generating a texture descriptor comprises generating a texture descriptor representing image texture details of each of the one or more regions of the residual image as a function of a plurality of texture primitives in the database.
In other words, in examples the texture descriptor may represent each of the one or more different regions of the residual image as a function of a plurality of the texture primitives. Consequently, fuller texture information defining the texture details of the subject image may be represented. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved. In examples, the texture descriptor may represent each of the one or more regions of the residual image as a function of  each of the texture primitives in the database, such that the texture details of the subject image is most fully represented.
In an implementation, the generating a texture descriptor comprises generating a texture descriptor representing image texture details of each of the one or more regions of the residual image as a linear combination of the plurality of texture primitives in the database and respective coefficients relating texture details of the region of the residual image to each of the texture primitives. In other words, the texture descriptor may represent the texture information of one or more the regions as a linear combination of a plurality of the texture primitives. This may advantageously represent a computationally efficient form for comprehensively representing texture information.
In an implementation, the denoising the subject image comprises denoising the subject image using a convolutional neural network for predicting a denoised representation of an input image.
A convolutional neural network may advantageously allow accurate and reliable denoising of images, with relatively low computational complexity. For example, the convolutional neural network could be a denoising auto-encoder. The convolutional neural network could be trained using only bonafide images of a user, e.g. of a user’s face, rather than of attack presentations. Consequently, the denoiser may be expected to be a more efficient denoiser for bonafide presentations as compared to attack presentations. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
In an implementation, the performing on the texture descriptor a classification operation comprises performing on the texture descriptor operations of a convolutional neural network for predicting image presentation attacks based on image texture details.
A convolutional neural network may advantageously allow accurate and reliable prediction of presentation attacks, with relatively low computational complexity. For example, the convolutional neural network could be a multi-layer perceptron.
In an implementation, the method further comprises performing on the subject image, an image intensity normalization operation for changing an intensity range of the received image to a  predetermined intensity range, and/or an image resizing operation for changing a size of the received image to a predetermined size.
In other words, the subject image may be subjected to intensity normalising/resizing operations prior to the denoising stage. Consequently, the subject image, and so the texture details of the subject image, may be adapted to match as closely as possible a desired intensity/size of the texture details, e.g. to match the intensity/size on which the denoiser and/or the classifier operations have been trained, and/or on which the texture primitive database was learnt. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
In an implementation, the method further comprises, performing on the subject image a feature position detection operation for detecting a region of the subject image containing a predetermined image feature; and performing on the detected region of the subject image a visibility detection operation for detecting a visibility of the predetermined image feature in the detected region of the input image.
In other words, the method involve detecting, and checking the visibility of, certain features of an image. For example, the operations could be aimed at detecting the position of facial landmarks, such as eye and/or mouth regions in facial images, and subsequently checking the visibility of those features in the subject image (s) . Thus, the existence of particular features of the image may be determined, which may be indicative of a liveness of the imaged subject. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved. In examples, the visibility detection operation could involve computing an entropy of the regions and comparing the computer entropy to an expected entropy of a region depicting the predetermined feature, e.g. a mouth or eye.
In an implementation, the method further comprises, performing on the subject image a feature position detection operation for detecting a region of the subject image containing a predetermined image feature, convolving the detected region of the subject image with an edge detection kernel and computing a histogram representing an output of the convolution, obtaining a reference histogram, and computing a difference between the histogram and the reference histogram.
For example, this implementation could be used for detecting edges, such as edges bounding cut-out regions, of a mask, to thereby detect attack presentations. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
The reference histogram may be generated for a particular feature, e.g. eyes/mouth, of a bonafide image, and computing the difference between the histograms may involve computing a difference/similarity between the edge features. The magnitude of the difference may be used as a reliable proxy for the presence of edges of a mask, e.g. cut-outs.
In an implementation, the method further comprises receiving a further subject image; denoising the further input image to obtain a denoised representation of the further subject image; computing a further residual image representing a difference between the further subject image and the denoised representation of the further subject image; generating a further texture descriptor representing image texture details of a region of the further residual image, corresponding spatially to one of the one or more regions of the residual image, as a function of texture primitives in the database; and computing a difference between the further texture descriptor and the texture descriptor for the corresponding region of the residual image.
This method may advantageously allow for detection of local motion of a subject of the subject image, occurring between an acquisition time of the subject image and the acquisition time of the further subject image. For example, the method may be used for detecting movement of eye or mouth regions of an imaged user. Such a comparison may detect movement of the imaged subject, which may thereby be used to infer a liveness of the subject. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
In an implementation, the obtaining a convolutional neural network for predicting a denoised representation of a subject image comprises, obtaining a convolutional neural network, receiving a training image, adding image noise to the training image to generate a noisy representation of the training image, performing operations of the convolutional neural network on the noisy representation of the training image and generating a prediction for a denoised representation of the noisy representation of the training image, quantifying a difference between the prediction for a denoised representation of the noisy representation of the training  image and the training image; and modifying operations of the convolutional neural network based on the difference.
Adding image noise may involve adding white Gaussian noise. Modifying operations of the convolutional neural network based on the difference may involve updating parameters, such as weights, of the CNN.
In an implementation, the receiving a subject image comprises receiving an image of near-infrared radiation, and wherein the obtaining a database of one or more texture primitives comprises obtaining a database of one or more texture primitives representing textures of near-infrared radiation imagery.
In other words, the subject images may be acquired in the NIR channel. NIR images are advantageously relatively unsusceptible to corruption by varying levels of background visible light during imaging, good for imaging in low visible light levels, e.g. at night, and good for imaging texture certain key texture details which discriminate between bonafide and attack presentations. As a result, the accuracy and reliability of the classification operation for predicting image presentation attacks may advantageously be improved.
In an implementation, the receiving an input image comprises imaging using an optical imaging device. For examples, the imaging device could be a near-infrared imaging device, sensitive to near-infrared radiation.
A second aspect of the present disclosure provided a computer program comprising instructions, which, when executed by a computer, system cause the computer system to carry out a method of any implementation of the first aspect of the present disclosure.
A third aspect of the present disclosure comprises a computer-readable data carrier having the computer program of the second aspect of the present disclosure stored thereon.
A fourth aspect of the present disclosure provides a computer system for detecting image presentation attacks, wherein the computer system is configured to, obtain a subject image, denoise the subject image to obtain a denoised representation of the subject image, compute a residual image representing a difference between the subject image and the denoised representation of the subject image, obtain a database of one or more texture primitives, generate a texture descriptor representing image texture details of one or more regions of the  residual image as a function of texture primitives in the database, and perform on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.
In implementations, the computer system of the fourth aspect of the present disclosure may be further configured to perform a method of any implementation of the first aspect of the present disclosure.
These and other aspects of the disclosure will be apparent from the embodiment (s) described below.
Brief Description of the Drawings
In order that the present invention may be more readily understood, embodiments of the disclosure will now be described, by way of example, with reference to the accompanying drawings, in which:
Figure 1 shows schematically an example of a computing system embodying an aspect of the disclosure, comprising a computer and a biometric verification system;
Figure 2 shows schematically an example implementation of the biometric verification system identified previously with reference to Figure 1, comprising a presentation attack detection module;
Figure 3 shows schematically an example implementation of the presentation attack detection module, identified previously with reference to Figure 2;
Figure 4 shows schematically an example method performed by the biometric verification system x, comprising a method for detecting presentation attacks;
Figure 5 shows schematically processes involved in the method for detecting presentation attacks, identified previously with reference to Figure 4;
Figure 6 shows schematically an example of computational functionality provided by the presentation attack detection module identified previously with reference to Figure 3;
Figure 7 shows schematically processes involved in obtaining subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;
Figure 8 shows schematically processes involved in denoising subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;
Figure 9 shows schematically stages of a denoising auto-encoder involved in the process for denoising subject images identified previously with reference to Figure 8;
Figure 10 shows schematically processes involved in computing residual images corresponding to subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;
Figure 11 shows schematically processes involved in obtaining a database of texture primitives, for the method for detecting presentation attacks identified previously with reference to Figure 5;
Figure 12 shows schematically processes involved in generating texture descriptors, for the method for detecting presentation attacks identified previously with reference to Figure 5;
Figure 13 shows schematically processes involved in assessing features of subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5;
Figure 14 shows schematically processes involved in predicting presentation attacks in subject images, for the method for detecting presentation attacks identified previously with reference to Figure 5; and
Figure 15 shows schematically a multi-layer perceptron classifier model, utilised in the process for predicting presentation attacks in subject images identified previously with reference to Figure 14.
Detailed Description of the Disclosure
Referring firstly to Figure 1, a computing system 101 embodying an aspect of the present disclosure comprises a computer 102 and a biometric verification system 103 in communication with the computer 102 via connection 104.
Computer 102 is operable to perform a computing function. For example, computer 102 may run computer programs for performing functions. In examples, computer 102 is a personal computer, a smartphone, or a computer installed onboard a vehicle for controlling functions of the vehicle. In other examples, computer 102 is a payment device for accepting a payment from a user, such as a point of sale device, or for dispensing a payment to a user, for example, an automated teller machine. In such applications, it may be desirable to control access to the computer, or to functionality of the computer, so as prevent unauthorised use of the computer. For example, where the computer 102 is a personal computer, it may be desirable to restrict access to functionality of the computer to one or more authorised users of the personal computer, to thereby prevent unauthorised users from using the computer.
Biometric verification system 103 is functional to verify whether a user requesting access to the computer 102 is an authorised user. An output of biometric verification system 103 may thus be used by computer 102 to determine whether to grant access to the computer, e.g. access to an operating system or programs running on the computer, to the user. In examples, biometric verification system 103 is functional to image a user requesting access to computer 102, and verify that the imaged user is an authorised user of the computer, by comparing biometric information extracted from the image with predefined biometric characteristics of authorised users. A difficulty encountered by such a system is that an unauthorised user may perpetrate a presentation attack on the biometric verification system 103, whereby the unauthorised user may attempt to impersonate the biometric presentation of an authorised user, to thereby impermissibly gain access the computer. For example, such an unauthorised user may wear a face mask, or a hand/finger prosthetic, or show a printed image or an electronic display, depicting the authorised user’s presentation. As will be described in further detail with reference to the later Figures, in examples therefore biometric verification system 103 is functional to detect certain forms of presentation attacks in images of the user, to thereby reduce the risk of unauthorised access to the computer 102.
In examples, biometric verification system 103 may be connected to computer 102 via a network 104. Network 104 may be implemented, for example, by a wide area network (WAN) such as the Internet, a local area network (LAN) , a metropolitan area network (MAN) , and/or a personal area network (PAN) , etc. The network may be implemented using wired technology such as Ethernet, Data Over Cable Service Interface Specification (DOCSIS) , synchronous optical networking (SONET) , and/or synchronous digital hierarchy (SOH) , etc. ) and/or wireless technology e.g., Institute of Electrical and Electronics (IEEE) 802.11 (Wi-Fi) , IEEE 802.15  (WiMAX) , Bluetooth, ZigBee, near-field communication (NFC) , and/or Long-Term Evolution (LTE) , etc. ) . The network 104 may include at least one device for communicating data in the network. For example, the network 104 may include computing devices, routers, switches, gateways, access points, and/or modems. In other examples, biometric verification system 103 may be connected to computer 102 via a simpler data transfer connection 104, e.g. via a connection in accordance with the Universal Serial Bus standard.
In the illustrated example, biometric verification system 103 is depicted as being structurally distinct from, and co-located with computer 102. In other examples, biometric verification system 103 could be located remotely of the computer 102, or could instead be incorporated into computer 102, such that biometric verification system 103 utilises computing resource of the computer 102. For example, computer 102 could comprise a handheld computing device, e.g. a smart phone, and biometric verification system 103 may be integrated in the handheld computing device.
Referring in particular to Figure 2, in examples, biometric verification system 103 comprises user verification module 201, user identification module 202, presentation attack detection module 203, and image acquisition module 204. Image acquisition module 204 comprises imaging device 205. The components 201 to 205 of the biometric verification system 103 are in communication via system bus 106, which is in turn connected to computer 102 via connection104.
User verification module 201 is for determining whether a user requesting access to the computer 102 is an authorised user of the system, based on inputs from user identification module 202 and from presentation attack detection module 203, and for communicating such determination to the computer 102 via connection 104. User verification system 201 may, for example, comprise a computer processor for performing the user verification task.
User identification module 202 is for identifying an attempted user of the computer from an image of the user acquired by image acquisition module 204. In examples, the user identification module 202 is configured to extract and analyse biometric information from images acquired by image acquisition module 204, access predefined biometric information stored in storage, e.g. in storage internal to user identification module, in which biometric information of authorised users is stored, and determine whether the biometric information extracted from the acquired images matches biometric information of an authorised user. User  identification system 202 may then communicate a determination as to whether the user requesting access appears to be an authorised user to user verification system. In an example to be described in detail herein, user identification module 202 is configured for facial recognition, and is configured to analyse facial features of an imaged user, to determine whether the facial features match predefined facial features of an authorised user. In other examples, user identification module 202 may be configured for alternative forms of biometric identification, for example, fingerprint or iris recognition. Various suitable methods for analysis of biometric images for identification of a user, such as facial, fingerprint or iris recognition, are known to persons skilled in the art. User identification module 202 may, for example, comprise computer storage for storing predefined biometric information of authorised users, and a computer processor for performing the user identification task.
Presentation attack detection module 203 is for detecting presentation attack attempts in images acquired by image acquisition module 204. In particular, in examples, presentation attack detection module 203 is for predicting whether an image acquired by image acquisition module 204 depicts a presentation attack. Presentation attack detection module 203 will be described in further detail with particular reference to Figures 3 and 6.
Image acquisition module 204 is for acquiring one or more images of a user requesting access to the computer 102, for communication to user identification system 202 and presentation attack detection system 203. For example, image acquisition module 204 may be for imaging a user’s face, and imaging device 205 may thus be fixed in a suitable position such that a face region of a user requesting access to computer 102 may be imaged. In examples, imaging device 205 is an optical camera. In an example to be described in detail herein, imaging device 205 is for imaging a user’s face. In examples, the imaging device 205 is configured for imaging near-infrared (NIR) radiation, e.g. for imaging a user’s face in the NIR channel. In such examples, imaging device 205 may comprise an optical camera having an NIR filter fitted to the lens. The filter may thus selectively pass NIR spectra radiation to an image sensor. In other examples, image acquisition device could be configured for imaging other regions of a user’s body, for example, as a fingerprint or iris imager. Image acquisition module 204 may comprise a computer processor for performing the imaging task, and may optionally further comprise storage for storing acquired images.
In the example, biometric verification system is described as comprising four distinct modules 201 to 204, each having independent computing resource, processor and/or memory resource. In  other examples however, the functionality of one or more of the modules may be combined and implemented by shared computing resource. For example, the functionality of all of the modules 201 to 204 could be implemented by a common processor.
Referring to Figure 3, in an example, presentation attack detection module 203 comprises processor 301, storage 302, memory 303, input/output interface 304, and system bus 305. The presentation attack detection module 203 is configured to run a computer program for detecting presentation attacks in images acquired by image acquisition module 204.
Processor 301 is configured for execution of instructions of a computer program. Storage 302 is configured for non-volatile storage of computer programs for execution by the processor 301. In the embodiment, the computer program for predicting presentation attacks in images acquired by image acquisition module 204 is stored in storage 302. Memory 303 is configured as read/write memory for storage of operational data associated with computer programs executed by the processor 301. Input/output interface 304 is provided for communicating presentation attack detection module 203 with system bus 206. The components 301 to 304 of the presentation attack detection module 203 are in communication via system bus 305.
Referring to Figure 4, in an example, biometric verification system 103 is configured to perform a user verification procedure for verifying that a user requesting access to computer 102 is an authorised user of the computer. For example, the biometric verification system 103 may perform the verification procedure in response to receiving a prompt from computer 102 via connection 104.
At stage 401, the image acquisition module 204 images a user requesting access to the computer 102, using imaging device 205. In examples, stage 401 involves imaging the user’s face, optionally in the NIR channel, wherein the actual range of wavelengths may be pre-configured. In examples, the image acquisition module 204 acquires a plurality of frames, where the duration of frame acquisition and time interval between successive frames (frame rate) may be defined at stage 401. In examples, the image acquisition module 204 may be capable of imaging at plural different resolutions, and stage 401 may involve defining an imaging resolution. In examples, the imaging device 205 may have one or more illumination devices for a specified NIR range. Stage 401 may thus further involve adjusting the illumination devices such that the region nominally containing a subject’s head/face is properly and uniformly illuminated. Although minor variations in the ambient light are acceptable, the capturing session should have  reasonable illumination conditions. During the capture, it is preferred that the subject’s face occupies a major area of the image being captured. The image acquisition module 204 may then communicate the acquired image (s) to user identification module 202 and presentation attack detection module 203.
At stage 402, the user identification module 202 analyses the image data acquired at stage 401, extracts biometric information relating to the imaged user (s) , e.g. facial feature information, and determines whether the user is an authorised user of the computer by comparing the extracted biometric information to predefined biometric information of authorised users, i.e. predefined biometric information stored in computer storage accessible by the user identification module 202. The user identification module 202 may output the determination to user verification module 201.
At stage 403, the presentation attack detection module 203 analyses the image data acquired at stage 401, and generates a prediction of whether the acquired image (s) depict a presentation attack, i.e. whether the image is of an unauthorised user attempting to perpetrate a presentation attack. For example, this stage could involve the presentation attack detection module 203 predicting whether the acquired image (s) show a face mask or printed photo. The operation of presentation attack detection module 203 will be described in further detail with reference to later Figures 5 to 15. The presentation attack detection module 203 may output the determination to user verification module 201.
At stage 404, the user verification module 201 may evaluate the determinations from the user identification module 202 and the presentation attack detection module 203, received at  stages  402 and 403 respectively, determine whether the imaged user is an authorised user, and communicate that determination to computer 102. For example, where the determination of the user identification module 202 at stage 402 is that the imaged user appears to be an authorised user, and the prediction of the presentation attack detection module 203 at stage 403 is that the image does not depict a presentation attack, the user verification system 201 may determine that the user requesting access is an authorised user. In contrast, if determination of the user identification module 202 at stage 402 is that the imaged user does not appear to be an authorised user, or if the prediction of the presentation attack detection module 203 at stage 403 is that the image does depict a presentation attack, the user verification system 101 may determine at stage 404 that the user requesting access is not an authorised user, and may notify the computer 102 accordingly.
Referring in particular to Figure 5, in an example, the method of stage 403, for detecting presentation attacks, comprises seven stages. In examples, the method of stage 402 is implemented by the processor 301 of presentation attack detection module 203, in response to instructions of the computer program stored in storage 302 of presentation attack detection module 203.
At stage 501, the computer program stored in storage 302 causes the processor 301 to obtain one or more subject images for analysis, i.e. to obtain images of an attempted user for analysis.
stage 502, the computer program stored in storage 302 causes the processor 301 to denoise the one or more subject images obtained at stage 501, i.e. to remove image noise from the images, to obtain denoised representations of the subject images.
At stage 503, the computer program stored in storage 302 causes the processor 301 to compute one or more residual images, each residual image representing a difference between a subject image obtained at stage 501 and the respective denoised image computed at stage 502.
At stage 504, the computer program stored in storage 302 causes the processor 301 to obtain a database of texture primitives, each texture primitive encoding information representing a texture feature.
At stage 505, the computer program stored in storage 302 causes the processor 301 to generate one or more texture descriptors, e.g. code vectors, each texture descriptor representing one or more regions of a residual image computed at stage 503 as a function of texture primitives in the database obtained at stage 504.
At stage 506, the computer program stored in storage 302 causes the processor 301 to perform on the subject images obtained at stage 501 one or more feature assessment operations.
At stage 507, the computer program stored in storage 302 causes the processor 301 to perform classification operations, based on outputs of  stages  505 and 506, for predicting whether the subject images obtained at stage 501 depict a presentation attack.
In other examples, stage 403 may comprise fewer or further operational stages, depending on the instructions contained in the computer program. For example, in other implementations the operations of stage 506 may be omitted from the method.
Referring next to Figure 6, in examples, the presentation attack detection module 203, depicted previously with reference to Figure 3, is configured to support the functionality of a plurality of functional modules. In the example, each of the functional modules utilise the processor 301, storage 302, and memory 303 of the presentation attack detection module 203.
Pre-processor module 601 is provided for supporting the method of stage 501, for retrieving images, e.g. facial images, from image acquisition module 204, performing image processing operations on the acquired images, and for outputting subject images for analysis by later modules.
Denoiser module 602 is provided for supporting the method of stage 502, for denoising images output by pre-processor module 601 to remove image noise from the images, to obtain denoised representations of the images.
Residual image computing module 603 is provided for supporting the method of stage 503, for computing residual images representing a difference between a subject image output by pre-processor module 601 and the respective denoised image output by denoiser 602.
Texture descriptor generator module 604 is provided for supporting the methods of  stages  504 and 505, for generating a database of texture primitives, and generating texture descriptors representing regions of residual images output by residual image computing module 603 as a function of one or more of the texture primitives.
Feature assessment module 605 is provided for supporting the method of stage 506, to perform feature assessment operations on the subject images output by pre-processor module 601. In examples in which the image acquisition module 204 is utilised to image a user’s face, feature assessment module 605 may comprise eye-region and/or mouth- region assessment sub-modules  606, 607 respectively.
Classifier module 608 is provided for supporting the method of stage 507, for predicting, based on the outputs of texture descriptor generator module 604 and feature assessment module 605,  at  stages  505 and 506 respectively, whether the images obtained at stage 501 depict a presentation attack.
Referring to Figure 7, in examples, the method of stage 501 for obtaining subject images comprises six stages.
At stage 701, the image acquisition module 204 acquires one or more images of a user attempting access to the computer 102. In examples, stage 701 may involve the presentation attack detection module 203 retrieving pre-acquired images from image acquisition module 204, or may instead involve presentation attack detection module 203 instructing image acquisition module 204 to acquire images of a presenting user by imaging device 205, e.g. images of the user’s face, optionally in the NIR channel.
At stages 702 to 706, the pre-processor module 601 performs certain image processing operations on the image (s) acquired at stage 701. In examples, the pre-processor module 601 processes each image (or frame) independently and identically.
At stage 702, the pre-processor module 601 performs an image normalization operation, whereby the image (s) acquired at stage 701 are normalized for a specific, predefined, intensity range.
In examples, stage 702 involves calculation of minimum (I min) and maximum (I max) values for intensity thresholds from the image statistics. The normalization operation on the image data of each frame, may be as shown in equation 1:
Figure PCTCN2020134321-appb-000001
These values may be dynamically computed for each acquired image to capture most of the valid intensity values while ignoring spurious noisy pixels. Once the range thresholds are computed, then the valid range of pixels (||I max-I min||) may be mapped to a different finite range for further processing.
At stage 703, the pre-processor module 601 performs an image resizing operation, whereby the normalized image (s) output at stage 702 are converted to a predefined fixed dimension through appropriate resizing.
At stage 704, the pre-processor module 601 performs a feature detection operation on the resized image (s) output at stage 703, for example, to detect a user’s face in the image and/or facial landmarks of a user’s face, such as the user’s mouth and/or eye regions. This operation may utilise state-of-the-art libraries such as dlib, or techniques such as a deformable parts model, convolutional neural networks (CNN) , cascaded pose regression, and/or multi-task CNNs. In the case of facial imaging, if at stage 704 the pre-processor module 601 is unable to detect a valid face, then a signal may be generated, to be communicated to the computer 102, to inform the same to the user. Also, if the dimensions of the detected face are smaller than a predetermined threshold, the input image may be rejected. If this behaviour is observed across several frames, a user may be provided with suitable signal, e.g. by computer 102.
At stage 705, in response to detection of valid features at stage 704, e.g. a valid face/facial landmarks (i.e., coordinates of various facial features) , the pre-processor module 601 performs an alignment operation on the image (s) , whereby the images are aligned for the subsequent processing. This stage may involve includes selecting one or more image features, e.g. facial features (such as left and right eye, or left/right/middle points of mouth) ; and computing a two-dimensional transformation of the image such that the coordinates of these specific features are consistent across a succession of images.
At stage 706, the pre-processor module 601 performs a cropping operation on the image (s) . For example, the images may be cropped using available features, such as facial landmarks, to depict only a desired image feature, e.g. a face region. The images can also be again resized at stage 706 to predetermined dimensions.
The output of the pre-processor module 601 is thus a cropped and aligned subject image, e.g. a facial image, and corresponding image features, e.g. facial landmarks such as eye/mouth regions.
Referring to Figures 8 and 9 collectively, in examples, the method of stage 502 for denoising images comprises two stages.
At stage 801, the denoiser module 602 is trained to denoise images of the type to be analysed, e.g. images in the NIR channel. In examples, a denoising auto-encoder (DAE) CNN is utilised for denoising images. In examples the DAE comprises one or more units of convolutional, pooling, and normalization layers; and one or more units of fully connected layers.
For training the DAE at stage 801, bonafide training images are obtained, e.g. non-presentation attack images, such as images of real human faces, and the images are pre-processed by pre-processor module 601 by the method of stages 702 to 706, as described previously.
During training of the DAE, the training images are intentionally corrupted by adding a suitable noise, such as white Gaussian noise (AWGN) , of varying levels. In case of AWGN, these levels are determined by discrete values of the variance (or standard deviation) of the Gaussian function used to generate the noise probability mass function (pmf) . Therefore, for N pre-processed images and m levels of noise, the total training data of mN images may be obtained through augmentation. For an input image I F, the noisy or corrupted image
Figure PCTCN2020134321-appb-000002
may be obtained through a stochastic mapping, given by equation (2) :
Figure PCTCN2020134321-appb-000003
where σ is the noise level.
At a higher level, the architecture of a DAE consists of an encoder and a decoder. During training, the encoder maps the noisy input
Figure PCTCN2020134321-appb-000004
to the hidden representation h, as the function given by equation 3:
Figure PCTCN2020134321-appb-000005
where f represents the overall encoder model with parameters θ E. On the decoder side, the hidden representation, h is reconstructed into I F-DN using a decoder function g consisting of parameters θ D, such that equation 4 holds:
Figure PCTCN2020134321-appb-000006
During training, the parameters θ E and θ D of the DAE are learnt through minimization of a loss function over an average reconstruction error, E (||I F-DN-I F||) of training images (authorised user images only) . The training can be conducted using a suitable optimization method (such as Stochastic gradient descent (SGD) or Adam) , and learning parameters. The batch size of training images can be decided by the amount of computing resources available for the training.  After a reasonable convergence and good accuracy, the model may be saved, e.g. in storage 302, for the deployment phase. The model consists of θ E and θ D parameters.
Referring in particular to Figure 9, at stage 802, the DAE trained at stage 801 is deployed to denoise pre-processed subject images output by pre-processor module 601 at stage 501. At stage 802, the pre-processed subject image (s) are passed through the DAE without any noise corruption. Because the DAE model was trained at stage 801 using only bonafide images, i.e. images of real humans instead of presentation attacks, in the particular channel of interest, e.g. NIR, it has learnt the finer textural details of real humans. Therefore it is expected to be a more efficient denoiser for bonafide presentations as compared to attack presentations. The output of the DAE at stage 802 is thus a smoothened/filtered version of the input image.
Whilst the example training stage 801, for training the DAE, is described as being performed immediately prior to denoising stage 802, in alternative examples, training stage 801 could be performed well in advance of denoising stage 802. Indeed, in examples, training stage 801 could be performed by an engineer prior to deployment of the biometric verification system 103.
Referring to Figure 10, in examples, the method of stage 503 for computing residual images comprises pixelwise subtraction of the filtered image, output by the DAE denoiser 602 at stage 502, from the pre-processed image output by the pre-processor module 601 stage 501, as input to the DAE denoiser. This procedure yields an image that primarily consists of textural information in the input image. For example, in the case of presentation attacks constructed using a digital display, the residual image is expected to contain patterns of scanline noise. Similarly, for a three-dimensional face mask formed of paper, the residual image is expected to represent mainly the fine granular texture of paper material.
For an input image, e.g. a face image, I F, the DAE generates a somewhat smoothened output, I F-DN. The residual image, I residue, is obtained from pixelwise difference between the two, as given by equation 5:
I residue=I F-DN-I F     (5)
The resultant residual image thus primarily encodes the information related to texture patterns of the input images.
However, these texture details are superimposed with other corrupting noisy or high frequency contents of the input presentation. Such corruptions may impair the process of predicting whether the image depicts a presentation attack. Thus, an objective of later stage 505 is to represent the contents of the residual image (s) in a more discriminate manner.
Referring to Figure 11, in examples, the method of stage 504 for obtaining a database of texture primitives comprises five stages. In the example, stage 504 involves generating the database, by the training procedure described below. In other examples, stage 504 could involve obtaining a pre-generated database of texture primitives, i.e. a database generated at a prior time step, optionally by a third party, e.g. using the following procedure.
In examples, the database of texture primitives is a “dictionary” of textural primitives or codewords that are specifically learnt for the intended imaging application, e.g. for imaging faces in the NIR channel. An advantage of this approach of generating texture primitives that are specific to the intended imaging application, e.g. texture primitives of NIR imagery where the intended application will image in the NIR channel, is that the texture primitives are then specific to the application, may most accurately define the textures of the image, and may thus allow more accurate/reliable classification at later stage 507.
An objective of later stage 505 is to represent a local patch of residue image I residue as a linear combination of texture primitives, i.e., codewords of the dictionary, Thus, entries in the database should encode texture primitives, such that at stage 505, an input image, through its residual image, may be represented as a vector of texture primitives.
At stage 1101, training images, including both bonafide images, e.g. images of real human faces, and attack presentations, e.g. images of subjects wearing masks, are obtained.
At stage 1102, a residual image is obtained, for example by the method of stages 501 to 503, for each input training image (both classes, bonafide and attack presentations) .
At stage 1103, the residual images obtained at stage 1102 may be tessellated to obtain small, non-overlapping regions, otherwise known as patches, of n*n dimensions.
At stage 1104, for each region/patch, I residue [i, j] , 0<<i<n, 0<<j<n, texture primitives, otherwise known as codes, for inclusion in the database may be computed by minimization of equation 6:
Figure PCTCN2020134321-appb-000007
where α and C represent the coefficients and texture primitive, respectively. Their values are computed through the alternate minimization technique, where an acceptable error norm, e min can be predetermined. This training procedure thus generates a set of texture primitives or codewords, representing different textural features of the bonafide and attack training images.
At stage 1105, the texture primitives generated at stage 1104 are compiled to form a database of texture primitives. Columns of the database may thus represent individual texture primitives, i.e. individual codewords.
Referring to Figure 12, in examples, the method of stage 505 comprises computing coefficients mapping regions of the residual images generated at stage 503 to texture primitives in the database generated at stage 504.
As previously described, the texture details contained in residual images are superimposed with other corrupting noisy or high frequency contents of the input presentation, which may impair the later classification process at stage 507 for predicting whether the image depicts a presentation attack. Thus, an objective of stage 505 is to represent the contents of the residual image (s) in a more discriminate manner, as a function of the texture primitives in the database. The accuracy and reliability of the presentation attack detection prediction, at later stage 507, may thus be improved. As previously described, in examples, the texture primitives in the database are generated specifically for the imaging application, e.g. in the NIR channel, and may thus accurately represent the textures of the subject image.
At stage 505, the texture primitive database learnt at stage 504, is used to obtain a texture descriptor, also known as a feature descriptor, for the residual image (s) output by the DAE denoiser 602 at stage 502. The incoming residual image I residue is divided into smaller, non-overlapping patches of n*n. For each patch, an optimal vector of coefficients may be computed using equation (7) :
Figure PCTCN2020134321-appb-000008
This operation thus involves computing the coefficients α, such that each patch of each residual image may be represented as a function of one or more of the (micro) texture primitives contained in the texture primitive database and their respective coefficients, e.g. as a linear combination of plural texture primitives, as given by the function in Figure 12.
Whilst in the example learning stage 504 for generating the database of texture primitives is described as being performed immediately prior to deployment stage 505, in alternative examples, learning stage 504 could be performed well in advance of deployment stage 505. Indeed, in examples, training stage 504 could be performed by an engineer prior to deployment of the biometric verification system 103.
The order of tessellated patches/regions of the residual images should be predefined, and should be consistent to obtain the descriptor specific to the spatial coordinates. If the residual image is tessellated into i×j patches of uniform sizes, and the database of texture primitives consists of P codewords; then the feature descriptor F texture has dimensionality given by equation 8:
F texture∈R i×j×P                     (8)
The texture/feature descriptor (s) generated at stage 505 may then passed to classifier module 608, for inclusion in the classification operation at later stage 507, as will be described further with reference to Figure 15.
Referring to Figure 13, in examples relating to subject facial images, the method of stage 506 for performing feature assessments comprises two operations, which may suitably be performed in parallel, each operation comprising three stages. Stages 1301 to 1303 are deployed for assessing eye regions of subject face images, whilst stages 1304 to 1306 are deployed for assessing mouth regions of subject face images.
Eye and mouth regions provide several important cues for detection of presentation attacks relating to facial images. Stage 506 thus involves conducting an assessment of eye and mouth regions over a sequence of images frames to test for occlusion, local motion, and masking possibilities. Stage 506 involves examining a variety of such cues from individual image frames as well as sequence of image frames. The features are extracted from a single frame as well as over a sequence of frames.
Stages 1301 to 1303 are deployed for assessing eye regions of subject face images.
Features of an eye region of a face image can be considered to be useful indicators of presentation attacks only if eyes are visible in the image, i.e. if the eyes were visible to the imaging device at stage 501. For a subject wearing dark glasses or using another means to cover or hide their eyes, assessments of eye region features are not useful, and should be excluded from consideration in the later classification stage 507. Additionally, partial occlusion of one or both eyes may result in lower accuracy of the presentation attack prediction at later stage 507.
Thus, at stage 1301, a check is performed for the visibility of a user’s eyes in subject facial images.
The visibility check performed at stage 1301 may involve analysing the facial landmarks detected by pre-processor module 601 at stage 705 to identify relevant regions of the image. Based on the landmarks related to eyes, and predetermined feature dimensions, e.g. facial dimensions, a region of the image containing eyes is identified. The feature detection at stage 704 may approximate or estimate these locations, in case of partial occlusions. Therefore, an explicit check for visibility is desirable. In case of occlusion using glasses or other materials, the eye region of the image may be expected to appear nearly homogeneous. In a visible view of the eye regions, i.e. including pupils, eyebrows, eyelids etc., the region will include visible features. Stage 1301 involves looking for such visible features in the region of interest. Firstly, the entropy of a rectangular eye region I eye is computed by equation 9:
H eye= -∑ kp klog 2 (p k)     (9)
where p k refers to the histogram of eye region I eye. The entropy of a visible eye region is expected to be much higher than that of an occluded eye region.
However, it also possible for a user’s eyes to be covered by alternate means including visibly distinct features, which may render the above entropy calculation, H eye not useful in detecting the visibility of a users’ eye region. Thus, stage 1301 may further involve a pattern checking operation. In this operation, an edge map of an eye region of an image is computed using edge detection operators such as Sobel, Prewitt, or Canny. The edge map is slightly blurred by convolving with a two-dimensional Gaussian kernel of small variance. A template of the eye  region is pre-computed from a small set of visible and clear bonafide presentations. The process of blurring slightly dilates the edge map, and hence, compensates for minor differences in the shapes of eyes of individuals.
During the visibility check, the blurred edge map of the eye region of the subject image is matched against the pre-computed template using a normalized cross-correlation (NCC) technique, given by equation 10:
Figure PCTCN2020134321-appb-000009
where
Figure PCTCN2020134321-appb-000010
is the normalized image of a blurred eye region, and T eye is the normalized template of the eye region.
The values of entropy H eye and normalized cross-correlation NCC eye computed by equations 9 and 10 respectively are passed to the classifier module 608 for inclusion in the later classification stage 507.
Human eyes exhibit a natural motion over small intervals of time (e.g, eyelid blinking and gaze changes) . The presence of such movement in input subject images may thus usefully serve as an indication that the image depicts a live human, rather than e.g. a presentation attack in the form of a printed image. Stage 506 may thus further involve assessment of motion between frames in an image sequence. In the example, such an assessment does not explicitly check for a blink or gaze; rather it calculates the degree of generic local motion. This feature is computed over a sequence of frames. This feature is computed only if the visibility check at stage 1301 provides a positive output.
Thus, at stage 1302, the eye region, I eye is divided into small patches of m*n dimensions. For a given p-th frame, the mean absolute difference (MAD) between the patch I eye [k 1m, k 2n] from p-th frame and the patch at same spatial location from (p-1) -th frame is calculated. The MAD is a scalar value, which would remain close to zero if the patches do not change over frame sequence. For a natural eye movement–in the form of blink, gaze, eyelid opening, eyelid closing, etc. –the MAD sequence is expected to be inconsistent. For sudden changes such as eye blinks or quick head movements, the MAD sequence consists of high frequency (impulse)  signals; whereas for a slow movement such as gaze, the MAD sequence contains relatively lower frequency contents.
The MAD sequence is analysed over a moving window of frames, and it may be computed from every n-th frame (n could be 2, 3, 5.. ) , rather than over consecutive frames. These parameters can be defined in accordance with the frame rate of the overall system. The differential value of MAD is passed to the classifier module 601 for inclusion in the classification operation at later stage 507.
Additionally, for every patch of eye region, a texture descriptor is also generated by texture descriptor generator 604, in the way described previously with reference to stage 505. As described previously, these texture descriptors capture fine texture features of the patch/region. Therefore, if the contents of an image patch change (due to eye movements) , the corresponding textures also exhibit a large change. Stage 1302 may thus further compute a difference between texture descriptors of a given spatial patch over a frame sequence, which thereby be used to estimate of local changes/movement. Note that, this function does not check for explicit blink or gaze movements; but finds indications for an overall motion of any kind. The objective of this operation is not to quantify the amount of motion, rather it is aimed at identifying any motion that may be helpful in evaluating the liveliness of the presentation.
While local motion of an eye region is an important liveliness characteristic, it may be spoofed by an attacker, by replaying a video of a subject, or by using a mask with cut-outs at the eye region (where an attacker’s eyes will be seen through the cut-outs of the mask) . Therefore, stage 506 further involves provides a check for a possibility of cut-out features around the eye regions (in the case of detection of mask or print attacks) . This functionality is tested only if the visibility check for the eye region at stage 1301 is positive.
It may be expected that, in the case of the presence of cut-outs in a mask, strong cut edges will be visible in the region around the eyes. Thus, at stage 1303, the eye region, I eye, is convolved with an edge detection kernel such as Sobel, and the output of the convolution operation is normalized for mean and standard deviation. A histogram of this normalized edge image is computed. A reference/template histogram is computed from a set of visible bonafide presentations from training data, by selecting the eye regions, and then obtaining edge maps through convolution.
The histogram for attack presentations with cut-outs may be expected to consist of higher values as compared to the corresponding histogram of bonafide samples. The magnitude of a difference between the reference and test histograms is considered a useful indicator of the presence of cut-outs in the given region, and is passed to the classifier module 608 for inclusion in the classification operation at later stage 507.
The mouth region of a facial image also provides several important cues for detection of presentation attacks. Thus, stages 1304 to 1306 involve extracting a variety of such cues in a similar manner to the eye region assessment of stages 1301 to 1303.
Features of a mouth region can be considered to be useful indicators of presentation attacks only if the mouth is visible in the image, i.e. if it is visible to the imaging device at stage 501. A subject (or attacker) might occlude the mouth completely or partially by covering, e.g. with hands or clothing. Since high occlusions degrade the accuracy of a presentation attack prediction, it is desirable to check the amount of occlusion or visibility of the mouth region in subject images.
Thus, at stage 1304, a visibility check, to check for the visibility of a mouth region in subject facial images, is conducted. The visibility check for the mouth region is conducted by analysing the features, e.g. facial landmarks, detected by the pre-processor module 601 at stage 704, to identify relevant regions of the subject image. A region containing the mouth is identified based on the corresponding landmarks, and predetermined dimensions of an average face. An explicit check for lips/mouth is desirable in particular where the feature detection at stage 704 is focused on identifying a silhouette of a face rather than specifically a mouth feature. In case of occlusion of the mouth region, e.g. by clothing, it may be expected that the mouth region will appear nearly homogeneous. Conversely, in a visible view, features of the mouth region, such as the lips should be visible. Similar to the eye region assessment, this stage generates entropy and edge map-based functions for a visibility check. The entropy of a rectangular mouth region I mouth is computed by Equation 11:
H mouth=-∑ kp klog 2 (p k)    (11)
where p k refers to the histogram of mouth region I mouth. The entropy of a visible mouth region may be expected to be higher than that of an occluded mouth region. However, if the mouth is  covered, e.g. by clothing or a facial mask, then the entropy calculation may not be a useful indicator of the visibility of the mouth.
Stage 1304 thus further involves a pattern-checking operation for assessing the visibility of a mouth region. An edge map of a mouth region is computed using edge detection operators such as Sobel, Prewitt, or Canny. This edge map is slightly blurred using a 2D Gaussian kernel of small variance. A template of the mouth region is pre-computed from a small set of facial images in which the mouth region is visible. The process of blurring slightly dilates the edge map, and hence, compensates for minor differences in the shapes of mouth for different individuals.
During the visibility check, the blurred edge map of the mouth region of subject images is matched against the pre-computed template using a NCC technique, given by equation 12:
Figure PCTCN2020134321-appb-000011
where
Figure PCTCN2020134321-appb-000012
is the normalized image of blurred mouth region; and T mouth is the normalized template of mouth region. This value, along with the region entropy H mouth, , is then passed to the classifier module 608 for inclusion in the classification operation at later stage 507.
A natural movement of the mouth region, e.g. of the lips, can be a useful indicator of the liveness of a presentation. Stage 506 may thus further involve detecting local motion between successive image frames. Since stage 1305 requires checking for motion, this feature is computed over a sequence of frames. Additionally, this feature is computed only if the visibility check for the mouth region at stage 1304 returns a positive output.
Thus, at stage 1305, the mouth region, I mouth, is divided into small patches of m*n dimensions. For a given p-th frame, the mean absolute difference (MAD) between the patches from a same spatial location (I mouth [k 1m, k 2n] ) over consecutive frames, p-th and (p-1) -th frames, is calculated. The MAD is a scalar value that remains close to zero if the patches do not change over a frame sequence. For a moderate movement of lips, either natural or to speak, the MAD sequence may be expected to be inconsistent. For a specific speech utterance or quick head movement, the MAD sequence may be expected to consist of high frequency (impulse) signals,  whereas for a slow natural movement, the MAD sequence may be expected to contain relatively lower frequency contents.
The MAD sequence is analyzed over a moving window of frames, and it may be computed from every n-th frame (n could be 2, 3, 5.. ) , rather than over consecutive frames. These parameters can be defined in accordance with the frame rate of overall system. The differential value of the MAD analysis is passed to the classifier module 608 for inclusion in the classification operation at later stage 507.
Additionally, for every patch of mouth region, a texture descriptor is also generated by texture descriptor generator 604, in the way described previously with reference to stage 505. As described previously, these texture descriptors capture fine texture features of the patch. Therefore, if the contents of an image patch change (e.g. due to lip movements) , the corresponding textures are also expected to exhibit a large change. Thus, at this stage, the difference between texture descriptors of a given spatial patch over a frame sequence is computed, and is used to estimate a local change/movement. The objective of this operation is not to quantify the amount of motion or utterances, rather it is aimed just at detecting any motion that may be a useful indicator of liveliness of the presentation.
Although detection of local motion of a mouth region in an image is a helpful liveliness characteristic, it can be spoofed by an attacker replaying a video of subject, or by using a mask with a cut-out at the mouth region (wherein the attacker’s mouth may be visible through the cut-out) . Thus, stage 1306 involves a check for the possibility of a cut-out around the mouth region of face (in case of mask or print attacks) . This functionality is tested only if the visibility check for the mouth region at stage 1304 is positive.
The presence of a cut-out around the mouth region can be inferred from the presence of unnaturally strong edges. The mouth region, I mouth, is convolved with an edge detection kernel such as Sobel, and the output of convolution is normalized for mean and standard deviation. A histogram of this normalized edge image is computed, which is then used as a feature. A reference histogram is computed from a set of images in which a subject’s mouth region is visible, by selecting the mouth region, and then obtaining its edge map through convolution.
The histogram for attack presentations with cut-outs around the mouth region may be expected to have higher values as compared to the corresponding histogram of bonafide presentations.
The total magnitude of difference between the reference and test histograms is passed to the classifier module 608 for inclusion in the classification operation at later stage 507.
Referring next to Figures 14 and 15 collectively, in examples, stage 507 for performing a classification operation for predicting presentation attacks comprises two stages.
At stage 1401, the classifier module 608 is trained to predict presentation attacks based on the outputs of texture descriptor generator 604 at stage 505, and feature assessment module 605 at stage 506. In examples, the classifier module 608 utilises a neural network, such as a multi-layer perceptron (MLP) network with one or more hidden layers, one input layer, and one output layer. The number of neurons in the input layer is equal to the sum of dimensions of input features provided by the texture descriptor generator 604 and feature assessment module 605. This number is itself governed by the size of the texture primitive database, and the dimensions of the subject images.
For training the MLP at stage 1401, feature regions of training images, e.g. eye and mouth regions, of both bonafide and attack classes, along with labels identifying the nature of texture primitives in the database, i.e. bonafide or attack, is utilised. The training can be conducted using a suitable optimization method (such as Stochastic gradient descent (SGD) or Adam) , and learning parameters. The batch size of training images can be decided by the amount of computing resources available for the training. After a reasonable convergence and good accuracy, the model may be saved for the deployment phase. The model consists of learned weight parameters.
During the classification stage 1402, the classifier module takes as inputs the outputs of texture descriptor generator module 604, and feature assessment module 605, output at  stages  505 and 506 respectively, and by operation of the learned neural network, outputs predictions as to whether the subject image obtained at stage 501 shows a bonafide presentation of a human, or an attack presentation, e.g. a mask or printed image. The MLP model utilised by the classifier module consists of two neurons at the output. One output provides the classification of presentation attack detection, i.e., whether the input image depicts a bonafide human or a presentation attack. The second output is used to provide a signal to the user, e.g. via the computer 102, if major occlusions are observed in the input images. While the operation of the classification module is robust to minor image occlusions, its performance may degrade as the  amount of occlusion increases. The visibility of regions such as eyes and mouth can provide helpful cues in this regard.
In other examples, other classification procedures could be used. For example, in other examples, a simpler, or more complex, neural network could be used. As previously described, in examples, stage 506 for feature assessment may be omitted from the method, and the classifier module 608 may thus receive as inputs only texture descriptors output by texture descriptor generator module at stage 505. In such examples, a simpler neural network may be utilised. In other examples, a classification method not involving a neural network may be utilised.
Whilst in the example, training stage 1401 for training the classifier module 608 is described as being performed immediately prior to classification stage 1402, in alternative examples, training stage 1401 could be performed well in advance of classification stage 1402. Indeed, in examples, training stage 1401 could be performed by an engineer prior to deployment of the biometric verification system 103.
Aspects of the present disclosure have been described in detail herein in the context of a biometric verification system for verifying the authority of a user requesting access to a computer, as a part of a computer access control system. However, the utility of the disclosure is not limited to such an application. In particular, it should be understood that the presentation attack detection module 203, and the method for presentation attack detection using the module 203, have wider utility generally for detecting presentation attacks in images. Thus, in other examples of aspects of the disclosure, the presentation attack detection module 203 and/or the method for presentation attack detection using the module 203 could be deployed in isolation of one or more other features of the computing system 101. For example, in an alternative embodiment, presentation attack detection module 203 could be deployed as a standalone module for detecting presentation attacks in images input to the module.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality.

Claims (16)

  1. A method for detecting image presentation attacks, the method comprising:
    obtaining a subject image;
    denoising the subject image to obtain a denoised representation of the subject image;
    computing a residual image representing a difference between the subject image and the denoised representation of the subject image;
    obtaining a database of one or more texture primitives;
    generating a texture descriptor representing image texture details of one or more regions of the residual image as a function of texture primitives in the database; and
    performing on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.
  2. The method of any one of the preceding claims, wherein the generating a texture descriptor comprises generating a texture descriptor representing image texture details of a plurality of regions of the residual image as a function of texture primitives in the database.
  3. The method of any one of the preceding claims, wherein the generating a texture descriptor comprises generating a texture descriptor representing image texture details of each of the one or more regions of the residual image as a function of a plurality of texture primitives in the database.
  4. The method of claim 3, wherein the generating a texture descriptor comprises generating a texture descriptor representing image texture details of each of the one or more regions of the residual image as a linear combination of the plurality of texture primitives in the database and respective coefficients relating texture details of the region of the residual image to each of the texture primitives.
  5. The method of any one of the preceding claims, wherein the denoising the subject image comprises denoising the subject image using a convolutional neural network for predicting a denoised representation of an image.
  6. The method of any one of the preceding claims, wherein the performing on the texture descriptor a classification operation comprises performing on the texture descriptor operations of a convolutional neural network for predicting image presentation attacks based on image texture details.
  7. The method of any one of the preceding claims, comprising performing on the subject image:
    an image intensity normalization operation for changing an intensity range of the subject image to a predetermined intensity range; and/or
    an image resizing operation for changing a size of the subject image to a predetermined size.
  8. The method of any one of the preceding, further comprising:
    performing on the subject image a feature position detection operation for detecting a region of the subject image containing a predetermined image feature; and
    performing on the detected region of the subject image a visibility detection operation for detecting a visibility of the predetermined image feature in the detected region of the subject image.
  9. The method of any one of the preceding claims, further comprising:
    performing on the subject image a feature position detection operation for detecting a region of the subject image containing a predetermined image feature;
    convolving the detected region of the subject image with an edge detection kernel and computing a histogram representing an output of the convolution;
    obtaining a reference histogram; and
    computing a difference between the histogram and the reference histogram.
  10. The method of any one of the preceding claims, further comprising:
    obtaining a further subject image;
    denoising the further subject image to obtain a denoised representation of the further subject image;
    computing a further residual image representing a difference between the further subject image and the denoised representation of the further subject image;
    generating a further texture descriptor representing image texture details of a region of the further residual image, corresponding spatially to one of the one or more regions of the residual image, as a function of texture primitives in the database; and
    computing a difference between the further texture descriptor and the texture descriptor for the corresponding region of the residual image.
  11. The method of any one of claims 5 to 10, wherein the denoising the subject image comprises:
    obtaining a convolutional neural network;
    receiving a training image;
    adding image noise to the training image to generate a noisy representation of the training image;
    performing operations of the convolutional neural network on the noisy representation of the training image and generating a prediction for a denoised representation of the noisy representation of the training image;
    quantifying a difference between the prediction and the training image; and
    modifying operations of the convolutional neural network based on the difference.
  12. The method of any one of the preceding claims, wherein the obtaining the subject image comprises receiving an image of near-infrared radiation, and wherein the obtaining a database of  one or more texture primitives comprises obtaining a database of one or more texture primitives representing textures of near-infrared radiation imagery.
  13. The method of any one of the preceding claims, wherein the obtaining a subject image comprises imaging using an optical imaging device.
  14. A computer program comprising instructions, which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 13.
  15. A computer-readable data carrier having the computer program of claim 14 stored thereon.
  16. A computer for detecting image presentation attacks, wherein the computer is configured to:
    obtain a subject image;
    denoise the subject image to obtain a denoised representation of the subject image;
    compute a residual image representing a difference between the subject image and the denoised representation of the subject image;
    obtain a database of one or more texture primitives;
    generate a texture descriptor representing image texture details of one or more regions of the residual image as a function of texture primitives in the database; and
    perform on the texture descriptor a classification operation for predicting image presentation attacks based on image texture details.
PCT/CN2020/134321 2020-12-07 2020-12-07 Presentation attack detection WO2022120532A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080107787.5A CN116982093A (en) 2020-12-07 2020-12-07 Presence attack detection
PCT/CN2020/134321 WO2022120532A1 (en) 2020-12-07 2020-12-07 Presentation attack detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/134321 WO2022120532A1 (en) 2020-12-07 2020-12-07 Presentation attack detection

Publications (1)

Publication Number Publication Date
WO2022120532A1 true WO2022120532A1 (en) 2022-06-16

Family

ID=81972803

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/134321 WO2022120532A1 (en) 2020-12-07 2020-12-07 Presentation attack detection

Country Status (2)

Country Link
CN (1) CN116982093A (en)
WO (1) WO2022120532A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008045139A2 (en) * 2006-05-19 2008-04-17 The Research Foundation Of State University Of New York Determining whether or not a digital image has been tampered with
CN106097379A (en) * 2016-07-22 2016-11-09 宁波大学 A kind of distorted image detection using adaptive threshold and localization method
CN109086718A (en) * 2018-08-02 2018-12-25 深圳市华付信息技术有限公司 Biopsy method, device, computer equipment and storage medium
CN109948776A (en) * 2019-02-26 2019-06-28 华南农业大学 A kind of confrontation network model picture tag generation method based on LBP
CN111126190A (en) * 2019-12-10 2020-05-08 武汉大学 Disguised face recognition method based on free energy theory and dynamic texture analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008045139A2 (en) * 2006-05-19 2008-04-17 The Research Foundation Of State University Of New York Determining whether or not a digital image has been tampered with
CN106097379A (en) * 2016-07-22 2016-11-09 宁波大学 A kind of distorted image detection using adaptive threshold and localization method
CN109086718A (en) * 2018-08-02 2018-12-25 深圳市华付信息技术有限公司 Biopsy method, device, computer equipment and storage medium
CN109948776A (en) * 2019-02-26 2019-06-28 华南农业大学 A kind of confrontation network model picture tag generation method based on LBP
CN111126190A (en) * 2019-12-10 2020-05-08 武汉大学 Disguised face recognition method based on free energy theory and dynamic texture analysis

Also Published As

Publication number Publication date
CN116982093A (en) 2023-10-31

Similar Documents

Publication Publication Date Title
Raghavendra et al. Face presentation attack detection across spectrum using time-frequency descriptors of maximal response in laplacian scale-space
Raghavendra et al. Robust scheme for iris presentation attack detection using multiscale binarized statistical image features
CN107423690B (en) Face recognition method and device
KR102483642B1 (en) Method and apparatus for liveness test
JP5010905B2 (en) Face recognition device
JP5174045B2 (en) Illumination detection using a classifier chain
US8050463B2 (en) Iris recognition system having image quality metrics
Kang et al. Real-time image restoration for iris recognition systems
CN110298297B (en) Flame identification method and device
US20040233299A1 (en) Method and apparatus for red-eye detection
CN112052831B (en) Method, device and computer storage medium for face detection
WO2016084072A1 (en) Anti-spoofing system and methods useful in conjunction therewith
Barnouti Improve face recognition rate using different image pre-processing techniques
EP4300417A1 (en) Method and apparatus for evaluating image authenticity, computer device, and storage medium
WO2020195732A1 (en) Image processing device, image processing method, and recording medium in which program is stored
Elrefaei et al. Developing iris recognition system for smartphone security
CN112232159B (en) Fingerprint identification method, device, terminal and storage medium
CN112232163A (en) Fingerprint acquisition method and device, fingerprint comparison method and device, and equipment
Asmuni et al. An improved multiscale retinex algorithm for motion-blurred iris images to minimize the intra-individual variations
Llano et al. Cross-sensor iris verification applying robust fused segmentation algorithms
Proença Iris recognition in the visible wavelength
Fathy et al. Benchmarking of pre-processing methods employed in facial image analysis
WO2022120532A1 (en) Presentation attack detection
Amjed et al. Noncircular iris segmentation based on weighted adaptive hough transform using smartphone database
Karunya et al. A study of liveness detection in fingerprint and iris recognition systems using image quality assessment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20964479

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202080107787.5

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20964479

Country of ref document: EP

Kind code of ref document: A1