WO2020159437A1

WO2020159437A1 - Method and system for face liveness detection

Info

Publication number: WO2020159437A1
Application number: PCT/SG2020/050029
Authority: WO
Inventors: Lilei ZHENG; Ying Zhang; Chien Eao LEE; Vrizlynn Ling Ling Thing
Original assignee: Agency For Science, Technology And Research
Priority date: 2019-01-29
Filing date: 2020-01-21
Publication date: 2020-08-06

Abstract

There is provided a method of detecting a face liveness of a face portion in a captured video. The method includes: obtaining a video frame from a plurality of video frames of the captured video; detecting a face portion corresponding to a face in the video frame; generating a first type facial image and a second type facial image of the detected face portion; determining, using a first neural network based on the first type facial image, whether the detected face portion comprises a two-dimensional (2D) non-live face representation; and determining, using a second neural network based on the second type facial image, whether the detected face portion comprises a three-dimensional (3D) non-live face representation.

Description

METHOD AND SYSTEM FOR FACE LIVENESS DETECTION

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of priority of Singapore Patent Application No. 10201900838R, filed 29 January 2019, the content of which being hereby incorporated by reference in its entirety for all purposes.

TECHNICAL FIELD

[0002] The present invention generally relates to a method and a system for detecting a face liveness of a face portion in a captured video.

BACKGROUND

[0003] With the rapid development of deep learning and computer vision, face authentication systems have been widely used on security doors, personal computers and mobile devices as the major or assistant means of access control through a web camera. A typical face authentication system may include two modules, i.e., a face detection module and a face recognition module. The former is to locate the face regions if face(s) appear in an image and the latter is to determine the identities of the detected face(s). A face authentication system only allows access to registered faces. However, neither of the two modules has the ability to prevent from spoofing attacks. If a spoofing attacker has the photo or video of a registered face, it is easy to spoof such a face authentication system by presenting the photo of the face or replaying the video containing the face. Hence, existing face authentication systems are susceptible to spoofing attacks using non-live faces (i.e., not actual face present, which may also be referred to as fake faces herein) presented by way of photos or videos of a registered face (i.e., face spoofing attacks by a third party).

[0004] Therefore, there is an urgent need for a face liveness detection module to protect the authentication system from the spoofing attacks. Existing studies on face liveness detection chose to process the visible image patterns from the RGB channel, i.e., the face image visible to human eyes. Some methods tried to demonstrate their effectiveness against still spoofing attacks such as the print-photo attack. For example, detecting eye blinking helps to filter the fake faces presented by photos. However, it becomes ineffective when a video containing the face is replayed. Hence, these methods are not feasible or effective for a real world application. Apart from the visible image patterns, the reflection of light has been used conventionally to distinguish non-live faces from live faces. Compared with the visible image patterns, the reflection of light is demonstrated to be a better feature for accurate liveness detection. Other than the colour or lighting information from the visible channels (e.g., RGB, HSV, YCbCr), the depth information from the depth channel may be used for spoofing detection.

[0005] The face anti-spoofing task using only the visible channels may be referred to as two-dimensional (2D) face anti-spoofing, and the task using the additional depth information may be referred to as three-dimensional (3D) face anti-spoofing. Popular face spoofing attacks include print-photo attack, video replay attack and 3D mask attack. One of the biggest challenges in 2D face anti-spoofing is the lack of training data, especially the lack of spoofing faces. This is because in face recognition tasks, millions of labelled human faces can be found from the social websites, and used by authorized parties to train deep models. In contrast, there is only a few thousands of spoofing face data created by research community in recent years, and this resulted in face anti-spoofing models that are unable to consider all environmental conditions. Further, an important finding is that when these models achieve very low error rate on some datasets, they perform much worse in cross-dataset and cross-camera tests.

[0006] In order to build an effective face anti-spoofing system, one conventional approach uses information from both visible and depth channels in an integrated way. For example, attempts at utilizing image information from both the visible and depth channels to enhance face anti-spoofing, may train two convolutional neural networks (CNNs) to detect face spoofing in the visible channel and depth channel, respectively. The prediction results of the two CNNs were fused to make the final decision of live or non-live face for a detected face portion. However, depth information is estimated from the captured raw color image by using 3D face reconstruction algorithms instead of directly captured by real 3D cameras. This method may facilitate a pure software-based application, but the disadvantage is that additional estimation error will be introduced to the final decision. Other conventional work extracted the depth information from the Kinect depth camera, but it is only utilized to detect a 3D mask attack. [0007] A need therefore exists to provide a method of detecting a face liveness of a face portion in a captured video that seek to overcome, or at least ameliorate, one or more of the deficiencies in conventional face authentication methods/systems, such as to improve accuracy and/or reliability. It is against this background that the present invention has been developed.

SUMMARY

[0008] According to a first aspect of the present invention, there is provided a method of detecting a face liveness of a face portion in a captured video using at least one processor, the method comprising:

obtaining a video frame from a plurality of video frames of the captured video; detecting a face portion corresponding to a face in the video frame;

generating a first type facial image and a second type facial image of the detected face portion;

determining, using a first neural network based on the first type facial image, whether the detected face portion comprises a two-dimensional (2D) non-live face representation; and

determining, using a second neural network based on the second type facial image, whether the detected face portion comprises a three-dimensional (3D) non-live face representation.

[0009] According to a second aspect of the present invention, there is provided a system for detecting a face liveness of a face portion in a captured video, the system comprising:

a memory; and

at least one processor communicatively coupled to the memory and configured to: obtain a video frame from a plurality of video frames of the captured video;

detect a face portion corresponding to a face in the video frame;

generate a first type facial image and a second type facial image of the detected face portion;

determine, using a first neural network based on the first type facial image, whether the detected face portion comprises a two-dimensional (2D) non-live face representation; and

determine, using a second neural network based on the second type facial image, whether the detected face portion comprises a three-dimensional (3D) non- live face representation.

[0010] According to a third aspect of the present invention, there is provided a computer program product, embodied in one or more non-transitory computer-readable storage mediums, comprising instructions executable by at least one processor to perform a method of detecting a face liveness of a face portion in a captured video, the method comprising:

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] Embodiments of the present invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:

FIG. 1 depicts a schematic flow diagram of a method for detecting a face liveness of a face portion in a captured video using at least one processor according to various embodiments of the present invention;

FIG. 2 depicts a schematic block diagram of a system for detecting a face liveness of a face portion in a captured video according to various embodiments of the present invention, such as corresponding to the method shown in FIG.1 ; FIG. 3 depicts an example computer system which the system according to various embodiments of the present invention may be embodied in;

FIG. 4 illustrates an exemplary deep neural framework for detecting a face liveness of a face portion in a captured video according to various example embodiments of the present invention; and

FIG. 5 shows an exemplary chart illustrating exemplary non-live face representations which may be used in spoofing attacks and may be determined from a detected face portion.

DETAILED DESCRIPTION

[0012] Various embodiments of the present invention provide a method (computer- implemented method) and a system (including a memory and at least one processor communicatively coupled to the memory) for detecting a face liveness of a face portion in a captured video. In various embodiments, a fully automated technique for detecting face liveness of the face portion using a deep learning-based classification model or deep neural network framework is provided. The face portion may correspond to a face of a user or person. For example, the face portion in the captured video may correspond to a live face of a person which is captured in real-time, or a non-live face representation of the person which is not captured in real-time (i.e., not actual face present). The non-live face representation may be a two-dimensional (2D) non-live face representation or a three- dimensional (3D) non-live face representation. It will be appreciated by a person skilled in the art that the face liveness detection as described herein may be used to distinguish between a live face and any form of non-live face representations, including a 2D non-live face representation and/or a 3D non-live face representation. Accordingly, the face liveness detection may be used, for example, to determine presence of a spoofing attack by a third party (e.g., unauthorized user). For example, the face liveness detection may be used to facilitate face authentication systems. In various embodiments, the fully automated technique for detecting a face liveness of a face portion based on the deep neural network framework may use depth information in the captured video to determine or detect 2D non- live face representations, and visible or color information to determine or detect 3D non- live face representations. [0013] FIG. 1 depicts a schematic flow diagram of a method 100 (computer- implemented method) for detecting a face liveness of a face portion in a captured video using at least one processor according to various embodiments of the present invention. The method 100 comprises obtaining (at 102) a video frame from a plurality of video frames of the captured video; detecting (at 104) a face portion corresponding to a face in the video frame; generating (at 106) a first type facial image and a second type facial image of the detected face portion; determining (at 108), using a first neural network based on the first type facial image, whether the detected face portion comprises a 2-dimensional (2D) non-live face representation; and determining (at 110), using a second neural network based on the second type facial image, whether the detected face portion comprises a 3D non-live face representation.

[0014] In relation to 102, for example, the video may be captured to include a face portion corresponding to a face of a user. The face of the user captured by the video may be a live face of the user which is captured in real-time (that is, actual person present), or a non-live face representation of the user which is not captured in real-time (that is, actual person not present). The video frame may be obtained and processed to detect a face liveness of the user in the video. A video relating to the face portion of a user as described herein may be obtained by a video capturing or recording device. In various embodiments, the video capturing device may produce a video comprising the face portion and at least one processor may then obtain (or receive) the plurality of video frames (or video images) of the video for processing to detect a face liveness of the face portion in the captured video. As a non-limiting example, the video capturing device may be, or include, a mobile device, a camera, or combinations thereof. The video frames, for example, may be in a color format such as RGB format (i.e., a video frame may be a RGB image).

[0015] In various embodiments, the non-live face representation may be a 2D non-live face representation, a 3D non-live face representation, or combinations thereof. The 2D non-live face representation may be, or include, for example an image-based face representation or rendering presented in a print-paper, a screen, or combinations thereof. The 3D non-live face representation may be, or include, a 3D face representation or rendering, for example, presented by a 3D mask. Other types or forms of face representation, which are not live faces, may also be applicable. [0016] In relation to 104, for example, a face portion corresponding to a face in the video frame may be detected using a face detector or detection module. The face detector may be, or include, various existing face detecting techniques, such as a face detector implemented in the OpenCV library in a non-limiting example, to detect a face portion(s) in the video frame from the color (or visible) channel (e.g., detect a face portion(s) from the RBG image). The color channel may be a primary color channel such a RGB channel, in a non-limiting example. For example, the face detector may propose one or more small or middle bounding boxes within the video frame (e.g., referred to as a region proposal), and identify if any of the bounding boxes includes a face portion corresponding a human face or a type of background. By using such a“region proposal” method, the detection problem may be converted into an image classification problem. It will be appreciated by a person skilled in the art that the present invention is not limited to such a face detection technique and that other face detection techniques known in the art for detecting a face portion corresponding to a face in the video frame may also be used.

[0017] In various embodiments, the above-mentioned generating a first type facial image and a second type facial image of the detected face portion comprises generating or producing a depth image corresponding to the detected face portion, and a grayscale or color image corresponding to the detected face portion. In various embodiments, the first type facial image may be a depth image corresponding to the detected face portion. In various embodiments, the second type facial image may be a color image corresponding to the detected face portion. In another embodiment, the second type facial image may be a grayscale image corresponding to the detected face portion. The grayscale image may be transformed from the color image corresponding to the detected face portion. For example, one or more of the bounding boxes within the video frame containing a face portion corresponding to a human face may be regarded as a detected face portion. The detected face portion or region may be used to crop the depth image and color image corresponding to the detected face portion. The grayscale image corresponding to the detected face portion may be transformed from the color image corresponding to the detected face portion. The grayscale image may be a weighted sum of the values of the R, G, B channels. For example, grayscale may be a scale for representing the intensity of light, where the minimum value may represent black and the maximum value may represent pure white. The grayscale image may be a single-channel image compared to a color image which is a 3-channel (i.e., R, G, B) image. According to various embodiments, processing a grayscale image may be faster than processing the color version.

[0018] In relation to 108, in various embodiments, the first neural network may be a first convolutional neural network (CNN). In relation to 110, in various embodiments, the second neural network may be a second CNN. The first CNN and the second CNN may each be a deep CNN, in a non-limiting example.

[0019] In various embodiments, the first neural network and the second neural network may be pre-trained. The first neural network and the second neural network may be trained separately for respective dimensions of the detected face portion in the video frame to determine the presence of a 2D or 3D non-live face representation. In various embodiments, one of the first neural network and the second neural network may be trained in the depth dimension and used to determine the presence of a 2D non-live face representation in the video, while the other neural network may be trained in the visible or color dimension and used to determine the presence of a 3D non-live face representation in the video.

[0020] In various embodiments, the first neural network may be trained using a dataset comprising depth images relating to faces and configured for distinguishing between a 2D non-live face representation and a 3D face portion (e.g., a live face or a 3D non-live face representation such as a non-live mask). The first neural network may be configured to output an indication of the presence of a 2D non-live face representation in the video in the case of the determination that the detected face portion comprises a 2D non-live face representation. In various embodiments, an indication of presence of a 2D spoofing attack may be output in response to determining that the detected face portion comprises a 2D non-live face representation. In the case of a determination that the detected face portion does not have a 2D non-live face representation, the above-mentioned determining, using a second neural network based on the second type facial image, whether the detected face portion comprises a live face or a 3D non-live face representation may be performed. In other words, the determination using the first neural network and the second neural network may be performed in sequence. [0021] In various embodiments, the second neural network may be trained using a dataset comprising grayscale images relating to faces and configured for distinguishing between a 3D non-live face representation and a live face. In another embodiment, the second neural network may be trained using a dataset comprising color images relating to faces and configured for distinguishing between a 3D non-live face representation and a live face. The second neural network may be configured to output an indication of the presence of a 3D non-live face representation in the video in the case of the determination that the detected face portion comprises a 3D non-live face representation. In various embodiments, an indication of presence of a 3D spoofing attack may be output in response to determining that the detected face portion comprises a 3D non-live face representation. In the case of a determination that the detected face portion does not have a 3D non-live face representation, the second neural network may be configured to output an indication of the presence of a live face in the video.

[0022] Accordingly, the face liveness detection as described may determine whether the detected face portion corresponding to a face in the captured video is a live face by an elimination or filtering process using the first neural network and the second neural network, where the first neural network may be used to filter 2D non-live face representations and the second neural network may be used to filter 3D non-live face representations. In the case of a determination that the detected face portion corresponding to a face in the captured video is (i) not a 2D non-live face representation, and (ii) not a 3D non-live face representation, a further determination that the detected face portion in the captured video is a live face may be made.

[0023] FIG. 2 depicts a schematic block diagram of a system 200 for detecting a face liveness of a face portion in a captured video according to various embodiments of the present invention, such as corresponding to the method 100 for detecting a face liveness of a face portion in a captured video as described hereinbefore according to various embodiments of the present invention.

[0024] The system 200 comprises a memory 204, and at least one processor 206 communicatively coupled to the memory 204 and configured to: obtain a video frame from a plurality of video frames of the captured video; detect a face portion corresponding to a face in the video frame; generate a first type facial image and a second type facial image of the detected face portion; determine, using a first neural network based on the first type facial image, whether the detected face portion comprises a 2D non-live face representation; and determine, using a second neural network based on the second type facial image, whether the detected face portion comprises a 3D non-live face representation.

[0025] It will be appreciated by a person skilled in the art that the at least one processor 206 may be configured to perform the required functions or operations through set(s) of instructions (e.g., software modules) executable by the at least one processor 206 to perform the required functions or operations. Accordingly, as shown in FIG. 2, the system 200 may further comprise a video obtaining module (or circuit) 208 configured to obtain a video frame from a plurality of video frames of the captured video; a face detection module (or circuit) 210 configured to detect a face portion corresponding to a face; an image processing module (or circuit) 212 configured to generate a first type facial image and a second type facial image of the detected face portion; and a face liveness detection module (or circuit) 214 configured to determine, using a first neural network based on the first type facial image, whether the detected face portion comprises a 2D non-live face representation, and to determine, using a second neural network based on the second type facial image, whether the detected face portion comprises a 3D non-live face representation.

[0026] It will be appreciated by a person skilled in the art that the above-mentioned modules (or circuits) are not necessarily separate modules, and two or more modules may be realized by or implemented as one functional module (e.g., a circuit or a software program) as desired or as appropriate without deviating from the scope of the present invention. For example, the video obtaining module 208, the face detection module 210, the image processing module 212, and/or the face liveness detection module 214 may be realized (e.g., compiled together) as one executable software program (e.g., software application or simply referred to as an“app”), which for example may be stored in the memory 204 and executable by the at least one processor 206 to perform the functions/operations as described herein according to various embodiments.

[0027] In various embodiments, the system 200 corresponds to the method 100 as described hereinbefore with reference to FIG. 1, therefore, various functions/operations configured to be performed by the least one processor 206 may correspond to various steps or operations of the method 100 described hereinbefore according to various embodiments, and thus need not be repeated with respect to the system 200 for clarity and conciseness. In other words, various embodiments described herein in context of the methods are analogously valid for the respective systems (e.g., which may also be embodied as devices).

[0028] For example, in various embodiments, the memory 204 may have stored therein the video obtaining module 208, the face detection module 210, the image processing module 212, and/or the face liveness detection module 214, which respectively correspond to various steps or operations of the method 100 as described hereinbefore, which are executable by the at least one processor 206 to perform the corresponding functions/operations as described herein.

[0029] A computing system, a controller, a microcontroller or any other system providing a processing capability may be provided according to various embodiments in the present disclosure. Such a system may be taken to include one or more processors and one or more computer-readable storage mediums. For example, the system 200 described hereinbefore may include a processor (or controller) 206 and a computer-readable storage medium (or memory) 204 which are for example used in various processing carried out therein as described herein. A memory or computer-readable storage medium used in various embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).

[0030] In various embodiments, a“circuit” may be understood as any kind of a logic implementing entity, which may be special purpose circuitry or a processor executing software stored in a memory, firmware, or any combination thereof. Thus, in an embodiment, a“circuit” may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g., a microprocessor (e.g., a Complex Instruction Set Computer (CISC) processor or a Reduced Instruction Set Computer (RISC) processor). A “circuit” may also be a processor executing software, e.g., any kind of computer program, e.g., a computer program using a virtual machine code, e.g., Java. Any other kind of implementation of the respective functions which will be described in more detail below may also be understood as a“circuit” in accordance with various alternative embodiments. Similarly, a“module” may be a portion of a system according to various embodiments in the present invention and may encompass a“circuit” as above, or may be understood to be any kind of a logic-implementing entity therefrom.

[0031] Some portions of the present disclosure are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.

[0032] Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as“determining”,“obtaining”,“generating”,“detecting”, or the like, refer to the actions and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.

[0033] The present specification also discloses a system (which may also be embodied as a device or an apparatus) for performing the operations/functions of the methods described herein. Such a system may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose machines may be used with computer programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate. [0034] In addition, the present specification also at least implicitly discloses a computer program or software/functional module, in that it would be apparent to the person skilled in the art that the individual steps or operations of the methods described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the scope of the invention. It will be appreciated by a person skilled in the art that various modules described herein (e.g., the video obtaining module 208, the face detection module 210, the image processing module 212, and/or the face liveness detection module 214) may be software module(s) realized by computer program(s) or set(s) of instructions executable by a computer processor to perform the required functions, or may be hardware module(s) being functional hardware unit(s) designed to perform the required functions. It will also be appreciated that a combination of hardware and software modules may be implemented.

[0035] Furthermore, one or more of the steps or operations of a computer program/module or method described herein may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general- purpose computer. The computer program when loaded and executed on such a general- purpose computer effectively results in an apparatus that implements the steps or operations of the methods described herein.

[0036] In various embodiments, there is provided a computer program product, embodied in one or more computer-readable storage mediums (non-transitory computer- readable storage medium), comprising instructions (e.g., the video obtaining module 208, the face detection module 210, the image processing module 212, and/or the face liveness detection module 214) executable by one or more computer processors to perform a method 100 for detecting a face liveness of a face portion in a captured video as described hereinbefore with reference to FIG. 1. Accordingly, various computer programs or modules described herein may be stored in a computer program product receivable by a system (e.g., a computer system or an electronic device) therein, such as the system 200 as shown in FIG. 2, for execution by at least one processor 206 of the system 200 to perform the required or desired functions.

[0037] The software or functional modules described herein may also be implemented as hardware modules. More particularly, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. Those skilled in the art will appreciate that the software or functional module(s) described herein can also be implemented as a combination of hardware and software modules.

[0038] In various embodiments, the above-mentioned computer system may be realized by any computer system (e.g., portable or desktop computer system), such as a computer system 300 as schematically shown in FIG. 3 as an example only and without limitation. Various methods/operations or functional modules (e.g., the video obtaining module 208, the face detection module 210, the image processing module 212, and/or the face liveness detection module 214) may be implemented as software, such as a computer program being executed within the computer system 300, and instructing the computer system 300 (in particular, one or more processors therein) to conduct the methods/functions of various embodiments described herein. The computer system 300 may comprise a computer module 302, input modules, such as a keyboard 304 and a mouse 306, and a plurality of output devices such as a display 308, and a printer 310. The computer module 302 may be connected to a computer network 312 via a suitable transceiver device 314, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN). The computer module 302 in the example may include a processor 318 for executing various instructions, a Random Access Memory (RAM) 320 and a Read Only Memory (ROM) 322. The computer module 302 may also include a number of Input/Output (I/O) interfaces, for example I/O interface 324 to the display 308, and I/O interface 326 to the keyboard 304. The components of the computer module 302 typically communicate via an interconnected bus 328 and in a manner known to the person skilled in the relevant art.

[0039] It will be appreciated by a person skilled in the art that the terminology used herein is for the purpose of describing various embodiments only and is not intended to be limiting of the present invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising", or the like such as“includes” and/or“including”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0040] In order that the present invention may be readily understood and put into practical effect, various example embodiments of the present invention will be described hereinafter by way of examples only and not limitations. It will be appreciated by a person skilled in the art that the present invention may, however, be embodied in various different forms or configurations and should not be construed as limited to the example embodiments set forth hereinafter. Rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the present invention to those skilled in the art.

[0041] Various example embodiments of the present invention describes detecting a face liveness of a face portion of a user (e.g., registered or authorized user) in a captured video. For example, the face portion may correspond to the face of the user. The video may be captured in real-time and may comprise a plurality of video frames. One or more video frames of the plurality of video frames may be processed to detect the face portion corresponding to the face and to detect a face liveness of the detected face portion. For ease of discussion, processing of the captured video to detect the face liveness of the face portion may be discussed with respect to a video frame, however, it will be appreciated by a person skilled in the art that any subset of the plurality of video frames of the captured video may be processed to detect the face liveness of the face portion in the captured video. For example, a subset of the plurality of video frames of the captured video may be processed and a voting decision of determinations based on a number of consecutive frames may be made so as to obtain a more confident decision.

[0042] In various example embodiments, a video comprising a face portion corresponding to a human face (i.e., face of a user) may be received. The video may comprise a plurality of video frames (or video images). The video may be captured using a video capturing device. The video capturing device may be, or include, a camera in an example implementation. For example, a captured video corresponding to the face of the user may comprise both color images and depth images. In other implementations, two videos may be captured simultaneously, one including color images and the other including depth images. In other words, two images may be produced, one from the color channel (e.g., RGB channel) and the other from the depth channel, simultaneously. Accordingly, a color image and a depth image may be generated for each video frame. In various example embodiments, the video capturing device may be a Kinect sensor having a depth camera (the time-of-flight camera) to detect 3D motions such as full-body 3D motions (including head motions). Other video capturing devices, such as those comprising an infrared camera such as the Intel RealSense camera, may be used for capturing a video of a face portion corresponding to the face of the user. Infrared cameras, for example, allow access to raw depth data of the captured images.

[0043] A face portion corresponding to the face of the user may be detected in one or more of the plurality of video frames. In various example embodiments, two cameras such as, a common camera and a depth camera, may be used for capturing the face portion of the user. The common camera may produce raw color images, while the depth camera may produce raw depth images. The raw color image and the depth image may be aligned to have the same size. For example, when detecting the face portion in the raw color image and when the bounding box of the face portion is identified, the same region in the depth image may be cropped so as to obtain a face depth image. After face detection, the face region(s) may be cropped from both the color image and the depth image. In other words, detected face region(s) may be used to crop the face depth image(s) and face color image(s) accordingly. In various example embodiments, the face portion may be detected by a face detector implemented in the OpenCV library. The face detector may detect face portions (e.g., faces or face regions) in the video frame (i.e., images from the color channel such as the RGB channel). In various example embodiments, a first type facial image and a second type facial image may be generated based on the detected face portion. In various example embodiments, the first type facial image may be a face depth image generated based on the detected face portion (i.e., depth image corresponding to the detected face portion). In various example embodiments, the second type facial image may be a face color image generated based on the detected face portion. In other words, the face color image and the face depth image of the detected face portion may be produced from each video frame, and used as data for identifying non-live face representations (i.e., spoofings). In another example embodiment, the second type facial image may be a face grayscale image generated based on the detected face portion (i.e., grayscale image corresponding to the detected face portion). The face grayscale image may be transformed or converted from the face color image. For example, the grayscale image may be a weighted sum of the R, G, B channels. Accordingly, in various example embodiments, a face depth image and a face grayscale image may be generated for each video frame. The face depth image and the face grayscale image of the detected face portion may be produced from each video frame, and used as data for identifying non-live face representations (i.e., spoofings).

[0044] In various example embodiments, the face depth image and face grayscale image generated from each video frame may be processed using a deep learning-based classification model or deep neural framework 400 to detect a face liveness of the detected face portion in the captured video. FIG. 4 illustrates a diagram of an exemplary deep neural framework 400 for detecting a face liveness of a face portion in a captured video according to various example embodiments of the present invention. In various example embodiments, the deep neural framework 400 comprises a first neural network 410 (e.g., first CNN) and a second neural network 420 (e.g., second CNN).

[0045] In various example embodiments, the first neural network 410 and the second neural network 420 may each be a six-layer CNN structure. For example, the first neural network 410 and the second neural network 420 may each include six hidden layers. For example, the architecture of each of the first neural network 410 and the second neural network 420 may include an input layer and an output layer, as well as multiple hidden layers made of convolutional layers (applying a convolution operation to the input and passing the result to the next layer), activation functions (defining the output of certain node given an input of set of inputs), pooling layers (combining the outputs of neuron clusters at one layer into a single neuron in the next layer), fully connected layers (connecting every neuron in one layer to every neuron in another layer). The depth image and grayscale image of the detected face portion may be fed respectively to the first neural network 410 and the second neural network 420, and analyzed with the aforementioned layers, resulting a determination of the presence of a 2D non-live face representation and/or a 3D non-live face representation.

[0046] The first neural network 410 and the second neural network 420 may be trained separately, one on a dataset comprising depth images relating to faces and the other on a dataset comprising grayscale images relating to faces. For example, people’s live faces or spoofing attacks (e.g., presented by way of 2D or 3D non-live face presentations) may be recorded as videos and the video data may then be manually labelled as different classes and the face images may then be extracted for training the first neural network 410 and the second neural network 420. In other example embodiments, one of the first neural network 410 and the second neural network 420 may be trained on a dataset comprising depth images relating to faces and the other on a dataset comprising color images relating to faces. In various example embodiments, a model training on a conventional deep learning platform such as Caffe deep learning platform may be performed as the saved models can be directly utilized by OpenCV. In various example embodiments, processing speed of the system may be about 0.06 second per video frame to support real-time user experience.

[0047] In various example embodiments, the first neural network 410 may be trained using a dataset comprising depth images relating to faces and configured for distinguishing between a 2D non-live face representation and a 3D face portion (e.g., 3D non-live face representation or live face (which is 3D)). In other words, the first neural network 410 may be able to determine or identify 2D non-live face representations (e.g., 412a, 412b) which are detected as face portions in the captured video. The 2D non-live face representations may be, or include, face representations presented in a print-paper (e.g., print-paper attack), a screen (e.g., video-replay attack), or combinations thereof. Other types of 2D non-live face representations may also be applicable. For example, the first neural network 410 may output an indication of the presence of a 2D non-live face representation in the case of a determination that the detected face portion comprises any type of 2D non-live face representation. The first neural network 410 does not treat a 3D mask attack as an anomaly because it is still a 3D modality. Accordingly, the first neural network 410 may filter out 2D non-live face representations. In other words, the first neural network 410 may filter out presence of 2D spoofing attacks. An indication of presence of the 2D spoofing attack may be output, for example via a user interface, in response to determining that the video frame comprises a 2D non-live face representation.

[0048] In various example embodiments, the second neural network 420 may be used to distinguish 3D masks 422a from live faces 422b. For example, when it is hard to create imperceptible mask (i.e., fake face), 3D non-live face representations may be identified in the visible channels, i.e., from the color images or the grayscale transformed from the color images.

[0049] In various example embodiments, in the case of a determination that the video frame does not have a 2D non-live face representation, said determining, using a second neural network based on the second type facial image, whether the detected face portion comprises a live face or a 3D non-live face representation may be performed. In other words, the determination using the first neural network and the second neural network may be performed in sequence. According to various embodiments, the deep neural framework 400 may be referred to as a two-stage neural network or CNN framework. The CNN-based two-stage structure may be used to cover most of the popular spoofing attacks. FIG. 5 shows an exemplary chart 500 illustrating exemplary non-live face representations which may be used in spoofing attacks and may be determined from a detected face portion.

[0050] Compared to previous works, the deep neural framework of the present invention has a clearer division of the anti-spoofing work for the two CNNs. In conventional systems, two deep CNNs are trained, one on the depth images and the other on the RGB images, to simultaneously exploit spoofing features from the two image channels. The first CNN has to distinguish 3D masks from 3D real faces in the depth channel; the second CNN has to distinguish (high-resolution) faces shown on paper or screens from live faces in the visible channel (RGB or other color variants). This rises up challenges to the CNNs such that they have to be deeper and need to be trained on more labelled data. In contrast, the two-stage CNN framework of the present invention reduces the requirement of high-capacity CNNs so that a simple deep learning model, such as the CNN by LeNet, may be used. For example, various embodiments of the present invention enable training of each of the two CNNs using a specific dataset comprising specific type of image (e.g., depth dimension or images for the first CNN and color dimension or images (from which grayscale images may be generated in some embodiments) for the second CNN), and with relatively limited numbers of training data set compared to very deep CNNs. Instead of spending much effort in collecting or synthesizing a large amount of spoofing data, the depth information may be used to significantly improve the performance of anti-spoofing. For example, two of the three popular attacks, the print-photo attack and the video-replay attack, are 2D spoofing attacks that can be easily filtered using the depth information. It is also difficult and expensive to create a high-quality face mask to conduct a 3D spoofing attack, i.e., the difference between a human face and a mask is visibly distinguishable.

[0051] In various example embodiments, a decision of live face or non-live face representation of the detected face portion made on a single video frame may be a voting result of classification outcomes on a subset of consecutive video frames (e.g., the latest five video frames), which helps to account for decisions made in error (e.g., due to isolated false accepts, i.e., a non-live frame which is wrongly accepted as live, or false rejects, i.e., a live frame which is wrongly rejected as non-live).

[0052] In various example embodiments, in response to determining that the detected face portion processed using the two-stage CNN framework comprises a 2D non-live face representation or a 3D non-live face representation, the two-stage CNN framework may further classify the type of spoofing attack. For example, the two-stage CNN framework may output an indication of presence of a 2D spoofing attack in response to determining that the video frame comprises a 2D non-live face representation, or output an indication of presence of a 3D spoofing attack in response to determining that the video frame comprises a 3D non-live face representation. In some embodiments, the two-stage CNN framework may output an indication that the detected face portion in the captured video is a live face.

[0053] The aforementioned embodiments of the present invention provides a two-stage CNN framework that enables the use of a relatively simple or shallow CNN for faster or high processing speed while keeping high detection accuracy. In addition, the CNN for processing depth images does not have to distinguish 3D masks from (3D) live faces in the depth channel, while the CNN for processing grayscale images does not have to distinguish (high-resolution) faces shown on paper or screens from live faces in the visible channel (e.g., RGB or its variants). Accordingly, embodiments of the present invention for detecting face liveness may be more time-efficient and saves computing resources.

[0054] The two-stage CNN framework for detecting a face liveness of a face portion in a captured video provides an effective feature representation that is discriminative for distinguishing live faces and non-live faces. Various embodiments of the present invention may be implemented in a face authentication system to facilitate information/network security in an example implementation.

[0055] In various example embodiments, a human-computer interaction (HCI) module configured to facilitate the user to present frontal face at a proper position to the camera may be provided. The HCI module may improve the user experience.

[0056] While embodiments of the invention have been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.

Claims

CLAIMS What is claimed is:

1. A computer-implemented method of detecting a face liveness of a face portion in a captured video using at least one processor, the method comprising:

2. The method of claim 1, wherein the first type facial image comprises a depth image corresponding to the detected face portion.

3. The method of claim 1, wherein the second type facial image comprises a grayscale image corresponding to the detected face portion.

4. The method of claim 1, wherein the 2D non-live face representation comprises a face representation presented in a print-paper, a screen, or combinations thereof.

5. The method of claim 1, wherein the 3D non-live face representation comprises a face representation presented by a 3D mask.

6. The method of claim 1, wherein the first neural network is trained using a dataset comprising depth images relating to faces and configured for distinguishing between a 2D non-live face representation and a 3D face portion.

7. The method of claim 1, wherein the second neural network is trained using a dataset comprising grayscale images relating to faces and configured for distinguishing between a 3D non-live face representation and a live face.

8. The method of claim 1, wherein said determining, using a second neural network based on the second type facial image, whether the video frame comprises a 3D non-live face representation is in response to determining that the video frame does not have a 2D non-live face representation.

9. The method of claim 1, further comprising outputting an indication of presence of a 2D spoofing attack in response to determining that the video frame comprises a 2D non-live face representation.

10. The method of claim 1 , further comprising outputting an indication of presence of a 3D spoofing attack in response to determining that the video frame comprises a 3D non-live face representation.

11. A system for detecting a face liveness of a face portion in a captured video, the system comprising:

a memory; and

detect a face portion corresponding to a face in the video frame;

12. The system according to claim 11, wherein the first type facial image comprises a depth image corresponding to the detected face portion.

13. The system according to claim 11, wherein the second type image comprises a grayscale image corresponding to the detected face portion.

14. The system according to claim 11, wherein the 2D non-live face representation comprises face representation presented in a print-paper, a screen, or combinations thereof.

15. The system according to 11, wherein the 3D non-live face representation comprises a face representation presented by a 3D mask.

16. The system of claim 11 , wherein the first neural network is trained using a dataset comprising depth images relating to faces and configured for distinguishing a 2D non-live face representation and a 3D face portion.

17. The system of claim 11 , wherein the second neural network is trained using a dataset comprising grayscale images relating to faces and configured for distinguishing a 3D non-live face representation and a live face.

18. The system of claim 11, wherein said determining, using a second neural network based on the second type facial image, whether the video frame comprises a 3D non-live face representation is in response to determining that the video frame does not have a 2D non-live face representation.

19. The system of claim 11, further comprising outputting an indication of presence of a 2D spoofing attack in response to determining that the video frame comprises a 2D non-live face representation.

20. A computer program product, embodied in one or more non-transitory computer-readable storage mediums, comprising instructions executable by at least one processor to perform a method of detecting a face liveness of a face portion in a captured video, the method comprising: