WO2022104340A1

WO2022104340A1 - Artificial intelligence for passive liveness detection

Info

Publication number: WO2022104340A1
Application number: PCT/US2021/072328
Authority: WO
Inventors: Ashim Banerjee; Sandeep Gandhi; Goran Rauker
Original assignee: IDMission LLC
Priority date: 2020-11-10
Filing date: 2021-11-10
Publication date: 2022-05-19

Abstract

Methods, systems, and storage media are disclosed for passive liveness detection using artificial intelligence. Example implementations may receive an image of a person's face; generate a cropped version of that image; generate two different embeddings using two convolutional neural networks that are fed the image and the cropped image, respectively; generate a combined embedding that is a concatenation of the two embeddings; and generate, based on the combined embedding, an output indicating whether the facial portion corresponds to a live person.

Description

ARTIFICIAL INTELLIGENCE FOR PASSIVE LIVENESS DETECTION

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is related to U.S. Patent Application Serial No. 17/395,987, entitled “Point of Transaction Device with Multi-factor Authentication, filed January 5, 2016 (and published as US 2016/0132890 on May 12, 2016), which is a continuation-in-part of both U.S. Patent Application Serial No. 13/907,306, filed on May 31, 2013, and also U.S. Patent Application Serial No. 13/907,314, filed on May 31, 2013. All of the aforementioned applications are hereby incorporated by reference as if set forth in full in this application in their entireties for all purposes.

FIELD OF THE DISCLOSURE

[0002] The present disclosure generally relates to artificial intelligence, machine learning, computer vision, computer networking, and hardware and software related thereto. More specifically, one or more aspects described herein relates to artificial intelligence for passive liveness detection.

BACKGROUND

[0003] An image showing the face of a user may be used by a system to verify the user’s identity. However, an imposter may attempt a presentation attack to trick the system into granting access, despite the imposter not being authorized access. It may be difficult for a system to determine whether a presentation attack has occurred or not.

[0004] As avid author Aurelien Geron of “Hands-on Machine Learning with Scikit- Learn.. teaches: “the flexibility of neural networks is also one of their main drawbacks: there are many hyperparameters to tweak. Not only can you use any imaginable network architecture, but even in a simple MLP you can change the number of layers, the number of neurons per layer, the type of activation function to use in each layer, the weight initialization logic, and much more.” Moreover, “[a] common mistake is to use convolution kernels that are too large. For example, instead of using a convolutional layer with a 5x5 kernel, stack two layers with 3x3 kernels: it will use fewer parameters and require fewer computations, and it will usually perform better. One exception is for the first convolutional layer: it can typically have a large kernel (e.g., 5x5), usually with a stride of 2 or more: this will reduce the spatial dimension of the image without losing too much information, and since the input image only has three channels in general, it will not be too costly.” The features disclosed herein overcome one or more problems that exist in the art.

SUMMARY

[0005] To overcome limitations in the prior art described above, and to overcome other limitations that will be apparent upon reading and understanding the present specification, aspects described herein are directed towards passive liveness detection. An image and one or more cropped images of a user may be used as input into one or more neural networks. The one or more cropped images may show only the regions of interest (RO I) of the image (e.g., the user’s face, eyes, etc.). The one or more neural networks may generate a first embedding for the image and a different embedding for the one or more cropped images. The embeddings may be combined and used in one or more neural networks to generate a prediction of liveness or whether a presentation attack has occurred.

[0006] In one aspect, a computer implemented method for passive liveness detection may include receiving, by a computing device, an input image, wherein the image comprises a facial portion and a first background portion; generating, based on the input image, a cropped image, wherein the cropped image comprises the facial portion and a second background portion that is a subset of the first background portion; generating, based on the input image and via a first convolutional neural network, a first image embedding, wherein the first convolutional neural network comprises an average pooling layer, a fully connected layer, and a plurality of depthwise convolutional layers; generating, based on the cropped image and via a second convolutional neural network, a second image embedding; generating, via a concatenation of the first image embedding and the second image embedding, a combined embedding; generating, based on the combined embedding, output indicating whether the facial portion corresponds to a live person; and denying, based on the output indicating whether the facial portion corresponds to a live person, access to a computer system. The first convolutional neural network may comprise a first plurality of layers and a first plurality of input channels, wherein each input channel of the first plurality of input channels corresponds to a layer of the first plurality of layers. The second convolutional neural network may comprise a second plurality of input channels, wherein each input channel of the second plurality of input channels is determined by reducing a corresponding input channel of the first plurality of input channels. A first width parameter corresponding to input channels of the first convolutional neural network may be greater than a second width parameter corresponding to input channels of the second convolutional neural network. The generating the cropped image may include removing, from the input image, pixels corresponding to the background portion.

[0007] The method may further comprise training, based on a first plurality of images and a second plurality of cropped images, the first convolutional neural network and the second convolutional neural network, to output information that indicates liveness of each person in the first plurality of images.

[0008] The method may further comprise receiving an additional image for liveness detection, wherein the additional image comprises a person; and determining, based on facial features of the person, that the additional image is not suitable for liveness detection. The generating, based on the first image embedding and the second image embedding, output may comprise generating the output via a sigmoid function.

[0009] In one aspect a computer implemented method may comprise generating, by a computing device and via a camera, a plurality of images, wherein each image of the plurality of images indicates a same person, and wherein each image of the plurality of images is generated within a threshold time of each other; generating, via a neural network and based on the plurality of images, an image embedding; generating, based on the image embedding and via a fully connected layer of the neural network, an output value; and granting, to a user device and based on the output value, access to a computing system. The neural network may comprise a recurrent convolutional neural network. Generating the image embedding may comprise generating, via the neural network and based on a first image, a first image embedding; and generating, via the neural network and based on both the first image embedding and a second image of the plurality of images, the image embedding. The method of claim 8, wherein at least one of the plurality of images comprises a cropped image.

[0010] The method may further comprise generating the cropped image by removing, from a first image of the plurality of images, one or more pixels corresponding to a background portion of the first image.

[0011] The method may further comprise training, based on a first plurality of images and a second plurality of cropped images, the neural network, to output information that indicates liveness of each person in the first plurality of images.

[0012] The method may further comprise receiving an additional image for liveness detection, wherein the additional image comprises a person; and determining, based on facial features of the person, that the additional image is not suitable for liveness detection. Generating an output value may comprise generating the output value via a sigmoid function.

[0013] In one aspect a computer implemented method may comprise receiving, by a computing device, an input image, wherein the input image comprises a facial portion and a first background portion; generating, based on the input image, a cropped image, wherein the cropped image comprises a subset of pixels of the input image; generating, based on the input image and via a first neural network, a first image embedding; generating, based on the cropped image and via a second neural network, a second image embedding; generating, based on the first image embedding and the second image embedding, output indicating whether the facial portion corresponds to a live person; and granting, based on the output, access to a computing system.

[0014] The method may further comprise training, based on a first plurality of images and a second plurality of cropped images, the first neural network and the second neural network, to output information that indicates liveness of a person in each image of the first plurality of images.

[0015] The method may further comprise receiving an additional image for liveness detection, wherein the additional image indicates a person; and determining, based on facial features of the person, that the additional image is not suitable for liveness detection. Generating output may comprise generating the output via a sigmoid function. The first convolutional neural network may comprise a first plurality of layers and a first plurality of input channels, wherein each input channel of the first plurality of input channels corresponds to a layer of the first plurality of layers. The second convolutional neural network may comprise a second plurality of input channels, wherein each input channel of the second plurality of input channels is determined by reducing a corresponding input channel of the first plurality of input channels.

[0016] In other aspects, a system may be configured to perform one or more aspects and/or methods described herein. In some aspects, an apparatus may be configured to perform one or more aspects and/or methods described herein. In some aspects, one or more computer readable media may store computer executed instructions that, when executed, configure a system to perform one or more aspects and/or methods described herein. These and additional aspects will be appreciated with the benefit of the disclosures discussed in further detail below.

BRIEF DESCRIPTION OF THE DRAWINGS

[0017] FIG. 1 shows an example system configured for passive liveness detection.

[0018] FIG. 2 shows an example neural network that may be used for passive liveness detection.

[0019] FIG. 3 shows an example method for passive liveness detection.

[0020] FIG. 4 shows an additional example method for detecting liveness in an image

[0021] FIG. 5 shows example arrays or image embeddings that may be used for passive liveness detection.

[0022] FIG. 6 shows an example neural network architecture that may be used for passive liveness detection.

[0023] FIG. 7 shows an example image that may be used for passive liveness detection.

[0024] FIG. 8 shows an example cropped image that may be used for passive liveness detection.

DETAILED DESCRIPTION

[0025] In the following description of the various embodiments, reference is made to the accompanying drawings identified above and which form a part hereof, and in which is shown by way of illustration various embodiments in which aspects described herein may be practiced. It is to be understood that other embodiments may be utilized and structural and functional modifications may be made without departing from the scope described herein. Various aspects are capable of other embodiments and of being practiced or being carried out in various different ways. It is to be understood that the phraseology and terminology used herein are for the purpose of description and should not be regarded as limiting. Rather, the phrases and terms used herein are to be given their broadest interpretation and meaning. The use of “including” and “comprising” and variations thereof is meant to encompass the items listed thereafter and equivalents thereof as well as additional items and equivalents thereof.

[0026] A framework for machine learning algorithm may involve a combination of one or more components, sometimes three components: (1) representation, (2) evaluation, and (3) optimization components. Representation components refer to computing units that perform steps to represent knowledge in different ways, including but not limited to as one or more decision trees, sets of rules, instances, graphical models, neural networks, support vector machines, model ensembles, and/or others. Evaluation components refer to computing units that perform steps to represent the way hypotheses (e.g., candidate programs) are evaluated, including but not limited to as accuracy, prediction and recall, squared error, likelihood, posterior probability, cost, margin, entropy k-L divergence, and/or others. Optimization components refer to computing units that perform steps that generate candidate programs in different ways, including but not limited to combinatorial optimization, convex optimization, constrained optimization, and/or others. In some embodiments, other components and/or subcomponents of the aforementioned components may be present in the system described herein to further enhance and supplement the aforementioned machine learning functionality.

[0027] Machine learning algorithms sometimes rely on unique computing system structures. Machine learning algorithms may leverage neural networks. Such structures, while significantly more complex than conventional computer systems, are beneficial in implementing machine learning. For example, an artificial neural network may be comprised of a large set of nodes, which may be dynamically configured to effectuate learning and decision-making .

[0028] Machine learning tasks are sometimes broadly categorized as either unsupervised learning or supervised learning. In unsupervised learning, a machine learning algorithm is left to generate any output (e.g., to label as desired) without feedback. The machine learning algorithm may teach itself (e.g., observe past output), but otherwise operates without (or mostly without) feedback from, for example, a human administrator. An embodiment involving unsupervised machine learning is described herein.

[0029] Meanwhile, in supervised learning, a machine learning algorithm is provided feedback on its output. Feedback may be provided in a variety of ways, including via active learning, semi- supervised learning, and/or reinforcement learning. In active learning, a machine learning algorithm is allowed to query answers from an administrator. For example, the machine learning algorithm may make a guess in a face detection algorithm, ask an administrator to identify the photo in the picture, and compare the guess and the administrator’s response. In semi- supervised learning, a machine learning algorithm is provided a set of example labels along with unlabeled data. For example, the machine learning algorithm may be provided a data set of 100 photos with labeled human faces and 10,000 random, unlabeled photos. In reinforcement learning, a machine learning algorithm is rewarded for correct labels, allowing it to iteratively observe conditions until rewards are consistently earned. For example, for every face correctly identified, the machine learning algorithm may be given a point and/or a score (e.g., “75% correct”). An embodiment involving supervised machine learning is described herein.

[0030] One theory underlying supervised learning is inductive learning. In inductive learning, a data representation is provided as input samples data (x) and output samples of the function (f(x)). The goal of inductive learning is to learn a good approximation for the function for new data (x), i.e., to estimate the output for new input samples in the future. Inductive learning may be used on functions of various types: (1) classification functions where the function being learned is discrete; (2) regression functions where the function being learned is continuous; and (3) probability estimations where the output of the function is a probability.

[0031] As elaborated herein, in practice, machine learning systems and their underlying components are tuned by data scientists to perform numerous steps to perfect machine learning systems. The process is sometimes iterative and may entail looping through a series of steps: (1) understanding the domain, prior knowledge, and goals; (2) data integration, selection, cleaning, and pre-processing; (3) learning models; (4) interpreting results; and/or (5) consolidating and deploying discovered knowledge. This may further include conferring with domain experts to refine the goals and make the goals more clear, given the many variables that can be optimized in the machine learning system. Meanwhile, one or more of data integration, selection, cleaning, and/or pre-processing steps can sometimes be the most time consuming because the old adage, “garbage in, garbage out,” also reigns true in machine learning systems.

[0032] Once data for machine learning has been created, an optimization process may be used to transform the machine learning model. The optimization process may include (1) training the data to predict an outcome, (2) defining a loss function that serves as an accurate measure to evaluate the machine learning model’s performance, (3) minimizing the loss function, such as through a gradient descent algorithm or other algorithms, and/or (4) optimizing a sampling method, such as using a stochastic gradient descent (SGD) method where instead of feeding an entire dataset to the machine learning algorithm for the computation of each step, a subset of data is sampled sequentially. In one example, optimization comprises minimizing the number of false positives to maximize a user’s experience. Alternatively, an optimization function may minimize the number of missed positives to optimize minimization of losses from exploits.

[0033] FIG. 1 illustrates a system 100 configured for liveness detection, in accordance with one or more implementations. Any module or device within system 100 may use any of the machine learning techniques described above, or described in connection with FIG. 2 below, for liveness detection. Biometrics such as faces, fingerprints, retina, voice, and others may be used to identify or verify a user and can be useful for many applications where authentication is necessary. However, biometrics are also susceptible to presentation attacks, where an impersonator may obtain a biometric sample from an authorized user, present it to a system, and gain access to the system despite not being authorized. A presentation attack may include an impersonator using a biometric sample without the support of the user to whom the biometric sample belongs. For example, an impersonator may obtain an image of an authorized user’s face and present it to a facial recognition system to gain unauthorized access to a system. Liveness detection may help prevent presentation attacks by detecting whether a biometric sample is being used in an authentic manner (e.g., the authorized user corresponding to the biometric sample is actually present and/or intends to gain access a system). For example, when given an image of a user’s face, liveness detection may be used to determine whether the user was actually present when the image was taken or whether the image is simply a picture of an image of the user or some other type of presentation attack (e.g., a mask that looks like the authorized user, etc.). Liveness detection may include determining whether a biometric corresponds to a live person, or a representation of the person. Liveness detection may include determining whether an image authentically shows a live person or a spoof of the person. For example, a spoof may comprise presenting (e.g., to a camera) one or more of the following (e.g., which may be designed to look like an authorized user): high resolution glossy printed photographs, high resolution matte printed photographs, paper cutout masks, paper cutout masks with eyes cut out worn on an imposter’s face, 3D layered printed paper masks, 3D layered printed masks with eyes cut out and worn on an imposter’s face, use of hair (e.g., a wig), electronic still photo on a laptop or mobile device, video on a laptop or mobile device, latex masks, silicone masks, mannequins, 3d printed face busts, etc. One or more neural networks described herein may be configured to distinguish between the above mentioned spoofs and a live user (e.g., an authorized user).

[0034] Liveness detection may include active liveness detection and passive liveness detection. In active liveness detection the user may be presented with a challenge. For example, the user may be prompted to blink, move a device, nod the user’s head, smile, or perform other actions to pass a liveness detection test. Active liveness can lead to a poor user experience because it requires extra time and effort to perform a challenge to verify liveness. Another drawback to active liveness detection is that a system may need to instruct a user to perform an action. This may signal to an imposter that a liveness detection is being performed and may assist the imposter in learning how to thwart or spoof the liveness detection. On the other hand, passive liveness detection may be done without requiring any special challenge or action by the user. For example, a system may take a picture of a user as the user enters their log in information and may use the picture to determine liveness. Alternatively, for passive liveness detection the system may simply take a picture of the user’s face without requiring the user to perform a special action (e.g., such as smiling or blinking). The system may use the picture to determine whether the user is authentic or if a presentation attack is being used by an imposter.

[0035] The system, methods, and techniques described herein may be used for passive liveness detection and may provide a number of benefits over other liveness detection techniques. One benefit is that the features described herein may require minimal effort from a user and thus improve user experience. For example, a user may need to simply be located in front of a camera without having to perform any special actions. This may reduce the time and effort required by the user to access a system. In addition, passive liveness detection techniques described herein may allow a system to better protect against identity theft and presentation attacks. Techniques for passive liveness detection described herein may provide an improvement over active liveness detection or other liveness detection techniques by improving abilities to detect spoofs and/or by improving the user experience. Active liveness detection requires the user to respond to instructions from a computing device (e.g., instructions to smile, blink, turn the user’s head, etc.). Requiring the user to perform an action adds friction to the process and results in significantly increased user frustration and abandonment of the process. Additionally, active liveness detection is easy to spoof using animated deep fake images created by a computing device. Passive liveness detection techniques described herein may make it more difficult to spoof using animated and/or deep fake images. Passive liveness detection techniques described herein may use an input image and one or more cropped images to train one or more neural networks to determine whether the input image corresponds to a presentation attack. Using one or more cropped images may allow the one or more neural networks to more accurately predict whether the input image corresponds to a presentation attack (e.g., by allowing the one or more neural networks to focus on one or more portions of a face).

[0036] In some implementations, system 100 may include one or more computing platforms 102. Computing platform(s) 102 may be configured to communicate with one or more remote platforms 104 according to a client/server architecture, a peer-to-peer architecture, and/or other architectures. For example, computing platform(s) 102 may communicate with a client device 130 and receive an image taken by a camera 132 of the client device 130. The image may show a user and one or more devices in the system 100 may determine whether the user shown in the image is live or not. The client device 130 may include a display 131 that may be configured to display an image taken using the camera 132. A communications module 133 of the client device 130 may be configured to send one or more images captured by the camera 132 to computing platform 102. For example, a user may use the client device 130 to log in to a system. As part of the login process, the camera 132 may take a picture of the user (the user may or may not be aware of the time at which the picture is taken). The communications module 133 may send the picture to computing platform 102 (e.g., the image receiving module 108) where it may undergo a liveness detection test as described herein. The image receiving module 108 may format the image and/or otherwise process the image so that it may be used in a liveness detection test. The communications module 133 may receive the result and/or an indication of the result of the liveness detection test from the computing platform 102 (e.g., the client device 130 may be granted access to the system or may be denied access to the system). Remote platform(s) 104 may be configured to communicate with other remote platforms via computing platform(s) 102 and/or according to a client/server architecture, a peer-to-peer architecture, and/or other architectures. Users may access system 100 via remote platform(s) 104.

[0037] Computing platform(s) 102 may be configured by machine-readable instructions 106. Machine-readable instructions 106 may include one or more instruction modules. The instruction modules may include computer program modules. The instruction modules may include one or more of image receiving module 108, image generating module 110, embedding generating module 112, output generating module 114, neural network training module 116, image determination module 118, and/or other instruction modules.

[0038] Image receiving module 108 may be configured to receive, by a computing device, an image. The image may depict a user, and the computing platform(s) 102 may be tasked with determining whether the user shown is live or not (e.g., whether a presentation attack is being used by an imposter, whether the image is a fake, whether the depicted user was in front of the camera when the image was taken, etc.). The image may include a facial portion and a background portion. For example, as shown in FIG. 7, the image may include a facial portion 705 and a background portion 710. The facial portion may include any portion of the user’s face (e.g., eyes, ears, nose, mouth, etc.). The background portion may include any portion of the image that does not include the user.

[0039] Image generating module 110 may be configured to generate, based on the image, one or more cropped images. For example, the generating module 110 may crop an image received from the user. A cropped image may include the facial portion and a subset of the background portion of the image. For example, as shown in FIG. 8, a portion of the background may be removed from the image received by the image receiving module 108 to generate the cropped image. As shown in FIG. 8, the cropped image may comprise a facial portion 805. The image generating module 110 may crop the image so that the cropped image includes only the user’s face (e.g., the cropped image may omit any part of the user that is not the user’s face and/or the cropped image may omit the background portion of the image). Image generating module 110 may be configured to generate the cropped image by removing one or more pixels corresponding to a background portion of the image received from the user. For example, the image generating module 110 may use machine learning (e.g., object detection techniques, face detection techniques, etc.) to determine pixels corresponding to the user and/or the user’s face and may crop the image such that only the portion of the image that corresponds to the user and/or user’s face remain in the cropped image.

[0040] The computing platform 102 may use the image received from the user and the one or more cropped images to determine liveness of a user shown in the image. For example, the image and the cropped image may be input into one or more neural networks (e.g., as described in more detail below) to generate a prediction of liveness of a user shown in the image. Using one or more cropped images in addition to the image may improve the ability (e.g., accuracy may be improved) of computing platform 102 to determine liveness of the user shown in the image or whether a presentation attack is occurring.

[0041] Image generating module 110 may be configured to generate, by a computing device and via a camera, a plurality of images. For example, the computing platform 102 may cause a camera of a client device to take a plurality of images of a user (e.g., each image in the plurality of images may show the same user). Each image in the plurality of images may be taken within a threshold period of time (e.g., there may be a period of time between the time at which each image is taken). For example, the period of time between when each image is taken may be 200ms (other amounts of time may be used such as 500ms, 1 second, or any other amount of time). Alternatively, the threshold period of time may indicate a time period that each image of the plurality of images must be taken within. For example, if the threshold time period is 800 milliseconds, the camera may be required to take each image of the plurality of images within 800 milliseconds seconds. One or more of the plurality of images may include a cropped image.

[0042] Image generating module 110 may cause a camera to capture one or more images and/or video of a user while the user logs into an account or device. For example, image generating module 110 may cause a camera to capture a plurality of images of a user (e.g., 5 images, 10 images, 20 images, etc.). The image generating module 110 may determine a liveness score from each image of the plurality of images. The liveness score may comprise an indication of how open the user’s eyes are in the image. The image with the highest liveness score (e.g., the image showing the user’s eyes being the most open) may be used as the input image (e.g., the input image discussed below in connection with any of FIGS. 3-8).

[0043] Embedding generating module 112 may be configured to generate, based on the image and via a neural network (e.g., a convolutional neural network or any other type of neural network), an embedding (e.g., an image embedding). The neural network may include any component of the neural network described in connection with FIG. 2 or the convolutional neural network described in connection with FIG. 3. The neural network may be used to generate an image embedding of an image received from a user. The embedding may be a representation of the image with reduced dimensionality. For example a 12 megapixel image may be reduced to an embedding of 1024 dimensions. Additionally or alternatively, a neural network may be used to generate an image embedding using a cropped version of the input image. An image embedding may be a vector representation of the image. For example, the image embedding may be vector of numbers, where the vector has size N (e.g., 200, 500, 1024, 2028, 4056, or any other size). The image embedding may be the output of a layer within a neural network. For example, the image embedding may be the output of a fully connected layer, a pooling layer (e.g., average pooling, max pooling, etc.), or other layer of a neural network (e.g., a convolutional neural network).

[0044] One or more neural networks used to generate an embedding may include a plurality of layers and/or a plurality of input channels. The plurality of layers may include any type of layer described in more detail in connection with FIG. 2 below (e.g., convolutional layer, pooling layer, fully connected layer, etc.). By way of non-limiting example, one or more neural networks may comprise an average pooling layer, a fully connected layer, and/or a plurality of depthwise convolutional layers (e.g., as explained in more detail in connection with FIGS. 2-3 below).

[0045] The plurality of input channels may correspond to the type of images the neural network receives as input. For example, an image may comprise three dimensions. The dimensions or shape of an image may be c X h X w, where c is the number of input channels, h is the height (e.g., the number of rows of pixels), and w is width (e.g., the number of columns). For example, if the image is in RGB color model format, the input channels, c may be three and the number of input channels at the input layer of the neural network may be three. Subsequent layers of the neural network may have different numbers of input channels. For example, a second layer in the neural network may generate an output that has more or less than the number of input channels of a previous first layer of the neural network. A third layer that follows the second layer may have a number of input channels that equals the number of input channels generated in the output of the second layer.

[0046] A width parameter may be used to modify the number of input channels that are used in each layer in a neural network. Modifying or reducing the number of input channels using the width parameter may cause the neural network to reduce in size (e.g., the number of trainable parameters of the neural network may be reduced). This may increase the speed at which the neural network may be used (e.g., reduce time needed to make a prediction, reduce time needed to train, etc.) and/or may reduce the amount of computer processing power needed to use the neural network. The width parameter may be a value d, that is between 0 and 1. The width parameter may be multiplied with the number of input channels at each layer so that the number of input channels at each layer becomes d ■ c. The product d ■ c may be rounded to the nearest integer. For example, if the width parameter is 3=0.4 and the number of input channels is c=2, then the product may be rounded up to 1. In another example, if the width parameter is d = 0.25 and the number of input channels is c = 3, then the product may be rounded up to 1. In yet another example, if the width parameter is d = 0.55 and the number of input channels is c = 3, then the product may be rounded up to 2. The width parameter may be different for different neural networks. For example, the width parameter for a first neural network that takes a full (e.g., uncropped) image as input may be a first value (e.g., 1,.75, etc.), and the width parameter for a second neural network that takes a cropped image as input may be a second value (e.g., 0.25, 0.5, 0.6, etc.).

[0047] Embedding generating module 112 may generate an embedding using a plurality of images (e.g., a plurality of images taken within a threshold time as discussed above). Embedding generating module 112 may be configured to generate, via a neural network and based on a plurality of images of a user, an embedding. For example, a recurrent neural network (RNN) (e.g., a recurrent convolutional neural network) may be used to generate, based on the plurality of images, the embedding. The plurality of images may be in order of when each image in the plurality of images was taken. At a first time step, the RNN may take as input the first image of the plurality of images and generate an embedding. At each subsequent time step, the RNN may use as input, the embedding generated at the previous time step as well as the next image in the plurality of images.

[0048] One or more embeddings may be used to determine a liveness prediction. A first embedding of an image and a second embedding of a cropped version of the image may be concatenated (e.g., by the embedding generating module 112). Embedding generating module 112 may be configured to generate, via a concatenation of the first embedding and the second embedding, a combined embedding (e.g., as discussed in more detail in connection with FIG. 4 below). Output generating module 114 may be configured to generate, based on the combined embedding, a prediction of liveness corresponding to the input image (e.g., a prediction whether a presentation attack is being used). For example, output generating module 114 may output information indicating whether a user shown in an image corresponds to a live person (e.g., as discussed in more detail in connection with FIG. 4 below). In the case of passive liveness detection, the step of generating, based on the combined embedding, a prediction of liveness is without requiring any special challenge or action by the user; for example, the system may simply take a photo/video of the user’s face without requiring the user to perform a special action (e.g., such as smiling or blinking). The system may then perform the aforementioned steps using the photo/video to determine whether the user is authentic or if a presentation attack is being used by an imposter.

[0049] Neural network training module 116 may be configured to train one or more neural networks to detect liveness (e.g., whether a presentation attack is being attempted). Neural network training module 116 may use training data comprising a plurality of images. The images in the training data may be labeled with an indication of whether a user shown in each image is live and/or whether the image corresponds to a presentation attack. The images in the training data may comprise cropped versions of each image (e.g., the cropped version may show just the face of the user). The neural network may be trained using any technique (e.g., backpropagation, regularization, etc.) described in connection with FIG. 2 below.

[0050] Image determination module 118 may be configured to determine whether an image is suitable for liveness detection. There may be one or more criteria that are checked as part of an initial determination of liveness. The initial determination of liveness may be performed before a neural network is used to generate information indicating liveness as described above. For example, the initial determination may be performed prior to generating one or more embeddings and making a prediction of whether the image includes a live person (e.g., that the image is not a spoof or that the image does not correspond to a presentation attack). Image determination module 118 may analyze an input image and may determine whether the image is suitable for liveness detection. Image determination module 118 may be configured to determine, based on facial features of the person, that the additional image is not suitable for liveness detection. For example, image determination module 118 may use image processing techniques to determine whether the user’s eyes are open or closed. If the user’s eyes are closed, the image determination module 118 may determine that the image is not suitable for liveness detection and/or may determine that a user shown in the image is not a live user. Image determination module 118 may use image processing techniques to determine focus of a camera or blurriness of a captured image. If the blurriness of a captured image satisfies a blurriness threshold, the image determination module 118 may cause an additional image to be captured. Image determination module 118 may use image processing techniques to determine a glare level within a captured image. If the glare level of a captured image satisfies a threshold, the image determination module 118 may cause an additional image to be captured. Image determination module 118 may use image processing techniques to determine a head angle of a captured image. If the head angle in a captured image fails to satisfy one or more thresholds or criteria (e.g., the image is taken from too high of an angle above the user’s head, too low of an angle below the user’s head, etc.), the image determination module 118 may cause an additional image to be captured. Image determination module 118 may use image processing techniques to determine a distance between the user’s face and the camera. If the distance fails to satisfy one or more thresholds (e.g., the user is too far away from the camera, or the user is too close to the camera), the image determination module 118 may cause an additional image to be captured.

[0051] In some implementations, computing platform(s) 102, remote platform(s) 104, and/or external resources 122 may be operatively linked via one or more electronic communication links. For example, such electronic communication links may be established, at least in part, via a network such as the Internet and/or other networks. It will be appreciated that this is not intended to be limiting, and that the scope of this disclosure includes implementations in which computing platform(s) 102, remote platform(s) 104, and/or external resources 122 may be operatively linked via some other communication media.

[0052] A given remote platform 104 may include one or more processors configured to execute computer program modules. The computer program modules may be configured to enable an expert or user associated with the given remote platform 104 to interface with system 100 and/or external resources 122, and/or provide other functionality attributed herein to remote platform(s) 104. By way of non-limiting example, a given remote platform 104 and/or a given computing platform 102 may include one or more of a server, a desktop computer, a laptop computer, a handheld computer, a tablet computing platform, a NetBook, a Smartphone, a gaming console, and/or other computing platforms.

[0053] External resources 122 may include sources of information outside of system 100, external entities participating with system 100, and/or other resources. In some implementations, some or all of the functionality attributed herein to external resources 122 may be provided by resources included in system 100.

[0054] Computing platform(s) 102 may include electronic storage 124, one or more processors 126, and/or other components. Computing platform(s) 102 may include communication lines, or ports to enable the exchange of information with a network and/or other computing platforms. Illustration of computing platform(s) 102 in FIG. 1 is not intended to be limiting. Computing platform(s) 102 may include a plurality of hardware, software, and/or firmware components operating together to provide the functionality attributed herein to computing platform(s) 102. For example, computing platform(s) 102 may be implemented by a cloud of computing platforms operating together as computing platform(s) 102.

[0055] Electronic storage 124 may comprise non-transitory storage media that electronically stores information. The electronic storage media of electronic storage 124 may include one or both of system storage that is provided integrally (i.e., substantially non-removable) with computing platform(s) 102 and/or removable storage that is removably connectable to computing platform(s) 102 via, for example, a port (e.g., a USB port, a firewire port, etc.) or a drive (e.g., a disk drive, etc.). Electronic storage 124 may include one or more of optically readable storage media (e.g., optical disks, etc.), magnetically readable storage media (e.g., magnetic tape, magnetic hard drive, floppy drive, etc.), electrical charge-based storage media (e.g., EEPROM, RAM, etc.), solid-state storage media (e.g., flash drive, etc.), and/or other electronically readable storage media. Electronic storage 124 may include one or more virtual storage resources (e.g., cloud storage, a virtual private network, and/or other virtual storage resources). Electronic storage 124 may store software algorithms, information determined by processor(s) 126, information received from computing platform(s) 102, information received from remote platform(s) 104, and/or other information that enables computing platform(s) 102 to function as described herein.

[0056] Processor(s) 126 may be configured to provide information processing capabilities in computing platform(s) 102. As such, processor(s) 126 may include one or more of a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information. Although processor(s) 126 is shown in FIG. 1 as a single entity, this is for illustrative purposes only. In some implementations, processor(s) 126 may include a plurality of processing units. These processing units may be physically located within the same device, or processor(s) 126 may represent processing functionality of a plurality of devices operating in coordination. Processor(s) 126 may be configured to execute modules 108, 110, 112, 114, 116, and/or 118, and/or other modules. Processor(s) 126 may be configured to execute modules 108, 110, 112, 114, 116, and/or 118, and/or other modules by software; hardware; firmware; some combination of software, hardware, and/or firmware; and/or other mechanisms for configuring processing capabilities on processor(s) 126. As used herein, the term “module” may refer to any component or set of components that perform the functionality attributed to the module. This may include one or more physical processors during execution of processor readable instructions, the processor readable instructions, circuitry, hardware, storage media, or any other components.

[0057] It should be appreciated that although modules 108, 110, 112, 114, 116, and/or 118, are illustrated in FIG. 1 as being implemented within a single processing unit, in implementations in which processor(s) 126 includes multiple processing units, one or more of modules 108, 110, 112, 114, 116, and/or 118, may be implemented remotely from the other modules. The description of the functionality provided by the different modules 108, 110, 112, 114, 116, and/or 118, described below is for illustrative purposes, and is not intended to be limiting, as any of modules 108, 110, 112, 114, 116, and/or 118, may provide more or less functionality than is described. For example, one or more of modules 108, 110, 112, 114, 116, and/or 118, may be eliminated, and some or all of its functionality may be provided by other ones of modules 108, 110, 112, 114, 116, and/or 118. As another example, processor(s) 126 may be configured to execute one or more additional modules that may perform some or all of the functionality attributed below to one of modules 108, 110, 112, 114, 116, and/or 118.

[0058] By way of example, FIG. 2 illustrates a simplified artificial neural network (e.g., neural network) 200 on which a machine learning algorithm may be executed. The artificial neural network shown in FIG. 2 may be used for liveness detection as described in connection with FIGS. 1, and 3-5. FIG. 2 is merely an example of an artificial neural network; other forms of nonlinear processing may be used to implement a machine learning algorithm in accordance with features described herein.

[0059] In FIG. 2, each of input nodes 210a-n is connected to a first set of processing nodes 220a-n. Each of the first set of processing nodes 220a-n is connected to each of a second set of processing nodes 230a-n. Each of the second set of processing nodes 230a-n is connected to each of output nodes 240a-n. Though only two sets of processing nodes are shown, any number of processing nodes may be implemented. Similarly, though only four input nodes, five processing nodes, and two output nodes per set are shown in FIG. 2, any number of nodes may be implemented per set. Data flows in FIG. 2 are depicted from left to right: data may be input into an input node, may flow through one or more processing nodes, and may be output by an output node. Input into the input nodes 210a-n may originate from an external source 260. Output may be sent to a feedback system 250 and/or to storage 270. The feedback system 250 may send output to the input nodes 210a-n for successive processing iterations with the same or different input data.

[0060] In one illustrative method using feedback system 250, the system may use machine learning to determine an output. The output may include anomaly scores, heat scores/values, confidence values, and/or classification output. The system may use any machine learning model including xgboosted decision trees, auto-encoders, perceptron, decision trees, support vector machines, regression, and/or a neural network. The neural network may be any type of neural network including a feed forward network, radial basis network, recurrent neural network, long/short term memory, gated recurrent unit, auto encoder, variational autoencoder, convolutional network, residual network, Kohonen network, MobileNet, GoogleNet, VGG 16, Squeezenet, AlexNet, and/or other type of network. For example, the neural network may comprise a depthwise, separable convolution such as a MobileNet architecture. In one example, the output data in the machine learning system may be represented as multi-dimensional arrays, an extension of two-dimensional tables (such as matrices) to data with higher dimensionality.

[0061] The neural network may include an input layer, a number of intermediate layers, and an output layer. Each layer may have its own weights. The input layer may be configured to receive as input one or more feature vectors described herein. The intermediate layers may be convolutional layers, pooling layers, dense (fully connected) layers, and/or other types. The input layer may pass inputs to the intermediate layers. In one example, each intermediate layer may process the output from the previous layer and then pass output to the next intermediate layer. The output layer may be configured to output a classification or a real value. The layers may include convolutional layers, pooling layers, depthwise convolutional layers, and/or any other type of layer.

[0062] In one example, the layers in the neural network may use an activation function such as a sigmoid function, a Tanh function, a ReLu function, and/or other functions. Moreover, the neural network may include a loss function. A loss function may, in some examples, measure a number of missed positives; alternatively, it may also measure a number of false positives. The loss function may be used to determine error when comparing an output value and a target value. For example, when training the neural network the output of the output layer may be used as a prediction and may be compared with a target value of a training instance to determine an error. The error may be used to update weights in each layer of the neural network.

[0063] In one example, the neural network may include a technique for updating the weights in one or more of the layers based on the error. The neural network may use gradient descent to update weights. Alternatively, the neural network may use an optimizer to update weights in each layer. For example, the optimizer may use various techniques, or combination of techniques, to update weights in each layer. When appropriate, the neural network may include a mechanism to prevent overfitting — regularization (such as LI or L2), dropout, and/or other techniques. The neural network may also increase the amount of training data used to prevent overfitting.

[0064] In one example, FIG. 2 depicts nodes that may perform various types of processing, such as discrete computations, computer programs, and/or mathematical functions implemented by a computing device. For example, the input nodes 210a-n may comprise logical inputs of different data sources, such as one or more data servers. The processing nodes 220a-n may comprise parallel processes executing on multiple servers in a data center. And, the output nodes 240a-n may be the logical outputs that ultimately are stored in results data stores, such as the same or different data servers as for the input nodes 210a-n. Notably, the nodes need not be distinct. For example, two nodes in any two sets may perform the exact same processing. The same node may be repeated for the same or different sets.

[0065] Each of the nodes may be connected to one or more other nodes. The connections may connect the output of a node to the input of another node. A connection may be correlated with a weighting value. For example, one connection may be weighted as more important or significant than another, thereby influencing the degree of further processing as input traverses across the artificial neural network. Such connections may be modified such that the artificial neural network 200 may learn and/or be dynamically reconfigured. Though nodes are depicted as having connections only to successive nodes in FIG. 2, connections may be formed between any nodes. For example, one processing node may be configured to send output to a previous processing node.

[0066] Input received in the input nodes 210a-n may be processed through processing nodes, such as the first set of processing nodes 220a-n and the second set of processing nodes 230a-n. The processing may result in output in output nodes 240a-n. As depicted by the connections from the first set of processing nodes 220a-n and the second set of processing nodes 230a-n, processing may comprise multiple steps or sequences. For example, the first set of processing nodes 220a-n may be a rough data filter, whereas the second set of processing nodes 230a-n may be a more detailed data filter.

[0067] The artificial neural network 200 may be configured to effectuate decision-making. As a simplified example for the purposes of explanation, the artificial neural network 200 may be configured to detect faces in photographs. The input nodes 210a-n may be provided with a digital copy of a photograph. The first set of processing nodes 220a-n may be each configured to perform specific steps to remove non-facial content, such as large contiguous sections of the color red. The second set of processing nodes 230a-n may be each configured to look for rough approximations of faces, such as facial shapes and skin tones. Multiple subsequent sets may further refine this processing, each looking for further more specific tasks, with each node performing some form of processing which need not necessarily operate in the furtherance of that task. The artificial neural network 200 may then predict the location on the face. The prediction may be correct or incorrect.

[0068] The feedback system 250 may be configured to determine whether or not the artificial neural network 200 made a correct decision. Feedback may comprise an indication of a correct answer and/or an indication of an incorrect answer and/or a degree of correctness (e.g., a percentage). For example, in the facial recognition example provided above, the feedback system 250 may be configured to determine if the face was correctly identified and, if so, what percentage of the face was correctly identified. The feedback system 250 may already know a correct answer, such that the feedback system may train the artificial neural network 200 by indicating whether it made a correct decision. The feedback system 250 may comprise human input, such as an administrator telling the artificial neural network 200 whether it made a correct decision. The feedback system may provide feedback (e.g., an indication of whether the previous output was correct or incorrect) to the artificial neural network 200 via input nodes 210a-n or may transmit such information to one or more nodes. The feedback system 250 may additionally or alternatively be coupled to the storage 270 such that output is stored. The feedback system may not have correct answers at all, but instead base feedback on further processing: for example, the feedback system may comprise a system programmed to identify faces, such that the feedback allows the artificial neural network 200 to compare its results to that of a manually programmed system.

[0069] The artificial neural network 200 may be dynamically modified to learn and provide better input. Based on, for example, previous input and output and feedback from the feedback system 250, the artificial neural network 200 may modify itself. For example, processing in nodes may change and/or connections may be weighted differently. Following on the example provided previously, the facial prediction may have been incorrect because the photos provided to the algorithm were tinted in a manner which made all faces look red. As such, the node which excluded sections of photos containing large contiguous sections of the color red could be considered unreliable, and the connections to that node may be weighted significantly less. Additionally or alternatively, the node may be reconfigured to process photos differently. The modifications may be predictions and/or guesses by the artificial neural network 200, such that the artificial neural network 200 may vary its nodes and connections to test hypotheses.

[0070] The artificial neural network 200 need not have a set number of processing nodes or number of sets of processing nodes, but may increase or decrease its complexity. For example, the artificial neural network 200 may determine that one or more processing nodes are unnecessary or should be repurposed, and either discard or reconfigure the processing nodes on that basis. As another example, the artificial neural network 200 may determine that further processing of all or part of the input is required and add additional processing nodes and/or sets of processing nodes on that basis.

[0071] The feedback provided by the feedback system 250 may be mere reinforcement (e.g., providing an indication that output is correct or incorrect, awarding the machine learning algorithm a number of points, or the like) or may be specific (e.g., providing the correct output). For example, the machine learning algorithm 200 may be asked to detect faces in photographs. Based on an output, the feedback system 250 may indicate a score (e.g., 75% accuracy, an indication that the guess was accurate, or the like) or a specific response (e.g., specifically identifying where the face was located).

[0072] The artificial neural network 200 may be supported or replaced by other forms of machine learning. For example, one or more of the nodes of artificial neural network 200 may implement a decision tree, associational rule set, logic programming, regression model, cluster analysis mechanisms, Bayesian network, propositional formulae, generative models, and/or other algorithms or forms of decision-making. The artificial neural network 200 may effectuate deep learning.

[0073] FIG. 3 illustrates a method 300 for liveness detection, in accordance with one or more implementations. The operations of method 300 presented below are intended to be illustrative. In some implementations, method 300 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of method 300 are illustrated in FIG. 3 and described below is not intended to be limiting.

[0074] In some implementations, method 300 may be implemented in one or more processing devices (e.g., a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information). The one or more processing devices may include one or more devices executing some or all of the operations of method 300 in response to instructions stored electronically on an electronic storage medium. The one or more processing devices may include one or more devices configured through hardware, firmware, and/or software to be specifically designed for execution of one or more of the operations of method 300.

[0075] An operation 302 may include receiving, by a computing device, an image. The image may include a facial portion and a first background portion. Operation 302 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to image receiving module 108, in accordance with one or more implementations.

[0076] An operation 304 may include generating, based on the image, a cropped image. The cropped image may include the facial portion and a second background portion that is a subset of the first background portion. Operation 304 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to image generating module 110, in accordance with one or more implementations.

[0077] An operation 306 may include generating, based on the image and via a first convolutional neural network, a first embedding. The first convolutional neural network may include an average pooling layer, a fully connected layer, and a plurality of depthwise convolutional layers. Operation 306 may be performed by one or more hardware processors configured by machine -readable instructions including a module that is the same as or similar to embedding generating module 112, in accordance with one or more implementations.

[0078] An operation 308 may include generating, based on the cropped image and via a second convolutional neural network, a second embedding. Operation 308 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to embedding generating module 112, in accordance with one or more implementations.

[0079] An operation 310 may include generating, via a concatenation of the first embedding and the second embedding, a combined embedding. Operation 310 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to embedding generating module 112, in accordance with one or more implementations.

[0080] An operation 312 may include generating, based on the combined embedding, output indicating whether the facial portion corresponds to a live person. Operation 312 may be performed by one or more hardware processors configured by machine-readable instructions including a module that is the same as or similar to output generating module 114, in accordance with one or more implementations.

[0081] An operation 314 may include determining whether the image received in step 302 passes a liveness test. Operation 314 may be performed as described in connection with step 450 in FIG. 4 below. For example, the output generated in step 312 may be used to determine whether the image passes a liveness test and/or corresponds to a presentation attack. Operation 314 may be performed by the computing platform 102 or one or more hardware processors configured by machine -readable instructions. If it is determined that the image passes the liveness test, an operation 316 may be performed. If it is determined that the image does not pass the liveness test, an operation 318 may be performed. Operation 316 may include authorizing an action. Authorizing an action is described in more detail below in connection with step 455 of FIG. 4. Operation 318 may include declining authorization. Declining authorization is described in more detail below in connection with step 460 of FIG. 4.

[0082] FIG. 4 shows an example method 400 for detecting liveness of a user in an image. The example method 400 may be performed using any device, module, and/or component described in connection with FIGS. 1-3 and/or other device(s). Although one or more steps of the example method of FIG. 4 are described for convenience as being performed by the computing platform 102, one, some, or all of such steps may be performed by one or more other devices, and steps may be distributed among one or more devices, including any devices such as those described in connection with FIGS. 1-3. One or more steps of the example method of FIG. 4 may be rearranged, modified, repeated, and/or omitted.

[0083] At step 405, the computing platform 102 may receive an input image. The input image may show a user for whom liveness should be detected. For example, the input image may show a user and the computing platform 102 may be tasked with determining whether the user was physically present when the image was taken or whether the image contains a spoof (e.g., the image is a picture of an image of the user, the image contains a computer generated image of the user, etc.).

[0084] At step 410, the computing platform 102 may perform an initial test on the image. The initial test may comprise determining whether the image is suitable for use in a machine learning model to detect liveness. The computing platform 102 may determine a location of the user’s face within the image. The computing platform 102 may determine regions of interest corresponding to the image and/or the user’s face shown in the image. For example, the computing platform 102 may determine a location of facial features such as the user’s eyes, nose, mouth, ears, and/or any other facial feature. The computing platform 102 may analyze the regions of interests (e.g., the facial features). For example, the computing platform 102 may analyze the regions of interests to determine whether the eyes of a user shown in the image are open or closed. The initial test may include checking for any suitability criteria described above in connection with FIG. 1. Additionally or alternatively, the computing platform 102 may use facial recognition to verify the face of the user in the image. For example, the computing platform 102 may use user in image matches a name and face in database.

[0085] At step 415, the computing platform 102 may determine whether the image passed the initial test. The computing platform 102 may use any information or determinations made in step 410 as input, to determining whether the initial test is passed. The computing platform 102 may determine that one or more criteria are satisfied (e.g., the user’s eyes are open in the image, etc.). The criteria may include whether the user has one or more expected facial features. For example, the computing platform 102 may determine whether shows a nose, one or more eyes, or any other facial feature. If one or more criteria are satisfied, step 420 may be performed. If one or more criteria are not satisfied, step 416 may be performed. At step 416, the computing platform 102 may determine whether to request a new image. For example, there may be a threshold number of attempts that a user is allowed to submit an image for liveness detection (e.g., the threshold may be 3, 5, 10, etc.). The threshold number of attempts may be used to limit an imposter’s ability to try to circumvent the system or make an unlimited number of presentation attacks. The computing platform 102 may determine how many times the user that sent the input image received in step 405 has submitted an image. If the threshold is satisfied, the method 400 may end. If the threshold is not satisfied, step 417 may be performed. At step 417, the computing platform 102 may send a request to the user for one or more new images. For example, the computing platform 102 may send a message to a user device corresponding to the user to request one or more new images of the user be taken.

[0086] Step 420 may be performed if it is determined that the image passes the initial test in step 415. At step 420, the computing platform 102 may generate a cropped image. The computing platform 102 may crop the image received in step 405 so that only the user’s face remains. Alternatively the computing platform 102 may crop the image to remove a threshold portion of the background (e.g., the portion of the image that does not show the user). Additionally or alternatively, the cropped image may be generated as described above in connection with FIG. 1.

[0087] At step 425, the computing platform 102 may use the input image and the cropped image as input into one or more machine learning models. For example, the input image may be input into a first neural network and the cropped image may be input into a second neural network. Each neural network may have been trained separately for liveness detection. For example, the first neural network may have been trained for liveness detection using input images showing the user and the background of the user, while the second neural network may have been trained for liveness detection using cropped images showing only the user’s face. Alternatively, the input image and the cropped image may be input into the same neural network. For example, one convolutional neural network may take both the input image and the cropped image as input (e.g., the convolutional neural network may generate a first embedding for the input image and a second embedding for the cropped image).

[0088] At step 430, the computing platform 102 may generate an embedding for the input image and the cropped image as described above in connection with FIG. 1. The embedding for the input image may be the output of a layer of a first neural network and the embedding for the cropped image may be the output of a layer of a second neural network. Alternatively, the same neural network may be used to generate both embeddings. For example, the embedding for the input image may be the output of a layer of a neural network and the embedding for the cropped image may be the output of a layer of the same neural network.

[0089] At step 440, the computing platform 102 may concatenate the embedding of the cropped image and the embedding of the input image. The computing platform may append the embedding of the cropped image onto the end of the embedding of the input image. For example, if each embedding was of size 1024 (e.g., the embedding comprises 1024 values), the concatenated embedding would be of size 2028. By combining both embeddings, the computing platform 102 may be able to use both embeddings to more accurately detect liveness of a user in an image. Any number of cropped images may be used by the computing platform 102. For example, the computing platform may generate a first cropped image that contains the area just around a user’s eyes (e.g., an area that contains the user’s eyes and eyebrows but excludes the user’s mouth, ears, hair, etc.), a second cropped image that is limited to the user’s face, and/or any other number of cropped images that contain any portion of the user’s face (e.g., a cropped image corresponding to the nose, a cropped image corresponding to the ears, etc.). Embeddings for any of these cropped images may be generated by a neural network and concatenated. Referring to FIG. 6, a 1024 dimension array 505 may be an embedding of the input image. The array 505 may comprise 1024 values generated by a neural network. A 1024 dimension array 510 may be an embedding of the cropped image. The array 505 and the array 510 may be concatenated to form a 2048 dimension array 515. The values in each array 505- 515 may indicate to a neural network or may be used by a neural network to make a liveness prediction for the input image.

[0090] At step 445, the computing platform 102 may generate a liveness prediction. The computing platform 102 may use the concatenated embedding as input into a fully connected layer of a neural network. The fully connected layer may use a function (e.g., a sigmoid function, a ReLU function, or other function) that indicates whether the image contains a live user or not. For example, the function may be used by the computing platform 102 to generate an output value that is between 0 and 1.

[0091] At step 450, the computing platform 102 may determine whether the liveness test is passed (e.g., whether the image received in step 405 corresponds to a presentation attack or not). The computing platform may use the output value generated in step 445 to determine whether the liveness test has been passed or not. For example, if the value is above 0.5 the computing platform 102 may determine that the image contains a live user. If the value is at or below 0.5, the computing platform 102 may determine that the image does not contain a live user (e.g., computing platform 102 may determine that a presentation attack has occurred). Step 455 may be performed if it is determined that the liveness test is passed.

[0092] At step 455, the computing platform 102 may authorize an action. A user and/or user device corresponding to the input image received in step 405 may be authorized access to one or more computer systems (e.g., the user may be allowed to login to a system). A user and/or user device may be authorized to perform one or more actions such as depositing a check, transferring money, opening an account, etc. Passive liveness detection techniques described herein may allow a bank to operate without a branch office, because it may allow the bank to verify the identity of a user electronically (e.g., over the Internet). The one or more actions may comprise validating a user identification (e.g., driver’s license, passport, etc.) of the user corresponding to the input image. The one or more actions may comprise verifying that the user does not have duplicate accounts or policies (e.g., insurance). For example, by verifying each user through passive liveness detection, the computing platform 102 may prevent a user from opening multiple accounts or policies. The one or more actions may comprise authorizing a payment (e.g., using a credit card or any other means). For example, using passive liveness detection, the computing platform 102 may determine that the user making a purchase is the user that is associated with the credit card being used for the purchase. The computing platform 102 may use passive liveness detection to prevent fraud by impersonation. The computing platform 102 may use passive liveness detection in a digital environment, where users are enrolling for services remotely (and subsequently accessing the services remotely). The computing platform 102 may use passive liveness detection (e.g., as described in connection with FIGS. 1-8) to prevent an imposter from gaining access (e.g., to information, to a computer system, etc.) using previously captured (e.g., paper and/or electronic) images (e.g., images from social media, publicly available photographs, etc.).

[0093] Step 460 may be performed, for example, if it is determined that the liveness test is not passed in step 450. At step 460, the computing platform 102 may decline authorization to the user device that sent the input image in step 405. For example, the computing platform 102 may prevent a user device from accessing a system (e.g., via a login), and/or may prevent any of the actions described in connection with step 455 above.

[0094] FIG. 6 shows an example neural network architecture 600 that may be used for passive liveness detection. The neural network architecture 600 may be used for passive liveness detection as described in any of FIGS. 1-5. The neural network architecture 600 may include a convolutional neural network 601 and a convolutional neural network 602. The convolutional neural network 601 may be configured to receive as input an input image (e.g., an image of a user as described above). The convolutional neural network 602 may be configured to receive a cropped image as input (e.g., an image that has been cropped to show only a user’s face as described above). Although only convolutional neural networks 601 and 602 are shown in FIG. 6, the neural network architecture 600 may comprise any number of convolutional neural networks. The neural network architecture 600 may comprise a neural network corresponding to the input image and one or more neural networks corresponding to one or more cropped images. For example, if three cropped images are used by the computing platform 102, the neural network architecture 600 may comprise four neural networks (e.g., one for the input image 605, and one for each cropped image). The number of convolution neural networks contemplated by this disclosure is not limited by the illustrative diagram in FIG. 6. The convolutional neural network 601 may include a convolutional layer 606, a depthwise convolutional layer 607, and/or an average pooling layer 608. The convolutional neural network 602 may include a convolutional layer 611, a depthwise convolutional layer 612, and/or an average pooling layer 613. Although only three layers are shown in each of convolutional neural networks 601-602, each convolutional neural network may include any number of convolutional, depthwise convolutional, or any other type of layer such as those described above in connection with FIG. 2.

[0095] The neural network architecture 600 may comprise a concatenation layer 620, which may be configured to concatenate or otherwise combine two or more image embeddings (e.g., an image embedding generated for the input image 605, an image embedding generated for the cropped image 610, and an image embedding for each additional cropped image if additional cropped images are used). The neural network architecture may comprise a fully connected layer 621 and/or a binary cross entropy layer which may be configured to generate a liveness prediction for one or more images (e.g., the input image 605).

[0096] The disclosure contemplates not only the one or more method claimed below, but also the one or more corresponding systems and/or devices that are configured to perform the steps of the methods described herein. In some aspects, an apparatus may be configured accordingly; in other aspects, one or more computer readable media may store computer executed instructions that, when executed, configure a system to perform accordingly. For example, the disclosure contemplates a system with one or more convolution neural networks; a first module for receiving an input image comprising a facial portion and a first background portion; a second module for generating, based on the input image, a cropped image; a third module for generating, based on the input image and via a first convolutional neural network, a first image embedding, wherein the first convolutional neural network comprises an average pooling layer, a fully connected layer, and a plurality of depthwise convolutional layers; a fourth module for generating, based on the cropped image and via a second convolutional neural network, a second image embedding; a fifth module for generating, via a concatenation of the first image embedding and the second image embedding, a combined embedding; and a sixth module for generating, based on the combined embedding, output indicating whether the facial portion corresponds to a live person. One or more of the first to sixth modules may be combined, consolidated, or divided into more or less modules for executing the operations described herein. The modules may comprise computer-executable instructions (e.g., compiled software code) that executes in a computer processor. In other examples, the modules may be implemented as hardware in an integrated circuit or as firmware or other hardware-software combination. Moreover, the other methods and steps described herein may be performed by one or more of the first to sixth module, or other modules. In some examples, one or more modules may be integrated into the convolutional neural network.

[0097] Although the present technology has been described in detail for the purpose of illustration based on what is currently considered to be the most practical and preferred implementations, it is to be understood that such detail is solely for that purpose and that the technology is not limited to the disclosed implementations, but, on the contrary, is intended to cover modifications and equivalent arrangements that are within the spirit and scope of the appended claims. For example, it is to be understood that the present technology contemplates that, to the extent possible, one or more features of any implementation can be combined with one or more features of any other implementation.

[0098] Many illustrative embodiments are listed below in accordance with one or more aspects disclosed herein. Although many of the embodiments listed below are described as depending from other embodiments, the dependencies are not so limited. For example, embodiment #5 (below) is expressly described as incorporating the features of embodiment #1 (below), however, the disclosure is not so limited. For example, embodiment #5 may depend any one or more of the preceding embodiments (i.e., embodiment #1, embodiment #2, embodiment #3, and/or embodiment #4). Moreover, that any one or more of embodiments #2 - #5 may be incorporated into embodiment #1 is contemplated by this disclosure.

Embodiment #1. A point of transaction device configured for performing multi-factor authentication before approving a transaction originating from a sender party without subjecting the sender party to memorizing a reusable, non-one-time-use PIN, the point of transaction device comprising: a user input module configured to receive the sender party input regarding the transaction and to receive a selection of a transaction type, a triggering transaction amount, and a selection of at least one transaction parameter; a module configured to pre-register the sender party with a transaction information server by storing a stored image of an identification document of the sender party, wherein the identification document comprises at least a facial image of the sender party; a communications module; an identification capture module, in operative communication with an image capture device and a biometric capture device; a processor; a memory in electronic communication with the processor and configured to store a security rule that associates the selected at least one transaction parameter with the selected transaction type and the selected triggering transaction amount; and instructions stored in the memory, which when executed by the processor, cause the point of transaction device to: capture, using the image capture device, an image of the identification document received from the sender party to the transaction contemporaneous to the transaction, wherein the captured image of the identification document matches the stored image of the identification document; transmit, from the point of transaction device, a request for the transaction to the transaction information server, the request comprising a transaction amount, a customer identifier code, and an identification parameter, wherein the identification parameter is collected using the identification capture module, and wherein the identification parameter comprises the captured image of the identification document; causing the transaction information server to transmit a one-time- use identification code to a smartphone of the sender party; transmit, through the communications module to the transaction information server, a temporary one-time-use identification code provided through the user input module by the sender party to the transaction contemporaneously with the transaction, for authentication before receiving the transaction identifier code; receive, from the transaction information server, a transaction identifier code based on an authentication of the identification parameter, wherein the authentication of the identification parameter is performed by an approving authority, separate from the transaction information server, and wherein the authentication of the identification parameter is based on data maintained by one or more government agencies; transmit, from the point of transaction device, the transaction identifier code and at least a portion of the request for the transaction to a transaction authority separate from the transaction information server; and receive an approval for the transaction from the transaction authority, the approval based on the transaction identifier code and the request; wherein the authentication of the identification parameter by an approving authority further comprises: forwarding at least a portion of the request for the transaction to a plurality of approving authorities in an order associated with a hierarchy of the approving authorities, wherein a first approving authority confirms a first aspect before a second approving authority confirms a second aspect; and receiving confirmation from a first approving authority that the identification parameter is associated with the customer identifier code, wherein the security rule further configures the point of transaction device to capture, using the biometric capture device, a photo image of the sender party contemporaneous to the transaction at a location of the point of transaction device for all transactions where the selection of the transaction type is cash.

Embodiment #2. The point of transaction device of Embodiment #1, wherein the identification capture module is in operative communication with the biometric capture device configured to capture biometric data selected from iris, fingerprint, face and voice.

Embodiment #3. The point of transaction device of claim 1, further comprising instructions to: receive, from the transaction information server, a second identification parameter from the record associated with the customer identifier code stored by the transaction information server; and provide an indication of whether the sender party to the transaction matches the second identification parameter. Embodiment #4. The point of transaction device of Embodiment #1, wherein the transmission to and receipt from the transaction information server is via a wireless communications channel and transmission to and receipt from the transaction authority is via a wired communications channel.

Embodiment #5. The point of transaction device of Embodiment #1, wherein the identification parameter comprises a biometric data and the point of transaction device further comprises: a biometric capture device configured to capture the biometric data from the sender party to the transaction contemporaneous to the transaction.

Claims

34 What is claimed is:

1. A method that more accurately detects passive liveness using a plurality of convolutional neural networks and concatenating, the method comprising: receiving, by a computing device, an input image, wherein the image comprises a facial portion and a first background portion; generating, based on the input image, a cropped image, wherein the cropped image comprises the facial portion and a second background portion that is a subset of the first background portion; generating, based on the input image and via a first convolutional neural network, a first image embedding, wherein the first convolutional neural network comprises an average pooling layer, a fully connected layer, and a plurality of depthwise convolutional layers; generating, based on the cropped image and via a second convolutional neural network, a second image embedding; concatenating the first image embedding and the second image embedding; generating, via the concatenation of the first image embedding and the second image embedding, a combined embedding; after the concatenating, generating, based on the combined embedding, output indicating whether the facial portion corresponds to a live person, wherein the output more accurately detects liveness than when the concatenating step is omitting; and denying, based on the output indicating whether the facial portion corresponds to a live person, access to a computer system.

2. The method of claim 1, wherein the first convolutional neural network comprises: a first plurality of layers and a first plurality of input channels, wherein each input channel of the first plurality of input channels corresponds to a layer of the first plurality of layers; and wherein the second convolutional neural network comprises: a second plurality of input channels, wherein each input channel of the second plurality of input channels is determined by reducing a corresponding input channel of the first plurality of input channels. 35

3. The method of claim 1, wherein a first width parameter corresponding to input channels of the first convolutional neural network is greater than a second width parameter corresponding to input channels of the second convolutional neural network.

4. The method of claim 1, wherein the generating the cropped image comprises: removing, from the input image, pixels corresponding to the background portion.

5. The method of claim 1, further comprising: training, based on a first plurality of images and a second plurality of cropped images, the first convolutional neural network and the second convolutional neural network, to output information that indicates liveness of each person in the first plurality of images.

6. The method of claim 1, further comprising: receiving an additional image for liveness detection, wherein the additional image comprises a person; and determining, based on facial features of the person, that the additional image is not suitable for liveness detection.

7. The method of claim 1, wherein the generating, based on the first image embedding and the second image embedding, output comprises generating the output via a sigmoid function.

8. A method comprising: generating, by a computing device and via a camera, a plurality of images, wherein each image of the plurality of images indicates a same person with a background, and wherein each image of the plurality of images is generated within a threshold time of each other; generating, via a first neural network and based on the plurality of images, a first image embedding, wherein the first neural network comprises an average pooling layer and a fully connected layer; cropping each image of the plurality of images by removing a portion of the background; generating, via a second neural network and based on the plurality of cropped images, a second image embedding; concatenating the first image embedding and the second image embedding to generate a combined embedding; generating, based on the combined embedding, an output value that more accurately detects liveness of the same person in the plurality of images than when the concatenating step is omitted; and granting, to a user device and based on the output value, access to a computing system.

9. The method of claim 8, wherein the neural network comprises a recurrent convolutional neural network.

10. The method of claim 8, wherein the by removing the portion of the background comprises removing one or more pixels corresponding to a background portion of the first image.

11. The method of claim 8, further comprising: training, based on the plurality of images and the plurality of cropped images, the first and second neural networks, to output information that indicates the liveness of the same person.

12. The method of claim 8, further comprising: receiving an additional image for liveness detection, wherein the additional image comprises a person; and determining, based on facial features of the person, that the additional image is not suitable for liveness detection.

13. The method of claim 8, wherein the generating an output value comprises generating the output value via a sigmoid function.

14. A method comprising: receiving, by a computing device, an input image, wherein the input image comprises a facial portion and a first background portion; generating, based on the input image, a cropped image, wherein the cropped image comprises a subset of pixels of the input image; generating, based on the input image and via a first neural network, a first image embedding; generating, based on the cropped image and via a second neural network, a second image embedding; and concatenating the first image embedding and the second image embedding; generating, via the concatenation, a combined embedding; generating, based on the combined image embedding, output indicating whether the facial portion corresponds to a live person, wherein the output more accurately detects liveness than when the concatenating step is omitting; and denying, based on the output indicating whether the facial portion corresponds to the live person, access to a computer system.

15. The method of claim 14, further comprising: training, based on the input image and the cropped image, the first neural network and the second neural network, to output information that indicates liveness of the live person in the input image.

16. The method of claim 14, further comprising: receiving an additional image for liveness detection, wherein the additional image indicates a person; and determining, based on facial features of the person, that the additional image is not suitable for liveness detection.

17. The method of claim 14, wherein the generating the output comprises generating the output via a sigmoid function, and the method further comprising: granting, based on the output, access to the computer system. 38

18. The method of claim 14, wherein the first neural network comprises a convolutional neural network comprising: a first plurality of layers and a first plurality of input channels, wherein each input channel of the first plurality of input channels corresponds to a layer of the first plurality of layers; and wherein the second neural network comprises a convolutional neural network comprising: a second plurality of input channels, wherein each input channel of the second plurality of input channels is determined by reducing a corresponding input channel of the first plurality of input channels.