Disclosure of Invention
The present invention is directed to a method for detecting a living body based on depth map prediction, which is directed to overcome the above-mentioned problems.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a method of in vivo detection based on depth map prediction, the method comprising:
acquiring a first infrared image and a second infrared image aiming at the same shooting object;
inputting the first infrared image into a trained depth model, and performing feature extraction on the first infrared image based on the depth model to obtain a first feature image; and (c) a second step of,
inputting the second infrared image into a trained depth model, and performing feature extraction on the second infrared image based on the depth model to obtain a second feature image;
generating a target depth map corresponding to the first infrared map and the second infrared map according to the first characteristic map and the second characteristic map;
inputting the target depth map into a pre-trained first classification model, and performing living body classification on the target depth map through the first classification model to obtain a first classification result; and (c) a second step of,
inputting the first infrared image or the second infrared image into a trained second classification model, and performing living body classification on the first infrared image or the second infrared image through the second classification model to obtain a second classification result;
and determining whether the shooting object is a living body according to the first classification result and the second classification result.
The living body detection method based on depth map prediction comprises the following training processes of the depth model:
acquiring a training sample set, wherein the training sample set comprises a plurality of training sample subsets, and the training sample subsets comprise a first sample and a second sample;
inputting a first sample in the training sample subset into a preset initial model aiming at each training sample subset, and performing feature extraction on the first sample based on the initial model to obtain a first feature map; and the number of the first and second groups,
inputting a second sample in the training sample subset into the initial model, and performing feature extraction on the second sample based on the initial model to obtain a second feature map;
calculating the distance between the first feature map and the second feature map to obtain a training distance corresponding to the training sample subset;
and adjusting the parameters of the initial model based on the training distance until the initial model converges to obtain a depth model.
The living body detection method based on depth map prediction is characterized in that the target depth map consists of a plurality of target depth values; generating a target depth map corresponding to the first infrared map and the second infrared map according to the first feature map and the second feature map, specifically including:
for each pixel point in the first feature map, subtracting a pixel value corresponding to the pixel point according to the first feature map and the second feature map to obtain an initial depth value corresponding to the pixel point;
and performing upsampling on the initial depth map to obtain a target depth map with the same size as a preset target, wherein the initial depth map is composed of the initial depth values.
The depth map prediction-based in-vivo detection method, wherein, for each pixel point in the first feature map, according to the first feature map and the second feature map, subtracting a pixel value corresponding to the pixel point to obtain an initial depth value corresponding to the pixel point specifically includes:
for each pixel point, when the pixel value corresponding to the pixel point is greater than zero, taking the first feature as a subtracted number and the second feature as a subtracted number, and performing subtraction operation on the first feature and the second feature to obtain an initial depth value corresponding to the pixel point;
and when the pixel value corresponding to the pixel point is less than or equal to zero, taking the second feature as a subtracted number and the first feature as a subtracted number, and performing subtraction operation on the first feature and the second feature to obtain an initial depth value corresponding to the pixel point.
The living body detection method based on depth map prediction is characterized in that the dimension of the initial depth map is larger than that of the first feature map or the second feature map; the subtracting operation comprises a staggered subtraction comprising:
zero filling is carried out on the first characteristic diagram according to a preset dislocation interval aiming at each pixel point to obtain a first processing diagram; zero filling is carried out on the second characteristic diagram to obtain a second processing diagram;
and subtracting the first processing image from the second processing image to obtain an initial depth value corresponding to the pixel point.
The in-vivo detection method based on depth map prediction includes the following steps that initial depth maps are up-sampled to obtain target depth maps with the same size as a preset target, and specifically includes the following steps:
inputting the initial depth map into a trained cascade model, and performing upsampling on the initial depth map based on a plurality of upsampling modules in the cascade model to obtain a target depth map corresponding to the initial depth map, wherein for each upsampling module, the upsampling module comprises a line hole convolution.
The method for detecting a living body based on depth map prediction, wherein the determining whether the photographic object is a living body according to the first classification result and the second classification result specifically includes:
when both the first classification result and the second classification result are living bodies, determining that the photographic subject is a living body.
A depth map prediction-based living body detecting apparatus, wherein the depth map prediction-based living body detecting apparatus comprises:
the acquisition module is used for acquiring a first infrared image and a second infrared image aiming at the same shooting object;
the feature extraction module is used for inputting the first infrared image into a trained depth model and extracting features of the first infrared image based on the depth model to obtain a first feature image; and (c) a second step of,
inputting the second infrared image into a trained depth model, and performing feature extraction on the second infrared image based on the depth model to obtain a second feature image;
the prediction module is used for generating a target depth map corresponding to the first infrared map and the second infrared map according to the first characteristic map and the second characteristic map;
the classification module is used for inputting the depth map into a pre-trained first classification model, and performing living body classification on the second depth map through the first classification model to generate a first classification result; and (c) a second step of,
inputting the first infrared image or the second infrared image into a trained second classification model, and performing living body classification on the first infrared image or the second infrared image through the second classification model to obtain a second classification result;
and the determining module is used for determining whether the shooting object is a living body according to the first classification result and the second classification result.
A computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps in the depth map prediction based liveness detection method as any one of the above.
A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;
the communication bus realizes connection communication between the processor and the memory;
the processor, when executing the computer readable program, implements the steps in the depth map prediction based liveness detection method as described in any one of the above.
Has the advantages that: compared with the prior art, the invention provides a living body detection method based on depth map prediction and related equipment. The infrared map is input into a second classification model for infrared map recognition, and the depth map is input into a trained first classification model for depth map recognition. And determining whether the shooting object is a living body or not according to the classification results of the two classification models. Further, in the process of obtaining the depth map, instead of using a conventional stereo matching algorithm, the first infrared image and the second infrared image are respectively subjected to feature extraction to obtain a first feature map and a second feature map, and then the first feature map and the second feature map are subtracted to obtain the depth map. Because the high latitude characteristic in the characteristic map is extracted in the characteristic extraction process, individual pixel points are not required to be matched on the original map, so that the interference of background noise is reduced, the accuracy of the depth map obtained based on the first infrared map and the second infrared map is improved, and the accuracy of living body identification is further improved.
Detailed Description
The invention provides a depth map prediction-based in vivo detection method and related equipment, and in order to make the purposes, technical schemes and effects of the invention clearer and clearer, the invention is further described in detail by referring to the attached drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The inventor finds that currently, for non-living body attacks, a technology for combining an infrared image and an RGB image to realize living body detection exists. . An infrared camera and an RGB camera need to be provided, so the cost is high. In addition, the depth map is mainly based at present on the mode that is roughly calculated by a stereo matching algorithm, and in the stereo matching algorithm, the quality of the depth map result is directly related to the pixel matching effect. When more background noise exists in the shooting environment, matching is easily interfered, so that a high-quality depth map cannot be obtained, and the detection efficiency is low.
For example, the embodiment of the invention can be applied to devices or equipment which need identity authentication, such as an intelligent door lock, mobile phone authentication and the like. The present embodiment is described by taking the intelligent door lock as an example.
It should be noted that the above application scenarios are only presented to facilitate understanding of the present invention, and the embodiments of the present invention are not limited in any way in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.
The invention will be further explained by the description of the embodiments with reference to the drawings.
As shown in fig. 1, the present embodiment provides a depth map prediction-based in-vivo detection method, which may include the following steps:
and S10, acquiring a first infrared image and a second infrared image aiming at the same shooting object.
Specifically, when the user stands in front of the smart door lock, the smart door lock photographs the photographic subject, i.e., the user, through the installed binocular camera. In this embodiment, the camera is an infrared camera, and the intelligent door lock photographs the user through an infrared binocular camera, so as to obtain two infrared images for the current environment, which are respectively named as a first infrared image and a second infrared image, wherein for the subsequent detailed description of the generation process of the depth map, the first infrared image adopted in this embodiment is an infrared image obtained by shooting with a left eye infrared camera, and the second infrared image is an infrared image obtained by shooting with a right eye infrared camera.
S20, inputting the first infrared image into a trained depth model, and performing feature extraction on the first infrared image based on the depth model to obtain a first feature image; and the number of the first and second groups,
and inputting the second infrared image into a trained depth model, and performing feature extraction on the second infrared image based on the depth model to obtain a second feature image.
Specifically, the depth model in this embodiment is a model used for extracting features of the input first infrared image and the input second infrared image, and a network architecture of the depth model may select any convolution model in the depth model, and based on the convolution model, parameters or modules may be appropriately adjusted to obtain a better feature extraction effect. The essence of feature extraction performed by the depth model in this embodiment may be regarded as downsampling the first infrared image and the second infrared image to obtain features of a high latitude.
Further, the purpose of feature extraction can be achieved by using a conventional trained convolution model, but since the first infrared image and the second infrared image are shot of the same shot object in essence, if a good feature extraction effect is desired, features extracted by the first infrared image and the second infrared image should have high similarity, otherwise, depth features of the shot object cannot be accurately reflected by depth images generated based on the first infrared image and the second infrared image.
Therefore, in this embodiment, the depth model is trained in a meta-learning manner, and the training process includes:
and A10, acquiring a training sample set.
Specifically, a large number of training samples are obtained first, and a training sample set is generated based on the training samples. The training sample set comprises a plurality of training sample subsets, and the training sample subsets comprise a first sample and a second sample. The type of samples in the subset of training samples may be adjusted according to the type of the pre-established initial model. The currently adopted network for meta-learning comprises a triple network and a twin network, wherein the triple network is used for training a model by taking three training samples as a group, in the triple network, a training sample subset comprises a positive sample, a negative sample and a test sample besides a first sample and a second sample, wherein the positive sample and the test sample are an infrared image obtained by photographing the same object, and the negative sample is an infrared image corresponding to other objects; while in a twin network, the first sample in the subset of training samples is a positive sample and the second sample is a test sample. The training is to train the features extracted by the final model, wherein the features corresponding to the positive sample and the test sample are closest to each other, and the features corresponding to the test sample and the negative sample are farthest from each other.
In this embodiment, a description of a training process is given by taking a twin network as an example.
A20, aiming at each training sample subset, inputting a first sample in the training sample subset into a preset initial model, and performing feature extraction on the first sample based on the initial model to obtain a first feature map; and the number of the first and second groups,
and inputting a second sample in the training sample subset into the initial model, and performing feature extraction on the second sample based on the initial model to obtain a second feature map.
Specifically, in the present embodiment, the training sample subset is input to the initial model in units of training sample subsets. And inputting the positive samples in the training sample subset into a preset initial model, and performing feature extraction on the positive samples based on the initial model to obtain a first feature map. Meanwhile, inputting a second sample in the training sample subset, namely the test sample, into the initial model, and performing feature extraction on the test sample to obtain a second feature map. Since one initial model is shared, parameters such as the weights of both are shared.
Further, in order to increase the rate of feature extraction, rapidly scale an image with a larger resolution into a smaller feature map, and reduce subsequent calculation steps, in this embodiment, the size of the convolution kernel in the initial model is greater than or equal to the default convolution kernel size, and the step size is less than or equal to the default step size, so that a larger receptive field and a higher sampling frequency are maintained for feature extraction, for example, the selected convolution kernel size is 5, and the step size is 2. Meanwhile, in order to retain more detailed features, the initial model further includes a plurality of residual blocks to reduce detail loss in the feature extraction process, and the number of the residual blocks used in the implementation is 6. In addition, in order to improve the efficiency of feature extraction, the present embodiment adopts 3D convolution as a preferable convolution method.
And A30, calculating the distance between the first feature map and the second feature map to obtain the training distance corresponding to the training sample subset.
Specifically, after the first feature map and the second feature map are obtained through the initial model, a distance between the first feature map and the second feature map is calculated, where the distance may include an euclidean distance, a cosine distance, and the like, and the essence of the distance is to calculate a similarity between the first feature map and the second feature map, so that the similarity between the first feature map and the second feature map may be calculated by adopting other similarity calculation methods, as the distance between the first feature map and the second feature map, so as to obtain a training distance corresponding to the training sample subset, for example, by adopting a softmax classifier.
And A40, adjusting the parameters of the initial model based on the training distance until the initial model converges to obtain a depth model.
Specifically, after the training distance corresponding to each training sample subset in the training sample set is obtained, the training distance is reversely transmitted back to the initial model, and the parameters of the initial model are adjusted based on the training distance. And inputting the training sample set into the adjusted initial model, obtaining the training distance again, repeating the steps until the initial model is converged, wherein the converged initial model is the depth model.
In this embodiment, a condition that the training distance between the two is smaller than the threshold or the training distance cannot be reduced any more may be selected as the convergence condition.
And S30, generating target depth maps corresponding to the first infrared map and the second infrared map according to the first characteristic map and the second characteristic map.
Specifically, in the first implementation manner of this embodiment, the difference between the first feature map and the second feature map is calculated, and since the feature maps are matrices in nature, the difference is still an image, and the image after the difference is taken as the target depth map.
In a second implementation manner of this embodiment, since the sizes of the first feature map and the second feature map are smaller, the size of the image obtained by directly subtracting is smaller than that of the first infrared map and the second infrared map, and in a subsequent detection process, the image cannot be corresponded to in-vivo detection based on the infrared map. Meanwhile, if the size of the target depth map is small, the resolution is low, and details are easy to lose in the subsequent living body detection process based on the target depth map, so that the detection precision is low. Therefore, in a second implementation manner, an image obtained by direct subtraction is used as an initial depth map, and then the initial depth map is up-sampled to obtain a target depth map, which includes the specific processes:
and B10, for each pixel point in the first characteristic diagram, subtracting a pixel value corresponding to the pixel point according to the first characteristic diagram and the second characteristic diagram to obtain an initial depth value corresponding to the pixel point.
Specifically, for each pixel point in the first feature map, a value of the pixel point corresponding to the first feature map is subtracted from a value corresponding to the second feature map, and the subtracted value is used as an initial depth value corresponding to the pixel point. And after the initial depth value corresponding to each pixel point is obtained, filling the initial depth value into a preset blank matrix according to the position coordinate corresponding to the first characteristic diagram or the second characteristic diagram, wherein the blank matrix after filling is the initial characteristic diagram, and therefore, the initial depth diagram consists of the initial depth values.
Further, in the process of performing subtraction processing on the first feature map and the second feature map, the embodiment adopts cyclic subtraction to reasonably combine the first feature map and the second feature map.
In the process, whether the pixel value corresponding to each pixel point is larger than zero or not is judged. And when the pixel point is greater than zero, taking the first feature as a subtracted number and the second feature as a subtracted number, and carrying out subtraction operation on the first feature and the second feature to obtain an initial depth value corresponding to the pixel point. And when the pixel point is less than or equal to zero, taking the second characteristic as a subtracted number and the first characteristic as a subtracted number, and carrying out subtraction operation on the first characteristic and the second characteristic to obtain an initial depth value corresponding to the pixel point.
Further, in the subtraction error process, direct subtraction may be adopted, and also, offset subtraction may be adopted. Misplaced subtraction can preserve more edge detail. In this embodiment, a blank depth matrix is preset, and the blank depth matrix is an asymmetric matrix. The blank depth matrix has a dimension that is higher than a dimension of the first feature map or the second feature map. In this embodiment, the dimension of the first feature map is four dimensions, and the blank depth matrix is five dimensions.
Performing zero filling on the first characteristic diagram according to a preset dislocation interval for each pixel point to obtain a first processing diagram; and zero filling is carried out on the second characteristic diagram to obtain a second processing diagram. Because the dimension between the first feature map and the blank depth matrix is different, the first feature map and the second feature map cannot be directly subtracted from each other, and the first feature map and the second feature map are filled with zeros first and the edges of the first feature map and the second feature map are filled with zeros to make the dimension of the first feature map, the dimension of the second feature map and the dimension of the blank depth matrix the same. And the number and location of the zero padding can be determined according to the misalignment interval. And then subtracting the first processing image and the second processing image to obtain an initial depth value corresponding to the pixel point. And finally writing the initial depth value into a preset blank depth matrix to obtain an initial depth map. The above process can be expressed in shorthand code as:
where disp is a set of pixels, i is a pixel value, leftimg _ feature is a first feature map, and rightimg _ feature is a second feature map.
And B20, performing up-sampling on the initial depth map to obtain a target depth map with the same size as the preset target.
Specifically, the initial feature map is up-sampled, and the size of the initial feature map is enlarged to a preset target size. The target size preferably employed in this embodiment is the size of the first infrared chart or the second infrared chart. The upsampling mode can directly use the existing upsampling method, such as linear interpolation and inverse pooling.
In this embodiment, a cascade model is trained in advance, and the cascade model is used to perform upsampling on the input initial depth map, so as to expand the initial depth map and obtain a target depth map with the same size as the input first infrared map or the input second infrared map. The cascade model not only takes the result of the last up-sampling as the input of the next up-sampling, but also inputs the original input initial depth map into each up-sampling module in the cascade model, so that more details can be reserved compared with the prior model, and the feature loss in the up-sampling process is reduced.
Further, since the edge features in the initial depth map are more likely to be lost, in this embodiment, the upsampling module extracts the edge of the initial depth map, so as to recover the high-frequency features in the depth map. In order to effectively extract the edge of the initial depth map, the up-sampling module comprises a cavity convolution, and the cavity convolution can enhance the receptive field of a convolution kernel and capture the slight change of the edge. The upsampling module also includes linear upsampling for conventional feature upsampling. In this embodiment, 3 upsampling modules are provided, and the initial depth map can be restored to the same size as the first infrared map through three upsampling operations, that is, the target depth map.
Further, after the first infrared image and the second infrared image are obtained and before the target depth map is generated, the target user can be identified according to the first infrared image or the second infrared image. If the shooting object is a non-target user, directly stopping subsequent operations, such as not opening a door or closing some applications; and if the shooting object is the target user, performing subsequent living body detection. The process is as follows: and carrying out face recognition on the infrared image based on preset face information to obtain a recognition result corresponding to the infrared image.
Specifically, face information of the target user is collected in advance and stored. The user's picture can be directly stored as face information. In order to improve the subsequent calculation efficiency, the photos of the user are collected firstly, and then the features of the photos are extracted, so that the face information is obtained.
After the first infrared image or the second infrared image is obtained, face recognition is performed on any one of the infrared images, in this embodiment, taking the first infrared image as an example, the face features in the first infrared image are extracted first, and then the face features are compared with face information stored in advance to obtain a recognition result. The recognition result is to determine whether the current environment includes the user. The face recognition process can be realized by a machine learning algorithm or a trained deep learning model.
In the process, because the user to be verified is far from or close to the camera, if the distance is too far, the situation of excessive interference information exists, and the accuracy of identification is influenced. The specific process comprises the following steps:
and C10, carrying out face extraction on the first infrared image to obtain a recognized face image.
In this embodiment, the first infrared image is preprocessed and cut to a certain extent. Firstly, carrying out face detection on the first infrared image, and determining an anchor frame containing a face in the first infrared image. And then, based on the position coordinates of the anchor frame, cutting the first infrared image to obtain a recognized face image corresponding to the first infrared image. The mode of determining the anchor frame adopted in the embodiment is face positioning based on an SSD (Single Shell Multi Box Detector) method, and the one-stage detection method is more suitable for being used at an embedded end, can detect faces with various size changes, is not sensitive to illumination, and can detect faces in a darker environment.
And C20, inputting the recognized face image into a trained face recognition model, carrying out face recognition on the recognized face image based on the face recognition model, and determining whether the first infrared image contains a target user.
Specifically, the recognized face image is input into a preset face recognition algorithm or a trained face recognition model, and is compared with face information stored in advance, so that whether the face image is a target user or not is judged.
The comparison method that can be adopted in this embodiment is to calculate the similarity between the face features and the face information, and then judge whether the source of the face features, i.e., the current environment, includes the target user according to the similarity. If yes, the next step is carried out, namely whether the target user contained in the environment is a living body is judged.
In this embodiment, the face recognition model includes a feature extraction network and a classifier. The network structure of the feature extraction network is MobileFaceNet. Because the embodiment is applied to the intelligent door lock, the number of users is small, and the face information stored in advance is small, so that the MobileFaceNet network is cut to reduce the calculated amount and improve the forward processing speed of the network. The number of channel extensions of the feature extraction network is less than or equal to the default multiple of extensions of the MobileFaceNets, and the feature dimension of the fully-connected layer of the feature extraction network is less than the default fully-connected layer dimension of the MobileFaceNets. The network results used in this example are as follows:
wherein input is an input value, operator is a Convolution parameter, conv (constraint) is a Convolution layer, dw Conv (Depthwise constraint) is a deep Convolution, bottleneck is a Bottleneck layer, linear GDConv (Gated scaled constraint) is a Linear expansion gate Convolution, and full connect represents a Fully connected layer, and the feature dimension of the Fully connected layer in this embodiment is 1 × 128; t denotes a channel expansion multiple, C denotes an output channel, N denotes the number of repetitions, and s denotes a step size stride.
Through the feature extraction layer, the face recognition model extracts and recognizes face features in the face image. Then, through the classifier, the similarity value between the face information and the face features is calculated. When the similarity value is greater than a preset similarity threshold value, for example, 90%, it is determined that the first infrared image includes the target user, that is, the target user exists in the current environment.
Further, in this embodiment, since the network structure of the face recognition model is cut based on the conventional model, the recognition accuracy is affected to a certain extent, and therefore, in order to avoid false recognition, the similarity threshold in this embodiment is larger than the conventional similarity threshold.
In addition, in order to reduce the amount of calculation of the model and increase the calculation speed, in this embodiment, before the recognized face image is input to the preset face recognition algorithm or the trained face recognition model, the recognized face image is further scaled according to the preset compression size to obtain a compressed recognized face image, and then the compressed recognized face image is recognized. The preferred compression size employed in this embodiment is 112x112.
And C30, when the first infrared image is determined to contain the target user, determining that the identification result is the target user.
Specifically, the recognition result in this embodiment includes a target user and a non-target user. And when the first infrared image is determined to contain the target user, determining the identification result as the target user. And when the first infrared image is determined not to contain the target user, the recognition result is the non-target user.
S40, inputting the target depth map into a first classification model trained in advance, and performing living body classification on the target depth map through the first classification model to obtain a first classification result; and the number of the first and second groups,
inputting the first infrared image or the second infrared image into a trained second classification model, and performing living body classification on the first infrared image or the second infrared image through the second classification model to obtain a second classification result.
Specifically, a first classification model and a second classification model are trained in advance for classifying the depth map and the infrared map.
The first classification model is used for classifying the target depth map. The network architecture of the first classification model adopts a two-classification network model architecture, in the training process, the adopted training set comprises a positive sample training subset and a negative sample training subset, the positive sample training subset comprises a plurality of real target depth maps obtained by face shooting post-processing so as to distinguish the former target depth maps and name the former target depth maps as training positive sample maps; the negative sample training set is a target depth map obtained after the planar face is photographed and named as a training negative sample map. The planar face refers to a face presented by means of photos, videos, printing and the like. Since the training process can adopt a common neural network training process or a training process of a clustering algorithm, the details are not repeated herein.
As shown in fig. 2, in the training positive sample map of the real 3D face, the outline of the image is similar to that of the normal face, and the training negative sample mainly presents a plane shape. The two are obviously different, so that the simple neural network can realize accurate identification. The specific process comprises the following steps:
and D10, cutting the target depth map based on the position coordinates of the recognized face image relative to the infrared image to obtain a first face image.
Specifically, in order to reduce interference of depth values of other environments in the target depth map, the target depth map is first cropped to obtain a first face image including a face region in the target depth map. The first facial image may be implemented by training an algorithm or model that is specific to extracting the facial image from the target depth map. In this embodiment, in the process of face recognition, the recognized face image is extracted by a preset face recognition algorithm or a trained face recognition model, so that the region of the first face image in the target depth map can be determined directly based on the position coordinates of the recognized face image on the infrared map, and the target depth map is cut to obtain the first face image.
And D20, inputting the first face image into a trained first classification model, and performing living body classification on the target depth map through the first classification model to obtain a first classification result, wherein the first classification model comprises a plurality of convolutional layers, a grouped convolutional layer and a full-link layer.
Specifically, the first face image is input into a first classification model obtained by training a positive sample training subset and a negative sample training subset, and the first classification model performs living body classification on the input first face image, which is substantially a two-classification process, that is, it is determined whether an object of the first face image is a living body (an image obtained based on a real face) or a non-living body (an image obtained based on a flat face).
Further, in this embodiment, in order to reduce the amount of calculation and improve the recognition efficiency, in another implementation manner of this embodiment, the first face image is scaled and compressed to a preset compression size to obtain a first compressed image, and then the first compressed image is input to the first classification model for classification to obtain a first classification result. The compressed face image used in this implementation is 112x112 in size. Because the intelligent door lock executes most tasks at the terminal, in order to reduce the calculation pressure of the terminal and improve the response rate, the first classification model comprises a plurality of convolutional layers, a grouping convolutional layer and a full connection layer.
The second classification model is used for classifying the infrared images, is similar to the first classification model, and is also a network architecture adopting a two-classification network model architecture; the negative sample training set is an infrared image obtained after photographing the model face and is named as an infrared negative sample image. The model face refers to a face presented by a model such as a doll. As shown in fig. 3, under the irradiation of infrared light, the eyes in the real human face are brighter, while the eyes in the infrared image obtained based on the human figure are dim and dull, which is a big difference between the two.
Further, in order to ensure accuracy and consistency of the classification of the infrared image and the classification of the target depth image in the result, the target depth image of the embodiment is obtained by processing the first image and the second image in the infrared image, and therefore, the regions of the human faces in the target depth image and the infrared image are also the same, so that the infrared image is cut first, and the infrared image is cut based on the position coordinates of the recognized human face image relative to the infrared image, so that the second human face image is obtained. Meanwhile, the infrared map for living body classification may be implemented based on the first image, based on the second image, or implemented by fusing the two images, which is not limited herein.
As shown in fig. 4, the cut second face image is input into the second classification model, and the living body classification is performed on the second face image based on the second classification model, so as to obtain a second classification result. The second classification result is classified into living and non-living, similar to the first classification result. In addition, when the second face image is subjected to living body classification, the second face image is compressed according to a preset compression size to obtain a second compressed image, and then the second compressed image is subjected to living body classification through a second classification model to obtain a second classification result. The preferred compression size for this embodiment is 112x112. Similar to the first classification model, the second classification model includes several convolutional layers, a block convolutional layer, and a full link layer. Further, in order to enhance the effective classification of the eyes, a mechanism of attention for the eyes may be added to the second classification model, or after the second face image is cropped, the second face image is further cropped to obtain a human eye image including the eyes, and then the second classification model performs living body classification on the human eyes.
And S50, determining whether the shooting object is a living body according to the first classification result and the second classification result.
Specifically, the first classification result and the second classification result are of only two types, one is a living body, and one is a non-living body. Therefore, whether the target living body exists in the environment is directly determined according to whether the first classification result and the second classification result are both living bodies.
When the first classification result is a living body and the second classification result is also a living body, it is determined that the photographic subject is a non-living body.
And when the first classification result is a non-living body and/or the second classification result is a non-living body, determining that the shooting object is a non-living body, and rejecting subsequent operation.
If the photographic subject is a living body, the subsequent operation is executed.
In the embodiment, the face recognition is performed on the acquired infrared image to determine whether the target user exists. And when the target user exists, inputting the depth map corresponding to the infrared map into the first classification model to obtain a first classification result, and inputting the infrared map into the second classification model to obtain a second classification result. Because the depth map of the real face and the depth map of the plane face have larger difference, and in the infrared image, the eyes of the real face and the model face can present different reflection, whether the depth map and the infrared image contain living bodies can be effectively and accurately identified through the classification model, and therefore the living body detection efficiency is improved.
Based on the above-described living body detection method based on depth map prediction, the present embodiment provides a living body detection device 100 based on depth map prediction as shown in fig. 5, wherein the living body detection device 100 based on depth map prediction includes:
an obtaining module 110, configured to obtain a first infrared image and a second infrared image for a same photographic subject;
a feature extraction module 120, configured to input the first infrared image into a trained depth model, and perform feature extraction on the first infrared image based on the depth model to obtain a first feature image; and the number of the first and second groups,
inputting the second infrared image into a trained depth model, and performing feature extraction on the second infrared image based on the depth model to obtain a second feature image;
a prediction module 130, configured to generate a target depth map corresponding to the first infrared map and the second infrared map according to the first feature map and the second feature map;
the classification module 140 is configured to input the depth map into a first classification model trained in advance, and perform living body classification on the second depth map through the first classification model to generate a first classification result; and the number of the first and second groups,
inputting the first infrared image or the second infrared image into a trained second classification model, and performing living body classification on the first infrared image or the second infrared image through the second classification model to obtain a second classification result;
a determining module 150, configured to determine whether the photographic subject is a living body according to the first classification result and the second classification result.
Wherein the training process of the depth model is as follows:
acquiring a training sample set, wherein the training sample set comprises a plurality of training sample subsets, and the training sample subsets comprise a first sample and a second sample;
inputting a first sample in the training sample subset into a preset initial model aiming at each training sample subset, and performing feature extraction on the first sample based on the initial model to obtain a first feature map; and the number of the first and second groups,
inputting a second sample in the training sample subset into the initial model, and performing feature extraction on the second sample based on the initial model to obtain a second feature map;
calculating the distance between the first feature map and the second feature map to obtain a training distance corresponding to the training sample subset;
and adjusting the parameters of the initial model based on the training distance until the initial model converges to obtain a depth model.
Wherein the target depth map is composed of a number of target depth values; the prediction module 130 includes:
the calculation unit is used for subtracting a pixel value corresponding to each pixel point in the first characteristic diagram according to the first characteristic diagram and the second characteristic diagram to obtain an initial depth value corresponding to the pixel point;
and the up-sampling unit is used for up-sampling the initial depth map to obtain a target depth map with the same size as a preset target, wherein the initial depth map consists of the initial depth values.
Wherein the calculation unit includes:
the first subtraction subunit is configured to, for each pixel point, when a pixel value corresponding to the pixel point is greater than zero, perform subtraction operation on the first feature and the second feature by using the first feature as a subtracted number and the second feature as a subtracted number, so as to obtain an initial depth value corresponding to the pixel point;
and the second subtraction subunit is used for taking the second feature as a subtracted number and the first feature as a subtracted number when the pixel value corresponding to each pixel point is less than or equal to zero, and performing subtraction operation on the first feature and the second feature to obtain an initial depth value corresponding to the pixel point.
Wherein the initial depth map has a dimension greater than a dimension of the first feature map or the second feature map; the subtracting operation comprises a staggered subtraction comprising:
performing zero filling on the first characteristic diagram according to a preset dislocation interval for each pixel point to obtain a first processing diagram; zero filling is carried out on the second characteristic diagram to obtain a second processing diagram;
and subtracting the first processing image from the second processing image to obtain an initial depth value corresponding to the pixel point.
Wherein the upsampling unit is specifically configured to:
inputting the initial depth map into a trained cascade model, and performing upsampling on the initial depth map based on a plurality of upsampling modules in the cascade model to obtain a target depth map corresponding to the initial depth map, wherein for each upsampling module, the upsampling module comprises a line hole convolution.
Wherein the determining module 150 is specifically configured to:
when both the first classification result and the second classification result are living bodies, determining that the photographic subject is a living body.
Based on the above-described living body detection method based on depth map prediction, the present embodiment provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors, to implement the steps in the living body detection method based on depth map prediction as described in the above-described embodiment.
Based on the above-mentioned living body detection method based on depth map prediction, the present invention also provides a terminal device, as shown in fig. 4, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, and may further include a communication Interface (Communications Interface) 23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may call logic instructions in the memory 22 to perform the methods in the embodiments described above.
Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.
The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional applications and data processing, i.e. implements the methods in the above embodiments, by running software programs, instructions or modules stored in the memory 22.
The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be used as the transient computer readable storage medium.
In addition, the specific processes loaded and executed by the instruction processors in the computer-readable storage medium and the terminal device are described in detail in the method, and are not stated herein.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.