WO2020125623A1

WO2020125623A1 - Method and device for live body detection, storage medium, and electronic device

Info

Publication number: WO2020125623A1
Application number: PCT/CN2019/125957
Authority: WO
Inventors: 侯允; 刘耀勇; 陈岩
Original assignee: 上海瑾盛通信科技有限公司
Priority date: 2018-12-20
Filing date: 2019-12-17
Publication date: 2020-06-25
Also published as: CN109635770A

Abstract

A method and device for live body detection, a storage medium, and an electronic device. The method comprises: first, photographing via a monocular camera a two-dimensional color image of a face to be detected (101), then, inputting the two-dimensional color image into a pretrained depth estimation model for depth estimation to produce a corresponding depth image (102), and finally, inputting the two-dimensional color image and the depth image corresponding thereto into a pretrained live body detection model for live body detection to produce a detection result (103).

Description

Living body detection method, device, storage medium and electronic equipment

This application requires the priority of the Chinese patent application submitted to the Chinese Patent Office on December 20, 2018, with the application number 201811565579.0 and the invention titled "living test method, device, storage medium and electronic equipment", the entire contents of which are incorporated by reference In this application.

Technical field

The present application relates to the technical field of face recognition, in particular to a living body detection method, device, storage medium and electronic equipment.

Background technique

At present, electronic devices use relevant face recognition technology to not only distinguish between individual users, but also perform live detection on users. For example, electronic devices obtain user faces (such as photos taken through a depth camera such as a structured light camera or a time-of-flight camera). The RGB-D image of the user's face image) can determine whether the user's face is a living face.

Summary of the invention

The embodiments of the present application provide a living body detection method, device, storage medium, and electronic equipment, which can reduce the hardware cost of the electronic equipment for living body detection.

In a first aspect, an embodiment of the present application provides a living body detection method, which is applied to an electronic device, the electronic device includes a monocular camera, and the living body detection method includes:

Shooting the face to be detected through the monocular camera to obtain a two-dimensional color image of the face to be detected;

Input the two-dimensional color image into a pre-trained depth estimation model to obtain a depth image corresponding to the two-dimensional color image;

The two-dimensional color image and the depth image are input into a pre-trained living body detection model to obtain a detection result.

In a second aspect, an embodiment of the present application provides a living body detection device, which is applied to an electronic device, the electronic device includes a monocular camera, and the living body detection device includes:

A color image acquisition module, configured to shoot the face to be detected through the monocular camera to obtain a two-dimensional color image of the face to be detected;

A depth image acquisition module, configured to input the two-dimensional color image into a pre-trained depth estimation model to obtain a depth image corresponding to the two-dimensional color image;

The living body face detection module inputs the two-dimensional color image and the depth image into a pre-trained living body detection model to obtain a detection result.

In a third aspect, an embodiment of the present application provides a storage medium on which a computer program is stored, and when the computer program runs on a computer, causes the computer to execute:

According to a fourth aspect, an embodiment of the present application provides an electronic device including a processor, a memory, and a monocular camera. The memory has a computer program, and the processor is used to execute the computer program by calling the computer program:

BRIEF DESCRIPTION

FIG. 1 is a schematic flowchart of a living body detection method provided by an embodiment of the present application.

FIG. 2 is a schematic diagram of the living body detection performed by the electronic device through the living body detection model in the embodiment of the present application.

FIG. 3 is another schematic flowchart of the living body detection method provided by the embodiment of the present application.

4 is a schematic diagram of constructing a training sample set in an embodiment of the present application.

5 is a schematic structural diagram of a living body detection device provided by an embodiment of the present application.

6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

7 is another schematic structural diagram of an electronic device provided by an embodiment of the present application.

detailed description

Please refer to the drawings in which the same component symbols represent the same components. The principle of the present application is illustrated by implementation in an appropriate computing environment. The following description is based on the illustrated specific embodiments of the present application, which should not be considered as limiting other specific embodiments not detailed herein.

Currently, face recognition technology is widely used to unlock electronic devices and secure payment, but using non-living face images, non-living face videos, face masks, or head models can easily impersonate others and cause losses to users. . In order to solve this defect in the face recognition technology, the related art proposes a living body detection technology based on a depth camera such as a structured light camera or a time-of-flight camera. However, its implementation requires that the electronic device be equipped with an additional depth camera, which increases The cost of biopsy. For this reason, the embodiments of the present application firstly provide a living body detection method, which realizes living body detection based on a monocular camera commonly configured in electronic devices, without increasing the hardware cost of the electronic devices. Wherein, the execution subject of the living body detection method may be the living body detection device provided in the embodiment of the present application, or an electronic device integrated with the living body detection device. The living body detection device may be implemented by hardware or software, and the electronic device may be intelligent Mobile phones, tablet computers, PDAs, notebook computers, or desktop computers are equipped with processors and have processing capabilities.

An embodiment of the present application provides a living body detection method, including:

Input the two-dimensional color image into a pre-trained depth estimation model to perform depth estimation to obtain a depth image corresponding to the two-dimensional color image;

The two-dimensional color image and the depth image are input into a pre-trained living body detection model for living body detection, and a detection result is obtained.

In an embodiment, the living body detection model is a convolutional neural network model, including a convolutional layer, a pooling layer, and a fully connected layer connected in sequence, and the inputting of the two-dimensional color image and the depth image in advance Trained living body detection model to obtain the detection results, including:

Input the two-dimensional color image and the depth image into the convolutional layer for feature extraction to obtain a joint global feature of the two-dimensional color image and the depth image;

Input the joint global feature into the pooling layer to perform feature dimensionality reduction to obtain the joint global feature after dimensionality reduction;

Input the dimensionality-reduced joint global feature into the fully connected layer for classification processing to obtain the detection result that the face to be detected is a live face, or the face to be detected is a non-live face Test results.

In an embodiment, the inputting the two-dimensional color image and the depth image into the convolution layer for feature extraction to obtain the joint global features of the two-dimensional color image and the depth image includes:

Preprocessing the two-dimensional color image to obtain a face area image in the two-dimensional color image;

Preprocessing the depth image to obtain a face area image in the depth image;

Input the face area image in the two-dimensional color image and the face area image in the depth image into the convolution layer for feature extraction to obtain the joint global features of the two-dimensional color image and the depth image .

In an embodiment, the preprocessing the two-dimensional color image to obtain the face area image in the two-dimensional color image includes:

An ellipse template, a circular template or a rectangular template is used to extract the face area image from the two-dimensional color image.

In an embodiment, before capturing the face to be detected through the monocular camera to obtain a two-dimensional color image of the face to be detected, the method further includes:

A plurality of different live human face images are captured by the monocular camera to obtain multiple two-dimensional color live human face sample images, and a depth image corresponding to each of the two-dimensional color live human face sample images is obtained to obtain multiple A depth image;

A plurality of different non-living human faces are photographed through the monocular camera to obtain a plurality of two-dimensional color non-living human face sample images, and a depth image corresponding to each of the two-dimensional color non-living human face sample images is obtained to obtain Multiple second depth images;

Constructing each of the two-dimensional color live human face sample images and their corresponding first depth images as positive samples, and using each of the two-dimensional color non-live human face sample images and their corresponding second depth images as negative samples Training sample set;

A convolutional neural network is used to perform model training on the training sample set to obtain the convolutional neural network model.

In an embodiment, the convolutional neural network is used to perform model training on the training sample set, and before the convolutional neural network model is obtained, the method further includes:

Perform sample expansion processing on the training sample set according to a preset sample expansion strategy.

In an embodiment, the acquiring depth images corresponding to each of the two-dimensional color live human face sample images to obtain multiple first depth images includes:

Receiving the calibrated distance between each pixel in each of the two-dimensional color live human face sample images and the monocular camera;

According to the distance between each pixel in each of the two-dimensional color live human face sample images and the monocular camera, a depth image corresponding to each two-dimensional color live human face sample image is generated to obtain a plurality of first depth images.

In an embodiment, the living body detection method further includes:

Using each of the two-dimensional color live human face sample images and each of the two-dimensional color non-live human face sample images as training inputs, and using the first depth image corresponding to each of the two-dimensional color live human face sample images and each location The second depth image corresponding to the two-dimensional color non-living human face sample image is output as the target, and the supervised model training is performed to obtain the depth estimation model.

In an embodiment, before inputting the two-dimensional color image into a pre-trained depth estimation model for depth estimation, the method further includes:

Call the depth estimation model locally or call the depth estimation model from the server.

Please refer to FIG. 1, which is a schematic flowchart of a living body detection method provided by an embodiment of the present application. As shown in FIG. 1, the flow of the living body detection method provided by the embodiment of the present application may be as follows:

In 101, a face to be detected is photographed through a monocular camera to obtain a two-dimensional color image of the face to be detected.

In the embodiment of the present application, the electronic device can treat the detected person through the configured monocular camera when receiving an operation that requires face recognition for user identity detection, such as an unlock operation based on face recognition or a payment operation based on face recognition The face is photographed. Since the monocular camera is only sensitive to two-dimensional color information, a two-dimensional color image of the face to be detected will be captured.

It should be noted that at present, electronic devices are usually equipped with two monocular cameras, namely a front monocular camera (also commonly known as a front camera) and a rear monocular camera (also commonly known as a rear camera) ), and the imaging capability of the rear monocular camera is higher than the imaging capability of the front monocular camera, so that when the electronic device shoots the face to be detected through the monocular camera, it can default to perform the shooting through the front monocular camera Operation to shoot the face to be detected; the shooting operation can also be performed by the rear monocular camera by default to shoot the face to be detected; the front monocular camera and the rear can also be predicted based on the real-time pose information The monocular camera facing the face to be detected in the monocular camera, so that the shooting operation is automatically performed by the monocular camera facing the face to be detected in the front monocular camera and the rear monocular camera, and the face to be detected is photographed .

For example, the current unlocking method adopted by the electronic device is “face unlocking”. When the electronic device receives the trigger operation for unlocking the face, by default the front monocular camera is used to shoot the face to be detected, thereby obtaining the person to be detected Two-dimensional color image of the face.

For another example, the payment method currently adopted by the electronic device is "face-swapping payment", when the electronic device receives the trigger operation of the face-swapping payment, the face to be detected is photographed by the front monocular camera by default, thereby obtaining the pending detection Two-dimensional color image of human face.

In 102, the captured two-dimensional color image is input to a pre-trained depth estimation model to perform depth estimation to obtain a depth image corresponding to the two-dimensional color image.

It should be noted that in the embodiment of the present application, a depth estimation model for depth estimation is pre-trained, where the depth estimation model may be stored locally in the electronic device or may be stored in a remote server. In this way, after acquiring the two-dimensional color image of the face to be detected through the monocular camera, the electronic device calls the pre-trained depth estimation model locally or calls the pre-trained depth estimation model from the remote server, and transfers the person to be detected The two-dimensional color image of the face is input to a pre-trained depth estimation model, and the depth estimation model is used to perform depth estimation on the two-dimensional color image to obtain a depth image corresponding to the two-dimensional color image.

It should be noted that the resolution of the estimated depth image is the same as the resolution of the two-dimensional color image. The pixel value of each pixel in the depth image is used to describe the corresponding pixel in the two-dimensional color image to the aforementioned monocular camera (That is, the distance of a monocular camera that captures a two-dimensional color image).

For example, after obtaining the two-dimensional color image of the face to be detected through the front monocular camera, the electronic device calls a locally stored, pre-trained depth estimation model, and uses the depth estimation model to perform depth estimation on the two-dimensional color image. Get a depth image corresponding to a two-dimensional color image.

In 103, a two-dimensional color image and its corresponding depth image are input into a pre-trained living body detection model for living body detection, and a detection result is obtained.

It should be noted that, in the embodiment of the present application, in addition to the depth estimation model for depth estimation being pre-trained, the living body detection model for living body detection is also pre-trained, where the living body detection model may be stored locally in the electronic device , Can also be stored in a remote server. In this way, after the electronic device inputs the two-dimensional color image captured by the monocular camera to the pre-trained depth estimation model and obtains the depth image corresponding to the two-dimensional color image, the pre-trained living body detection model is called locally or from a remote The server at the end calls the pre-trained living body detection model, and inputs the previously acquired two-dimensional color image and its corresponding depth image to the pre-trained living body detection model. The living body detection model is based on the input two-dimensional color image and The corresponding depth image performs live detection on the face to be detected to obtain a detection result that the face to be detected is a living face, or a detection result that the face to be detected is a non-living face.

For example, referring to FIG. 2, after the two-dimensional color image of the face to be detected is captured by the front monocular camera, the electronic device calls a locally stored, pre-trained depth estimation model, and uses the depth estimation model for the two-dimensional color Perform depth estimation on the image to obtain a depth image corresponding to the two-dimensional color image, then call the locally stored, pre-trained living body detection model, and input the previously obtained two-dimensional color image and its corresponding depth image to the living body detection model for living body Detection, and the detection result is obtained, wherein, if the detection result of the face to be detected is a living face, it means that the face to be detected is a real face of a person with vital signs, and if the face to be detected is a non-living face The detection result indicates that the face to be detected is not the real face of the person with vital signs, and may be a face image or a face video captured in advance.

It can be seen from the above that the electronic device in the embodiment of the present application can first obtain the two-dimensional color image of the face to be detected by the configured monocular camera, and then input the obtained two-dimensional color image into the pre-trained depth estimation model Perform depth estimation to obtain a depth image corresponding to a two-dimensional color image, and finally input the previously obtained two-dimensional color image and its corresponding depth image into a pre-trained living body detection model for living body detection to obtain a detection result. As a result, the electronic device can realize the living body detection without using an additional depth camera, but using a generally configured monocular camera, which reduces the hardware cost of the electronic device for living body detection.

Please refer to FIG. 3, which is another schematic flowchart of the living body detection method provided by the embodiment of the present application. The living body detection method may be applied to an electronic device, and the flow of the living body detection method may include:

In 201, the electronic device is trained with a machine learning algorithm to obtain a depth estimation model and a living body detection model, where the living body detection model is a convolutional neural network model.

In the embodiment of the present application, the electronic device uses a machine learning algorithm to train in advance to obtain a depth estimation model and a living body detection model. It should be noted that after the trained depth estimation model and the living body detection model, the electronic device may store the depth estimation model and the living body detection model locally in the electronic device, or may store the depth estimation model and the living body detection model on a remote The server may also store one of the depth estimation model and the living body detection model locally on the electronic device and the other on a remote server.

Among them, machine learning algorithms can include: decision tree models, logistic regression models, Bayesian models, neural network models, clustering models, and so on.

The types of machine learning algorithms can be divided according to various situations. For example, based on the learning method, machine learning algorithms can be divided into: supervised learning algorithms, unsupervised learning algorithms, semi-supervised learning algorithms, reinforcement learning algorithms, and so on.

Under supervised learning, the input data is called "training data", and each set of training data has a clear identification or result, such as "spam" and "non-spam" in anti-spam systems, and recognition of handwritten digits. "1,2,3,4" and so on. Common application scenarios of supervised learning are classification problems and regression problems. Common algorithms are Logistic Regression and Backward Propagation Neural Network.

In unsupervised learning, the data is not specifically identified, the model is to infer some internal structure of the data. Common application scenarios include association rule learning and clustering. Common algorithms include Apriori algorithm and k-Means algorithm.

Semi-supervised learning algorithm. In this learning mode, the input data is partially identified. This learning model can be used for type recognition, but the model first needs to learn the internal structure of the data in order to reasonably organize the data for prediction. Application scenarios include classification and regression. The algorithm includes some extensions to commonly used supervised learning algorithms. These algorithms first attempt to model unlabeled data, and then predict the labeled data on this basis. Such as graph theory inference algorithm (Graph Inference) or Laplacian support vector machine (Laplacian SVM), etc.

Reinforcement learning algorithm. In this learning mode, the input data is used as feedback to the model. Unlike the supervised model, the input data is only used as a way to check whether the model is right or wrong. Under reinforcement learning, the input data is directly fed back to the model. The model must make adjustments immediately. Common application scenarios include dynamic systems and robot control. Common algorithms include Q-Learning and time difference learning (Temporal learning).

In addition, it is also possible to divide the machine learning algorithm into:

Regression algorithms, common regression algorithms include: least squares (Ordinary Least Square), logistic regression (Logistic Regression), stepwise regression (Stepwise Regression), multiple adaptive regression spline (Multivariate Adaptive Regression Splines) and local scatter smoothing Estimate (Locally Estimated Scatterplot Smoothing).

Examples-based algorithms include k-Nearest Neighbor (KNN), Learning Vector Quantization (LVQ), and Self-Organizing Map (SOM).

Regularization methods, common algorithms include: Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO), and Elastic Network (Elastic Net).

Decision tree algorithm, common algorithms include: Classification and regression tree (Classification And Regression Tree, CART), ID3 (Iterative Dichotomiser 3), C4.5, Chi-squared Automatic Interaction Detection (CHAID), Decision Stump, Random Forest (Random Forest), Multiple Adaptive Regression Spline (MARS) and Gradient Boosting Machine (Gradient Boosting Machine, GBM).

Bayesian algorithm, including: Naive Bayesian algorithm, Average One-Dependence Estimators (AODE), and Bayesian Belief Network (BBN).

For example, in this embodiment of the present application, a convolutional neural network is used to train a live detection model, that is, the live detection model is a convolutional neural network model, where the convolutional neural network model includes a convolutional layer, a pooling layer, and a fully connected Floor.

In 202, the electronic device shoots the face to be detected through a monocular camera to obtain a two-dimensional color image of the face to be detected.

In the embodiment of the present application, the electronic device may treat the detected person through the configured monocular camera when receiving an operation that requires face recognition for user identity detection, such as an unlock operation based on face recognition or a payment operation based on face recognition The face is photographed. Since the monocular camera is only sensitive to two-dimensional color information, a two-dimensional color image of the face to be detected will be captured.

In 203, the electronic device inputs the captured two-dimensional color image into a pre-trained depth estimation model to perform depth estimation to obtain a depth image corresponding to the two-dimensional color image.

After acquiring the two-dimensional color image of the face to be detected through the monocular camera, the electronic device calls the pre-trained depth estimation model locally or calls the pre-trained depth estimation model from a remote server, and transfers the person to be detected The two-dimensional color image of the face is input to a pre-trained depth estimation model, and the depth estimation model is used to perform depth estimation on the two-dimensional color image to obtain a depth image corresponding to the two-dimensional color image.

In 204, the electronic device inputs the aforementioned two-dimensional color image and its corresponding depth image to the convolutional layer of the convolutional neural network model for feature extraction, and obtains the combined global features of the aforementioned two-dimensional color image and the aforementioned depth image.

In the embodiment of the present application, after inputting the two-dimensional color image captured by the monocular camera into the pre-trained depth estimation model and obtaining the depth image corresponding to the two-dimensional color image, the electronic device locally calls the pre-trained living body detection The model or call a pre-trained living body detection model from a remote server, and use the living body detection model, which is the previously trained convolutional neural network model, to achieve living body detection.

First, the electronic device inputs the aforementioned two-dimensional color image and its corresponding depth image into the convolutional layer of the convolutional neural network model for feature extraction (feature extraction is to map the original image data to the hidden layer feature space, thereby Get the corresponding global features), get the global features of the two-dimensional color image and the global features of the depth image. After that, the global features of the two-dimensional color image and the global features of the depth image are combined in the convolutional layer to obtain the joint global features of the foregoing two-dimensional color image and the foregoing depth image.

In 205, the electronic device obtains the joint global feature and inputs the pooling layer of the convolutional neural network model to perform feature dimensionality reduction to obtain the joint global feature after the dimensionality reduction.

In the embodiment of the present application, in order to reduce the amount of calculation and improve the efficiency of living body detection, the joint global features of the two-dimensional color image and the depth image output by the convolution layer will be input into the pooling layer of the convolutional neural network model. Sampling is to retain the salient features of the joint global features and achieve feature dimensionality reduction of the joint global features. Among them, downsampling can be achieved by means of maximum pooling or mean pooling.

For example, assuming that the convolutional layer outputs a joint global feature of 20*20, the joint global feature is subjected to feature dimensionality reduction through the pooling layer to obtain a joint global feature of 10*10 dimensionality reduction.

In 206, the electronic device inputs the joint global features after dimensionality reduction into the fully connected layer of the convolutional neural network model for classification processing, and obtains the detection result that the face to be detected is a living face, or the face to be detected is a non-living body Face detection results.

Among them, the fully connected layer is used to implement the function of the classifier. Each node of the fully connected layer is connected to all output nodes of the pooling layer. A node of the fully connected layer is called a neuron in the fully connected layer. The number of neurons in the connection layer can be determined according to the actual application requirements. For example, the number of neurons in the fully connected layer can be set to 4096, and so on.

In the embodiment of the present application, the dimensionality-reduced joint global features output by the pooling layer will be input to the fully connected layer for classification processing to obtain the detection result that the face to be detected is a living face, or the face to be detected is Detection results of non-living human faces.

In one embodiment, when the aforementioned two-dimensional color image and its corresponding depth image are input into the convolutional layer of the convolutional neural network model for feature extraction, and the combined global feature of the aforementioned two-dimensional color image and the aforementioned depth image is obtained, carried out:

(1) The electronic device preprocesses the two-dimensional color image to obtain the face area image in the two-dimensional color image;

(2) The electronic device preprocesses the aforementioned depth image to obtain the face area image in the aforementioned depth image;

(3) The electronic device inputs the face area image in the two-dimensional color image and the face area image in the depth image to the convolutional layer for feature extraction to obtain the combined global features of the two-dimensional color image and the depth image .

In order to further improve the efficiency of living body detection, when the electronic device inputs the aforementioned two-dimensional color image and its corresponding depth image into the convolutional layer of the convolutional neural network model for feature extraction, it is not the original two-dimensional color image and the original The aforementioned depth image is input to the convolutional layer of the convolutional neural network for feature extraction, but the two-dimensional color image and the depth image are preprocessed separately to obtain the face area image in the two-dimensional color image and the aforementioned Face area image in the depth image.

Wherein, when the electronic device preprocesses the two-dimensional color image and the depth image, the face area can be extracted from the two-dimensional color image and the depth image by using an oval template, a circular template, or a rectangular template, etc., respectively. Image, thereby obtaining the face area image in the aforementioned two-dimensional color image and the face area image in the aforementioned depth image.

In one embodiment, when using a machine learning algorithm to train to obtain a living body detection model, you can execute:

(1) The electronic device shoots a plurality of different living human faces through a monocular camera to obtain a plurality of two-dimensional color living human face sample images, and obtains a depth image corresponding to each two-dimensional color living human face sample image to obtain multiple First depth image

(2) The electronic device shoots multiple different non-living human faces through a monocular camera to obtain multiple two-dimensional color non-living human face sample images, and obtains depth images corresponding to each two-dimensional color non-living human face sample images, Obtain multiple second depth images;

(3) The electronic device uses each two-dimensional color live human face sample image and its corresponding first depth image as a positive sample, and each two-dimensional color non-live human face sample image and its corresponding second depth image as a negative sample, Construct training sample set;

(4) The electronic device adopts a convolutional neural network to perform model training on the training sample set to obtain a convolutional neural network model as a living body detection model.

Among them, on the one hand, the electronic device can shoot the faces of the users with different skin colors, different genders, and different ages (ie, live faces) through the monocular camera configured to obtain multiple two-dimensional color live face sample images In addition, the electronic device also obtains a depth image corresponding to each two-dimensional color non-living human face sample image to obtain multiple first depth images.

For example, the electronic device can also be connected to a depth camera. When shooting any live face through the monocular camera, the external depth camera is used to shoot simultaneously. In this way, the electronic device will obtain the live face through the monocular camera. The two-dimensional color live human face sample image is captured by an external depth camera to obtain the depth image of the live human face, and then the captured depth image and the two-dimensional color live human face sample image are aligned, and the aligned depth image is recorded It is the first depth image of the two-dimensional color live human face sample image.

On the other hand, the electronic device can also shoot different non-living human faces such as different facial images, facial videos, human face masks and human head models through its configured monocular camera to obtain multiple two-dimensional color non-living human face samples In addition, the electronic device also obtains depth images corresponding to the two-dimensional color non-living human face sample images to obtain multiple second depth images.

For example, the electronic device can also be connected with a depth camera. When shooting any non-living human face through the monocular camera, the external depth camera is used to shoot simultaneously. In this way, the electronic device will obtain the non-living human through the monocular camera. The two-dimensional color non-living face sample image of the face is captured by an external depth camera to obtain the depth image of the non-living face, and then the captured depth image and the two-dimensional color non-living face sample image are aligned and aligned The post-depth image is recorded as the second depth image of the two-dimensional color non-living human face sample image.

After acquiring the multiple two-dimensional color live human face sample images and their corresponding first depth images and the multiple two-dimensional color non-live human face sample images and their corresponding second depth images, the electronic device will Each two-dimensional color live human face sample image and its corresponding first depth image are used as positive samples, and each two-dimensional color non-live human face sample image and its corresponding second depth image are used as negative samples to construct a training sample set, such as Figure 4 shows.

After completing the construction of the training sample set, the electronic device uses a convolutional neural network to perform model training on the constructed training sample set to obtain a convolutional neural network model as a living body detection model for living body detection.

It should be noted that, when a convolutional neural network is used to perform model training on the constructed training sample set, a supervised learning method or an unsupervised learning method may be used, which can be specifically selected by a person of ordinary skill in the art according to actual needs.

In one embodiment, before the convolutional neural network is used to perform model training on the training sample set to obtain a convolutional neural network model, which is used as a living body detection model, it further includes:

The electronic device performs sample expansion processing on the training sample set according to a preset sample expansion strategy.

In the embodiment of the present application, the sample expansion of the training sample set can increase the diversity of the samples, so that the trained convolutional neural network model has stronger robustness. The sample expansion strategy may be set to perform one or more of small rotation, scaling, and inversion on the positive samples/negative samples in the training sample set.

For example, for a positive sample composed of a two-dimensional color live human face sample image and its corresponding first depth image in the training sample set, the two-dimensional color live human face sample image and its corresponding first depth image can be Rotate the same amplitude to obtain the rotated two-dimensional color live human face sample image and the rotated first depth image. The new two-dimensional color live human face sample image and the rotated first depth image form a new Positive sample.

In one embodiment, when acquiring depth images corresponding to each two-dimensional color live human face sample image to obtain multiple first depth images, the following may be performed:

(1) The electronic device receives the distance from each pixel in the two-dimensional color live human face sample image to the monocular camera;

(2) The electronic device generates a depth image corresponding to each two-dimensional color live human face sample image according to the distance from each pixel in each two-dimensional color live human face sample image to the monocular camera, and obtains a plurality of first depth images.

Among them, for any two-dimensional color live human face sample image captured by the electronic device through the monocular camera, the distance from each pixel point in the two-dimensional color live human face sample image to the monocular camera can be manually calibrated, and the electronic The device generates a depth image corresponding to the two-dimensional color live human face sample image according to the distance between each pixel in the two-dimensional color live human face sample image and the monocular camera, and records it as the first depth image.

Thus, the electronic device can receive the distance from each pixel in the two-dimensional color live human face sample image to the monocular camera, and according to the distance from each pixel in the two-dimensional color live human face sample image to the monocular camera , A depth image corresponding to each two-dimensional color live human face sample image is generated, and multiple first depth images are obtained.

In an embodiment, when acquiring depth images corresponding to each two-dimensional color non-living human face sample image to obtain multiple second depth images, the following may be performed:

The electronic device receives the calibrated distance from each pixel in the two-dimensional color non-living face sample image to the monocular camera;

The electronic device generates a depth image corresponding to each two-dimensional color non-living human face sample image according to the distance from each pixel in each two-dimensional color non-living human face sample image to the monocular camera, and obtains a plurality of second depth images.

In an embodiment, when a machine learning algorithm is used to obtain a depth estimation model, the following may be performed:

The electronic device uses each two-dimensional color live human face sample image and each two-dimensional color non-live human face sample image as training inputs, and uses the first depth image corresponding to each two-dimensional color live human face sample image and each two-dimensional color non-live body The second depth image corresponding to the face sample image is used as the target output, and the supervised model training is performed to obtain the depth estimation model.

It should be noted that, in the embodiment of the present application, the electronic device uses multiple acquired two-dimensional color live human face sample images and corresponding multiple first depth images, and multiple two-dimensional color non-live human face samples In addition to training the image and its corresponding multiple second depth images to train the living body detection model, multiple acquired two-dimensional color live human face sample images and their corresponding multiple first depth images, and multiple second depth images can also be used Dimensional color non-living human face sample images and corresponding multiple second depth images are used to train a depth estimation model. Among them, the electronic device can directly use each two-dimensional color live human face sample image and each two-dimensional color non-live human face sample image as training inputs, and use the first depth image and each two corresponding to each two-dimensional color live human face sample image The second depth image corresponding to the dimensional color non-living face sample image is used as the target output, and the supervised model is trained to obtain the depth estimation model.

For example, for any two-dimensional color live human face sample image, the electronic device uses the two-dimensional color live human face sample image as a training input, and uses the first depth image of the two-dimensional color live human face sample image as a corresponding target output ; Similarly, for any two-dimensional color non-living human face sample image, the electronic device uses the two-dimensional color non-living human face sample image as a training input, and the two-dimensional color non-living human face sample image as the corresponding target output .

An embodiment of the present application also provides a living body detection device. Please refer to FIG. 5, which is a schematic structural diagram of a living body detection device according to an embodiment of the present application. Wherein the living body detection device is applied to an electronic device, the electronic device includes a monocular camera, the living body detection device includes a color image acquisition module 501, a depth image acquisition module 502, and a living face detection module 503, as follows:

The color image acquisition module 501 is used to shoot a face to be detected through a monocular camera to obtain a two-dimensional color image of the face to be detected;

The depth image acquisition module 502 is used to input the captured two-dimensional color image into a pre-trained depth estimation model to perform depth estimation to obtain a depth image corresponding to the two-dimensional color image;

The living body face detection module 503 is used to input a two-dimensional color image and its corresponding depth image into a pre-trained living body detection model for living body detection to obtain a detection result.

In one embodiment, the living body detection model is a convolutional neural network model, which includes a convolution layer, a pooling layer, and a fully connected layer connected in sequence. After inputting a two-dimensional color image and its corresponding depth image into a pre-trained living body detection The model performs live detection, and when the detection result is obtained, the live face detection module 503 can be used to:

Input the aforementioned two-dimensional color image and its corresponding depth image into the convolution layer for feature extraction, and obtain the joint global features of the aforementioned two-dimensional color image and the aforementioned depth image;

The joint global feature will be input into the pooling layer for feature dimensionality reduction, and the joint global feature after dimensionality reduction will be obtained;

The joint global features after dimensionality reduction are input into the fully connected layer for classification processing to obtain the detection result that the face to be detected is a living face, or the detection result that the face to be detected is a non-living face.

In one embodiment, when the two-dimensional color image and the corresponding depth image are input into the convolution layer for feature extraction, and the combined global features of the two-dimensional color image and the depth image are obtained, the living face detection module 503 may Used for:

Preprocessing the aforementioned two-dimensional color image to obtain the face area image in the aforementioned two-dimensional color image;

Preprocessing the aforementioned depth image to obtain the face area image in the aforementioned depth image;

The face area image in the two-dimensional color image and the face area image in the depth image are input to the convolutional layer for feature extraction to obtain a joint global feature of the two-dimensional color image and the depth image.

In an embodiment, the living body detection device further includes a model training module, which is used to:

Before shooting the face to be detected through the monocular camera to obtain the two-dimensional color image of the face to be detected, the multiple different live human faces are captured through the monocular camera to obtain multiple two-dimensional color live human face sample images, And obtain a depth image corresponding to each two-dimensional color live human face sample image to obtain multiple first depth images;

A plurality of different non-living human face images are captured by a monocular camera to obtain multiple two-dimensional color non-living human face sample images, and a depth image corresponding to each two-dimensional color non-living human face sample image is obtained to obtain multiple second Depth image

Use each two-dimensional color live human face sample image and its corresponding first depth image as a positive sample, and each two-dimensional color non-live human face sample image and its corresponding second depth image as a negative sample to construct a training sample set;

A convolutional neural network is used to model the training sample set, and a convolutional neural network model is obtained as a living body detection model.

In one embodiment, before the convolutional neural network is used to train the training sample set, the model training module:

Perform sample expansion processing on the training sample set according to the preset sample expansion strategy.

In one embodiment, when acquiring depth images corresponding to each two-dimensional color live human face sample image to obtain multiple first depth images, the model training module may be used to:

Receive the distance from each pixel in the two-dimensional color live human face sample image to the monocular camera;

According to the distance between each pixel in each two-dimensional color live human face sample image and the monocular camera, a depth image corresponding to each two-dimensional color live human face sample image is generated to obtain a plurality of first depth images.

In one embodiment, when acquiring depth images corresponding to each two-dimensional color non-living human face sample image to obtain multiple second depth images, the model training module may be used to:

Receive the distance between each pixel in the two-dimensional color non-living face sample image calibrated to the monocular camera;

According to the distance between each pixel in each two-dimensional color non-living human face sample image and the monocular camera, a depth image corresponding to each two-dimensional color non-living human face sample image is generated to obtain a plurality of second depth images.

In one embodiment, the model training module can also be used for:

Use each two-dimensional color live human face sample image and each two-dimensional color non-live human face sample image as training input, and use the first depth image corresponding to each two-dimensional color live human face sample image and each two-dimensional color non-live human face image The second depth image corresponding to the sample image is output as the target, and the supervised model is trained to obtain the depth estimation model.

An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the stored computer program is executed on a computer, causes the computer to perform the steps in the living body detection method provided in this embodiment, or The computer is caused to execute the steps in the model training method provided in this embodiment. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read Only Memory, ROM), or a random access device (Random Access Memory, RAM), and so on.

An embodiment of the present application also provides an electronic device, including a memory, a processor, and the processor executes the steps in the living body detection method provided in this embodiment by calling a computer program stored in the memory, or executes the model as provided in this embodiment Steps in the training method.

In an embodiment, an electronic device is also provided. Referring to FIG. 6, the electronic device includes a processor 701, a memory 702, and a monocular camera 703. The processor 701 is electrically connected to the memory 702 and the monocular camera 703.

The processor 701 is the control center of the electronic device, and uses various interfaces and lines to connect the various parts of the entire electronic device, executes the electronic device by running or loading the computer program stored in the memory 702, and calling the data stored in the memory 702 Various functions and process data.

The memory 702 may be used to store software programs and modules. The processor 701 runs computer programs and modules stored in the memory 702 to execute various functional applications and data processing. The memory 702 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, computer programs required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store Data created by the use of electronic devices, etc. In addition, the memory 702 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Accordingly, the memory 702 may further include a memory controller to provide the processor 701 with access to the memory 702.

The monocular camera 703 may include a camera having one or more lenses and an image sensor, capable of capturing external image data.

In the embodiment of the present application, the processor 701 in the electronic device loads the instructions corresponding to the process of one or more computer programs into the memory 702 according to the following steps, and the processor 701 runs and stores the instructions in the memory 702 Computer program to achieve various functions as follows:

The monocular camera 703 shoots the face to be detected to obtain a two-dimensional color image of the face to be detected;

Input the captured two-dimensional color image into a pre-trained depth estimation model to perform depth estimation to obtain a depth image corresponding to the two-dimensional color image;

The two-dimensional color image and the corresponding depth image are input into a pre-trained living body detection model for living body detection, and the detection result is obtained.

Please refer to FIG. 7, which is another schematic structural diagram of an electronic device provided by an embodiment of the present application. The difference from the electronic device shown in FIG. 6 is that the electronic device further includes components such as an input unit 704 and an output unit 705.

The input unit 704 can be used to receive input numbers, character information, or user characteristic information (such as fingerprints), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The output unit 705 may be used to display information input by the user or information provided to the user, such as a screen.

In one embodiment, the living body detection model is a convolutional neural network model, which includes a convolution layer, a pooling layer, and a fully connected layer connected in sequence. After inputting a two-dimensional color image and its corresponding depth image into a pre-trained living body detection The model performs a living body test, and when the test result is obtained, the processor 701 can execute:

In an embodiment, when the two-dimensional color image and the corresponding depth image are input to the convolutional layer for feature extraction, and the combined global features of the two-dimensional color image and the depth image are obtained, the processor 701 may execute:

In an embodiment, before shooting the face to be detected through the monocular camera 703703 to obtain a two-dimensional color image of the face to be detected, the processor 701 may execute:

Before the monocular camera 703 is used to photograph the face to be detected to obtain a two-dimensional color image of the face to be detected, the monocular camera 703 is used to photograph multiple different live human faces to obtain multiple two-dimensional color live human face samples Image, and obtain a depth image corresponding to each two-dimensional color live human face sample image to obtain multiple first depth images;

A plurality of different non-living human face images are captured by the monocular camera 703 to obtain a plurality of two-dimensional color non-living human face sample images, and a depth image corresponding to each two-dimensional color non-living human face sample image is obtained to obtain multiple Two depth images;

In one embodiment, before the convolutional neural network is used for model training on the training sample set, the processor 701 may execute:

In an embodiment, when acquiring the depth images corresponding to each two-dimensional color live human face sample image to obtain multiple first depth images, the processor 701 may execute:

Receive the distance from each pixel in the two-dimensional color live human face sample image to the monocular camera 703;

According to the distance between each pixel in each two-dimensional color live human face sample image and the monocular camera 703, a depth image corresponding to each two-dimensional color live human face sample image is generated to obtain a plurality of first depth images.

In one embodiment, when acquiring the depth images corresponding to the two-dimensional color non-living human face sample images to obtain multiple second depth images, the processor 701 may execute:

Receive the distance from each pixel in the two-dimensional color non-living human face sample image to the monocular camera 703;

According to the distance between each pixel in each two-dimensional color non-living human face sample image and the monocular camera 703, a depth image corresponding to each two-dimensional color non-living human face sample image is generated to obtain a plurality of second depth images.

In an embodiment, the processor 701 may also execute:

In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to the related descriptions of other embodiments.

It should be noted that, for the living body detection method of the embodiment of the present application, ordinary testers in the art can understand that all or part of the process of implementing the living body detection method of the embodiment of the present application can be completed by controlling relevant hardware through a computer program , The computer program may be stored in a computer-readable storage medium, such as stored in a memory of an electronic device, and executed by at least one processor in the electronic device, and may include, for example, a living body detection method during execution The process of the embodiment. Wherein, the storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, etc.

For the living body detection device of the embodiment of the present application, each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules may be integrated into one module. The above integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may also be stored in a computer-readable storage medium, such as a read-only memory, magnetic disk, or optical disk, etc. .

The above provides a detailed description of a method, device, storage medium, and electronic equipment provided by embodiments of the present application. Specific examples are used in this article to explain the principles and implementation of the present application. The descriptions of the above embodiments are only It is used to help understand the method of this application and its core ideas; meanwhile, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and application scope. In summary, this specification The content should not be construed as limiting the application.

Claims

A living body detection method is applied to an electronic device. The electronic device includes a monocular camera, including:

Shooting the face to be detected through the monocular camera to obtain a two-dimensional color image of the face to be detected;

Input the two-dimensional color image into a pre-trained depth estimation model to perform depth estimation to obtain a depth image corresponding to the two-dimensional color image;

The two-dimensional color image and the depth image are input into a pre-trained living body detection model for living body detection, and a detection result is obtained.
The living body detection method according to claim 1, wherein the living body detection model is a convolutional neural network model, including a convolution layer, a pooling layer, and a fully connected layer connected in sequence, and the two-dimensional color image Input the pre-trained living body detection model with the depth image to obtain the detection result, including:

Input the two-dimensional color image and the depth image into the convolutional layer for feature extraction to obtain a joint global feature of the two-dimensional color image and the depth image;

Input the joint global feature into the pooling layer to perform feature dimensionality reduction to obtain the joint global feature after dimensionality reduction;

Input the dimensionality-reduced joint global feature into the fully connected layer for classification processing to obtain the detection result that the face to be detected is a live face, or the face to be detected is a non-live face Test results.
The living body detection method according to claim 2, wherein the two-dimensional color image and the depth image are input to the convolutional layer for feature extraction to obtain the two-dimensional color image and the depth image Joint global features, including:

Preprocessing the two-dimensional color image to obtain a face area image in the two-dimensional color image;

Preprocessing the depth image to obtain a face area image in the depth image;

Input the face area image in the two-dimensional color image and the face area image in the depth image into the convolution layer for feature extraction to obtain the joint global features of the two-dimensional color image and the depth image .
The living body detection method according to claim 3, wherein the preprocessing the two-dimensional color image to obtain a face area image in the two-dimensional color image includes:

An ellipse template, a circular template or a rectangular template is used to extract the face area image from the two-dimensional color image.
The living body detection method according to claim 2, wherein the taking of the face to be detected by the monocular camera to obtain a two-dimensional color image of the face to be detected further comprises:

A plurality of different live human face images are captured by the monocular camera to obtain multiple two-dimensional color live human face sample images, and a depth image corresponding to each of the two-dimensional color live human face sample images is obtained to obtain multiple A depth image;

A plurality of different non-living human faces are photographed through the monocular camera to obtain a plurality of two-dimensional color non-living human face sample images, and a depth image corresponding to each of the two-dimensional color non-living human face sample images is obtained to obtain Multiple second depth images;

Constructing each of the two-dimensional color live human face sample images and their corresponding first depth images as positive samples, and using each of the two-dimensional color non-live human face sample images and their corresponding second depth images as negative samples Training sample set;

A convolutional neural network is used to perform model training on the training sample set to obtain the convolutional neural network model.
The living body detection method according to claim 5, wherein the model training of the training sample set using a convolutional neural network to obtain the convolutional neural network model further includes:

Perform sample expansion processing on the training sample set according to a preset sample expansion strategy.
The living body detection method according to claim 5, wherein the acquiring depth images corresponding to each of the two-dimensional color live human face sample images to obtain a plurality of first depth images includes:

Receiving the calibrated distance between each pixel in each of the two-dimensional color live human face sample images and the monocular camera;

According to the distance between each pixel in each of the two-dimensional color live human face sample images and the monocular camera, a depth image corresponding to each two-dimensional color live human face sample image is generated to obtain a plurality of first depth images.
The living body detection method according to claim 5, wherein the living body detection method further comprises:

Using each of the two-dimensional color live human face sample images and each of the two-dimensional color non-live human face sample images as training inputs, and using the first depth image corresponding to each of the two-dimensional color live human face sample images and each location The second depth image corresponding to the two-dimensional color non-living human face sample image is output as the target, and the supervised model training is performed to obtain the depth estimation model.
The living body detection method according to claim 1, wherein before the inputting the two-dimensional color image into a pre-trained depth estimation model for depth estimation, further comprising:

Call the depth estimation model locally or call the depth estimation model from the server.
A living body detection device applied to electronic equipment, including:

A color image acquisition module, configured to shoot the face to be detected through the monocular camera to obtain a two-dimensional color image of the face to be detected;

A depth image acquisition module, configured to input the two-dimensional color image into a pre-trained depth estimation model to obtain a depth image corresponding to the two-dimensional color image;

The living body face detection module is used to input the two-dimensional color image and the depth image into a pre-trained living body detection model to obtain a detection result.
A storage medium on which a computer program is stored, wherein, when the computer program runs on a computer, the computer is caused to execute:

Shooting the face to be detected through the monocular camera to obtain a two-dimensional color image of the face to be detected;

Input the two-dimensional color image into a pre-trained depth estimation model to perform depth estimation to obtain a depth image corresponding to the two-dimensional color image;

The two-dimensional color image and the depth image are input into a pre-trained living body detection model for living body detection, and a detection result is obtained.
An electronic device includes a processor, a memory, and a monocular camera. The memory stores a computer program, wherein the processor is used to execute the computer program by calling the computer program:

Shooting the face to be detected through the monocular camera to obtain a two-dimensional color image of the face to be detected;

Input the two-dimensional color image into a pre-trained depth estimation model to perform depth estimation to obtain a depth image corresponding to the two-dimensional color image;

The two-dimensional color image and the depth image are input into a pre-trained living body detection model for living body detection, and a detection result is obtained.
The electronic device according to claim 12, wherein the living body detection model is a convolutional neural network model, which includes a convolutional layer, a pooling layer, and a fully connected layer connected in sequence. The depth image is input into a pre-trained living body detection model, and when a detection result is obtained, the processor is used to execute:

Input the two-dimensional color image and the depth image into the convolutional layer for feature extraction to obtain a joint global feature of the two-dimensional color image and the depth image;

Input the joint global feature into the pooling layer to perform feature dimensionality reduction to obtain the joint global feature after dimensionality reduction;

Input the dimensionality-reduced joint global feature into the fully connected layer for classification processing to obtain the detection result that the face to be detected is a live face, or the face to be detected is a non-live face Test results.
The electronic device according to claim 13, wherein the two-dimensional color image and the depth image are input to the convolutional layer for feature extraction to obtain a joint global view of the two-dimensional color image and the depth image Feature, the processor is used to perform:

Preprocessing the two-dimensional color image to obtain a face area image in the two-dimensional color image;

Preprocessing the depth image to obtain a face area image in the depth image;

Input the face area image in the two-dimensional color image and the face area image in the depth image into the convolution layer for feature extraction to obtain the joint global features of the two-dimensional color image and the depth image .
The electronic device according to claim 14, wherein, when preprocessing the two-dimensional color image to obtain a face area image in the two-dimensional color image, the processor is configured to execute:

An ellipse template, a circular template or a rectangular template is used to extract the face area image from the two-dimensional color image.
The electronic device according to claim 13, wherein before taking a face to be detected through the monocular camera to obtain a two-dimensional color image of the face to be detected, the processor is further configured to execute:

A plurality of different live human face images are captured by the monocular camera to obtain multiple two-dimensional color live human face sample images, and a depth image corresponding to each of the two-dimensional color live human face sample images is obtained to obtain multiple A depth image;

A plurality of different non-living human faces are photographed through the monocular camera to obtain a plurality of two-dimensional color non-living human face sample images, and a depth image corresponding to each of the two-dimensional color non-living human face sample images is obtained to obtain Multiple second depth images;

Constructing each of the two-dimensional color live human face sample images and their corresponding first depth images as positive samples, and using each of the two-dimensional color non-live human face sample images and their corresponding second depth images as negative samples Training sample set;

A convolutional neural network is used to perform model training on the training sample set to obtain the convolutional neural network model.
The electronic device according to claim 16, wherein before the convolutional neural network is used to model the training sample set to obtain the convolutional neural network model, the processor is further configured to execute:

Perform sample expansion processing on the training sample set according to a preset sample expansion strategy.
The electronic device according to claim 16, wherein, when acquiring depth images corresponding to each of the two-dimensional color live human face sample images to obtain multiple first depth images, the processor is configured to execute:

Receiving the calibrated distance between each pixel in each of the two-dimensional color live human face sample images and the monocular camera;

According to the distance between each pixel in each of the two-dimensional color live human face sample images and the monocular camera, a depth image corresponding to each two-dimensional color live human face sample image is generated to obtain a plurality of first depth images.
The electronic device according to claim 16, wherein the processor is further configured to execute:

Using each of the two-dimensional color live human face sample images and each of the two-dimensional color non-live human face sample images as training inputs, and using the first depth image corresponding to each of the two-dimensional color live human face sample images and each location The second depth image corresponding to the two-dimensional color non-living human face sample image is output as the target, and the supervised model training is performed to obtain the depth estimation model.
The electronic device according to claim 12, wherein before inputting the two-dimensional color image into a pre-trained depth estimation model for depth estimation, the processor is further configured to execute:

Call the depth estimation model locally or call the depth estimation model from the server.