CN112200057A

CN112200057A - Face living body detection method and device, electronic equipment and storage medium

Info

Publication number: CN112200057A
Application number: CN202011063444.1A
Authority: CN
Inventors: 冯思博; 陈莹; 黄磊; 彭菲
Original assignee: Hanwang Technology Co Ltd
Current assignee: Hanwang Technology Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-08
Anticipated expiration: 2040-09-30
Also published as: CN112200057B

Abstract

The application discloses a face in-vivo detection method, belongs to the technical field of face detection, and is beneficial to improving the speed and accuracy of face in-vivo detection. The method comprises the following steps: acquiring a first face image and a second face image which are synchronously acquired by a first image acquisition device and a second image acquisition device aiming at a target face; respectively carrying out face positioning on the first face image and the second face image, cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to a face positioning result; inputting the cut first face image to be detected and the cut second face image to be detected into a pre-trained living body detection model in parallel, and classifying and mapping the target face through the living body detection model according to the plane characteristics and the depth characteristics in the two input face images; and determining whether the target face is a living face according to the classification mapping result.

Description

Face living body detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of human face detection technologies, and in particular, to a human face in-vivo detection method, an apparatus, an electronic device, and a computer-readable storage medium.

Background

In order to improve the safety of the face recognition technology in practical application, the importance of performing living body detection on a face image to be recognized to resist the attack of a photo or a video on the face recognition application is increasingly prominent. In the prior art, in order to improve the accuracy of face recognition, a face recognition technology based on a binocular camera is increasingly widely applied. The face living body detection technology based on the binocular camera is also continuously improved. Currently, common face detection technologies include binocular visible light-based face in-vivo detection technologies. The method specifically comprises the following steps: the method comprises the steps of respectively carrying out face key point detection on two images of a target face acquired based on a binocular visible light camera, then constructing three-dimensional sparse point cloud according to face key point data, then carrying out interpolation on the three-dimensional sparse point cloud to generate dense point cloud, and classifying based on the dense point cloud. The scheme has the advantages of long calculation time, high complexity, large calculation error and limited use scene.

Therefore, the method for detecting the living human face in the prior art needs to be improved.

Disclosure of Invention

The application provides a face in-vivo detection method which is beneficial to improving the speed and accuracy of face in-vivo detection.

In order to solve the above problem, in a first aspect, an embodiment of the present application provides a face live detection method, including:

acquiring a first face image and a second face image which are synchronously acquired by a first image acquisition device and a second image acquisition device aiming at a target face;

respectively carrying out face positioning on the first face image and the second face image to obtain corresponding face positioning results;

respectively cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image;

inputting the cut first face image to be detected and the cut second face image to be detected into a pre-trained living body detection model in parallel, and performing classification mapping on the target face through the living body detection model according to the plane features and the depth features in the first face image to be detected and the second face image to be detected; the living body detection model is a classification model trained on face key point constraint and depth feature constraint of a training sample;

and determining whether the target face is a living face according to the classification mapping result.

In a second aspect, an embodiment of the present application provides a human face living body detection apparatus, including:

the face image acquisition module is used for acquiring a first face image and a second face image which are synchronously acquired by a first image acquisition device and a second image acquisition device aiming at a target face;

the face positioning module is used for respectively carrying out face positioning on the first face image and the second face image to obtain corresponding face positioning results;

the face image cutting module is used for cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image respectively;

the image classification module is used for inputting the cut first face image to be detected and the cut second face image to be detected into a pre-trained living body detection model in parallel, and performing classification mapping on the target face through the living body detection model according to the plane features and the depth features in the first face image to be detected and the second face image to be detected; the living body detection model is a classification model trained on face key point constraint and depth feature constraint of a training sample;

and the face living body detection result determining module is used for determining whether the target face is a living body face according to the classification mapping result.

In a third aspect, an embodiment of the present application further discloses an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the living human face detection method according to the embodiment of the present application is implemented.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the living human face detection method disclosed in the present application.

The method for detecting the living human face comprises the steps of acquiring a first human face image and a second human face image which are synchronously acquired by a first image acquisition device and a second image acquisition device aiming at a target human face; respectively carrying out face positioning on the first face image and the second face image to obtain corresponding face positioning results; then, respectively cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image; inputting the cut first face image to be detected and the cut second face image to be detected into a pre-trained living body detection model in parallel, and performing classification mapping on the target face through the living body detection model according to the plane features and the depth features in the first face image to be detected and the second face image to be detected; the living body detection model is a classification model trained on face key point constraint and depth feature constraint of a training sample; and determining whether the target face is a living face according to the classification mapping result, which is beneficial to improving the speed of face living body detection.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of a human face living body detection method according to a first embodiment of the present application;

FIG. 2 is a diagram illustrating a multitasking model according to a first embodiment of the present application;

fig. 3 is a schematic structural diagram of a living human face detection model according to a first embodiment of the present application;

fig. 4 is a schematic structural diagram of a living human face detection device according to a second embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Example one

As shown in fig. 1, the method for detecting a living human face includes steps 110 to 150.

And 110, acquiring a first face image and a second face image which are synchronously acquired by a first image acquisition device and a second image acquisition device aiming at the target face.

In the embodiment of the present application, the first image capturing device and the second image capturing device are two synchronous image capturing devices disposed on the same electronic device, for example, a binocular synchronous face recognition device. The first image acquisition device and the second image acquisition device synchronously acquire images of target objects (such as human faces) according to the control of the electronic equipment. In some embodiments of the present application, the relative positions of the first image capturing device and the second image capturing device in the vertical direction and the horizontal direction are kept constant, and the horizontal direction is kept at a certain distance (for example, a distance larger than 60 mm is common). The imaging light sources of the first image acquisition device and the second image acquisition device can be the same or different. For example, the first image capturing device and the second image capturing device may be all visible light image capturing devices, or all infrared light image capturing devices, or one infrared light image capturing device and one visible light image capturing device, which is not limited in this application.

In some embodiments of the present application, the first image capturing device and the second image capturing device need to be calibrated and calibrated in advance to obtain calibration matrices of the first image capturing device and the second image capturing device.

Referring to the prior art, specific implementation manners for calibrating and calibrating the different image acquisition devices may be found, for example, the internal reference matrix and the external reference matrix of the camera obtained by a zhangnyou checkerboard calibration method may be adopted, and details are not repeated in the embodiments of the present application.

Taking the first image acquisition device and the second image acquisition device as binocular cameras of the electronic equipment as an example, the calibration matrix of the binocular cameras is determined by calibration when the cameras leave a factory. In some embodiments of the present application, after two face images of a target face are simultaneously and respectively acquired by a binocular synchronous camera of an electronic device, for example, the two face images are respectively represented as a first face image a and a second face image B, the first face image a and the second face image B are further rectified by using a calibration matrix of the binocular synchronous camera, and a first face image a 'and a second face image B' are respectively obtained.

And 120, respectively carrying out face positioning on the first face image and the second face image to obtain corresponding face positioning results.

In some embodiments of the present application, the face positioning result includes: and a human face positioning frame. In specific implementation, a face positioning method in the prior art may be adopted to perform face positioning on the first face image a 'and the second face image B' respectively, so as to obtain a face positioning frame in the first face image a 'and a face positioning frame in the second face image B'. The method and the device do not limit the specific implementation mode of respectively carrying out face positioning on the first face image and the second face image and respectively determining the face positioning frames in the first face image and the second face image.

Step 130, respectively cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image.

In some embodiments of the present application, the face positioning result includes: and a human face positioning frame. After the face positioning frame in the first face image and the face positioning frame in the second face image are respectively determined, respectively cutting out a first face image to be detected from the first face image and cutting out a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image, further comprising: cutting a first face image to be detected from the first face image according to the face positioning frame in the first face image; and cutting a second face image to be detected from the second face image according to the face positioning frame in the second face image.

In order to acquire more abundant information, it is necessary to determine an image in a larger area including the face positioning frame according to the face positioning frame for face living body detection. In some embodiments of the present application, a first face image to be detected is clipped from the first face image according to a face positioning frame in the first face image; and cutting a second face image to be detected from the second face image according to the face positioning frame in the second face image, wherein the cutting process comprises the following steps: expanding the face positioning frame in the first face image to a preset size, and cutting a first face image to be detected from the first face image according to the expanded face positioning frame; and expanding the face positioning frame in the second face image to a preset size, and cutting a second face image to be detected from the second face image according to the expanded face positioning frame. For example, the face of the first face image A' is positioned in the frame S_AExpanding the face to the periphery by 2 times to obtain a face positioning frame S_A' then, a face is positioned in a frame S from the first face image A_AThe image of the' coverage area is cut out to be used as the first face image to be detected. In the same way, will be secondFace positioning frame S of face image B_BExpanding the face to the periphery by 2 times to obtain a face positioning frame S_B' then, a face is positioned in a frame S from the second face image B_BAnd cutting out an image of the coverage area to serve as a second face image to be detected.

Step 140, inputting the cut first to-be-detected face image and the cut second to-be-detected face image in parallel to a pre-trained living body detection model, and performing classification mapping on the target face through the living body detection model according to the plane features and the depth features in the first to-be-detected face image and the second to-be-detected face image.

The living body detection model is a classification model trained on face key point constraint and depth feature constraint of a training sample;

and then, inputting the cut first face image to be detected and the cut second face image to be detected into a pre-trained living body detection model in parallel, and carrying out living body detection on the target face through the living body detection model based on the two input images. In specific implementation, a living body detection model needs to be trained firstly.

In some embodiments of the present application, before inputting the cut first to-be-detected face image and the cut second to-be-detected face image in parallel to a pre-trained living body detection model, and performing classification mapping on the target face according to a plane feature and a depth feature in the first to-be-detected face image and the second to-be-detected face image through the living body detection model, the method further includes: and training a living body detection model.

In some embodiments of the present application, the living body detection model is obtained by cutting a preset multitask model, as shown in fig. 2, the multitask model includes: a first task network consisting of a first convolutional network 210 and a first fully connected network 220, the first task network being used for learning face key point features in images input to the first convolutional network; a second task network consisting of a second convolutional network 230 and a second fully-connected network 240, the second task network being configured to learn the face keypoint features in the image input to the second convolutional network; a third task network consisting of the first convolutional network 210, the second convolutional network 230, a residual network 250, and a depth regression network 260, the third task network for learning depth features in the images input to the first convolutional network and the second convolutional network; and a fourth task network composed of the first convolution network 210, the second convolution network 230, the residual network 250, and a classification network 270, the fourth task network being configured to learn living body and non-living body information in the image input to the first convolution network and the second convolution network, wherein the first convolution network and the second convolution network are arranged in parallel, and the residual network is connected to outputs of the first convolution network and the second convolution network, respectively.

In the embodiment of the present application, as shown in fig. 3, the training of the living body detection model includes: the first convolutional network 210, the second convolutional network 230, the residual network 250, and the classification network 270. Accordingly, the training in vivo examination model comprises: training the multitask model; the network parameters of the living body detection model composed of the first convolutional network 210, the second convolutional network 230, the residual error network 250 and the classification network 270 are obtained by training the multi-task model.

Wherein the first convolutional network 210 and the second convolutional network 230 are arranged in parallel, the residual network 250 is connected with the outputs of the first convolutional network 210 and the second convolutional network 230, and the classification network 270 is connected with the output of the residual network 250.

In the specific implementation of the application, the multitask network is trained first. The multitask network comprises four learning tasks, namely two network tasks for respectively learning key point features of human faces in input images of two channels, one network task for learning depth features in the input images and one network task for learning living and non-living features in the input images. Each network task is realized through different task networks, the four task networks share the first convolution network and the second convolution network, and the learning of the depth features, the living body features and the non-living body features is based on the learning of the key point features of the face.

Before training the multitask model, a training sample set including several training samples needs to be obtained first. Wherein the sample data of each training sample comprises: a first sample image and a second sample image. The sample label of each training sample comprises: the first sample image and the second sample image each correspond to: the real value of the face key point, the real value of the depth value and the real value of the living body category.

The first sample image and the second sample image are a pair of images determined according to the mode of determining the first face image to be detected and the second face image to be detected; the real values of the face key points corresponding to the first sample image and the second sample image are obtained by respectively performing face detection on the first sample image and the second sample image through a face detection technology in the prior art. In specific implementation, the number of face key points obtained by different face detection technologies may be different. And the real value of the depth value is the depth value of the face key point obtained by calculating according to the face key point coordinates in the first sample image and the face key point coordinates in the second sample image and the calibration matrixes of the first image acquisition device and the second image acquisition device.

In some embodiments of the present application, the sample data for training each training sample of the multitask model comprises: a first sample image and a second sample image, the sample label of each of the training samples comprising: the first sample image and the second sample image each correspond to: the real values of the face key points, the real values of the depth values and the real values of the face living body categories.

The multitask model is trained by the following method: for each training sample in the sample set, performing the following encoding mapping operations: inputting a first sample image comprised by the training sample to the first convolutional network of the multitask model, while inputting a second sample image comprised by the training sample to the second convolutional network of the multitask model; performing operation processing on the first sample image through the first task network to obtain a human face key point prediction value of the first sample image in the training sample; performing operation processing on the second sample image through the second task network to obtain a human face key point prediction value of the second sample image in the training sample; performing operation processing on the first sample image and the second sample image through the third task network to obtain a depth value predicted value of the training sample; performing operation processing on the first sample image and the second sample image through the fourth task network to obtain a human face living body category predicted value of the training sample; determining prediction loss values of the first task network, the second task network, the third task network and the fourth task network according to prediction values (namely the face key point prediction value of the first sample image, the face key point prediction value of the second sample image, the depth value prediction value and the face living body category prediction value) obtained by executing the coding mapping operation; carrying out weighted summation on the predicted loss values of the first task network, the second task network, the third task network and the fourth task network to determine a model predicted total loss value of the multitask model; optimizing the network parameters of the multitask model, and skipping to executing the coding mapping operation until the model prediction total loss value converges to meet the preset condition.

Firstly, for each training sample, inputting a first sample image included in sample data of the training sample to the first convolution network of the multitask model, simultaneously inputting a second sample image included in the sample data of the training sample to the second convolution network of the multitask model, and then starting to execute codes corresponding to each network in the multitask model by computing and processing equipment to learn face key point features, depth features, living body features and non-living body face features of all the training samples in the training sample set. Specifically, in the task model structure shown in fig. 3, the first convolutional network 210 and the second convolutional network 230 are respectively used for learning the face key point features in the input image; the residual error network 250 is used for simultaneously learning the depth features and the living body and non-living body features of the input image based on the face key point feature learning.

In some embodiments of the present application, the first sample image is subjected to operation processing by the first task network, so as to obtain a face keypoint prediction value of the first sample image in the training sample; and performing operation processing on the second sample image through the second task network to obtain a face key point prediction value of the second sample image in the training sample, including: performing convolution processing on the first sample image in the training sample through the first convolution network to obtain a first vector; then, coding and mapping the first vector through the first full-connection network to obtain a human face key point predicted value corresponding to the first sample image; performing convolution processing on the second sample image in the training sample through the second convolution network to obtain a second vector; and then, coding and mapping the second vector through the second fully-connected network to obtain a human face key point prediction value corresponding to the second sample image.

In the implementation of the present application, the processing of the first sample image and the processing of the first sample image are performed synchronously through two network tasks. The following describes the encoding and mapping process for the first sample image and the second sample image in the training sample, respectively.

As shown in FIG. 2, the first convolution network 210 includes a plurality of convolution layers for input image (denoted as P)_{L_i}) Convolving and extracting a feature vector, e.g. denoted as a first vector e_{L_i}(ii) a Thereafter, the first fully connected network 220 couples the first vector e_{L_i}Performing vector leveling and mapping to obtain an input image (e.g., P)_{L_i}) The corresponding face keypoint prediction value, for example, the prediction value of 81 face keypoints is expressed as (x)_{L_i_0},x_{L_i_1},…,x_{L_i_80}),(y_{L_i_0},y_{L_i_1},…,y_{L_i_80})。

Similarly, the second convolutional network 230 includes a plurality of convolutional layers for aligning the input image (e.g., P)_{R_i}) Convolving and extracting the feature vector, e.g. denoted as second vector e_{R_i}(ii) a Thereafter, the second fully-connected network 240 pairs the second vector e_{R_i}Performing vector leveling and mapping to obtain an input image (e.g., P)_{R_i}) The corresponding face keypoint prediction value, for example, the prediction value of 81 face keypoints is expressed as (x)_{R_i_0},x_{R_i_1},…,x_{R_i_80}),(y_{R_i_0},y_{R_i_1},…,y_{R_i_80})。

In some embodiments of the present application, the determining the real value of the depth value according to the key points of the face in the first sample image and the second sample image in the sample data of the training sample, and performing operation processing on the first sample image and the second sample image through the third task network to obtain the predicted value of the depth of the training sample includes: performing convolution processing on the first vector and the second vector through the residual error network to obtain a third vector; and coding and mapping the third vector through the depth regression network to obtain a depth value predicted value of the face key point corresponding to the training sample.

For specific implementation of determining the face key points in the first sample image and the second sample image, reference is made to the prior art, and details are not described in the embodiment of the present application. Furthermore, by adopting the method in the prior art, the real value of the depth value of the training sample can be determined according to the face key points in the first sample image and the second sample image and the calibration matrixes of the first image acquisition device and the second image acquisition device for acquiring the first sample image and the second sample image.

As shown in fig. 2, the first vector e is processed by the residual network 250_{L_i}And the second vector e_{R_i}Performing convolution processing to obtain a third vector; performing coding mapping on the third vector through the deep regression network 260 to obtain the third vector corresponding to the training sampleThe predicted value of depth values of key points of a human face, for example, the predicted value of depth values of key points of 81 human faces is expressed as (z)_{i_0},z_{i_1},…,z_{i_80})。

In some embodiments of the present application, the obtaining a human face living body category prediction value of the training sample by performing operation processing on the first sample image and the second sample image through the fourth task network includes: performing convolution processing on the first vector and the second vector through the residual error network to obtain a third vector; and coding and mapping the third vector through the classification network to obtain a human face living body category predicted value corresponding to the training sample.

As shown in fig. 2, the first vector e is processed by the residual network 250_{L_i}And the second vector e_{R_i}Performing convolution processing to obtain a third vector; and performing coding mapping on the third vector through the classification network 270 to obtain a human face living body category predicted value corresponding to the training sample. In some embodiments of the present application, the classification network 270 may output a two-dimensional vector after encoding and mapping the third vector, where each dimension of the two-dimensional vector is used to represent the probability that the training sample is a different living human face category.

In some embodiments of the present application, determining the predicted loss values of the first task network, the second task network, the third task network, and the fourth task network according to the predicted values obtained by performing the encoding mapping operation includes: determining a prediction loss value of the first task network according to a difference value between the human face key point prediction value and a human face key point true value of a first sample image in all the training samples in the sample set; determining a prediction loss value of the second task network according to a difference value between the predicted value of the face key point and the true value of the face key point of a second sample image in all the training samples in the sample set; determining a prediction loss value of the third task network according to the difference value between the depth value prediction value and the depth value true value of all the training samples in the sample set; and determining a prediction loss value of the fourth task network according to the difference value between the human face living body type prediction value and the human face living body type true value of all the training samples in the sample set.

For example, according to the encoding mapping result of the first sample image in all training samples in the sample set, the prediction errors of the first convolution network 210 and the first fully-connected network 220 are calculated, that is, the prediction loss value of the first task network is also the first face keypoint prediction loss value. In some embodiments of the present application, the first face keypoint predicted loss value of the first task network may be calculated by the following formula:

wherein L is_{landmark_left}Representing a predicted loss value, x, of the first task network_{L_i}For the first sample image in the ith training sample, f (x)_{L_i}) Representing the face keypoint prediction value y of the first sample image in the ith training sample_{L_i}Representing the real value of the face key point of the first sample image in the ith training sample, N representing the number of samples in the training sample set, lambda representing the weight of each layer of network, w_jFor the network parameter, n represents the number of network layers with weights.

For another example, according to the encoding mapping result of the second sample image in all the training samples in the sample set, the prediction error of the second convolutional network 230 and the second fully-connected network 240 is calculated, that is, the prediction loss value of the second task network is also the second face key point prediction loss value. In some embodiments of the present application, the predicted loss value for the second task network may be calculated by the following formula:

wherein L is_{landmark_right}Representing the predicted loss value, x, of the second network of tasks_{L_i}For the second sample image in the ith training sample, f (x)_{R_i}) Express the ith trainingFace key point prediction value y of second sample image in training sample_{R_i}Representing the real value of the face key point of a second sample image in the ith training sample, N representing the number of samples in the training sample set, lambda representing the weight of each layer of network, w_jFor the network parameters, n represents the number of network layers with weights.

For another example, the prediction errors of the first convolution network 210, the second convolution network 230, the residual network 250, and the depth regression network 260, i.e., the prediction loss values of the third task network, are also depth value prediction loss values, are calculated according to the encoding mapping results of all the training samples in the sample set. In some embodiments of the present application, the predicted loss value of the third task network may be calculated by the following formula:

wherein L is_depthA predicted loss value, f (x), representing the third task network_i) Representing the predicted value y of the depth value of the face key point in the ith training sample_iRepresenting the real value of the depth value of the key point of the face in the ith training sample, N representing the number of samples in the training sample set, lambda representing the weight of each layer of the network, w_jFor the network parameter, n represents the number of network layers with weights.

For another example, the prediction errors of the first convolutional network 210, the second convolutional network 230, the residual network 250, and the classification network 270, that is, the prediction loss value of the fourth task network, are also the face living body class prediction loss values, are calculated according to the coding mapping results of all the training samples in the sample set. In some embodiments of the present application, the predicted loss value of the fourth task network may be calculated by the following formula:

wherein L is_{face_liveness}A predicted loss value, f (x), representing the fourth mission network_i) Represents the ith training samplePredicted value of middle face living body class, y_iRepresenting the real value of the class of the living human face in the ith training sample, N representing the number of samples in the training sample set, lambda representing the weight of each layer of the network, w_jFor the network parameter, n represents the number of network layers with weights.

After the predicted loss value of each branch network is determined, a model predicted total loss value of the multitask model is further calculated according to the predicted loss values of the branch networks. In some embodiments of the present application, the model predicted total loss value L of the multitask model may be determined as follows_total：

L_total＝λ₁L_{landmark_left}+λ₂L_{landmark_right}+λ₃L_depth+λ₄L_{face_liveness}(ii) a Wherein λ is₁、λ₂、λ₃And λ₄The value of (a) can be set according to practical experience.

In the training process, the model prediction total loss value L of the multitask model can be adjusted by continuously optimizing network parameters of each network included in the multitask model_totalUntil the model predicts the total loss value L_totalSatisfies a predetermined condition (e.g., loss value L)_totalAnd converging to be less than a preset value), namely completing the training of the multitask model.

In the prediction stage, the cut first face image to be detected and the cut second face image to be detected are input to a pre-trained living body detection model in parallel, and the target face is classified and mapped through the living body detection model according to the plane features and the depth features in the first face image to be detected and the second face image to be detected, including: carrying out convolution processing on the first face image to be detected through the first convolution network to obtain a fourth vector; carrying out convolution processing on the second face image to be detected through the second convolution network to obtain a fifth vector; performing convolution processing on the fourth vector and the fifth vector through the residual error network to obtain a sixth vector; and coding and mapping the sixth vector through the classification network to obtain the living human face category corresponding to the target human face.

Specifically, in the specific embodiment of performing convolution processing on the first to-be-detected face image through the first convolution network to obtain the fourth vector, the first sample image of the training sample is subjected to convolution processing through the first convolution network in the training stage to obtain the first vector, which is not described herein again. And performing convolution processing on the second face image to be detected through the second convolution network to obtain a specific implementation manner of a fifth vector, and performing convolution processing on a second sample image of the training sample through the second convolution network in a training stage to obtain a second vector, which is not described herein again. And performing convolution processing on the fourth vector and the fifth vector through the residual error network to obtain a specific implementation manner of a sixth vector, and performing convolution processing on the first vector and the second vector through the residual error network in a training stage to obtain a specific implementation manner of a third vector, which is not described herein again. And performing coding mapping on the sixth vector through the classification network to obtain a specific implementation of the human face living body category corresponding to the target human face, referring to a training stage, and performing coding mapping on the third vector through the classification network to obtain a specific implementation of the human face living body category predicted value corresponding to the training sample, which is not described herein again.

And 150, determining whether the target face is a living face according to the classification mapping result.

The classification mapping result in the embodiment of the application comprises the probability that the input image is recognized as different human face living body classes. Further, when the probability that the input image is recognized as the living body face type is larger than a preset probability threshold value, the target face can be determined to be the living body face, and otherwise, the target face can be determined to be the non-living body face.

According to the face in-vivo detection method disclosed by the embodiment of the application, two face images acquired by binocular image acquisition equipment are further learned based on the constraint of face key points and depth information learning results in the training process of a in-vivo detection model, so that the plane information and the depth information of the image pair of a target face acquired by the binocular image acquisition equipment are realized, the target face is subjected to in-vivo detection without generating a three-dimensional space point cloud, the calculation complexity is low, the operation speed is high, and the face in-vivo detection efficiency is high.

In the model training process, the depth information of the image is considered, so that the plane non-living human faces such as photos and videos can be accurately classified in the prediction stage, and the attack of the plane image can be rapidly detected. Because the face key point information (such as the training process of the first task network and the second task network in fig. 2, and the training process of the first convolution network and the second convolution network) is fully considered by the respective networks, in the prediction stage, the living body detection model can detect the attack face of a nose complex bend, a three-dimensional head model, a simulation mask or a mask worn by a real person by pulling out a picture nose, and the accuracy of face living body detection is further improved.

Specifically, a first sample image and a second sample image (i.e., two image acquisition devices of a binocular image acquisition device) are respectively input into the multitask model shown in fig. 2, first, a first convolution network of a first task network and a second convolution network of a second task network respectively learn the face features of the first sample image and the second sample image, a full connection layer is added behind each convolution network to carry out regression on the features, and face key point constraints are increased, so that the convolution networks branch to learn the face features in the input image; then, combining the features extracted by the two convolutional network branches together, and learning the depth features and the living body features through a plurality of residual error modules of a residual error network; and then, two network branches are led out, one network branch is connected to the full-connection layer and regresses the depth value of the feature point for the fusion feature, the convolution network is added to the other network branch, then the convolution feature is stretched into a one-dimensional vector, the depth value is spliced to the one-dimensional vector, and a living body classification result is obtained through the two full-connection layers. The living body detection model trained by the structure and the method has face key point constraint and depth constraint corresponding to the key points, can learn face plane information (such as two-dimensional texture information) and depth information, and is favorable for improving the accuracy and reliability of living body detection.

On the other hand, the in-vivo detection model in the embodiment of the application is obtained by cutting based on a multi-task model, the four branch networks are synchronously trained in the training stage by combining plane information and depth information, only one branch network is used in the prediction stage, the network structure used in the prediction stage is simple, and the operation efficiency is higher.

Example two

Corresponding to the method embodiment, another embodiment of the present application discloses a human face live detection device, as shown in fig. 4, the device includes:

a face image obtaining module 410, configured to obtain a first face image and a second face image that are synchronously collected by a first image collecting device and a second image collecting device for a target face;

a face positioning module 420, configured to perform face positioning on the first face image and the second face image respectively to obtain corresponding face positioning results;

a face image clipping module 430, configured to clip a first to-be-detected face image from the first face image and clip a second to-be-detected face image from the second face image according to the face positioning results in the first face image and the second face image, respectively;

the image classification module 440 is configured to input the cut first to-be-detected face image and the cut second to-be-detected face image to a pre-trained living body detection model in parallel, and perform classification mapping on the target face through the living body detection model according to a plane feature and a depth feature in the first to-be-detected face image and the second to-be-detected face image; the living body detection model is a classification model trained on face key point constraint and depth feature constraint of a training sample;

and a face living body detection result determining module 450, configured to determine whether the target face is a living body face according to the classification mapping result.

In some embodiments of the present application, the living body detection model is obtained by cutting a preset multitask model, and the multitask model includes:

the system comprises a first task network, a second task network and a third task network, wherein the first task network is composed of a first convolution network and a first fully-connected network and is used for learning face key point features in images input to the first convolution network;

the second task network is composed of a second convolutional network and a second fully-connected network and is used for learning the human face key point characteristics in the image input to the second convolutional network;

a third task network composed of the first convolutional network, the second convolutional network, a residual network, and a depth regression network, the third task network being configured to learn depth features in images input to the first convolutional network and the second convolutional network; and the number of the first and second groups,

a fourth task network composed of the first convolutional network, the second convolutional network, the residual network, and a classification network, the fourth task network for learning living body and non-living body information in the image input to the first convolutional network and the second convolutional network;

obtaining network parameters of the living body detection model consisting of the first convolution network, the second convolution network, the residual error network and the classification network by training the multitask model;

the first convolution network and the second convolution network are arranged in parallel, and the residual error network is connected with the outputs of the first convolution network and the second convolution network respectively.

In some embodiments of the present application, the sample data for training each training sample of the multitask model comprises: a first sample image and a second sample image, the sample label of each of the training samples comprising: the first sample image and the second sample image each correspond to: the real values of the key points of the human face, the real values of the depth values and the real values of the living body classes of the human face;

the multitask model is trained by the following method:

for each training sample in the sample set, performing the following encoding mapping operations:

inputting a first sample image comprised by the training sample to the first convolutional network of the multitask model, while inputting a second sample image comprised by the training sample to the second convolutional network of the multitask model;

performing operation processing on the first sample image through the first task network to obtain a human face key point prediction value of the first sample image in the training sample; performing operation processing on the second sample image through the second task network to obtain a human face key point prediction value of the second sample image in the training sample;

performing operation processing on the first sample image and the second sample image through the third task network to obtain a depth value predicted value of the training sample;

performing operation processing on the first sample image and the second sample image through the fourth task network to obtain a human face living body category predicted value of the training sample;

determining prediction loss values of the first task network, the second task network, the third task network and the fourth task network according to prediction values obtained by executing the coding mapping operation;

carrying out weighted summation on the predicted loss values of the first task network, the second task network, the third task network and the fourth task network to determine a model predicted total loss value of the multitask model;

optimizing the network parameters of the multitask model, and skipping to executing the coding mapping operation until the model prediction total loss value converges to meet the preset condition.

In some embodiments of the present application, determining the predicted loss values of the first task network, the second task network, the third task network, and the fourth task network according to the predicted values obtained by performing the encoding mapping operation includes:

determining a prediction loss value of the first task network according to a difference value between the human face key point prediction value and a human face key point true value of a first sample image in all the training samples in the sample set;

determining a prediction loss value of the second task network according to a difference value between the predicted value of the face key point and the true value of the face key point of a second sample image in all the training samples in the sample set;

determining a prediction loss value of the third task network according to the difference value between the depth value prediction value and the depth value true value of all the training samples in the sample set;

and determining a prediction loss value of the fourth task network according to the difference value between the human face living body type prediction value and the human face living body type true value of all the training samples in the sample set.

In some embodiments of the present application, the first sample image is subjected to operation processing by the first task network, so as to obtain a face keypoint prediction value of the first sample image in the training sample; and the step of performing operation processing on the second sample image through the second task network to obtain a face key point prediction value of the second sample image in the training sample includes:

performing convolution processing on the first sample image in the training sample through the first convolution network to obtain a first vector; then, coding and mapping the first vector through the first full-connection network to obtain a human face key point predicted value corresponding to the first sample image; and the number of the first and second groups,

performing convolution processing on the second sample image in the training sample through the second convolution network to obtain a second vector; and then, coding and mapping the second vector through the second fully-connected network to obtain a human face key point prediction value corresponding to the second sample image.

In some embodiments of the present application, the determining the real value of the depth value according to the key points of the face in the first sample image and the second sample image in the sample data of the training sample, and the performing operation processing on the first sample image and the second sample image through the third task network to obtain the predicted value of the depth of the training sample includes:

performing convolution processing on the first vector and the second vector through the residual error network to obtain a third vector;

and coding and mapping the third vector through the depth regression network to obtain a depth value predicted value of the face key point corresponding to the training sample.

In some embodiments of the application, the step of obtaining the human face living body category prediction value of the training sample by performing operation processing on the first sample image and the second sample image through the fourth task network includes:

and coding and mapping the third vector through the classification network to obtain a human face living body category predicted value corresponding to the training sample.

The face living body detection device disclosed in the embodiment of the present application is used for implementing the face living body detection method described in the first embodiment of the present application, and specific implementation manners of each module of the device are not described again, and reference may be made to specific implementation manners of corresponding steps in the method embodiments.

The face living body detection device disclosed by the embodiment of the application acquires a first face image and a second face image which are synchronously acquired by a first image acquisition device and a second image acquisition device aiming at a target face; respectively carrying out face positioning on the first face image and the second face image to obtain corresponding face positioning results; then, respectively cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image; inputting the cut first face image to be detected and the cut second face image to be detected into a pre-trained living body detection model in parallel, and performing classification mapping on the target face through the living body detection model according to the plane features and the depth features in the first face image to be detected and the second face image to be detected; the living body detection model is a classification model trained on face key point constraint and depth feature constraint of a training sample; and determining whether the target face is a living face according to the classification mapping result, which is beneficial to improving the speed of face living body detection.

The face in-vivo detection device disclosed by the embodiment of the application further learns the face in-vivo and non-in-vivo characteristics based on the constraint of face key points and depth information learning results for two face images acquired by binocular image acquisition equipment in the training process of a in-vivo detection model, so that the plane information and the depth information of the image pair of the target face acquired by the binocular image acquisition equipment are realized, the target face is subjected to in-vivo detection without generating a three-dimensional space point cloud, the calculation complexity is low, the operation speed is high, and the face in-vivo detection efficiency is high.

Correspondingly, the application also discloses an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the human face liveness detection method according to the first embodiment of the application. The electronic device can be a PC, a mobile terminal, a personal digital assistant, a tablet computer and the like.

The present application also discloses a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the living human face detection method according to the first embodiment of the present application.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The method and the device for detecting the living human face provided by the application are described in detail above, a specific example is applied in the text to explain the principle and the implementation of the application, and the description of the above example is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Claims

1. A face living body detection method is characterized by comprising the following steps:

2. The method of claim 1, wherein the in-vivo detection model is tailored from a pre-defined multitask model, the multitask model comprising:

3. The method of claim 2, wherein the sample data for training each training sample of the multitask model comprises: a first sample image and a second sample image, the sample label of each of the training samples comprising: the first sample image and the second sample image each correspond to: the real values of the key points of the human face, the real values of the depth values and the real values of the living body classes of the human face;

the multitask model is trained by the following method:

4. The method of claim 3, wherein determining predicted loss values for the first task network, the second task network, the third task network, and the fourth task network based on predicted values from performing the code mapping operation comprises:

5. The method according to claim 3, wherein the first sample image is subjected to operation processing through the first task network to obtain a face key point prediction value of the first sample image in the training sample; and the step of performing operation processing on the second sample image through the second task network to obtain a face key point prediction value of the second sample image in the training sample includes:

6. The method according to claim 5, wherein the step of determining the true depth value according to the key points of the face in the first sample image and the second sample image in the sample data of the training sample, and performing the operation processing on the first sample image and the second sample image through the third task network to obtain the predicted depth value of the training sample comprises:

7. The method according to claim 5, wherein the step of performing operation processing on the first sample image and the second sample image through the fourth task network to obtain the face living body class prediction value of the training sample comprises:

8. A human face living body detection device is characterized in that,

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of detecting the presence of a human face according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for detecting a living body of a human face according to any one of claims 1 to 7.