CN112464864A

CN112464864A - Face living body detection method based on tree-shaped neural network structure

Info

Publication number: CN112464864A
Application number: CN202011439243.7A
Authority: CN
Inventors: 沈耀; 薛迪
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-03-09

Abstract

A human face in-vivo detection method based on a tree-shaped neural network structure is characterized in that training samples are collected and labeled and then used for training a human face in-vivo detection model based on the tree-shaped neural network structure, and a preprocessed picture to be detected is input into the trained human face in-vivo detection model to realize in-vivo detection. When the method faces various light condition changes, camera condition changes and attacks of various non-living body types of the picture to be detected, the detection accuracy of the living body detection is ensured, the usability, the reliability and the universality of the face living body detection are obviously improved, and the judgment of the non-living body attack types is carried out.

Description

Face living body detection method based on tree-shaped neural network structure

Technical Field

The invention relates to a technology in the field of image detection, in particular to a human face living body detection method based on a tree-shaped neural network structure.

Background

Because the existing face recognition technology is very easy to copy through the modes of photos, videos, even simulation molds and the like, a malicious person tries to disguise in the recognition process, and the purpose of illegal invasion is achieved through verification by a picture. The human face living body detection is generated from the human face living body detection, and the main adopted technical means are as follows: detecting a living body by using pictures: whether the target object is a living body is judged based on the burst (Moire pattern, imaging deformity and the like) of the portrait in the picture, so that cheating attacks such as secondary copying of a screen and the like can be effectively prevented, and single or multiple judgment logics can be used; video stream living body detection: the motion of each pixel position is determined by utilizing the time domain variation and the correlation of pixel intensity data in the image sequence, the operation information of each pixel point is obtained from the image sequence, and a Gaussian difference filter, LBP characteristics and a support vector machine are adopted for data statistical analysis. Meanwhile, the optical flow field is sensitive to the movement of an object, and eyeball movement and blink can be uniformly detected by using the optical flow field. The living body detection mode can realize blind detection under the condition that a user is not matched; the infrared binocular camera living body detection realizes living body judgment at night or under the condition of no natural light by utilizing the near-infrared imaging principle. The imaging characteristics (such as incapability of imaging on a screen, different material reflectivities and the like) of the method can realize high-robustness living body judgment; the 3D camera in-vivo detection is based on a 3D structured light imaging principle, a depth image is constructed through light reflected by the surface of a human face, whether a target object is a living body or not is judged, and attacks such as pictures, videos, screens and molds can be defended effectively. The disadvantages are the need for additional equipment and the high cost; the action is matched with living body detection, a specified action requirement is given, a user needs to complete the action, and whether the living body exists is judged by detecting the states of eyes, mouth and head postures of the user in real time, so that the method is a widely used technology at present.

However, the existing human face living body detection technology has the following defects: the detection accuracy is low using extra equipment and silence detection that does not require action coordination; attack to a specific detection mode cannot be dealt with; the detection accuracy under various light conditions is not stable; the method cannot cope with unknown attacks, i.e. the method has low generalization.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a human face in-vivo detection method based on a tree neural network structure, which can ensure the detection accuracy of in-vivo detection, remarkably improve the usability, reliability and universality of human face in-vivo detection and judge the types of non-in-vivo attacks when the human face in-vivo detection method faces to be detected and the attack of various light condition changes, camera condition changes and various non-in-vivo types.

The invention is realized by the following technical scheme:

the invention relates to a human face in-vivo detection method based on a tree-shaped neural network structure, which is characterized in that training samples are collected and labeled and then used for training a human face in-vivo detection model based on the tree-shaped neural network structure, and a preprocessed picture to be detected is input into the trained human face in-vivo detection model to realize in-vivo detection.

The training sample collection and labeling means that: using a public data set, the data set comprising: in addition to the live body sample video and the non-live body sample video of paper printing and screen attack that are used for training of shooting are gathered under indoor light environment, still include: and collecting non-living sample videos containing attack types such as makeup, masks and the like for training from a video website, carrying out face detection on each frame in the non-living sample videos through an MTCNN face detection algorithm, and making a mask serving as training supervision data.

The living body sample video used for training is a living body sample video shot in natural light and indoors.

The non-living sample video is a face photo shot and printed as a printing attack and a screen display face photo as a screen attack.

The non-live sample video further comprises: the collection from the video website includes: the method comprises the following steps that various types of non-living body sample videos such as a partial mask, a silicone mask, a paper mask, a dummy model, a makeup simulation and a paper printing eye glasses are added, the type number of the non-living body samples is increased, and the richness of training data is improved so as to improve the effect of the model. And randomly extracting 10 frames, namely 1626 × 10 images in total from the training sample video as training data, marking the label of the living body sample as 0 and the label of the non-living body sample as 1, thereby achieving the effect of data enhancement.

The face detection means that: the real human face part is 0, the prosthesis part is 1, and the method specifically comprises the following steps: the mask of the real face is 0 for the area of the face, the mask of the print photo is 1 for the area of the face, the mask of the eye mask is 1 for the area of the eyes, and the other areas of the face are 0.

The training is as follows: and (3) disordering the videos of the living body sample and the non-living body sample, randomly extracting a video frame and a mask thereof, and inputting the video frame and the mask thereof into the tree-shaped neural network for training to obtain a trained human face living body detection model based on the tree-shaped neural network structure.

The human face living body detection model based on the tree-shaped neural network structure is a neural network with a 4-layer tree-shaped structure, and specifically comprises the following steps: 7 feature extraction modules, 8 supervised training modules and 7 unsupervised clustering routing modules for attack samples, wherein: non-leaf nodes on the tree neural network pass through a feature extraction module and then are connected with a routing module, and leaf nodes are supervised training modules.

The network input batch size (batch size) is preferably set to 32.

The feature extraction module comprises: three convolutional layers with residual structure and one max pooling layer, where: the convolution kernels of the convolution layers are all 3 × 3 in size, and characteristic graphs with the sizes of 128 × 40, 64 × 40 and 32 × 40 are obtained by using the ReLU as an activation function and adding BN layers to the first layer, the second layer and the third layer respectively.

The unsupervised clustering means that: passing the input vector x through a routing function

Calculating to obtain a sample set advancing to the left child node

And sample set advancing to the right child node

Wherein: s is a sample set, I_kK is 1,2, 3.

The loss function adopted for the training of the routing module of the unsupervised cluster used is

Wherein: n, N_l,N_rRespectively are sample sets S, S_left,S_rightThe number of samples.

The routing module reduces the dimension of the input feature graph by using convolution of 1 × 1, scales the feature graph to 16 × 20, uses reshape as a vector with the length of 16 × 20, and inputs the vector into a routing function to calculate and obtain the routing target.

The supervised training module takes a feature map of 32 × 40 as input, and firstly, one branch passes through a convolution layer of 1 × 1 to generate a non-living body mask map of 32 × 1; the other branch passes through two convolution layers, the sizes of convolution kernels are 3 x 3, the number of channels is 40, then the final result of whether the living body is obtained through two fully-connected layers with the dimensionality of 500 and 2 respectively, the living body is 0, and the non-living body is 1.

The pretreatment is as follows: after face detection is carried out on an input picture to be detected to obtain a face position, cutting and zooming are carried out on the picture to be detected according to the obtained face position, and the method specifically comprises the following steps: detecting the face position coordinates in an input picture by using an MTCNN method, selecting the largest face for processing when the picture contains a plurality of faces, and quitting a detection system and returning no-face abnormity when the picture does not contain the faces; and preprocessing the picture to be detected according to the obtained face position, cutting the picture according to face position coordinates obtained by face detection, cutting out the picture of the face part, and zooming to 256 × 256 of a fixed size.

The living body detection is as follows: preprocessing a picture to be detected and inputting the preprocessed picture into a trained human face living body detection model to obtain a result comprising: and determining whether the detected image is a live body 0/1 value and a mask with the size of a non-live body part being 32 × 1 in the image as a detection result, and performing live body judgment on the detection result and non-live body attack type judgment according to the mask to obtain a final live body detection result.

The invention relates to a system for realizing the method, which comprises the following steps: the system comprises a data preprocessing unit and a face living body detection unit, wherein the data preprocessing unit receives and preprocesses a picture to be detected, processes the picture into a picture which only contains a face part and has a fixed size, and inputs the picture into the face living body detection unit; the human face living body detection unit is connected with the data preprocessing unit, receives the picture data after data preprocessing, performs living body detection on the human face in the picture, and generates the probability that the human face in the picture is a non-living body and a mask of the non-living body part in the picture.

Technical effects

Compared with the prior art, the method does not need human action coordination, increases the non-living body part mask as a monitoring signal, realizes clustering on non-living body attack, enhances the accuracy of extracted feature semantics during feature extraction, enables the model effect to get rid of the requirements on light and other conditions, and further improves the reliability of the model result; the invention can further change the network structure by increasing the number of layers or nodes of the tree neural network, can realize more detailed clustering of non-living attack types, and improves the resistance to unknown attack types.

Drawings

FIG. 1 is a schematic diagram of a training process of the present invention;

FIG. 2 is a schematic view of the detection process of the present invention;

FIG. 3 is a schematic diagram of a tree neural network structure according to the present invention;

FIG. 4 is a block diagram of a tree neural network according to the present invention;

in the figure: a) is a structure diagram of a feature extraction module, b) is a structure diagram of a routing module, and c) is a structure diagram of a leaf supervised training module;

FIG. 5 is a schematic diagram of a partial image of test sample data and its non-live mask according to an embodiment.

Detailed Description

As shown in fig. 1, the present embodiment relates to a method for detecting a living human face based on a tree neural network structure, which includes: the method comprises the following steps:

s101, acquiring a training sample video: using the public data set, acquiring a live sample video for training and a non-live sample video for paper printing and screen attack in an indoor light environment, and collecting the non-live sample video for training from a video website, as shown in fig. 5, including: part mask, silicone mask, paper mask, dummy model, make-up simulation, paper printing eye glasses, 1626 videos in total in the data set.

Step S102, carrying out face detection on the video file frame by frame, cutting an image according to the face position and zooming, and specifically comprising the following steps:

for each frame of image of the video, the method comprises the following steps: but not limited to MTCNN and other methods, to obtain the coordinates (x) of the upper left corner and the lower right corner of the face position frame₁,y₁) And (x)₂,y₂)；

② according to the coordinates (x) of the face frame₁,y₁) And (x)₂,y₂) And cutting the picture, cutting the face part, scaling the face part by adopting a bilinear interpolation algorithm to a fixed size of 256 x 256.

Step S103, whether the face position of the video frame image is a living body or not is marked according to the non-living body attack type, and a non-living body part mask is manufactured, wherein: the labeling method of the sample comprises the following steps: labeling a live sample label as 0 and a non-live sample label as 1; and performing face detection and making a mask for each frame in the sample video, wherein the real face part is 0, the false body part is 1, for example, the mask of the real face is 0, the mask of the printed picture is 1, the mask of the eye mask is 1, and the other face regions are 0.

Step S104, training a tree neural network model shown in FIG. 4, wherein the tree neural network structure has a 4-layer structure, and comprises 7 feature extraction modules, 8 leaf supervised training modules for outputting the living body detection result and the non-living body mask and 7 routing modules.

As shown in fig. 4a), the feature extraction module in the tree-shaped neural network model is composed of three convolution layers with residual structure and one maximum pooling layer, the convolution kernels of the convolution layers are all 3 × 3, and feature maps with sizes of 128 × 40, 64 × 40 and 32 × 40 are obtained by using the ReLU as an activation function and adding BN layers to the first layer, the second layer and the third layer respectively. Extracting hierarchical features of the image;

the routing module is used for calculating the forward direction of the features in the network. As shown in fig. 4b), the feature map enters the routing module, the convolution layer with convolution kernel size of 1 × 1 is subjected to dimensionality reduction, the number of channels 40 is subjected to dimensionality reduction to the number of channels 20, then scaling of the feature map is performed through bilinear interpolation, the size of the feature map is scaled to fixed size 16 × 16, then the feature map with size 16 × 20 is reshaped into a vector with length 16 × 20, and then the vector is input into the routing function

The forward direction is calculated.

The routing function of the routing module

Wherein: x is a vector calculated by a characteristic graph processing input routing function through a routing module, and v and tau are a parameter vector and a bias quantity which need to pass through unsupervised training in the routing function respectively; calculating the vector x as an input function

(proceed to the left child node) and

(proceed to the right child node), wherein: s is a sample set, I_kK is a data sample, and therefore clustering of the samples is achieved. The parameters v, tau in the function are obtained by training in an unsupervised mode, and the loss function is obtained

Wherein: n, N_l,N_rAre respectively a sampleCollection of S, S_left,S_rightThe objective of the training loss function is to make the routing function value of the vector to the left and right nodes

The average value of (a) is farthest to achieve clustering, x_kIs a sample I_kThe vector to be calculated of the routing function is reached after processing, S is a non-living sample set passing through the node, S^-For other sample sets, including: live samples and non-live samples that did not pass through this node.

The input of the leaf supervised training module is a feature map of 32 × 40, as shown in fig. 4c), and the module comprises: two branches, specifically: a branch feature map is subjected to direct dimensionality reduction through a convolution layer with convolution kernel size of 1 x 1 to obtain a mask of 32 x 1; and the other branch characteristic diagram successively passes through two convolution layers with the convolution kernel size of 3 x 3 and the channel number of 40, then is connected with two full-connection layers, the dimensionality is 500 and 2 respectively, and then the probability that the result is a non-living body is obtained through the softmax layer, wherein the living body is 0, and the non-living body is 1.

The leaf has a non-living mask of the supervised training module, and passes through a loss function

To obtain, wherein: m_kMasks generated for the network, D_kFor the input mask samples, N is the number of samples to reach this node, S is the set of samples to reach this node, I_kIs the sample of the sample set that arrives at this node.

The leaf has the in-vivo detection result of the supervised training module, that is, in 0/1 two-classification branches, the two-classification loss function uses an improved focalloss loss function, which specifically comprises:

wherein: s¹For all non-live data samples arriving at this node, S⁰For all live data samples arriving at this node, I_kTo arrive atThe samples in the sample set of this node, N is the total number of samples that reach the node, y' is the predicted non-living body probability of the sample, α, γ are balance factors, and are preferably set to α ═ 0.25, and γ ═ 2.

The total loss of the training use of the tree neural network model

Wherein: alpha is alpha₁，α₂，α₃For weighting parameters, the preferred settings of the parameters are 0.01, 1,2, L, respectively_cls,L_mask,L_routeRespectively a binary classification loss function, a mask loss function and a routing unsupervised clustering loss function.

Step S105, as shown in fig. 2, performing face detection on the picture to be detected: adopting MTCNN method to detect human face and obtaining coordinates (x) of upper left corner and lower right corner of human face position frame₁,y₁) And (x)₂,y₂) If the face is not detected in the picture, the face living body detection fails and returns to the picture that no face error exists; and if the number of the faces in the picture is more than 1, only selecting the largest face for subsequent judgment.

Step S106, preprocessing the picture to be detected: face frame coordinates (x) obtained in step S201₁,y₁) And (x)₂,y₂) And cutting the picture, cutting the face part, scaling the face part by adopting a bilinear interpolation algorithm to a fixed size of 256 x 256.

And S107, inputting the picture preprocessed in the step S202 into the trained tree neural network model, and judging whether the picture is a living body or a non-living body part mask result according to the deduced non-living body probability.

And step S108, judging whether the face in the picture is a living body according to the result of the fact that the face is the living body 0/1, if so, successfully detecting the face living body, and if not, returning a non-living body result and deducing the non-living body attack type according to the mask result.

Setting the batch processing size to be 32, the learning rate to be 0.001 and the learning rate momentum to be 0.999 under the actual experimental environment to classify the loss into two categoriesα in the loss function is 0.25, γ is 2, α in the total loss function₁＝0.01，α₂＝1，α₃The above method was run with 2 parameters and the results on the test set were APCER-3.62% and BPCER-12.56%, where: the APCER is the proportion misjudged as a living body in all non-living body samples, and the BPCER is the proportion misjudged as a non-living body in all living body samples.

Compared with the typical face living body detection algorithm SVM and LBP based on the traditional method, wherein the APCER is 32.8 +/-29.8, the BPCER is 21.0 +/-2.9, and the typical face living body detection algorithm Auxiliary based on deep learning, the APCER is 38.3 +/-37.4, and the BPCER is 8.9 +/-2.0, the APCER of the method is 3.62%, the BPCER is 12.56%, the non-living body false detection rate is lower than that of the prior art, and the comprehensive accuracy of the detection result is high.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A human face in-vivo detection method based on a tree-shaped neural network structure is characterized in that training samples are collected and labeled for training a human face in-vivo detection model based on the tree-shaped neural network structure, and then a preprocessed picture to be detected is input into the trained human face in-vivo detection model to realize in-vivo detection;

the human face living body detection model based on the tree-shaped neural network structure is a neural network with a 4-layer tree-shaped structure, and specifically comprises the following steps: 7 feature extraction modules, 8 supervised training modules and 7 unsupervised clustering routing modules for attack samples, wherein: non-leaf nodes on the tree neural network pass through a feature extraction module and then are connected with a routing module, and leaf nodes are supervised training modules;

the training sample collection and labeling means that: and carrying out face detection on each frame in the non-living sample video through an MTCNN face detection algorithm by using the public data set and the non-living sample video collected from a video website and used for training, and making a mask as training supervision data.

2. The method as claimed in claim 1, wherein the exposing the data set comprises: collecting and shooting a live sample video for training and a non-live sample video for paper printing and screen attack in an indoor light environment;

the living body sample video used for training is a living body sample video shot in natural light and indoors;

3. The method for detecting the living human face based on the tree neural network structure as claimed in claim 1 or 2, wherein the non-living sample video further comprises: the collection from the video website includes: the method comprises the steps of partially masking, silicone masking, paper masking, dummy models, makeup simulation, paper printing of non-living body sample videos of eye glasses, randomly extracting 10 frames, namely 1626 images by 10 images in total, from training sample videos to serve as training data, marking a living body sample label to be 0, and marking a non-living body sample label to be 1, so that the effect of data enhancement is achieved.

4. The method for detecting the living human face based on the tree neural network structure as claimed in claim 1, wherein the human face detection is that: the real human face part is 0, the prosthesis part is 1, and the method specifically comprises the following steps: the mask of the real face is 0 for the area of the face, the mask of the print photo is 1 for the area of the face, the mask of the eye mask is 1 for the area of the eyes, and the other areas of the face are 0.

5. The method for detecting the living human face based on the tree neural network structure as claimed in claim 1, wherein the training is: and (3) disordering the videos of the living body sample and the non-living body sample, randomly extracting a video frame and a mask thereof, and inputting the video frame and the mask thereof into the tree-shaped neural network for training to obtain a trained human face living body detection model based on the tree-shaped neural network structure.

6. The method for detecting the living human face based on the tree neural network structure as claimed in claim 1, wherein the feature extraction module comprises: three convolutional layers with residual structure and one max pooling layer, where: the convolution kernels of the convolution layers are all 3 × 3 in size, and characteristic graphs with the sizes of 128 × 40, 64 × 40 and 32 × 40 are obtained by using the ReLU as an activation function and adding the BN layer into the first layer, the second layer and the third layer respectively;

the routing module reduces the dimension of the input feature graph by using convolution of 1 × 1, scales the feature graph to 16 × 20, uses reshape as a vector with the length of 16 × 20, and inputs the vector into a routing function to calculate and obtain a routing target;

7. The method for detecting the living human face based on the tree neural network structure as claimed in claim 1, wherein the unsupervised clustering is as follows: passing the input vector x through a routing function

Calculating to obtain a sample set S advancing to the left child node_left:

And a sample set S proceeding to the right child node_right:

Wherein: s is a sample set, I_kK is 1,2,3, and K is a data sample;

8. The method for detecting the living human face based on the tree neural network structure as claimed in claim 1, wherein the preprocessing is as follows: after face detection is carried out on an input picture to be detected to obtain a face position, cutting and zooming are carried out on the picture to be detected according to the obtained face position, and the method specifically comprises the following steps: detecting the face position coordinates in an input picture by using an MTCNN method, selecting the largest face for processing when the picture contains a plurality of faces, and quitting a detection system and returning no-face abnormity when the picture does not contain the faces; and preprocessing the picture to be detected according to the obtained face position, cutting the picture according to face position coordinates obtained by face detection, cutting out the picture of the face part, and zooming to 256 × 256 of a fixed size.

9. The method for detecting the living human face based on the tree neural network structure as claimed in claim 1, wherein the living detection is: preprocessing a picture to be detected and inputting the preprocessed picture into a trained human face living body detection model to obtain a result comprising: and determining whether the detected image is a live body 0/1 value and a mask with the size of a non-live body part being 32 × 1 in the image as a detection result, and performing live body judgment on the detection result and non-live body attack type judgment according to the mask to obtain a final live body detection result.

10. A system for implementing the method of any preceding claim, comprising: the system comprises a data preprocessing unit and a face living body detection unit, wherein the data preprocessing unit receives and preprocesses a picture to be detected, processes the picture into a picture which only contains a face part and has a fixed size, and inputs the picture into the face living body detection unit; the human face living body detection unit is connected with the data preprocessing unit, receives the picture data after data preprocessing, performs living body detection on the human face in the picture, and generates the probability that the human face in the picture is a non-living body and a mask of the non-living body part in the picture.