CN114092994A

CN114092994A - Human face living body detection method based on multi-view feature learning

Info

Publication number: CN114092994A
Application number: CN202111192064.2A
Authority: CN
Inventors: 毋立芳; 王竹铭; 徐姚文; 简萌; 石戈
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2022-02-25

Abstract

The invention provides a human face in-vivo detection method based on multi-view feature learning, which is characterized in that group classification is carried out from multiple views such as 'true + mask' and 'true + video', feature extraction models of multiple different views are trained for extracting detection features with discriminative power, and the detection features are fused through a binary classification model and then used for authenticity classification. The invention can relieve the problem that the identification performance of the existing face living body detection method is reduced after the attack types are expanded, and enhance the capability of defending the attack, thereby improving the precision of face living body detection.

Description

Human face living body detection method based on multi-view feature learning

Technical Field

The invention belongs to the technical field of human face in-vivo detection, and relates to a human face in-vivo detection method based on multi-view feature learning.

Background

In recent years, with the wide application of face recognition technology in identity authentication systems such as financial payment and door access unlocking, face forgery attacks are more and more performed on the face recognition technology. An attacker can easily deceive the face recognition system by using identification means such as photo printing, video playback, 3D masks and the like, thereby causing serious threats to the personal property and even social public security of users. The face live detection technology is generated accordingly.

However, in the existing method, the human face living body detection problem is generally regarded as a binary classification problem between a real human face and a forged human face, and classification is only performed at the visual angle of' real vs. In fact, different attack types present different forgery characteristics, while forgery characteristics that are well able to detect one attack type (which has characteristics not possessed by a real face) may not exist in another attack type, trying to find some common forgery characteristics that can detect multiple attack types may only result in a compromise of characteristics of different attack types, rather than an optimal choice for each attack type. On the other hand, from different perspectives, there are common features between the attack type and the real human face. For example, real faces and faces have depth information, while photos and videos do not; real faces and videos have dynamic information, while photos and masks do not.

Thus, existing approaches can produce significant performance degradation after the attack type is extended. For example, a model that can achieve good performance when only two attack types of photos and videos with similar false clues are considered, and the performance of the model is significantly reduced after a mask attack with a large difference from the photo and video attacks with the similar false clues is introduced.

Disclosure of Invention

In order to solve the problems, the invention provides a human face living body detection method based on multi-view characteristic learning, which carries out group-level classification training model and extracts characteristics by using a plurality of views of 'true + given attack type vs. other attack types', and further fuses and realizes true and false human face identification. The invention considers the difference between different attack types and the commonality between the different attack types and the real face, can relieve the problem that the identification performance of the existing face living body detection method is reduced after the attack types are expanded, and enhances the capability of defending the attack, thereby improving the precision of the face living body detection.

The method comprises the following specific steps:

(1) selecting training samples and labeling: collecting non-living human face samples for training, such as photos, videos, masks and the like, and labeling according to attack types; collecting a living body face sample, and marking the living body face sample as a real face class;

(2) group-level classification training based on "true + given attack type vs. remaining attack types" perspective: carrying out group-level classification training by taking a real face sample and a given type of attack sample as a positive sample group and the other types of attack samples as negative sample groups to obtain a feature extraction model based on a 'real + given attack type' visual angle;

(3) multi-view based feature extraction: selecting a plurality of different visual angles for training and obtaining corresponding feature extraction models based on the group classification training method in the step (2), and extracting detection features of the corresponding visual angles;

(4) training a binary classification model: taking a plurality of detection features based on different visual angles obtained in the step (3) as the input of a binary classification model, and training the binary classification model by taking a real face class as a positive sample and all non-living faces as negative samples;

(5) face living body detection based on multi-view feature learning: and (4) taking a face image to be detected, obtaining a plurality of detection characteristics based on different visual angles through the models obtained in the step (3), and inputting the binary classification model obtained in the step (4) to classify the true face and the false face.

Further, in the step (1), the non-living human face sample for training is collected, wherein the non-living human face sample comprises three attack types of a photo, a video and a mask.

Further, the method comprises two forms of specific model design:

1) in the step (3), only feature extraction models of two visual angles of 'true + mask' and 'true + video' are selected;

2) the following steps are added between the step (2) and the step (3): carrying out classification training by taking a real face sample as a positive sample and all types of attack samples as negative samples to obtain a characteristic extraction model of a real vs. all attack types visual angle; in the step (3), a feature extraction model of three visual angles, namely ' true + mask ', ' true + video ', ' true vs.

Further, in the method, a CDCN model designed by Zitong Yu et al in 2020 is adopted as a backbone network in a feature extraction model based on two visual angles of 'true + mask' and 'true vs. all attack types'; a 3D-CDCN model designed by Yaowen Xu et al in 2021 is adopted as a backbone network based on a feature extraction model of a real + video visual angle; the binary classification model is a three-layer convolution network which is designed independently, the sizes of convolution kernels are all 3 x 3, stride is 1, padding is 1, each layer of the first two layers is connected with a BatchNorm layer and a Relu layer, and the layer of the third layer is connected with the Relu layer.

Further, in the step (5), the judgment method of the human face living body detection is as follows: and taking the output of the binary classification model as a score map, taking the average value of all elements of the score map as a classification score, and if the score is greater than a classification threshold value, judging that the face image to be detected is a living face.

The invention has the following advantages:

1) by adopting a multi-view characteristic learning strategy, the method pays attention to the differences among different attack types and the commonalities between the different attack types and the real human face, and avoids the reduction of characteristic identification caused by simultaneous identification of multiple attacks.

2) The method selects the feature extraction models of two visual angles of 'real + mask' and 'real + video', gives consideration to the depth and dynamic information of the real face, describes the real face more comprehensively and meticulously, and enhances the discrimination of face living body detection.

Description of the drawings:

FIG. 1 is a schematic diagram of a model framework of the present invention.

FIG. 2 is a diagram of group level classification from multiple views.

Detailed Description

The invention provides a human face living body detection method based on multi-view feature learning. The following describes specific implementation steps of the present invention with reference to specific examples.

Referring to fig. 1, in this example, a photo, a video, and a mask attack are selected as a non-living human face sample, and two viewing angles of "true + mask" and "true + video" are selected for feature extraction. The method comprises the following specific steps:

(1) selecting training samples and labeling: collecting non-living human face samples for training, wherein the non-living human face samples comprise three types of photos, videos and masks, and are sequentially marked as

Collecting living human face sample, marking as real human face class I_R(ii) a The samples are sequences of eight frames long.

(2) Group-level classification training based on "true + given attack type vs. remaining attack types" perspective: carrying out group-level classification training by taking a real face sample and a given type of attack sample as a positive sample group and the other types of attack samples as negative sample groups to obtain a feature extraction model based on a 'real + given attack type' visual angle; if "true + mask" is taken as the visual angle, then

In the case of a positive sample,

is a negative sample; with "true + video" as the view angle, then I_R、

In the case of a positive sample,

is a negative sample;

(3) training a multi-view-based feature extraction model: training and obtaining a feature extraction model E of two visual angles of 'true + mask' and 'true + video' based on the group classification training method in the step (2)_rmAnd E_rvFor extracting detection feature P based on two visual angles of ' true + mask ' and ' true + video_rmAnd P_rv；

As shown in fig. 2, the conventional method only uses the view angle of "true vs. all attack types" for classification, while the present invention uses additional view angles such as "true + mask", "true + video", etc. Wherein E is_rmA CDCN model designed by Zitong Yu et al in 2020 is used as a backbone network, the output form and the supervision mode of the CDCN model are consistent with those of the CDCN model, and the output of the penultimate layer of the model is additionally taken as P_rm；E_rvA3D-CDCN model designed by Yaowen Xu et al in 2021 is used as a backbone network, the output form and the supervision mode of the 3D-CDCN model are consistent with those of the 3D-CDCN model, the output of the model in the penultimate layer is additionally taken to be averaged according to the time sequence and is sampled to be equal to P_rmThe same shape is taken as P_rv. I.e. P_rmAnd P_rvAll the shapes of (1) and (2) are 64 x 32. E_rmThe input of (a) is the first frame of an eight-frame image sequence, E_rvThe input of (2) is a sequence of eight frame images.

(4) Training a binary classification model: detecting characteristics P obtained in the step (3) based on two visual angles of ' real + mask ' and ' real + video_rmAnd P_rvAs input to a binary classification model F, with a real face class I_RAs positive sample, non-living human face

Training a binary classification model F for the negative samples;

(5) face living body detection based on multi-view feature learning: taking a face image sequence to be detected with the length of eight frames, and respectively inputting the two models E obtained in the step (3)_rmAnd E_rvObtaining two detection characteristics P based on different visual angles_rmAnd P_rvThen inputting the binary classification model F obtained in the step (4) to obtain binary classification inputGo out of P_fThe face authentication method is used for face authentication.

The binary classification model F is of a three-layer convolution structure, the convolution kernel size is 3 x 3, stride is 1, padding is 1, each of the first two layers is connected with a BatchNorm layer and a Relu layer, and the third layer is connected with a Relu layer. P_fThe shape of (2) is 32 x 32, the average value of all the elements of the face image is taken as a classification score, and if the score is greater than a classification threshold value, the face image to be detected is judged to be a living face.

The classification threshold value selection mode is as follows: and searching a threshold from 0 to 1 by taking the false face classification error rate (APCER) and the living face classification error rate (BPCER) in the standard file ISO/IEC 30107-3 in the aspect of biological recognition false attack prevention as performance evaluation indexes, and taking the threshold when the APCER and the BPCER in a verification set (or a training set) are equal as a classification threshold used in a testing stage. The threshold is determined by a specific experimental scene, from experimental empirical analysis, 0.5 can be selected as a general classification threshold, and compared with the performance when the optimal threshold is selected, the classification error rate is increased within 2%.

In order to prove the effectiveness of the invention, the invention is tested on a common test data set, and the result shows that the method can obtain good detection performance.

In addition to the routine performance evaluation experiments, we performed an additional set of comparative tests: introducing a sample of a new attack type into the existing method for retraining, and finding that the classification performance is remarkably reduced; the method is optimized by the thought of multi-view learning, the model structure is not changed, only the group-level classification labels are modified, and the performance reduction condition is obviously relieved. A comparison experiment proves that the method can effectively solve the problem that the identification performance of the existing face living body detection method is reduced after the attack types are expanded, enhance the attack defense capability and improve the face living body detection precision.

Claims

1. A human face living body detection method based on multi-view feature learning is characterized by comprising the following steps:

(1) selecting training samples and labeling: collecting a non-living body face sample for training, and labeling according to an attack type; collecting a living body face sample, and marking the living body face sample as a real face class;

2. The face live detection method based on multi-view feature learning according to claim 1, characterized in that: in the step (1), the collected non-living human face samples for training comprise three attack types of photos, videos and masks.

3. The face live detection method based on multi-view feature learning according to claim 1, characterized in that: in the step (3), only the feature extraction models of the two visual angles of 'true + mask' and 'true + video' are selected.

4. The face live detection method based on multi-view feature learning according to claim 1, characterized in that: the following steps are added between the step (2) and the step (3): carrying out classification training by taking a real face sample as a positive sample and all types of attack samples as negative samples to obtain a characteristic extraction model of a real vs. all attack types visual angle; in the step (3), a feature extraction model of three visual angles, namely ' true + mask ', ' true + video ', ' true vs.

5. The face live detection method based on multi-view feature learning according to claim 1, characterized in that: in the step (2), a CDCN model is adopted as a backbone network by a feature extraction model based on 'real + mask' visual angle training; and the feature extraction model based on the 'real + video' view angle adopts a 3D-CDCN model as a backbone network.

6. The face live detection method based on multi-view feature learning according to claim 1, characterized in that: in the step (4), the binary classification model is a three-layer convolution network which is designed independently, the sizes of convolution kernels are all 3 × 3, stride is 1, padding is 1, each of the first two layers is connected with a BatchNorm layer and a Relu layer, and the third layer is connected with the Relu layer.

7. The face live detection method based on multi-view feature learning according to claim 1, characterized in that: in the step (5), the judgment method of the human face living body detection is as follows: and taking the output of the binary classification model as a score map, taking the average value of all elements of the score map as a classification score, and if the score is greater than a classification threshold value, selecting 0.5 as the classification threshold value, and judging that the face image to be detected is a living face.

8. The face live detection method based on multi-view feature learning according to claim 1, characterized in that: in the step (2), the CDCN model is adopted as a backbone network by the feature extraction model based on the visual angle training of the real vs. all attack types.