CN111881815A

CN111881815A - Human face in-vivo detection method based on multi-model feature migration

Info

Publication number: CN111881815A
Application number: CN202010728371.7A
Authority: CN
Inventors: 凌康杰; 王祥雪; 林焕凯; 刘双广
Original assignee: Gosuncn Technology Group Co Ltd
Current assignee: Gosuncn Technology Group Co Ltd
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2020-11-03

Abstract

The invention provides a human face living body detection method based on multi-model feature migration. In the training stage, visible light images on an open source or private data set are fused, and an RGB model and a YUV model are respectively trained simultaneously until the models are converged after face detection, alignment and cutting; in the prediction stage, the collected visible light images are respectively input into the trained RGB model and YUV model to respectively obtain the results of the two models, the final score is obtained through a model score fusion strategy, and finally the in-vivo detection result is judged according to the score. The method has good generalization performance and high precision, and is suitable for deployment and use in industry.

Description

Human face in-vivo detection method based on multi-model feature migration

Technical Field

The invention belongs to the technical field of computer vision, pattern recognition, machine learning, convolutional neural network and face recognition, and particularly relates to a face living body detection method based on multi-model feature migration.

Background

The face recognition technology has been widely applied to the fields of security monitoring, man-machine exchange, electronic commerce, mobile payment and the like, and face living body detection is the first threshold of face recognition and is also the premise of the application of the face recognition technology. In the current living body detection, the main technical solutions are interactive living body detection, multi-source information fusion living body detection and static image living body detection. The interactive in-vivo detection requires the cooperation of users, is very inconvenient, has complicated steps, is easy to conflict by the users and has low efficiency; the multisource information fusion in-vivo detection usually needs to add a depth camera, an infrared camera, a 3D camera, a microphone and the like, so that not only is the hardware overhead increased, but also a large amount of complex 3D modeling calculation is brought; static image biopsy is often a low-cost and quick biopsy method, but due to the current insufficiency of data sets and the inefficiency of a model construction method, the model generalization is often insufficient.

In the current static image living body detection method, a living body detection model is often constructed aiming at a limited few attack types under a single scene represented by a single data set, the method can meet the ideal precision requirement in a laboratory stage, but the method is very complex in actual scenes in industrial and practical use, not only is the illumination and the background brought by the diversity of the scenes diversified, but also the diversity of attack types and attack means can exist, and the method brings serious challenges to the fact that the living body detection can be used on the ground. In common living body attack detection, the type of a printer, the printing quality, the type, the resolution, the size and the like of different display screens brought by different display devices, even the angle, the distance, the brightness of the display screen, whether the display devices are pasted with films and the like of the attack are all influenced on the living body detection. The diversity of attack types and the image distribution difference in the same kind of attack are large, so that the generalization capability of the model is low in a real scene. Aiming at the problem that the generalization capability of the traditional in-vivo detection method is insufficient, the in-vivo detection method based on multi-model migration is provided, and through constructing a heterogeneous data set, the in-vivo training and prediction are carried out by adopting a multi-model feature migration fusion method under various color spaces.

The patent CN109840467A uses a kind of generation-based countermeasure network (GAN) to generate a new training set of negative samples (the negative samples are attack images), and the purpose of the patent is to solve the problem of too few negative samples when a deep learning training network is adopted. However, on a given data set, the GAN method can only learn a limited sample probability distribution on the data set. Therefore, for a brand new attack scene and means, the image data generated by the network has limited representativeness and insufficient generalization capability.

Patent CN110472519A adopts a multi-model fusion method for living body detection, but its model requires that natural light images and infrared light images are used as training sets to be input to the network at the same time. In the method, the infrared image acquisition method is complex, an infrared camera is needed, the hardware cost is increased, and the rapid upgrade of some existing face detection and recognition equipment is not facilitated.

In the prior art, due to the fact that training data are single in distribution and a model building method is low in efficiency, a static living body detection model is often insufficient in generalization and cannot be used in industrial and real production scenes generally.

Disclosure of Invention

The invention provides a human face in-vivo detection method based on multi-model feature migration, aiming at the problem of insufficient generalization existing in the current static in-vivo detection model.

The invention is realized by the following technical scheme:

a human face living body detection method of multi-model feature migration comprises the following steps:

s1, acquiring a visible light image, wherein the visible light comprises a human face;

s2, identifying the visible light image by using a first RGB model to obtain a first living body probability;

s3, identifying the visible light image by using a second YUV model to obtain a second living body probability;

s4, determining a third living body probability according to the first living body probability and the second living body probability; and S5, judging whether the living body is the living body according to the third living body probability.

Further, the first live body probability includes: negative sample probability p1 and positive sample probability p 2; the second live body probability includes: negative sample probability p3 and positive sample probability p 4.

Further, step S4 further includes: the third live body probability is the mean of the probabilities of the respective models, i.e., the final negative sample probability is α × (p1+ p3) and the final positive sample probability is β × (p2+ p4), where 0 ≦ α, β ≦ 1, and α + β ≦ 1.

Further, step S4 further includes: step-by-step judgment, setting a first RGB model threshold value as T1 and a second YUV model threshold value as T2, wherein T1 is more than or equal to 0.5 and less than 1, and T2 is more than or equal to 0.5 and less than 1; specifically, the output of the first RGB model is firstly determined, if p2 is less than 1-T1 or p2 is greater than T1, the finally output third live body probability is the first live body probability output by the first RGB model, and if not, the output of the second YUV model is determined; if p4 < 1-T2 or p4 > T2 in the second YUV model, the final output third living body probability is the second living body probability output by the second YUV model, and if not, the third living body probability is the average value of the model probabilities, namely the final negative sample probability is alpha x (p1+ p3) and the final positive sample probability is beta x (p2+ p4), wherein alpha is greater than or equal to 0, beta is less than or equal to 1, and alpha + beta is 1.

Further, step S5 includes: setting the living body judgment threshold value as T3, wherein T3 is more than or equal to 0.5 and less than 1, and if the probability of the final positive sample is more than or equal to T3, judging the image as the positive sample; if the value is less than T3, the result is judged to be a negative sample.

Further, step S0 is included before step S1, in which the training phase is to fuse the visible light images on the heterogeneous data set, perform face detection, alignment, and clipping, and train the first RGB model and the second YUV model respectively until the models converge.

Further, in step S0, the training phase includes the steps of:

s101: constructing a heterogeneous data set, collecting the heterogeneous data set, and only selecting images or videos under visible light to form the heterogeneous data set; the positive sample is a real sample in the heterogeneous data set, and the negative sample is an attack sample in the heterogeneous data set;

s102: data preprocessing, comprising 3 steps:

a: face detection, namely performing face detection once every n frames for video data, if a face is detected, performing the next step, and if not, performing the face detection of the next n frames; directly carrying out face detection on image data, if a face is detected, carrying out the next step, and otherwise, carrying out face detection on the next image;

b: b, face alignment, which is characterized in that the face detected in the step A is aligned by adopting similarity transformation;

c: b, cutting the face aligned in the step B to an input size suitable for both the first RGB model and the second YUV model;

s103: training a first RGB model, and inputting the preprocessed face image of S102 into the first RGB model for training;

s104: training a second YUV model, converting the preprocessed face image of S102 from an RGB color space to a YUV color space through color space conversion, and inputting the face image into the second YUV model for training;

s105: and respectively training the two models in the S103 and the S104, and when the models converge and reach the expected accuracy on the verification set or the test set, representing that the model training is finished, and entering the step S1.

Further, in step S101, the heterogeneous data set includes a public or private data set, where the public data set includes: Replay-Attack, Print-Attack, Yale-Recampled, CASIA-MFSD, MSU-MFSD, Replay-Mobile, Mspoof, SiW, Ouu-NPU, VAD, NUAA, or CASIA-SURF.

A computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, performs the steps of a method for live face detection with multi-model feature migration.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the live human face detection method of multi-model feature migration when executing the program.

The key points of the invention are as follows:

1. the construction strategy and method of the heterogeneous data set are as follows: the construction strategy of the heterogeneous data set and the multiple models in the multiple color spaces complement each other, and if the heterogeneous data set does not exist, the characteristics learned by the multiple models in the multiple color spaces are limited and single, and the generalization capability is low;

2. the model construction scheme of multi-model feature migration under RGB and YUV multiple color spaces is as follows: the method has the advantages that multiple models in multiple color spaces are lacked, and complex features in heterogeneous data sets are difficult to be fully learned by a single model;

3. model score fusion strategy: the model score fusion strategy is a key factor influencing the final effect of the multiple models, and if the model score fusion strategy is not matched with the actually constructed multiple models, the final effect of the multiple models is often worse than that of a single model.

Compared with the prior art, the invention has at least the following beneficial effects or advantages: firstly, the scheme has the advantages of low cost and small calculated amount, and can rapidly deploy and upgrade the existing face detection and recognition equipment with a plurality of single cameras. The scheme adopts a static image in-vivo detection technology, only needs a single camera, has no cost caused by additional hardware such as a depth camera, an infrared camera, a 3D camera, a microphone and the like, has no 3D modeling calculation with a large amount of complexity, and has the characteristics of low cost and small calculated amount. The backbone network adopted by the scheme can be replaced by lightweight networks such as MobileNet V1, MobileNet V2 and EfficientNet according to actual requirements, so that the reasoning calculation is further accelerated. Secondly, the living body detection model established by the scheme has the advantages of strong generalization and high precision, and the generalization and the precision of the model are obviously improved because the training set uses the heterogeneous data set and constructs the multi-model under various color spaces.

Drawings

The present invention will be described in further detail with reference to the accompanying drawings;

FIG. 1 is a flow chart of the training phase of the present invention;

FIG. 2 is a flow chart of the prediction phase of the present invention;

fig. 3 is a flow chart of the step-by-step determination of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Aiming at the problem that the existing static living body detection model based on deep learning often has insufficient generalization, the living body detection method based on multi-model feature migration is provided. By constructing and fusing heterogeneous data sets, the living body training is carried out by adopting a multi-model feature migration and fusion method under various color spaces, and the precision and the generalization capability of a living body detection model are improved. The method comprises the steps of fusing visible light images on an open source or private data set in a training stage, and simultaneously training an RGB (red, green and blue) model and a YUV (Luma and Luma) model through face detection, alignment and cutting; in the prediction stage, the collected visible light images are respectively input into an RGB model and a YUV model to respectively obtain the results of the two models, a final score is obtained through a threshold fusion scheme, and finally the in-vivo detection result is judged according to the score. The method has good generalization performance and high precision, and is suitable for deployment and use in industry.

In an embodiment of the present invention, a living human face detection method with multi-model feature migration is provided, which includes the steps of:

In another embodiment, the method is divided into two steps, wherein the first step is the construction of a living body detection model, which corresponds to a training stage; the second step is the deployment of the in vivo detection model, corresponding to the prediction phase. The scheme of the training phase is shown in fig. 1, and the scheme of the prediction phase is shown in fig. 2.

The training phase is characterized in that a heterogeneous data set is constructed, two or more depth models with different frameworks are trained simultaneously, and each depth learning model adopts different color spaces.

S101: and (5) constructing a heterogeneous data set. Public or private data sets are collected first, and only images or videos under visible light are selected to form heterogeneous data sets. Public data sets include, but are not limited to, the following data sets: Replay-Attack, Print-Attack, Yale-Recampled, CASIA-MFSD, MSU-MFSD, Replay-Mobile, Mspoof, SiW, Ouu-NPU, VAD, NUAA, CASIA-SURF, and the like. For private specific scenarios, private data sets are collected and joined. The positive sample is a real sample in the heterogeneous data set, the negative sample is an attack sample in the heterogeneous data set, and the negative sample type refers to 2D attack types such as print picture attack, tablet attack, mobile phone attack, display attack and the like in the negative sample. The construction strategy of the heterogeneous data set comprises the following 2 strategies:

let the collected data sets be S, and be recorded as { D₁,D₂,D₃,…,D_S}. Let mth data set D_mThe number of positive samples contained in (A) is M_mNumber of negative samples is N_mA negative sample type has O_mWherein m ═ { m |1 ≦ m ≦ S, m ∈ N^*}. Let gamma be the equilibrium factor, wherein gamma is more than or equal to 0 and less than or equal to 1.

Strategy one: in the mth data set, a negative sample type is randomly (hierarchically) decimated

One sample, then negative samples are taken in total of γ N_mAnd (4) respectively. If γ N_m＞M_mPositive sample random (tiered) charging M_mA plurality of; if γ N_m≤M_mPositive samples are drawn at random (in layers) γ N_mAnd (4) respectively.

And (2) strategy two: the positive samples in all data sets are classified into one class, the negative samples are classified according to the negative sample type, and the negative sample type is set as a common O_oClasses, each class having P_oA negative example. For example, if the print image attack occurs at the same time in { D₁,D₂,D₅Data set, then { D }₁,D₂,D₅And classifying the print picture attacks in the negative samples of the three data sets. The second specific construction method of the strategy is to randomly (hierarchically) extract gamma P aiming at each type of negative sample_oAnd (4) respectively. Then the total number of negative samples extracted is γ P over all data_oO_oAnd (4) respectively. If it is not

Positive sample random (hierarchical) extraction

A plurality of; if it is not

Positive sample random (stratified) extraction of gamma P_oO_oAnd (4) respectively.

S102: and (4) preprocessing data. The data pre-processing comprises 3 steps. The first step is face detection, wherein the face detection is carried out once every n (n is more than 1) frames aiming at video data, if the face is detected, the second step is carried out, otherwise, the face detection of the next n frames is carried out; and directly carrying out face detection on the image data, if a face is detected, carrying out the second step, and otherwise, carrying out the face detection on the next image. And the second step is face alignment, which is characterized in that the face detected in the first step is aligned by adopting similarity transformation. And thirdly, cutting the face aligned in the second step to an input size suitable for both the RGB model and the YUV model.

S103: the RGB model is trained. And inputting the preprocessed image of the S102 into an RGB model for training. The RGB model refers to a deep learning model with RGB color space of the input image, and the specific network backbone of the RGB model may be, but is not limited to, convolutional neural networks such as VGG, google net, ResNet, DenseNet, MobileNet V1, MobileNet V2, MobileFaceNet, EfficientNet, ShuffleNet, and the like, and variants thereof. In particular, the RGB model here is pre-trained with the ImageNet dataset.

S104: and training a YUV model. And (4) converting the preprocessed image in the S102 into a color space, converting the image from the RGB color space to the YUV color space, and inputting the image into a YUV model for training. The YUV model is a deep learning model in which a color space of an input image is YUV, and the YUV color space specifically includes, but is not limited to, YCrCb, YCbCr, YPbPr, YDbDr, and the like. The specific network backbone of the YUV model can be, but is not limited to, convolutional neural networks such as VGG, google lenet, ResNet, densnet, MobileNet V1, MobileNet V2, MobileFaceNet, EfficientNet, ShuffleNet, and variants thereof. In particular, the YUV color space here refers to YCrCb, with the YUV model pre-trained with the ImageNet data set.

S105: and finishing the model training. And respectively training the two models in the S103 and the S104, and when the models are converged and reach the expected precision on the verification set or the test set, representing that the model training is finished, and deploying the next prediction stage.

In the prediction stage, after the input visible light image is preprocessed, the preprocessed visible light image is simultaneously input into an RGB model and input into a YUV model after color space conversion, scores are respectively obtained, and the final score is obtained through a score fusion scheme, so that the final in vivo detection model result is judged. The inference stage comprises the following specific steps:

s201: a visible light image is input. An RGB image containing a human face is acquired by a visible light camera, and the image is an input of S202.

S202: and (5) image preprocessing. And (4) preprocessing the RGB image acquired in the step (S201), wherein the preprocessing method is consistent with that in the step (S102).

S203: and (5) forward calculation of the RGB model. And sending the image preprocessed in the step S202 to the trained RGB model for forward calculation.

S204: and obtaining the RGB model score. And (3) acquiring the network output result after forward calculation in S203, and setting the output negative sample probability as p1 and the positive sample probability as p2 as (p1, p2), wherein p1 is more than or equal to 0 and less than or equal to 1, p2 is more than or equal to 0 and less than or equal to 1, and p1+ p2 is equal to 1.

S205: and (4) color space conversion. And converting the image preprocessed in the step S202 into a YUV color space through color space conversion. In particular, here YUV color space refers to YCrCb.

S206: and (4) forward computing a YUV model. And sending the image subjected to color space conversion in the step S205 to a trained YUV model for forward calculation.

S207: and obtaining the YUV model score. And (2) acquiring a network output result after forward calculation in S206, and setting the output negative sample probability as p3 and the positive sample probability as p4 as (p3, p4), wherein p3 is more than or equal to 0 and less than or equal to 1, p4 is more than or equal to 0 and less than or equal to 1, and p3+ p4 is equal to 1.

S208: and fusing the model scores. The model score fusion strategy comprises the following 2 types:

strategy one: the final probability output is the mean of the probabilities of the individual models. Specifically, the final negative sample probability is α × (p1+ p3) and the positive sample probability is β × (p2+ p4), denoted as (α × (p1+ p3), β × (p2+ p4)), where 0 ≦ α, β ≦ 1, and α + β ≦ 1. Typically, α is 0.5 and β is 0.5.

And (2) strategy two: and (5) judging step by step. The flow is shown in figure 3. Let the RGB model threshold be T1(0.5 ≤ T1 < 1), and YUV model threshold be T2(0.5 ≤ T2 < 1). The specific process is that the output of the RGB model is judged firstly, if p2 is less than 1-T1 or p2 is more than T1, the final output is the output of the RGB model; and if not, judging the output of the YUV model. If p4 is less than 1-T2 or p4 is more than T2 in the YUV model, the final output is the YUV model output; if not, the probability output equivalent to the first strategy is output in the final model, that is, the final output negative sample probability is α × (p1+ p3), and the positive sample probability is β × (p2+ p4), and these probabilities are denoted by (α × (p1+ p3), β × (p2+ p4)), where 0 is ≦ α, β is ≦ 1, and α + β is 1. Typically, α is 0.5 and β is 0.5.

S209: and judging the result of the living body detection model. Assuming that the living body judgment threshold is T3 (0.5. ltoreq. T3 < 1), if the probability of a positive sample in the output of S208 is greater than or equal to T3, the image is judged as a living body image (positive sample); if the value is less than T3, the image is judged to be an attack image (negative sample). For the model thresholds T1, T2 and T3 described above, typically T1-T2-T3.

The present invention also provides a computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, performs the steps of the method for live human face detection with multi-model feature migration.

The invention also provides computer equipment which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the human face living body detection method of multi-model feature migration.

The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above-mentioned embodiments are only examples of the present invention and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the invention are also within the protection scope of the invention.

Claims

1. A human face living body detection method of multi-model feature migration is characterized by comprising the following steps:

s4, determining a third living body probability according to the first living body probability and the second living body probability;

and S5, judging whether the living body is the living body according to the third living body probability.

2. The method of claim 1, wherein the first live probability comprises: negative sample probability p1 and positive sample probability p 2; the second live body probability includes: negative sample probability p3 and positive sample probability p 4.

3. The method according to claim 2, wherein step S4 further comprises: the third live body probability is the mean of the probabilities of the respective models, i.e., the final negative sample probability is α × (p1+ p3) and the final positive sample probability is β × (p2+ p4), where 0 ≦ α, β ≦ 1, and α + β ≦ 1.

4. The method according to claim 2, wherein step S4 further comprises: step-by-step judgment, setting a first RGB model threshold value as T1 and a second YUV model threshold value as T2, wherein T1 is more than or equal to 0.5 and less than 1, and T2 is more than or equal to 0.5 and less than 1; specifically, the output of the first RGB model is firstly determined, if p2 is less than 1-T1 or p2 is greater than T1, the finally output third live body probability is the first live body probability output by the first RGB model, and if not, the output of the second YUV model is determined; if p4 < 1-T2 or p4 > T2 in the second YUV model, the final output third living body probability is the second living body probability output by the second YUV model, and if not, the third living body probability is the average value of the model probabilities, namely the final negative sample probability is alpha x (p1+ p3) and the final positive sample probability is beta x (p2+ p4), wherein alpha is greater than or equal to 0, beta is less than or equal to 1, and alpha + beta is 1.

5. The method according to claim 1, wherein step S5 includes: setting the living body judgment threshold value as T3, wherein T3 is more than or equal to 0.5 and less than 1, and if the probability of the final positive sample is more than or equal to T3, judging the image as the positive sample; if the value is less than T3, the result is judged to be a negative sample.

6. The method of claim 1, further comprising a step S0 before the step S1, wherein the training phase is a step of fusing visible light images on the heterogeneous data set, and performing face detection, alignment and cropping, and training the first RGB model and the second YUV model respectively until the models converge.

7. The method of claim 6, wherein in step S0, the training phase comprises the steps of:

s102: data preprocessing, comprising 3 steps:

8. The method according to claim 7, wherein in step S101, the heterogeneous data set comprises a public or private data set, wherein the public data set comprises: Replay-Attack, Print-Attack, Yale-Recampled, CASIA-MFSD, MSU-MFSD, Replay-Mobile, Mspoof, SiW, Ouu-NPU, VAD, NUAA, or CASIA-SURF.

9. A computer-readable storage medium, having stored thereon a computer program, wherein the program, when executed by a processor, performs the steps of the method for living human face detection with multi-model feature migration according to any one of claims 1-8.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for living human face detection with multi-model feature migration according to any one of claims 1-8.