CN112990090A

CN112990090A - Face living body detection method and device

Info

Publication number: CN112990090A
Application number: CN202110380564.2A
Authority: CN
Inventors: 聂凤梅; 李骊
Original assignee: Beijing HJIMI Technology Co Ltd
Current assignee: Beijing HJIMI Technology Co Ltd
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2021-06-18

Abstract

The embodiment of the invention discloses a face in-vivo detection method and a face in-vivo detection device, wherein the method comprises the following steps: constructing a living body detection model, wherein the convolution layer of the living body detection model comprises the combination of first convolution processing and linear mapping processing; inputting a face image to be detected into the living body detection model to obtain an output value; determining a live body detection result based on a comparison result of the output value with a preset live body detection threshold value. The convolution processing in the implementation scheme adopts a mode of combining convolution processing and linear mapping processing, can extract effective characteristics which are beneficial to living body detection in a face image and improve living body precision, and has the characteristics of small operand and light weight of a model, so that the face detection speed can be higher while the face living body detection precision is ensured by adopting the scheme.

Description

Face living body detection method and device

Technical Field

The invention relates to the field of artificial intelligence, in particular to a human face in-vivo detection method and a human face in-vivo detection device.

Background

At present, a face recognition technology is widely applied to various industries and fields, and face living body detection in the face recognition technology is always a research focus in the face recognition technology.

At present, the human face living body detection method can be mainly divided into two types, one type is a method based on texture characteristics, and the other type is a method based on time. The texture feature-based method is to judge a living body and a non-living body from texture information in a face image. The texture information-based acquisition method can be divided into a conventional method and a deep learning method. The traditional method needs to manually design a feature extraction method, and the deep learning-based method does not need to artificially design features, and automatically extracts the features by relying on the strong learning capability of a deep learning model to complete the human face living body detection task. The time-based method mainly obtains time clues from continuous frames to carry out human face living body detection. Time-based methods, which earlier require the capture of mouth movements, eye movements (blinks) in video frames to determine live and non-live subjects, have more general methods in recent years to obtain more efficient time cues in video, such as CNN-LSTM-based methods.

Generally speaking, a texture-based face living body detection method uses a single-frame face image for living body detection, and has relatively high speed, but the extracted features are relatively limited, and errors are easy to make in a complex scene; the time-based method is to use video (continuous multiframe) for face living body detection, and has high detection precision but relatively low speed.

Disclosure of Invention

In view of this, the present invention provides the following technical solutions:

a face in-vivo detection method comprises the following steps:

constructing a living body detection model, wherein the convolution layer of the living body detection model comprises the combination of first convolution processing and linear mapping processing;

inputting a face image to be detected into the living body detection model to obtain an output value;

determining a live body detection result based on a comparison result of the output value with a preset live body detection threshold value.

Optionally, the convolution layer of the in-vivo detection model includes a series process of a central differential convolution and a linear mapping.

Optionally, the constructing the in-vivo detection model includes:

acquiring a data set, and dividing the data set into a training data set and a verification data set;

building a data model, wherein a convolution layer of the data model comprises the combination of first convolution processing and linear mapping processing;

initializing the data model;

and training the data model based on the training data set to obtain a living body detection model.

Optionally, the training termination condition of the data model is that the output of the loss function of the data model is smaller than a first set value or the number of training rounds reaches a second set value.

Optionally, after the obtaining the in-vivo detection model, the method further includes:

and verifying the living body detection model by adopting the verification data set to determine a living body detection threshold value.

Optionally, the verifying the in-vivo detection model by using the verification data set to determine an in-vivo detection threshold includes:

normalizing all output results obtained by inputting the verification data set into the living body detection model into a numerical value between 0 and 1;

setting N thresholds, unifying the living body detection rate under each threshold, wherein the value of the threshold is in the interval from 0 to 1, and N is a positive integer;

determining the threshold value at which the live body detection rate is maximum as a live body detection threshold value.

Optionally, the determining a living body detection result based on the comparison result of the output value and a preset living body detection threshold includes:

determining that the object corresponding to the face image to be detected is a non-living body under the condition that the output value is smaller than the living body detection threshold value;

and determining that the object corresponding to the face image to be detected is a living body under the condition that the output value is greater than or equal to the living body detection threshold value.

A face liveness detection device, comprising:

the model building module is used for building a living body detection model, and the convolution layer of the living body detection model comprises the combination of first convolution processing and linear mapping processing;

the output determining module is used for inputting the face image to be detected into the living body detection model to obtain an output value;

and the result determining module is used for determining a living body detection result based on the comparison result of the output value and a preset living body detection threshold value.

Optionally, the model building module includes:

the data set acquisition module is used for acquiring a data set and dividing the data set into a training data set and a verification data set;

the model building module is used for building a data model, and the convolution layer of the data model comprises the combination of first convolution processing and linear mapping processing;

an initialization module for initializing the data model;

and the model training module is used for training the data model based on the training data set to obtain a living body detection model.

Compared with the prior art, the embodiment of the invention discloses a face living body detection method and a device, and the method comprises the following steps: constructing a living body detection model, wherein the convolution layer of the living body detection model comprises the combination of first convolution processing and linear mapping processing; inputting a face image to be detected into the living body detection model to obtain an output value; determining a live body detection result based on a comparison result of the output value with a preset live body detection threshold value. The convolution processing in the implementation scheme adopts a mode of combining convolution processing and linear mapping processing, can extract effective characteristics which are beneficial to living body detection in a face image and improve living body precision, and has the characteristics of small operand and light weight of a model, so that the face detection speed can be higher while the face living body detection precision is ensured by adopting the scheme.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart of a face live detection method disclosed in an embodiment of the present invention;

FIG. 2 is a flowchart of a method for constructing a biopsy model according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a human face in-vivo detection model disclosed in the embodiment of the present invention;

FIG. 4 is a schematic diagram of a training process of an in-vivo detection model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a human face living body detection device disclosed in the embodiment of the invention.

Detailed Description

For the sake of reference and clarity, the descriptions, abbreviations or abbreviations of the technical terms used hereinafter are summarized as follows:

face recognition: and judging who the current face is, namely judging the identity of the face.

Detecting the living human face: whether the currently obtained face is from a living body (a real person) or a non-living body (such as a photo, a video, a 3D head model, a headgear camouflage, a real person wearing a mask) or the like is judged.

Face Anti-Spoofing: FAS for short, face fraud prevention, and face liveness detection. In the human face live detection, live objects are generally represented by real, and non-live objects are represented by fake.

CNN: convolutional neural network, one of deep learning methods.

CDC: and (4) performing central difference convolution.

CDCN: a human face living body detection model.

ghostnet: a generic classification model.

FAR: an index for evaluating the human face living body detection effect divides the number of non-living bodies detected into living bodies by the total number of all non-living bodies.

LSTM: the Short form of Long Short-Term Memory is a time-cycle neural network.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the application can be applied to electronic equipment, the product form of the electronic equipment is not limited by the application, and the electronic equipment can include but is not limited to a smart phone, a tablet computer, wearable equipment, a Personal Computer (PC), a netbook and the like, and can be selected according to application requirements.

Fig. 1 is a flowchart of a face live detection method disclosed in an embodiment of the present invention, and referring to fig. 1, the face live detection method may include:

step 101: and constructing a living body detection model, wherein the convolution layer of the living body detection model comprises the combination of the first convolution processing and the linear mapping processing.

Specifically, the convolution layer of the in-vivo detection model may include a series process of a central differential convolution and a linear mapping. With respect to concatenation, i.e., the processing result of the central differential convolution is used as the input of the linear mapping.

In the embodiment of the application, the constructed living body detection model converts the original convolution layer, and combines the first convolution processing and the linear mapping processing in the converted convolution layer, so that the human face living body detection can have the advantages of the first convolution processing mode and the linear mapping processing at the same time. The first convolution processing mode is not limited fixedly, and may be any existing convolution processing mode with a better implementation effect, such as CDC.

In the following embodiments, a detailed description will be given of a specific implementation of constructing the in-vivo detection model, which will not be described herein too much.

Step 102: and inputting the face image to be detected into the living body detection model to obtain an output value.

The realization needs to carry out data processing on the face image to be detected, and the judgment on whether the face image is a living body can be realized. After the construction of the in-vivo detection model is completed, the in-vivo detection model can be put into on-line use. In this embodiment, the output of the living body detection model may be one value.

Step 103: determining a live body detection result based on a comparison result of the output value with a preset live body detection threshold value.

The living body detection threshold may be a threshold determined based on a verification result in a living body detection model verification stage, and specific implementation will be described in detail in the following embodiments.

In the face in-vivo detection method of the embodiment, the convolution processing is a mode combining convolution processing and linear mapping processing, so that not only can effective features beneficial to in-vivo detection be extracted from a face image and the in-vivo precision be improved, but also the face in-vivo detection method has the characteristics of small operand and light weight of a model, and therefore the face in-vivo detection precision can be ensured and the face detection speed can be higher by adopting the scheme.

In the above embodiment, the process of constructing the living body detection model may refer to fig. 2, and as shown in fig. 2, constructing the living body detection model may include:

step 201: a data set is acquired and divided into a training data set and a validation data set.

Step 202: and building a data model, wherein the convolution layer of the data model comprises the combination of the first convolution processing and the linear mapping processing.

Step 203: the data model is initialized.

Step 204: and training the data model based on the training data set to obtain a living body detection model.

The training termination condition of the data model may be, but is not limited to, that the output of the loss function of the data model is less than a first set value or the number of training rounds reaches a second set value.

For a better understanding of the above implementation, the following description will be made in detail on the background and specific implementation of the construction of the in-vivo examination model. In the following description, the first convolution process is exemplified by CDC, and the linear mapping process is exemplified by ghost convolution (a convolution processing method).

For extracting features by original convolution operation and extracting features by replacing original convolution operation with ghost convolution, ghost has the advantages of high speed and high precision on a general classification task, and the principle of the ghost convolution is as follows:

assume that the original convolutional layer operates as follows:

Y＝X*f+b (1)

(1) wherein, X represents convolution operation operator, and X belongs to R^c×h×wRepresenting the input data, c representing the number of input channels, h, w representing the height and width of the input, respectively. b represents a bias term, Y ∈ R^{h′×w′×n}Indicating the output characteristic (featuremap) where n indicates the number of channels of the output characteristic and h ', w' indicate the height and width of the output, respectively. f is an element of R^c×k×k×nRepresents the convolution kernel of this layer, k × k represents the size of the convolution kernel. The required FLOPs (floating-point operations per second) is n.h '. w'. k.k.c, and since c and n are usually large, FLOPs are also large.

The ghost convolution operation is as follows:

Y′＝X*f′ (2)

y_ij＝Φ_i，j(y′_i)，i＝1，…，m，j＝1，…，s， (3)

(2) in which Y' is epsilon with R^{h′×w′×m}，f′∈R^c×k×k×mM ≦ n, where the bias term is removed for simplicity of expression. (3) Y 'in the formula'_iDenotes the characteristic of the ith channel of Y'. phi_i，jGenerating the j-th feature Y from the i-th feature of Y_ijThe value range of i is [1, m ]]J has a value in the range of [1, s ]]。n＝m·s。s＜＜c，Φ_i，sRepresenting identity mapping, i.e. y_ij＝y′_i. The FLOPs of the ghost convolution at this time are:

(4) where d represents the size of the linear running kernel, similar to the size k of the convolution kernel. d ≈ k.

Ratio of original convolution to ghost convolution FLOPs:

ratio of original convolution to gshost convolution parameters:

it can be seen from the expressions (5) and (6) that changing the original convolution operation into the gshost convolution operation can reduce the FLOPs and the parameters required for the operation, i.e., can increase the operation speed of the model and make the model lighter.

The principle of the CDC convolution operation is as follows:

the original convolution is expressed as:

(7) where x denotes the input feature (featuremap), w denotes the convolution kernel, p₀Current calculation position, p, representing input-output characteristics_nGo through

Each position in (e.g. 3 x 3 kernel)

The Central Differential Convolution (CDC) can be expressed as:

(8) the formula center difference convolution is equivalent to that before the convolution operation is carried out on the input features and the convolution kernel, the input at the center position is subtracted from each position, and then the convolution calculation is carried out.

And combining the traditional convolution and the central differential convolution to obtain a CDC convolution module:

with reference to the above, the specific implementation of the face live detection method according to the embodiment of the present application may include:

1. data set preparation

The method comprises the steps of collecting images containing human faces under various scenes (distance from a camera, illumination conditions, various attack modes, various posture angles and the like) by using a data collection system, extracting the human face images in the images by using a human face detection system, and labeling each human face image according to the existence of a living body, wherein the living body is 1, and the non-living body is 0. For example, 35 ten thousand images are respectively taken out randomly, and a total of 70 ten thousand images are used as a training data set for training a human face living body detection model; and respectively taking out 5 ten thousand images, and taking 10 ten thousand images in total as a verification set for determining a model classification threshold value.

2. Model building

The invention provides a new convolution operation ghost-CDC to replace the original convolution operation, and the principle of the ghost-CDC convolution operation is that in view of the great correlation between the characteristics extracted by the original convolution, the original convolution layer is replaced by the convolution layer and the linear mapping is considered to be connected in series, so that the efficiency of extracting the characteristics can be ensured, and the computation amount of the model can be reduced. Combining the gshost convolution operation and the CDC convolution module, a new convolution mode gshost-CDC can be obtained as follows:

assuming that the original convolution is shown in equation (1), the improved convolution ghost-CDC is as follows:

y_ij＝Φ_i，j(y′_i)，i＝1，…，m，j＝1，…，s， (11)

(10) in the formulaY′∈R^{h′×w′×m}Wherein m represents the number of channels of the output characteristic, m is less than or equal to n, h ', w' respectively represent the height and width of the output. X is formed by R^c×h×wRepresenting the input data, c representing the number of input channels, h, w representing the height and width of the input, respectively. f' is epsilon of R^c ^×k×k×mThe convolution kernel of this layer is denoted, and k × k denotes the size of the convolution kernel.

The 1 st and 2 nd dimensions of the convolution kernel are summed to obtain a sum having a shape of c × m, and reshape means that the output after summing the convolution kernels is transformed into a shape of c × 1 × 1 × m for convolution to be able to be performed. Theta is [0,1 ]]And (4) the hyper-parameter needs to be obtained by experiments. (11) Y 'in the formula'_iDenotes the characteristic of the ith channel of Y'. phi_i，jGenerating the j-th feature Y from the i-th feature of Y_ijThe value range of i is [1, m ]]J has a value in the range of [1, s ]]。n＝m·s。s＜＜c，Φ_i，sRepresenting identity mapping, i.e. y_ij＝y′_i。

And establishing a deep learning human face living body detection model by taking the ghost-CDC as a basic convolution module. A schematic structural diagram of the face live detection model is shown in fig. 3.

3. Training human face living body detection model

Setting model hyper-parameters such as initial learning rate, maximum training round number (epochs), theta and the like, defining a loss function, and guiding model updating by taking mse as a loss function if human face living body detection is regarded as a classification problem and a cross entropy loss function if human face living body detection is regarded as a regression problem. And when the output of the loss function is small enough or the training reaches the maximum number of training rounds, ending the model training. Fig. 4 is a schematic diagram of a training process of the in-vivo detection model according to an embodiment of the present invention, which can be understood by referring to fig. 4.

4. Determination of classification threshold of human face living body detection model

And after the training of the human face living body detection model is finished, inputting all data on the verification set into the trained model to obtain all outputs, controlling all output results to be between 0 and 1, starting from 0, adding 1/10000 values as threshold values every time, counting the living body detection rate under the corresponding threshold value, and obtaining the threshold value corresponding to the maximum living body detection rate as the final living body detection threshold value of the model. Of course, the threshold interval and the start/end value of the living body detection model in determining the threshold value may also be determined according to actual requirements.

Based on the foregoing, after the obtaining the living body detection model, the method may further include: and verifying the in-vivo detection model by adopting the verification data set, and determining an in-vivo detection threshold value so as to be convenient for subsequent determination of detection results.

Based on the foregoing, the verifying the living body detection model with the verification data set to determine a living body detection threshold may include: normalizing all output results obtained by inputting the verification data set into the living body detection model into a numerical value between 0 and 1; setting N thresholds, unifying the living body detection rate under each threshold, wherein the value of the threshold is in the interval from 0 to 1, and N is a positive integer; determining the threshold value at which the live body detection rate is maximum as a live body detection threshold value.

Of course, the implementation of determining the live body detection threshold of the live body detection model in the embodiment of the present application is not limited, and for example, in addition to the implementation described above, a threshold when the equal error rate is reached on the verification set may be used as a final threshold, or a threshold when the FAR is equal to a specific value may be used.

The determining a living body detection result based on the comparison result of the output value with a preset living body detection threshold value may include: determining that the object corresponding to the face image to be detected is a non-living body under the condition that the output value is smaller than the living body detection threshold value; and determining that the object corresponding to the face image to be detected is a living body under the condition that the output value is greater than or equal to the living body detection threshold value.

The human face living body detection method provided by the invention has the advantages that a new convolution mode ghost-CDC is used for replacing an original convolution mode for a human face living body detection model, the model is trained by a human face living body training data set which is collected in advance, then the statistical analysis is carried out on a verification data set to obtain a classification threshold value under the equal error rate, a human face image to be detected is input into the trained model to obtain the output of the model, the output of the model is compared with the threshold value, when the output is smaller than the threshold value, the human face image to be detected is a non-living body, otherwise, the human face image to be detected is a living body. In application, different network model structures can be built based on the ghost-CDC, and different parameters are set to realize human face living body detection.

In view of the fact that the features extracted by the deep learning model have great similarity and the fact that effective human face living body detection features can be extracted by CDC under a changing scene, the method combines linear operation in the ghostnet and CDC operation in the CDCN to obtain the ghost-CDC operation. The convolution operation in the CNN model of the original human face living body detection is replaced by the ghost-CDC operation, so that the precision and the speed of the human face living body detection can be improved.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present invention is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently with other steps in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

The method is described in detail in the embodiments disclosed above, and the method of the present invention can be implemented by various types of apparatuses, so that the present invention also discloses an apparatus, and the following detailed description will be given of specific embodiments.

Fig. 5 is a schematic structural diagram of a living human face detection apparatus according to an embodiment of the present invention, and referring to fig. 5, a living human face detection apparatus 50 may include:

a model building module 501, configured to build a living body detection model, where a convolution layer of the living body detection model includes a combination of a first convolution process and a linear mapping process.

And the output determining module 502 is configured to input the face image to be detected into the living body detection model to obtain an output value.

A result determining module 503, configured to determine a living body detection result based on a comparison result of the output value and a preset living body detection threshold.

In the face in-vivo detection device of the embodiment, the convolution processing adopts a mode of combining convolution processing and linear mapping processing, so that effective characteristics beneficial to in-vivo detection in a face image can be extracted, the in-vivo precision is improved, and the face in-vivo detection device also has the characteristics of small operand and light weight of a model, and therefore the face in-vivo detection precision can be ensured and the face detection speed can be higher by adopting the scheme.

In one implementation, the convolution layer of the liveness detection model includes a series process of a central differential convolution and a linear mapping.

In one implementation, the model building module includes: the data set acquisition module is used for acquiring a data set and dividing the data set into a training data set and a verification data set; the model building module is used for building a data model, and the convolution layer of the data model comprises the combination of first convolution processing and linear mapping processing; an initialization module for initializing the data model; and the model training module is used for training the data model based on the training data set to obtain a living body detection model.

In one implementation, the training termination condition of the data model is that the output of the loss function of the data model is smaller than a first set value or the number of training rounds reaches a second set value.

In one implementation, the model building module further includes: and the model verification module is used for verifying the living body detection model by adopting the verification data set and determining a living body detection threshold value.

In one implementation, the model validation module is specifically configured to: normalizing all output results obtained by inputting the verification data set into the living body detection model into a numerical value between 0 and 1; setting N thresholds, unifying the living body detection rate under each threshold, wherein the value of the threshold is in the interval from 0 to 1, and N is a positive integer; determining the threshold value at which the live body detection rate is maximum as a live body detection threshold value.

In one implementation, the result determination module is specifically operable to: determining that the object corresponding to the face image to be detected is a non-living body under the condition that the output value is smaller than the living body detection threshold value; and determining that the object corresponding to the face image to be detected is a living body under the condition that the output value is greater than or equal to the living body detection threshold value.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A face living body detection method is characterized by comprising the following steps:

2. The face live-body detection method according to claim 1, wherein the convolution layer of the live-body detection model includes a series process of a central difference convolution and a linear mapping.

3. The human face in-vivo detection method according to claim 1, wherein the constructing of the in-vivo detection model comprises:

initializing the data model;

4. The face in-vivo detection method according to claim 3, wherein the training termination condition of the data model is that the output of the loss function of the data model is smaller than a first set value or the number of training rounds reaches a second set value.

5. The face live-body detection method according to claim 3, further comprising, after the obtaining the live-body detection model:

6. The human face in-vivo detection model as claimed in claim 5, wherein the validating the in-vivo detection model with the validation data set to determine the in-vivo detection threshold value comprises:

7. The face live body detection method according to claim 1, wherein the determining a live body detection result based on the comparison result of the output value and a preset live body detection threshold value comprises:

8. A face liveness detection device, comprising:

9. The face liveness detection device of claim 8, wherein the convolution layer of the liveness detection model comprises a series process of a central difference convolution and a linear mapping.

10. The living human face detection device of claim 8, wherein the model construction module comprises:

an initialization module for initializing the data model;