CN112232204A

CN112232204A - Living body detection method based on infrared image

Info

Publication number: CN112232204A
Application number: CN202011106811.1A
Authority: CN
Inventors: 严安; 周治尹
Original assignee: Shanghai Dianze Intelligent Technology Co ltd; Zhongke Zhiyun Technology Co ltd
Current assignee: Shanghai Dianze Intelligent Technology Co ltd; Zhongke Zhiyun Technology Co ltd
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2021-01-15
Anticipated expiration: 2040-10-16
Also published as: CN112232204B

Abstract

The invention belongs to the technical field of face recognition, and particularly relates to a real-time multifunctional face detection method. The living body detection method based on the infrared image comprises the following steps: collecting an infrared picture and carrying out preprocessing operation; the picture is put into a detector for prediction, and a face frame prediction value, a face key point and a mask recognition result are obtained; decoding the face frame prediction value and the face key point; eliminating overlapped detection frames by adopting a non-maximum suppression algorithm with a threshold value of 0.4 to obtain a final face detection frame, a face key point and a mask recognition result; extracting coordinates x and y of two eyes according to key points of the face, and extending the x and y to preset pixels in four directions respectively to obtain an eye image; and judging whether the eye image is a living body by adopting a living body recognition neural network to obtain a judgment result. The invention can achieve the real-time detection effect under the condition that the mobile end only has a CPU, and accurately detect the eye position.

Description

Living body detection method based on infrared image

Technical Field

The invention belongs to the technical field of face recognition, and particularly relates to a real-time multifunctional face detection method.

Background

The face recognition system takes a face recognition technology as a core, is an emerging biological recognition technology, and is a high-precision technology for the current international scientific and technological field. The method is widely applied to regional characteristic analysis, integrates a computer image processing technology and a biological statistics principle, extracts portrait characteristic points from a video by using the computer image processing technology, analyzes and establishes a mathematical model by using the biological statistics principle, and has wide development prospect. Face detection is a key link in automatic face recognition systems. However, the human face has quite complicated detail changes, different appearances such as face shapes, skin colors and the like, and different expressions such as opening and closing of eyes and mouths and the like; mask occlusion, etc., and the variation of these intrinsic and extrinsic factors makes face detection a complex and challenging pattern detection problem in face recognition systems.

Although people have extensively studied face detection algorithms based on convolutional neural networks, the face detection algorithms on mobile devices cannot achieve real-time effects at the mobile end, nor can they achieve real-time detection effects under the condition of only a CPU.

In addition, when the existing face detection is performed, the detection function is single, the eye position cannot be accurately detected, the steps for detecting a dynamic living body are multiple, the dynamic living body is easily influenced by external environments such as natural illumination, and the robustness is insufficient.

Disclosure of Invention

The invention aims to solve the technical problems that the existing face detection cannot accurately detect the eye removing position and the dynamic living body detection has various steps, and provides a living body detection method based on an infrared image.

The living body detection method based on the infrared image comprises the following steps:

acquiring an infrared picture and preprocessing the picture;

the picture is put into a preset detector for prediction, features obtained through four different convolution layers in a backbone network of the detector are combined with anchor points with multiple sizes, face detection, face key point detection and mask recognition are carried out, and a face frame prediction value, a face key point and a mask recognition result are obtained;

decoding the predicted value of the face frame, converting the predicted value into the real position of a boundary frame, and decoding the key points of the face to convert the key points into the real position of the key points;

eliminating overlapped detection frames by adopting a non-maximum suppression algorithm with a threshold value of 0.4 to obtain a final face detection frame, face key points and mask recognition results, wherein the final face detection frame, face key points and mask recognition results comprise information of a left upper corner coordinate, a right lower corner coordinate, two eye coordinates, a nose coordinate, a pair of mouth corner coordinates and confidence coefficient of wearing a mask;

extracting coordinates x and y of two eyes according to the key points of the face, and extending the x and y to preset pixels in four directions respectively to obtain an eye image;

and judging whether the eye image is a living body by adopting a preset living body recognition neural network to obtain a judgment result.

Optionally, before the picture is placed in a preset detector for prediction, the method further includes:

loading preset pre-training network parameters to the detector, and generating a default anchor point according to the size and length-width ratio of the preset anchor point;

training the detector through a preset data set to obtain a trained detector;

the detector includes a backbone network, a prediction layer, and a multi-tasking loss layer.

Optionally, the training the detector through a preset data set to obtain a trained detector includes:

acquiring unoccluded data and occluded data serving as a data set, converting a BGR picture in the data set into a YUV format, only storing data of a Y channel, and then performing data enhancement to obtain an enhanced data set;

performing network training by using a random optimization algorithm with momentum of 0.9 and weight attenuation factor of 0.0005, wherein the random optimization algorithm reduces imbalance between positive and negative samples by using a difficult sample mining mode, and the initial learning rate is set to 10 in the first 100 rounds of training^-3After 50 and 100 rounds each decreased by a factor of 10, each predictor was first compared to the best Jacca during trainingrd overlaps the anchor point to match, then matches the anchor point to the Jaccard overlapping face with a threshold above 0.35.

Optionally, the non-occlusion data is a face picture when the mask is not worn, the occlusion data is a face picture when the mask is worn, and the occlusion data is greater than the non-occlusion data.

Optionally, the performing data enhancement includes:

adding data to prevent model overfitting by applying a combination of at least one or more of color distortion, increased brightness contrast, random cropping, horizontal flipping, and transformation channels to pictures in the data set.

Optionally, put into the picture and predict in the detector of predetermineeing, through the characteristics that four different convolution layers obtained combine with the anchor point of a plurality of sizes in the backbone network of detector, carry out face detection, people's face key point detection and gauze mask discernment, obtain people's face frame predicted value, people's face key point and gauze mask recognition result, include:

the pictures are put into the trained detector for prediction, and the characteristics of 8 th, 11 th, 13 th and 15 th convolutional layers in the backbone network are respectively input into each prediction layer for face frame, face key point positioning and mask recognition operation during prediction;

for each anchor point, representing by using 4 offsets from its coordinates and N scores for classification, where N is 2; for each anchor point during detector training, the minimization of the multitask loss function:

wherein L is_objDetecting for a cross entropy loss function whether an anchor point contains a target classification, p_iFor the probability of an anchor having a target, if the anchor contains a target, then

Otherwise, the value is 0; l is_boxEmploying the smoth-L1 loss function for humansPositioning of the face anchor, t_i＝{t_x，t_y，t_w，t_hI is the coordinate offset of the prediction box,

the coordinate offset of the anchor point of the positive sample; l is_landmarkAdopting smoth-L1 loss function for positioning key points of human face_i＝{l_x1，l_y1，l_x2，l_y2，...，l_x5，l_y5}_iFor the predicted amount of the keypoint offset,

is the coordinate offset of the key point of the positive sample, if the sample is the wearing mask_i＝{l_x1，l_y1，l_x2，l_y2}_i，

Wherein l_x1，l_y1And

respectively representing the left-eye predicted keypoint coordinate offset and the positive sample keypoint offset, l_x2，l_y2And

respectively representing the coordinate offset of the right-eye predicted key point and the offset of the positive sample key point; lambda [ alpha ]₁And λ₂Respectively, the weight coefficients of the face frame and the key point loss function.

Optionally, anchor points of 10 to 256 pixels are used to match the minimum size of the corresponding effective receptive field, with each anchor point for detecting features being sized to (10, 16, 24), (32, 48), (64, 96) and (128, 192, 256), respectively.

Optionally, the decoding the face frame prediction value to convert the face frame prediction value into a real position of a bounding box, and the decoding the face key point to convert the face key point into a real position of a key point includes:

the predicted value l ═ of the face frame obtained by the detector is (l)^cx，l^cy，l^w，l^h) Decoding operation is carried out, and the real position b ═ b of the boundary box is converted into^cx，b^cy，b^w，b^h)：

b^cx＝l^cxd^w+d^cx，b^cy＝l^cyd^h+d^cy；

b^w＝d^wexp(l^w)，b^h＝d^hexp(l^h)；

Predicting the face key point value obtained by the detector

Translating to true positions of keypoints

Wherein d ═ d (d)^cx，d^cy，d^w，d^h) Representing a generated default anchor point.

Optionally, the extracting coordinates x and y of the two eyes according to the key points of the face, and extending the x and y to preset pixels in four directions respectively to obtain the eye image includes:

and extracting coordinates x and y of two eyes according to the key points of the human face, and extending the x and the y to four directions by 32 pixels respectively to obtain a 64 x 64 eye image.

Optionally, the determining, by using a preset living body recognition neural network, whether the eye image is a living body, to obtain a determination result, includes:

the living body recognition neural network adopts a mobilenet lightweight neural network to extract living body characteristics, and the living body recognition neural network uses a cross entropy loss function as a loss function.

The positive progress effects of the invention are as follows: the invention adopts the living body detection method based on the infrared image, and has the following remarkable advantages:

1. the real-time detection effect can be achieved under the condition that the mobile terminal only has a CPU;

2. the living body accuracy is improved in a mode of finely detecting the bright pupil effect;

3. accurately detecting the position of the eye;

4. the robustness is strong, and the influence of the outside is small.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a network architecture of the detector of the present invention;

FIG. 3 is a diagram of the attack image results of the present invention;

FIG. 4 is a diagram of a human image result of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific drawings.

Referring to fig. 1, a living body detecting method based on an infrared image includes:

and S1, inputting the picture, collecting the infrared picture through the infrared camera, and carrying out preprocessing operation on the picture.

In the step, the infrared picture can be directly acquired from the infrared camera end, or the infrared picture can be input through the input interface. The preprocessing operation of the picture comprises image size adjustment and standardization.

S2, predicting by the detector: the picture is put into a preset detector for prediction, features obtained through four different convolution layers in a backbone network of the detector are combined with anchor points of multiple sizes, face detection, face key point detection and mask recognition are carried out, and a face frame prediction value, a face key point and a mask recognition result are obtained.

Before the step of placing the picture into a preset detector for prediction, the method further comprises the following steps:

loading preset pre-training network parameters to a detector, and generating a default anchor point according to the size and length-width ratio of the preset anchor point, wherein the default anchor point is as follows: d ═ d (d)^cx,d^cy,d^w,d^h)。

Wherein, referring to fig. 2, the detector includes a backbone network, a prediction layer, and a multi-tasking loss layer. The backbone network comprises 15 convolutional layers, 4 prediction layers and 1 multitask loss layer. The 15 convolutional layers comprise a convolution module 1, thirteen convolution modules 2 and a convolution module 3. The convolution module 1 consists of convolution, normalization and activation layers. The convolution module 2 is composed of two groups of modules, namely a first module composed of a group of convolution, normalization and activation layers and a second module composed of a group of convolution, normalization and activation layers. The convolution module 3 is composed of two groups of modules, namely a first module composed of a group of convolution, normalization and activation layers and a second module only containing convolution. In the step, the characteristics of 8 th, 11 th, 13 th and 15 th convolution layers in the backbone network are respectively input into each prediction layer to carry out face frame, face key point positioning and mask recognition operation, and each prediction layer is input into a multi-task loss layer to realize the fitting of a plurality of detection results.

And training the detector through a preset data set to obtain the trained detector. The detector algorithm is preferably implemented using a pytorech open source deep learning library. During training, the following processes are included:

s201, data acquisition: the acquisition includes unoccluded data and occlusion data as datasets.

The non-occlusion data is a face picture when the mask is not worn, the occlusion data is a face picture when the mask is worn, the occlusion data is larger than the non-occlusion data, and most of the occlusion data is preferably a data set of the mask. During data acquisition, manually processed WiderFace unoccluded data and MAFA occluded data can be adopted.

S202, data processing and enhancing: and converting the BGR pictures in the data set into YUV format, and only storing the data of the Y channel, and then performing data enhancement to obtain an enhanced data set.

The data enhancement includes adding data to prevent model overfitting by applying a combination of at least one or more of color distortion, increasing brightness contrast, random cropping, horizontal flipping, and transformation channels to pictures in the data set.

The method for training the data in the single channel can reduce the parameter quantity of the model and improve the detection speed of the model. Through the picture of direct training single channel Y form, also avoid moving the end and need picture format conversion, save time for the model can reach super real-time detection's effect under the condition that moves the end only CPU.

The strategy adopted for enhancing the brightness contrast is to reduce the brightness in the target frame and increase the brightness outside the target frame. The data enhancement can be realized in various combinations, so that the model can be more robust under the illumination condition.

S203, training: performing network training by using a random optimization algorithm with momentum of 0.9 and weight attenuation factor of 0.0005, wherein the random optimization algorithm reduces imbalance between positive and negative samples by using a difficult sample mining mode, and the initial learning rate is set to 10 in the first 100 rounds of training^-3After 50 and 100 rounds each by a factor of 10, during training each predictor was first matched to the best Jaccard overlap anchor point, and then the anchor point was matched to Jaccard overlap faces with a threshold above 0.35.

By the design, the trained detector can predict pictures.

During prediction, the characteristics of 8 th, 11 th, 13 th and 15 th convolution layers in the backbone network are respectively input into each prediction layer to carry out face frame, face key point positioning and mask recognition operation.

Otherwise, the value is 0; l is_boxAdopt the smoth-L1 loss function for locating the anchor point of the human face, t_i＝{t_x，t_y，t_w，t_h}_iIn order to predict the coordinate offset of the box,

Wherein l_x1，l_y1And

Where anchor points of 10 to 256 pixels are employed to match the minimum size of the corresponding effective receptive field, the anchor points for each detected feature are sized to (10, 16, 24), (32, 48), (64, 96) and (128, 192, 256), respectively.

By the design, the purpose of end-to-end mask identification is achieved, an additional classifier is not needed to be added to independently identify whether the mask is worn, operations such as picture rotation and matting can be avoided under the condition that the mobile end only has a CPU, and time is saved. In addition, the invention optimizes the detection of key points of the face wearing the mask, and only visible eye feature loss is optimized during training under the condition of wearing the mask.

And S3, decoding according to the generated anchor points: and decoding the predicted value of the face frame, converting the predicted value into the real position of the boundary frame, and decoding the key points of the face to convert the decoded value into the real position of the key points.

The specific decoding process is as follows:

the predicted value l of the face frame obtained by the detector is equal to (l)^cx，l^cy，l^w，l^h) Decoding operation is carried out, and the real position b ═ b of the boundary box is converted into^cx，b^cy，b^w，b^h)：

b^cx＝l^cxd^w+d^cx，b^cy＝l^cyd^h+d^cy；

b^w＝d^wexp(l^w)，b^h＝d^hexp(l^h)；

Predicting the face key points obtained by the detector

Translating to true positions of keypoints

Wherein d ═ d (d)^cx，d^cy，d^w，d^h) Indicating the default anchor generated at step S2.

S4, non-maximum suppression: and eliminating the overlapped detection frames by adopting a non-maximum suppression algorithm with a threshold value of 0.4 to obtain a final face detection frame, face key points and mask recognition result, wherein the final face detection frame, face key points and mask recognition result comprises information of the upper left corner coordinate, the lower right corner coordinate, the two eye coordinates, the nose coordinate, a pair of mouth corner coordinates and the confidence coefficient of wearing the mask.

The picture shown in fig. 3 is pre-processed to adjust the image size and standardize the image size. And converting the standardized picture format into a YUV format, storing the data of the Y channel only, enhancing the data, and inputting the data into a trained detector for prediction. The network model at the time of prediction is shown in fig. 2, in the multitask loss function, the anchor point contains the target,

finally, a face detection frame is detected and red frame marking is carried out, and each face detection frame comprises two eye coordinates, a nose coordinate and a pair of mouth angle coordinates and is marked. The obtained detection results are face detection frames, face key points and mask identification results, and the detection results are used in a face identification scene and can be used in other subsequent identification processes as accurate data. Particularly, the invention extracts the coordinates of two eyes as accurate data aiming at the key points of the human face in the detection result, and provides important basis for judging whether the human body is a living body or not after data processing.

The picture shown in fig. 4 is subjected to image resizing by means of preprocessing, so as to be standardized. And converting the standardized picture format into a YUV format, storing the data of the Y channel only, enhancing the data, and inputting the data into a trained detector for prediction. The network model at the time of prediction is shown in fig. 2, in the multitask loss function, the anchor point contains the target,

finally, a face detection frame is detected and red frame marking is carried out, and each face detection frame comprises two eye coordinates, a nose coordinate and a pair of mouth angle coordinates and is marked.

And S5, intercepting the eye image: extracting coordinates x and y of two eyes according to the key points of the face, and extending the x and y to preset pixels in four directions respectively to obtain an eye image.

Specifically, two eye coordinates x and y are extracted according to the key points of the face, and the x and the y are extended for 32 pixels in four directions respectively to obtain 64 × 64 eye images.

S6, living body recognition neural network: and judging whether the eye image is a living body by adopting a preset living body recognition neural network to obtain a judgment result.

Specifically, the living body recognition neural network extracts living body features by using a mobilenet lightweight neural network, and the living body recognition neural network judges whether an eye image is a living body or not by using a cross entropy loss function as a loss function.

The living body recognition neural network in the step adopts a trained living body recognition neural network, and during training, a trained data set uses collected samples, wherein a positive sample is a real person picture shot under an infrared camera, and an attack sample is one or more combination forms of a mobile phone screen face, an ipad face, a printed colorful face or a gray face shot under an infrared image.

The picture shown in fig. 3 is an attack sample, after the picture is processed in S4, two eye coordinates x and y are extracted according to key points of a human face, x and y are extended by 32 pixels in four directions respectively to obtain 64 × 64 eye images, and after the eye images are judged by the living body recognition neural network in the step, the judgment result is that "fake" is not a living body.

The picture shown in fig. 4 is a real person picture, after the picture is processed in S4, two eye coordinates x and y are extracted according to key points of a human face, x and y are extended by 32 pixels in four directions respectively to obtain 64 × 64 eye images, and after the eye images are judged by the living body recognition neural network in the step, the result is that "real" is a living body.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A living body detection method based on infrared images is characterized by comprising the following steps:

acquiring an infrared picture and preprocessing the picture;

2. The infrared image-based in-vivo detection method as set forth in claim 1, wherein before the picture is placed in a preset detector for prediction, the method further comprises:

training the detector through a preset data set to obtain a trained detector;

3. The infrared image-based in-vivo detection method as set forth in claim 2, wherein the training of the detector through a preset data set to obtain a trained detector comprises:

performing network training by using a random optimization algorithm with momentum of 0.9 and weight attenuation factor of 0.0005, wherein the random optimization algorithm reduces imbalance between positive and negative samples by using a difficult sample mining mode, and the initial learning rate is set to 10 in the first 100 rounds of training^-3After 50 and 100 rounds each by a factor of 10, during training each predictor was first matched to the best Jaccard overlap anchor point, and then the anchor point was matched to Jaccard overlap faces with a threshold above 0.35.

4. The infrared image-based living body detection method according to claim 3, wherein the non-occlusion data is a face picture when a mask is not worn, the occlusion data is a face picture when a mask is worn, and the occlusion data is larger than the non-occlusion data.

5. The infrared image-based liveness detection method of claim 3 wherein said performing data enhancement comprises:

6. The method for detecting living bodies based on infrared images according to claim 2, wherein the steps of putting the pictures into a preset detector for prediction, combining features obtained by four different convolution layers in a backbone network of the detector with anchor points with a plurality of sizes, and performing face detection, face key point detection and mask recognition to obtain a face frame prediction value, a face key point and a mask recognition result comprise:

Otherwise, the value is 0; l is_boxAdopt the smoth-L1 loss function for locating the anchor point of the human face, t_i＝{t_x,t_y,t_w,t_h}_iIn order to predict the coordinate offset of the box,

the coordinate offset of the anchor point of the positive sample; l is_landmarkAdopting smoth-L1 loss function for positioning key points of human face_i＝{l_x1,l_y1,l_x2,l_y2,…,l_x5,l_y5}_iFor the predicted amount of the keypoint offset,

is the coordinate offset of the key point of the positive sample, if the sample is the wearing mask_i＝{l_x1,l_y1,l_x2,l_y2}_i，

Wherein l_x1,l_y1And

respectively representing the left-eye predicted keypoint coordinate offset and the positive sample keypoint offset, l_x2,l_y2And

7. The infrared image-based living body detecting method as set forth in claim 6, wherein anchor points of 10 to 256 pixels are employed to match the minimum size of the corresponding effective receptive field, and the size of each anchor point for detecting the feature is set to (10, 16, 24), (32, 48), (64, 96) and (128, 192, 256), respectively.

8. The infrared image-based living body detection method as claimed in claim 1, wherein the decoding operation of the face frame prediction value is performed to convert the face frame prediction value into the real position of the bounding box, and the decoding operation of the face key point is performed to convert the face key point into the real position of the key point, comprising:

the predicted value l ═ of the face frame obtained by the detector is (l)^cx,l^cy,l^w,l^h) Decoding operation is carried out, and the real position b ═ b of the boundary box is converted into^cx,b^cy,b^w,b^h)：

b^cx＝l^cxd^w+d^cx,b^cy＝l^cyd^h+d^cy；

b^w＝d^wexp(l^w),b^h＝d^hexp(l^h)；

Predicting the face key point value obtained by the detector

Translating to true positions of keypoints

Wherein d ═ d (d)^cx,d^cy,d^w,d^h) Representing a generated default anchor point.

9. The infrared image-based in-vivo detection method as claimed in claim 1, wherein the extracting coordinates x and y of two eyes according to the key points of the human face, and extending the x and y to four directions respectively by preset pixels to obtain the eye image comprises:

10. The method for detecting living bodies based on infrared images as claimed in claim 1, wherein the determining whether the eye images are living bodies by using a preset living body recognition neural network to obtain a determination result comprises: