CN112766205B

CN112766205B - Robustness silence living body detection method based on color mode image

Info

Publication number: CN112766205B
Application number: CN202110116023.9A
Authority: CN
Inventors: 骆春波; 韦仕才; 罗杨; 张赟疆; 徐加朗; 濮希同; 许燕; 彭涛; 刘翔
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2022-02-11
Anticipated expiration: 2041-01-28
Also published as: CN112766205A

Abstract

The invention discloses a robustness silence living body detection method based on color mode images. The method comprises the steps of shooting a human face picture under high-energy visible light; acquiring a color modality image, a saturation image and a brightness image based on the picture; carrying out face detection and secondary cutting on the obtained color modal image; binarizing the cut color mode image, and taking a maximum connected region; combining the processed color mode image with the unprocessed saturation image and the luminance image to obtain a three-channel image; establishing a CMNet network model, and training the CMNet network model by using the obtained three-channel image; and (5) performing silent in-vivo detection by adopting a trained CMNet network model. The invention aims to solve the problems that the prior art can not effectively solve the unknown living body attack and is high in cost, utilizes the specular reflection component in the image to distinguish the real person from the deceptive face, and has the advantages of simplicity and high efficiency.

Description

Robustness silence living body detection method based on color mode image

Technical Field

The invention relates to the field of face recognition, in particular to a robustness silence living body detection method based on color mode images.

Background

In recent years, face recognition technology has been developed. However, in many applications, such as face recognition mobile payment, video witness account opening, etc., when a face image is verified, it is necessary to determine whether the face image is a face image of a living body, a photograph, or a face image in a recorded video.

The existing face living body judgment technology can be roughly divided into two types, namely a silent living body and an action living body. The action living body mainly refers to various living body judgment based on actions, and requires a user to complete specified facial actions such as mouth opening and blinking before taking a lens. However, on the one hand, these facial actions can also be easily accomplished by various face synthesis software, the security level is not high enough, and on the other hand, because it requires the cooperation of the user, the user experience is extremely bad, and therefore, is gradually being replaced by a silent living body.

The silent living bodies can be simply classified into three categories according to the data used for living body judgment: silence liveness detection based on a single frame RGB image, silence liveness detection based on a multi-frame image, and silence liveness detection based on a multi-modality. The method has the characteristics of simplicity and high efficiency, but because the static RGB face image is very easy to obtain, and the textures of a real person and a deceptive face are greatly influenced by the environment, deceptive media and the like, the detection method is very easy to crack, and the robustness is not high. Researchers subsequently propose to use images of multiple frames to perform silence live body detection, so that more information, such as subtle movements of human faces, can be introduced to detect spoofing attacks. However, this method has a serious disadvantage that when an attacker uses real-person video playback to attack, there is also slight movement of the human face, and at this time, multi-frame-based live body detection may fail. So, other researchers have proposed that the detection accuracy can be improved by introducing data of other modalities, such as depth map and infrared map. However, the method has two defects, namely, the method cannot be effective to 3D living attack, and the scheme often needs a special camera to acquire a depth map and an infrared map, but the camera is expensive in manufacturing cost and is difficult to popularize in practical application scenes.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a robust silence living body detection method based on color mode images.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

a robust silence live body detection method based on color mode images comprises the following steps:

s1, shooting a human face photo under the light supplement of high-energy visible light;

s2, detecting the mirror component of the photo shot in the step S1, and acquiring a color mode image, a saturation image and a brightness image;

s3, carrying out face detection and secondary cutting on the color mode image acquired in the step S2;

s4, binarizing the color mode image cut in the step S3, and taking a maximum connected region;

s5, combining the color mode image processed in the step S4 with the unprocessed saturation image and the luminance image to obtain a three-channel image;

s6, building a CMNet network model, and training the CMNet network model by using the three-channel image obtained in the step S5;

and S7, performing silent live body detection by adopting the trained CMNet network model.

The invention has the following beneficial effects: the method has the advantages that the specular reflection component in the image is utilized to distinguish the real person and the deceptive person face, only a single RGB image for obtaining the person face under the irradiation of monochromatic light is needed, the problem of in-vivo detection can be solved without utilizing data of multi-frame images or other modes, the problems of poor user experience and easiness in playback attack deception existing in the multi-frame images are avoided, meanwhile, the silence in-vivo detection of a single RGB picture which is far from common in robustness is ensured, the performance is not lost while the user experience is ensured, and the detection effect on unknown deceptive attacks is superior to that of the existing in-vivo detection algorithm. Meanwhile, the method overcomes the requirement of expensive peripherals in the multi-mode-based silence living body detection method, and is easy to deploy in the existing face recognition system.

Preferably, step S2 specifically includes:

the photograph taken in step S1 is transferred from the RGB color space to the HSV space, the color and illumination of the image are separated, and a color mode image, a saturation image, and a brightness image are acquired.

The preferred scheme has the following beneficial effects: compared with an RGB color space which represents an image by three primary colors, the HSV space decomposes the image into an H component representing color, an S component representing saturation and a V component representing brightness, and can more conveniently acquire and process the color components of the image.

Preferably, step S3 includes the following substeps:

s31, carrying out face detection on the color mode image;

and S32, extracting key points of the detected face, and secondarily cutting the face by using the coordinates of the key points, so as to reduce the width of the face and keep the height unchanged.

The preferred scheme has the following beneficial effects: the influence caused by interference information, namely the environment around the human face, is reduced.

Preferably, step S4 includes the following substeps:

s41, performing binarization processing on the color mode image cut in the step S3 by adopting an adaptive threshold value Dajin binarization method, specifically including firstly calculating to obtain a gray level histogram and a probability histogram of the original image, then traversing all possible threshold values t, finding a corresponding threshold value when the inter-class variance is maximum, and performing binarization processing on the color mode image by using the threshold value;

and S42, acquiring connected regions contained in the color mode image subjected to the binarization processing in the step S41, calculating the areas of all the connected regions, sequencing the connected regions in sequence, reserving the connected region with the largest area, and deleting other connected regions to obtain a binarized color mode image only containing one connected region.

The preferred scheme has the following beneficial effects: the binaryzation removes the specular reflection of the ambient light, only the specular reflection component of the irradiated light is left, the robustness of color modal data is further improved, the connected region is solved, the maximum connected region is reserved, the uncontrollable specular reflection component of the ambient light existing in other regions except the irradiated region is removed, and after the step, the other specular reflection components of the ambient light are all processed except the specular reflection component of the irradiated light.

Preferably, step S6 includes the following substeps:

s61, extracting features by taking DenseNet as a backbone network, then classifying by utilizing the extracted features by a full connection layer, and judging whether the image is a real image or a deceptive image;

s62, combining a pixel supervision technology, taking BCELoss and Cross Engine Loss as Loss functions, and supervising the training of the model to establish a CMNet network model;

and S63, selecting a random gradient descent optimizer as the optimizer in the training process, inputting the three-channel image obtained in the step S5 into the CMNet network model, and training the CMNet network model in a transfer learning mode.

The preferred scheme has the following beneficial effects: the DenseNet takes the shallow feature as input and inputs the shallow feature into a high-level network, so that the shallow feature can be well utilized, and the DenseNet is suitable for the condition that the data volume is insufficient; compared to the conventional solution of treating liveness detection as a binary classification, pixel supervision introduces supervision information at a pixel level before binary classification. If the photo is a live photo, the photo corresponds to a 14x14 supervision matrix of all 1, and the spoofed photo corresponds to a 14x14 supervision matrix of all 0, so that the characteristics can be better aggregated, and the secondary classification is assisted to obtain more stable output; BCELoss is a loss function commonly used in neural network classification, and has a good effect, and the training of inputting images processed according to the preprocessing method described in the steps S1 to S6 into the network can improve the generalization of the model as much as possible under the condition of insufficient data.

Drawings

FIG. 1 is a flow chart of a robust silence liveness detection method based on color modality images in accordance with the present invention;

FIG. 2 is a schematic diagram of imaging results of a real person and a deceptive person face under fill-in of purple light in the embodiment of the invention;

FIG. 3 is a schematic diagram of color mode images of a real person and a deceptive person face under fill-in of purple light in the embodiment of the invention;

FIG. 4 is a schematic diagram of key points of a face according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a color modality image cropping result according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a result of binarization and maximum connected region of a color mode image according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a CMNet network model structure in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, the present invention provides a robust silence live detection method based on color mode images, comprising the following steps:

and (3) adopting purple light with the strongest energy as additional supplementary light to irradiate the human face, and then detecting the human face and taking a picture. According to the global illumination rendering equation, the energy intensity of a certain point in the reflection image is as follows:

wherein L is₀(p,ω₀) Is the intensity of the last observed light at a point, p is the position of the point, w_oIs the direction of this point, L_e(p,ω₀) Is the emergent intensity, xi, of the reflecting surface²Is the respective direction of the hemisphere, f_rIs a scattering function, L_iIs the incident intensity, w_iIs the direction of incidence and theta is the angle of the incoming direction with the normal.

The image I (x, y) can be simply split into the sum of the illumination of the reflecting surface itself, the specular component, and the diffuse component.

I(x,y)＝I_i(x,y)+I_s(x,y)+I_d(x,y)

I_s(x,y)＝k_s*E(x,y)

I_d(x,y)＝k_d*E(x,y)

E(x,y)＝E_a(x,y)+E_i(x,y)

Wherein I_i(x, y) is the illumination of the reflecting surface itself, if the reflecting surface itself is to emit light, such as a screen, then I_i(x, y) is the light intensity of the screen, otherwise it is 0. I is_s(x, y) represents a specular reflection component, which is defined by incident light energy E (x, y) and a specular reflection coefficient k of the material_sAnd (6) determining. I is_d(x, y) represents a diffuse reflection component, which is defined by incident light energy E (x, y) and diffuse reflection coefficient k of the material_dAnd (6) determining. And the energy E (x, y) of the incident light is the ambient light energy E_a(x, y) and irradiation light energy E_iThe sum of (x, y).

For real face and deceptive face, the surface of deceptive face such as photo and screen is smooth, so the reflection is very smoothOf the components, the specular component is dominant. However, the surface of the human face is relatively rough, so that the diffuse reflection component is dominant in the reflection component of the real human face. By detecting the specular component, genuine and spoofed faces can be distinguished. But the reflected components are difficult to separate, so a compromise solution is devised. And highlighting the specular reflection component by adopting an additional supplementary lighting mode, and approximating the specular reflection component of the image by detecting the specular reflection component of the additional supplementary lighting. And the larger the energy E (x, y), the specular component I_sThe larger (x, y) is, the larger the difference between the real person and the deceptive person face is, so that purple light is selected as additional supplementary light. In addition, in order to ensure the sufficiency of illumination, the distance between the user and the camera is required to be within a limited range, and then the human face is automatically monitored and a picture is taken to obtain an RGB image I (x, y). Referring to fig. 2, the result of shooting the real person and the deceptive person face under the supplementary purple light is shown, where the left is the photo of the real person face under the supplementary purple light, and the right is the photo of the deceptive person face under the supplementary purple light.

in the embodiment of the present invention, the step S2 specifically includes converting the photo taken in the step S1 from the RGB color space to the HSV space, separating the color and the illumination of the image, and acquiring the image in the hue space representing the color, that is, the color mode image, and the saturation image and the brightness image. Referring to fig. 3, color mode images of a real person and a spoofed face are shown, wherein the left side is the color mode image of the real person face, and the right side is the color mode image of the spoofed face, so that an obvious difference exists, and the living body detection can be performed by using the difference.

in the embodiment of the present invention, step S3 includes the following sub-steps:

s31, carrying out face detection on the color mode image;

because the light filling area of the photo is only on the face, the environment around the face is interference information, and the face is extracted by carrying out face detection on the collected photo.

Due to the fact that the environment is changeable, the size of the face detection frame fluctuates greatly, and therefore secondary clipping is conducted on the face detection frame again. Considering that the positions of the key points of the face are fixed and do not change with the size of the detection frame, after the face is detected, the key points of the face are extracted by using the dlib library, and the detection result is shown in fig. 4. And then the coordinates of the 1 st and 17 th key points are used for judging the face, the width of the face is reduced, the height is not adjusted, and the influence of the surrounding environment is further reduced through the processing. The original image and the result after the second cropping are shown in fig. 5, where the left side is the original color mode image and the right side is the color mode image after the face detection and the second cropping, and it can be seen that after the second cropping, the noise is significantly less.

in the embodiment of the present invention, step S4 includes the following sub-steps:

s41, performing binarization processing on the color mode image cut in the step S3 by adopting an adaptive threshold value Dajin binarization method, specifically including firstly calculating to obtain a gray level histogram and a probability histogram of the original image, then traversing all possible threshold values t, finding a corresponding threshold value when the inter-class variance is maximum, and performing binarization processing on the color mode image by using the threshold value; compared with the traditional fixed threshold value binarization method, the method can better adapt to different conditions of different images for dynamically updating the threshold values of different images.

The energy E (x, y) of the incident light is the ambient light energy E_a(x, y) and irradiation light energy E_iThe sum of (x, y). The illumination light is an extra fill light, which is a controllable variable, but the ambient light is an uncontrollable quantity affected by a variable environment. Although ambient light may notHowever, since the energy of the ambient light is much smaller than that of the stable irradiation light, the robustness of the color mode data can be further improved by selecting a threshold value, performing binarization, and removing the influence of the specular reflection of the ambient light to leave only the specular reflection component of the irradiation light. And (3) performing binarization processing on the dynamically updated threshold values of different images by adopting a Dajin binarization method of self-adaptive threshold values.

In addition to the illuminated area, other areas of the face may also have uncontrollable specular components of the environment, but these disturbances are much smaller in area than the controllable specular component of the illumination. In order to remove the interference, for the color mode image after binarization, a connected region is obtained, the maximum connected region is reserved, and other environmental interference is removed. Referring to fig. 6, the results before and after the processing are shown, where the left side is the color mode image after the secondary cropping, and the right side is the color mode image after the binarization and the maximum region is taken, it can be seen that after the processing, the specular reflection components caused by the other environments are all processed except the specular reflection component caused by the fill light.

when clipping and binarization are performed and the maximum connected region is obtained, information loss occurs, and in order to better perform living body detection, the processed color mode image and the unprocessed S-space and V-space images are combined again to obtain a three-channel image. In this way, information lost in the preprocessing can be compensated with information of other channels.

in the embodiment of the present invention, step S6 includes the following sub-steps:

s63, selecting a random gradient descent optimizer as an optimizer in the training process, wherein the learning rate is 0.001, the momentum parameter is 0.99, firstly downloading a DesenNet121 model which is pre-trained on imagenet, then processing the model by using collected color mode images according to the pre-processing method described above, inputting the three-channel images obtained in the step S5 into the CMNet network model, and training the CMNet network model by adopting a transfer learning mode.

In this embodiment, a network model CMNet shown in fig. 7 is newly proposed for processing color mode images, extracting features, and performing attack detection by using classical DenseNet as a backbone network, combining with the latest pixel supervision technology, and using BCELoss as loss. The reason why the DenseNet is selected as the backbone instead of other reasons such as ResNet is that the DenseNet takes the shallow feature as input and inputs the shallow feature into a high-level network, so that the shallow feature can be well utilized, and the method is very suitable for processing the situation that the data volume is insufficient. Secondly, compared with the traditional solution thought of treating the living body detection as two classifications, the pixel supervision introduces supervision information at a pixel level before the two classifications. In the case of a live photo, the photo corresponds to a 14x14 supervision matrix with all 1's, and the photo with fraud corresponds to a 14x14 supervision matrix with all 0's. BCELoss is a commonly used loss function in neural network classification, and has good effect.

And S8, deploying the trained CMNet network model on a server to perform silent live body detection.

Referring to the following table, the results of comparing the performance of the present invention with some of the latest algorithms currently available are shown in the following table:

training is carried out in data of paper printing and screen playback attack, then testing is carried out in unknown photo printing attack to compare the robustness of different algorithms to the unknown attack, and the algorithm provided by the invention is obviously superior to other algorithms. And the processing speed is the second fastest of all algorithms, only 80ms is needed on the Intel Core i5-7500 CPU to process a picture.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A robust silence live detection method based on color mode images is characterized by comprising the following steps:

s3, carrying out face detection and secondary cutting on the color mode image acquired in the step S2, comprising the following steps:

s31, carrying out face detection on the color mode image;

s32, extracting key points of the detected face, cutting the face twice by using the coordinates of the key points, reducing the width of the face and keeping the height unchanged;

2. The method for robust silence liveness detection based on color modality images as claimed in claim 1, wherein the step S2 specifically comprises:

and (4) transferring the picture taken in the step (S1) from the RGB color space to the HSV color space, separating the color and the illumination of the image, and acquiring a color mode image, a saturation graph and a brightness image.

3. The method for robust silence liveness detection based on color modality images as claimed in claim 1, wherein said step S4 comprises the sub-steps of:

4. The method according to claim 3, wherein the step S6 includes the following sub-steps:

s61, extracting features by taking the DenseNet121 as a backbone network, then classifying by utilizing the extracted features by a full connection layer, and judging whether the image is a real image or a deceptive image;