CN113743192A

CN113743192A - Silent human face living body detection model and method

Info

Publication number: CN113743192A
Application number: CN202110810804.8A
Authority: CN
Inventors: 李开; 邹复好; 甘早斌; 肖伟; 向文; 卢萍
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2021-12-03

Abstract

The embodiment of the invention provides a silent human face in-vivo detection model and a method, wherein the silent human face in-vivo detection model comprises the following steps: the system comprises a face detection module, a skeleton network module and a central differential convolution classification branch; the face detection module is used for acquiring a face detection frame based on an input picture and inputting the face detection frame to the skeleton network module; the skeleton network module is used for extracting convolution characteristics, and the convolution characteristics are connected with the central differential convolution classification branch; and the central differential convolution classification branch is used for judging whether the face of the input picture is a living body. The embodiment of the invention is only based on RGB single-frame image data, is real-time silent human face in-vivo detection, does not need the cooperation of video data and a user, and greatly saves the cost compared with a human face in-vivo detection scheme based on an infrared camera, a 3D structure optical camera and a multi-camera; compared with a matching type human face living body detection scheme, the method is higher in passing efficiency and better in user experience.

Description

Silent human face living body detection model and method

Technical Field

The invention relates to the field of computer vision and image processing, in particular to a silent human face in-vivo detection model and a silent human face in-vivo detection method.

Background

With the gradual accumulation of video monitoring data, the continuous development of hardware platforms and the rapid breakthrough of related technologies of computational vision, the face recognition algorithm based on deep learning develops the myth in the fields of city security, smart communities and the like, and the face recognition algorithm is continuously applied. However, with the popularization of multimedia devices and the internet, high-quality face images and videos are more and more easily acquired, so that the traditional face recognition algorithm faces serious face fraud attacks such as photos, masks, shelters, screen flips and the like, and therefore, the identification of the living body of a user in face recognition becomes extremely important.

In view of the low efficiency and low human-based design of the video interactive human face living body detection scheme, the method is generally only applied to few severe scenes, such as payment verification and the like. Under the more extensive video monitoring, intelligent entrance guard's scene, silent type human face live body detection scheme current efficiency is higher, user experience is better. And because the infrared camera, the multi-view camera and the 3D structure optical camera have high cost and are rarely controlled in reality, the human face living body detection based on the depth (RGBD) diagram is rarely applied. Therefore, the silent human face in-vivo detection technology based on the common RGB human face image is widely favored by the industry and academia, and has wide application prospect.

Generally, a silent human face in-vivo detection method using an RGB image is based on only a certain local detail feature, such as texture, optical flow, 3D information or traditional manual features, and is very easy to over-fit a certain scene, causing false detection and missed detection, and the practicability is poor. Therefore, it is necessary to design an accurate and robust RGB image-based silent face in vivo detection model.

Disclosure of Invention

In order to solve the above problems, embodiments of the present invention provide a silent human face live detection model and method that overcome the above problems or at least partially solve the above problems.

According to a first aspect of the embodiments of the present invention, there is provided a silent human face live detection model, including: the system comprises a face detection module, a skeleton network module and a central differential convolution classification branch; the face detection module is used for acquiring a face detection frame based on an input picture and inputting the face detection frame to the skeleton network module; the skeleton network module is used for extracting convolution characteristics, and the convolution characteristics are connected with the central differential convolution classification branch; and the central differential convolution classification branch is used for judging whether the face of the input picture is a living body.

According to a second aspect of the embodiments of the present invention, there is provided a silent human face live detection method, including: and inputting the face image to be detected into the silent face in-vivo detection model provided by any one of the aspects, and obtaining a detection result output by the silent face in-vivo detection model.

The silent human face living body detection model and the silent human face living body detection method provided by the embodiment of the invention at least have the following effects:

(1) the algorithm is based on RGB single-frame image data only, and is real-time silent human face in-vivo detection, video data and user cooperation are not needed, and compared with a human face in-vivo detection scheme based on an infrared camera, a 3D structure optical camera and a multi-view camera, the cost is greatly saved; compared with a matching type human face living body detection scheme, the method is higher in passing efficiency and better in user experience.

(2) The classification branch of the central differential convolution is designed, and the semantic information and the gradient information of the feature map are gathered to capture fine-grained features and local correlation in various environments, so that more robust human face living body features are extracted, various attacks are powerfully responded, and living body classification precision is greatly improved.

(3) A reflection map prediction branch is introduced, a supervision signal is constructed by carrying out pixel-level reflection stripping on a single image, and reflection artifacts caused by light reflected from a smooth plane are taken as a powerful basis for judging whether a living body exists or not, so that the 2D imaging attack can be fully responded.

(4) And designing a depth map prediction branch, taking 3D space distribution information of a human face as an auxiliary supervision signal, and obtaining a depth map label of a human face image through a pre-trained PRNet (pseudo random noise) to fully cope with 2D imaging attack.

(5) A Fourier spectrum prediction branch is designed, a Fourier spectrum image of a face image is used as an additional supervision signal, the obvious Fourier spectrum distribution difference of a real face and various false faces is fully utilized, various attacks are effectively responded, and the living body detection precision is obviously improved.

(6) The plug-and-play lightweight attention module is designed to comprise a space and channel attention mechanism, a space local area or channel needing to be enhanced is enhanced, an unnecessary space area or channel is restrained, and the learning capacity and the expression capacity of the skeleton network are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from these without inventive effort.

Fig. 1 is a schematic structural diagram of a silent human face live detection model according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a multi-scale human face image generation process according to an embodiment of the invention.

Fig. 3 is a schematic diagram of a skeleton network structure of a silent human face in-vivo detection model based on a deep convolutional neural network and an RGB single-frame image according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of Block1 and Block2 in the framework network according to the embodiment of the present invention.

FIG. 5 is a schematic view of an attention module according to an embodiment of the present invention.

FIG. 6 is a schematic diagram of a channel attention module according to an embodiment of the invention.

FIG. 7 is a schematic diagram of a spatial attention module according to an embodiment of the present invention.

FIG. 8 is a schematic diagram of a central differential convolution according to an embodiment of the present invention.

FIG. 9 is a diagram of a central differential convolution classification branch according to an embodiment of the present invention.

FIG. 10 is a diagram of a branch for reflection map prediction according to an embodiment of the present invention.

FIG. 11 is a diagram of Fourier spectrogram prediction branches according to an embodiment of the present invention.

FIG. 12 is a diagram illustrating a branch prediction for a depth map according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a silent human face in-vivo detection model based on a deep convolutional neural network and an RGB single-frame image, which is independent of a deep camera or a binocular camera, only uses the RGB image, elaborately designs a lightweight convolutional neural network structure and a unique functional operator, introduces a space attention and channel attention mechanism to perform feature enhancement, trains by using a multi-scale human face image and various auxiliary monitoring signals, comprises a depth map, a Fourier spectrogram and a reflection map, and fully considers multi-scale human face information and scene context information during forward test, thereby performing real-time and accurate living body judgment on the human face image under video monitoring.

Specifically, fig. 1 is a schematic structural diagram of a silent human face live detection model according to an embodiment of the present invention, and referring to fig. 1, the silent human face live detection model according to the embodiment of the present invention includes: the system comprises a face detection module, a skeleton network module and a central differential convolution classification branch; the face detection module is used for acquiring a face detection frame based on an input picture and inputting the face detection frame to the skeleton network module; the skeleton network module is used for extracting convolution characteristics, and the convolution characteristics are connected with the central differential convolution classification branch; and the central differential convolution classification branch is used for judging whether the face of the input picture is a living body.

Based on the content of the foregoing embodiment, as an optional embodiment, the silent face live detection model further includes: a reflection map prediction branch, a Fourier spectrogram prediction branch and a depth map prediction branch; the reflection map prediction branch, the Fourier spectrogram prediction branch, the depth map prediction branch and the central differential convolution classification branch are parallel branches which are connected with the convolution characteristics in an up-connection mode; the reflection map prediction branch is used for predicting a reflection map of the input picture; the Fourier spectrogram prediction branch is used for predicting a Fourier spectrogram of the input picture; the depth map prediction branch is used to predict a depth map of the input picture.

Based on the content of the foregoing embodiment, as an optional embodiment, the reflection map prediction branch, the fourier spectrum map prediction branch and the depth map prediction branch are used to provide an auxiliary supervisory signal for optimization of the central differential convolution classification branch in training; in actual use, only the central differential convolution classification branch is called.

Specifically, the face detection module is a pre-processing process of the whole algorithm, the obtained face detection frame is sent to the skeleton network module to extract convolution characteristics, and four parallel branches are connected to the convolution characteristics: the method comprises a Reflection Map prediction branch, a Fourier spectrogram prediction branch, a Depth Map prediction branch and a central differential convolution classification branch, wherein the first three branches are used for predicting a Reflection Map (Reflection Map), a Fourier spectrogram and a Depth Map (Depth Map) of an input picture respectively and are supervised by corresponding actual labels respectively. The central differential convolution classification branch judges whether the input human face is a living body or not by utilizing the central differential convolution, the first three branches are used for providing auxiliary supervision signals for the optimization of the classification branch in training, and only the classification branch is actually called when forward test is carried out.

Based on the content of the above embodiment, as an optional embodiment, the face detection module is specifically configured to generate a multi-scale face picture as training data in a training stage to perform model training, and generate a multi-scale face detection box as an input in real time in an actual use stage.

Specifically, the face detection module is used for generating a multi-scale face picture as living body detection training data in a training stage and generating a multi-scale face detection frame as input in a testing stage in real time. The RetinaFace is used as a face detector, firstly training data is made in a training stage, a face detection frame picture is obtained by utilizing the pretrained RetinaFace to perform forward calculation on an original video monitoring image, the resolution ratio of the face detection frame picture is scaled to 112 x 112, the original detection frame is expanded by 2 times, 3 times and 4 times in an original image for enhancing the robustness of a model to a multi-scale face, the original detection frame is stopped when meeting the image boundary, and then the original detection frame picture is scaled to the fixed resolution ratio of 112 x 112, so that one face can obtain four training pictures with different scales; in the testing stage, in order to capture the multi-scale information of the face and the context information of the scene, the detection frame is expanded by 2 times, 3 times and 4 times, and then is scaled to 112 × 112, so that one face generates four input pictures, the four input pictures are simultaneously sent to the model to obtain four confidence probability vectors, and the average value of the confidence probability vectors is used as the final result.

In order to improve the robustness of the model to the multi-scale human face, when training data is made, the human face image detected by the human face detector is subjected to multi-scale expansion, specifically, the original detection frame (with the resolution of 112 × 112) is respectively expanded by 2 times, 3 times and 4 times, and then all the expanded detection frames are scaled to 112 × 112, so that a human face correspondingly generates four training images with different scales, as shown in fig. 2, and the robustness of the model to the multi-scale human face is enhanced. Similarly, when performing forward test, in order to increase the receptive field, the scene context information and the multi-scale face information are fully utilized, and the multi-scale transformation is also performed, so that four confidence probability values are generated in one forward direction, and the average value is taken as the final output.

Based on the content of the above embodiment, as an alternative embodiment, the model of the skeleton network module is formed by stacking a plurality of depth separable convolution residual blocks, and a spatial and channel attention mechanism is introduced into the model to respectively model the information distribution weights of the feature map on the space and the channel in an end-to-end learning manner.

Specifically, the skeleton network (Backbone) module is used for extracting convolution characteristics of a human face picture, designing a lightweight convolution neural network, and introducing a space attention and channel attention mechanism to enhance characteristic expression capacity. Then, a pre-training model on the MS1M-Arcface face recognition data set is used as a feature extraction network, a full connection layer is removed, and the four branches are connected to a high-level feature map.

The framework network module is used for rapidly extracting convolution characteristics of an input picture, and in the specific design, in order to ensure high real-time performance of the model and reduce the calculated amount and parameter amount of the model, the model is integrally formed by stacking a plurality of depth separable convolution residual blocks, the number of channels of each layer is elaborately designed, and the optimal balance between the precision and the speed of the model is obtained. Meanwhile, in order to enhance the robustness and the expression capability of the characteristics, a space and channel attention mechanism is introduced into the model, the information distribution weights of the characteristic diagram on the space and the channel are respectively modeled in an end-to-end learning mode, an important local space region and the channel are enhanced, and the space noise and redundant channel information are weakened; meanwhile, the down-sampling times are reduced, a high-level feature map with higher resolution is obtained, and as much image information as possible is reserved.

The overall structure of the skeleton network is shown in fig. 3, and the specific design is as follows, the network is formed by stacking depth separable convolution residual blocks, and the channels are scaled by 1 × 1 convolution. Both Block1 and Block2 are Bottleneck structures, and the detailed structure is shown in FIG. 4. The residual block structure can greatly relieve the problems of gradient disappearance and gradient explosion of a deep network through short-circuit connection, and meanwhile, the overfitting phenomenon can be effectively prevented. In addition, the number of channels of the network is carefully pruned and designed to ensure high real-time performance of the model and obtain the optimal balance of speed and precision, the calculated amount of the model designed by the embodiment of the invention is only 80M, and the parameter number is only 0.4M.

TABLE 1 Overall framework network architecture

In addition, in order to enhance the learning ability of the skeleton network and further enhance the feature robustness, the invention inserts a lightweight attention module into the skeleton network, the schematic diagram of which is shown in fig. 5, and the lightweight attention module comprises a channel and a space attention module, which are respectively shown in fig. 6 and fig. 7, so that important channel and space regions are emphasized, unnecessary channel and space regions are inhibited, and more robust and more expressive features are generated. Specifically, for an input feature map with dimension H × W × C, the channel attention module first generates two feature vectors with dimension 1 × 1 × C by average pooling and maximum pooling respectively, then respectively obtaining corresponding feature vectors through a shared multilayer perceptron (namely a full-connection layer), the dimension of the channel is still 1 multiplied by C, then the two are added, a channel attention vector with the dimension of 1 multiplied by C is obtained through a layer of Sigmoid activation function, the value of each position of the vector represents the weight of a corresponding channel, the value is between 0 and 1, the input feature map is multiplied by the channel attention vector to obtain a channel attention feature map with dimension H multiplied by W multiplied by C, each channel is multiplied by the corresponding channel attention weight to obtain a new feature map, wherein the channel with larger weight is strengthened, and the channel with smaller weight is restrained. After the channel attention feature map, connecting a spatial attention module, for the H multiplied by W multiplied by C channel attention feature map, firstly obtaining a pooling feature map with dimension H multiplied by W multiplied by 1 through average pooling and maximum pooling on the channel respectively, that is, for each spatial position of the feature map, respectively taking the average value or the maximum value on all the channels to obtain a pooling result of H multiplied by W multiplied by 1, obtaining a spatial attention map with dimension H multiplied by W multiplied by 1 through a Sigmoid activation function, the value of each position represents the distribution weight of the position in space, the channel attention feature map domain and the space attention map are multiplied to obtain a final output feature map, the numerical values of all the channels in each space position are multiplied by the weight of the corresponding position of the space attention map, important local areas are enhanced, and redundant space areas are suppressed.

Based on the content of the foregoing embodiment, as an optional embodiment, the reflection map prediction branch is specifically configured to predict a reflection map of the input image, and then calculate an MSE loss with an actual labeled reflection map, so as to assist in learning of the central differential convolution classification branch; the additional surveillance signals are directed to 2D imaging attacks.

Specifically, a Reflection Map prediction (Reflection Map) branch is used for predicting a Reflection Map of an input face image, and then an MSE loss is calculated with an actually labeled Reflection Map for assisting in learning of a classification branch, while Reflection Map labeling of an original image is obtained by a perceptual Reflection Map stripping algorithm (perceptual Reflection removal). The additional supervision signal is mainly aimed at 2D imaging attacks (video and photo playback), because such attacks generally have obvious reflection artifacts caused by reflected light from a smooth plane, and can be used as an important basis for judging whether the living body exists or not.

The structure of the reflection map prediction branch is shown in fig. 10, firstly, 1 × 1 convolution is used for reducing channels of a skeleton network top-layer feature map from 512 to 3, namely, the feature map dimension is changed from 14 × 14 × 512 to 14 × 14 × 3, because the reflection map is an RGB three-channel map, then, two layers of 3 × 3 convolution are carried out to obtain the prediction reflection map, and MSE loss is calculated with the labeled reflection map, so that the optimization process of the skeleton network is supervised, and learning of classification branches is assisted. The labeled reflection map of the original image is obtained by a pre-trained perceptual reflection map stripping model (perceptual reflection removal), and the algorithm can perform pixel-level reflection separation on a single picture to serve as the reflection map label of the original image.

Based on the content of the foregoing embodiment, as an optional embodiment, the fourier spectrogram predicting branch is specifically configured to predict a fourier spectrogram of the input image, and then calculate an MSE loss with an actually labeled fourier spectrogram, so as to assist in learning of the central differential convolution classification branch.

Specifically, the fourier spectrogram prediction branch is used for predicting a fourier spectrogram of an input picture, and then calculating MSE loss with an actually labeled fourier spectrogram for assisting in learning of the classification branch, wherein the fourier spectrogram label is obtained by performing fourier transform on the input picture.

The fourier spectrogram prediction branch is designed for predicting the fourier spectrogram of an input picture, and the structure of the fourier spectrogram prediction branch is shown in fig. 11. The fake face picture and the real face picture are converted into frequency domain pictures, and the comparison shows that the high-frequency information of the fake face is single in distribution and basically extends along the horizontal vertical direction, but the high-frequency information of the real face is in a divergence shape, and the Fourier frequency spectrum pictures of the real face and the fake face are different, so that the Fourier frequency spectrum prediction branch is introduced to carry out auxiliary supervision. And then calculating the MSE loss with the actually labeled Fourier spectrogram, and using the MSE loss to assist in learning the classification branch. And performing Fourier transform on the input picture on line, and then performing normalization and resize to obtain the Fourier spectrum label of the original picture.

Based on the content of the foregoing embodiment, as an optional embodiment, the depth map prediction branch is specifically configured to predict a depth map of the input image, reflect a distribution state of an input face in a 3D space, and calculate an MSE loss according to the predicted depth map and an annotated depth map, so as to assist in learning of the central differential convolution classification branch.

Specifically, a Depth Map (Depth Map) prediction branch is used for a Depth Map of an input picture, reflecting the distribution state of an input face in a 3D space, calculating MSE loss by using the prediction Depth Map and an labeled Depth Map, and is used for assisting the learning of a classification branch, wherein the Depth Map is used for noting the forward calculation result of a pre-training model from PRNet on the input face image.

Designing a depth map prediction branch for predicting a depth map of an input picture, reflecting the distribution state of an input human face in a 3D space, wherein the structure of the depth map prediction branch is shown in FIG. 12, firstly reducing a channel of a top-layer feature map of a skeleton network from 512 to 64 by using 1 × 1 convolution, then carrying out two-layer 3 × 3 convolution without changing the resolution, finally obtaining a prediction depth map with the dimension of 14 × 14 × 1 by using one-layer 1 × 1 convolution, and finally calculating MSE loss with a labeled depth map so as to supervise the optimization process of the skeleton network and assist in learning of the classification branch. The labeled depth map of the original image is derived from the forward calculation result of the PRNet pre-training model on the input face image. The PRNEs are used for 3D face reconstruction, 3D parameters are directly obtained from 2D face prediction, and UV position mapping images are output.

Based on the content of the foregoing embodiment, as an optional embodiment, the central differential convolution classification branch is specifically configured to predict whether a face in the input image is a living body, and specifically output a two-dimensional confidence probability vector; the central differential partial convolution simulates a local binary pattern for capturing local correlations.

Specifically, the central differential convolution classification branch is used for predicting whether an input face is a living body, specifically outputting a two-dimensional confidence probability vector, and the central differential partial convolution simulation Local Binary Pattern (LBP) is used for capturing local correlation and learning more robust human face living body features.

And designing a classification branch based on central differential convolution. The central difference convolution calculation flow is as shown in fig. 8, and specifically, the value of the central position is subtracted from the value of the position corresponding to the convolution kernel on the feature map, and then the central difference convolution calculation flow is calculated with the convolution kernel parameter. The calculation of the ordinary 2D convolution first samples the corresponding region (receptive field region) of the convolution kernel on the input feature map locally, and then adds the weighted sum with the parameter weight of the convolution kernel as follows:

where R represents the receptive field on the profile, for a 3 × 3 convolution, R { (-1, -1), (-1,0), …, (0,1), (1,1) }, where there are 9 positions, p₀Representing the current position on the input-output profile, p_nRepresenting each region on the receptive field. By modeling the Local Binary Pattern (LBP), the computational expression of the initial central difference convolution is:

when p is_nWhen the gradient value relative to the central position is always 0 (0,0), the semantic information of the feature map itself and the locally relevant gradient information are equally important for the human face live body detection task, so the improved central differential convolution designed herein combines the normal convolution and the initial central differential convolution, and the expression is as follows:

simplifying to obtain:

the present invention sets θ to 0.7. The central differential convolution captures fine-grained features and local correlation in various environments by simultaneously converging semantic information and gradient information of the feature map, and learns more robust human face living body features.

The classification branch structure based on the central differential convolution designed by the invention is shown in the following table, the topological structure is shown in fig. 9, the size of three layers of convolution kernels is 3, the central differential convolution with the step length of 1 forms a residual block structure, and then the probability vector of whether the face is a living body is obtained through global average pooling and a full connection layer.

TABLE 2 Classification branching Structure for Central differential convolution

Based on the content of the foregoing embodiment, as an optional embodiment, an embodiment of the present invention further provides a face live detection method, including: and inputting the face image to be detected into the silent face in-vivo detection model provided by any one of the embodiments to obtain a detection result output by the silent face in-vivo detection model.

Compared with the existing human face living body detection algorithm, the silent human face living body detection model based on the convolutional neural network and the RGB single-frame image provided by the embodiment of the invention at least has the following effects:

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A silent face liveness detection model, comprising: the system comprises a face detection module, a skeleton network module and a central differential convolution classification branch;

the face detection module is used for acquiring a face detection frame based on an input picture and inputting the face detection frame to the skeleton network module;

the skeleton network module is used for extracting convolution characteristics, and the convolution characteristics are connected with the central differential convolution classification branch;

and the central differential convolution classification branch is used for judging whether the face of the input picture is a living body.

2. The model of claim 1, further comprising: a reflection map prediction branch, a Fourier spectrogram prediction branch and a depth map prediction branch;

the reflection map prediction branch, the Fourier spectrogram prediction branch, the depth map prediction branch and the central differential convolution classification branch are parallel branches which are connected with the convolution characteristics in an up-connection mode;

the reflection map prediction branch is used for predicting a reflection map of the input picture;

the Fourier spectrogram prediction branch is used for predicting a Fourier spectrogram of the input picture;

the depth map prediction branch is used to predict a depth map of the input picture.

3. The model of claim 2, wherein the reflection map prediction branch, the fourier spectrogram prediction branch, and the depth map prediction branch are used in training to provide auxiliary supervisory signals for optimization of the central differential convolution classification branch; in actual use, only the central differential convolution classification branch is called.

4. The model of claim 1, wherein the face detection module is specifically configured to generate a multi-scale face image as training data in a training stage for model training, and generate a multi-scale face detection box as an input in real time in an actual use stage.

5. The model of claim 1, wherein the model of the skeletal network module is stacked from a plurality of depth separable convolutional residual blocks and introduces a spatial and channel attention mechanism in the model to model the information distribution weights of the feature map spatially and on the channels, respectively, in an end-to-end learning manner.

6. The model of claim 2, wherein the reflection map prediction branch is specifically configured to predict a reflection map of the input image, and then calculate an MSE loss with an actual labeled reflection map for assisting in learning the central differential convolution classification branch; the additional surveillance signals are directed to 2D imaging attacks.

7. The model of claim 2, wherein the fourier spectrogram prediction branch is specifically configured to predict a fourier spectrogram of the input image, and then compute MSE loss with an actually labeled fourier spectrogram, for assisting in learning the central differential convolution classification branch.

8. The model of claim 2, wherein the depth map prediction branch is specifically configured to predict a depth map of the input image, reflect a distribution state of an input face in a 3D space, and calculate an MSE loss according to the predicted depth map and an annotated depth map, so as to assist in learning the central differential convolution classification branch.

9. The model of claim 1, wherein the central differential convolution classification branch is specifically configured to predict whether a face in the input image is a living body, specifically to output a two-dimensional confidence probability vector; the central differential partial convolution simulates a local binary pattern for capturing local correlations.

10. A face living body detection method is characterized by comprising the following steps:

inputting a face image to be detected into the silent face in-vivo detection model according to any one of claims 1 to 9, and obtaining a detection result output by the silent face in-vivo detection model.