WO2021248733A1

WO2021248733A1 - Live face detection system applying two-branch three-dimensional convolutional model, terminal and storage medium

Info

Publication number: WO2021248733A1
Application number: PCT/CN2020/116644
Authority: WO
Inventors: 沈海斌; 欧阳文汉
Original assignee: 浙江大学
Priority date: 2020-06-12
Filing date: 2020-09-22
Publication date: 2021-12-16
Also published as: CN111814574B; CN111814574A

Abstract

A live face detection system applying a two-branch three-dimensional convolutional model, a terminal and a storage medium relate to the technical field of live face detection. The live face detection system comprises a face video capturing module, a face video pre-processing module, a live body annotation module, a live body motion amplification module, a two-branch three-dimensional convolutional model training module and a live body judgement module. The two-branch three-dimensional convolutional model training module is configured with a two-branch three-dimensional convolutional model and comprises a static texture information sub-module, a dynamic motion cue sub-module, a fusion sub-module and a classification sub-module. Outputs of the static texture information sub-module and the dynamic motion cue sub-module are pooled, aggregated and fused through the fusion sub-module, and a detection result is produced through the classification sub-module. The two-branch three-dimensional convolutional model is mimic and biologically significant and has strong robustness and generalization. Live body detection of a human face recognition system is well guaranteed, system security is improved, and information and properties are protected from infringement.

Description

Face detection system, terminal and storage medium using dual-branch three-dimensional convolution model

Technical field

The invention relates to the field of human face detection, in particular to a human face detection system, a terminal and a storage medium using a dual-branch three-dimensional convolution model.

Background technique

As people increasingly use electronic devices such as laptops and smartphones to make payments, shop, pay bills, and social interactions, the demand for electronic identity authentication is also increasing. more. Face recognition verification stands out among many systems and is deployed on a large scale in the life of the present invention. In order to ensure safety and prevent various potential hacker attacks, face live detection is a crucial part of the face verification system.

At present, the biggest problem faced by face live detection algorithms is the lack of generalization. Many trained models perform well in training and corresponding test sets, but their performance in the new unknown data set is not satisfactory. The actual deployment value of the human face live detection algorithm is greatly reduced, so in response to this phenomenon, the invention mainly focuses on the improvement of the generalization of the face live detection model.

There are many models of traditional methods, and they are not all the same. The local binary mode method has obvious advantages such as gray invariance and rotation invariance. It is simple and easy to calculate, but it is relatively simple. The accelerated robust feature method uses the determinant value of the Hessian matrix for feature point response detection and uses the integral graph to accelerate the operation for detection. However, regardless of the traditional method, most of the traditional feature methods use artificially extracted features, combined with traditional shallow classifiers such as SVM and LDA for live detection. Traditional artificial feature extraction methods are often limited by their own methods and training samples, which can only target specific attack methods, or apply to specific environments or lighting conditions. Even the comprehensive multiple traditional feature extraction methods are the same. Because the thresholds and parameters are often manually set, they cannot achieve very strong adaptability and generalization, and cannot be applied to unknown scenarios and attack methods. In actual scenarios, most of them are fragile and unstable.

Although the interactive method is relatively simple and effective, the entire verification time process will be longer, and it will bring many negative feelings to users in terms of convenience and user experience. Moreover, if video attacks are used, it can often be based on blink detection, And interactive methods such as lip movement lose their utility, so the limitations of the interactive face live detection algorithm are more obvious.

At present, more methods of deep learning are used to solve the problem of face detection. The two-dimensional convolutional neural network is a feed-forward neural network. Its artificial neurons can respond to a part of the surrounding units in the coverage area, and it has excellent performance for image processing. Compared with the local binary mode method, it can better extract the two-dimensional image features with a certain generalization, thereby increasing the accuracy of the model. But the deep learning method also has certain bottlenecks. Although its model performs very well on many data sets, it will still perform poorly in cross-data set testing. This is because most two-dimensional CNN models only focus on learning the texture features of the training samples, but the texture features in the samples are also very strong due to different environments, different lights, different attack methods, and different display device materials. Because of the difference and random diversity of the training set, the location and texture features of the brand-new samples outside the training set cannot be well fitted.

In addition, there will be some methods that try to introduce additional constraints by extracting face depth maps or other auxiliary supervision methods to enhance the generalization ability of the model. However, this auxiliary supervision is only an indirect supervision method. The relevance of face live detection is currently inconclusive. Moreover, its extraction not only requires a lot of calculation, but also takes up a lot of hard disk space, which will bring a lot of inconvenience to both training and subsequent testing.

Therefore, the generalization problem of the model has always been an urgent problem to be solved in the application of deep learning in the field of living detection.

Summary of the invention

In order to solve the problem of poor generalization performance of existing algorithms in the field of face live detection, unknown scenarios and attack methods cannot be applied, and the performance is relatively fragile and unstable in actual scenarios. The present invention provides a human face detection system, terminal and storage medium using a dual-branch three-dimensional convolution model. The present invention uses a three-dimensional convolutional neural network as the model skeleton, which can not only extract high-dimensional abstract features, but also Some concretized shallow features can also be summarized from the shallower network, so as to obtain more comprehensive temporal motion features. By taking into account both high-dimensional and low-dimensional features, the model can achieve better results. At the same time, the three-dimensional convolutional neural network has stronger information extraction ability in the time domain, and is more suitable as a technical framework suitable for processing face living detection technology. Compared with the ordinary two-dimensional convolutional network, it can extract time domain information better; and compared with the cyclic neural network, it can pay more attention to the low-order and high-order feature information, and improve the general system of the whole system. Ability.

The purpose of the present invention is to provide a face living detection system using a dual-branch three-dimensional convolution model, including:

Face video collection module: used to collect user's face video;

Face video preprocessing module: read the collected face video, segment it with n frames as a unit, and obtain live recognition samples;

Living body labeling module: used to label training samples of known living or non-living bodies, the living body labeling module is turned on when the detection system is in the training mode, and turned off when the detection system is in the recognition mode;

Living body motion amplification module: According to the operation mode of the detection system, the living body motion information amplification process is performed on the labeled training sample or the unlabeled sample to be tested to obtain the motion amplified living body recognition sample;

Two-branch three-dimensional convolution model training module: equipped with a two-branch three-dimensional convolution model, including static texture information sub-module, dynamic motion cue sub-module, fusion sub-module and classification sub-module; said static texture information sub-module and dynamic motion The clue sub-modules are two branches of the 3D convolution model. When the detection system is in training mode, the output of the live motion amplification module is used as the input of the dynamic motion clues sub-module, and the output of the face video preprocessing module is used as static texture information The input of the sub-module, the output of the static texture information sub-module and the dynamic motion clue sub-module are pooled, summarized, and merged by the fusion sub-module, and then the detection result is output by the classification sub-module;

Living body judgment module: When the detection system is in the recognition mode, it is used to load the model file output by the dual-branch 3D convolution model training module to obtain the trained dual-branch 3D convolution model, and the face video preprocessing module outputs no The labeled living body recognition sample to be detected is used as the input of the static texture information sub-module, and the unlabeled living body recognition sample that is output by the living body motion amplification module after the motion amplification processing is used as the input of the dynamic motion cue sub-module, and the output is recognized. result.

Another object of the present invention is to disclose a terminal including a memory and a processor;

The memory is used to store a computer program;

The processor is configured to, when the computer program is executed, realize the functions of the above-mentioned method and system for human face detection using a two-branch three-dimensional convolution model.

Another object of the present invention is to disclose a computer-readable storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is executed by a processor, the application of the two-branch three-dimensional convolution is realized. The function of the model's face detection method and system.

The beneficial effects of the present invention are:

1) The three-dimensional convolution model of the present invention adopts a dual-branch structure, in which the dynamic motion clue branch runs at a high frame rate (25 frames per second), and focuses more on the collection of dynamic clues on faces or forgery attacks; static texture The information branch runs at a low frame rate level (6.25 frames per second), applies a more sophisticated multi-scale convolution method, focuses on the extraction of static texture information features that distinguish real faces from fake attacks, and lays the foundation for the efficient work of the entire system. It can extract static spatial texture features and temporal motion features at the same time, which enhances the generalization of the system.

In the dual-branch model, for the high frame rate dynamic motion cue sub-module, set the number of channels to be small (initially 8 channels, and finally 128 channels). On the one hand, the purpose is to save model overhead; on the other hand, the model channel The larger the number, the stronger the model’s ability to distinguish and extract static features, and it can capture more texture and pattern details. However, due to the smaller number of channels for dynamic motion clues, it is more effective for these static spatial domains. The ability to extract texture features will also be reduced. Taking into account the high frame rate timing input of the model, this branch can reduce the extraction of spatial domain feature information while making it more specialized in the extraction of time domain information, resulting in a more pure Time domain motion characteristic information. For the static texture information sub-module with low frame rate, it is not very sensitive to time domain changes, and the number of channels is set to be larger (64 channels initially, 1024 channels in the end), because the input and calculation amount of the model is relatively small. , A higher number of channels can effectively improve its ability to extract spatial texture detail information. For ordinary 3D convolutional neural networks, due to the large model and the expensive calculation of the 3D convolution kernel, and the actual memory cost is very limited, only simpler structures can be used, and complex network structures cannot be designed. And training techniques to optimize feature extraction, no matter the network depth and the number of feature channels can not be set very large, it is also difficult to use more complex convolution kernels to extract features, so the effect of the model will be limited to a certain extent.

2) For the static texture information branch and the dynamic motion clue branch, in order to guide them to extract static spatial features and dynamic time domain features, different convolutional layer structures are provided in the present invention.

For the dynamic motion clue branch, the present invention approximately splits the 3x3x3 time domain convolution into four convolutions of 1x1x1, 3x1x1, 1x3x3, and 1x1x1, which can effectively save the model redundancy calculation expenses while maintaining the module's time-domain information Because it can effectively reduce the redundant overhead of the calculation of the 3D neural network model compared with the original 3x3x3 convolution kernel, it leaves more room for the rest of the calculation in the model. At the same time, as an approximate three-dimensional convolution kernel method, the ability to obtain information in the time and space domains is not worse than the original three-dimensional convolution mode, and it can still ensure that the accuracy in the face living detection task does not decrease. But it can save more than 60% of the amount of memory calculations, showing a strong advantage.

For the static texture information sub-module, the present invention uses a multi-scale convolution kernel for feature extraction for each layer, first input into the 1x1x1 convolution, and then respectively input into the 1x1x1, 1x3x3, 1x5x5 convolution, and finally output The results are concatenated and input into another 1x1x1 convolution. The application of this multi-scale convolution kernel enables the static texture information sub-module to have better extraction capabilities for texture and static feature information of different scales, which greatly enhances its ability to capture static planar spatial information. This is because the size of texture and pattern features are not fixed, and the use of a fixed size convolution kernel (for example, 1x3x3) will cause the network to be more sensitive to features of a specific size, and it is easier to ignore some features of other sizes. . On the contrary, if convolution kernels of different sizes are applied, no matter for larger global features (such as global moiré fringes on the surface of non-living samples, or water ripple features), or small-scale subtle local features (such as non-living samples). The local specular reflection texture and light spots on the surface of the living body can have a suitable convolution kernel to extract specific features. A larger convolution kernel can be used to extract a rougher overall structure contour, and a smaller convolution Kernels can be used to extract smaller details.

3) In order to improve the model’s distinguishing ability in the time domain and better extract the time domain information, this method does not perform any form of down-sampling in the time dimension before the final global pooling layer, and retains the maximum amount of data. Extract the effective information in the time domain, which can help to preserve and extract the effective information in the time domain to the greatest extent, so as to achieve a better balance between the characteristics of the time domain and the space domain. For ordinary 3D convolutional networks with relatively expensive overhead, due to their complex structure, down-sampling must be performed to avoid the problem of calculation overflow.

Description of the drawings

Figure 1 is a schematic flow diagram of an embodiment of the present invention;

2 is a schematic diagram of the network structure of the dual-branch three-dimensional convolution model of the present invention;

FIG. 3 is the network parameters of the two-branch three-dimensional convolution model used in the embodiment of the present invention;

Fig. 4 is a schematic diagram of a time series convolution block of a dynamic motion cue sub-module in the present invention;

Fig. 5 is a schematic diagram of the texture convolution block of the static texture information sub-module in the present invention.

detailed description

The present invention will be further described below in conjunction with the drawings and embodiments.

A specific implementation of the present invention shows a face living detection system using a dual-branch three-dimensional convolution model, including:

Face video collection module: used to collect user's face video;

Living body motion amplification module: According to the operating mode of the detection system, perform motion amplification processing on labeled training samples or unlabeled samples to be tested to obtain motion amplified living body recognition samples;

Two-branch three-dimensional convolution model training module: equipped with a two-branch three-dimensional convolution model, including static texture information sub-module, dynamic motion cue sub-module, fusion sub-module and classification sub-module; said static texture information sub-module and dynamic motion The clue sub-module is two branches of the three-dimensional convolution model. The output of the static texture information sub-module and the dynamic motion clue sub-module are pooled, summarized, and fused by the fusion sub-module, and then the detection results are output through the classification sub-module;

The three-dimensional convolution model of the present invention adopts a double-branch structure, the static texture information sub-module is used as the first branch, and the living body motion amplification module and the dynamic motion clue sub-module are used as the second branch. Wherein, the static texture information sub-module includes an input layer, a preprocessing layer with a time domain step size of k and a spatial domain step size of 1*1, an initial block layer, and p convolutional block layers. Preferably, The initial block layer in the static texture information sub-module is composed of an initialization convolution layer with 8m channels and an initialization pooling layer. Each convolutional block layer includes the same or different number of texture convolutional blocks, and each texture convolutional block consists of the first convolutional layer with the convolution kernel of 1*1*1, and the convolution kernel of 1*1*1 , 1*3*3, 1*5*5 second convolution layer, convolution kernel is 1*1*1 third convolution layer; among them, the texture convolution block of the first convolution block layer contains three The number of channels of the convolutional layer is 8m, 8m, and 32m, respectively, and the number of channels of the three-layer convolutional layer in the texture convolution block of the latter convolutional block layer is twice that of the previous convolutional block layer. The convolution kernel of the initialization convolution layer is 1*5*5, the convolution kernel of the initialization pooling layer is 1*3*3, 2≤k≤5, k is preferably 4, and m is preferably 8. In a specific implementation of the present invention, the second convolution layer in the texture convolution block includes a 1*5*5 convolution kernel, which can be split into two concatenated 1*3*3 convolution kernels.

Wherein, the dynamic motion cue submodule includes an input layer, an initial block layer, and p convolutional block layers. Preferably, the initial block layer in the dynamic motion cue submodule starts with the number of channels m. Convolutional layer and initialization pooling layer. Each convolution block layer includes the same or different numbers of time series convolution blocks, and each time series convolution block consists of the first convolution layer with a convolution kernel of 1*1*1, and a convolution kernel of 3*1*1 The second convolutional layer, the third convolutional layer with the convolution kernel of 1*3*3, and the fourth convolutional layer with the convolution kernel of 1*1*1; among them, the first convolutional block layer is the sequential convolution The number of channels of each convolutional layer in the convolutional block is m, m, m, 4m; and the number of channels of the four convolutional layers in the sequential convolutional block of the latter convolutional block layer is the previous convolutional block layer 2 times. The convolution kernel of the initialization convolution layer is 3*5*5, the convolution kernel of the initialization pooling layer is 1*3*3, and m is preferably 8.

The output of the i-th convolutional block layer in the static texture information sub-module and the output of the i-th convolutional block layer in the dynamic motion cue sub-module are combined to form the i+1th volume in the static texture information sub-module Input to the building block layer; where p is an integer greater than 0, 1≤i≤p-1, and p is preferably 3.

Wherein, the living body motion amplification module, when performing the motion amplification processing process of the sample, is specifically as follows:

1) Through the decomposition of Fourier series, the face image f(x+δ(t)) in each frame is decomposed into the sum of a series of sine functions:

Among them, f(x+δ(t)) represents the live sample image of the face in the time domain, that is, the initial image is I(x,0)=f(x), and δ(t) is the motion information function of the face , A _ω is the amplitude of the signal transformed into the frequency domain space; each individual frequency ω corresponds to a bandwidth, and the bandwidth for a specific frequency ω is a complex sinusoidal signal:

S _ω (x,t)=A _ω e ^iω(x+δ(t))

Among them, the frequency range of the frequency ω of the small movement of the human face is set to 0.3-3 Hz to extract the small movement of the human face. S _ω is a sine curve whose phase is ω(x+δ(t)) and contains the motion information of the original image;

2) In order to separate the tiny actions in a specific time-domain corresponding frequency band, the phase ω(x+δ(t)) is filtered to obtain the filtered band-pass phase, which is expressed as follows:

B _ω (x,t)=ωδ(t)

Multiply the bandpass phase B _ω (x,t) with α, α is the motion information amplification coefficient, and the value is 30. In practical applications, it can be changed between 10-50 as needed, and the subband S is added _{The phase of ω} (x,t) to obtain the sub-band after motion amplification

Expressed as:

The final result

It is a complex sine curve, representing the image after motion magnification in the frequency domain space.

3) According to the sub-band after the movement in step 2.2)

Obtain the video sequence f(x+(1+α)δ(t)) after motion magnification, and finally convert it back to the time domain to be the magnified result.

A specific implementation of the present invention shows the specific working process of the face living detection system.

The face video of the user is acquired through the face video acquisition module, and the face video preprocessing module is used for segmentation with 8 frames as a unit to obtain living body recognition samples. Assuming that the size of the original image stream is 224x224x8, after the small facial movements are amplified by the living body motion amplification module, it is input to the dynamic motion cue submodule. The dynamic motion cue sub-module runs at a high frame rate level (25 frames per second), uses a three-dimensional convolution module, and pays more attention to the collection of dynamic clues for human faces or forgery attacks. The number of channels is set to be small (initially 8 channels, and finally 128 channels). On the one hand, it can save model overhead. On the other hand, it can also make it more specialized in the extraction of time domain information. It is worth noting that the whole process There is no down-sampling processing in the time domain, which preserves the motion information in the time domain to the greatest extent.

The calculation process in the dynamic motion cue submodule is specifically as follows: first, through a convolution with a size of 3x5x5 and a step size of 1, 2, and 2, to obtain a feature with a channel number of 8; then pass a size of 1x3x3 with a step size of The initial pooling layer of 1,2,2, the number of channels is still 8; then through the three convolutional block layers in the second branch, the three convolutional block layers include 2, 3, and 2 time series volumes respectively The structure of each time-series convolution block is shown in Figure 4. In order to save computational overhead and memory, the present invention splits the original 3x3x3 three-dimensional convolution kernel, and sequentially passes through the 1x1x1 convolution kernel and 3x1x1 convolution kernel. Convolution kernel, 1x3x3 convolution kernel and 1x1x1 convolution kernel. The purpose of setting 1x1x1 convolution kernel is to enhance the fitting ability of the model.

It is also assumed that the original image size is 224x224x8, which is input into the static texture information sub-module. The static texture information sub-module runs at a low frame rate level (6.25 frames per second), applies a finer multi-scale convolution method, and focuses on the static texture information feature extraction that distinguishes real faces from fake attacks. It is not very sensitive to changes in the time domain, and the number of channels is set to be larger (initially 64 channels, and finally 1024 channels). Because the input and calculation of the model itself are small, a higher number of channels can effectively improve its spatial texture The ability to extract detailed information.

The calculation process in the static texture information sub-module is specifically: after the frame is extracted through the preprocessing layer, it is input at a low frequency, and first passes through a 1x5x5 convolution kernel with a step size of 1, 2, 2, and the number of feature channels obtained is 64, and then pass an initialization pooling layer with a size of 1x3x3 and a step size of 1, 2, 2, and the number of channels remains unchanged at 64.

The output result of the initial block layer will be merged and spliced with the output result of the corresponding layer of the dynamic motion cue submodule, and then input into the convolutional block layer.

Then it passes through the three convolutional block layers in the first branch. The three convolutional block layers respectively include 2, 3, and 2 texture convolution blocks. The structure of each texture convolution block is shown in Figure 5, in order to further save the model In the memory computing space of, the 1x5x5 convolution shown in Figure 4 is split into two 1x3x3 convolutions in series. This multi-scale convolution kernel method enables the module to have strong information extraction capabilities for features of different static spatial scales.

During the operation of the two branches, the output of the corresponding convolutional block layer will be combined and used as the input of the next first branch convolutional block layer. Finally, the output result of the motion cue submodule and the output result of the static texture information submodule are combined and input to the global pooling layer and a 1024 fully connected layer, and finally the classification is completed through the softmax function.

After the two-branch three-dimensional convolutional neural network model required for training is constructed, for the training samples and corresponding labels, this embodiment trains the model by the following method, and saves the model file in the storage medium. For all samples used for training, perform batch gradient descent training, that is, only send a batch of 10 samples to the network model each time for training, and record the samples used for training in a batch as x, and its corresponding label Referred to as

After the training sample x is recognized by the two-branch three-dimensional convolutional neural network model, the recognition result y of the model is obtained. In this example, the purpose of training is to reduce the label

The difference between the recognition result y and the model, so the cross entropy loss function is selected to describe

The difference between and y, the cross-entropy loss function is as follows:

in

Represents the cross-entropy loss function, N represents the number of classifications of the recognition task in training, here is 2.

Represents the probability that the i-th sample in a batch belongs to the j-th category, and y _ij represents the recognition result of the i-th sample in a batch after the two-branch three-dimensional convolutional neural network model network belongs to the j-th category The probability of the category. In this embodiment, the batch gradient descent method is used on the Pytorch toolkit platform. The first branch and the second branch are independently trained for two cycles, and then the two-branch model is merged. After training the network model for 50 cycles, the model file is saved To the storage medium for the living body judgment module to perform the face living body detection and recognition task. The said one cycle means that all training data are trained once through the batch gradient descent method.

In an embodiment of the present application, a terminal and a storage medium are provided.

Terminal, which includes memory and processor;

The memory is used to store computer programs;

The processor is used to implement the functions of the aforementioned two-branch three-dimensional convolutional neural network model method and system when the computer program is executed.

It should be noted that the memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage. The above-mentioned processor is the control center of the terminal, which uses various interfaces and lines to connect various parts of the terminal, and executes the computer program in the memory to call the data in the memory to perform the functions of the terminal. The processor can be a general-purpose processor, including central processing unit (CPU), network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP), application-specific integrated circuit ( Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. Of course, the terminal should also have necessary components for program operation, such as a power supply, a communication bus, and so on.

Exemplarily, the computer program can be divided into multiple modules, and each module is stored in a memory. Each of the divided modules can complete a computer program instruction segment with a specific function, and the instruction segment is used To describe the execution process of a computer program. For example, a computer program can be divided into the following modules:

Face video collection module: used to collect user's face video;

Living body judgment module: When the detection system is in the recognition mode, it is used to load the model file output by the dual-branch 3D convolution model training module to obtain the trained dual-branch 3D convolution model, and the face video preprocessing module outputs no The labeled living body recognition sample to be detected is used as the input of the static texture information submodule, and the unlabeled sample to be detected after the motion amplification processing output by the living body motion amplification module is used as the input of the dynamic motion clue submodule, and the recognition result is output.

The programs in the above modules are all processed by the processor during execution.

In addition, the above-mentioned logical instructions in the memory can be implemented in the form of a software functional unit and when sold or used as an independent product, they can be stored in a computer readable storage medium. As a computer-readable storage medium, the memory can be configured to store software programs and computer-executable programs, such as program instructions or modules corresponding to the system in the embodiments of the present disclosure. The processor executes functional applications and data processing by running software programs, instructions or modules stored in the memory, that is, realizes the functions in the foregoing embodiments. For example, U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or CD-ROM and other media that can store program codes, and can also be temporary storage media . In addition, the specific process in which the multiple instructions in the foregoing storage medium and the terminal are loaded and executed by the processor has been described in detail in the foregoing.

Example

This embodiment is used to demonstrate a specific implementation effect. The face video acquisition module, face video preprocessing module, living body annotation module, living body motion amplification module, dual-branch three-dimensional convolution model training module, and living body judgment module in this embodiment all adopt the structure and function described above, here No longer.

The implementation process is:

Including the configuration process and identification process. First set the system in configuration mode, obtain face video through the face video acquisition module, and then segment the face video by the face video preprocessing module to obtain live recognition samples, which are marked by the live labeling module, and then zoomed in by live motion The module performs processing, and finally the living body detection mimic model training module trains the three-dimensional convolution model according to the training sample set and saves it as a model file.

After the configuration, the system is set to recognition mode, the first choice is to obtain the face video through the face video acquisition module, and then the face video preprocessing module will segment the face video to obtain the live body recognition sample to be detected, and finally the live body judgment module will directly Load the trained model file, and use the sample to be tested and the sample to be tested after motion amplification as the model input respectively to obtain the recognition result.

In this embodiment, a total of 6 test tasks were performed, including the protocol 1 test in the OULU-NPU database, the protocol 2 test in the OULU-NPU database, the protocol 3 test in the OULU-NPU database, and the protocol 4 test in the NPU database. , And a cross-test between the CASIA-FASD database and the replay attack database. The most difficult one is the cross-test between the CASIA-FASD database and the replay attack database, because it poses great challenges to the generalization ability of the model and the robustness under unknown lighting, background, and equipment conditions.

For the four protocol tests in the OULU-NPU database, the present invention follows the test index criteria in the original protocol. It uses forged attack classification error (APCER), which is used to evaluate the highest misclassification error rate of all attack methods; and real living body classification error rate (BPCER), which is used to evaluate the classification error rate of real living samples; and mean classification Error rate (ACER), which is the average value of the classification error rate of forgery attacks and the classification error rate of real faces:

For the tests in the CASIA database and the Replay Attack database, the present invention follows the test standards of the original database, and uses half of the total error rate (HTER) as the indicator rule, whose values are the false rejection rate (FRR) and the false acceptance rate (FAR) After half:

To ensure fairness, all training and testing are performed on the Pytorch benchmark platform with GeForce RTX 2080 Ti-NVIDIA GPU, and all training and testing rules are the same. The models compared here include the local binary pattern in the traditional method, the long and short-term memory neural network in the recurrent neural network, and the two-dimensional convolutional neural network in the convolutional neural network. The results are shown in Table 1 and Table 2.

Table 1 The performance of each model on different protocols in the OULU-NPU database

Table 2 Cross-data test of each model on CASIA and replay attack database

It can be seen that the dual-branch three-dimensional convolution model of the present invention is tested in all data, under the four test protocols of the OULU-NPU database, through the normal two-dimensional convolutional neural network model and the traditional texture feature model. In comparison, the three-dimensional mimicry models proposed by the invention all occupy absolute performance advantages. Since OULU-NPU is a database close to the actual mobile phone scene currently in use, it can also prove that the three-dimensional mimicry model is close to the actual movement. In the payment scenario, it can effectively prevent various non-living forgery attacks, which has strong practical value. Compared with traditional methods, two-dimensional convolutional neural networks and recurrent neural networks have huge advantages. In the more challenging cross-data set test, the robustness and superiority of the model performance have also been well reflected. This shows that the structure of the model of the present invention is effective and advanced.

Although OULU-NPU is a test data set that takes into account the generalization of cross-scene and cross-device, it takes into account the limitations of the shooting scene and light, and the photography habits of the same group of photographers are fixed, and the same group of attackers' attacks The details such as methods and habits are also relatively single and other objective conditions, so that there are still many similarities in the database during the test process, so it cannot be completely close to the actual complex application scenarios. Therefore, the model is cross-tested across data sets in the CASIA-FASD database and the Idiap Replay Attack database, so as to perform a more challenging generalization test that is completely close to the actual scenario. In these comparison models, many different types of experimental models are covered, including some traditional texture extraction algorithms, as well as CNN and RNN timing models in deep learning.

The cross-test on the CASIA and Replay Attack datasets is a test of the highest standard of model generalization, because the two datasets are very different in the collection equipment, the live ID, the collection environment, or the shooting habits of the collectors. , Are very different from each other, so it is very consistent with the actual detection scene. As can be seen from the table, the semi-total error rate HTER is used as the performance evaluation index. Through comprehensive comparison with various models, including traditional texture feature extraction models, CNN and LSTM timing models in deep convolution learning, the invention The proposed two-branch 3D convolution models all show superior performance.

Compared with the traditional manual feature extraction method or the most advanced complex deep learning network, the two-branch 3D convolution model proposed by the invention has shown a generalization performance far exceeding other models. This is the case in the set mutual test. It can be seen that the proposed two-branch 3D convolution model is more robust and robust, so it also confirms that its performance in the actual scene test will be better.

The above-listed are only specific embodiments of the present invention. Obviously, the present invention is not limited to the above embodiments, and many variations are possible. All modifications that can be directly derived or imagined by a person of ordinary skill in the art from the disclosure of the present invention should be considered as the protection scope of the present invention.

Claims

A face living detection system using a dual-branch three-dimensional convolution model is characterized in that it includes:

Face video collection module: used to collect user's face video;

Face video preprocessing module: read the collected face video, segment it with n frames as a unit, and obtain live recognition samples;

Living body labeling module: used to label training samples of known living or non-living bodies, the living body labeling module is turned on when the detection system is in the training mode, and turned off when the detection system is in the recognition mode;

Living body motion amplification module: According to the operation mode of the detection system, the living body motion information amplification process is performed on the labeled training sample or the unlabeled sample to be tested to obtain the motion amplified living body recognition sample;

Dual-branch 3D convolution model training module: equipped with dual-branch 3D convolution model, including static texture information sub-module, dynamic motion cue sub-module, fusion sub-module and classification sub-module;

When the detection system is in training mode, the output of the living body motion amplification module is used as the input of the dynamic motion cue submodule, the output of the face video preprocessing module is used as the input of the static texture information submodule, the static texture information submodule and the dynamic motion The output of the clue sub-module is pooled, summarized, and merged by the fusion sub-module, and then the detection result is output through the classification sub-module;

The static texture information submodule includes an input layer, a preprocessing frame extraction layer with a time domain step size of k and a spatial domain step size of 1*1, an initial block layer with a channel number of 8m, and p convolutional block layers; The dynamic motion cue submodule includes an input layer, an initial block layer with m channels, and p convolutional block layers;

The output of the initial block layer of the static texture information submodule and the output of the initial block layer of the dynamic motion cue submodule are combined and used as the input of the first convolutional block layer of the static texture information submodule; the static texture information submodule The output of the i-th convolutional block layer and the output of the i-th convolutional block layer of the dynamic motion cue sub-module are combined, and used as the input of the i+1th convolutional block layer of the static texture information sub-module; the static texture The convolutional block layer of the information sub-module and the convolutional block layer of the dynamic motion cue sub-module each include several convolution sub-modules composed of multiple convolutional layers, and the number of output convolutional layer channels of each convolution sub-module is greater than Enter the number of channels in the convolutional layer; where m, p, and k are integers greater than 0, 2≤k≤5, 1≤i≤p-1;

Living body judgment module: When the detection system is in the recognition mode, it is used to load the model file output by the dual-branch 3D convolution model training module to obtain the trained dual-branch 3D convolution model, and the face video preprocessing module outputs no The labeled living body recognition sample to be detected is used as the input of the static texture information sub-module, and the unlabeled living body recognition sample that is output by the living body motion amplification module after the motion amplification processing is used as the input of the dynamic motion clue sub-module, and the output is recognized result.
The face living detection system using a dual-branch three-dimensional convolution model according to claim 1, wherein the living body motion amplification module is specifically:

2.1) Through the decomposition of Fourier series, the face image f(x+δ(t)) in each frame is decomposed into the sum of a series of sine functions:

Among them, f(x+δ(t)) represents the live sample image of the face in the time domain, that is, the initial image is I(x,0)=f(x), and δ(t) is the motion information function of the face , A ω is the amplitude of the signal transformed into the frequency domain space; i represents the imaginary part corresponding to the image in the complex frequency domain; each individual frequency ω corresponds to a bandwidth, and the bandwidth of a specific frequency ω is a complex sinusoidal signal:

Among them, the frequency range of the frequency ω of the tiny movement of the face is set to 0.3-3Hz to extract the tiny movement of the face; S ω is a sine curve whose phase ω(x+δ(t)) contains the movement of the original image Information; adjust the amplitude of movement by adjusting the phase;

2.2) Filter ω(x+δ(t)) in the above formula through a DC complementary filter to obtain the filtered bandpass phase, which is expressed as follows:

B ω (x,t)=ωδ(t)

Multiply the bandpass phase B ω (x,t) by α, where α is the motion amplification coefficient, and add the phase of the subband S ω (x,t) to obtain the motion amplified subband
Expressed as:

in
It is a complex sine curve, and it is exactly (1+α) times the input sine curve;

2.3) According to the sub-band after the movement in step 2.2)
Obtain the video sequence f(x+(1+α)δ(t)) after motion magnification, and finally convert it back to the time domain to be the magnified result.
A human face detection system using a dual-branch three-dimensional convolution model according to claim 1, wherein the initial block layer includes an initialized convolution layer and an initialized pooling layer; the static texture information sub-module The convolution kernel in the initialization convolution layer is 1*5*5, and the convolution kernel in the initialization pooling layer is 1*3*3; the convolution kernel in the initialization convolution layer of the dynamic motion cue submodule is 3*5 *5, the convolution kernel of the initial pooling layer is 1*3*3.
A human face detection system using a dual-branch three-dimensional convolution model according to claim 3, wherein the static texture information sub-module includes three convolutional block layers, and the three convolutional block layers are respectively Including 2, 3, 2 texture convolution sub-modules, each texture convolution sub-module consists of a first convolution layer with a convolution kernel of 1*1*1, and a convolution kernel of 1*1*1, 1*3 *3. The second convolutional layer of 1*5*5 and the third convolutional layer with the convolution kernel of 1*1*1 consist of three layers of convolution in the texture convolution submodule of the first convolutional block layer. The number of channels of the layers are 8m, 8m, and 32m, respectively, and the number of channels of the three-layer convolutional layer in the texture convolution submodule of the latter convolutional block layer is twice that of the previous convolutional block layer.
The face living detection system using a dual-branch three-dimensional convolution model according to claim 4, wherein the static texture information sub-module is convolved with 1*5* of the second convolution layer in the block layer. The 5 convolution kernel is split into two 1*3*3 convolution kernels connected in series.
A human face detection system using a dual-branch three-dimensional convolution model according to claim 1, wherein the dynamic motion cue sub-module includes three convolutional block layers, and the three convolutional block layers are respectively Including 2, 3, 2 time series convolution sub-modules, each time series convolution sub-module consists of a first convolution layer with a 1*1*1 convolution kernel and a second convolution with a 3*1*1 convolution kernel. The build-up layer, the convolution kernel is the third convolution layer of 1*3*3, and the convolution kernel is the fourth convolution layer of 1*1*1. Among them, the sequential convolution submodule of the first convolution block layer The number of channels of each convolutional layer is m, m, m, 4m; and the number of channels of the four convolutional layers in the sequential convolution submodule of the latter convolutional block layer is 2 of that of the previous convolutional block layer. Times.
The face living detection system using a dual-branch three-dimensional convolution model according to claim 1, wherein the value of m is 8.
The face living detection system using a dual-branch three-dimensional convolution model according to claim 1, wherein the value of p is 3 and the value of k is 4.
A terminal, characterized in that it comprises a memory and a processor;

The memory is used to store a computer program;

The processor is configured to, when the computer program is executed, implement the human face living detection system using the two-branch three-dimensional convolution model according to any one of claims 1 to 8.
A computer-readable storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is executed by a processor, the application of a two-branch three-dimensional convolution model as in any one of claims 1 to 8 is realized The face live detection system.