WO2021248733A1 - Live face detection system applying two-branch three-dimensional convolutional model, terminal and storage medium - Google Patents

Live face detection system applying two-branch three-dimensional convolutional model, terminal and storage medium Download PDF

Info

Publication number
WO2021248733A1
WO2021248733A1 PCT/CN2020/116644 CN2020116644W WO2021248733A1 WO 2021248733 A1 WO2021248733 A1 WO 2021248733A1 CN 2020116644 W CN2020116644 W CN 2020116644W WO 2021248733 A1 WO2021248733 A1 WO 2021248733A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
convolution
sub
layer
branch
Prior art date
Application number
PCT/CN2020/116644
Other languages
French (fr)
Chinese (zh)
Inventor
沈海斌
欧阳文汉
Original Assignee
浙江大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学 filed Critical 浙江大学
Publication of WO2021248733A1 publication Critical patent/WO2021248733A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/40Spoof detection, e.g. liveness detection
    • G06V40/45Detection of the body part being alive
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Definitions

  • the invention relates to the field of human face detection, in particular to a human face detection system, a terminal and a storage medium using a dual-branch three-dimensional convolution model.
  • Face recognition verification stands out among many systems and is deployed on a large scale in the life of the present invention. In order to ensure safety and prevent various potential hacker attacks, face live detection is a crucial part of the face verification system.
  • the biggest problem faced by face live detection algorithms is the lack of generalization. Many trained models perform well in training and corresponding test sets, but their performance in the new unknown data set is not satisfactory. The actual deployment value of the human face live detection algorithm is greatly reduced, so in response to this phenomenon, the invention mainly focuses on the improvement of the generalization of the face live detection model.
  • the local binary mode method has obvious advantages such as gray invariance and rotation invariance. It is simple and easy to calculate, but it is relatively simple.
  • the accelerated robust feature method uses the determinant value of the Hessian matrix for feature point response detection and uses the integral graph to accelerate the operation for detection.
  • most of the traditional feature methods use artificially extracted features, combined with traditional shallow classifiers such as SVM and LDA for live detection.
  • Traditional artificial feature extraction methods are often limited by their own methods and training samples, which can only target specific attack methods, or apply to specific environments or lighting conditions. Even the comprehensive multiple traditional feature extraction methods are the same. Because the thresholds and parameters are often manually set, they cannot achieve very strong adaptability and generalization, and cannot be applied to unknown scenarios and attack methods. In actual scenarios, most of them are fragile and unstable.
  • the interactive method is relatively simple and effective, the entire verification time process will be longer, and it will bring many negative feelings to users in terms of convenience and user experience. Moreover, if video attacks are used, it can often be based on blink detection, And interactive methods such as lip movement lose their utility, so the limitations of the interactive face live detection algorithm are more obvious.
  • the two-dimensional convolutional neural network is a feed-forward neural network. Its artificial neurons can respond to a part of the surrounding units in the coverage area, and it has excellent performance for image processing. Compared with the local binary mode method, it can better extract the two-dimensional image features with a certain generalization, thereby increasing the accuracy of the model. But the deep learning method also has certain bottlenecks. Although its model performs very well on many data sets, it will still perform poorly in cross-data set testing. This is because most two-dimensional CNN models only focus on learning the texture features of the training samples, but the texture features in the samples are also very strong due to different environments, different lights, different attack methods, and different display device materials. Because of the difference and random diversity of the training set, the location and texture features of the brand-new samples outside the training set cannot be well fitted.
  • the present invention provides a human face detection system, terminal and storage medium using a dual-branch three-dimensional convolution model.
  • the present invention uses a three-dimensional convolutional neural network as the model skeleton, which can not only extract high-dimensional abstract features, but also Some concretized shallow features can also be summarized from the shallower network, so as to obtain more comprehensive temporal motion features. By taking into account both high-dimensional and low-dimensional features, the model can achieve better results.
  • the three-dimensional convolutional neural network has stronger information extraction ability in the time domain, and is more suitable as a technical framework suitable for processing face living detection technology. Compared with the ordinary two-dimensional convolutional network, it can extract time domain information better; and compared with the cyclic neural network, it can pay more attention to the low-order and high-order feature information, and improve the general system of the whole system. Ability.
  • the purpose of the present invention is to provide a face living detection system using a dual-branch three-dimensional convolution model, including:
  • Face video collection module used to collect user's face video
  • Face video preprocessing module read the collected face video, segment it with n frames as a unit, and obtain live recognition samples;
  • Living body labeling module used to label training samples of known living or non-living bodies, the living body labeling module is turned on when the detection system is in the training mode, and turned off when the detection system is in the recognition mode;
  • Living body motion amplification module According to the operation mode of the detection system, the living body motion information amplification process is performed on the labeled training sample or the unlabeled sample to be tested to obtain the motion amplified living body recognition sample;
  • Two-branch three-dimensional convolution model training module equipped with a two-branch three-dimensional convolution model, including static texture information sub-module, dynamic motion cue sub-module, fusion sub-module and classification sub-module; said static texture information sub-module and dynamic motion
  • the clue sub-modules are two branches of the 3D convolution model.
  • the output of the live motion amplification module is used as the input of the dynamic motion clues sub-module
  • the output of the face video preprocessing module is used as static texture information
  • the input of the sub-module, the output of the static texture information sub-module and the dynamic motion clue sub-module are pooled, summarized, and merged by the fusion sub-module, and then the detection result is output by the classification sub-module;
  • Living body judgment module When the detection system is in the recognition mode, it is used to load the model file output by the dual-branch 3D convolution model training module to obtain the trained dual-branch 3D convolution model, and the face video preprocessing module outputs no
  • the labeled living body recognition sample to be detected is used as the input of the static texture information sub-module
  • the unlabeled living body recognition sample that is output by the living body motion amplification module after the motion amplification processing is used as the input of the dynamic motion cue sub-module, and the output is recognized. result.
  • Another object of the present invention is to disclose a terminal including a memory and a processor
  • the memory is used to store a computer program
  • the processor is configured to, when the computer program is executed, realize the functions of the above-mentioned method and system for human face detection using a two-branch three-dimensional convolution model.
  • Another object of the present invention is to disclose a computer-readable storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is executed by a processor, the application of the two-branch three-dimensional convolution is realized.
  • the function of the model's face detection method and system is realized.
  • the three-dimensional convolution model of the present invention adopts a dual-branch structure, in which the dynamic motion clue branch runs at a high frame rate (25 frames per second), and focuses more on the collection of dynamic clues on faces or forgery attacks; static texture
  • the information branch runs at a low frame rate level (6.25 frames per second), applies a more sophisticated multi-scale convolution method, focuses on the extraction of static texture information features that distinguish real faces from fake attacks, and lays the foundation for the efficient work of the entire system. It can extract static spatial texture features and temporal motion features at the same time, which enhances the generalization of the system.
  • the number of channels for the high frame rate dynamic motion cue sub-module, set the number of channels to be small (initially 8 channels, and finally 128 channels).
  • the purpose is to save model overhead; on the other hand, the model channel The larger the number, the stronger the model’s ability to distinguish and extract static features, and it can capture more texture and pattern details.
  • the model channel The larger the number, the stronger the model’s ability to distinguish and extract static features, and it can capture more texture and pattern details.
  • due to the smaller number of channels for dynamic motion clues it is more effective for these static spatial domains.
  • the ability to extract texture features will also be reduced.
  • this branch can reduce the extraction of spatial domain feature information while making it more specialized in the extraction of time domain information, resulting in a more pure Time domain motion characteristic information.
  • the number of channels is set to be larger (64 channels initially, 1024 channels in the end), because the input and calculation amount of the model is relatively small.
  • a higher number of channels can effectively improve its ability to extract spatial texture detail information.
  • For ordinary 3D convolutional neural networks due to the large model and the expensive calculation of the 3D convolution kernel, and the actual memory cost is very limited, only simpler structures can be used, and complex network structures cannot be designed. And training techniques to optimize feature extraction, no matter the network depth and the number of feature channels can not be set very large, it is also difficult to use more complex convolution kernels to extract features, so the effect of the model will be limited to a certain extent.
  • the present invention approximately splits the 3x3x3 time domain convolution into four convolutions of 1x1x1, 3x1x1, 1x3x3, and 1x1x1, which can effectively save the model redundancy calculation expenses while maintaining the module's time-domain information Because it can effectively reduce the redundant overhead of the calculation of the 3D neural network model compared with the original 3x3x3 convolution kernel, it leaves more room for the rest of the calculation in the model.
  • the ability to obtain information in the time and space domains is not worse than the original three-dimensional convolution mode, and it can still ensure that the accuracy in the face living detection task does not decrease. But it can save more than 60% of the amount of memory calculations, showing a strong advantage.
  • the present invention uses a multi-scale convolution kernel for feature extraction for each layer, first input into the 1x1x1 convolution, and then respectively input into the 1x1x1, 1x3x3, 1x5x5 convolution, and finally output The results are concatenated and input into another 1x1x1 convolution.
  • This multi-scale convolution kernel enables the static texture information sub-module to have better extraction capabilities for texture and static feature information of different scales, which greatly enhances its ability to capture static planar spatial information.
  • this method does not perform any form of down-sampling in the time dimension before the final global pooling layer, and retains the maximum amount of data. Extract the effective information in the time domain, which can help to preserve and extract the effective information in the time domain to the greatest extent, so as to achieve a better balance between the characteristics of the time domain and the space domain.
  • down-sampling must be performed to avoid the problem of calculation overflow.
  • Figure 1 is a schematic flow diagram of an embodiment of the present invention
  • FIG. 2 is a schematic diagram of the network structure of the dual-branch three-dimensional convolution model of the present invention
  • FIG. 3 is the network parameters of the two-branch three-dimensional convolution model used in the embodiment of the present invention.
  • Fig. 4 is a schematic diagram of a time series convolution block of a dynamic motion cue sub-module in the present invention
  • Fig. 5 is a schematic diagram of the texture convolution block of the static texture information sub-module in the present invention.
  • a specific implementation of the present invention shows a face living detection system using a dual-branch three-dimensional convolution model, including:
  • Face video collection module used to collect user's face video
  • Face video preprocessing module read the collected face video, segment it with n frames as a unit, and obtain live recognition samples;
  • Living body labeling module used to label training samples of known living or non-living bodies, the living body labeling module is turned on when the detection system is in the training mode, and turned off when the detection system is in the recognition mode;
  • Living body motion amplification module According to the operating mode of the detection system, perform motion amplification processing on labeled training samples or unlabeled samples to be tested to obtain motion amplified living body recognition samples;
  • Two-branch three-dimensional convolution model training module equipped with a two-branch three-dimensional convolution model, including static texture information sub-module, dynamic motion cue sub-module, fusion sub-module and classification sub-module; said static texture information sub-module and dynamic motion
  • the clue sub-module is two branches of the three-dimensional convolution model.
  • the output of the static texture information sub-module and the dynamic motion clue sub-module are pooled, summarized, and fused by the fusion sub-module, and then the detection results are output through the classification sub-module;
  • Living body judgment module When the detection system is in the recognition mode, it is used to load the model file output by the dual-branch 3D convolution model training module to obtain the trained dual-branch 3D convolution model, and the face video preprocessing module outputs no
  • the labeled living body recognition sample to be detected is used as the input of the static texture information sub-module
  • the unlabeled living body recognition sample that is output by the living body motion amplification module after the motion amplification processing is used as the input of the dynamic motion cue sub-module, and the output is recognized. result.
  • the three-dimensional convolution model of the present invention adopts a double-branch structure, the static texture information sub-module is used as the first branch, and the living body motion amplification module and the dynamic motion clue sub-module are used as the second branch.
  • the static texture information sub-module includes an input layer, a preprocessing layer with a time domain step size of k and a spatial domain step size of 1*1, an initial block layer, and p convolutional block layers.
  • the initial block layer in the static texture information sub-module is composed of an initialization convolution layer with 8m channels and an initialization pooling layer.
  • Each convolutional block layer includes the same or different number of texture convolutional blocks, and each texture convolutional block consists of the first convolutional layer with the convolution kernel of 1*1*1, and the convolution kernel of 1*1*1 , 1*3*3, 1*5*5 second convolution layer, convolution kernel is 1*1*1 third convolution layer; among them, the texture convolution block of the first convolution block layer contains three The number of channels of the convolutional layer is 8m, 8m, and 32m, respectively, and the number of channels of the three-layer convolutional layer in the texture convolution block of the latter convolutional block layer is twice that of the previous convolutional block layer.
  • the convolution kernel of the initialization convolution layer is 1*5*5, the convolution kernel of the initialization pooling layer is 1*3*3, 2 ⁇ k ⁇ 5, k is preferably 4, and m is preferably 8.
  • the second convolution layer in the texture convolution block includes a 1*5*5 convolution kernel, which can be split into two concatenated 1*3*3 convolution kernels.
  • the dynamic motion cue submodule includes an input layer, an initial block layer, and p convolutional block layers.
  • the initial block layer in the dynamic motion cue submodule starts with the number of channels m. Convolutional layer and initialization pooling layer.
  • Each convolution block layer includes the same or different numbers of time series convolution blocks, and each time series convolution block consists of the first convolution layer with a convolution kernel of 1*1*1, and a convolution kernel of 3*1*1
  • the second convolutional layer, the third convolutional layer with the convolution kernel of 1*3*3, and the fourth convolutional layer with the convolution kernel of 1*1*1; among them, the first convolutional block layer is the sequential convolution
  • the number of channels of each convolutional layer in the convolutional block is m, m, m, 4m; and the number of channels of the four convolutional layers in the sequential convolutional block of the latter convolutional block layer is the previous convolutional block layer 2 times.
  • the convolution kernel of the initialization convolution layer is 3*5*5, the convolution kernel of the initialization pooling layer is 1*3*3, and m is preferably 8.
  • the output of the i-th convolutional block layer in the static texture information sub-module and the output of the i-th convolutional block layer in the dynamic motion cue sub-module are combined to form the i+1th volume in the static texture information sub-module Input to the building block layer; where p is an integer greater than 0, 1 ⁇ i ⁇ p-1, and p is preferably 3.
  • the living body motion amplification module when performing the motion amplification processing process of the sample, is specifically as follows:
  • a ⁇ is the amplitude of the signal transformed into the frequency domain space; each individual frequency ⁇ corresponds to a bandwidth, and the bandwidth for a specific frequency ⁇ is a complex sinusoidal signal:
  • the frequency range of the frequency ⁇ of the small movement of the human face is set to 0.3-3 Hz to extract the small movement of the human face.
  • S ⁇ is a sine curve whose phase is ⁇ (x+ ⁇ (t)) and contains the motion information of the original image;
  • phase ⁇ (x+ ⁇ (t)) is filtered to obtain the filtered band-pass phase, which is expressed as follows:
  • the final result It is a complex sine curve, representing the image after motion magnification in the frequency domain space.
  • step 2.2 According to the sub-band after the movement in step 2.2) Obtain the video sequence f(x+(1+ ⁇ ) ⁇ (t)) after motion magnification, and finally convert it back to the time domain to be the magnified result.
  • a specific implementation of the present invention shows the specific working process of the face living detection system.
  • the face video of the user is acquired through the face video acquisition module, and the face video preprocessing module is used for segmentation with 8 frames as a unit to obtain living body recognition samples.
  • the size of the original image stream is 224x224x8
  • the dynamic motion cue submodule runs at a high frame rate level (25 frames per second), uses a three-dimensional convolution module, and pays more attention to the collection of dynamic clues for human faces or forgery attacks.
  • the number of channels is set to be small (initially 8 channels, and finally 128 channels). On the one hand, it can save model overhead. On the other hand, it can also make it more specialized in the extraction of time domain information. It is worth noting that the whole process There is no down-sampling processing in the time domain, which preserves the motion information in the time domain to the greatest extent.
  • the calculation process in the dynamic motion cue submodule is specifically as follows: first, through a convolution with a size of 3x5x5 and a step size of 1, 2, and 2, to obtain a feature with a channel number of 8; then pass a size of 1x3x3 with a step size of The initial pooling layer of 1,2,2, the number of channels is still 8; then through the three convolutional block layers in the second branch, the three convolutional block layers include 2, 3, and 2 time series volumes respectively
  • the structure of each time-series convolution block is shown in Figure 4.
  • the present invention splits the original 3x3x3 three-dimensional convolution kernel, and sequentially passes through the 1x1x1 convolution kernel and 3x1x1 convolution kernel. Convolution kernel, 1x3x3 convolution kernel and 1x1x1 convolution kernel.
  • the purpose of setting 1x1x1 convolution kernel is to enhance the fitting ability of the model.
  • the original image size is 224x224x8, which is input into the static texture information sub-module.
  • the static texture information sub-module runs at a low frame rate level (6.25 frames per second), applies a finer multi-scale convolution method, and focuses on the static texture information feature extraction that distinguishes real faces from fake attacks. It is not very sensitive to changes in the time domain, and the number of channels is set to be larger (initially 64 channels, and finally 1024 channels). Because the input and calculation of the model itself are small, a higher number of channels can effectively improve its spatial texture The ability to extract detailed information.
  • the calculation process in the static texture information sub-module is specifically: after the frame is extracted through the preprocessing layer, it is input at a low frequency, and first passes through a 1x5x5 convolution kernel with a step size of 1, 2, 2, and the number of feature channels obtained is 64, and then pass an initialization pooling layer with a size of 1x3x3 and a step size of 1, 2, 2, and the number of channels remains unchanged at 64.
  • the output result of the initial block layer will be merged and spliced with the output result of the corresponding layer of the dynamic motion cue submodule, and then input into the convolutional block layer.
  • the three convolutional block layers respectively include 2, 3, and 2 texture convolution blocks.
  • the structure of each texture convolution block is shown in Figure 5, in order to further save the model
  • the 1x5x5 convolution shown in Figure 4 is split into two 1x3x3 convolutions in series.
  • This multi-scale convolution kernel method enables the module to have strong information extraction capabilities for features of different static spatial scales.
  • the output of the corresponding convolutional block layer will be combined and used as the input of the next first branch convolutional block layer.
  • the output result of the motion cue submodule and the output result of the static texture information submodule are combined and input to the global pooling layer and a 1024 fully connected layer, and finally the classification is completed through the softmax function.
  • this embodiment trains the model by the following method, and saves the model file in the storage medium.
  • perform batch gradient descent training that is, only send a batch of 10 samples to the network model each time for training, and record the samples used for training in a batch as x, and its corresponding label.
  • the recognition result y of the model is obtained.
  • the purpose of training is to reduce the label
  • the difference between the recognition result y and the model, so the cross entropy loss function is selected to describe The difference between and y, the cross-entropy loss function is as follows:
  • N represents the number of classifications of the recognition task in training, here is 2.
  • y ij represents the recognition result of the i-th sample in a batch after the two-branch three-dimensional convolutional neural network model network belongs to the j-th category The probability of the category.
  • the batch gradient descent method is used on the Pytorch toolkit platform. The first branch and the second branch are independently trained for two cycles, and then the two-branch model is merged. After training the network model for 50 cycles, the model file is saved To the storage medium for the living body judgment module to perform the face living body detection and recognition task. The said one cycle means that all training data are trained once through the batch gradient descent method.
  • a terminal and a storage medium are provided.
  • Terminal which includes memory and processor
  • the memory is used to store computer programs
  • the processor is used to implement the functions of the aforementioned two-branch three-dimensional convolutional neural network model method and system when the computer program is executed.
  • the memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage.
  • RAM Random Access Memory
  • NVM Non-Volatile Memory
  • the above-mentioned processor is the control center of the terminal, which uses various interfaces and lines to connect various parts of the terminal, and executes the computer program in the memory to call the data in the memory to perform the functions of the terminal.
  • the processor can be a general-purpose processor, including central processing unit (CPU), network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP), application-specific integrated circuit ( Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • CPU central processing unit
  • NP Network Processor
  • DSP Digital Signal Processing
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the terminal should also have necessary components for program operation, such as a power supply, a communication bus, and so on.
  • the computer program can be divided into multiple modules, and each module is stored in a memory.
  • Each of the divided modules can complete a computer program instruction segment with a specific function, and the instruction segment is used To describe the execution process of a computer program.
  • a computer program can be divided into the following modules:
  • Face video collection module used to collect user's face video
  • Face video preprocessing module read the collected face video, segment it with n frames as a unit, and obtain live recognition samples;
  • Living body labeling module used to label training samples of known living or non-living bodies, the living body labeling module is turned on when the detection system is in the training mode, and turned off when the detection system is in the recognition mode;
  • Living body motion amplification module According to the operating mode of the detection system, perform motion amplification processing on labeled training samples or unlabeled samples to be tested to obtain motion amplified living body recognition samples;
  • Two-branch three-dimensional convolution model training module equipped with a two-branch three-dimensional convolution model, including static texture information sub-module, dynamic motion cue sub-module, fusion sub-module and classification sub-module; said static texture information sub-module and dynamic motion
  • the clue sub-module is two branches of the three-dimensional convolution model.
  • the output of the static texture information sub-module and the dynamic motion clue sub-module are pooled, summarized, and fused by the fusion sub-module, and then the detection results are output through the classification sub-module;
  • Living body judgment module When the detection system is in the recognition mode, it is used to load the model file output by the dual-branch 3D convolution model training module to obtain the trained dual-branch 3D convolution model, and the face video preprocessing module outputs no
  • the labeled living body recognition sample to be detected is used as the input of the static texture information submodule, and the unlabeled sample to be detected after the motion amplification processing output by the living body motion amplification module is used as the input of the dynamic motion clue submodule, and the recognition result is output.
  • the above-mentioned logical instructions in the memory can be implemented in the form of a software functional unit and when sold or used as an independent product, they can be stored in a computer readable storage medium.
  • the memory can be configured to store software programs and computer-executable programs, such as program instructions or modules corresponding to the system in the embodiments of the present disclosure.
  • the processor executes functional applications and data processing by running software programs, instructions or modules stored in the memory, that is, realizes the functions in the foregoing embodiments.
  • U disk mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or CD-ROM and other media that can store program codes, and can also be temporary storage media .
  • ROM read-only memory
  • RAM random access memory
  • magnetic disk or CD-ROM magnetic disk or CD-ROM and other media that can store program codes, and can also be temporary storage media .
  • the specific process in which the multiple instructions in the foregoing storage medium and the terminal are loaded and executed by the processor has been described in detail in the foregoing.
  • This embodiment is used to demonstrate a specific implementation effect.
  • the face video acquisition module, face video preprocessing module, living body annotation module, living body motion amplification module, dual-branch three-dimensional convolution model training module, and living body judgment module in this embodiment all adopt the structure and function described above, here No longer.
  • the implementation process is:
  • Including the configuration process and identification process First set the system in configuration mode, obtain face video through the face video acquisition module, and then segment the face video by the face video preprocessing module to obtain live recognition samples, which are marked by the live labeling module, and then zoomed in by live motion
  • the module performs processing, and finally the living body detection mimic model training module trains the three-dimensional convolution model according to the training sample set and saves it as a model file.
  • the system is set to recognition mode, the first choice is to obtain the face video through the face video acquisition module, and then the face video preprocessing module will segment the face video to obtain the live body recognition sample to be detected, and finally the live body judgment module will directly Load the trained model file, and use the sample to be tested and the sample to be tested after motion amplification as the model input respectively to obtain the recognition result.
  • a total of 6 test tasks were performed, including the protocol 1 test in the OULU-NPU database, the protocol 2 test in the OULU-NPU database, the protocol 3 test in the OULU-NPU database, and the protocol 4 test in the NPU database.
  • a cross-test between the CASIA-FASD database and the replay attack database is the most difficult one is the cross-test between the CASIA-FASD database and the replay attack database, because it poses great challenges to the generalization ability of the model and the robustness under unknown lighting, background, and equipment conditions.
  • the present invention follows the test index criteria in the original protocol. It uses forged attack classification error (APCER), which is used to evaluate the highest misclassification error rate of all attack methods; and real living body classification error rate (BPCER), which is used to evaluate the classification error rate of real living samples; and mean classification Error rate (ACER), which is the average value of the classification error rate of forgery attacks and the classification error rate of real faces:
  • APICER forged attack classification error
  • BPCER real living body classification error rate
  • ARR mean classification Error rate
  • the present invention follows the test standards of the original database, and uses half of the total error rate (HTER) as the indicator rule, whose values are the false rejection rate (FRR) and the false acceptance rate (FAR) After half:
  • HTER total error rate
  • FRR false rejection rate
  • FAR false acceptance rate
  • the models compared here include the local binary pattern in the traditional method, the long and short-term memory neural network in the recurrent neural network, and the two-dimensional convolutional neural network in the convolutional neural network. The results are shown in Table 1 and Table 2.
  • Table 1 The performance of each model on different protocols in the OULU-NPU database
  • the dual-branch three-dimensional convolution model of the present invention is tested in all data, under the four test protocols of the OULU-NPU database, through the normal two-dimensional convolutional neural network model and the traditional texture feature model.
  • the three-dimensional mimicry models proposed by the invention all occupy absolute performance advantages. Since OULU-NPU is a database close to the actual mobile phone scene currently in use, it can also prove that the three-dimensional mimicry model is close to the actual movement. In the payment scenario, it can effectively prevent various non-living forgery attacks, which has strong practical value. Compared with traditional methods, two-dimensional convolutional neural networks and recurrent neural networks have huge advantages. In the more challenging cross-data set test, the robustness and superiority of the model performance have also been well reflected. This shows that the structure of the model of the present invention is effective and advanced.
  • OULU-NPU is a test data set that takes into account the generalization of cross-scene and cross-device, it takes into account the limitations of the shooting scene and light, and the photography habits of the same group of photographers are fixed, and the same group of attackers' attacks
  • the details such as methods and habits are also relatively single and other objective conditions, so that there are still many similarities in the database during the test process, so it cannot be completely close to the actual complex application scenarios. Therefore, the model is cross-tested across data sets in the CASIA-FASD database and the Idiap Replay Attack database, so as to perform a more challenging generalization test that is completely close to the actual scenario.
  • many different types of experimental models are covered, including some traditional texture extraction algorithms, as well as CNN and RNN timing models in deep learning.
  • the cross-test on the CASIA and Replay Attack datasets is a test of the highest standard of model generalization, because the two datasets are very different in the collection equipment, the live ID, the collection environment, or the shooting habits of the collectors. , Are very different from each other, so it is very consistent with the actual detection scene.
  • the semi-total error rate HTER is used as the performance evaluation index.
  • the two-branch 3D convolution model proposed by the invention has shown a generalization performance far exceeding other models. This is the case in the set mutual test. It can be seen that the proposed two-branch 3D convolution model is more robust and robust, so it also confirms that its performance in the actual scene test will be better.

Abstract

A live face detection system applying a two-branch three-dimensional convolutional model, a terminal and a storage medium relate to the technical field of live face detection. The live face detection system comprises a face video capturing module, a face video pre-processing module, a live body annotation module, a live body motion amplification module, a two-branch three-dimensional convolutional model training module and a live body judgement module. The two-branch three-dimensional convolutional model training module is configured with a two-branch three-dimensional convolutional model and comprises a static texture information sub-module, a dynamic motion cue sub-module, a fusion sub-module and a classification sub-module. Outputs of the static texture information sub-module and the dynamic motion cue sub-module are pooled, aggregated and fused through the fusion sub-module, and a detection result is produced through the classification sub-module. The two-branch three-dimensional convolutional model is mimic and biologically significant and has strong robustness and generalization. Live body detection of a human face recognition system is well guaranteed, system security is improved, and information and properties are protected from infringement.

Description

应用双分支三维卷积模型的人脸活体检测系统、终端及存储介质Face detection system, terminal and storage medium using dual-branch three-dimensional convolution model 技术领域Technical field
本发明涉及一种人脸活体检测领域,具体涉及一种应用双分支三维卷积模型的人脸活体检测系统、终端及存储介质。The invention relates to the field of human face detection, in particular to a human face detection system, a terminal and a storage medium using a dual-branch three-dimensional convolution model.
背景技术Background technique
随着人们越来越多得使用着像笔记本电脑和智能手机之类的电子设备,用他们付款,购物,支付账单以及社交互动,而与之而生的,对于电子身份认证的需求也越来越多。人脸识别验证在众多系统中脱颖而出,并在本发明生活中大规模的展开部署。为了保障安全,预防各种潜在的黑客攻击,人脸活体检测对于人脸验证系统是至关重要的一环。As people increasingly use electronic devices such as laptops and smartphones to make payments, shop, pay bills, and social interactions, the demand for electronic identity authentication is also increasing. more. Face recognition verification stands out among many systems and is deployed on a large scale in the life of the present invention. In order to ensure safety and prevent various potential hacker attacks, face live detection is a crucial part of the face verification system.
目前,人脸活体检测算法面临的最大问题就是泛化性的不足,很多训练的模型在训练和对应的测试集上表现很出色,但是在全新的未知数据集中的表现却不尽如人意,这让人脸活体检测算法的实际部署价值大打折扣,所以针对这一现象,发明主要着眼于人脸活体检测模型泛化性的提高。At present, the biggest problem faced by face live detection algorithms is the lack of generalization. Many trained models perform well in training and corresponding test sets, but their performance in the new unknown data set is not satisfactory. The actual deployment value of the human face live detection algorithm is greatly reduced, so in response to this phenomenon, the invention mainly focuses on the improvement of the generalization of the face live detection model.
传统方法的模式众多,不尽相同。局部二值模式方法具有灰度不变性和旋转不变性等显著优点,有简单易算性但是较为简单。加速稳健特征方法使用海森矩阵的行列式值作特征点响应侦测并用积分图加速运算进行检测。但是无论是哪种传统方法,大部分的传统特征方法大多通过人工提取的特征,结合例如SVM以及LDA等传统的浅层分类器进行活体检测。基于传统人工特征提取方法又往往由于自身方法和训练样本的限制,只能针对特定的攻击方式、或者适用于特定的环境或者光照条件下。哪怕是综合性的多重传统特征提取的方法亦是如此,因为其阈值和参数往往通过人工设定,所以无法做到非常强的适应性和泛化性,不能够适用未知的场景和攻击方法,在实际场景中大多较为脆弱且不稳定。There are many models of traditional methods, and they are not all the same. The local binary mode method has obvious advantages such as gray invariance and rotation invariance. It is simple and easy to calculate, but it is relatively simple. The accelerated robust feature method uses the determinant value of the Hessian matrix for feature point response detection and uses the integral graph to accelerate the operation for detection. However, regardless of the traditional method, most of the traditional feature methods use artificially extracted features, combined with traditional shallow classifiers such as SVM and LDA for live detection. Traditional artificial feature extraction methods are often limited by their own methods and training samples, which can only target specific attack methods, or apply to specific environments or lighting conditions. Even the comprehensive multiple traditional feature extraction methods are the same. Because the thresholds and parameters are often manually set, they cannot achieve very strong adaptability and generalization, and cannot be applied to unknown scenarios and attack methods. In actual scenarios, most of them are fragile and unstable.
交互式的方法虽然较为简单有效,不过整个验证时间过程会较长,且在便利性和用户体验上会给用户带来诸多负面感受,而且如果采用视频攻击的方式,往 往可以使基于眨眼检测、以及嘴唇运动等交互式方法失去效用,所以交互式人脸活体检测算法的局限性也较为明显。Although the interactive method is relatively simple and effective, the entire verification time process will be longer, and it will bring many negative feelings to users in terms of convenience and user experience. Moreover, if video attacks are used, it can often be based on blink detection, And interactive methods such as lip movement lose their utility, so the limitations of the interactive face live detection algorithm are more obvious.
目前更多的使用深度学习的方法来解决人脸活体检测问题。二维卷积神经网络是一种前馈神经网络,它的人工神经元可以响应一部分覆盖范围内的周围单元,对于图像处理有出色表现。相较于局部二值模式方法等,其能够更好的提取出具有一定泛化性的二维图像特征,从而增加模型准确性。但是深度学习方法同样存在着一定的瓶颈,其模型固然在很多数据集上表现都非常出色,但是在跨数据集测试中仍然表现会不佳。这是由于大部分二维CNN模型只重点学习了训练样本的纹理特征,但是样本中的纹理特征因为不同的环境、不同的光线、不同的攻击方式、不同的显示设备材质,也会具有很强的差异性和随机多样性,所以对于为训练集合之外的全新样本的位置纹理特征就不能做到很好的拟合。At present, more methods of deep learning are used to solve the problem of face detection. The two-dimensional convolutional neural network is a feed-forward neural network. Its artificial neurons can respond to a part of the surrounding units in the coverage area, and it has excellent performance for image processing. Compared with the local binary mode method, it can better extract the two-dimensional image features with a certain generalization, thereby increasing the accuracy of the model. But the deep learning method also has certain bottlenecks. Although its model performs very well on many data sets, it will still perform poorly in cross-data set testing. This is because most two-dimensional CNN models only focus on learning the texture features of the training samples, but the texture features in the samples are also very strong due to different environments, different lights, different attack methods, and different display device materials. Because of the difference and random diversity of the training set, the location and texture features of the brand-new samples outside the training set cannot be well fitted.
另外,会有一些方法试图通过提取人脸深度图,或者其他的辅助监督手段来额外的引入约束,来增强模型的泛化能力,但是这种辅助监督首先只是一种间接的监督手段,和人脸的活体检测的相关性目前尚无定论。而且其提取不仅需要耗费大量的计算量,而且会占用大量的硬盘空间,无论对于训练还是后续测试的都会带来诸多不便。In addition, there will be some methods that try to introduce additional constraints by extracting face depth maps or other auxiliary supervision methods to enhance the generalization ability of the model. However, this auxiliary supervision is only an indirect supervision method. The relevance of face live detection is currently inconclusive. Moreover, its extraction not only requires a lot of calculation, but also takes up a lot of hard disk space, which will bring a lot of inconvenience to both training and subsequent testing.
因此,模型的泛化问题一直是深度学习在活体检测领域应用中亟待解决的问题。Therefore, the generalization problem of the model has always been an urgent problem to be solved in the application of deep learning in the field of living detection.
发明内容Summary of the invention
为了解决现有的人脸活体检测领域算法泛化性能较差,不能够适用未知的场景和攻击方法,在实际场景中表现较为脆弱且不稳定的问题。本发明提供了一种应用双分支三维卷积模型的人脸活体检测系统、终端及存储介质,本发明采用三维卷积神经网络作为模型骨架,不仅能够抽取出高维的抽象化特征,而且更能从较浅层的网络中把一些具象化的浅层特征同样进行概括,从而获得更全面的时序运动特征。通过同时兼顾高维度、低维度的特征,让模型发挥更好的效果。同时,三维卷积神经网络在时域上的信息抽取能力更强,更适合作为适合处理人脸活体检测技术的技术框架。相较于普通二维卷积网络,能够更好地提取时域信息;而相较于循环神经网络,又能更多的均衡关注低阶和高阶特征信息,从总体上提升整个系统的泛化能力。In order to solve the problem of poor generalization performance of existing algorithms in the field of face live detection, unknown scenarios and attack methods cannot be applied, and the performance is relatively fragile and unstable in actual scenarios. The present invention provides a human face detection system, terminal and storage medium using a dual-branch three-dimensional convolution model. The present invention uses a three-dimensional convolutional neural network as the model skeleton, which can not only extract high-dimensional abstract features, but also Some concretized shallow features can also be summarized from the shallower network, so as to obtain more comprehensive temporal motion features. By taking into account both high-dimensional and low-dimensional features, the model can achieve better results. At the same time, the three-dimensional convolutional neural network has stronger information extraction ability in the time domain, and is more suitable as a technical framework suitable for processing face living detection technology. Compared with the ordinary two-dimensional convolutional network, it can extract time domain information better; and compared with the cyclic neural network, it can pay more attention to the low-order and high-order feature information, and improve the general system of the whole system. Ability.
本发明的目的在于提供一种应用双分支三维卷积模型的人脸活体检测系统,包括:The purpose of the present invention is to provide a face living detection system using a dual-branch three-dimensional convolution model, including:
人脸视频采集模块:用于采集用户的人脸视频;Face video collection module: used to collect user's face video;
人脸视频预处理模块:读取采集到的人脸视频,以n帧为一个单位进行分割处理,获得活体识别样本;Face video preprocessing module: read the collected face video, segment it with n frames as a unit, and obtain live recognition samples;
活体标注模块:用于对已知活体或非活体的训练样本进行标注,所述的活体标注模块在检测系统处于训练模式时开启,在检测系统处于识别模式时关闭;Living body labeling module: used to label training samples of known living or non-living bodies, the living body labeling module is turned on when the detection system is in the training mode, and turned off when the detection system is in the recognition mode;
活体运动放大模块:根据检测系统的运行模式,对带标签的训练样本或者不带标签的待检测样本进行活体运动信息放大处理,获得运动放大的活体识别样本;Living body motion amplification module: According to the operation mode of the detection system, the living body motion information amplification process is performed on the labeled training sample or the unlabeled sample to be tested to obtain the motion amplified living body recognition sample;
双分支三维卷积模型训练模块:配置有双分支三维卷积模型,包括静态纹理信息子模块、动态运动线索子模块、融合子模块和分类子模块;所述的静态纹理信息子模块和动态运动线索子模块为三维卷积模型的两个分支,当检测系统处于训练模式时,所述活体运动放大模块的输出作为动态运动线索子模块的输入,人脸视频预处理模块的输出作为静态纹理信息子模块的输入,静态纹理信息子模块和动态运动线索子模块的输出经融合子模块进行池化、汇总、融合,再经分类子模块输出检测结果;Two-branch three-dimensional convolution model training module: equipped with a two-branch three-dimensional convolution model, including static texture information sub-module, dynamic motion cue sub-module, fusion sub-module and classification sub-module; said static texture information sub-module and dynamic motion The clue sub-modules are two branches of the 3D convolution model. When the detection system is in training mode, the output of the live motion amplification module is used as the input of the dynamic motion clues sub-module, and the output of the face video preprocessing module is used as static texture information The input of the sub-module, the output of the static texture information sub-module and the dynamic motion clue sub-module are pooled, summarized, and merged by the fusion sub-module, and then the detection result is output by the classification sub-module;
活体判断模块:当检测系统处于识别模式时,用于加载双分支三维卷积模型训练模块输出的模型文件,得到训练好的双分支三维卷积模型,并将人脸视频预处理模块输出的不带标签的待检测活体识别样本作为静态纹理信息子模块的输入,将活体运动放大模块输出的经运动放大处理后的不带标签的待检测活体识别样本作为动态运动线索子模块的输入,输出识别结果。Living body judgment module: When the detection system is in the recognition mode, it is used to load the model file output by the dual-branch 3D convolution model training module to obtain the trained dual-branch 3D convolution model, and the face video preprocessing module outputs no The labeled living body recognition sample to be detected is used as the input of the static texture information sub-module, and the unlabeled living body recognition sample that is output by the living body motion amplification module after the motion amplification processing is used as the input of the dynamic motion cue sub-module, and the output is recognized. result.
本发明的另一目的在于公开了一种终端,包括存储器和处理器;Another object of the present invention is to disclose a terminal including a memory and a processor;
所述存储器,用于存储计算机程序;The memory is used to store a computer program;
所述处理器,用于当执行所述计算机程序时,实现上述应用双分支三维卷积模型的人脸活体检测方法和系统的功能。The processor is configured to, when the computer program is executed, realize the functions of the above-mentioned method and system for human face detection using a two-branch three-dimensional convolution model.
本发明的另一目的在于公开了一种计算机可读存储介质,其特征在于,所述存储介质上存储有计算机程序,当所述计算机程序被处理器执行时,实现上述应用双分支三维卷积模型的人脸活体检测方法和系统的功能。Another object of the present invention is to disclose a computer-readable storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is executed by a processor, the application of the two-branch three-dimensional convolution is realized. The function of the model's face detection method and system.
本发明具备的有益效果是:The beneficial effects of the present invention are:
1)本发明的三维卷积模型采用双分支结构,其中动态运动线索分支运行在一个高帧率的水平(25帧每秒),更关注于对人脸或者伪造攻击的动态线索收集;静态纹理信息分支运行在低帧率水平(6.25帧每秒),应用更精细的多尺度卷积方式,专注于区分真实人脸与伪造攻击的静态纹理信息特征提取,为整个系统的高效工作奠定基础,能同时提取静态空间纹理特征和时域运动特征,增强了系统泛化性。1) The three-dimensional convolution model of the present invention adopts a dual-branch structure, in which the dynamic motion clue branch runs at a high frame rate (25 frames per second), and focuses more on the collection of dynamic clues on faces or forgery attacks; static texture The information branch runs at a low frame rate level (6.25 frames per second), applies a more sophisticated multi-scale convolution method, focuses on the extraction of static texture information features that distinguish real faces from fake attacks, and lays the foundation for the efficient work of the entire system. It can extract static spatial texture features and temporal motion features at the same time, which enhances the generalization of the system.
在双分支模型中,对于高帧率的动态运动线索子模块,将通道数设置较小(初始为8通道,最终为128通道),一方面的目的是节约模型开销;另一方面,模型通道数目越多,则模型的对于静态特征的辨析和抽取能力也就越强,能捕捉更多的纹理和图案细节,而由于动态运动线索分支的通道数目较少,其对于这些静态空间域中的纹理特征的抽取能力也会降低,再考虑到模型高帧率的时序输入,所以该分支能够在降低对于空间域特征信息提取同时,让其更专精于时域信息的提取,获得更为纯粹的时域运动特征信息。而对于低帧率的静态纹理信息子模块,其对于时域变化不太敏感,将通道数设置较大(初始为64通道,最终为1024通道),由于其模型的输入和计算量本身较小,较高的通道数能够有效提升其对于空间纹理细节信息的提取能力。而对于普通的三维卷积神经网络,由于其模型庞大,且三维卷积核计算量开销昂贵,而实际情况下内存开销又十分有限,所以只能应用较为简单的结构,无法设计复杂的网络结构和训练技巧来对特征提取进行优化,无论在网络深度及特征通道数上都无法设置很大,也很难使用较为复杂的卷积核去提取特征,故模型的效果会受到一定的限制。In the dual-branch model, for the high frame rate dynamic motion cue sub-module, set the number of channels to be small (initially 8 channels, and finally 128 channels). On the one hand, the purpose is to save model overhead; on the other hand, the model channel The larger the number, the stronger the model’s ability to distinguish and extract static features, and it can capture more texture and pattern details. However, due to the smaller number of channels for dynamic motion clues, it is more effective for these static spatial domains. The ability to extract texture features will also be reduced. Taking into account the high frame rate timing input of the model, this branch can reduce the extraction of spatial domain feature information while making it more specialized in the extraction of time domain information, resulting in a more pure Time domain motion characteristic information. For the static texture information sub-module with low frame rate, it is not very sensitive to time domain changes, and the number of channels is set to be larger (64 channels initially, 1024 channels in the end), because the input and calculation amount of the model is relatively small. , A higher number of channels can effectively improve its ability to extract spatial texture detail information. For ordinary 3D convolutional neural networks, due to the large model and the expensive calculation of the 3D convolution kernel, and the actual memory cost is very limited, only simpler structures can be used, and complex network structures cannot be designed. And training techniques to optimize feature extraction, no matter the network depth and the number of feature channels can not be set very large, it is also difficult to use more complex convolution kernels to extract features, so the effect of the model will be limited to a certain extent.
2)对于静态纹理信息分支以及动态运动线索分支,为了引导其分别提取静态空间特征,以及动态时域特征,本发明设置了不同的卷积层结构。2) For the static texture information branch and the dynamic motion clue branch, in order to guide them to extract static spatial features and dynamic time domain features, different convolutional layer structures are provided in the present invention.
对于动态运动线索分支,本发明将3x3x3的时域卷积近似拆分成1x1x1,3x1x1,1x3x3,1x1x1的四个卷积,这能够有效的节约模型冗余计算开支,同时保持模块对于时域信息的专注力,因为其相较于原有的3x3x3的卷积核,能够有效的减少三维神经网络模型的计算量的冗余开销,为模型中其余计算留有更大的余地。同时,作为一种近似的三维卷积核方法,在获取时域和空间域信息的能力上,并没有比原三维卷积模式差,依旧能保证在人脸活体检测任务中的精度不下降,但是却可以节省60%以上的内存计算量,展示了很强的优势。For the dynamic motion clue branch, the present invention approximately splits the 3x3x3 time domain convolution into four convolutions of 1x1x1, 3x1x1, 1x3x3, and 1x1x1, which can effectively save the model redundancy calculation expenses while maintaining the module's time-domain information Because it can effectively reduce the redundant overhead of the calculation of the 3D neural network model compared with the original 3x3x3 convolution kernel, it leaves more room for the rest of the calculation in the model. At the same time, as an approximate three-dimensional convolution kernel method, the ability to obtain information in the time and space domains is not worse than the original three-dimensional convolution mode, and it can still ensure that the accuracy in the face living detection task does not decrease. But it can save more than 60% of the amount of memory calculations, showing a strong advantage.
对于静态纹理信息子模块,本发明对于每一层采用多尺度的卷积核进行特征提取,首先输入到1x1x1的卷积中,然后分别输入到1x1x1,1x3x3,1x5x5的卷积中,最后将输出结果联结并输入到又一个1x1x1的卷积中。这种多尺度卷积核的应用使得静态纹理信息子模块能够对于不同尺度的纹理和静态特征信息都能有比较好的提取能力,极大的增强了其对于静态平面空间信息的捕捉能力。这是由于纹理和图案特征的尺寸是不固定的,而使用固定尺寸的卷积核(例如1x3x3)则会导致网络只对特定大小尺寸的特征更为敏感,从而更容易忽略一些其他尺寸的特征。相反,如果应用不同大小的卷积核,那么无论对于尺寸较大的全局特征(例如非活体样本表面全局性的莫尔条纹、或者水波纹特征),还是尺寸较小的细微局部特征(例如非活体表面局部的镜面反射纹理以及光斑),都能有合适的卷积核去具有针对性的提取特征,较大的卷积核可以用来提取较为粗糙的整体结构轮廓,而较小的卷积核可以用来提取较为微小的细节特征。For the static texture information sub-module, the present invention uses a multi-scale convolution kernel for feature extraction for each layer, first input into the 1x1x1 convolution, and then respectively input into the 1x1x1, 1x3x3, 1x5x5 convolution, and finally output The results are concatenated and input into another 1x1x1 convolution. The application of this multi-scale convolution kernel enables the static texture information sub-module to have better extraction capabilities for texture and static feature information of different scales, which greatly enhances its ability to capture static planar spatial information. This is because the size of texture and pattern features are not fixed, and the use of a fixed size convolution kernel (for example, 1x3x3) will cause the network to be more sensitive to features of a specific size, and it is easier to ignore some features of other sizes. . On the contrary, if convolution kernels of different sizes are applied, no matter for larger global features (such as global moiré fringes on the surface of non-living samples, or water ripple features), or small-scale subtle local features (such as non-living samples). The local specular reflection texture and light spots on the surface of the living body can have a suitable convolution kernel to extract specific features. A larger convolution kernel can be used to extract a rougher overall structure contour, and a smaller convolution Kernels can be used to extract smaller details.
3)为了提升模型在时域维度上的辨别能力,更好的提取时域信息,本方法在最后的全局池化层之前,未在时间维度上进行任何形式的下采样,最大限度的保留和提取时域中有效的信息,这能够最大限度的帮助保留和提取时域中有效的信息,做到时域和空间域特征更好的平衡。而对于开销较为昂贵的普通三维卷积网络来讲,由于其复杂的结构,必须进行下采样来避免计算量溢出问题。3) In order to improve the model’s distinguishing ability in the time domain and better extract the time domain information, this method does not perform any form of down-sampling in the time dimension before the final global pooling layer, and retains the maximum amount of data. Extract the effective information in the time domain, which can help to preserve and extract the effective information in the time domain to the greatest extent, so as to achieve a better balance between the characteristics of the time domain and the space domain. For ordinary 3D convolutional networks with relatively expensive overhead, due to their complex structure, down-sampling must be performed to avoid the problem of calculation overflow.
附图说明Description of the drawings
图1为本发明实施例的简要流程示意图;Figure 1 is a schematic flow diagram of an embodiment of the present invention;
图2为本发明的双分支三维卷积模型的网络结构示意图;2 is a schematic diagram of the network structure of the dual-branch three-dimensional convolution model of the present invention;
图3为本发明实施例采用的双分支三维卷积模型的网络参数;FIG. 3 is the network parameters of the two-branch three-dimensional convolution model used in the embodiment of the present invention;
图4为本发明中动态运动线索子模块的时序卷积块示意图;Fig. 4 is a schematic diagram of a time series convolution block of a dynamic motion cue sub-module in the present invention;
图5为本发明中静态纹理信息子模块的纹理卷积块示意图。Fig. 5 is a schematic diagram of the texture convolution block of the static texture information sub-module in the present invention.
具体实施方式detailed description
下面结合附图和实施例对本发明做进一步的说明。The present invention will be further described below in conjunction with the drawings and embodiments.
本发明的一个具体实施案展示了一种应用双分支三维卷积模型的人脸活体检测系统,包括:A specific implementation of the present invention shows a face living detection system using a dual-branch three-dimensional convolution model, including:
人脸视频采集模块:用于采集用户的人脸视频;Face video collection module: used to collect user's face video;
人脸视频预处理模块:读取采集到的人脸视频,以n帧为一个单位进行分割处理,获得活体识别样本;Face video preprocessing module: read the collected face video, segment it with n frames as a unit, and obtain live recognition samples;
活体标注模块:用于对已知活体或非活体的训练样本进行标注,所述的活体标注模块在检测系统处于训练模式时开启,在检测系统处于识别模式时关闭;Living body labeling module: used to label training samples of known living or non-living bodies, the living body labeling module is turned on when the detection system is in the training mode, and turned off when the detection system is in the recognition mode;
活体运动放大模块:根据检测系统的运行模式,对带标签的训练样本或者不带标签的待检测样本进行运动放大处理,获得运动放大的活体识别样本;Living body motion amplification module: According to the operating mode of the detection system, perform motion amplification processing on labeled training samples or unlabeled samples to be tested to obtain motion amplified living body recognition samples;
双分支三维卷积模型训练模块:配置有双分支三维卷积模型,包括静态纹理信息子模块、动态运动线索子模块、融合子模块和分类子模块;所述的静态纹理信息子模块和动态运动线索子模块为三维卷积模型的两个分支,静态纹理信息子模块和动态运动线索子模块的输出经融合子模块进行池化、汇总、融合,再经分类子模块输出检测结果;Two-branch three-dimensional convolution model training module: equipped with a two-branch three-dimensional convolution model, including static texture information sub-module, dynamic motion cue sub-module, fusion sub-module and classification sub-module; said static texture information sub-module and dynamic motion The clue sub-module is two branches of the three-dimensional convolution model. The output of the static texture information sub-module and the dynamic motion clue sub-module are pooled, summarized, and fused by the fusion sub-module, and then the detection results are output through the classification sub-module;
活体判断模块:当检测系统处于识别模式时,用于加载双分支三维卷积模型训练模块输出的模型文件,得到训练好的双分支三维卷积模型,并将人脸视频预处理模块输出的不带标签的待检测活体识别样本作为静态纹理信息子模块的输入,将活体运动放大模块输出的经运动放大处理后的不带标签的待检测活体识别样本作为动态运动线索子模块的输入,输出识别结果。Living body judgment module: When the detection system is in the recognition mode, it is used to load the model file output by the dual-branch 3D convolution model training module to obtain the trained dual-branch 3D convolution model, and the face video preprocessing module outputs no The labeled living body recognition sample to be detected is used as the input of the static texture information sub-module, and the unlabeled living body recognition sample that is output by the living body motion amplification module after the motion amplification processing is used as the input of the dynamic motion cue sub-module, and the output is recognized. result.
本发明的三维卷积模型采用了双分支结构,静态纹理信息子模块作为第一分支,活体运动放大模块和动态运动线索子模块作为第二分支。其中,所述的静态纹理信息子模块,包括输入层、时域步长为k且空间域步长为1*1的预处理层、初始块层、以及p个卷积块层,优选的,所述的静态纹理信息子模块中的初始块层由通道数为8m的初始化卷积层和初始化池化层构成。每一个卷积块层包括相同数量或不同数量的纹理卷积块,每一个纹理卷积块由卷积核为1*1*1的第一卷积层、卷积核为1*1*1、1*3*3、1*5*5的第二卷积层、卷积核为1*1*1的第三卷积层构成;其中第一卷积块层的纹理卷积块中三层卷积层的通道数分别为8m、8m、32m,且后一个卷积块层的纹理卷积块中三层卷积层的通道数是前一个卷积块层的2倍。初始化卷积层的卷积核为1*5*5,初始化池化层的卷积核为1*3*3,2≤k≤5,k优选为4,m优选为8。在本发明的一个具体实施中,纹理卷积块中的第二卷积层包含一个1*5*5的卷积核,可以拆分为两个串联的1*3*3卷积核。The three-dimensional convolution model of the present invention adopts a double-branch structure, the static texture information sub-module is used as the first branch, and the living body motion amplification module and the dynamic motion clue sub-module are used as the second branch. Wherein, the static texture information sub-module includes an input layer, a preprocessing layer with a time domain step size of k and a spatial domain step size of 1*1, an initial block layer, and p convolutional block layers. Preferably, The initial block layer in the static texture information sub-module is composed of an initialization convolution layer with 8m channels and an initialization pooling layer. Each convolutional block layer includes the same or different number of texture convolutional blocks, and each texture convolutional block consists of the first convolutional layer with the convolution kernel of 1*1*1, and the convolution kernel of 1*1*1 , 1*3*3, 1*5*5 second convolution layer, convolution kernel is 1*1*1 third convolution layer; among them, the texture convolution block of the first convolution block layer contains three The number of channels of the convolutional layer is 8m, 8m, and 32m, respectively, and the number of channels of the three-layer convolutional layer in the texture convolution block of the latter convolutional block layer is twice that of the previous convolutional block layer. The convolution kernel of the initialization convolution layer is 1*5*5, the convolution kernel of the initialization pooling layer is 1*3*3, 2≤k≤5, k is preferably 4, and m is preferably 8. In a specific implementation of the present invention, the second convolution layer in the texture convolution block includes a 1*5*5 convolution kernel, which can be split into two concatenated 1*3*3 convolution kernels.
其中,所述的动态运动线索子模块,包括输入层、初始块层、以及p个卷积块层,优选的,所述的动态运动线索子模块中的初始块层由通道数为m的始化 卷积层和初始化池化层构成。每一个卷积块层包括相同数量或不同数量的时序卷积块,每一个时序卷积块由卷积核为1*1*1的第一卷积层、卷积核为3*1*1的第二卷积层、卷积核为1*3*3的第三卷积层、卷积核为1*1*1的第四卷积层构成;其中第一卷积块层的时序卷积块中每一层卷积层的通道数分别为m、m、m、4m;且后一个卷积块层的时序卷积块中四层卷积层的通道数是前一个卷积块层的2倍。初始化卷积层的卷积核为3*5*5,初始化池化层的卷积核为1*3*3,m优选为8。Wherein, the dynamic motion cue submodule includes an input layer, an initial block layer, and p convolutional block layers. Preferably, the initial block layer in the dynamic motion cue submodule starts with the number of channels m. Convolutional layer and initialization pooling layer. Each convolution block layer includes the same or different numbers of time series convolution blocks, and each time series convolution block consists of the first convolution layer with a convolution kernel of 1*1*1, and a convolution kernel of 3*1*1 The second convolutional layer, the third convolutional layer with the convolution kernel of 1*3*3, and the fourth convolutional layer with the convolution kernel of 1*1*1; among them, the first convolutional block layer is the sequential convolution The number of channels of each convolutional layer in the convolutional block is m, m, m, 4m; and the number of channels of the four convolutional layers in the sequential convolutional block of the latter convolutional block layer is the previous convolutional block layer 2 times. The convolution kernel of the initialization convolution layer is 3*5*5, the convolution kernel of the initialization pooling layer is 1*3*3, and m is preferably 8.
所述静态纹理信息子模块中第i个卷积块层的输出与动态运动线索子模块中第i个卷积块层的输出进行合并后,作为静态纹理信息子模块中第i+1个卷积块层的输入;其中,p为大于0的整数,1≤i≤p-1,p优选为3。The output of the i-th convolutional block layer in the static texture information sub-module and the output of the i-th convolutional block layer in the dynamic motion cue sub-module are combined to form the i+1th volume in the static texture information sub-module Input to the building block layer; where p is an integer greater than 0, 1≤i≤p-1, and p is preferably 3.
其中,所述的活体运动放大模块,在执行样本的运动放大处理过程时,具体为:Wherein, the living body motion amplification module, when performing the motion amplification processing process of the sample, is specifically as follows:
1)通过傅里叶级数的分解,将每一帧中的人脸图像f(x+δ(t))分解为一系列正弦函数之和:1) Through the decomposition of Fourier series, the face image f(x+δ(t)) in each frame is decomposed into the sum of a series of sine functions:
Figure PCTCN2020116644-appb-000001
Figure PCTCN2020116644-appb-000001
其中,f(x+δ(t))表示时域中的人脸活体样本图像,即初始时的图像为I(x,0)=f(x),δ(t)为人脸的运动信息函数,A ω为转化到频域空间后信号的振幅;每个单独的频率ω对应一种带宽,而对于特定频率ω的带宽是复数的正弦信号: Among them, f(x+δ(t)) represents the live sample image of the face in the time domain, that is, the initial image is I(x,0)=f(x), and δ(t) is the motion information function of the face , A ω is the amplitude of the signal transformed into the frequency domain space; each individual frequency ω corresponds to a bandwidth, and the bandwidth for a specific frequency ω is a complex sinusoidal signal:
S ω(x,t)=A ωe iω(x+δ(t)) S ω (x,t)=A ω e iω(x+δ(t))
其中,人脸微小运动的频率频率ω的范围设定为0.3-3Hz,以提取人脸微小运动。S ω为一条正弦曲线,其相位为ω(x+δ(t))包含有原图像的运动信息; Among them, the frequency range of the frequency ω of the small movement of the human face is set to 0.3-3 Hz to extract the small movement of the human face. S ω is a sine curve whose phase is ω(x+δ(t)) and contains the motion information of the original image;
2)为了分离出特定的时域对应频带内的微小动作,对相位ω(x+δ(t))进行过滤,得到滤波后的带通相位,表示如下:2) In order to separate the tiny actions in a specific time-domain corresponding frequency band, the phase ω(x+δ(t)) is filtered to obtain the filtered band-pass phase, which is expressed as follows:
B ω(x,t)=ωδ(t) B ω (x,t)=ωδ(t)
将带通相位B ω(x,t)与α相乘,α为运动信息放大系数,取值为30,在实际应用中,可以根据需要在10-50之间变化,并加上子带S ω(x,t)的相位,从而获取运动放大后的子带
Figure PCTCN2020116644-appb-000002
表示为:
Multiply the bandpass phase B ω (x,t) with α, α is the motion information amplification coefficient, and the value is 30. In practical applications, it can be changed between 10-50 as needed, and the subband S is added The phase of ω (x,t) to obtain the sub-band after motion amplification
Figure PCTCN2020116644-appb-000002
Expressed as:
Figure PCTCN2020116644-appb-000003
Figure PCTCN2020116644-appb-000003
最终获得的结果
Figure PCTCN2020116644-appb-000004
是一条复数的正弦曲线,表示频域空间内经过运动放大后的图像。
The final result
Figure PCTCN2020116644-appb-000004
It is a complex sine curve, representing the image after motion magnification in the frequency domain space.
3)根据步骤2.2)中的运动放大后的子带
Figure PCTCN2020116644-appb-000005
获得运动放大后的视频序列f(x+(1+α)δ(t)),最后将其转换回时域即为放大后的结果。
3) According to the sub-band after the movement in step 2.2)
Figure PCTCN2020116644-appb-000005
Obtain the video sequence f(x+(1+α)δ(t)) after motion magnification, and finally convert it back to the time domain to be the magnified result.
本发明的一个具体实施案展示了人脸活体检测系统的具体工作流程。A specific implementation of the present invention shows the specific working process of the face living detection system.
通过人脸视频采集模块获取用户的人脸视频,通过人脸视频预处理模块,以8帧为一个单位进行分割,获得活体识别样本。假设原始图像流尺寸为224x224x8,在由活体运动放大模块进行人脸微小动作放大之后,输入到动态运动线索子模块。所述的动态运动线索子模块,运行在一个高帧率的水平(25帧每秒),使用三维卷积模块,更关注于对人脸或者伪造攻击的动态线索收集。其通道数设置较小(初始为8通道,最终为128通道),一方面能够节约模型开销,另一方面也能够让其更专精于时域信息的提取,值得注意的是,整个过程中没有时域上的降采样处理,这在最大限度的保留了时域中的运动信息。The face video of the user is acquired through the face video acquisition module, and the face video preprocessing module is used for segmentation with 8 frames as a unit to obtain living body recognition samples. Assuming that the size of the original image stream is 224x224x8, after the small facial movements are amplified by the living body motion amplification module, it is input to the dynamic motion cue submodule. The dynamic motion cue sub-module runs at a high frame rate level (25 frames per second), uses a three-dimensional convolution module, and pays more attention to the collection of dynamic clues for human faces or forgery attacks. The number of channels is set to be small (initially 8 channels, and finally 128 channels). On the one hand, it can save model overhead. On the other hand, it can also make it more specialized in the extraction of time domain information. It is worth noting that the whole process There is no down-sampling processing in the time domain, which preserves the motion information in the time domain to the greatest extent.
在动态运动线索子模块中的运算流程具体为:首先通过一个尺寸为3x5x5,步长为1,2,2的卷积,获得通道数为8的特征;接着通过一个尺寸为1x3x3,步长为1,2,2的初始化池化层,通道数仍为8;然后通过第二分支中的三个卷积块层,所述的三个卷积块层分别包括2、3、2个时序卷积块,每个时序卷积块的结构如图4所示,为了节约计算开销和内存,本发明将原始的3x3x3的三维卷积核进行拆分,依次通过1x1x1的卷积核、3x1x1的卷积核、1x3x3的卷积核以及1x1x1的卷积核,设置1x1x1的卷积核的目的是增强模型的拟合能力。The calculation process in the dynamic motion cue submodule is specifically as follows: first, through a convolution with a size of 3x5x5 and a step size of 1, 2, and 2, to obtain a feature with a channel number of 8; then pass a size of 1x3x3 with a step size of The initial pooling layer of 1,2,2, the number of channels is still 8; then through the three convolutional block layers in the second branch, the three convolutional block layers include 2, 3, and 2 time series volumes respectively The structure of each time-series convolution block is shown in Figure 4. In order to save computational overhead and memory, the present invention splits the original 3x3x3 three-dimensional convolution kernel, and sequentially passes through the 1x1x1 convolution kernel and 3x1x1 convolution kernel. Convolution kernel, 1x3x3 convolution kernel and 1x1x1 convolution kernel. The purpose of setting 1x1x1 convolution kernel is to enhance the fitting ability of the model.
同样假设原始图像尺寸为224x224x8,输入到静态纹理信息子模块中。所述的静态纹理信息子模块运行在低帧率水平(6.25帧每秒),应用更精细的多尺度卷积方式,专注于区分真实人脸与伪造攻击的静态纹理信息特征提取。其对于时域变化不太敏感,通道数设置较大(初始为64通道,最终为1024通道),由于其模型的输入和计算量本身较小,较高的通道数能够有效提升其对于空间纹理细节信息的提取能力。It is also assumed that the original image size is 224x224x8, which is input into the static texture information sub-module. The static texture information sub-module runs at a low frame rate level (6.25 frames per second), applies a finer multi-scale convolution method, and focuses on the static texture information feature extraction that distinguishes real faces from fake attacks. It is not very sensitive to changes in the time domain, and the number of channels is set to be larger (initially 64 channels, and finally 1024 channels). Because the input and calculation of the model itself are small, a higher number of channels can effectively improve its spatial texture The ability to extract detailed information.
在静态纹理信息子模块中的运算流程具体为:通过预处理层进行抽帧之后,以低频率输入,首先通过一个1x5x5的卷积核,步长1,2,2,获得的特征通道数 为64,接着通过一个尺寸为1x3x3,步长为1,2,2的初始化池化层,通道数不变,仍为64。The calculation process in the static texture information sub-module is specifically: after the frame is extracted through the preprocessing layer, it is input at a low frequency, and first passes through a 1x5x5 convolution kernel with a step size of 1, 2, 2, and the number of feature channels obtained is 64, and then pass an initialization pooling layer with a size of 1x3x3 and a step size of 1, 2, 2, and the number of channels remains unchanged at 64.
初始块层的输出结果会与动态运动线索子模块对应层的输出结果进行合并拼接,然后输入到卷积块层中。The output result of the initial block layer will be merged and spliced with the output result of the corresponding layer of the dynamic motion cue submodule, and then input into the convolutional block layer.
然后经过第一分支中的三个卷积块层,三个卷积块层分别包括2、3、2纹理卷积块,每个纹理卷积块的结构如图5所示,为了进一步节约模型的内存计算空间,将图4中所示的1x5x5的卷积拆分称为两个1x3x3的卷积相串联。这种多尺度卷积核的方式使得模块对于不同静态空间尺度的特征都能有很强的信息提取能力。Then it passes through the three convolutional block layers in the first branch. The three convolutional block layers respectively include 2, 3, and 2 texture convolution blocks. The structure of each texture convolution block is shown in Figure 5, in order to further save the model In the memory computing space of, the 1x5x5 convolution shown in Figure 4 is split into two 1x3x3 convolutions in series. This multi-scale convolution kernel method enables the module to have strong information extraction capabilities for features of different static spatial scales.
在两个分支的运算过程中,相应卷积块层的输出会进行联结后作为下一个第一分支卷积块层的输入。最终,将运动线索子模块的输出结果与静态纹理信息子模块的输出结果合并后输入到全局池化层,以及一个1024的全连接层,最终通过softmax函数完成分类。During the operation of the two branches, the output of the corresponding convolutional block layer will be combined and used as the input of the next first branch convolutional block layer. Finally, the output result of the motion cue submodule and the output result of the static texture information submodule are combined and input to the global pooling layer and a 1024 fully connected layer, and finally the classification is completed through the softmax function.
在构建完毕训练时需要的双分支三维卷积神经网络模型后,针对于训练的样本和对应的标签,本实施例通过以下方法对该模型进行训练,并保存下模型文件到存储介质中。对于所有用于训练的样本,进行批梯度下降训练,即每次只送入网络模型一个批次共10个样本进行训练,将一个批次内用于训练的样本记作x,其对应的标签记作
Figure PCTCN2020116644-appb-000006
训练样本x在通过双分支的三维卷积神经网络模型的识别后,得到模型的识别结果y。在本实施例中,训练的目的就是缩小标签
Figure PCTCN2020116644-appb-000007
和模型的识别结果y之间的差异,因此选择交叉熵损失函数用于描述
Figure PCTCN2020116644-appb-000008
和y之间的差异,其交叉熵损失函数如下:
After the two-branch three-dimensional convolutional neural network model required for training is constructed, for the training samples and corresponding labels, this embodiment trains the model by the following method, and saves the model file in the storage medium. For all samples used for training, perform batch gradient descent training, that is, only send a batch of 10 samples to the network model each time for training, and record the samples used for training in a batch as x, and its corresponding label Referred to as
Figure PCTCN2020116644-appb-000006
After the training sample x is recognized by the two-branch three-dimensional convolutional neural network model, the recognition result y of the model is obtained. In this example, the purpose of training is to reduce the label
Figure PCTCN2020116644-appb-000007
The difference between the recognition result y and the model, so the cross entropy loss function is selected to describe
Figure PCTCN2020116644-appb-000008
The difference between and y, the cross-entropy loss function is as follows:
Figure PCTCN2020116644-appb-000009
Figure PCTCN2020116644-appb-000009
其中
Figure PCTCN2020116644-appb-000010
代表交叉熵损失函数,N代表训练中识别任务的分类数,此处为2。
Figure PCTCN2020116644-appb-000011
代表一个批次中的第i个样本属于第j个类别的概率,y ij代表一个批次中的第i个样本在经过双分支的三维卷积神经网络模型网络后的识别结果属于第j个类别的概率。本实施例在Pytorch toolkit平台上通过批梯度下降方法,先单独将第一分支与第二份独立训练两个周期,然后将双分支模型合并后,训练网络模型50个周期后,将模型文件保存至存储介质中去,以供活体判断模块进行人脸活体检测识别任务。所述的一个周期指将所有训练数据都通过批梯度下降方法训练过一次。
in
Figure PCTCN2020116644-appb-000010
Represents the cross-entropy loss function, N represents the number of classifications of the recognition task in training, here is 2.
Figure PCTCN2020116644-appb-000011
Represents the probability that the i-th sample in a batch belongs to the j-th category, and y ij represents the recognition result of the i-th sample in a batch after the two-branch three-dimensional convolutional neural network model network belongs to the j-th category The probability of the category. In this embodiment, the batch gradient descent method is used on the Pytorch toolkit platform. The first branch and the second branch are independently trained for two cycles, and then the two-branch model is merged. After training the network model for 50 cycles, the model file is saved To the storage medium for the living body judgment module to perform the face living body detection and recognition task. The said one cycle means that all training data are trained once through the batch gradient descent method.
在本申请的一个实施案中,提供了一种终端和存储介质。In an embodiment of the present application, a terminal and a storage medium are provided.
终端,它包括存储器和处理器;Terminal, which includes memory and processor;
其中存储器,用于存储计算机程序;The memory is used to store computer programs;
处理器,用于当执行所述计算机程序时,实现前述双分支的三维卷积神经网络模型方法与系统的功能。The processor is used to implement the functions of the aforementioned two-branch three-dimensional convolutional neural network model method and system when the computer program is executed.
需要注意的是,存储器可以包括随机存取存储器(Random Access Memory,RAM),也可以包括非易失性存储器(Non-Volatile Memory,NVM),例如至少一个磁盘存储器。上述的处理器为终端的控制中心,利用各种接口和线路连接终端的各个部分,通过执行存储器中的计算机程序来调用存储器中的数据,以执行终端的功能。处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。当然,该终端中还应当具有实现程序运行的必要组件,例如电源、通信总线等等。It should be noted that the memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage. The above-mentioned processor is the control center of the terminal, which uses various interfaces and lines to connect various parts of the terminal, and executes the computer program in the memory to call the data in the memory to perform the functions of the terminal. The processor can be a general-purpose processor, including central processing unit (CPU), network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP), application-specific integrated circuit ( Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. Of course, the terminal should also have necessary components for program operation, such as a power supply, a communication bus, and so on.
示例性的,所述的计算机程序可以被分割为多个模块,每一个模块均被存储在存储器中,分割开来的每一个模块都可以完成具备特定功能的计算机程序指令段,该指令段用于描述计算机程序的执行过程。例如,可以将计算机程序分割成以下模块:Exemplarily, the computer program can be divided into multiple modules, and each module is stored in a memory. Each of the divided modules can complete a computer program instruction segment with a specific function, and the instruction segment is used To describe the execution process of a computer program. For example, a computer program can be divided into the following modules:
人脸视频采集模块:用于采集用户的人脸视频;Face video collection module: used to collect user's face video;
人脸视频预处理模块:读取采集到的人脸视频,以n帧为一个单位进行分割处理,获得活体识别样本;Face video preprocessing module: read the collected face video, segment it with n frames as a unit, and obtain live recognition samples;
活体标注模块:用于对已知活体或非活体的训练样本进行标注,所述的活体标注模块在检测系统处于训练模式时开启,在检测系统处于识别模式时关闭;Living body labeling module: used to label training samples of known living or non-living bodies, the living body labeling module is turned on when the detection system is in the training mode, and turned off when the detection system is in the recognition mode;
活体运动放大模块:根据检测系统的运行模式,对带标签的训练样本或者不带标签的待检测样本进行运动放大处理,获得运动放大的活体识别样本;Living body motion amplification module: According to the operating mode of the detection system, perform motion amplification processing on labeled training samples or unlabeled samples to be tested to obtain motion amplified living body recognition samples;
双分支三维卷积模型训练模块:配置有双分支三维卷积模型,包括静态纹理信息子模块、动态运动线索子模块、融合子模块和分类子模块;所述的静态纹理信息子模块和动态运动线索子模块为三维卷积模型的两个分支,静态纹理信息子 模块和动态运动线索子模块的输出经融合子模块进行池化、汇总、融合,再经分类子模块输出检测结果;Two-branch three-dimensional convolution model training module: equipped with a two-branch three-dimensional convolution model, including static texture information sub-module, dynamic motion cue sub-module, fusion sub-module and classification sub-module; said static texture information sub-module and dynamic motion The clue sub-module is two branches of the three-dimensional convolution model. The output of the static texture information sub-module and the dynamic motion clue sub-module are pooled, summarized, and fused by the fusion sub-module, and then the detection results are output through the classification sub-module;
活体判断模块:当检测系统处于识别模式时,用于加载双分支三维卷积模型训练模块输出的模型文件,得到训练好的双分支三维卷积模型,并将人脸视频预处理模块输出的不带标签的待检测活体识别样本作为静态纹理信息子模块的输入,将活体运动放大模块输出的经运动放大处理后的不带标签的待检测样本作为动态运动线索子模块的输入,输出识别结果。Living body judgment module: When the detection system is in the recognition mode, it is used to load the model file output by the dual-branch 3D convolution model training module to obtain the trained dual-branch 3D convolution model, and the face video preprocessing module outputs no The labeled living body recognition sample to be detected is used as the input of the static texture information submodule, and the unlabeled sample to be detected after the motion amplification processing output by the living body motion amplification module is used as the input of the dynamic motion clue submodule, and the recognition result is output.
以上模块中的程序在执行时均由处理器进行处理。The programs in the above modules are all processed by the processor during execution.
此外,上述的存储器中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。存储器作为一种计算机可读存储介质,可设置为存储软件程序、计算机可执行程序,如本公开实施例中的系统对应的程序指令或模块。处理器通过运行存储在存储器中的软件程序、指令或模块,从而执行功能应用以及数据处理,即实现上述实施例中的功能。例如,U盘、移动硬盘、只读存储器(Read-OnlyMemory,ROM)、随机存取存储器(RandomAccessMemory,RAM)、磁碟或者光盘等多种可以存储程序代码的介质,也可以是暂态存储介质。此外,上述存储介质以及终端中的多条指令由处理器加载并执行的具体过程在上述中已经详细说明。In addition, the above-mentioned logical instructions in the memory can be implemented in the form of a software functional unit and when sold or used as an independent product, they can be stored in a computer readable storage medium. As a computer-readable storage medium, the memory can be configured to store software programs and computer-executable programs, such as program instructions or modules corresponding to the system in the embodiments of the present disclosure. The processor executes functional applications and data processing by running software programs, instructions or modules stored in the memory, that is, realizes the functions in the foregoing embodiments. For example, U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or CD-ROM and other media that can store program codes, and can also be temporary storage media . In addition, the specific process in which the multiple instructions in the foregoing storage medium and the terminal are loaded and executed by the processor has been described in detail in the foregoing.
实施例Example
本实施例用于展示一个具体的实施效果。本实施例中的人脸视频采集模块、人脸视频预处理模块、活体标注模块、活体运动放大模块、双分支三维卷积模型训练模块、活体判断模块均采用上述描述的结构及功能,此处不再赘述。This embodiment is used to demonstrate a specific implementation effect. The face video acquisition module, face video preprocessing module, living body annotation module, living body motion amplification module, dual-branch three-dimensional convolution model training module, and living body judgment module in this embodiment all adopt the structure and function described above, here No longer.
实施过程为:The implementation process is:
包括配置过程和识别过程。首先设置系统处于配置模式,通过人脸视频采集模块获取人脸视频,然后由人脸视频预处理模块对人脸视频进行分割得到活体识别样本,并由活体标注模块进行标记,再通过活体运动放大模块进行处理,最后由活体检测拟态模训练模块根据训练样本集对三维卷积模型进行训练,保存为模型文件。Including the configuration process and identification process. First set the system in configuration mode, obtain face video through the face video acquisition module, and then segment the face video by the face video preprocessing module to obtain live recognition samples, which are marked by the live labeling module, and then zoomed in by live motion The module performs processing, and finally the living body detection mimic model training module trains the three-dimensional convolution model according to the training sample set and saves it as a model file.
配置结束后,将系统设置为识别模式,首选通过人脸视频采集模块获取人脸视频,然后由人脸视频预处理模块对人脸视频进行分割得到待检测活体识别样本, 最后由活体判断模块直接加载训练好的模型文件,将待检测样本和运动放大后的待测样本分别作为模型输入,得到识别结果。After the configuration, the system is set to recognition mode, the first choice is to obtain the face video through the face video acquisition module, and then the face video preprocessing module will segment the face video to obtain the live body recognition sample to be detected, and finally the live body judgment module will directly Load the trained model file, and use the sample to be tested and the sample to be tested after motion amplification as the model input respectively to obtain the recognition result.
本实施例共进行了6种测试任务,包括了OULU-NPU数据库中的协议1测试,OULU-NPU数据库中的协议2测试,OULU-NPU数据库中的协议3测试,NPU数据库中的协议4测试,以及CASIA-FASD数据库和回放攻击数据库之间分别进行的交叉测试。其中最困难的是CASIA-FASD数据库和回放攻击数据库之间分别进行的交叉测试,因为其对于模型泛化性能力,以及在未知照明、背景、设备条件下的鲁棒性有着很大的挑战。In this embodiment, a total of 6 test tasks were performed, including the protocol 1 test in the OULU-NPU database, the protocol 2 test in the OULU-NPU database, the protocol 3 test in the OULU-NPU database, and the protocol 4 test in the NPU database. , And a cross-test between the CASIA-FASD database and the replay attack database. The most difficult one is the cross-test between the CASIA-FASD database and the replay attack database, because it poses great challenges to the generalization ability of the model and the robustness under unknown lighting, background, and equipment conditions.
对于在OULU-NPU数据库的四个协议测试,本发明遵循了原协议中的测试指标准则。其采用伪造攻击分类错误(APCER),其用来评估所有攻击方式中错分类错误率最高一种;以及真实活体分类错误率(BPCER),其用于评价真实活体样本分类错误率;及均值分类错误率(ACER),为伪造攻击分类错误和真实人脸分类错误率的均值:For the four protocol tests in the OULU-NPU database, the present invention follows the test index criteria in the original protocol. It uses forged attack classification error (APCER), which is used to evaluate the highest misclassification error rate of all attack methods; and real living body classification error rate (BPCER), which is used to evaluate the classification error rate of real living samples; and mean classification Error rate (ACER), which is the average value of the classification error rate of forgery attacks and the classification error rate of real faces:
Figure PCTCN2020116644-appb-000012
Figure PCTCN2020116644-appb-000012
而在CASIA数据库和Replay Attack数据库中的测试,本发明遵循原数据库的测试标准,使用总错误率的一半(HTER)作为指标规则,其值为误拒率(FRR)和误接受率(FAR)之后的一半:For the tests in the CASIA database and the Replay Attack database, the present invention follows the test standards of the original database, and uses half of the total error rate (HTER) as the indicator rule, whose values are the false rejection rate (FRR) and the false acceptance rate (FAR) After half:
Figure PCTCN2020116644-appb-000013
Figure PCTCN2020116644-appb-000013
为保证公平,所以训练测试均在Pytorch基准测试平台上用GeForce RTX 2080 Ti-NVIDIA GPU进行,并且所有训练和测试规则均相同。这里所进行对比的模型包括了传统方法中的局部二值模式,循环神经网络中的长短期记忆神经网络,以及卷积神经网络中的二维卷积神经网络。结果如表1和表2所示。To ensure fairness, all training and testing are performed on the Pytorch benchmark platform with GeForce RTX 2080 Ti-NVIDIA GPU, and all training and testing rules are the same. The models compared here include the local binary pattern in the traditional method, the long and short-term memory neural network in the recurrent neural network, and the two-dimensional convolutional neural network in the convolutional neural network. The results are shown in Table 1 and Table 2.
表1各模型在OULU-NPU数据库中不同协议上的表现Table 1 The performance of each model on different protocols in the OULU-NPU database
Figure PCTCN2020116644-appb-000014
Figure PCTCN2020116644-appb-000014
Figure PCTCN2020116644-appb-000015
Figure PCTCN2020116644-appb-000015
表2各模型在CASIA和回放攻击数据库上的跨数据测试Table 2 Cross-data test of each model on CASIA and replay attack database
Figure PCTCN2020116644-appb-000016
Figure PCTCN2020116644-appb-000016
可以看到本发明的双分支三维卷积模型在所有的数据测试中,在OULU-NPU数据库的四种测试协议下,通过和普通的二维卷积神经网络模型,以及传统的纹理特征模型进行比较,发明提出的三维拟态模型都占据了绝对的性能优势,由于OULU-NPU是一种接近于目前正在应用的实际手机场景的数据库,所以也可以侧面上证明了三维拟态模型在接近于实际移动支付场景中能够有效的防范各种非活体伪造攻击,具有很强的实用价值。无论相较于传统方法,还是二维卷积神 经网络以及循环神经网络,都有非常巨大的优势。而在更具挑战的跨数据集测试中,模型性能的鲁棒性和优越性也得到了很好得体现。这说明本发明所述模型的结构是有效的,也是具有先进性的。It can be seen that the dual-branch three-dimensional convolution model of the present invention is tested in all data, under the four test protocols of the OULU-NPU database, through the normal two-dimensional convolutional neural network model and the traditional texture feature model. In comparison, the three-dimensional mimicry models proposed by the invention all occupy absolute performance advantages. Since OULU-NPU is a database close to the actual mobile phone scene currently in use, it can also prove that the three-dimensional mimicry model is close to the actual movement. In the payment scenario, it can effectively prevent various non-living forgery attacks, which has strong practical value. Compared with traditional methods, two-dimensional convolutional neural networks and recurrent neural networks have huge advantages. In the more challenging cross-data set test, the robustness and superiority of the model performance have also been well reflected. This shows that the structure of the model of the present invention is effective and advanced.
由于OULU-NPU虽然是一种考虑到跨场景跨设备的泛化性测试数据集,但是考虑到拍摄场景、光线的限制,以及同一批拍摄人员的摄影习惯是固定的、同一批攻击者的攻击方式习惯等细节也是相对单一的等客观条件,使得该数据库在测试过程中仍存在很多相似性的可能,所以并不能完全接近实际的复杂应用场景。所以将模型在CASIA-FASD数据库以及Idiap Replay Attack数据库中进行跨数据集的交叉测试,从而对模型进行更具挑战性的、完全接近实际场景的泛化性测试。在这些对比模型中,涵盖了很多不同种类的实验模型,包括一些传统的纹理提取算法,以及深度学习中的CNN以及RNN时序模型.Although OULU-NPU is a test data set that takes into account the generalization of cross-scene and cross-device, it takes into account the limitations of the shooting scene and light, and the photography habits of the same group of photographers are fixed, and the same group of attackers' attacks The details such as methods and habits are also relatively single and other objective conditions, so that there are still many similarities in the database during the test process, so it cannot be completely close to the actual complex application scenarios. Therefore, the model is cross-tested across data sets in the CASIA-FASD database and the Idiap Replay Attack database, so as to perform a more challenging generalization test that is completely close to the actual scenario. In these comparison models, many different types of experimental models are covered, including some traditional texture extraction algorithms, as well as CNN and RNN timing models in deep learning.
CASIA与Replay Attack数据集上的交叉测试是一种模型泛化性的最高标准考验,因为两种数据集在采集设备,活体ID,采集环境还是采集人员的拍摄习惯,都存在着极大的不同,彼此差异很大,所以非常符合现实中的检测场景。从表中可以看出,这里采用半总错误率HTER作为性能考察指标,通过和各类模型的综合性比较,包括传统的纹理特征提取模型、深度卷积学习中的CNN以及LSTM时序模型,发明提出的双分支三维卷积模型都展现出了更为优越的性能。The cross-test on the CASIA and Replay Attack datasets is a test of the highest standard of model generalization, because the two datasets are very different in the collection equipment, the live ID, the collection environment, or the shooting habits of the collectors. , Are very different from each other, so it is very consistent with the actual detection scene. As can be seen from the table, the semi-total error rate HTER is used as the performance evaluation index. Through comprehensive comparison with various models, including traditional texture feature extraction models, CNN and LSTM timing models in deep convolution learning, the invention The proposed two-branch 3D convolution models all show superior performance.
发明提出的双分支三维卷积模型无论相比于传统手工提取特征的方法,还是最为先进的复杂深度学习网络,都展现出了远超其它模型的泛化性表现,在两种规则的跨数据集互相测试中都是如此,由此可以看出提出的双分支三维卷积模型是更为鲁棒和强健的,所以也印证了其在实际场景测试中的表现也会更加优异。Compared with the traditional manual feature extraction method or the most advanced complex deep learning network, the two-branch 3D convolution model proposed by the invention has shown a generalization performance far exceeding other models. This is the case in the set mutual test. It can be seen that the proposed two-branch 3D convolution model is more robust and robust, so it also confirms that its performance in the actual scene test will be better.
以上列举的仅是本发明的具体实施例。显然,本发明不限于以上实施例,还可以有许多变形。本领域的普通技术人员能从本发明公开的内容直接导出或联想到的所有变形,均应认为是本发明的保护范围。The above-listed are only specific embodiments of the present invention. Obviously, the present invention is not limited to the above embodiments, and many variations are possible. All modifications that can be directly derived or imagined by a person of ordinary skill in the art from the disclosure of the present invention should be considered as the protection scope of the present invention.

Claims (10)

  1. 一种应用双分支三维卷积模型的人脸活体检测系统,其特征在于,包括:A face living detection system using a dual-branch three-dimensional convolution model is characterized in that it includes:
    人脸视频采集模块:用于采集用户的人脸视频;Face video collection module: used to collect user's face video;
    人脸视频预处理模块:读取采集到的人脸视频,以n帧为一个单位进行分割处理,获得活体识别样本;Face video preprocessing module: read the collected face video, segment it with n frames as a unit, and obtain live recognition samples;
    活体标注模块:用于对已知活体或非活体的训练样本进行标注,所述的活体标注模块在检测系统处于训练模式时开启,在检测系统处于识别模式时关闭;Living body labeling module: used to label training samples of known living or non-living bodies, the living body labeling module is turned on when the detection system is in the training mode, and turned off when the detection system is in the recognition mode;
    活体运动放大模块:根据检测系统的运行模式,对带标签的训练样本或者不带标签的待检测样本进行活体运动信息放大处理,获得运动放大的活体识别样本;Living body motion amplification module: According to the operation mode of the detection system, the living body motion information amplification process is performed on the labeled training sample or the unlabeled sample to be tested to obtain the motion amplified living body recognition sample;
    双分支三维卷积模型训练模块:配置有双分支三维卷积模型,包括静态纹理信息子模块、动态运动线索子模块、融合子模块和分类子模块;Dual-branch 3D convolution model training module: equipped with dual-branch 3D convolution model, including static texture information sub-module, dynamic motion cue sub-module, fusion sub-module and classification sub-module;
    当检测系统处于训练模式时,所述活体运动放大模块的输出作为动态运动线索子模块的输入,人脸视频预处理模块的输出作为静态纹理信息子模块的输入,静态纹理信息子模块和动态运动线索子模块的输出经融合子模块进行池化、汇总、融合,再经分类子模块输出检测结果;When the detection system is in training mode, the output of the living body motion amplification module is used as the input of the dynamic motion cue submodule, the output of the face video preprocessing module is used as the input of the static texture information submodule, the static texture information submodule and the dynamic motion The output of the clue sub-module is pooled, summarized, and merged by the fusion sub-module, and then the detection result is output through the classification sub-module;
    所述静态纹理信息子模块包括输入层、时域步长为k且空间域步长为1*1的预处理抽帧层、通道数为8m的初始块层、以及p个卷积块层;所述动态运动线索子模块包括输入层、通道数为m的初始块层、以及p个卷积块层;The static texture information submodule includes an input layer, a preprocessing frame extraction layer with a time domain step size of k and a spatial domain step size of 1*1, an initial block layer with a channel number of 8m, and p convolutional block layers; The dynamic motion cue submodule includes an input layer, an initial block layer with m channels, and p convolutional block layers;
    所述静态纹理信息子模块的初始块层的输出与动态运动线索子模块的初始块层的输出进行合并后,作为静态纹理信息子模块第1个卷积块层的输入;静态纹理信息子模块第i个卷积块层的输出与动态运动线索子模块第i个卷积块层的输出进行合并后,作为静态纹理信息子模块第i+1个卷积块层的输入;所述静态纹理信息子模块的卷积块层和动态运动线索子模块的卷积块层均包括若干个由多层卷积层构成的卷积子模块,每一个卷积子模块的输出卷积层通道数大于输入卷积层通道数;其中m、p和k为大于0的整数,2≤k≤5,1≤i≤p-1;The output of the initial block layer of the static texture information submodule and the output of the initial block layer of the dynamic motion cue submodule are combined and used as the input of the first convolutional block layer of the static texture information submodule; the static texture information submodule The output of the i-th convolutional block layer and the output of the i-th convolutional block layer of the dynamic motion cue sub-module are combined, and used as the input of the i+1th convolutional block layer of the static texture information sub-module; the static texture The convolutional block layer of the information sub-module and the convolutional block layer of the dynamic motion cue sub-module each include several convolution sub-modules composed of multiple convolutional layers, and the number of output convolutional layer channels of each convolution sub-module is greater than Enter the number of channels in the convolutional layer; where m, p, and k are integers greater than 0, 2≤k≤5, 1≤i≤p-1;
    活体判断模块:当检测系统处于识别模式时,用于加载双分支三维卷积模型训练模块输出的模型文件,得到训练好的双分支三维卷积模型,并将人脸视频预处理模块输出的不带标签的待检测活体识别样本作为静态纹理信息子模块的输 入,将活体运动放大模块输出的经运动放大处理后的不带标签的待检测活体识别样本作为动态运动线索子模块的输入,输出识别结果。Living body judgment module: When the detection system is in the recognition mode, it is used to load the model file output by the dual-branch 3D convolution model training module to obtain the trained dual-branch 3D convolution model, and the face video preprocessing module outputs no The labeled living body recognition sample to be detected is used as the input of the static texture information sub-module, and the unlabeled living body recognition sample that is output by the living body motion amplification module after the motion amplification processing is used as the input of the dynamic motion clue sub-module, and the output is recognized result.
  2. 如权利要求1所述的一种应用双分支三维卷积模型的人脸活体检测系统,其特征在于,所述的活体运动放大模块具体为:The face living detection system using a dual-branch three-dimensional convolution model according to claim 1, wherein the living body motion amplification module is specifically:
    2.1)通过傅里叶级数的分解,将每一帧中的人脸图像f(x+δ(t))分解为一系列正弦函数之和:2.1) Through the decomposition of Fourier series, the face image f(x+δ(t)) in each frame is decomposed into the sum of a series of sine functions:
    Figure PCTCN2020116644-appb-100001
    Figure PCTCN2020116644-appb-100001
    其中,f(x+δ(t))表示时域中的人脸活体样本图像,即初始时的图像为I(x,0)=f(x),δ(t)为人脸的运动信息函数,A ω为转化到频域空间后信号的振幅;i代表复频域内图像所对应的虚部;每个单独的频率ω对应一种带宽,对于特定频率ω的带宽是复数的正弦信号: Among them, f(x+δ(t)) represents the live sample image of the face in the time domain, that is, the initial image is I(x,0)=f(x), and δ(t) is the motion information function of the face , A ω is the amplitude of the signal transformed into the frequency domain space; i represents the imaginary part corresponding to the image in the complex frequency domain; each individual frequency ω corresponds to a bandwidth, and the bandwidth of a specific frequency ω is a complex sinusoidal signal:
    Figure PCTCN2020116644-appb-100002
    Figure PCTCN2020116644-appb-100002
    其中,人脸微小运动的频率频率ω的范围设定为0.3-3Hz,以提取人脸微小运动;S ω为一条正弦曲线,其相位ω(x+δ(t))包含有原图像的运动信息;通过调节相位来调整运动的幅度; Among them, the frequency range of the frequency ω of the tiny movement of the face is set to 0.3-3Hz to extract the tiny movement of the face; S ω is a sine curve whose phase ω(x+δ(t)) contains the movement of the original image Information; adjust the amplitude of movement by adjusting the phase;
    2.2)对上式中的ω(x+δ(t))通过直流互补滤波器进行过滤,得到滤波后的带通相位,表示如下:2.2) Filter ω(x+δ(t)) in the above formula through a DC complementary filter to obtain the filtered bandpass phase, which is expressed as follows:
    B ω(x,t)=ωδ(t) B ω (x,t)=ωδ(t)
    将带通相位B ω(x,t)与α相乘,其中α为运动放大系数,并加上子带S ω(x,t)的相位,从而获取运动放大后的子带
    Figure PCTCN2020116644-appb-100003
    表示为:
    Multiply the bandpass phase B ω (x,t) by α, where α is the motion amplification coefficient, and add the phase of the subband S ω (x,t) to obtain the motion amplified subband
    Figure PCTCN2020116644-appb-100003
    Expressed as:
    Figure PCTCN2020116644-appb-100004
    Figure PCTCN2020116644-appb-100004
    其中
    Figure PCTCN2020116644-appb-100005
    是一条复数的正弦曲线,且正好为输入正弦曲线的(1+α)倍;
    in
    Figure PCTCN2020116644-appb-100005
    It is a complex sine curve, and it is exactly (1+α) times the input sine curve;
    2.3)根据步骤2.2)中的运动放大后的子带
    Figure PCTCN2020116644-appb-100006
    获得运动放大后的视频序列f(x+(1+α)δ(t)),最后将其转换回时域即为放大后的结果。
    2.3) According to the sub-band after the movement in step 2.2)
    Figure PCTCN2020116644-appb-100006
    Obtain the video sequence f(x+(1+α)δ(t)) after motion magnification, and finally convert it back to the time domain to be the magnified result.
  3. 如权利要求1所述的一种应用双分支三维卷积模型的人脸活体检测系统,其特征在于,所述的初始块层包括初始化卷积层和初始化池化层;静态纹理信息子模块的初始化卷积层中的卷积核为1*5*5,初始化池化层的卷积核为1*3*3;动态运动线索子模块的初始化卷积层中的卷积核为3*5*5,初始化池化层的卷积核为1*3*3。A human face detection system using a dual-branch three-dimensional convolution model according to claim 1, wherein the initial block layer includes an initialized convolution layer and an initialized pooling layer; the static texture information sub-module The convolution kernel in the initialization convolution layer is 1*5*5, and the convolution kernel in the initialization pooling layer is 1*3*3; the convolution kernel in the initialization convolution layer of the dynamic motion cue submodule is 3*5 *5, the convolution kernel of the initial pooling layer is 1*3*3.
  4. 如权利要求3所述的一种应用双分支三维卷积模型的人脸活体检测系统,其特征在于,所述的静态纹理信息子模块包括三个卷积块层,三个卷积块层分别包括2、3、2个纹理卷积子模块,每一个纹理卷积子模块由卷积核为1*1*1的第一卷积层、卷积核为1*1*1、1*3*3、1*5*5的第二卷积层、卷积核为1*1*1的第三卷积层构成,其中第一卷积块层的纹理卷积子模块中三层卷积层的通道数分别为8m、8m、32m,且后一个卷积块层的纹理卷积子模块中三层卷积层的通道数是前一个卷积块层的2倍。A human face detection system using a dual-branch three-dimensional convolution model according to claim 3, wherein the static texture information sub-module includes three convolutional block layers, and the three convolutional block layers are respectively Including 2, 3, 2 texture convolution sub-modules, each texture convolution sub-module consists of a first convolution layer with a convolution kernel of 1*1*1, and a convolution kernel of 1*1*1, 1*3 *3. The second convolutional layer of 1*5*5 and the third convolutional layer with the convolution kernel of 1*1*1 consist of three layers of convolution in the texture convolution submodule of the first convolutional block layer. The number of channels of the layers are 8m, 8m, and 32m, respectively, and the number of channels of the three-layer convolutional layer in the texture convolution submodule of the latter convolutional block layer is twice that of the previous convolutional block layer.
  5. 如权利要求4所述的一种应用双分支三维卷积模型的人脸活体检测系统,其特征在于,将所述静态纹理信息子模块卷积块层中第二卷积层的1*5*5卷积核拆分为两个串联的1*3*3卷积核。The face living detection system using a dual-branch three-dimensional convolution model according to claim 4, wherein the static texture information sub-module is convolved with 1*5* of the second convolution layer in the block layer. The 5 convolution kernel is split into two 1*3*3 convolution kernels connected in series.
  6. 如权利要求1所述的一种应用双分支三维卷积模型的人脸活体检测系统,其特征在于,所述的动态运动线索子模块包括三个卷积块层,三个卷积块层分别包括2、3、2个时序卷积子模块,每一个时序卷积子模块由卷积核为1*1*1的第一卷积层、卷积核为3*1*1的第二卷积层、卷积核为1*3*3的第三卷积层、卷积核为1*1*1的第四卷积层构成,其中第一卷积块层的时序卷积子模块中每一层卷积层的通道数分别为m、m、m、4m;且后一个卷积块层的时序卷积子模块中四层卷积层的通道数是前一个卷积块层的2倍。A human face detection system using a dual-branch three-dimensional convolution model according to claim 1, wherein the dynamic motion cue sub-module includes three convolutional block layers, and the three convolutional block layers are respectively Including 2, 3, 2 time series convolution sub-modules, each time series convolution sub-module consists of a first convolution layer with a 1*1*1 convolution kernel and a second convolution with a 3*1*1 convolution kernel. The build-up layer, the convolution kernel is the third convolution layer of 1*3*3, and the convolution kernel is the fourth convolution layer of 1*1*1. Among them, the sequential convolution submodule of the first convolution block layer The number of channels of each convolutional layer is m, m, m, 4m; and the number of channels of the four convolutional layers in the sequential convolution submodule of the latter convolutional block layer is 2 of that of the previous convolutional block layer. Times.
  7. 如权利要求1所述的一种应用双分支三维卷积模型的人脸活体检测系统,其特征在于,所述的m取值为8。The face living detection system using a dual-branch three-dimensional convolution model according to claim 1, wherein the value of m is 8.
  8. 如权利要求1所述的一种应用双分支三维卷积模型的人脸活体检测系统,其特征在于,所述的p取值为3,k取值为4。The face living detection system using a dual-branch three-dimensional convolution model according to claim 1, wherein the value of p is 3 and the value of k is 4.
  9. 一种终端,其特征在于,包括存储器和处理器;A terminal, characterized in that it comprises a memory and a processor;
    所述存储器,用于存储计算机程序;The memory is used to store a computer program;
    所述处理器,用于当执行所述计算机程序时,实现如权利要求1~8任一项所述应用双分支三维卷积模型的人脸活体检测系统。The processor is configured to, when the computer program is executed, implement the human face living detection system using the two-branch three-dimensional convolution model according to any one of claims 1 to 8.
  10. 一种计算机可读存储介质,其特征在于,所述存储介质上存储有计算机程序,当所述计算机程序被处理器执行时,实现如权利要求1~8任一项应用双分支三维卷积模型的人脸活体检测系统。A computer-readable storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is executed by a processor, the application of a two-branch three-dimensional convolution model as in any one of claims 1 to 8 is realized The face live detection system.
PCT/CN2020/116644 2020-06-12 2020-09-22 Live face detection system applying two-branch three-dimensional convolutional model, terminal and storage medium WO2021248733A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010534822.3 2020-06-12
CN202010534822.3A CN111814574B (en) 2020-06-12 2020-06-12 Face living body detection system, terminal and storage medium applying double-branch three-dimensional convolution model

Publications (1)

Publication Number Publication Date
WO2021248733A1 true WO2021248733A1 (en) 2021-12-16

Family

ID=72846094

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/116644 WO2021248733A1 (en) 2020-06-12 2020-09-22 Live face detection system applying two-branch three-dimensional convolutional model, terminal and storage medium

Country Status (2)

Country Link
CN (1) CN111814574B (en)
WO (1) WO2021248733A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114626024A (en) * 2022-05-12 2022-06-14 北京吉道尔科技有限公司 Internet infringement video low-consumption detection method and system based on block chain
CN115082734A (en) * 2022-06-23 2022-09-20 中南大学 Aluminum electrolysis cell fire eye video inspection system and superheat degree deep learning identification method
CN115410048A (en) * 2022-09-29 2022-11-29 昆仑芯(北京)科技有限公司 Training method, device, equipment and medium of image classification model and image classification method, device and equipment
CN115578771A (en) * 2022-10-24 2023-01-06 智慧眼科技股份有限公司 Living body detection method, living body detection device, computer equipment and storage medium
CN117095447A (en) * 2023-10-18 2023-11-21 杭州宇泛智能科技有限公司 Cross-domain face recognition method and device, computer equipment and storage medium
CN116631050B (en) * 2023-04-20 2024-02-13 北京电信易通信息技术股份有限公司 Intelligent video conference-oriented user behavior recognition method and system

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836194B (en) * 2021-01-29 2023-03-21 西安交通大学 Identity authentication method and system based on internal biological characteristics of human hand
CN113158773B (en) * 2021-03-05 2024-03-22 普联技术有限公司 Training method and training device for living body detection model
CN113312965B (en) * 2021-04-14 2023-04-28 重庆邮电大学 Face unknown spoofing attack living body detection method and system
CN113422982B (en) * 2021-08-23 2021-12-14 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN113792804B (en) * 2021-09-16 2023-11-21 北京百度网讯科技有限公司 Training method of image recognition model, image recognition method, device and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9202105B1 (en) * 2012-01-13 2015-12-01 Amazon Technologies, Inc. Image analysis for user authentication
CN109344716A (en) * 2018-08-31 2019-02-15 深圳前海达闼云端智能科技有限公司 Training method, detection method, device, medium and equipment of living body detection model
CN109886244A (en) * 2019-03-01 2019-06-14 北京视甄智能科技有限公司 A kind of recognition of face biopsy method and device
CN110516576A (en) * 2019-08-20 2019-11-29 西安电子科技大学 Near-infrared living body faces recognition methods based on deep neural network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095124B (en) * 2017-06-07 2024-02-06 创新先进技术有限公司 Face living body detection method and device and electronic equipment
CN109711243B (en) * 2018-11-01 2021-02-09 长沙小钴科技有限公司 Static three-dimensional face in-vivo detection method based on deep learning
CN110414350A (en) * 2019-06-26 2019-11-05 浙江大学 The face false-proof detection method of two-way convolutional neural networks based on attention model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9202105B1 (en) * 2012-01-13 2015-12-01 Amazon Technologies, Inc. Image analysis for user authentication
CN109344716A (en) * 2018-08-31 2019-02-15 深圳前海达闼云端智能科技有限公司 Training method, detection method, device, medium and equipment of living body detection model
CN109886244A (en) * 2019-03-01 2019-06-14 北京视甄智能科技有限公司 A kind of recognition of face biopsy method and device
CN110516576A (en) * 2019-08-20 2019-11-29 西安电子科技大学 Near-infrared living body faces recognition methods based on deep neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BHARADWAJ SAMARTH; DHAMECHA TEJAS I.; VATSA MAYANK; SINGH RICHA: "Computationally Efficient Face Spoofing Detection with Motion Magnification", 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, IEEE, 23 June 2013 (2013-06-23), pages 105 - 110, XP032479959, DOI: 10.1109/CVPRW.2013.23 *
CHEN HAONAN; HU GUOSHENG; LEI ZHEN; CHEN YAOWU; ROBERTSON NEIL M.; LI STAN Z.: "Attention-Based Two-Stream Convolutional Networks for Face Spoofing Detection", IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, IEEE, USA, vol. 15, 1 January 1900 (1900-01-01), USA, pages 578 - 593, XP011747040, ISSN: 1556-6013, DOI: 10.1109/TIFS.2019.2922241 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114626024A (en) * 2022-05-12 2022-06-14 北京吉道尔科技有限公司 Internet infringement video low-consumption detection method and system based on block chain
CN115082734A (en) * 2022-06-23 2022-09-20 中南大学 Aluminum electrolysis cell fire eye video inspection system and superheat degree deep learning identification method
CN115082734B (en) * 2022-06-23 2023-01-31 中南大学 Aluminum electrolysis cell fire eye video inspection system and superheat degree deep learning identification method
CN115410048A (en) * 2022-09-29 2022-11-29 昆仑芯(北京)科技有限公司 Training method, device, equipment and medium of image classification model and image classification method, device and equipment
CN115410048B (en) * 2022-09-29 2024-03-19 昆仑芯(北京)科技有限公司 Training of image classification model, image classification method, device, equipment and medium
CN115578771A (en) * 2022-10-24 2023-01-06 智慧眼科技股份有限公司 Living body detection method, living body detection device, computer equipment and storage medium
CN116631050B (en) * 2023-04-20 2024-02-13 北京电信易通信息技术股份有限公司 Intelligent video conference-oriented user behavior recognition method and system
CN117095447A (en) * 2023-10-18 2023-11-21 杭州宇泛智能科技有限公司 Cross-domain face recognition method and device, computer equipment and storage medium
CN117095447B (en) * 2023-10-18 2024-01-12 杭州宇泛智能科技有限公司 Cross-domain face recognition method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN111814574B (en) 2023-09-15
CN111814574A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
WO2021248733A1 (en) Live face detection system applying two-branch three-dimensional convolutional model, terminal and storage medium
Vu et al. Masked face recognition with convolutional neural networks and local binary patterns
CN102938065B (en) Face feature extraction method and face identification method based on large-scale image data
WO2015149534A1 (en) Gabor binary pattern-based face recognition method and device
CN107633207A (en) AU characteristic recognition methods, device and storage medium
Anand et al. An improved local binary patterns histograms techniques for face recognition for real time application
Qiang et al. SqueezeNet and fusion network-based accurate fast fully convolutional network for hand detection and gesture recognition
KR20130048076A (en) Face recognition apparatus and control method for the same
CN105956570B (en) Smiling face's recognition methods based on lip feature and deep learning
CN104766062A (en) Face recognition system and register and recognition method based on lightweight class intelligent terminal
CN103646255A (en) Face detection method based on Gabor characteristics and extreme learning machine
HN et al. Human Facial Expression Recognition from static images using shape and appearance feature
Yingxin et al. A robust hand gesture recognition method via convolutional neural network
Shao et al. PalmGAN for cross-domain palmprint recognition
JP7141518B2 (en) Finger vein matching method, device, computer equipment, and storage medium
Linda et al. Color-mapped contour gait image for cross-view gait recognition using deep convolutional neural network
CN111126250A (en) Pedestrian re-identification method and device based on PTGAN
Wang et al. A finger-vein image quality assessment algorithm combined with improved SMOTE and convolutional neural network
Kejun et al. Automatic nipple detection using cascaded adaboost classifier
Chuang et al. Hand posture recognition and tracking based on bag-of-words for human robot interaction
Yuan et al. Real-time ear detection based on embedded systems
CN112541576B (en) Biological living body identification neural network construction method of RGB monocular image
CN101739571A (en) Block principal component analysis-based device for confirming face
Xu et al. Facial analysis with a Lie group kernel
Chen et al. A classroom student counting system based on improved context-based face detector

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20939721

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20939721

Country of ref document: EP

Kind code of ref document: A1