CN111259792B

CN111259792B - DWT-LBP-DCT feature-based human face living body detection method

Info

Publication number: CN111259792B
Application number: CN202010042880.4A
Authority: CN
Inventors: 项世军; 章琬苓
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2023-05-12
Anticipated expiration: 2040-01-15
Also published as: CN111259792A

Abstract

The invention discloses a face living body detection method based on DWT-LBP-DCT characteristics, which comprises the following steps: acquiring an original video frame sequence to be processed; face positioning is carried out on the original video frame sequence, the face recognition area is enlarged, the interest area is determined, and the interest area is normalized; performing multistage two-dimensional discrete wavelet transform and block division to obtain DWT characteristics and obtaining video frame frequency information; performing equivalent local binary pattern transformation on the DWT characteristics to obtain DWT-LBP characteristics and obtaining video frame texture information; performing discrete cosine transform on the DWT-LBP characteristic longitudinally to obtain the DWT-LBP-DCT characteristic, and obtaining the time domain information of the video frame; training and classifying DWT-LBP-DCT characteristics by using a machine learning classifier to obtain a detection result; the invention can effectively resist attack with strong generalization capability, and improves the safety, reliability, accuracy and effectiveness of human face living body detection.

Description

DWT-LBP-DCT feature-based human face living body detection method

Technical Field

The invention relates to the field of research of face recognition technology, in particular to a face living body detection method based on DWT-LBP-DCT characteristics.

Background

At present, the face recognition technology is mature day by day, so that the biological authentication system is applied in a plurality of occasions, various secrets are applied in a plurality of occasions, various confidential access control systems are large, the login system of a mobile terminal is small, and even the mobile payment system can see the trail of the face recognition technology.

The face living body algorithm is an important component of the face recognition algorithm. The technology of simply recognizing the human face not only makes the authentication system easy to be attacked by illegal users, but also makes the network system in a huge threat, thereby providing a multiplicative machine for an attacker who implements illegal actions on the network system. The human face living body detection technology embeds a new checkpoint for the human face recognition system, and judges whether the target is a living body or not while recognizing the human face. The recognition result is true and valid only in the case where the face is judged as a living body; otherwise, the identification is determined as an illegal attack on the authentication system. Under the double guarantee, the security and the reliability of the system are improved, so that the anti-spoofing attack capability of the face recognition system, particularly the effectiveness of a face living body detection algorithm, is improved, and the problem to be solved in face authentication is urgent.

Generally, face fraud can be divided into three categories: photo attacks, video attacks, and facial attacks. Photo attack refers to that an illegal user holds a photo of a legal user, prints the photo into a paper plate or displays the photo on electronic equipment, and presents the photo to a camera of a verification system; the video attack is that an illegal user plays back a video from a legal user, and the face recognition system is attacked by dynamic information; mask attack refers to an illegal user wearing a 3D mask of the original user to simulate the stereoscopic effect of a human face.

For the attack, the prior art mainly uses the following 5 types of methods to extract features to realize detection: (1) a static texture feature-based detection method; (2) a motion feature based detection method; (3) a frequency-based detection method; (4) image color-based detection methods; (5) detection method based on deep learning.

In the prior art, the following problems exist: when the human face living body detection is carried out by utilizing the information of the single feature, the recognition effect of the single data set is good, a plurality of data sets cannot be generalized, and the limitation of application scenes is realized; when the neural network is used for feature extraction, a plurality of data sets are required to be combined to serve as training samples, network parameters can be determined through multiple parameter adjustment, the calculated amount is large, the time is long, and the interpretation is poor. Therefore, an effective and generalizing method is needed to combat attacks and improve the security and reliability of the face biopsy system.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art and provide a face living body detection method based on DWT-LBP-DCT characteristics.

The aim of the invention is achieved by the following technical scheme:

the human face living body detection method based on the DWT-LBP-DCT characteristics is characterized by comprising the following steps:

s1, acquiring an original video frame sequence to be processed;

s2, carrying out face positioning on an original video frame sequence, expanding a face recognition area, determining an interest area, and carrying out normalization processing on the interest area;

s3, performing multistage two-dimensional discrete wavelet transform and block division processing on the normalized region of interest to obtain DWT characteristics and obtain frequency information of a video frame;

s4, performing equivalent local binary pattern transformation processing on the DWT characteristics to obtain DWT-LBP characteristics, and acquiring frequency information and texture information of the video frame;

s5, performing discrete cosine transform on the DWT-LBP characteristic longitudinally to obtain the DWT-LBP-DCT characteristic, and obtaining frequency information, texture information and time domain information of the video frame;

and S6, training and classifying the DWT-LBP-DCT characteristics by using a machine learning classifier to obtain a detection result so as to judge whether the video to be detected is a non-living attack.

Further, the step S1 specifically includes: and extracting a few frames to represent the whole video, and when the original video has M frames of images in the database, extracting F frames of images according to a time interval I, wherein the calculation is as follows:

and further obtaining an original video frame sequence to be processed.

Further, the step S2 specifically includes the following steps:

s201, locating the face position of each frame of image in an original video frame sequence through a face classifier, and taking the recognized face position coordinates as a base number;

s202, amplifying coordinates according to a set of scale factors by taking face position coordinates as a base, and if the amplified coordinates are beyond the coordinate range of an original frame, replacing the amplified coordinates with boundary points corresponding to the original frame to obtain scale coordinates;

s203, re-determining a target area, namely an interesting area, in the original video frame according to the proportional coordinate point, unifying the image resolutions of all re-determined target areas to be target values, and if the resolution is insufficient or larger than the target value, performing bilinear interpolation processing to obtain a target image, wherein the target image is a color RGB image.

Further, the step S3 specifically includes the following steps:

s301, performing D-level two-dimensional DWT processing on a target area, and separating out frequency components, wherein the frequency components comprise smooth approximation components LL _D Horizontal component HL _X Vertical component LH _X Diagonal component HH _X Wherein X ranges from 1 to D;

s302, recording the resolution of each frequency component, wherein the resolution of each frequency component is smoothly approximated to the component LL _D The resolution of (2) is the smallest, record LL corresponding to D _D Resolution of (2);

s303, horizontal component HL to be diced _X Vertical component LH _X Diagonal component HH _X Performing dicing processing to obtain co-smooth approximation component LL _D And carrying out the operations on all the target images extracted by different scale factors to obtain the frequency information of the video frame.

Further, the frequency component to be diced has an X range of 1 to D-1.

Further, the step S4 specifically includes: and calculating equivalent LBP histograms of all frequency small blocks, and horizontally connecting the LBP histograms of the target images of each frame according to the time sequence to obtain the DWT-LBP characteristics.

Further, the step S5 specifically includes:

s501, connecting DWT-LBP features of the same video longitudinally to form a DWT-LBP video feature matrix;

s502, longitudinally applying DCT operation on a DWT-LBP video feature matrix, extracting energy information reflected by each block in the time transformation process, and obtaining DWT-LBP-DCT features, wherein the calculation of one-dimensional DCT is as follows:

/>

wherein x is _n N is more than or equal to 0 and less than or equal to N-1 as input data with the length of N; x is X _k For outputting data;

further, the step S6 specifically includes the following steps: according to DWT-LBP-DCT characteristics, a training set in a database is used for establishing a corresponding SVM classifier, parameters required in the test are acquired according to a verification set, the test set is sent into the trained SVM classifier, a detection result is obtained, and true and false video classification is achieved.

Further, parameters required in the test include a false acceptance rate FAR and a false rejection rate FRR; using the verification set to acquire a classifier threshold T when the false acceptance rate FAR and the false rejection rate FRR are equal, and acquiring a half error rate HTER and an accuracy rate according to the classifier threshold T; wherein the half error rate HTER is half of the sum of the false acceptance rate FAR and the false rejection rate FRR.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention utilizes the characteristics of DWT for presenting different frequencies in a multi-resolution mode to extract the frequency information of video frames; the characteristic that LBP can identify image textures is utilized, the difference between DWT blocks is further amplified, and the texture information of video frames is extracted; the method utilizes the characteristic of DCT energy concentration, cascades multiple frames in the same video, extracts the energy information of the multiple frames of the video, improves the accuracy of video classification, can also use the same algorithm parameters to obtain better results on multiple data sets, has stronger generalization capability and has good practical application prospect.

Drawings

FIG. 1 is a schematic flow chart of a face living body detection algorithm based on DWT-LBP-DCT characteristics of the invention;

FIG. 2 is a schematic diagram of a 2-level two-dimensional Discrete Wavelet Transform (DWT) and corresponding dicing operation in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart illustrating the application of Discrete Cosine Transform (DCT) on DWT-LBP features in accordance with an embodiment of the present invention;

FIG. 4 is a graph of the REPLAY-ATTACK database ROC in accordance with an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Examples:

the human face living body detection method based on the DWT-LBP-DCT characteristic is shown in figure 1, and comprises the following steps:

s1, acquiring an original video frame sequence to be processed;

In the embodiment of the invention, a REPLAY-ATTACK public database (RE) is used as experimental data, and the database consists of 1200 video segments of 50 objects, and is divided into 360 video segments of a training set, 360 video segments of a verification set and 480 video segments of a test set, wherein each object has 24 video segments, including 4 legal requests and 20 spoofing ATTACKs, and the duration of each video segment is more than 9s. Video was recorded under 3 different scenes and 2 different lighting conditions. In the algorithm evaluation process, each parameter in the SVM classifier is firstly adjusted by using a training set, then a classifier threshold tau when the error acceptance rate (false acceptance rate, FAR) and the error rejection rate (false rejection rate, FRR) are equal is obtained by using a verification set, the half error rate (half total error rate, HTER) and the accuracy rate of the algorithm on the test set are obtained according to the threshold tau, as shown in the formula, HTER refers to half of the sum of the error acceptance rate (FAR) and the error rejection rate (FRR) on the test set,

further, the performance of the model and the quality of the human face living body detection and identification effect are judged according to the method, and the specific implementation steps are as follows:

video frame extraction:

in order to improve the detection efficiency, the invention uses only a few frames in the input video for detection. Specifically, 1 frame of image is extracted at regular time intervals I for each input video. For example, if i=1, then an odd frame of video is extracted. Assuming that the target video to be detected consists of M frames of images, F frames of images are obtained through the video frame extraction step, and the method is calculated as follows:

and further obtaining an original video frame sequence to be processed.

Target identification and image normalization:

the face was identified here using the deep learning model face_recognment in the c++ open source library dlib, which was tested using the Labeled Faces in the Wild face dataset, with an accuracy of up to 99.38%. Compared with the OpenCV which is widely used as well, the face_recognition can more accurately distinguish the face area from the background area, and more accurately position the actual range of the face. After the position coordinates of the face are obtained by using the method, the coordinate point at the moment is recorded, the coordinate point is taken as a base number, the identification range is enlarged to {1.0,1.4,1.8,2.2,2.6} times of the base number, the length and width of the rectangular region of interest are respectively enlarged by corresponding times (S), and more pixel information from the original image frame is merged. Previous work has shown that properly enlarging the face range is beneficial to improving the accuracy of human face living detection, and setting the maximum magnification of the coordinates to 2.6 (s=2.6) prevents the addition of excessive background areas in the image, thereby reducing the recognition rate. At the same time, the problem with scaling up to a certain multiple is that the current region will exceed the native region size of the original video frame. For such problems, the boundary of the video frame is selected as the boundary threshold after enlargement, without filling the area without pixel values with a fixed value. Thus, the steps of target identification and region of interest determination are completed.

In addition, selecting different magnification factors for the images can result in different numbers of pixels of the images, so that in order to perform subsequent processing in batches, all the pixels of the images need to be normalized to 128×128, and all the target images are color images.

Multistage two-dimensional DWT and block partitioning operations:

based on the characteristic that the two-dimensional DWT can decompose the image into different frequency components and the different components can all present partial textures of the original image, the invention separates the image into a smooth approximation component LL through D-level DWT operation of the preprocessed target image, as shown in figure 2 _D Horizontal component HL _X Vertical component LH _X Diagonal component HH _X Where X ranges from 1 to D and preserves image texture in multi-resolution at different components. Among these components, the smooth approximation component retains most of the low frequency components, the horizontal and vertical components retain some of the low and high frequency components, and the diagonal component retains all of the high frequency components. Since the high frequency components mostly present details, it is not suitable to continue the subdivision by wavelet packet decomposition transformation, but at the same time, the number of resolution points for all component blocks is required to be the same, so after the D-level DWT operation, a larger block HL _X 、LH _X And HH _X Wherein X ranges from 1 to D-1, i.e., the range of the dicing process required is X from 1 to D-1, and the average dicing process is performed. Taking d=2 as an example, the results are shown in fig. 2. When d=1, 4 DWT component blocks can be obtained; when d=2, 7 DWT component blocks can be obtained; when d=3, 10 DWT component blocks can be obtained; when (when)D=4, 13 DWT component blocks can be obtained. After DWT operation, the block processing is performed, and since the normalized image is 128×128, when d=2, 16 blocks of cut blocks with a resolution of 32×32 can be obtained; when d=3, 64 cut blocks with a resolution of 16×16 can be obtained; when d=4, 256 blocks of 8×8 resolution cut blocks, etc. can be obtained, and so on, 4 can be obtained ^D And cutting blocks. Through a large amount of experiments, the minimum size of the cutting block is 8 x 8, so that the corresponding block can be caused by the excessively small cutting block, and meanwhile, the conventional operation in image processing is met by the 8 x 8-size cutting block.

Equivalent LBP transform:

since the DWT and the segmented image are color images, in order to reduce the dimension of the LBP feature, color-to-gray image transformation processing is required to be performed on the DWT and the segmented output image, so that the requirements of the LBP operation are satisfied. First, the R, G, B channels are given different weights, and the color image is converted into a single-channel Gray scale according to the formula gray=r×0.299+g×0.587+b×0.114.

The equivalent LBP transform is a histogram of the number of "0/1" transitions in all binary numbers in the image transform region after the conventional LBP transform. Specifically, in any one 3*3 neighborhood, the central pixel value is taken as a threshold value, the central pixel value is set to be 1 larger than the threshold value and 0 smaller than the threshold value, and then eight pixel transformation values in the neighborhood are read according to a certain sequence to form a binary number; when the number of "0/1" transforms in the binary exceeds 2, it is referred to as a non-equivalent pattern, and the number of transforms does not exceed 2 for a total of 58, the equivalent LBP transform is a histogram of the 58 equivalent patterns and 1 non-equivalent pattern. After DWT and partition block transformation, each partition block repeats the equivalent LBP transformation operation, so that 4 can be obtained ^D *59 histograms, which are connected laterally, constitute the DWT-LBP feature of the video frame.

DCT operation:

given a one-dimensional input sequence x _n N is more than or equal to 0 and less than or equal to N-1, and the output X of DCT _k The formula is: :

wherein x is _n N is more than or equal to 0 and less than or equal to N-1 as input data; x is X _k For outputting data;

through the operation of the DCT, the main energy of the signal is concentrated in the low frequency part, the first few components of the output sequence concentrate most of the energy of the signal, and the subsequent part retains the high frequency component of the signal, which can be omitted under certain conditions.

After the DWT-LBP characteristics of the video frames are obtained, the characteristic sequences from the same video are longitudinally connected to form a characteristic matrix of the video. Then, the discrete cosine transform is longitudinally applied to extract the commonality between LBP features of the same part between different frames of the video, and part of the components (C) can be selectively reserved to achieve the purpose of reducing the number of the features, and the specific process is shown in figure 3. After DCT transformation, the C components after DCT of each column are reserved to form a new size of 4 ^D * A matrix of 59 x c represents each video.

SVM classifier

Support vector machine SVM is implemented by applying a method of generating a vector between two types of samples (xεR ^d ) Dividing a hyperplane H, i.e. w.phi (x _i ) +b=0 to maximize the distance between samples, hence the name of a large-pitch classifier. The loss function J (w) of the SVM is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing the distance between the sample point and the hyperplane of the classifier, < >>

Representing the sum of the classifier relaxation factors, which allows the sample to have space partially misclassified, α is a parameter between equilibrium distance and relaxation factor.

Here a gaussian function (radial basis function RBF) is chosen to construct this hyperplane, the formula of which is shown below:

φ(x _i )＝exp(-γ||x _k -x _j || ² )

both gamma in the equation and alpha in the loss function need to be chosen based on the characteristics of the sample prior to training.

Network training and test results:

according to the above steps, since the time length of each video is different, but the representative matrix of each video is independent of the number of frames, the size of the matrix is 4 ^D *59×c, the size of the matrix is uniform as long as the number of steps of the two-dimensional DWT is the same and the components retained by the discrete cosine transform are the same. And (3) sending the matrix of the training set into an SVM classifier constructed by RBF Gaussian kernel, and training a network to obtain relevant parameters of a human face living body detection model.

The parameters of face magnification S, time interval I, DWT series D and DCT component C related to this embodiment are all determined by a large amount of experimental data. Experimental results obtained by the controlled variable method show that when i=6, d=1, c=1, and the penalty factor α=2048, γ=0.000008 of the SVM, the accuracy of classification increases to some extent with the increase of the magnification factor S, and the results are shown in table 1,

table 1: recognition results on REPALY-ATTACK at different magnification S

And S=2.6 with optimal classification effect is selected, the next experiment is carried out, the magnitudes of D and C values are changed, and the experimental result shows that C does not greatly help the improvement of accuracy, and the increase of DWT (discrete wavelet transform) level greatly helps the correct classification. Therefore, c=1, which has the smallest number of features, is selected, and the dominant energy of the DWT-LBP matrix is used to compress the texture variation in the extracted time stream. The fixed parameters s=2.6, i=6, c=1, a=2048, γ=0.000008, increasing D, the results are shown in table 2,

table 2: recognition results on REPALY-ATTACK at different DWT series

From the table it can be concluded that in the REPLAY-atack database the algorithm gets the best result hter=0 when d=3 and d=4, i.e. a perfect classification of zero errors has been achieved. Fig. 4 shows ROC curves of the REPLAY-atack database when s=2.6, d=4, i=6, c=1, α=2048, γ=0.000008, AUC is defined as the area under the ROC curve, the range of values is between 0.5 and 1, and when the value is used as an evaluation standard, the larger the value is, the better the corresponding classifier effect is, which is a comprehensive measure of the classification effect. In fig. 4, the curve passes through point (0, 1), with AUC area of 1, again demonstrating perfect classification capability in this example.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The human face living body detection method based on the DWT-LBP-DCT characteristics is characterized by comprising the following steps:

s1, acquiring an original video frame sequence to be processed;

the step S5 specifically comprises the following steps:

s502, longitudinally applying DCT operation on a DWT-LBP video feature matrix, extracting energy information reflected by each block in the time transformation process, obtaining DWT-LBP-DCT features, and calculating as follows:

and S6, training and classifying the DWT-LBP-DCT characteristics by using a machine learning classifier to obtain a detection result.

2. The face living body detection method based on DWT-LBP-DCT characteristics according to claim 1, wherein the step S1 is specifically: when the original video has M frames of images, F frames of images are extracted according to the time interval I, and the method is calculated as follows:

and further obtaining an original video frame sequence to be processed.

3. The face living body detection method based on DWT-LBP-DCT characteristics according to claim 1, characterized in that step S2 is specifically as follows:

s203, re-determining a target area, namely an interest area, in the original video frame according to the proportional coordinate point, unifying resolutions of all the interest area images to be target values, and if the resolutions are insufficient or larger than the target values, performing bilinear interpolation processing to obtain a target image, wherein the target image is a color RGB image.

4. The face living body detection method based on DWT-LBP-DCT characteristics according to claim 3, characterized in that step S3 is specifically as follows:

s302, recording the resolution of each frequency component;

5. The method for in-vivo detection of human faces based on DWT-LBP-DCT features according to claim 4, characterized in that the frequency component to be diced is in the range of X from 1 to D-1.

6. The face living body detection method based on DWT-LBP-DCT characteristics according to claim 4, wherein step S4 is specifically: and calculating equivalent LBP histograms of all frequency small blocks, and horizontally connecting the LBP histograms of the target images of each frame according to a time sequence to obtain DWT-LBP characteristics.

7. The face living body detection method based on DWT-LBP-DCT characteristics according to claim 1, characterized in that step S6 is specifically as follows: according to DWT-LBP-DCT characteristics, a training set in a database is used for establishing a corresponding SVM classifier, parameters required in the test are acquired according to a verification set, the test set is sent into the trained SVM classifier, a detection result is obtained, and true and false video classification is achieved.

8. The face living body detection method based on DWT-LBP-DCT characteristics according to claim 7, characterized in that the parameters required in the test include false acceptance rate FAR, false rejection rate FRR; using the verification set to acquire a classifier threshold tau when the false acceptance rate FAR and the false rejection rate FRR are equal, and acquiring a half error rate HTER and an accuracy rate according to the classifier threshold tau; wherein the half error rate HTER is half of the sum of the false acceptance rate FAR and the false rejection rate FRR.