CN111967331A

CN111967331A - Face representation attack detection method and system based on fusion feature and dictionary learning

Info

Publication number: CN111967331A
Application number: CN202010696193.4A
Authority: CN
Inventors: 傅予力; 黄汉业; 向友君; 许晓燕; 吕玲玲
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2020-11-20
Anticipated expiration: 2040-07-20
Also published as: CN111967331B

Abstract

The invention discloses a face representation attack detection method and system based on fusion features and dictionary learning, wherein the method comprises the following steps: extracting image quality characteristics of the complete face image according to a distortion source of secondary imaging of the face image; constructing a depth convolution network model, and extracting the depth network characteristics of the human face image block through a depth convolution network; cascading the two characteristics through PCA to generate final fusion characteristics; initializing dictionary atoms by utilizing the fusion characteristics, and training a dictionary learning classifier based on a low-rank shared dictionary; and judging the category of the test sample based on the size of the fusion feature reconstruction residual error. The face representation attack detection is carried out by combining the image quality characteristic and the depth network characteristic for the first time, the information provided by a single-frame image is better utilized, and the discrimination capability of extracting the characteristic is effectively enhanced; the same mode of the true and false samples is stripped through the low-rank shared dictionary for the first time, so that the accuracy of attack detection is successfully improved, and the method has good generalization.

Description

Face representation attack detection method and system based on fusion feature and dictionary learning

Technical Field

The invention relates to the technical field of image processing, in particular to a face representation attack detection method and system based on fusion features and dictionary learning.

Background

Nowadays, the face recognition technology is widely applied in security, payment, entertainment facilities and other scenes. However, the face recognition system has a certain safety hazard. With the development of social networks and the popularization of smart phones, more and more people share personal photos and videos on the network, lawless persons can attack a face recognition system by disguising others through the media or intentionally confusing personal identities, and the purposes of infringing the property safety of others, escaping legal sanctions and the like are achieved. The method for detecting the attack is called face living body detection, and the method is called face representation attack by means of pictures, videos and the like of a legal user to borrow the identity of the user through the operation of a face recognition system.

In human face living body detection, human face images can be divided into two types, one type is an image obtained by directly shooting a legal user. The photographic subject of another type of image may be a photograph, video, wax image, etc. of a legitimate user with a high degree of facial similarity to the legitimate user. Such images are called face representation attack images (called attack faces for short) and are objects to be detected by the living body detection technology.

The core of the human face living body detection algorithm is to extract the most discriminative features of a detected living body in a human face image, and the traditional detection technology is based on manually designed features, such as Local Binary Pattern (LBP) and Local Phase Quantization (LPQ), and as the imaging quality of equipment is continuously improved, the manually designed features capable of detecting an attacking human face become very difficult. In recent years, automatic feature extraction using a convolutional neural network has become the mainstream. The deep convolutional neural network is excellent in image classification task, but is limited by the scale of a living body detection data set, and the deep network supervised only by class labels tends to memorize any characteristics existing in a training set, so that overfitting is easily caused, and algorithm generalization is poor.

Disclosure of Invention

In order to overcome the defects and shortcomings in the prior art, the invention provides the face representation attack detection method based on fusion characteristics and dictionary learning.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a face representation attack detection method based on fusion features and dictionary learning, which comprises the following steps:

carrying out face detection and cutting on an input video to construct a face image database;

extracting fusion characteristics of face images in a face image database, wherein the fusion characteristics comprise image quality characteristics and depth network characteristics;

extracting image quality characteristics of the complete face image according to a distortion source of secondary imaging of the face image;

constructing a depth convolution network model, and extracting the depth network characteristics of the human face image block through a depth convolution network;

respectively standardizing and cascading the two characteristics according to the image quality characteristic and the depth network characteristic, and reducing the dimension of the cascaded characteristics through PCA to generate final fusion characteristics;

initializing dictionary atoms based on fusion characteristics, and training a dictionary learning classifier based on a low-rank shared dictionary;

and judging the category of the test sample based on the size of the fusion feature reconstruction residual error.

As a preferred technical solution, the extracting of the image quality characteristics of the complete face image according to the distortion source of the secondary imaging of the face image specifically comprises the following steps: and extracting specular reflection characteristics, fuzzy characteristics, color moment characteristics and color diversity characteristics, and cascading the extracted characteristics to obtain image quality characteristics.

As a preferred technical scheme, the extracting of the depth network features of the face image block through the depth convolution network specifically comprises the following steps:

the method comprises the steps of generating a face image block by randomly zooming and randomly cutting a complete face image, constructing a lightweight depth convolution network model, taking the face image block as the input of the convolution network model, training the convolution network model by adopting a Focal local Loss function to extract the depth network characteristics of the face image block, converting a one-hot coded label into a soft label by adopting a label smoothing method, and optimizing the training process of a depth convolution neural network.

As a preferred technical scheme, the initializing dictionary atoms based on the fusion features and training the dictionary learning classifier based on the low-rank shared dictionary specifically comprise the following steps: and (4) alternately optimizing the dictionary and minimizing the cost function of the dictionary model by the sparse coefficient, and storing the dictionary after the set times of iterative optimization.

As a preferred technical solution, the cost function of the dictionary model is expressed as:

the first term is a discrimination fidelity term, the second term is a discrimination coefficient term based on a Fisher criterion, the third term is an L1 regularization term, the fourth term is a nuclear norm, and the discrimination fidelity term is used for realizing the recognition power of the dictionary; the discrimination coefficient item is used for increasing the intra-class similarity and reducing the inter-class similarity, and the L1 regularization item is used for realizing the sparsity of the coefficient X; the kernel norm is used for constraining the size of a subspace spanned by the shared dictionary, and the low rank property, lambda, of the shared dictionary is ensured₁、λ₂And η is used to trade off the specific gravity of the terms of the cost function;

the discriminant fidelity term is defined as:

wherein,

representing samples of class c, the samples being fusion features, m representing the dimension of the fusion features, n_cRepresenting the number of samples of class c, D representing the global dictionary, D_cSub-dictionary, X, representing class c_ci represents the coefficient of the class c sample on the class i dictionary;

the term discriminant coefficient is defined as:

wherein M is_cRepresenting the mean value of the sparse coefficients of class c samples, M representing the mean value of the sparse coefficients of the entire training set, M⁰Represents the average of the coefficients on the shared dictionary,

the effect of (a) is to force the coefficients of all training samples on the shared dictionary to be close to the average.

As a preferred technical scheme, the method further comprises the step of solving the sparse coefficient of the test sample, specifically: and constructing two class dictionaries with shared dictionaries through the stored dictionaries, and solving sparse coefficients of the test samples by fixing the class dictionaries.

As a preferred technical solution, the determining the type of the test sample based on the size of the fusion feature reconstruction residual error specifically includes:

and solving the sparse coefficient of the test sample based on the regularization of the elastic network, reconstructing the fusion characteristics of the test sample through the sparse coefficient, and reconstructing the class with the minimum residual error as the prediction class of the test sample.

The invention provides a face representation attack detection system based on fusion features and dictionary learning, which comprises the following steps: the system comprises a face image database construction module, a preliminary fusion feature extraction module, a final fusion feature generation module, a dictionary learning classifier training module and a test sample category judgment module;

the preliminary fusion feature extraction module comprises an image quality feature extraction module and a depth network feature extraction module;

the face image database construction module is used for carrying out face detection and cutting on an input video to construct a face image database;

the preliminary fusion feature extraction module is used for extracting fusion features of the face images in the face image database, and the fusion features comprise image quality features and depth network features;

the image quality characteristic extraction module is used for extracting the image quality characteristics of the complete face image according to the distortion source of the secondary imaging of the face image;

the depth network feature extraction module is used for constructing a depth convolution network model and extracting the depth network features of the human face image blocks through a depth convolution network;

the final fusion feature generation module is used for respectively standardizing and cascading the two features according to the image quality feature and the depth network feature, and reducing the dimension of the cascaded features through PCA to generate final fusion features;

the dictionary learning classifier training module is used for initializing dictionary atoms based on fusion characteristics and training a dictionary learning classifier based on a low-rank shared dictionary;

the type judgment module of the test sample is used for judging the type of the test sample based on the size of the fusion characteristic reconstruction residual error.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the invention fully utilizes the information provided by the single-frame image and enhances the discrimination capability of the characteristics by fusing the image quality characteristics and the depth network characteristics which are artificially designed.

(2) The method adopts the low-rank shared dictionary to strip the commonalities of true and false samples, ensures that the category dictionary can better represent the difference between the true sample and the attack sample, avoids the defect that overfitting is easy to happen by adopting a full-connection layer, has good generalization, and further improves the accuracy of attack detection.

(3) According to the method, the elastic network regularization is adopted to replace the traditional L1 regularization, so that the problem that certain characteristics are easily ignored by a model when L1 regularization is used is solved, the method is beneficial to keeping detailed characteristics, and the discrimination of sparse coefficients is enhanced.

(4) The invention adopts the randomly cut and generated image blocks as the input of the convolutional neural network, so that the convolutional neural network is focused on learning and extracting effective information related to the deception mode, the data set scale is enlarged in an effective mode, and the problem of performance degradation caused by small data scale is effectively solved.

Drawings

Fig. 1 is a schematic flow chart of a face representation attack detection method based on fusion features and dictionary learning according to the present embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Examples

As shown in fig. 1, the present embodiment provides a face representation attack detection method based on fusion feature and dictionary learning, including the following steps:

s1: carrying out face detection and cutting on an input video to construct a face image database;

in the embodiment, disclosed face representation ATTACK video data sets REPLAY-ATTACK, CASIA-FASD and MSU-MFSD are selected, the three data sets comprise real face videos and ATTACK face videos, a training set and a testing set are divided, the first 30 frames of each video of the data sets are extracted, a cascade classifier based on Haar features is adopted to detect the positions of faces in picture frames, and face images are cut out;

s2: extracting fusion characteristics of face images in a face image database, wherein the fusion characteristics comprise image quality characteristics and depth network characteristics, and the method comprises the following specific steps:

s21) extracting the image quality characteristics of the complete face image according to the distortion source of the face image secondary imaging,

according to a human face image database, sub-feature extraction is carried out on a human face image through the aspects of ambiguity, specular reflection, color distortion and the like, and the final sample image quality feature vector is formed by splicing all sub-feature vectors, and the method specifically comprises the following steps:

under the same imaging environment, the real access face is a primary imaging picture, the face represents that the attack is a secondary imaging picture, the source of the face image distortion in the secondary imaging process is analyzed to be beneficial to enhancing the discrimination of the extracted features, and the image quality features are extracted from the four aspects of specular reflection, ambiguity, color moment distortion and color diversity distortion;

extracting specular reflection characteristics: iteratively replacing the chromaticity of the highlight position of the input face image with the maximum diffuse reflection chromaticity of an adjacent pixel, then extracting a specular reflection component in the image, and further forming specular reflection characteristics by using the percentage value, the average value and the variance value of the component;

fuzzy features are extracted: and extracting the fuzzy characteristics of the image by adopting a method based on secondary fuzzy. The input image is converted into a gray image, the gray image is subjected to low-pass filtering by using a Gaussian filter with a convolution kernel of 33, and the filtered image is called a blurred image. The image definition is measured by comparing the change of adjacent pixels of the image, the specific mode is to calculate the absolute difference image in the horizontal direction and the vertical direction, calculate the gray value sum of all pixel points of the absolute difference image, which is respectively called as the horizontal difference sum and the vertical difference sum, and the ratio of the horizontal difference sum and the vertical difference sum of the image before and after filtering forms the fuzzy characteristic.

Extracting color moment characteristics: firstly, converting an input face image from an RGB space to an HSV space with relatively independent channels, then calculating the average value, the variance value and the skewness of each channel, calculating the percentage of pixels in a minimum histogram box and a maximum histogram box of each channel, and forming the color moment characteristics by the 5 values.

Extracting color diversity characteristics: the R, G, B three channels of the input picture are color quantized and then the color diversity feature is constructed using the histogram bin count of the first 100 most commonly occurring colors and the number of all the different colors that appear in the face image.

The four extracted features are cascaded and called as image quality features, and the dimension of the image quality features is 121 dimensions.

S22), constructing a depth convolution network model, and extracting the depth network characteristics of the human face image block through the depth convolution network, wherein the depth convolution network model is specifically described as follows:

the size of the face image obtained in the step S1 is scaled to 112 x 112, the face image is randomly blocked through random scaling and random cutting, the size of the image block is set to 48 x 48, the face image block is used as the input of a convolutional neural network, the neural network is trained to extract features, the scale of a training set is increased by using a local image block, the convolutional neural network can be focused on learning to extract effective information related to a cheating attack mode, meanwhile, the original input resolution is kept, and the loss of discriminative information is prevented;

the scale of the existing public data set is small, and a convolution network model can adopt a model with small complexity. This embodiment employs the ResNet18 model pre-trained on ImageNet datasets. Meanwhile, the convolution kernel size of the first convolutional layer of the ResNet18 is reduced to 3 × 3, the step size is reduced to 1, the next layer of the last convolutional layer is a global pooling layer, the global pooling layer averages each feature image output by the convolutional layer, and then connects the averages into a one-dimensional vector, the next layer of the global pooling layer is a full-link layer, the full-link layer takes the one-dimensional vector output by the global pooling layer as input, the output dimension is the corresponding category number, and the category number in this embodiment is set to 2, and respectively corresponds to a real face and an attack face.

The Loss function in convolutional network training is the Focal local function. The formula for the Focal local Loss function is as follows:

FL(p_t)＝-(1-p_t)^γlog(p_t)

wherein,

p is the probability of a real face sample output by the network, y represents the real label of the input image, the label value of the real face is 1, the label value of the attack face is 0, gamma is called a focusing parameter, and the value is greater than 0. Modulation factor (1-p)_t)^γThe prediction score of the model is integrated into the Loss function, so that the model can be adaptively adjusted according to the difficulty of the sample, the value of gamma is 2, the number of the attack face videos is several times that of the real face videos, and the problem of data imbalance commonly existing in a data set can be solved by adopting a Focal local Loss function to replace a conventional cross entropy Loss function.

In the process of convolutional neural network training, the label smoothing is adopted to convert the traditional one-hot coded label into a soft label, as follows:

wherein, y_ohDenotes a conventional one-hot coded label, y_lsRepresenting the soft label after label smoothing. Tag smoothing reduces the tag value at the correct class by (1- α) times, with the term that is 0 becoming

K represents the number of classes, α ∈ [0,1 ]]. In this embodiment, α is 0.1. Label smoothing encourages the model to select the correct category by appropriately reducing the value of the correct label, but without undue confidence. For face representation attack detection, the positive and negative samples are very similar in the image domain. Therefore, in the initial stage of the network, the network is easy to quickly fit by adopting the hard tag, and the generalization capability of the convolutional network model is further improved by introducing a tag smoothing method.

The optimization method used in the training of the neural network is a random gradient descent method, the initial learning rate and the weight attenuation are respectively set to be 0.001 and 0.00001, the learning rate regulator is a cosine annealing regulator with a restart function, the lowest learning rate is set to be 0.00004, the cosine cycle is 5 rounds, 30 rounds of total iteration are performed, after the network training is finished, the last global average pooling layer and the full connection layer are removed, the depth network characteristics of the human face image block are extracted by utilizing the previous convolution block group, and the dimension of the depth network characteristics is 512 dimensions;

s23) respectively standardizing and cascading the two features according to the image quality feature and the depth network feature, and performing dimensionality reduction on the cascaded features through PCA to generate final fusion features, wherein the details are as follows:

extracting two groups of features of image quality features and depth network features from each face image of a face image database, calculating to obtain the average value and variance of the two groups of features, standardizing the two groups of features, and directly cascading the image quality features and the depth network features which correspond to each standardized face image, wherein the direct cascading length of the two features is 633;

adopting PCA (principal component analysis) to reduce the dimension of the cascade feature, wherein the feature after dimension reduction is called as a fusion feature, and in order to determine a relatively good PCA principal component number, the embodiment firstly determines a cutting point by setting an experiment with a larger principal component number; in this embodiment, the dimensionality of the PCA after dimension reduction is set to 400, then the principal components are sorted from large to small according to the variance, the cumulative value of the variance is calculated, and the dimension of the PCA after dimension reduction is determined again according to the proportion of the cumulative sum of the variances to the sum of the total variances. The dimension of PCA dimensionality reduction is selected to be 256 dimensions in the embodiment;

s3: initializing dictionary atoms by utilizing the fusion characteristics of training samples, and training a dictionary learning classifier based on a low-rank shared dictionary;

in this embodiment, a dictionary learning method based on a low-rank shared dictionary is adopted, and a total dictionary D ═ D is set₁,D₂,D₀]∈R^m×nWhere m represents the dimension of the fused feature, n represents the size of the dictionary, class dictionary D₁And D₂The size of the class dictionary is set to 125 corresponding to the real face and the attack face, respectively. Shared dictionary D₀Is set to 20, fusion features are extracted from the training set images, and dictionary atoms are initialized by using the fusion features, wherein twoThe class dictionary randomly extracts samples from corresponding classes, the shared dictionary randomly extracts samples from the whole training set, and atoms of the dictionary are normalized through L2;

the cost function J of the dictionary model is minimized by iteratively optimizing the dictionary D and the coefficient X, in this embodiment, the iteration number is set to 25, and the cost function J of the dictionary model is defined as follows:

the first term is a discrimination fidelity term, the second term is a discrimination coefficient term based on a Fisher criterion, the third term is an L1 regularization term, the fourth term is a nuclear norm, and the discrimination fidelity term is used for realizing the recognition power of the dictionary; the function of the discrimination coefficient term is to increase the intra-class similarity and reduce the inter-class similarity, and the function of the L1 regularization term is to realize the sparsity of the coefficient X; the kernel norm has the function of constraining the size of a subspace spanned by the shared dictionary and ensuring the low rank property, lambda, of the shared dictionary₁、λ₂And η is used to weigh the proportion of the terms of the cost function, in this example λ₁Is set to 0.1, lambda₂Set to 0.01, η is set to 0.0001;

specifically, the discriminant fidelity term is defined as follows:

wherein,

representing samples of class c, the samples being fusion features, m representing the dimension of the fusion features, n_cRepresenting the number of samples in class c, c being 1 or 2, D representing the global dictionary, D_cA sub-dictionary representing the class c,

representing the coefficient of the class c sample on the class i dictionary, wherein the value of i is 1 or 2;

specifically, the discrimination coefficient term is defined as follows:

the method has the advantages that the coefficients of all training samples on the shared dictionary are forced to be close to the average value, so that the phenomenon that the shared dictionary has too large contribution difference to different classes of samples to influence the classification performance is avoided;

and alternately optimizing the cost function of the dictionary model and the sparse coefficient minimization dictionary model, iteratively optimizing the cost function for a certain number of times, storing the dictionary, constructing two class dictionaries with shared dictionaries through the stored dictionaries, and fixing the class dictionaries to solve the sparse coefficient of the test sample.

S4: and judging the category of the test sample based on the size of the fusion feature reconstruction residual error.

Two sub-dictionaries are respectively constructed by utilizing the dictionary D obtained by the embodiment

And

that is, the dictionary stored in step S3 constructs two class dictionaries with shared dictionaries, and when solving the sparse coefficient of the test sample y, the embodiment adopts elastic network regularization, and the model optimization problem is as follows:

wherein,

and representing a class dictionary with a shared dictionary, and x represents a sparse coefficient corresponding to the test sample y. The second term is the L1 regularization term, the third term is the L2 regularization term, λ_aAnd λ_bFor weighing the specific gravity of the L1 regularization term and the L2 regularization term, in this embodiment, λ_aIs set to 0.01, lambda_bSet to 0.01, L2 regularization tends to make the solution of x smoother than L1 regularization, and thus by linearly combining L1 regularization and L2 regularization, improved sparse coding can be produced.

And after the sparse coefficient of the test sample y is obtained, reconstructing y according to the coefficient corresponding to the sub-dictionary of each class. The class with the minimum reconstructed residual is used as the prediction class, and is shown as the following formula:

as shown in Table 1 below, the performance of this example was compared with that of a single feature on three data sets, REPLAY-ATTACK, CASIA-FASD and MSU-MFSD, and the evaluation index was HTER (half total error rate).

TABLE 1 comparison of Performance Using different features on three public data sets

	REPLAY-ATTACK	CASIA-FASD	MSU-MFSD
				Image quality characteristics	12.85％	13.99％	13.71％
Deep network features	2.37％	4.81％	11.13％
				Fusion feature	1.92％	4.41％	9.39％

Table 1 shows that the depth network features cannot automatically extract all discriminative factors in the manually designed features, and the method further utilizes image information by fusing the image quality features and the depth network features, thereby effectively enhancing the recognition capability of the features.

As shown in the following Table 2, in the present example, the performance of the CASIA-FASD and REPLAY-ATTACK data set is compared with that of other methods, and the evaluation index is HTER.

TABLE 2 Performance comparison with different features across dataset scenarios

Table 2 shows that compared with the manual design methods such as LBP and the like and the single CNN method, the method provided by the invention has better generalization in a cross-data-set scene.

The embodiment further provides a face representation attack detection system based on fusion feature and dictionary learning, which includes: the system comprises a face image database construction module, a preliminary fusion feature extraction module, a final fusion feature generation module, a dictionary learning classifier training module and a test sample category judgment module;

in the embodiment, the preliminary fusion feature extraction module comprises an image quality feature extraction module and a depth network feature extraction module;

in this embodiment, the face image database construction module is configured to perform face detection and clipping on an input video to construct a face image database;

in this embodiment, the preliminary fusion feature extraction module is configured to extract fusion features of a face image in a face image database, where the fusion features include image quality features and depth network features;

in the embodiment, the image quality feature extraction module is used for extracting the image quality features of the complete face image according to the distortion source of the secondary imaging of the face image;

in this embodiment, the depth network feature extraction module is configured to construct a depth convolution network model, and extract depth network features of the face image block through a depth convolution network;

in this embodiment, the final fusion feature generation module is configured to respectively standardize and cascade the two features according to the image quality feature and the depth network feature, and perform dimensionality reduction on the cascaded features through PCA to generate a final fusion feature;

in this embodiment, the dictionary learning classifier training module is configured to initialize dictionary atoms based on fusion features and train a dictionary learning classifier based on a low-rank shared dictionary;

in this embodiment, the category determination module of the test sample is configured to determine the category of the test sample based on the size of the fusion feature reconstruction residual.

Through the description of the technical scheme, the invention can be seen that the information provided by the single-frame image is fully utilized by combining the artificially designed image quality characteristic and the depth network characteristic, and the distinguishing capability of the characteristic is enhanced. The method optimizes the structure and the training mode of the convolutional neural network aiming at the characteristic that the face represents an attack data set, solves the problem of data imbalance through a Focal local Loss function, and further improves the generalization capability of the deep network through a label smoothing technology. In addition, a low-rank shared dictionary is introduced to strip out the commonality of true and false samples, sparse coding of the test samples is improved by adopting elastic network regularization, and the accuracy of the dictionary learning classifier is further improved. The method has good generalization and is suitable for the detection of the two-dimensional face representation attack in the actual scene.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A face representation attack detection method based on fusion features and dictionary learning is characterized by comprising the following steps:

2. The method for detecting the face representation attack based on the fusion feature and the dictionary learning according to the claim 1, characterized in that the image quality feature of the complete face image is extracted according to the distortion source of the secondary imaging of the face image, and the specific steps comprise: and extracting specular reflection characteristics, fuzzy characteristics, color moment characteristics and color diversity characteristics, and cascading the extracted characteristics to obtain image quality characteristics.

3. The method for detecting the face representation attack based on the fusion feature and the dictionary learning as claimed in claim 1, wherein the depth network feature of the face image block is extracted through a depth convolution network, and the specific steps include:

4. The method for detecting the face representation attack based on the fusion feature and the dictionary learning as claimed in claim 1, wherein the dictionary atoms are initialized based on the fusion feature, and the dictionary learning classifier based on the low-rank shared dictionary is trained, and the method comprises the following specific steps: and (4) alternately optimizing the dictionary and minimizing the cost function of the dictionary model by the sparse coefficient, and storing the dictionary after the set times of iterative optimization.

5. The method for detecting human face representation attack based on fusion feature and dictionary learning according to claim 4, wherein the cost function of the dictionary model is expressed as:

wherein the first item is a discrimination fidelity itemThe second term is a discrimination coefficient term based on a Fisher criterion, the third term is an L1 regularization term, the fourth term is a nuclear norm, and the discrimination fidelity term is used for realizing the recognition power of the dictionary; the discrimination coefficient item is used for increasing the intra-class similarity and reducing the inter-class similarity, and the L1 regularization item is used for realizing the sparsity of the coefficient X; the kernel norm is used for constraining the size of a subspace spanned by the shared dictionary, and the low rank property, lambda, of the shared dictionary is ensured₁、λ₂And η is used to trade off the specific gravity of the terms of the cost function;

the discriminant fidelity term is defined as:

wherein,

representing samples of class c, the samples being fusion features, m representing the dimension of the fusion features, n_cRepresenting the number of samples of class c, D representing the global dictionary, D_cA sub-dictionary representing the class c,

representing coefficients of a class c sample on an class i dictionary;

the term discriminant coefficient is defined as:

6. The face representation attack detection method based on fusion feature and dictionary learning according to claim 4, characterized by further comprising a step of solving sparse coefficients of test samples, specifically: and constructing two class dictionaries with shared dictionaries through the stored dictionaries, and solving sparse coefficients of the test samples by fixing the class dictionaries.

7. The method for detecting the face representation attack based on the fusion feature and the dictionary learning according to claim 1, wherein the method for judging the category of the test sample based on the size of the residual error of the fusion feature reconstruction comprises the following specific steps:

8. A face representation attack detection system based on fusion feature and dictionary learning, comprising: the system comprises a face image database construction module, a preliminary fusion feature extraction module, a final fusion feature generation module, a dictionary learning classifier training module and a test sample category judgment module;