CN116189255A

CN116189255A - Face living body detection method based on generation type domain adaptation

Info

Publication number: CN116189255A
Application number: CN202211571186.7A
Authority: CN
Inventors: 杨海东; 付传辉; 李泽辉; 杨标
Original assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute
Current assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-05-30

Abstract

The invention discloses a face living body detection method based on generation type domain adaptation, which comprises the following steps: building a living body detection model and a generator; carrying out intra-domain spectrum mixing on the images of the unlabeled target domains to generate diversified target images and original target images; the diversified target images and the original target images are stylized into source style images through image conversion, and inter-domain nerve statistics consistency is adopted to guide a generator to generate the source style images; and inputting the source pattern image into a living body detection model to carry out living body detection of the human face, and outputting a result of living body detection of the human face. The invention solves the problems that the existing face living body detection method has insufficient supervision of unlabeled target domains, most of work mainly focuses on the alignment of high-level semantic features, and low-level features of face living body detection tasks are ignored.

Description

Face living body detection method based on generation type domain adaptation

Technical Field

The invention relates to the technical field of human face living body detection, in particular to a human face living body detection method based on generation domain adaptation.

Background

Face live detection (FAS) is intended to detect face images from real persons or various face presentation attacks. Early works utilized hand-made functions to address this issue, such as SIFT, LBP, and HOG. There are several methods to utilize information from different domains, such as HSV and YCrCb color space, time domain and fourier spectrum. Recent approaches utilize CNNs to model FAS with two classifications or additional supervision, such as depth maps, reflectance maps, and r-ppg signals. Other approaches employ decoupling and custom operators to improve performance. Although good results were achieved in training within the dataset, their performance on the target domain still significantly declines due to the larger domain shifts.

To improve performance under cross-domain settings, domain Generalization (DG) is introduced in FAS tasks. However, the DG-FAS method aims at mapping samples to a common feature space and lacks specific information of an invisible domain, inevitably resulting in unsatisfactory results. Recent studies on UDA-FAS have relied primarily on pseudo-labeling, counterlearning, or minimizing domain differences to reduce domain shifting. However, they are still under-supervised by unlabeled target domains, which may lead to negative migration of the source model. Furthermore, most of the work is focused on the alignment of high-level semantic features, while low-level features critical to FAS tasks are ignored.

Disclosure of Invention

Aiming at the defects, the invention provides a face living body detection method based on the generated domain adaptation, which aims to solve the problems that the existing face living body detection method has insufficient supervision of unlabeled target domains, most of work mainly focuses on the alignment of high-level semantic features and ignores low-level features of a face living body detection task.

To achieve the purpose, the invention adopts the following technical scheme:

a face living body detection method based on generation domain adaptation comprises the following steps:

step S1: building a living body detection model and a generator;

step S2: carrying out intra-domain spectrum mixing on the images of the unlabeled target domains to generate diversified target images and original target images;

step S3: the diversified target images and the original target images are both stylized into source style images through image conversion, and inter-domain nerve statistics consistency is adopted to guide a generator to generate the source style images;

step S4: and inputting the source pattern image into a living body detection model to carry out living body detection of the human face, and outputting a result of living body detection of the human face.

Preferably, in step S2, the following steps are specifically included:

step S21: calculating a target image x _t ∈D _t Fourier transform F (x) _t ) The specific formula is as follows:

wherein ,F(x_t ) (u, v) is the target image x _t U and v are frequency variations in the frequency domain, H is the maximum height of the image, W is the maximum width of the image, H is the image height, and W is the image width;

step S22: calculating a target image x _t ∈D _t Fourier transform F (x) _t ) The specific formulas are as follows:

A(x _t )(u，v)＝[R ² (x _t )(u，v)+I ² (x _t )(u，v)] ^1/2

wherein ,A(x_t ) (u, v) is amplitude, P (x) _t ) (u, v) is the phase, R (x) _t ) Is F (x) _t ) Is the real part of I (x) _t ) Is F (x) _t ) Is the imaginary part of (2);

step S23: computing the target domain D from the same unlabeled target domain _t Is an arbitrary image of two of (a)

The specific formula is as follows:

wherein ,

interpolation for mixed amplitude +.>

For image->

Amplitude interpolation of>

For image->

lambda-U (0, eta), the super-parameter eta controls the enhanced intensity;

step S24: combining the mixed amplitude spectrum with the original phase spectrum to reconstruct a new fourier representation:

wherein ,

for interpolating the image +.>

Is a two-dimensional discrete Fourier transform of>

For interpolating the image +.>

Mixed amplitude of>

For interpolating the image +.>

Is the original phase of (a);

step S25: will be

An interpolated image is generated by inverse fourier transform, and the specific formula is as follows:

wherein ,

for interpolating the image.

Preferably, in step S3, the inter-domain neural statistical consistency is used to guide the generator to generate the source pattern image, specifically including the following steps: calculating interdomain gap L _stat Namely, the inter-domain nerve statistical consistency loss is expressed as follows:

where l= e {1,2, …, L } represents the first layer in the source training model, including the feature extractor, classifier, and depth estimator, L represents the first layer in the source training model,

representing running average of source style data, +.>

Representing the running variance of the source style data, +.>

Representing a stored average value of the source model,/>

Representing the stored variance of the source model.

Preferably, in step S3, the content is constrained by adopting dual semantic consistency of feature level and image level to ensure that semantic content is preserved in the image conversion process, specifically comprising the following steps:

step S31: source style image to be generated

And original target image x _t As input, a perceptual penalty L is imposed on the potential features of the pre-trained VGG16 module on ImageNet _per Perception loss L _per The specific formula of (2) is as follows:

/>

wherein ,

image +.>

And original target image x _t Between (a) and (b)Perception loss, C _j The number of channels of the j-th layer of the characteristic diagram is H _j For the height of the j-th layer of the feature map, W _j For the width of the j-th layer of the feature map, +.>

Image +.>

Convolution of layer j,>

for the original target image x _t Convolution of layer j;

step S32 by minimizing semantic consistency loss L _ph To enhance phase consistency between original target image and source pattern image, L _ph The specific formula is as follows:

wherein,<,>is the dot product of the two-dimensional space, I.I. | ₂ Is the L2 norm and is used to determine,

for the negative cosine distance, x, between the original phase of the original target image and the generated phase of the source pattern image _t For the original target image->

For source pattern image, F (x _t ) _j For the jth original target image x _t Fourier transform of->

For the j-th source pattern image +.>

Is a fourier transform of (a).

Preferably, the method further comprises the step of training a living body detection model, and specifically comprises the following steps of:

step S51: calculating entropy loss through a classifier and a depth estimator to obtain the entropy loss of the classifier and the entropy loss of the depth estimator, wherein the specific formula is as follows:

wherein ,L_ent1 For classifier entropy loss, L _ent2 For the depth estimator entropy loss,

the label probability distribution for a source pattern image, C is the channel, H is the height of the image, W is the width of the image, C is the total number of channels, H is the upper height limit of the image, W is the upper width limit of the image,/->

(h, w) is any pixel of the source pattern image, < >>

Estimating the depth of any pixel on the source pattern image;

step S52: adding the entropy loss of the classifier and the entropy loss of the depth estimator to obtain the total entropy loss, wherein the specific formula is as follows:

L _ent ＝L _ent1 +L _ent2

wherein ,L_ent Is the total entropy loss.

Preferably, the total loss of generator parameter optimization is calculated from the total entropy loss, perceptual loss, semantic consistency loss, and inter-domain neural statistics consistency loss of the in-vivo detection model training, with the specific formulas as follows:

L _total ＝L _stat +L _per +λ _ent L _ent +λ _ph L _ph

wherein ,L_total L is the total loss _ent Lambda is the total entropy loss _ent Weighting coefficients for total entropy loss, L _ph Lambda is the loss of semantic consistency _ph Weighting coefficients for semantic consistency loss, L _stat For interdomain neural statistical consistency loss, L _per Is a perceived loss.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

according to the scheme, on one hand, the intra-domain spectrum mixing is adopted to expand the target data distribution, so that unmarked target data can be tested on an invisible test subset of the target domain, and the problem of insufficient supervision of the unmarked target domain is avoided. On the other hand, the inter-domain nerve statistics consistency is adopted to guide the generator to generate a source pattern image, the feature statistics of the target data and the feature statistics of the source pattern data are completely aligned at a high level and a low level, and the inter-domain gap is effectively reduced.

Drawings

Fig. 1 is a step diagram of a face in-vivo detection method based on generated domain adaptation.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

step S1: building a living body detection model and a generator;

and S4, inputting the source pattern image into a living body detection model to carry out living body detection of the human face, and outputting a result of the living body detection of the human face.

In the face living body detection method based on the generated domain adaptation, as shown in fig. 1, a living body detection model and a generator are established in the first step, so that preparation is made for face living body detection and generation of source pattern images. The second step is that the images of the unlabeled target fields are subjected to intra-field spectrum mixing to generate diversified target images and original target images; specifically, for unlabeled target data, if training is performed on only a visible training subset of the target domain and not on an invisible testing subset of the target domain, the image quality of the source pattern domain may be reduced. The third step is that the diversified target image and the original target image are both stylized into a source style image through image conversion, and inter-domain nerve statistics consistency is adopted to guide a generator to generate the source style image, specifically, the scheme provides inter-domain nerve statistics consistency to guide the generator to generate the source style image, and feature statistics of target data and feature statistics of source style data are completely aligned at a high level and a low level, and inter-domain differences are effectively reduced. Inputting the source pattern image into a living body detection model to carry out living body detection of the human face, and outputting a result of the living body detection of the human face; specifically, the accuracy of the result of the human face living body detection obtained through the living body detection model is higher.

Preferably, in step S2, the method specifically includes the following steps:

A(x _t )(u，v)＝[R ² (x _t )(u，v)+I ² (x _t )(u，v)] ^1/2

The specific formula is as follows:

wherein ,

interpolation for mixed amplitude +.>

For image->

Amplitude interpolation of>

For image->

lambda-U (0, eta), the super-parameter eta controls the enhanced intensity;

wherein ,

for interpolating the image +.>

Is a two-dimensional discrete Fourier transform of>

For interpolating the image +.>

Mixed amplitude of>

For interpolating the image +.>

Is the original phase of (a);

step S25: will be

wherein ,

for interpolating the image.

In this embodiment, since the phase tends to preserve most of the content in the fourier spectrum of the signal, while the amplitude mainly contains patterns of specific domains, a diversified image can be generated that preserves content in a continuous frequency space but has new patterns. In particular, by performing the above steps, invisible target samples with new patterns and original content can be efficiently generated in a continuous frequency space, and by feeding forward these diversified images to the generator, the generalization ability of different subsets within the target domain can be further enhanced.

representing running average of source style data, +.>

Representing the running variance of the source style data, +.>

Representing a stored average value of the source model,/>

Representing the stored variance of the source model.

Specifically, batch Normalization (BN) normalizes each input feature within a small batch in a channel manner such that the mean μ and variance σ of the features within the small batch ² Let the mean μ be 0, variance σ ² 1, the specific formula is as follows:

wherein B is the minimum batch size, x _i Is an input feature.

Further illustratively, in training the living body detection model, the source statistics at step n+1

and />

The specific formula is as follows:

wherein ,

for the exponential moving average of the source statistics at step n+1, +.>

Exponentially moving variance for source statistics at step n+1, α being the ratio, ++>

Index moving average of source statistics at step n, +.>

Exponentially moving variance for source statistics at step n, +.>

Index average value of source statistics at step n, +.>

Is the exponential variance of the source statistics at step n.

The neural statistics of source features stored in a well-trained biopsy model provides sufficient supervision for low-level and high-level features that may represent a domain-specific style that may be fully used to aid in distribution alignment in UDA. However, the conventional method only uses the output features of the higher layers to perform distribution alignment, and cannot fully use the abundant and differentiated active clues in the features of the lower layers. Therefore, we can easily estimate the source style data taking into account these stored BN statistics

wherein />

The embodiment provides a loss L of inter-domain neural statistics consistency _stat To match source style data

Running average of +.>

Running variance->

And stored statistics of source model +.>

Feature statistics between the two domains, thereby bridging the inter-domain gap. At loss L _stat Under the direction of (a), the present embodiment can approximate a source style field having a similar style to the source field. Unlike existing generation of image content from input random noise, the neural statistical consistency of the present embodiment uses BN statistical alignment as a constraint to style the input image without changing the content.

Preferably, in step S3, the content is constrained by adopting dual semantic consistency of feature level and image level to ensure that semantic content is preserved in the image conversion process, and specifically comprises the following steps:

step S31: source style image to be generated

/>

wherein ,

image +.>

And original target image x _t Perceived loss between C _j The number of channels of the j-th layer of the characteristic diagram is H _j For the height of the j-th layer of the feature map, W _j For the width of the j-th layer of the feature map, +.>

Image +.>

Convolution of layer j,>

for the original target image x _t Convolution of layer j;

For the j-th source pattern image +.>

Is a fourier transform of (a).

In the present embodiment, at the feature level, the generated source pattern image

And original target image x _t As input, a perceptual penalty L is imposed on the potential features of the pre-trained VGG16 module on ImageNet _per Thereby reducing the perceived difference between them.

However, using this perceptual penalty in space alone is not sufficient to ensure semantic consistency. Fourier transformation from one domain to another affects only the amplitude of its spectrum and not its phase, where the phase component retains most of the content in the original signal, while the amplitude component contains mainly patterns, semantic inconsistencies can be explicitly penalized by ensuring that the phase is preserved before and after the image conversion. Thus, by minimizing semantic consistency loss L _ph I.e. minimizing source style images

And original target image x _t The image level differences over the fourier spectrum maintain phase consistency.

For any pixel of the source pattern image, +.>

Estimating the depth of any pixel on the source pattern image;

L _ent ＝L _ent1 +L _ent2

wherein ,L_ent Is the total entropy loss.

In this embodiment, the entropy loss of the classifier and the depth estimator is calculated, so that the prediction distribution obtained by the living body detection model is as close as possible to the actual distribution of the data.

Preferably, the total loss of generator parameter optimization is calculated according to the total entropy loss, the perception loss, the semantic consistency loss and the inter-domain nerve statistics consistency loss trained by the living body detection model, and the specific formula is as follows:

L _total ＝L _stdt +L _per +λ _ent L _ent +λ _ph L _ph

In this embodiment, in the pretrained living body detection model, parameters of the feature extractor, the classifier, the depth estimator, and the VGG16 module are fixed, and thus, the resulting source pattern image can be further optimized by optimizing parameters of the generator. Loss can be generated in the optimization process, and the quality of the source pattern image can be effectively improved by reducing the loss.

Furthermore, functional units in various embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations of the above embodiments may be made by those skilled in the art within the scope of the invention.

Claims

1. A face living body detection method based on generation domain adaptation is characterized in that: the method comprises the following steps:

step S1: building a living body detection model and a generator;

2. The face living body detection method based on the generated domain adaptation according to claim 1, wherein: in step S2, the method specifically includes the following steps:

A(x _t )(u，v)＝[R ² (x _t )(u，v)+I ² (x _t )(u，v)] ^1/2

The specific formula is as follows:

wherein ,

interpolation for mixed amplitude +.>

For image->

Amplitude interpolation of>

For image->

lambda-U (0, eta), the super-parameter eta controls the enhanced intensity;

wherein ,

for interpolating the image +.>

Is a two-dimensional discrete Fourier transform of>

For interpolating the image +.>

Mixed amplitude of>

For interpolating the image +.>

Is the original phase of (a);

step S25: will be

wherein ,

for interpolating the image. />

3. The face living body detection method based on the generated domain adaptation according to claim 1, wherein: in step S3, the inter-domain neural statistical consistency is adopted to guide the generator to generate the source pattern image, which specifically comprises the following steps: calculating interdomain gap L _stat Namely, the inter-domain nerve statistical consistency loss is expressed as follows:

where l= e {1,2,., L } represents the first layer in the source training model, including the feature extractor, classifier, and depth estimator, L represents the first layer in the source training model,

representing running average of source style data, +.>

Operator representing source style dataDifference (S)>

Representing a stored average value of the source model,/>

Representing the stored variance of the source model.

4. A face living body detection method based on generated domain adaptation according to claim 3, wherein: in step S3, the content is constrained by adopting dual semantic consistency of the feature level and the image level to ensure that the semantic content is preserved in the image conversion process, which specifically comprises the following steps:

step S31: source style image to be generated

wherein ,

image +.>

Image +.>

Convolution of layer j,>

for the original target image x _t Convolution of layer j;

step S32: by minimizing semantic consistency loss L _ph To enhance phase consistency between original target image and source pattern image, L _ph The specific formula is as follows:

wherein,<，>is the dot product of the two-dimensional space, I.I ₂ Is the L2 norm and is used to determine,

For the j-th source pattern image +.>

Is a fourier transform of (a).

5. The face living body detection method based on the generated domain adaptation according to claim 4, wherein the face living body detection method is characterized in that: the method also comprises the step of training a living body detection model, and specifically comprises the following steps of:

/>

For any pixel of the source pattern image, +.>

Estimating the depth of any pixel on the source pattern image;

L _ent ＝L _ent1 +L _ent2

wherein ,L_ent Is the total entropy loss.

6. The face living body detection method based on the generated domain adaptation according to claim 5, wherein the face living body detection method is characterized in that: according to the total entropy loss, the perception loss, the semantic consistency loss and the inter-domain nerve statistics consistency loss of the living body detection model training, calculating the total loss of generator parameter optimization, wherein the specific formula is as follows:

L _total ＝L _stat +L _per +λ _ent L _ent +λ _ph L _ph