CN114663936A

CN114663936A - Face counterfeiting detection method based on uncertainty perception level supervision

Info

Publication number: CN114663936A
Application number: CN202210167833.1A
Authority: CN
Inventors: 鲁继文; 周杰; 于炳耀
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-02-23
Filing date: 2022-02-23
Publication date: 2022-06-24

Abstract

The method, the device and the storage medium for detecting the face forgery based on the uncertain perception level supervision acquire a plurality of different face image samples, extract a feature map of each face image sample in the plurality of different face image samples through a convolutional neural network and an upper sampling layer, input the feature map of each face image sample into a multi-level supervision network to obtain a prediction result of each level supervision network, obtain a calculation result corresponding to each level supervision network by utilizing a preset loss function, train according to a network gradient descent algorithm and the calculation result corresponding to each level supervision network to obtain a target multi-level supervision network, predict a face image to be detected by utilizing the target multi-level supervision network, and judge whether the face image to be detected is a forged face according to the prediction result. The method provided by the application improves the robustness and the generalization of the network and improves the accuracy of the detection result.

Description

Face counterfeiting detection method based on uncertainty perception level supervision

Technical Field

The present application relates to the field of computer vision and machine learning technologies, and in particular, to a method and an apparatus for detecting face forgery based on uncertainty perception level supervision, and a storage medium.

Background

With the continuous development and progress of the depth generation model, more and more face editing methods can enable a user to arbitrarily edit the face attribute and even directly change the identity of the face image. And these methods can produce images that are so realistic that even the human eye cannot distinguish them completely correctly. Meanwhile, photos and even videos of other people can be easily obtained through highly developed internet technologies and social networks, and a large amount of materials are provided for a face editing method, so that various visual error information and abuse of the face editing technology may cause serious trust crisis, for example, a face recognition system is attacked by forged images. Therefore, an effective face forgery detection method is required to detect whether the resulting face image is edited or not.

In the related technology, different additional information, prior knowledge and a convolutional neural network model are combined to perform face forgery detection through a texture-based face forgery detection method or a texture-based method, but the related technology mainly focuses on a whole image or a local area in the image, and a mask label of face data carries data uncertainty, so that robustness and generalization are insufficient, and the accuracy of a detection result is reduced.

Disclosure of Invention

The application provides a face forgery detection method and device based on uncertainty perception level supervision and a storage medium, which at least solve the technical problems of insufficient robustness and generalization and low accuracy of detection results in the related technology.

An embodiment of a first aspect of the present application provides a face forgery detection method based on uncertainty perception level supervision, including:

acquiring a plurality of different face image samples;

extracting a feature map of each face image sample in the plurality of different face image samples through a convolutional neural network and an upsampling layer;

inputting the characteristic diagram of each face image sample into each level of monitoring network in a multi-level monitoring network for prediction to obtain a prediction result of each level of monitoring network, wherein the multi-level monitoring network comprises a plurality of monitoring networks of different levels;

calculating the prediction result of each layer of the supervision network by using a preset loss function to obtain a calculation result corresponding to each layer of the supervision network;

training each level of supervision network in the multi-level supervision network according to a network gradient descent algorithm and a calculation result corresponding to each level of supervision network to obtain a target multi-level supervision network;

and predicting the face image to be detected by using the target multi-level surveillance network, and judging whether the face image to be detected is a forged face according to a prediction result.

The embodiment of the second aspect of the present application provides a face forgery detection apparatus based on uncertainty perception level supervision, including:

the acquisition module is used for acquiring a plurality of different face image samples;

the extraction module is used for extracting a feature map of each face image sample in the plurality of different face image samples through a convolutional neural network and an upsampling layer;

the prediction module is used for inputting the feature map of each face image sample into each hierarchy supervision network in a multilevel supervision network to predict to obtain the prediction result of each hierarchy supervision network, wherein the multilevel supervision network comprises a plurality of different levels of supervision networks;

the calculation module is used for calculating the prediction result of each layer of the supervision network by using a preset loss function to obtain a calculation result corresponding to each layer of the supervision network;

the training module is used for training each hierarchy supervision network in the multilevel supervision networks according to a network gradient descent algorithm and a calculation result corresponding to each hierarchy supervision network to obtain a target multilevel supervision network;

and the judging module is used for predicting the face image to be detected by utilizing the target multi-level surveillance network and judging whether the face image to be detected is a forged face or not according to a prediction result.

A non-transitory computer-readable storage medium as set forth in an embodiment of the third aspect of the present application, wherein the non-transitory computer-readable storage medium stores a computer program; which when executed by a processor implements the method as shown in the first aspect above.

A computer device according to an embodiment of a fourth aspect of the present application includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method according to the first aspect is implemented.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

the application provides a face forgery detection method, a device and a storage medium based on uncertainty perception level supervision, which obtains a plurality of different face image samples, extracts a characteristic diagram of each face image sample in the plurality of different face image samples through a convolution neural network and an upper sampling layer, inputs the characteristic diagram of each face image sample into each level supervision network in a multi-level supervision network for forecasting to obtain a forecasting result of each level supervision network, wherein the multi-level supervision network comprises a plurality of different levels of supervision networks, calculates the forecasting result of each level supervision network by using a preset loss function to obtain a calculating result corresponding to each level supervision network, trains each level supervision network in the multi-level supervision network according to a network gradient descent algorithm and the calculating result corresponding to each level supervision network to obtain a target multi-level supervision network, and predicting the face image to be detected by using a target multi-level surveillance network, and judging whether the face image to be detected is a forged face or not according to a prediction result. The method and the device have the advantages that the whole network structure is assisted through the binary mask label of the face image, the robustness and the generalization of the network are improved through a hierarchical supervision method, meanwhile, the uncertainty of data naturally carried by the mask label is processed through an uncertainty estimation method, the characteristics of the image are effectively extracted through a self-attention transformation network, and therefore the accuracy of a detection result is improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a face forgery detection method based on uncertainty perception hierarchical supervision according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a face forgery detection apparatus based on uncertainty perception hierarchical supervision according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The following describes a face forgery detection method and apparatus based on uncertainty perception hierarchical supervision according to an embodiment of the present application with reference to the drawings.

Example one

Fig. 1 is a schematic flowchart of a face forgery detection method based on uncertainty perception hierarchical supervision according to an embodiment of the present application, and as shown in fig. 1, the method may include:

step 101, obtaining a plurality of different face image samples.

In this embodiment, the number of different image samples obtained each time may be the same, for example, 120 different image samples are obtained each time.

And, in the embodiment of the present invention, the face image samples in the form of a matrix may be obtained, for example, one face image sample obtained is I e R^H×W×3。

And 102, extracting a characteristic map of each face image sample in a plurality of different face image samples through a convolutional neural network and an upper sampling layer.

In an embodiment of the present invention, a method for extracting a feature map of each face image sample in a plurality of different face image samples through a convolutional neural network and an upsampling layer may include the following steps:

step a, extracting an initial characteristic image of each face image sample in a plurality of different face image samples through a convolution network.

In the embodiment of the present invention, the initial feature map extracted from one face image sample is

And b, increasing the resolution of the initial feature map of each face image sample by adopting a plurality of continuous up-sampling convolution blocks to obtain the feature map of each face image sample.

Wherein, in the embodiment of the invention, the initial feature map F of the face image sample₀The size of the convolution block is smaller than the size of the mask label, and based on the requirement, a plurality of continuous up-sampling convolution blocks are adopted to increase the resolution of the initial feature map of each facial image sample, so that the feature map of each facial image sample is obtained. For example, in the embodiment of the present invention, the resolution of the initial feature map of each face image sample may be increased by using three consecutive upsampling convolution blocks, so as to obtain the feature map F ∈ R of each face image sample^h×w×c。

And 103, inputting the feature map of each face image sample into each layer of supervision network in the multi-layer supervision network to predict to obtain the prediction result of each layer of supervision network.

In the embodiment of the invention, the multi-level supervisory network comprises a plurality of supervisory networks of different levels. Specifically, in the embodiment of the present invention, the multi-level monitoring network may include a pixel-level monitoring network, a region-level monitoring network, and an image-level monitoring network.

And the method for inputting the feature map of each face image sample into each level of supervision network in the multi-level supervision network to predict and obtain the prediction result of each level of supervision network can comprise the following steps:

step 1031, inputting the feature map of each face image sample into a pixel-level supervision network for prediction to obtain a pixel-level prediction result of each face image sample.

In the embodiment of the invention, the step of inputting the feature map of each face image sample into the pixel-level supervision network to predict to obtain the pixel-level prediction result of each face image sample comprises the step of predicting to obtain the pixel-level prediction result of each face image sample through a plurality of convolutional neural networks.

And, in an embodiment of the present invention, the pixel-level prediction result of each face image sample obtained through step 1031 includes a prediction result of each pixel in each face image sample.

And step 1032, inputting the feature map of each face image sample into an area-level supervision network for prediction to obtain an area-level prediction result of each face image sample.

In the embodiment of the present invention, the method for inputting the feature map of each face image sample into the area-level supervision network to predict and obtain the area-level prediction result of each face image sample may include the following steps:

step one, obtaining a normalized variance graph corresponding to the feature graph of each face image sample by using a Sigmoid function.

In the embodiment of the invention, a normalized variance map S (σ) corresponding to the feature map of each face image sample is obtained by using a Sigmoid function.

And step two, obtaining an uncertainty perception characteristic image of each face image sample based on the normalized variance image of each face image sample.

In the embodiment of the invention, an uncertainty perception characteristic map of each face image sample is obtained through a first formula, wherein the first formula is as follows:

F_u＝[F|F⊙(1-S(σ))]

and step three, mapping the uncertainty perception characteristic image of each facial image sample through a linear mapping layer to obtain the implicit vector representation of each facial image sample.

In the embodiment of the present invention, before mapping the uncertainty perceptual feature map of each face image sample through the linear mapping layer, the uncertainty perceptual feature map F of each face image sample needs to be mapped_uAnd (4) carrying out serialization.

Specifically, in the embodiment of the invention, the uncertainty perception feature map F of each face image sample is used_uThe method for sequencing comprises the step of subjecting F_uCutting the face image into M small blocks, wherein each small block is a square with the same size, and stretching each small block in 2D to obtain a vector F of each face image sample_tAnd then, the vector F of each face image sample is mapped by utilizing a learnable linear mapping layer_tMapping to implicit vector characterization E₀。

And step four, obtaining the region-level characterization of each face image sample by using the implicit vector characterization of each face image sample.

In the embodiment of the invention, the implicit vector representation of each facial image sample and the learnable position code of each facial image sample are directly added to obtain the complete area-level representation z of each facial image sample₀So as to maintain the position information between the areas of each face image sample.

Specifically, in the embodiment of the invention, each face image sample has complete area-level characterization z₀Comprises the following steps:

z₀＝F_tE₀+E_pos,

wherein E is_posAnd coding the learnable position of each face image sample.

And fifthly, inputting the area level characteristics and the category marks of each face image sample into the self-attention transformation network layer to obtain an area level prediction result of each face image sample.

In an embodiment of the present invention, the Self-Attention transforming network Layer may include an L-Layer MSA (Multi-head Self-Attention module), and an MLP (Multi-Layer per unit).

And, in an embodiment of the invention, the output from the l-th layer in the attention transforming network layer is:

wherein LN (-) represents a layer regularization operation, and

for the output variable of the corresponding multi-headed self-attention module, z_l-1And z_lRespectively representing input and output coded representations of different self-attention transforming network layers.

Further, in embodiments of the present invention, after L self-attention transforming network layers, an encoded sequence output T may be obtained_cls,T₁,...,T_M]Wherein, T_clsIndicates the class header, T₁,...,T_MAnd predicting results for each region level of the face image sample.

And 1033, inputting the feature map of each face image sample into an image-level supervision network for prediction to obtain an image-level prediction result of each face image sample.

In the embodiment of the present invention, the image-level prediction result of each facial image sample includes a category to which each facial image sample belongs. For example, in the embodiment of the present invention, the image-level prediction result may use 0 to indicate that the category to which the input face image belongs is a fake face, and may use 1 to indicate that the category to which the input face image belongs is a real face.

And 104, calculating the prediction result of each layer of supervision network by using a preset loss function to obtain a calculation result corresponding to each layer of supervision network.

It should be noted that, in the embodiment of the present invention, a mask tag with data uncertainty is used as an auxiliary supervisory signal, and an uncertainty estimation method is used to process the data uncertainty naturally carried by the mask tag.

In an embodiment of the present invention, the uncertainty estimation method may include:

the method comprises the following steps: to model the data uncertainty characterizing learning under hierarchical aiding signal supervision, the characterization can be a probability distribution z-p (z | x).

Specifically, in the embodiment of the present invention, the hierarchical characterization z of each sample x follows a multivariate gaussian distribution:

wherein μ represents the mean of the gaussian distribution, Σ represents the diagonal covariance of the multivariate gaussian distribution, given a sample x, two convolutional neural networks are used to predict the above parameters, the mean μ is the most likely prediction mask, and the diagonal covariance Σ is the uncertainty of the data.

The second method comprises the following steps: the data uncertainty is estimated using a cross entropy loss function.

Specifically, in the embodiment of the present invention, the cross entropy loss function is:

wherein y is a binary mask label corresponding to the face image sample, and z is_iAnd (3) representing the hierarchical representation corresponding to each pixel point, wherein the hierarchical representation of each pixel point obeys multivariate Gaussian distribution, and N is the number of the pixel points of the face image sample. And the mean value mu can be used as a prediction mask of the human face image, and sigma can be regarded asAnd the uncertainty of the data is introduced so as to improve the accuracy of the detection result.

The third method comprises the following steps: the data uncertainty is estimated using a constraint loss function.

Specifically, in the embodiment of the present invention, the constraint loss function is:

where σ represents the variance of the gaussian distribution.

Further, in an embodiment of the present invention, the sampling operation required to estimate the probability distribution in the above-mentioned method one is not trivial, so that the deep learning model cannot back-propagate the gradient to minimize the objective loss function. And, the cross entropy loss function in the second method may cause the multi-level supervised network to predict a very small Σ all the time, so as to minimize the objective function. Based on the fact that a KL divergence term is adopted to constrain the distribution N (z | mu, Σ) to be close to N (e |0, I), in the embodiment of the present invention, the cross entropy loss function and the constraint loss function can be combined to form a composite loss function as a preset loss function of the multi-level supervisory network, so that the multi-level supervisory network can better estimate the uncertainty. Illustratively, in the embodiment of the present invention, the composite loss function is obtained by adding the cross-entropy loss function and the constraint loss function.

And 105, training each hierarchy supervision network in the multilevel supervision network according to the network gradient descent algorithm and the calculation result corresponding to each hierarchy supervision network to obtain the target multilevel supervision network.

And 106, predicting the face image to be detected by using the target multi-level surveillance network, and judging whether the face image to be detected is a forged face according to a prediction result.

The application provides a face forgery detection method based on uncertainty perception level supervision, which comprises the steps of obtaining a plurality of different face image samples, extracting a characteristic diagram of each face image sample in the plurality of different face image samples through a convolutional neural network and an upper sampling layer, inputting the characteristic diagram of each face image sample into each level supervision network in the level supervision network for forecasting to obtain a forecasting result of each level supervision network, wherein the level supervision network comprises a plurality of different levels of supervision networks, utilizing a preset loss function to calculate the forecasting result of each level supervision network to obtain a calculating result corresponding to each level supervision network, training each level supervision network in the level supervision network according to a network gradient descent algorithm and the calculating result corresponding to each level supervision network to obtain a target level supervision network, utilizing the target level supervision network to forecast a face image to be detected, and judging whether the face image to be detected is a forged face or not according to the prediction result. The method and the device have the advantages that the whole network structure is assisted through the binary mask label of the face image, the robustness and the generalization of the network are improved through a hierarchical supervision method, meanwhile, the uncertainty of data naturally carried by the mask label is processed through an uncertainty estimation method, the characteristics of the image are effectively extracted through a self-attention transformation network, and therefore the accuracy of a detection result is improved.

Example two

Further, fig. 2 is a schematic structural diagram of a face forgery detection apparatus based on uncertainty perception hierarchical supervision according to an embodiment of the present application, and as shown in fig. 2, the face forgery detection apparatus may include:

an obtaining module 201, configured to obtain multiple different face image samples;

an extraction module 202, configured to extract a feature map of each face image sample in multiple different face image samples through a convolutional neural network and an upsampling layer;

the prediction module 203 is configured to input the feature map of each face image sample into each hierarchical monitoring network in the hierarchical monitoring network to perform prediction to obtain a prediction result of each hierarchical monitoring network, where the hierarchical monitoring network includes multiple different hierarchical monitoring networks;

the calculation module 204 is configured to calculate the prediction result of each layer of the supervision network by using a preset loss function to obtain a calculation result corresponding to each layer of the supervision network;

the training module 205 is configured to train each hierarchical monitoring network in the multi-hierarchical monitoring network according to a network gradient descent algorithm and a calculation result corresponding to each hierarchical monitoring network to obtain a target multi-hierarchical monitoring network;

and the judging module 206 is configured to predict the facial image to be detected by using the target multi-level surveillance network, and judge whether the facial image to be detected is a counterfeit face according to the prediction result.

To implement the above embodiments, the present disclosure also proposes a non-transitory computer-readable storage medium.

A non-transitory computer-readable storage medium provided by an embodiment of the present disclosure stores a computer program; when executed by a processor, the computer program can implement the face forgery detection method based on the uncertainty perception hierarchical supervision as shown in fig. 1.

In order to implement the above embodiments, the present disclosure also provides a computer device.

The computer device provided by the embodiment of the disclosure comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; the processor, when executing the program, is capable of implementing the method as shown in fig. 1.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are exemplary and should not be construed as limiting the present application and that changes, modifications, substitutions and alterations in the above embodiments may be made by those of ordinary skill in the art within the scope of the present application.

Claims

1. A face forgery detection method based on uncertainty perception level supervision is characterized by comprising the following steps:

acquiring a plurality of different face image samples;

inputting the feature map of each face image sample into each hierarchy supervision network in a multilevel supervision network for prediction to obtain a prediction result of each hierarchy supervision network, wherein the multilevel supervision network comprises a plurality of different hierarchy supervision networks;

training each hierarchy supervision network in the multilevel supervision network according to a network gradient descent algorithm and a calculation result corresponding to each hierarchy supervision network to obtain a target multilevel supervision network;

2. The method of claim 1, wherein said extracting a feature map for each of said plurality of different face image samples via a convolutional neural network and an upsampling layer comprises: extracting an initial feature map of each face image sample in the plurality of different face image samples through a convolutional network;

and increasing the resolution of the initial feature map of each face image sample by adopting a plurality of continuous up-sampling convolution blocks to obtain the feature map of each face image sample.

3. The method of claim 1, wherein the multi-level supervisory network comprises a pixel level supervisory network, a region level supervisory network, an image level supervisory network,

inputting the feature map of each face image sample into each hierarchy supervision network in the multilevel supervision network to predict to obtain the prediction result of each hierarchy supervision network, wherein the prediction result comprises the following steps:

inputting the feature map of each face image sample into a pixel level supervision network for prediction to obtain a pixel level prediction result of each face image sample;

inputting the feature map of each face image sample into an area level supervision network for prediction to obtain an area level prediction result of each face image sample;

and inputting the feature map of each face image sample into an image-level supervision network for prediction to obtain an image-level prediction result of each face image sample.

4. The method of claim 3, wherein inputting the feature map of each face image sample into a pixel-level supervised network to predict the pixel-level predictor of each face image sample comprises predicting the pixel-level predictor of each face image sample by a plurality of convolutional neural networks.

5. The method as claimed in claim 3, wherein the inputting the feature map of each facial image sample into a region-level supervision network for prediction to obtain the region-level prediction result of each facial image sample comprises:

obtaining a normalized variance graph corresponding to the feature graph of each face image sample by using a Sigmoid function;

obtaining an uncertainty perception characteristic image of each face image sample based on the normalized variance image of each face image sample;

mapping the uncertainty perception characteristic image of each facial image sample through a linear mapping layer to obtain an implicit vector representation of each facial image sample;

obtaining the region-level characterization of each facial image sample by using the implicit vector characterization of each facial image sample;

and inputting the region level characteristics and the class marks of each facial image sample into a self-attention transformation network layer to obtain a region level prediction result of each facial image sample.

6. A method as claimed in claim 3 wherein the image-level prediction result for each face image sample comprises the category to which said each face image sample belongs.

7. The method of claim 1, wherein the pre-set loss function comprises a composite loss function comprising a cross-entropy loss function and a constrained loss function,

the cross entropy loss function is:

wherein y is a binary mask label corresponding to the face image sample, and z is_iAnd (3) performing layered characterization corresponding to each pixel point, wherein the layered characterization of each pixel point obeys multivariate Gaussian distribution, sigma represents the diagonal covariance of the multivariate Gaussian distribution, and N is the number of the pixel points of the face image sample.

The constraint loss function is:

where σ represents the variance of the gaussian distribution.

8. A human face forgery detection device based on uncertainty perception level supervision is characterized by comprising the following modules:

the extraction module is used for extracting a feature map of each face image sample in the plurality of different face image samples through a convolutional neural network and an upper sampling layer;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any one of claims 1-7 when executing the program.

10. A non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method of any one of claims 1-7.