CN113723295A

CN113723295A - Face counterfeiting detection method based on image domain frequency domain double-flow network

Info

Publication number: CN113723295A
Application number: CN202111009733.8A
Authority: CN
Inventors: 刘勇; 梁雨菲; 王蒙蒙
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-11-30
Anticipated expiration: 2041-08-31
Also published as: CN113723295B

Abstract

The invention relates to the technical field of computer vision, in particular to a face forgery detection method based on an image domain frequency domain double-flow network. The detection method comprises two stages of model training and model inference; in the model training stage, a server with high calculation performance is used for training a network model, network parameters are optimized by reducing a network loss function until the network converges, and an image domain-based frequency domain double-flow network model is obtained; in the model deducing stage, the network model obtained in the model training stage is used for judging whether the new image is a face forged image. Compared with a method for converting face forgery detection into a two-classification problem, the method realizes hierarchical classification, utilizes the supervision information of the diversity of the forged images, and can achieve better detection precision.

Description

Face counterfeiting detection method based on image domain frequency domain double-flow network

Technical Field

The invention relates to the technical field of computer vision, in particular to a face forgery detection method based on an image domain frequency domain double-flow network.

Background

With the development of advanced face synthesis algorithms, various vivid forged faces are generated, and the application of the face-to-face artificial facial image on social media draws high attention of people. The malicious use of a counterfeit face can bring about great adverse effects to individuals and society, and therefore, it is very important to detect the counterfeit face.

Many challenges exist in the human face counterfeiting detection task, especially the diversity of the counterfeiting algorithm, and the human face counterfeiting images which are streamed on the internet are often low in quality and difficult to detect.

Early studies attempted to use hand-made features or simply modify existing neural networks. The MesoNet designs a shallow neural network consisting of two initial modules and two classical convolutional layers, but the simple shallow network cannot achieve good precision on vivid face forgery data.

Some methods take advantage of the biological characteristics of the face, as training images obtained from the internet typically do not include photographs of closed eyes that can result in a fake face lacking true blinking. And detecting the blinking phenomenon through a neural network model, thereby detecting the forged human face. However, such detection can be avoided by deliberately combining the images of the closed eyes during training.

FaceX-ray uses its own generated data to locate the area of face change, but this method does not achieve good accuracy and results in the detection of low quality counterfeit images.

At present, most face counterfeiting detection algorithms only use information of an image domain to convert the task into a two-classification task to judge whether an image is true or false, and cannot use supervision information of various types of counterfeiting algorithms. Detectable elements such as artifacts present in the forged image are compressed in the low-quality image and are difficult to detect, while artifacts in the frequency domain are still present in the low-quality image.

Disclosure of Invention

The invention aims to provide a face forgery detection method based on an image domain frequency domain double-flow network aiming at the defects of face forgery detection at present.

The method is based on a face forgery detection framework of an image domain frequency domain double-flow network, and meanwhile, different forgery algorithm supervision information is used for carrying out hierarchical supervision, so that the forgery detection precision can be ensured in images with different definitions.

The method solves the problems that the traditional face forgery detection algorithm has poor detection effect on low-quality images, and detection is unstable by utilizing image domain artifacts.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a human face forgery detection method based on image domain frequency domain double-flow network, the said detection method includes model training and model deducing two stages;

in the model training stage, a server with high calculation performance is used for training a network model, network parameters are optimized by reducing a network loss function until the network converges, and an image domain-based frequency domain double-flow network model is obtained;

in the model deducing stage, the network model obtained in the model training stage is used for judging whether the new image is a face forged image.

Preferably, in the model training phase, the method specifically comprises the following steps:

firstly, preparing image data, specifically, subjecting images in a training set to Discrete Cosine Transform (DCT) to obtain frequency spectrum images of the images, performing image data enhancement processing on the images before and after processing, and simultaneously, re-integrating corresponding image data labels to enable each image to have two-stage labels;

then, training a network model, sending the original image subjected to data enhancement and the frequency spectrum image into an image domain frequency domain double-flow network in pairs, and training by adopting a hierarchical supervision method; and calculating a loss function to obtain a gradient, and then performing back propagation to obtain a trained network model.

Preferably, with respect to a Discrete Cosine Transform (DCT), in particular using a type ii 2D-DCT, the two-dimensional DCT transform of the input I (N-dimension) is defined as:

D＝C^N·I·(C^N)^T，

wherein:

d is the output of the computer system,

C^Nare the coefficients of the transform matrix and,

where j is 0, α_j1 is ═ 1; when j is greater than 0, alpha_j＝2；

The upper left corner of the frequency spectrum image after DCT transformation contains image low-frequency information, and the lower right corner contains image high-frequency information.

Preferably, the image data enhancement comprises horizontally flipping, vertically flipping and applying Mix Up and Random erase to the original image and the spectral image;

the method comprises the following steps that Mix up is used for constructing virtual data of linear interpolation of two random samples and labels in a training set as training data; random Erasing is the Random Erasing of an area in an image.

Preferably, the reintegration of the image tag comprises:

firstly, converting binary labels of an image into multi-class labels according to fine-grained label information; one for real images and many for forged images, where specific forging methods are considered, such as FaceSwap, Face2Face, deepface, etc.; furthermore, there is a hierarchy in the above-mentioned multi-class labels, because the entire dataset is divided into true and false classes;

meanwhile, various forged images belong to the category of entire fakes;

and reintegrating the data set labels into the hierarchical structure of the data set labels to realize that each image has labels of a coarse classification level and a fine classification level, wherein the coarse classification label is a first-level label, and the fine classification label is a second-level label.

Preferably, the model inference phase specifically includes:

firstly, the test data image is processed by Discrete Cosine Transform (DCT) to obtain an image spectrogram,

then, the test data image and the obtained corresponding image spectrogram are transmitted into a trained model together to obtain an inference result, a two-classification result and a multi-classification result of the test image are obtained, and the two results are subjected to weighted average fusion to obtain a final model inference result.

Preferably, the network model training comprises forward propagation and backward propagation of the image data; specifically, the method comprises the following steps:

transmitting the image data in a forward direction, and sending all original images subjected to data enhancement and spectrum images subjected to data enhancement in a training set into a network in pairs for training;

the network comprises a double-flow feature fusion module and a hierarchy supervision module;

the whole network structure of the double-flow characteristic fusion module consists of two branches, namely an image domain branch and a frequency domain branch, wherein the image domain branch is taken as a main flow, and the frequency domain branch is taken as an auxiliary information flow and comprises a convolution layer, a pooling layer, a full-connection layer + relu and a full-connection layer; performing concat fusion on the frequency domain characteristic diagram and the image domain characteristic diagram at a specific layer of the network; the reasonable fusion utilization of the information of the image domain and the frequency domain is realized;

a hierarchical supervision module: modifying a network structure based on a hierarchical label, extracting image features based on a convolutional neural network hierarchy, adding a coarse classification head to a shallow layer part of the network to obtain coarse classification output, and adding a fine classification head to a deep layer part of the network to obtain fine classification output;

gradient back propagation: calculating binary cross entropy loss by using the coarse classification output and the first-level label, and calculating multi-classification cross entropy loss by using the fine classification output and the second-level label; and calculating gradient according to the loss function, reversely propagating the gradient to update network parameters, and accelerating by using a GPU (graphics processing unit) until the error of the network is reduced to be within a set threshold value or the number of network iterations meets the requirement, and stopping training to obtain a trained network model.

Preferably, the cross entropy loss is implemented using a loss function:

wherein:

n is the number of all pictures,

h is the number of all levels of the hierarchy,

α_his the weight of the h-th level,

CE is the cross entropy loss.

The face counterfeiting detection method based on the image domain frequency domain double-flow network provided by the invention obtains the frequency domain information of the image through DCT (discrete cosine transformation), and realizes the fusion of the image domain information and the frequency domain information in the network through the double-flow network, so that the network can simultaneously utilize the information of the two domains. Meanwhile, hierarchical labels are realized on the image labels, hierarchical supervision is utilized during network training, coarse classification losses and fine classification losses are weighted and averaged, supervision information of different counterfeiting algorithms can be fully utilized, and model detection performance is further improved.

Compared with the prior art, the invention has the beneficial effects that:

compared with a method for converting face forgery detection into a two-classification problem, the method realizes hierarchical classification, and can achieve better detection precision by using the monitoring information of the diversity of the forged images;

compared with a method for detecting face forgery only by using the image domain artifact of the forged image, the method realizes the fusion of image frequency domain information and image domain information in a network by a double-flow network, and can effectively solve the problem that the image domain artifact of the forged image with low definition is compressed and is difficult to detect.

Drawings

Fig. 1 is a schematic diagram of a human face forgery detection process based on an image domain frequency domain double-flow network.

FIG. 2 is a schematic view of tag re-integration according to the present invention.

FIG. 3 is a schematic diagram of the hierarchical supervision of the present invention.

Detailed Description

Further refinements will now be made on the basis of the representative embodiment shown in the figures. It should be understood that the following description is not intended to limit the embodiments to one preferred embodiment. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the embodiments as defined by the appended claims.

In the model training stage, the method specifically comprises the following steps:

firstly, preparing image data, specifically, subjecting images in a training set to Discrete Cosine Transform (DCT) to obtain frequency spectrum images of the images, subjecting the images before and after processing to image data enhancement processing, and simultaneously, re-integrating corresponding image data labels to enable each image to have two levels of labels;

with respect to the Discrete Cosine Transform (DCT), in particular using the type ii 2D-DCT, the two-dimensional DCT transform of the input I (N-dimensional) is defined as:

D＝C^N·I·(C^N)^T，

wherein:

d is the output of the computer system,

C^Nare the coefficients of the transform matrix and,

where j is 0, α_j1 is ═ 1; when j is greater than 0, alpha_j＝2；

The image data enhancement comprises the steps of horizontally and vertically turning an original image and a frequency spectrum image and applying Mix Up and Random Erasing;

The reintegration of the image tags comprises:

meanwhile, various forged images belong to the category of entire fakes;

Then, after image data are prepared, network model training is carried out, original images and frequency spectrum images which are subjected to data enhancement processing are sent to an image domain frequency domain double-flow network in pairs, and training is carried out by adopting a hierarchical supervision method; and calculating a loss function to obtain a gradient, and then performing back propagation to obtain a trained network model.

The network model training comprises forward propagation and gradient backward propagation of image data; specifically, the method comprises the following steps:

the whole network structure of the double-flow characteristic fusion module consists of two branches, namely an image domain branch and a frequency domain branch, wherein the image domain branch is taken as a main flow, and the frequency domain branch is taken as an auxiliary information flow and comprises a convolution layer, a pooling layer, a full-connection layer + relu and a full-connection layer; performing concat fusion on the frequency domain characteristic diagram and the image domain characteristic diagram at a specific layer of the network; the reasonable fusion utilization of the information of the image domain and the frequency domain is realized; among them, concat is a simple and effective feature fusion method, which is a combination of features in channel dimension. And finally, reasonable fusion utilization of the information of the image domain and the frequency domain is realized.

A hierarchical supervision module: modifying a network structure based on a hierarchical label, extracting image features based on a convolutional neural network hierarchy, adding a coarse classification head to a shallow layer part of the network to obtain coarse classification output, and adding a fine classification head to a deep layer part of the network to obtain fine classification output; specifically, HRNet is taken as an example. The original HRNet includes four stages, corresponding to two levels of labels, and we select the outputs of the third and fourth stages to compute the classification loss. In the coarse classification, the output of the third stage is used to obtain a classification result, and in the fine classification, the output of the fourth stage is used to obtain a fine classification result. Both outputs are processed by a classification head, which is globally averaged pooled to obtain feature vectors from the feature map.

Here we set the baseline learning rate to 0.002 and the momentum to 0.9 using a multi-step learning rate scheduler with SGD optimized network. Specifically, HRNet is taken as an example. We compute binary cross-entropy losses using the output of the third stage and the first-level labels (true or false), respectively, and in subdivision, we compute multi-class cross-entropy losses using the output of the fourth stage and the output of the second-level labels.

The loss function used for cross-entropy loss is:

wherein:

n is the number of all pictures,

h is the number of all levels of the hierarchy,

α_his the weight of the h-th level,

CE is the cross entropy loss.

The gradient is a vector that is a sum of partial derivatives of all variables of the loss function, and the direction indicated by the gradient is the direction in which the function value decreases most at each point. And (5) updating parameters in the HRNet through gradient back propagation, and stopping training until the error of the network is reduced to be within a set threshold value or the number of network iterations meets the requirement, so as to obtain a trained HRNet network model.

For the model inference phase, the method specifically comprises the following steps:

Specific examples based on the above principle are as follows:

as shown in fig. 1, a flow chart of a face forgery detection method based on an image domain frequency domain double-flow network is provided.

The method comprises the steps of using a faceforces + + data set, wherein each quality video comprises 1000, selecting 720 videos as training sets, 140 videos as verification sets and 140 videos as test sets, extracting video frames and intercepting human face parts to serve as image data.

A double-flow network for face forgery detection is set up according to the attached figure 1, a frequency spectrum image obtained after DCT transformation and a corresponding original image are sent to the double-flow network, an image domain feature map and a frequency domain feature map are fused on a specific layer of the network, and information of two domains of the image is fully utilized.

And (3) performing label reintegration according to the attached figure 2, realizing hierarchical supervision in the network according to the form in the attached figure 3, calculating cross entropy loss according to two layers of labels and two corresponding outputs, performing iterative updating of parameters according to a gradient back propagation method, and accelerating by using a GPU until the error of the network is reduced to be within a set threshold value or the number of network iterations meets the requirement, and stopping training.

The method is specifically realized in the following way, taking HRNet as an example, a coarse classification head is added after a third stage, a fine classification head is added after a fourth stage, and the image domain feature map and the frequency domain feature map are respectively subjected to concat fusion by feature fusion at the third stage and the fourth stage. The network is optimized by the SGD. We set the baseline learning rate to 0.002 and the momentum to 0.9 using a multi-step learning rate scheduler.

By adopting the method to detect the face, compared with a method for converting face forgery detection into a two-classification problem, the method realizes hierarchical classification, utilizes the supervision information of the diversity of the forged images and can achieve better detection precision; compared with a method for detecting face forgery only by using the image domain artifact of the forged image, the method realizes the fusion of image frequency domain information and image domain information in a network by a double-flow network, and can effectively solve the problem that the image domain artifact of the forged image with low definition is compressed and is difficult to detect.

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the claims.

Claims

1. A face forgery detection method based on image domain frequency domain double-flow network is characterized in that: the detection method comprises two stages of model training and model inference;

2. The human face forgery detection method based on image domain frequency domain double-flow network of claim 1 is characterized in that: in the model training stage, the method specifically comprises the following steps:

3. The method for detecting face forgery based on image domain frequency domain double-flow network as claimed in claim 2, wherein: with respect to the Discrete Cosine Transform (DCT), in particular using the type ii 2D-DCT, the two-dimensional DCT transform of the input I (N-dimensional) is defined as:

D＝C^N·I·(C^N)^T，

wherein:

d is the output of the computer system,

C^Nare the coefficients of the transform matrix and,

where j is 0, α_j1 is ═ 1; when j is greater than 0, alpha_j＝2；

4. The method for detecting face forgery based on image domain frequency domain double-flow network as claimed in claim 2, wherein: the image data enhancement comprises the steps of horizontally and vertically turning an original image and a frequency spectrum image and applying Mix Up and Random Erasing;

5. The method for detecting face forgery based on image domain frequency domain double-flow network as claimed in claim 2, wherein: the reintegration of the image tags comprises:

meanwhile, various forged images belong to the category of entire fakes;

6. The method for detecting face forgery based on image domain frequency domain double-flow network as claimed in claim 1 or 2, wherein: for the model inference phase, the method specifically comprises the following steps:

7. The method for detecting face forgery based on image domain frequency domain double-flow network as claimed in claim 2, wherein: the network model training comprises forward propagation and gradient backward propagation of image data; specifically, the method comprises the following steps:

8. The method for detecting face forgery based on image domain frequency domain double-flow network as claimed in claim 7, wherein: the loss function used for cross-entropy loss is:

wherein:

n is the number of all pictures,

h is the number of all levels of the hierarchy,

α_his the weight of the h-th level,

CE is the cross entropy loss.