CN113723295B

CN113723295B - Face counterfeiting detection method based on image domain frequency domain double-flow network

Info

Publication number: CN113723295B
Application number: CN202111009733.8A
Authority: CN
Inventors: 刘勇; 梁雨菲; 王蒙蒙
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2023-11-07
Anticipated expiration: 2041-08-31
Also published as: CN113723295A

Abstract

The invention relates to the technical field of computer vision, in particular to a face counterfeiting detection method based on an image domain frequency domain double-flow network. The detection method comprises two stages of model training and model inference; in the model training stage, a server with high computing performance is utilized to train a network model, network parameters are optimized by reducing a network loss function until the network converges, and a frequency domain double-flow network model based on an image domain is obtained; in the model deducing stage, the network model obtained in the model training stage is utilized to judge whether the new image is a face fake image. Compared with the method for converting the face fake detection into the two-classification problem, the method realizes hierarchical classification, utilizes the supervision information of fake image diversity, and can achieve better detection precision.

Description

Face counterfeiting detection method based on image domain frequency domain double-flow network

Technical Field

The invention relates to the technical field of computer vision, in particular to a face counterfeiting detection method based on an image domain frequency domain double-flow network.

Background

With the development of advanced face synthesis algorithms, various realistic fake faces are generated, and the application of the face synthesis algorithms to social media is attracting high attention. The malicious use of fake faces can have a great adverse effect on individuals and society, and therefore detection of fake faces is very important.

The task of face-counterfeit detection presents a number of challenges, particularly the diversity of counterfeit algorithms, and face-counterfeit images that are streamed over the internet tend to be of low quality and difficult to detect.

Early studies attempted to use hand-made features or simply modify existing neural networks. The MesoNet designs a shallow neural network consisting of two initial modules and two classical convolution layers, but a simple shallow network cannot achieve good accuracy in realistic face falsification data.

Some methods make use of the biological features of the face, such as training images obtained from the internet that typically do not include a photograph of a closed eye, can lead to a lack of true blinking of a counterfeit face. And detecting the blink phenomenon through the neural network model, thereby detecting the fake human face. However, such detection can be avoided by deliberately combining the images of the closed eye during training.

The FaceX-ray uses its own generated data to locate the region of the face change, but the method cannot achieve good accuracy and effect in the detection of low quality counterfeit images.

Most of face fake detection algorithms at present only use the information of image domain to convert the task into two classification tasks to judge the authenticity of the image, and cannot use the supervision information of various fake algorithms. Detectable elements such as artifacts present in the counterfeit image can be compressed in the low quality image and difficult to detect, while artifacts in the frequency domain are still present in the low quality image.

Disclosure of Invention

The invention aims to provide a face counterfeiting detection method based on an image domain frequency domain double-flow network aiming at the defects of face counterfeiting detection at present.

The method is based on a human face counterfeiting detection framework of an image domain frequency domain double-flow network, and meanwhile, hierarchical supervision is carried out by utilizing supervision information of different counterfeiting algorithms, so that the counterfeiting detection precision can be ensured in images with different definition.

The method solves the problems that the traditional face fake detection algorithm has poor detection effect on low-quality images, and the detection by using image domain artifacts is unstable.

The technical scheme adopted for solving the technical problems is as follows:

a face fake detection method based on an image domain frequency domain double-flow network comprises two stages of model training and model inference;

in the model training stage, a server with high computing performance is utilized to train a network model, network parameters are optimized by reducing a network loss function until the network converges, and a frequency domain double-flow network model based on an image domain is obtained;

in the model deducing stage, the network model obtained in the model training stage is utilized to judge whether the new image is a face fake image.

Preferably, in the model training phase, the method specifically comprises the following steps:

firstly, preparing image data, namely, performing Discrete Cosine Transform (DCT) on images in a training set to obtain spectrum images of the images, performing image data enhancement processing on the images before and after processing, and simultaneously, re-integrating corresponding image data labels to enable each image to have two-stage labels;

then, training a network model, namely sending the original image and the spectrum image subjected to data enhancement processing into an image domain frequency domain double-flow network in pairs, and training by adopting a hierarchical supervision method; and calculating a loss function to obtain a gradient, and then performing back propagation to obtain a trained network model.

Preferably, with respect to the Discrete Cosine Transform (DCT), in particular using the type ii 2D-DCT, the two-dimensional DCT transform of the input I (N dimensions) is defined as:

D＝C ^N ·I·(C ^N ) ^T ，

wherein:

d is an output, and D is an output,

C ^N is the coefficient of the transformation matrix and,

where j=0, α _j =1; when j > 0, alpha _j ＝2；

The upper left corner of the spectrum image after DCT transformation contains image low-frequency information, and the lower right corner contains image high-frequency information.

Preferably, the image data enhancement includes performing horizontal inversion, vertical inversion, and application of Mix Up and Random spectroscopy on the original image and the spectrum image;

wherein Mix up is the virtual data of the linear interpolation of two random samples and the labels thereof in the training set, which is used as training data; random erase is a block of area in a Random erase image.

Preferably, the re-integration of the image tag includes:

firstly, converting binary labels of images into multi-class labels according to fine-granularity label information; one type for real images and multiple types for counterfeit images, wherein specific counterfeit methods are considered, such as FaceSwap, face2Face, deepfake; furthermore, there is a hierarchy in the multi-class labels described above, because the entire dataset is divided into true and false classes;

meanwhile, various fake images belong to the whole fake category;

the data set labels are re-integrated into the hierarchical structure of the data set labels, so that the labels of coarse classification and fine classification levels of each image are realized, wherein the coarse classification labels are first-level labels, and the fine classification labels are second-level labels.

Preferably, for the model inference phase, specific steps include:

firstly, the test data image is subjected to Discrete Cosine Transform (DCT) to obtain an image spectrogram,

and then, transmitting the test data image and the obtained corresponding image spectrogram into a trained model to obtain an inference result, obtaining a two-classification result and a multi-classification result of the test image, and carrying out weighted average fusion on the two results to obtain a final model inference result.

Preferably, the network model training includes image data forward propagation and gradient backward propagation; specific:

the image data is transmitted forward, and all the original images subjected to data enhancement and the spectrum images subjected to data enhancement in the training set are sent into a network in pairs for training;

the network comprises a double-flow characteristic fusion module and a hierarchical supervision module;

the overall network structure consists of two branches, namely an image domain branch and a frequency domain branch, wherein the image domain branch is taken as a main stream, and the frequency domain branch is taken as an auxiliary information stream and comprises a convolution layer, a pooling layer, a full connection layer+relu and a full connection layer; carrying out concat fusion on the frequency domain feature map and the image domain feature map at a specific layer of the network; the reasonable fusion utilization of the information of the image domain and the frequency domain is realized;

a hierarchical supervision module: modifying a network structure based on a hierarchical label, extracting image features based on a convolutional neural network in a hierarchical manner, adding a coarse classification head to a shallow part of the network to obtain coarse classification output, and adding a fine classification head to a deep part of the network to obtain fine classification output;

gradient back propagation: calculating a binary cross entropy loss by using the coarse classification output and the first-stage label, and calculating a multi-classification cross entropy loss by using the fine classification output and the second-stage label; and calculating the gradient according to the loss function, reversely propagating the gradient to update the network parameters, accelerating by utilizing the GPU, and stopping training until the error of the network is reduced to be within a set threshold value or the number of network iterations meets the requirement, so as to obtain a trained network model.

Preferably, the loss function used for cross entropy loss is:

wherein:

n is the number of all pictures and,

h is the number of all of the levels,

α _h is the weight of the h-th level,

CE is the cross entropy loss.

According to the face counterfeiting detection method based on the image domain frequency domain double-flow network, the frequency domain information of the image is obtained through DCT transformation, and fusion of the image domain information and the frequency domain information is achieved in the network through the double-flow network, so that the network can simultaneously utilize the information of the two domains. Meanwhile, the hierarchical label is realized on the image label, hierarchical supervision is utilized and coarse classification loss and fine classification loss are weighted and averaged when the network is trained, supervision information of different fake algorithms can be fully utilized, and the model detection performance is further improved.

Compared with the prior art, the invention has the beneficial effects that:

compared with the method for converting the face fake detection into the two-classification problem, the method realizes the hierarchical classification, utilizes the supervision information of the fake image diversity, and can achieve better detection precision;

compared with a method for carrying out face counterfeiting detection by using only the fake image domain artifact, the method realizes that the double-flow network fuses the image frequency domain information and the image domain information in the network, and can effectively solve the problem that the fake image domain artifact with low definition is compressed and difficult to detect.

Drawings

Fig. 1 is a schematic diagram of a face forgery detection flow based on an image domain frequency domain dual-flow network of the present invention.

FIG. 2 is a schematic diagram of tag re-integration according to the present invention.

Fig. 3 is a schematic view of the hierarchical supervision of the present invention.

Detailed Description

Representative embodiments based on the drawings will now be further refined. It should be understood that the following description is not intended to limit the embodiments to one preferred embodiment. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the embodiments as defined by the appended claims.

In the model training stage, the method specifically comprises the following steps:

firstly, preparing image data, namely obtaining a spectrum image of an image by Discrete Cosine Transform (DCT) of the image in a training set, carrying out image data enhancement processing on the image before and after processing, and simultaneously, re-integrating corresponding image data labels to enable each image to have two-stage labels;

regarding Discrete Cosine Transform (DCT), in particular using type ii 2D-DCT, the two-dimensional DCT transform of input I (N dimensions) is defined as:

D＝C ^N ·I·(C ^N ) ^T ，

wherein:

d is an output, and D is an output,

C ^N is the coefficient of the transformation matrix and,

where j=0, α _j =1; when j > 0, alpha _j ＝2；

The image data enhancement comprises the steps of horizontally overturning, vertically overturning and applying Mix Up and Random coating to an original image and a frequency spectrum image;

The re-integration of the image tag includes:

meanwhile, various fake images belong to the whole fake category;

Then, after image data preparation, training a network model, and sending the original image and the spectrum image subjected to data enhancement processing into an image domain frequency domain double-flow network in pairs, and training by adopting a hierarchical supervision method; and calculating a loss function to obtain a gradient, and then performing back propagation to obtain a trained network model.

Network model training includes image data forward propagation and gradient backward propagation; specific:

the overall network structure consists of two branches, namely an image domain branch and a frequency domain branch, wherein the image domain branch is taken as a main stream, and the frequency domain branch is taken as an auxiliary information stream and comprises a convolution layer, a pooling layer, a full connection layer+relu and a full connection layer; carrying out concat fusion on the frequency domain feature map and the image domain feature map at a specific layer of the network; the reasonable fusion utilization of the information of the image domain and the frequency domain is realized; the concat is a simple and effective feature fusion method, and is a combination of features in the channel dimension. And finally, reasonably fusing and utilizing the information of the image domain and the frequency domain.

A hierarchical supervision module: modifying a network structure based on a hierarchical label, extracting image features based on a convolutional neural network in a hierarchical manner, adding a coarse classification head to a shallow part of the network to obtain coarse classification output, and adding a fine classification head to a deep part of the network to obtain fine classification output; specifically, HRNet is taken as an example. The original HRNet includes four phases, corresponding to two levels of labels, we select the outputs of the third and fourth phases to calculate the classification loss. In the coarse classification, we obtain a classification result by using the output of the third stage, and in the fine classification, we obtain a fine classification result by using the output of the fourth stage. Wherein both outputs are processed by a classification head, which is global averaging pooling, deriving feature vectors from the feature map.

Here, by SGD optimizing the network, we set the reference learning rate to 0.002 and the momentum to 0.9 using the multi-step learning rate scheduler. Specifically, HRNet is taken as an example. We calculate the binary cross entropy loss using the output of the third stage and the first stage label (true or false), respectively, and in a fine classification we calculate the multi-class cross entropy loss using the output of the fourth stage and the output of the second stage label.

The loss function used for cross entropy loss is:

wherein:

n is the number of all pictures and,

h is the number of all of the levels,

α _h is the weight of the h-th level,

CE is the cross entropy loss.

The gradient is a vector formed by summing partial derivatives of all variables of the loss function, and the direction indicated by the gradient is the direction in which the function value at each point is reduced most. And updating parameters in the HRNet by gradient back propagation until the error of the network is reduced to be within a set threshold value or the number of network iterations meets the requirement, and stopping training to obtain a trained HRNet network model.

For the model inference phase, specifically including:

Specific examples based on the above principle are as follows:

as shown in fig. 1, a face forgery detection method based on an image domain frequency domain double-flow network is provided.

Using faceforensis++ data set, 1000 videos of each quality are selected, 720 videos are selected as training sets, 140 videos are selected as verification sets, 140 videos are selected as test sets, video frames are extracted, and face parts of people are intercepted as image data.

According to figure 1, a double-flow network for face counterfeiting detection is built, the frequency spectrum image obtained through DCT and the corresponding original image are sent into the double-flow network, the image domain feature image and the frequency domain feature image are fused at a specific layer of the network, and information of two domains of the image is fully utilized.

The label is re-integrated according to the figure 2, hierarchical supervision in the network is realized according to the form in the figure 3, cross entropy loss is calculated according to the two layers of labels and the two corresponding outputs, iterative updating of parameters is carried out according to a gradient back propagation method, and acceleration is carried out by utilizing the GPU until the error of the network is reduced to be within a set threshold value or the number of network iterations meets the requirement, and training is stopped.

The method is specifically realized by taking HRNet as an example, adding a coarse classification head after a stage three, adding a fine classification head after a stage four, and carrying out concat fusion on an image domain feature map and a frequency domain feature map on the stage three and the stage four respectively by feature fusion. The network is optimized by SGD. We set the baseline learning rate to 0.002 and the momentum to 0.9 using the multi-step learning rate scheduler.

By adopting the method for face detection, compared with the method for converting face counterfeiting detection into two classification problems, the method realizes hierarchical classification, utilizes the supervision information of the diversity of the counterfeiting images, and can achieve better detection precision; compared with a method for carrying out face counterfeiting detection by using only the fake image domain artifact, the method realizes that the double-flow network fuses the image frequency domain information and the image domain information in the network, and can effectively solve the problem that the fake image domain artifact with low definition is compressed and difficult to detect.

The above-described embodiments are intended to illustrate the present invention, not to limit the present invention, and any modifications and variations of the present invention fall within the spirit of the present invention and the scope of the appended claims.

Claims

1. A face fake detection method based on an image domain frequency domain double-flow network is characterized by comprising the following steps of: the detection method comprises two stages of model training and model inference;

in the model deducing stage, judging whether the new image is a face fake image or not by utilizing the network model obtained in the model training stage;

firstly, preparing image data, namely performing discrete cosine transform on images in a training set to obtain frequency spectrum images of the images, performing image data enhancement processing on the images before and after the discrete cosine transform processing, and simultaneously, re-integrating corresponding image data labels to enable each image to have two-stage labels;

then, training a network model, namely sending the original image and the spectrum image subjected to data enhancement processing into an image domain frequency domain double-flow network in pairs, and training by adopting a hierarchical supervision method; calculating a loss function to obtain a gradient, and then performing back propagation to obtain a trained network model;

the re-integration of the image tag includes:

firstly, converting binary labels of images into multi-class labels according to fine-granularity label information; one for true images and multiple for counterfeit images, wherein specific counterfeiting methods are considered, including FaceSwap, face2Face, deepfake; furthermore, there is a hierarchy in the multi-class labels described above, because the entire dataset is divided into true and false classes;

meanwhile, various fake images belong to the whole fake category;

the data set labels are re-integrated into the hierarchical structure of the data set labels, so that the labels of coarse classification and fine classification levels of each image are realized, wherein the coarse classification labels are first-stage labels, and the fine classification labels are second-stage labels;

the network model training comprises image data forward propagation and gradient backward propagation; specific:

2. The face forgery detection method based on the image domain frequency domain double-flow network according to claim 1, characterized by comprising the following steps: regarding discrete cosine transforms, in particular using type 2D-DCT, the two-dimensional DCT transform of input I is defined as:

D＝C ^N ·I·(C ^N ) ^T ，

wherein:

d is an output, and D is an output,

C ^N is the coefficient of the transformation matrix and,

where j=0, α _j ＝1；j>At 0, alpha _j ＝2；

N represents the dimension of the input data;

3. The face forgery detection method based on the image domain frequency domain double-flow network according to claim 1, characterized by comprising the following steps: the image data enhancement comprises the steps of horizontally overturning, vertically overturning and applying Mix Up and Random coating to an original image and a frequency spectrum image;

4. The face forgery detection method based on the image domain frequency domain double-flow network according to claim 1, characterized by comprising the following steps: for the model inference phase, specifically including:

firstly, the test data image is subjected to discrete cosine transform to obtain an image spectrogram,

5. The face forgery detection method based on the image domain frequency domain double-flow network according to claim 1, characterized by comprising the following steps: the loss function used for cross entropy loss is:

wherein:

n is the number of all pictures and,

h is the number of all of the levels,

α _h is the weight of the h-th level,

CE is the cross entropy loss.