Disclosure of Invention
The purpose of the invention is as follows: the invention provides a face generation detection method based on a dual-flow CNN structure fusion PRNU, which has strong generalization and high robustness.
The technical scheme is as follows: a face detection method based on a dual-flow CNN structure fusion PRNU (probabilistic neural network) GAN generation face detection method comprises the following steps:
(1) constructing an RGB stream network, and enhancing data by random erasure;
(2) inputting the preprocessed data set into CNN for training to obtain GAN fingerprint characteristics;
(3) constructing a PRNU flow, and extracting a PRNU image of a human face by an image denoising method;
(4) inputting the extracted PRNU image into a CNN for training to obtain PRNU characteristics;
(5) fully fusing the GAN fingerprint characteristics and the PRNU characteristics, and inputting the information into a subsequent network;
(6) and (5) performing secondary classification by using a Softmax loss function, and judging whether the face image is true or false.
Preferably, in the step (1), the RGB stream network is constructed by the specific process: selecting CelebA-HQ as a real face data set, and StyleGAN I as a fake face data set to train the network. The invention adopts an augmentation method, namely random erasing to carry out random shielding on a face image, randomly selects a rectangular area on an original image, and replaces pixels in the area with random values. In the process, the face images participating in training can be shielded to different degrees, so that the sample diversity is enhanced, and the network can be assisted to better pay attention to the difference of the image contents. The probability a of setting the random shielding is 0.5, and the area of the shielding rectangular frame is 0.02< S < 0.4.
Preferably, in the step (2), the preprocessed data set is input into a three-layer convolutional neural network for training, so that the network can fully explore the difference of true and false image contents per se and extract the GAN fingerprint features; of the three layer groups of the network, each layer group contains a convolutional layer, an activation function lreuu and a max pooling layer. The maximum pooling layer has the effects of reducing the size of the picture, increasing the receptive field of a convolution kernel, extracting high-level features, reducing the number of network parameters and preventing overfitting on the premise of keeping picture space information as much as possible. LRelu can not only improve the 'gradient vanishing' problem, but also enable the model to reach a convergence state quickly so as to save training time. The specific formula of the LReLU is expressed as follows:
wherein, yiRefers to the ith feature map.
Preferably, in step (3), the PRNU stream is constructed, and the PRNU image is extracted by an image denoising method, which specifically includes: firstly, a face image passes through a low-pass filter to filter additive noise, then the original image is used for subtracting the image after low-pass filtering to obtain a residual mode noise part, and the formula is expressed as follows:
n=I-F(I)
where n is the mode noise, I is the original image, and F (-) is the low pass filtering operation.
Preferably, in step (4), the extracted face PRNU image is input into a three-layer convolutional network for training, so that the network focuses on the change of the color image pixel values themselves to extract the PRNU feature, and in three layer groups of the network, each layer group includes a convolutional layer, an activation function lreuu, and a max pooling layer.
Preferably, in the step (5), the GAN fingerprint features and the PRNU features are sufficiently fused and input into a subsequent network, so as to facilitate final classification; the specific process is as follows: extracted PRNU features are fused well with GAN fingerprint features using a continate function and used for final classification. The formula is as follows:
z=concatenate(axis=2)([x.output,y.output])
wherein x is a PRNU feature, y is a GAN fingerprint feature, z is a fused feature, and axis is a splicing dimension.
The final output profiles are aggregated and then fed into two fully-connected layers, which are likewise equipped with the unsaturated activation function LReLu, consisting of 1024 and 512 units, respectively. In addition, the present invention also enables L2 regularization in the fully connected layer, where the parameter λ is 0.0005.
Preferably, in the step (6), the second classification is performed by using a Softmax loss function, so as to judge whether the face image is true or false and improve the detection precision.
Has the advantages that: compared with the prior art, the invention has the following remarkable effects: (1) compared with the existing deep learning method, the double-flow CNN model in the invention proves the effectiveness thereof with lower computing resource cost; (2) constructing RGB stream to explore the image content itself, so that the model focuses more on the difference between GAN fingerprint characteristics in real and forged faces; (3) the preprocessing operation is adopted, the sample is randomly erased and expanded, overfitting is prevented, and the robustness of the model is improved; (4) the PRNU stream is constructed for studying differences in image pixel value changes. The extracted face PRNU is directly taken as input to the stream, so that the network can focus more on the significant differences between the PRNU features, thus improving the versatility and robustness of the proposed method.
Detailed Description
The present invention will be described in detail with reference to examples.
The invention constructs a double-current CNN network to realize the face forgery detection task, wherein the double-current CNN network comprises an RGB (red, green and blue) stream and a PRNU (pseudo random number) stream. The RGB stream can ensure that the image generated by the same GAN network has higher detection precision. While the presence of the PRNU stream may direct the network to focus more on the changes in the color image pixel values themselves. At the beginning of the scheme, the two streams respectively play their own roles, and at the later stage, the two streams are fused to ensure that the proposed scheme shows obviously better generalization capability. More importantly, the fused network has remarkably improved resistance to various common attacks such as JPEG compression, Gaussian noise, Gaussian blur and the like.
As shown in fig. 1, the GAN generation face detection method based on the dual-stream CNN structure fusion PRNU includes the following steps: (1) constructing an RGB stream network, and enhancing data by random erasure;
selecting CelebA-HQ as a real face data set, and StyleGANI as a fake face data set to train the network. As shown in fig. 2, the present invention randomly masks a face image by using an augmentation method, i.e., random erasure, randomly selects a rectangular area in an original image, and replaces pixels in the rectangular area with random values. In the process, the face images participating in training can be shielded to different degrees, so that the sample diversity is enhanced, and the network can be assisted to better pay attention to the difference of the image contents. The probability a of setting the random shielding is 0.5, and the area of the shielding rectangular frame is 0.02< S < 0.4.
(2) Inputting the preprocessed data set into CNN for training to obtain GAN fingerprint characteristics;
because the fingerprint features belong to the bottom texture features of the image. The deeper and more complex networks are mainly used to extract semantic information of the image, which is contrary to the object of the present invention, so that constructing a shallow network is more beneficial to learning the extracted features. The network model of the invention refers to a discriminator network of a simple GAN network, and forms a final three-layer CNN model by adjusting a hierarchical structure, changing the number of characteristic diagrams of each layer in the network, the core size and the like, wherein each layer group of the three layer groups of the network comprises a convolution layer, an activation function LReLU and a maximum pooling layer. As shown in fig. 3. The model input is a color image with a size of 224 × 224 × 3. The image is then sent into three layer groups. Each layer group contains one convolutional layer (convolutional kernel size is 3 × 3, step size is 1 × 1) and one max pooling layer (kernel size is 2 × 2, step size is 2 × 2). The number of feature maps output from the first set of convolutional layers is 32, the output of the feature map becomes 222 × 222 × 32 after the first layer convolution, and the size of the feature map becomes 111 × 111 by halving after the maximum pooling. The number of output signatures for the other convolutional layers is twice the number of corresponding input signatures, i.e., 64 and 128. The invention is provided with a maximum pooled characteristic vector which can reduce the output of the convolution layer after each layer of convolution, thereby improving the training speed of the model and being not easy to over-fit, thereby greatly improving the training effect. Moreover, the present invention can not only improve the problem of gradient disappearance, but also make the model reach the convergence state faster to save the training time after the unsaturated activation function lreul is used for each convolution layer. The lreol is expressed as:
wherein, yiRefers to the ith feature map.
(3) Constructing a PRNU flow, and extracting a PRNU image of a human face by an image denoising method;
there have been many descriptions of image noise models in research, but the basic ideas are roughly the same, with m.chen et al analyzing the noise models most accurately and comprehensively. The pixel value of the image is composed of ideal pixel value, multiplicative noise and various additive noises, and can be approximately expressed by the following formula:
I=f((1+K)·O)+n
where I is the actual pixel value, O is the pixel value obtained by capturing the natural scene by the lens, n is the sum of additive noise generated during the processing of the image, f (·) is various camera operations, K is the PRNU multiplicative factor, and K · O is multiplicative noise, i.e., the theoretical expression of PRNU.
According to analysis, the PRNU is multiplicative noise, belongs to a high-frequency signal, is highly dependent on a pixel value, and is difficult to directly acquire, so that the PRNU is extracted by an image noise reduction method to ensure the integrity of the PRNU as much as possible, as shown in fig. 4. Firstly, a face image passes through a low-pass filter to filter additive noise, then the original image is used for subtracting the image after low-pass filtering to obtain a residual mode noise part, and the calculation formula is as follows:
n=I-F(I)
where n is the mode noise, I is the original image, and F (-) is the low pass filtering operation.
Since the face PRNU that is as complete and clear as possible is used for subsequent feature extraction, it is important to select a suitable map F (·). When complex images are processed, the traditional denoising methods such as Gaussian filtering and median filtering easily ignore the correlation among pixel points and destroy the texture structure of the images, and wavelet transformation has good time-frequency characteristics and can better depict image detail information such as edges and breakpoints. Therefore, the present invention adopts wavelet filtering to filter out additive noise. Firstly, a 'Sym 4' wavelet basis is selected to carry out wavelet decomposition to obtain a low-frequency component (LL) and 3 high-frequency components (HL, HH and LH), then, the high-frequency coefficients are set to be 0 through threshold quantization, and finally, the obtained wavelet coefficients are utilized to carry out image reconstruction. In order to make the denoising effect best, wavelet decomposition is performed twice.
(4) Inputting the extracted PRNU image into a CNN for training to obtain PRNU characteristics;
as shown in fig. 2, the extracted PRNU map is fed into the network for subsequent feature extraction. The size of the input image is still 224. The structure and parameters of this streaming network are consistent with the RGB stream and still consist of 3 packets. Each set consists of one convolutional layer (3 x 3 size, 1 x 1 step), equipped with LReLu and max pooling layer (2 x 2 size, 2 x 2 step). The number of output signature maps for the first set of convolutional layers is 32, and the number of output signature maps for the other convolutional layers is twice the number of corresponding input signature maps.
(5) And fully fusing the GAN fingerprint characteristics and the PRNU characteristics, and inputting the information into a subsequent network.
After the PRNU feature map and the RGB map have passed through the convolutional layer and the pooling layer, respectively, the two streams are merged, and the extracted PRNU features and GAN fingerprint features are fully fused using a concatemate function and used for final classification. The specific algorithm is expressed as follows:
z=concatenate(axis=2)([x.output,y.output])
wherein x is a PRNU feature, y is a GAN fingerprint feature, z is a fused feature, and axis is a splicing dimension.
The final output profiles are aggregated and then fed into two fully-connected layers, which are likewise equipped with the unsaturated activation function LReLu, consisting of 1024 and 512 units, respectively. In addition, the present invention also enables L2 regularization in the fully connected layer, where the parameter λ is 0.0005.
(6) And (5) performing secondary classification by using a Softmax loss function, and judging whether the face image is true or false.
The invention judges the authenticity of the face image, so the task is a binary problem. The result of the Softmax function corresponds to the probability distribution of the input image being divided into labels, where the label of the true image is set to 1 and the label of the tampered image is set to 0. The Softmax function is a monotone increasing function, and if an input picture is a true picture, the output numerical value is closer to 1; if the input picture is a tampered image, the output numerical value is closer to 0, so that the second classification can be completed by Softmax.
In summary, the GAN generated face detection method of the present invention fully exploits the differences between the genuine and counterfeit faces in the image content and pixel level, and uses GAN fingerprint features and PRNU features as important bases for detection. The constructed double-flow CNN network can not only ensure that the images generated by the same GAN have higher detection precision, but also ensure that the model of the invention still has better generalization performance for the images generated by other GANs, as shown in FIG. 5. We performed comparison experiments using images generated from 6 GANs, three of which were used as training data and the other three as test data. From the experimental effect, most of the proposed methods achieve the best results compared with the other four methods. More importantly, the method has better robustness in dealing with various common attacks such as sampling, JPEG compression, Gaussian noise, Gaussian blur and the like, as shown in FIG. 6. Compared with other methods, the method can ensure good detection performance under various attacks, and has the best stability under the condition that the attack strength is gradually strengthened.