CN117496583B - Deep fake face detection positioning method capable of learning local difference - Google Patents

Deep fake face detection positioning method capable of learning local difference Download PDF

Info

Publication number
CN117496583B
CN117496583B CN202311841206.2A CN202311841206A CN117496583B CN 117496583 B CN117496583 B CN 117496583B CN 202311841206 A CN202311841206 A CN 202311841206A CN 117496583 B CN117496583 B CN 117496583B
Authority
CN
China
Prior art keywords
local
feature
feature map
map
fake
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311841206.2A
Other languages
Chinese (zh)
Other versions
CN117496583A (en
Inventor
夏志华
冷凌云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN202311841206.2A priority Critical patent/CN117496583B/en
Publication of CN117496583A publication Critical patent/CN117496583A/en
Application granted granted Critical
Publication of CN117496583B publication Critical patent/CN117496583B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/40Spoof detection, e.g. liveness detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a depth fake face detection positioning method capable of learning local differences, which can be used for detecting authenticity of a depth fake face image and positioning a tampered area. Comprising the following steps: local difference feature extraction network, cross-hierarchy attention fusion fake positioning network and high generalization true and false classifier. The local difference feature extraction network and the cross-hierarchy attention fusion counterfeiting positioning network are the core of the method, and the local similarity calculation module in the local difference feature extraction network can improve the capability of the feature extraction network to capture fine-granularity local difference marks, and the cross-hierarchy attention fusion module aggregates detail information of various hierarchies to promote the accuracy of counterfeiting positioning. Under the condition of limited training data, the invention obtains higher generalization performance for the test of unknown fake method data, thereby expanding the generalization capability of the existing detection model.

Description

Deep fake face detection positioning method capable of learning local difference
Technical Field
The invention belongs to the technical field of digital evidence obtaining, and particularly relates to a deep fake face detection and positioning method capable of learning local differences.
Background
Deep face forging is a real face forging technology based on deep learning. Such techniques may replace the original face with the target face or tamper facial features such as expression, facial features, hairstyle, etc. of the target face to generate a counterfeit product intended by the producer. Unlike traditional face forging methods, deep face forging technology benefits from the rapid development of deep learning, and false face-changing video can be produced with only a small number of samples.
In recent years, there has been an increasing research heat in depth forgery detection techniques, and a large number of methods for detecting depth forgery have been proposed by researchers, and these methods can be classified into two main categories, i.e., intra-frame abnormality and inter-frame inconsistency, according to the detection characteristics.
The detection method based on intra-frame abnormality focuses on the possible abnormal phenomena of blurring, shaking, overlapping, different sources of characteristics, inconsistent identity information and the like in the forged face image. Thus, such methods attempt to detect counterfeit face images by capturing brightness, color, texture, features, and identity differences of the image. However, such methods do not perform well in the face of lower resolution data, and performance is greatly reduced when generalized to data sets of unknown counterfeited methods.
The detection method based on the inter-frame inconsistency focuses on the continuity of face motion of the forged face video in a duration time, and captures possible action inconsistency, abnormal distortion of colors and textures and the like. Therefore, the method mainly extracts the characteristics of continuous frames through the convolutional neural network, and then inputs the characteristics of continuous frames into the convolutional neural network together to capture inconsistent information of the characteristic sequence in time sequence, so that the face fake video is detected. However, the network model for the feature sequence flow has high training difficulty, high calculation cost, serious dependence on training data and insufficient generalization.
In general, existing deep counterfeited face detection methods have the following drawbacks and challenges:
(1) The generalization capability of the unknown falsification method data is insufficient:
training data of a limited class brings fitting risks to the model; the fake traces left by different fake algorithms are different, if the true and false classification model is trained only for specific flaws of a certain fake algorithm, the detection accuracy of the obtained classifier on unknown fake types is very low, and the generalization capability is seriously insufficient.
(2) The robust ability to low quality data is inadequate:
the existing depth fake detection method lacks sufficient robustness when facing video compression, and is lack of generality because fake marks can be destroyed after quality degradation processing such as compression, blurring and the like of fake face videos, so that detection accuracy of a method for capturing the falsified marks is low.
Disclosure of Invention
In order to solve the technical problems, the invention provides a depth fake face detection positioning method capable of learning local differences, which improves generalization capability of a detection model on unknown fake method data and accurately positions fake areas.
In order to achieve the above object, the present invention provides a depth fake face detection and positioning method capable of learning local differences, comprising:
acquiring a fake face video, extracting a video key frame from the fake face video, and acquiring a face image to be detected in the video key frame;
the method comprises the steps of constructing a detection model, inputting the face image to be detected into the detection model, detecting the authenticity of the face image to be detected and positioning a fake area, wherein the detection model comprises a local difference feature extraction network, a high generalization authenticity classifier and a cross-hierarchy attention fusion fake positioning network, the detection model is obtained through training of a training set, and the training set is a face image training set.
Optionally, the process of obtaining the training set includes:
acquiring an original forged face video data set, and extracting video key frames from the original forged face video data set;
positioning and cutting a face area in the video key frame to obtain an original face image dataset;
judging whether to carry out enhancement processing on the original face image in the original face image dataset according to a preset enhancement probability;
and if the enhancement processing is carried out, the enhancement processing is carried out on the original face image, if the enhancement processing is not carried out, the original face image is obtained, and the original face image after the enhancement processing are used as the face image training set.
Optionally, inputting the face image to be detected into the detection model, and detecting the authenticity of the face image to be detected and positioning the counterfeit area includes:
performing feature extraction on the face image to be detected based on the local difference feature extraction network, and outputting an intermediate layer feature image and a final feature image, wherein the local difference feature extraction network is a feature extraction network optimized by utilizing a local similarity calculation module, and the feature extraction network is an Xattention network;
inputting the middle layer feature map and the final feature map into the cross-hierarchy attention fusion fake positioning network, outputting a prediction mask image, and positioning fake areas in the face image to be detected;
and inputting the final feature map into the high generalization authenticity classifier to obtain the authenticity predicted value of the face image to be detected.
Optionally, optimizing the feature extraction network using the local similarity calculation module includes:
extracting features of the face image training set by using the feature extraction network to obtain an original middle layer feature map;
inputting the original middle layer feature map into the local similarity calculation module, learning local difference information of the original middle layer feature map, optimizing the feature extraction network, and obtaining the local difference feature extraction network.
Optionally, inputting the original intermediate layer feature map into the local similarity calculation module, learning local difference information of the original intermediate layer feature map, and optimizing the feature extraction network includes:
carrying out local feature similarity calculation prediction and aggregation on the original intermediate layer feature map to obtain a predicted local similarity map;
acquiring an initial fake region mask by using the real face image and the corresponding fake face image;
downsampling the initial fake region mask to the size of the original intermediate layer feature map to obtain a real fake region mask;
based on the real fake area mask, acquiring a real mask local similarity graph by using Cartesian products;
constraining the predicted local similarity graph through the real mask local similarity graph, and training a local similarity calculation module;
and training and optimizing the feature extraction network by utilizing the local similarity calculation module.
Optionally, based on the true counterfeit area mask, obtaining the true mask local similarity map using a cartesian product includes:
expanding the true fake area mask into a one-dimensional tensor to obtain a first feature map;
calculating Cartesian products of all position features of the first feature map to obtain a second feature map, wherein the second feature map is shaped asTensors of (a);
the second characteristic diagram is returned to the shape after being shapedIs a third feature map of (2);
separating the third feature map into two feature maps with the same size, obtaining the absolute value of the difference value of the two feature maps with the same size, binarizing the absolute value, and obtaining the real mask local similarity map;
the process of obtaining the true mask local similarity map is formulated as follows:
wherein,cartesia_prodrepresenting the feature cartesian product calculation,reshapea shape adjustment operation is shown as such,splitrepresenting the separation of the dimensions of the features,binaryrepresenting the function of the binarization,two mask features, respectively->One-dimensional features to which the two mask feature shapes are stretched,mlocal similarity for true masksA drawing.
Optionally, performing local feature similarity calculation prediction and aggregation on the intermediate layer feature map, and obtaining a predicted local feature map includes:
expanding the middle layer feature map, extracting local feature tensors in the expanded middle layer feature map, and obtaining a plurality of local feature tensors;
each local characteristic tensor in the plurality of local characteristic tensors is spliced with the rest local characteristic tensors respectively, so that a new tensor is obtained;
performing convolution learning operation on the new tensor by using a convolution kernel module to obtain a similarity threshold;
combining the similarity thresholds according to the position sequence to obtain a predicted local similarity graph;
the process of obtaining the predicted local similarity map is formulated as follows:
wherein,representing convolution operations, assamble representing aggregation operations,>representing the position on the feature map asi,jAndm,nsimilarity threshold value obtained by similarity calculation of two tensors of (2), +.>Representing a binarized prediction of the similarity of the features,/->Representing the position on the characteristic diagram F asi,jTensor of->Representing a characteristic diagramFThe upper position ism,nTensor of->Representing the local similarity map of the final prediction.
Optionally, inputting the intermediate layer feature map and the final feature map into the cross-hierarchy attention fusion fake positioning network, and outputting a prediction mask image includes:
the final feature map is up-sampled and then input into a convolution module together with the middle layer feature map, features of different layers are fused, and a fused feature map is output;
inputting the fusion feature map into a cross-hierarchy attention fusion module to perform self-adaptive feature fusion, and obtaining a new feature map;
and outputting the prediction mask image after the new feature image output by the final cross-hierarchy attention fusion module is subjected to convolution operation.
Optionally, acquiring the new feature map includes:
based on the cross-hierarchy attention fusion module, carrying out pooling convolution processing on the fusion feature map, and obtaining a channel attention map with the same channel size as the fusion feature map;
and carrying out Hadamard product operation on the fusion feature map and the channel attention map to obtain the new feature map.
Optionally, obtaining the true and false predicted value of the face image to be detected includes: and inputting the final feature map into the high generalization authenticity classifier, and obtaining the authenticity predicted value after convolution, self-adaptive pooling and full connection.
The invention has the technical effects that: the local difference calculation module acquires the intermediate features from the intermediate layer, performs local similar convolution calculation on the intermediate features to acquire the local difference similarity graph for loss constraint, only acquires the intermediate layer features from the detection network, and additionally receives the output of the feature extraction network as input by the specially designed fake positioning module.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for detecting and locating a face in deep forgery capable of learning local differences according to an embodiment of the invention;
FIG. 2 is a flow chart of a method for computing local similarity according to an embodiment of the present invention;
FIG. 3 is a comparison graph of effects of falsifying a face image real mask local similarity graph and predicting a local similarity graph according to the embodiment of the invention;
FIG. 4 is a diagram of counterfeit localization prediction in accordance with an embodiment of the present invention;
FIG. 5 is a cross-hierarchy feature attention fusion module in accordance with an embodiment of the present invention;
fig. 6 is a diagram of a cross-hierarchy attention fusion forgery positioning network according to an embodiment of the present invention.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
It should be noted that, all actions for acquiring signals, information or data in the present application are performed under the condition of conforming to the corresponding data protection rule policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.
The invention designs a double-task Y-shaped network model based on local feature difference based on a deep learning technology for simultaneously detecting authenticity of a face image and positioning a fake area. The frame of the invention mainly consists of three parts: local difference feature extraction network, cross-hierarchy attention fusion fake positioning network and high generalization true and false classifier. Starting from the characteristics of the forged face image which is easy to introduce images from different sources, focusing on the difference of fine-granularity local characteristics, and enhancing the capability of the detection model for mining the local difference in the forged face image by learning and calculating the local characteristic difference map. In addition, the capability of the detection model for acquiring comprehensive inconsistent characteristic information is further improved by combining cross-hierarchy attention fusion fake positioning network. After the capability of capturing fine local difference information of the model is trained, the characteristics extracted by the detection model are input into a high-generalization true-false classifier, so that more robust and generalized detection capability can be obtained.
As shown in fig. 1, the present embodiment provides a depth face detection positioning method capable of learning local feature differences, which can be used for true and false detection of depth counterfeit face images and positioning of tampered areas, and includes:
(1) Obtaining a fake face video: the face video data adopted by the invention are from faceforense++, celeb-DF and DFD data sets. The faceforensic++ is a large public face fake data set and is widely applied to deep fake detection tasks. The dataset contained 1000 authentic videos and 4000 counterfeited videos generated by four tamper methods, deepfakes, face2Face, faceSwap, neuralTextures. In addition, faceforensic++ includes three compressed-level versions, an original version, a high-quality version (C23), and a low-quality version (C40), respectively. Celeb-DF is a challenging fake face dataset consisting of 569 real videos and 5,639 fake videos extracted from Youtube. These fake videos are generated by the improved version of the disclosed deep fake generation algorithm, which improves the problems of low resolution and inconsistent colors of fake faces. DFD is a large fake face dataset containing 363 real videos and 3,068 fake videos in different scenes. Thus, the present data set satisfies the generalization and robust detection conditions required for the experiments of the present invention. The invention realizes programming experiments based on a Pytorch framework by adopting a Python language. The face fake data set is divided into a training set, a verification set and a test set which are respectively used for training, verifying and testing the detection model.
(2) Extracting video key frames: the OpenCV tool is used to intercept the video at regular intervals into a series of consecutive and non-repeating video frames.
(3) Clipping a face region image: and positioning the position of the face area in the video frame by using a face recognition algorithm, aligning the face area and cutting the face area in a proper proportion. After the redundant background information is removed, in order to ensure the input uniformity of the detection network model, the face image is adjusted to 224×224 pixel size in a consistent way, and a face image dataset is obtained.
(4) The random enhancement method processes the image: in order to improve the robustness of the detection model, the invention also adopts a random enhancement method to preprocess the image. The invention selects five enhancement methods, which are respectively as follows: random masking, random flipping, random noise, image blending, and gray scale processing. In the training stage, each image decides whether to enhance according to the probability, and the probability of selecting each enhancement method is the same. The enhancement treatment steps are as follows:
(4.1) given an enhanced probability thresholdSetting a random number +.>Used for judging whether enhancement processing is needed before each image is conveyed;
(4.2) if enhancement is needed, randomly selecting one of five enhancement methods, preprocessing the image by the method, inputting the preprocessed image into a feature extraction network, and otherwise, inputting the original image.
(5) The local difference feature extraction network extracts image features: an Xattention network which is pre-trained by an Imagenet data set and added with a local similarity calculation module is adopted as a characteristic extraction network of the invention. The Xception network is a linear stack of depth separable convolutional layers with residual connections, replacing the convolutional operations in original InceptionV3 with depth separable convolutions. The effect of the model is improved on the premise of not increasing the complexity of the network basically. The excellent performance of the XAN_SNtion network on the image data set gradually becomes the main stream network backbone selection of a plurality of image feature extraction tasks, so that the XAN_SNtion network is selected as the feature extraction network, and the XAN_SNtion network is optimized by adding the local similarity calculation module. The feature extraction flow in the training phase is as follows:
(5.1) loading model parameters pre-trained on the Imagenet dataset into the network;
(5.2) the feature extraction network acquires the face image input from the step (4.2), extracts the image features through a series of network intermediate layers, takes the input before the last full-connection layer as the feature input of a subsequent high generalization true-false classifier and a cross-hierarchy attention fusion fake positioning network, and receives the intermediate layer features of the feature extraction network as the input besides the final features of the cross-hierarchy attention fusion fake positioning network;
and (5.3) extracting a characteristic diagram with moderate size of the Xattention network middle layer, and inputting the characteristic diagram into a local similarity calculation module (LLSM) to learn the local similarity diagram.
(6) Local similarity graph learning: the pre-trained deep neural network model is used for extracting the facial image characteristics after the enhancement processing, the local characteristic difference mode of the network middle layer characteristic learning is calculated as shown in figure 2, and in order to strengthen the sensitivity of the detection model to local abnormality, the local similarity calculation module is adopted to force the detection model to learn the local difference information of the characteristic diagram. The specific calculation flow is as follows:
(6.1) introducing an intermediate layer feature map of the feature extraction network, whose shape size is CXHXW, first expanding its shape to a shape of C× (HW), local feature tensor for arbitrary position coordinatesExtracting to define the shape as Cx1×1, wherein C represents the channel number of the feature, and then splicing each local feature tensor with all other feature tensors to form HW new tensors with the shape of Cx2×1>
(6.2) utilizing the learning nature of convolution, the present invention employs a unique 2×1 size convolution kernel module to pair all new tensors in (6.1)And performing convolution learning operation, namely calculating one similarity tensor of the two three-dimensional tensors through convolution operation, finally obtaining a similarity threshold with the shape of 1 multiplied by 1, and combining HW thresholds according to the position sequence to obtain the first line of information of the local similarity graph. For any oneiAndj,m∈[1,H],n∈[1,W]the following formula exists:
wherein,representing convolution operations, assamble representing aggregation operations,>representing the position on the feature map asi,jAndm,nsimilarity threshold value obtained by similarity calculation of two tensors of (2), +.>Representing a binarized prediction of the similarity of the features,/->Representing the position on the characteristic diagram F asi,jTensor of->Tensor representing position m, n on feature map F, ++>Representing the local similarity map of the final prediction.
(6.3) after computing the local similarity of all the features in steps 6.1 and 6.2, combining to obtain a predicted local similarity map M of HW×HW size. In order to make the feature local similarity map reflect the difference between the local features of the face image, a real mask local similarity map m is constructed for constraint.
(6.4) the real mask local similarity graph construction process in the step 6.3 is as follows: firstly, a real face image and a corresponding fake face image are cut to obtain a real fake area mask, the real mask is downsampled to enable the size of the real mask to be consistent with that of an intermediate layer feature image, then the mask image is unfolded to be a one-dimensional tensor, the one-dimensional tensor is unfolded to obtain a first feature image, cartesian products are calculated on all position features in the first feature image to obtain a second feature image, the shape of the second feature image is H x W x 2 tensor, and the second feature image is returned to be the shape after the shape is adjustedIs a third feature map of (2). And then separating the third feature map into two feature maps with the same size according to a third dimension, and binarizing the two feature maps after absolute value difference to obtain a final real mask local similarity map. As shown in fig. 3, the first line and the second line are a true mask local similarity map and a predicted local similarity map, respectively.
The process is formulated as follows:
wherein,cartesia_prodrepresenting the characteristic cartesian product operation,reshapea shape adjustment operation is shown as such,splitrepresenting the separation of the dimensions of the features,binaryrepresenting the function of the binarization,two mask features, respectively, segmented according to a third dimension,/for each of the mask features>One-dimensional features to which the two mask feature shapes are stretched,mis a true mask local similarity graph.
And finally, fitting the real mask local similarity graph by adopting the BCE loss function constraint local feature similarity graph to train a local similarity calculation module. The local similarity loss function is expressed as:
wherein,representing a local similarity map of a real mask->A predicted local similarity map is represented and,kindicating the number of samples to be taken,Nrepresenting the total amount of training data.
The model adopts a BCE loss function in the training stage, namely local similarity loss, true and false classification loss and fake positioning loss, and training is carried out simultaneously.
The fake localization loss function is expressed as:
wherein,representing a true mask map,/">Representing a prediction mask map.
The true-false classification loss function is expressed as:
wherein,representing the true or false predictive value>Representing the tag.
The total training loss function is expressed as:
wherein,and->Representing the weight value.
The local similarity calculation module belongs to an additional plug-in module, and can be used for enhancing a feature extraction network to extract deep-level difference features, so that the classification features obtained by the classifier are more robust and have generalization. The local similarity calculation module is used for promoting the feature extraction network to capture the fine-granularity local feature difference through fitting the similarity between the local similarity graph of the middle layer feature convolution learning and the real mask local similarity graph and back propagation, and the final feature extraction network is used for transmitting the final feature of the classifier to be more generalized than the general feature through the joint training of the local similarity calculation module, so that the generalization capability of the classifier on different data sets is improved. On the other hand, the local similarity graph generated by the local similarity calculation module has strong relevance with the mask graph, and the addition of the local similarity calculation module can promote cross-hierarchy attention fusion fake positioning network to acquire a fine fake area, so that fake positioning capability is improved.
(7) Cross-hierarchy attention fusion fake positioning network prediction: the cross-hierarchy attention fusion fake positioning network receives the final characteristics and the intermediate characteristics output by the characteristic extraction network in step (5.2) as input, fuses the characteristics of different hierarchies in the self up-sampling process to enhance the capability of identifying abnormal information, and finally outputs a predicted mask image, wherein the structure of the cross-hierarchy attention fusion fake positioning network is shown in fig. 6 and a positioning prediction diagram is shown in fig. 4. The cross-hierarchy attention fusion fake positioning network prediction steps are as follows:
(7.1) receiving final characteristics and middle layer characteristics of the local difference characteristic extraction network by the cross-hierarchy attention fusion fake positioning network, and inputting the final characteristics and middle layer characteristics into a convolution module after the final characteristics are subjected to first channel connection combination with the middle layer characteristics after passing through a first layer up-sampling module; after the second upsampling module, the second combined middle layer features are input into the convolution module, and then a new upsampling module is input, so that feature information interaction fusion crossing different depth layers is realized.
(7.2) dividing the feature jump connection in the local difference feature extraction network, introducing a feature jump connection fusion method based on an attention mechanism into a fake positioning network middle layer, jumping the middle layer features of a positioner, performing self-adaptive feature fusion by using a cross-layer attention fusion module (CLAFM), receiving feature inputs of different layers by the cross-layer attention fusion module (CLAFM) as shown in figure 5, performing a series of pooling convolution processes on feature graphs before introducing N convolution blocks by the cross-layer attention fusion module, obtaining a channel attention map with the same channel size as the current input feature graph, and performing Hadamard product operation calculation on the attention map and the input feature map to obtain a new feature graph.
(7.3) the final feature is subjected to a series of upsampling and feature fusion operations, and a prediction mask image is output after the final layer 1 x 1 convolution operation.
(8) Identification of high generalization authenticity classifier: the high generalization true-false classifier also receives final characteristic input, and the true-false predicted value is obtained after convolution, self-adaptive pooling and full-connection layer.
After testing the faceforensic++ dataset, the experimental results are shown in table 1:
TABLE 1
After training on the faceforensic++ (C23) dataset, the test results are shown in table 2 after testing the Celeb-DF and DFD datasets:
TABLE 2
The local similarity calculation module enhances the capability of the feature extraction network to mine fine-granularity local differences so as to capture robust fake marks; the cross-hierarchy attention fusion counterfeit positioning network aggregates various characteristic information with different emphasis, and improves the counterfeit positioning accuracy; the generalization capability and the counterfeiting positioning capability of the detection model are greatly improved by the mutual promotion of the detection model and the detection model.
The above embodiments are merely illustrative of the preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but various modifications and improvements made by those skilled in the art to which the present invention pertains are made without departing from the spirit of the present invention, and all modifications and improvements fall within the scope of the present invention as defined in the appended claims.

Claims (6)

1. The deep fake face detection and positioning method capable of learning local difference is characterized by comprising the following steps:
acquiring a fake face video, extracting a video key frame from the fake face video, and acquiring a face image to be detected in the video key frame;
a detection model is built, the face image to be detected is input into the detection model, authenticity of the face image to be detected is detected, a fake area is located, the detection model comprises a local difference feature extraction network, a high generalization authenticity classifier and a cross-hierarchy attention fusion fake locating network, the detection model is obtained through training of a training set, and the training set is a face image training set;
inputting the face image to be detected into the detection model, and detecting the authenticity of the face image to be detected and positioning the fake area comprises the following steps:
performing feature extraction on the face image to be detected based on the local difference feature extraction network, and outputting an intermediate layer feature image and a final feature image, wherein the local difference feature extraction network is a feature extraction network optimized by utilizing a local similarity calculation module, and the feature extraction network is an Xattention network;
inputting the middle layer feature map and the final feature map into the cross-hierarchy attention fusion fake positioning network, outputting a prediction mask image, and positioning fake areas in the face image to be detected;
inputting the final feature map into the high generalization authenticity classifier to obtain an authenticity predicted value of the face image to be detected;
optimizing the feature extraction network using a local similarity calculation module includes:
extracting features of the face image training set by using the feature extraction network to obtain an original middle layer feature map;
inputting the original middle layer feature map into the local similarity calculation module, learning local difference information of the original middle layer feature map, optimizing the feature extraction network, and obtaining the local difference feature extraction network;
inputting the original middle layer feature map into the local similarity calculation module, learning local difference information of the original middle layer feature map, and optimizing the feature extraction network comprises:
carrying out local feature similarity calculation prediction and aggregation on the original intermediate layer feature map to obtain a predicted local similarity map;
acquiring an initial fake region mask by using the real face image and the corresponding fake face image;
downsampling the initial fake region mask to the size of the original intermediate layer feature map to obtain a real fake region mask;
based on the real fake area mask, acquiring a real mask local similarity graph by using Cartesian products;
constraining the predicted local similarity graph through the real mask local similarity graph, and training a local similarity calculation module;
training and optimizing the feature extraction network by utilizing the local similarity calculation module;
based on the true counterfeit area mask, using a cartesian product, obtaining a true mask local similarity map includes:
expanding the true fake area mask into a one-dimensional tensor to obtain a first feature map;
calculating Cartesian products of all position features of the first feature map to obtain a second feature map, wherein the second feature map is shaped asTensors of (a);
the second characteristic diagram is returned to the shape after being shapedIs a third feature map of (2);
separating the third feature map into two feature maps with the same size, obtaining the absolute value of the difference value of the two feature maps with the same size, binarizing the absolute value, and obtaining the real mask local similarity map;
the process of obtaining the true mask local similarity map is formulated as follows:
wherein,cartesia_prodrepresenting the feature cartesian product calculation,reshapea shape adjustment operation is shown as such,splitrepresenting the separation of the dimensions of the features,binaryrepresenting a binarization function, < >>Two mask features, respectively->Respectively two masksThe one-dimensional feature to which the feature shape is stretched,mis a true mask local similarity graph.
2. The method for deep counterfeited face detection localization capable of learning local differences according to claim 1, wherein the process of obtaining the training set comprises:
acquiring an original forged face video data set, and extracting video key frames from the original forged face video data set;
positioning and cutting a face area in the video key frame to obtain an original face image dataset;
judging whether to carry out enhancement processing on the original face image in the original face image dataset according to a preset enhancement probability;
and if the enhancement processing is carried out, the enhancement processing is carried out on the original face image, if the enhancement processing is not carried out, the original face image is obtained, and the original face image after the enhancement processing are used as the face image training set.
3. The method for detecting and locating a deep counterfeited face capable of learning local differences according to claim 1, wherein the step of performing local feature similarity calculation prediction and aggregation on the intermediate layer feature map to obtain a predicted local similarity map comprises:
expanding the middle layer feature map, extracting local feature tensors in the expanded middle layer feature map, and obtaining a plurality of local feature tensors;
each local characteristic tensor in the plurality of local characteristic tensors is spliced with the rest local characteristic tensors respectively, so that a new tensor is obtained;
performing convolution learning operation on the new tensor by using a convolution kernel module to obtain a similarity threshold;
combining the similarity thresholds according to the position sequence to obtain a predicted local similarity graph;
the process of obtaining the predicted local similarity map is formulated as follows:
wherein (1)>Representing convolution operations, assamble representing aggregation operations,>representing the position on the feature map asi,jAndm,nsimilarity threshold value obtained by similarity calculation of two tensors of (2), +.>Representing a binarized prediction of the similarity of the features,/->Representing the position on the characteristic diagram F asi,jTensor of->Representing a characteristic diagramFThe upper position ism,nTensor of->Representing the local similarity map of the final prediction.
4. The method for deep forgery face detection and localization capable of learning local differences according to claim 1, wherein inputting the intermediate layer feature map and the final feature map into the cross-hierarchy attention fusion forgery localization network, outputting a prediction mask image includes:
the final feature map is up-sampled and then input into a convolution module together with the middle layer feature map, features of different layers are fused, and a fused feature map is output;
inputting the fusion feature map into a cross-hierarchy attention fusion module to perform self-adaptive feature fusion, and obtaining a new feature map;
and outputting the prediction mask image after the new feature image output by the final cross-hierarchy attention fusion module is subjected to convolution operation.
5. The method for locating a false-depth face detection capable of learning local differences according to claim 4, wherein obtaining a new feature map comprises:
based on the cross-hierarchy attention fusion module, carrying out pooling convolution processing on the fusion feature map, and obtaining a channel attention map with the same channel size as the fusion feature map;
and carrying out Hadamard product operation on the fusion feature map and the channel attention map to obtain the new feature map.
6. The method for detecting and locating a deep counterfeited face capable of learning a local difference according to claim 1, wherein obtaining the true and false predicted value of the face image to be detected comprises: and inputting the final feature map into the high generalization authenticity classifier, and obtaining the authenticity predicted value after convolution, self-adaptive pooling and full connection.
CN202311841206.2A 2023-12-29 2023-12-29 Deep fake face detection positioning method capable of learning local difference Active CN117496583B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311841206.2A CN117496583B (en) 2023-12-29 2023-12-29 Deep fake face detection positioning method capable of learning local difference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311841206.2A CN117496583B (en) 2023-12-29 2023-12-29 Deep fake face detection positioning method capable of learning local difference

Publications (2)

Publication Number Publication Date
CN117496583A CN117496583A (en) 2024-02-02
CN117496583B true CN117496583B (en) 2024-04-02

Family

ID=89683260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311841206.2A Active CN117496583B (en) 2023-12-29 2023-12-29 Deep fake face detection positioning method capable of learning local difference

Country Status (1)

Country Link
CN (1) CN117496583B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022151655A1 (en) * 2021-01-18 2022-07-21 深圳市商汤科技有限公司 Data set generation method and apparatus, forgery detection method and apparatus, device, medium and program
CN115019370A (en) * 2022-06-21 2022-09-06 深圳大学 Depth counterfeit video detection method based on double fine-grained artifacts
CN115984917A (en) * 2022-09-22 2023-04-18 云南大学 Face depth counterfeiting detection method and system based on multi-mode artifacts
CN117238011A (en) * 2023-07-26 2023-12-15 华南农业大学 Depth counterfeiting detection method based on space-time attention guide fusion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709408B (en) * 2020-08-18 2020-11-20 腾讯科技(深圳)有限公司 Image authenticity detection method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022151655A1 (en) * 2021-01-18 2022-07-21 深圳市商汤科技有限公司 Data set generation method and apparatus, forgery detection method and apparatus, device, medium and program
CN115019370A (en) * 2022-06-21 2022-09-06 深圳大学 Depth counterfeit video detection method based on double fine-grained artifacts
CN115984917A (en) * 2022-09-22 2023-04-18 云南大学 Face depth counterfeiting detection method and system based on multi-mode artifacts
CN117238011A (en) * 2023-07-26 2023-12-15 华南农业大学 Depth counterfeiting detection method based on space-time attention guide fusion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
face spoof detection using feature map superposition and CNN;Fei Gu 等;《computational science and engineering》;20211008;第355-363页 *
learning second order local anomaly for general face forgery detection;Jianwei Fei;《IEEE》;20220624;第1-11页 *

Also Published As

Publication number Publication date
CN117496583A (en) 2024-02-02

Similar Documents

Publication Publication Date Title
CN111311563B (en) Image tampering detection method based on multi-domain feature fusion
Li et al. Identification of deep network generated images using disparities in color components
Rana et al. Deepfakestack: A deep ensemble-based learning technique for deepfake detection
Shang et al. PRRNet: Pixel-Region relation network for face forgery detection
Li et al. Fighting against deepfake: Patch&pair convolutional neural networks (PPCNN)
CN110008909B (en) Real-name system business real-time auditing system based on AI
CN112150450B (en) Image tampering detection method and device based on dual-channel U-Net model
Yang et al. Spatiotemporal trident networks: detection and localization of object removal tampering in video passive forensics
CN112069891B (en) Deep fake face identification method based on illumination characteristics
Rota et al. Bad teacher or unruly student: Can deep learning say something in image forensics analysis?
CN114694220A (en) Double-flow face counterfeiting detection method based on Swin transform
CN112907598A (en) Method for detecting falsification of document and certificate images based on attention CNN
CN113361474B (en) Double-current network image counterfeiting detection method and system based on image block feature extraction
Sharma et al. Deep convolutional neural network with ResNet-50 learning algorithm for copy-move forgery detection
CN117496583B (en) Deep fake face detection positioning method capable of learning local difference
Han et al. FCD-Net: Learning to detect multiple types of homologous deepfake face images
Patrikar et al. Comprehensive study on image forgery techniques using deep learning
Ernawati et al. Image Splicing Forgery Approachs: A Review and Future Direction
Abraham Digital image forgery detection approaches: A review and analysis
Bikku et al. Deep Residual Learning for Unmasking DeepFake
Sabeena et al. Copy-move image forgery localization using deep feature pyramidal network
CN117558011B (en) Image text tampering detection method based on self-consistency matrix and multi-scale loss
Eunus et al. ECARRNet: An Efficient LSTM-Based Ensembled Deep Neural Network Architecture for Railway Fault Detection
Singh et al. Fake Image Detection Using Ensemble Learning
Krishnamurthy et al. IFLNET: Image Forgery Localization Using Dual Attention Network.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant