CN116563957B

CN116563957B - Face fake video detection method based on Fourier domain adaptation

Info

Publication number: CN116563957B
Application number: CN202310834717.5A
Authority: CN
Inventors: 王春鹏; 时超轶; 马宾; 王玉立; 魏子麒; 夏之秋; 李琦; 李健; 咸永锦; 韩冰; 王晓雨
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2023-07-10
Filing date: 2023-07-10
Publication date: 2023-09-29
Anticipated expiration: 2043-07-10
Also published as: CN116563957A

Abstract

The invention discloses a face fake video detection method based on Fourier domain adaptation, which relates to the technical field of face fake detection and is characterized in that: the method mainly comprises the following steps: s1: carrying out Fourier domain adaptation on video sequences in a source domain data set and a target domain data set to obtain a video sequence with domain alignment; s2, inputting each frame of image in the video sequence with the aligned domains into an Xreception network to obtain a feature vector of each frame of image; s3: inputting the video sequence with the aligned domains into a TimeSformer space-time converter network to obtain feature vectors of the video sequence; s4: mutually fusing the feature vectors output by the Xreception network and the TimeSformer space-time converter network to obtain fused feature vectors; s5: and inputting the fused feature vectors into a classifier to obtain a judgment result of whether the video sequence is forged by the human face.

Description

Face fake video detection method based on Fourier domain adaptation

Technical Field

The invention relates to the technical field of face counterfeiting detection, in particular to a face counterfeiting video detection method based on Fourier domain adaptation.

Background

Face forging refers to the process of falsifying or replacing a real face by utilizing a digital image processing technology or an artificial intelligence technology so as to generate a false face image or video. Face counterfeiting techniques may be used in entertainment, education, medical, etc., but may also be used for malicious purposes such as fraud, defamation, disruption of social order, etc. Therefore, the face-forgery detection technology is an important security protection means that can protect personal privacy and social fairness by analyzing whether a face in an image or video is genuine or forgery.

Currently, face counterfeiting detection techniques are mainly divided into two categories: methods based on conventional image processing techniques and methods based on deep learning techniques. Methods based on traditional image processing technologies mainly use some statistical features or visual artifacts in images to determine whether a face is forged, such as color distribution, edge sharpness, illumination inconsistency, blink frequency, and the like. The method has the advantages of simplicity and easiness in implementation, but has the defects that different characteristic extractors are required to be designed for different forging modes, the generalization capability is poor, and the method is easily interfered by factors such as noise, compression, shielding and the like. The method based on the deep learning technology mainly utilizes a convolutional neural network or a cyclic neural network and other models to automatically learn the characteristics in the image or the video, and classifies or regresses. The method has the advantages of extracting high-level semantic features, having strong adaptability to different fake modes and being capable of processing data with high resolution and high frame rate. However, the disadvantage is that a large amount of annotation data is required for training and that generalization capability is poor for unknown forgeries or cross-domain datasets.

In order to improve the generalization ability and the cross-domain adaptability of the face-forgery detection technology, some researchers have proposed methods based on domain adaptation or domain alignment. Domain adaptation or domain alignment refers to transforming or mapping data sets of different distributions or different styles so that they are more similar or identical under some measure. For example, the paper of the artificial intelligence international top-level conference CVPR 2022 is a face falsification detection method based on a spatial domain adaptation network (Spatial Domain Adaptation Network, SDAN) and a frequency domain adaptation network (Frequency Domain Adaptation Network, FDAN), which first performs spatial domain adaptation and frequency domain adaptation on images in a source domain data set and a target domain data set, and then inputs the adapted images into a shared convolutional neural network for feature extraction and classification. The method can effectively reduce the difference between the source domain data set and the target domain data set in the space domain and the frequency domain, and improves the accuracy of cross-domain detection.

However, the above method only considers spatial domain adaptation and frequency domain adaptation of images, and does not consider timing information present in video sequences. The video sequence contains frame-to-frame dynamic changes and associations that are useful to distinguish between real faces and fake faces. For example, in a video sequence, a real face often has self-consistent movements such as expression changes, eye flickering, head rotation, etc., and a fake face may have abnormal phenomena such as incompatibility, stiffness, repetition, etc. Therefore, in performing face-forgery detection, not only image information but also video information is considered.

Disclosure of Invention

In order to make up for the defects of the prior art, the invention relates to a face counterfeiting detection method based on a Fourier domain adaptation and deep learning network, which can effectively utilize image information and video information to judge whether the face counterfeiting exists in a video sequence and has good generalization capability and cross-domain adaptability.

The invention is realized by the following technical scheme:

a face fake video detection method based on Fourier domain adaptation is characterized by comprising the following steps of: the method comprises the following steps:

s1: carrying out Fourier domain adaptation on video sequences in a source domain data set and a target domain data set to obtain a video sequence with domain alignment;

s2, inputting each frame of image in the video sequence with the aligned domains into an Xreception network to obtain a feature vector of each frame of image;

s3: inputting the video sequence with the aligned domains into a TimeSformer space-time converter network to obtain feature vectors of the video sequence;

s4: mutually fusing the feature vectors output by the Xreception network and the TimeSformer space-time converter network to obtain fused feature vectors;

s5: inputting the fused feature vectors into a classifier to obtain a judgment result of whether the video sequence is forged by a human face;

in S1, fourier domain adaptation is performed on video sequences in a source domain data set and a target domain data set to obtain a video sequence after domain alignment, and the implementation steps include:

s11: video given a source domain dataset isThe video of the target domain dataset is +.>, wherein ,/>A certain video representing a source domain dataset, +.>Representation->Color picture frame of corresponding video, wherein +.>Representing the real number field, ++> and />Representing the height and width of the image, 3 representing an RGB image with color channels red, green and blue,/for the color channels red, green and blue>A label representing the correspondence of the video or picture, i.e. a face fake video is true or false, wherein +.>Video representing a target domain dataset, +.>Picture representing a target domain dataset, +.>A corresponding tag representing a target domain dataset;

s12: is provided withAmplitude component representing a fourier transformation of a color image, < >>The phase component representing the fourier transform of a color image is converted from the spatial domain to the frequency domain for a single-channel image by equation (1), equation (1) being:

（1）

wherein ,is the image at coordinates +.>Pixel value at +.>Is the transformed image at the coordinatesValue of (I) at (I)>Is imaginary unit, ++>Euler number, & lt + & gt>Representing the abscissa of the image, +.>The ordinate of the image is represented, and />Represents the abscissa in the frequency domain, +.>Is indicated at->Frequency variation in direction, +.>Is indicated at->Frequency variation in direction, +.>Representing the height of the image +.>Representing the width of the image;

s13: by usingRepresenting a mask matrix for replacing the low frequency region of the image, expressed by the formula (2)The method is shown as follows:

（2）

wherein the center position of the designated image is，/>An area with image pixel values of 1 is formed as a square, wherein +.>The size of this square area is indicated, < >>、/>Representing the height and width of the image, +.>、/>Indicating the height and width of the masking region required to be performed;

s14: converting the image of the frequency domain into the space domain again according to the inverse Fourier transform to obtain an image after domain alignment, wherein the transformation formula is as shown in formula (3):

（3）

s15: let equation (3) be written asGiven frame pictures in two video sequences +.>，/>Fourier domain adaptation is expressed by equation (4):

（4）

wherein ,representing inverse fourier transform ++>An image representing a source domain video, +.>An image representing a target field video, +.>Representing an image generated after style migration, +.>Representing the phase part of the source domain image after fourier transformation, and>representing the magnitude part of the target domain image after fourier transformation, a ∈>Representing the magnitude part of the source domain image after fourier transformation, a ∈>Representing a mask matrix->Representing a composite of two functions; in S15, 0.001 is set; the fourier domain adaptation refers to: performing Fourier transform on each video sequence in the source domain data set and the target domain data set on a time-frequency plane, calculating the magnitude spectrum and the phase spectrum of each video sequence, randomly pairing each video sequence in the source domain data set with each video sequence in the target domain data, and pairingExchanging the magnitude spectrum of the rear video sequence; and finally, carrying out inverse Fourier transform on the video sequence after the amplitude spectrum exchange on a time-frequency plane, and retaining the original phase spectrum of the video sequence.

As a further limitation of the present technical solution, the feature fusion refers to combining or integrating feature vectors of different networks to generate a new feature vector with a better expression capability or better fit for classification tasks, where the used Xception is,The feature of the image is extracted by the functional representation, +.>Representing the extracted feature vector, as shown in equation (5), the average feature vector corresponding to the frame sequence is then found from the corresponding number of frames +.>See formula (6), whereinRepresenting the number of frames contained in the frame sequence; similarly, timeSformer is +.>,/>The function representation extracts features of the image sequence, < >>Representing the extracted feature vector, < >>Representing a frame sequence generated after style migration, as shown in equation (7), let ∈>Is a fused feature vector expressed as formula (8) representing +.>And->The two eigenvectors are correspondingly added to obtain a fused eigenvector +.>Then, the final prediction probability ++can be obtained by further carrying out the formula (9)>, wherein />Representing a softmax layer,/->Representing a linear layer; for Xreception networks, ++can be determined according to equation (10)>Probability of conversion to predictive class through the linear layer and softmax layer +.>；

（5）

（6）

（7）

（8）

（9）

（10）。

As a further limitation of the present solution, for Xception networks, a loss function is usedFor cross entropy loss, the calculation formula is expressed as follows by formula (11):

（11）

wherein ,represents natural logarithm, and the base is +.>，/>Is a true sample tag.

As a further limitation of the present solution, for the TimeSformer space-time transformer network, a loss function is usedFor the focus loss, the calculation formula is expressed as follows by formula (12):

（12）

wherein ,represents natural logarithm, and the base is +.>，/>For regulating parameters->Set to 2 +>Representing a scaling factor,/->The weight of the positive and negative samples is adjusted to be 0.25;

wherein ,representing the loss function of the timeformer network.

As a further limitation of the technical proposal, the parameters are setWeight parameters for different losses:

（13）

wherein ,for the total loss of the whole process, +.>For the loss of Xreception network, +.>For loss of TimeSformer network, +.>Is set to 0.5 for the proportionality coefficient.

The beneficial effects of the invention are as follows:

(1) Not only is Fourier domain adaptation carried out on the image, but also Fourier domain adaptation is carried out on continuous frames in the video sequence, domain alignment is realized by utilizing information on a frequency domain, and domain alignment operation is carried out between a source domain and a target domain, so that the accuracy of face counterfeiting detection is greatly improved;

(2) The method is characterized in that the method does not adopt a spatial domain adaptation network, but adopts an Xception network and a TimeSformer network to respectively extract the characteristics of an image and a video sequence, and mutually fuses the characteristics of the image and the video sequence, so that the detection effect is improved by utilizing the information on space and time sequence;

(3) The full training of the neural network is added while domain alignment between different data sets is performed, so that very excellent performance can be achieved in face fake detection. The X-section network has multi-layer convolution operation, can extract multi-scale characteristics of images, is a time-space modeling method based on an attention mechanism, can effectively capture time sequence characteristics in videos, and can comprehensively utilize characteristic information of static images and video sequences by combining with the X-section and the time-space former, so that the detection capability of face counterfeiting is improved. Both Xpercent and TimeSformer are models trained by a large-scale data set, have stronger robustness and generalization capability, and can cope with the change and interference of different samples.

Drawings

Fig. 1 is a fourier transformed image amplitude low frequency component transfer diagram of the present invention.

Fig. 2 is a flowchart of feature fusion of an image sequence using an Xception network and a TimeSformer network according to the present invention.

Fig. 3 is a schematic diagram of the operation of the TimeSformer encoder module.

Fig. 4 is a process of the timeformer processing a video sequence to a timing feature.

Fig. 5 is a flow chart of the present invention.

Detailed Description

In order to clearly illustrate the technical features of the present solution, the present invention will be described in detail below with reference to the following detailed description and the accompanying fig. 1 to 5. In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "left", "right", "front", "rear", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

The specific embodiments of the present invention are as follows:

s5: and inputting the fused feature vectors into a classifier to obtain a judgment result of whether the video sequence is forged by the human face.

（1）

wherein ,is the image at coordinates +.>Pixel value at +.>Is the transformed image at coordinates +.>Value of (I) at (I)>Is imaginary unit, ++>Euler number, & lt + & gt>Representing the abscissa of the image, +.>Representing the ordinate of the image, +.> and />Represents the abscissa in the frequency domain, +.>Is indicated at->Frequency variation in direction, +.>Is indicated at->Frequency variation in direction, +.>Representing the height of the image +.>Representing the width of the image;

s13: by usingRepresenting a mask matrix for replacing the low frequency region of the image, expressed by equation (2):

（2）

（3）

（4）

wherein ,representing inverse fourier transform ++>An image representing a source domain video, +.>An image representing a target field video, +.>Representing an image generated after style migration, +.>Representing the phase part of the source domain image after fourier transformation, and>representing the magnitude part of the target domain image after fourier transformation, a ∈>Representing the magnitude part of the source domain image after fourier transformation, a ∈>Representing a mask matrix->Representing a composite of the two functions.

Wherein the low frequency part of the amplitude on the source domain video imageLow frequency portion on target domain video imageReplaced, then, a->The low frequency region of the image is +.>The low frequency region of the image is replaced, the generated image +.>Content and->The same style and->Same (same as->With the same appearance).

In the S15, atIn the gradual increase from 0 to 1, < + >>The image will also get closer and closer to +.>But at the same time visible artefacts will occur, therefore +.>The method is set to be 0.001, and a low-frequency part (namely, a region with slow gray value change) on a source domain video image is replaced by a low-frequency part (namely, a region matched with a target style) on a target domain video image, so that an image with the same content as the source domain and the same style as the target domain is generated, and in this way, the domain difference between the source domain and the target domain can be obviously reduced, and a better effect can be achieved when counterfeiting detection is carried out.

The fourier domain adaptation refers to: performing Fourier transform on each video sequence in the source domain data set and the target domain data set on a time-frequency plane, calculating the amplitude spectrum and the phase spectrum of each video sequence, randomly pairing each video sequence in the source domain data set with each video sequence in the target domain data set, and exchanging the amplitude spectrum between the paired video sequences; and finally, carrying out inverse Fourier transform on the video sequence after the amplitude spectrum exchange on a time-frequency plane, and retaining the original phase spectrum of the video sequence.

The Xattention network is a convolution neural network model based on a depth separable convolution (Depthwise Separable Convolution) structural design, and can perform feature extraction on each picture in a video sequence, so that the accuracy of image identification is improved, the Xattention network adopts the depth separable convolution to replace the traditional convolution, so that the number of parameters and the calculated amount are reduced, the depth separable convolution comprises a depth convolution layer and a point-by-point convolution layer, the depth convolution layer performs convolution operation on each input channel respectively, the depth of a convolution kernel is set to be 1, and the input depth is kept unchanged; the point-by-point convolution layer combines the characteristic graphs of different channels to form an output characteristic graph;

each frame of image is input into the Xception network, and a feature vector with the length of 2048 can be obtained as a feature representation of the frame of image.

The TimeSformer network is a video classification model designed based on a transducer structure, and can model each frame in a video sequence and learn the time sequence relationship between frames;

each video sequence is input into the timeformer network, and a feature vector with a length of 2048 can be obtained as a feature representation of the video sequence.

The TimeSformer network comprises a block segmentation layer, a position embedding layer, a linear embedding layer, 12 encoder modules and a global average pooling layer;

the image frame is divided into image blocks through block division, the image blocks are linearly embedded into vector forms, and the vector forms are added with position information contained in the block division to be combined into embedded vectorsAs input to the encoder module.

The encoder module processes the input video sequence using a separate spatiotemporal attention mechanism; the separate spatiotemporal attention mechanism includes: a temporal attention mechanism for interactive processing of image blocks within each frame of the input video sequence and at the same spatial locations of the image blocks from frame to frame; a spatial attention mechanism for processing the interaction between the image blocks in each frame and other image blocks in the same frame; a multi-layer perceptron module for transforming and mapping the characteristics of the temporal and spatial attention mechanisms, and the output of the multi-layer perceptron module can be used as the input of the next encoder module. After 12 iterations, the output of the encoder module enters the global averaging pooling layer, and the required sequence features can be obtained.

The feature fusion means that feature vectors of different networks are combined or integrated to generate new feature vectors with more expressive capacity or more suitable for classification tasks, and the used Xattention is,/>The feature of the image is extracted by the functional representation, +.>Representing the extracted feature vector, as shown in equation (5), the average feature vector corresponding to the frame sequence is then found from the corresponding number of frames +.>See formula (6), wherein +.>Representing the number of frames contained in the frame sequence; similarly, timeSformer is +.>,/>The function representation extracts features of the image sequence, < >>Representing the extracted feature vector, < >>Representing a frame sequence generated after style migration, as shown in equation (7), let ∈>Is a fused feature vector expressed as formula (8) representing +.>And->The two eigenvectors are correspondingly added to obtain a fused eigenvector +.>Then, the final prediction probability ++can be obtained by further carrying out the formula (9)>, wherein ,/>The softmax layer is shown as such,representing a linear layer; for Xreception networks, ++can be determined according to equation (10)>Probability of conversion to predictive class through the linear layer and softmax layer +.>；

（5）

（6）

（7）

（8）

（9）

（10）。

For Xreception networks, the loss function usedFor cross entropy loss, the calculation formula is expressed as follows by formula (11):

（11）

For TimeSformer space-time transformer networks, the loss function usedFor focus loss, focus loss is suitable for handling the problem of class imbalance, in particular, it is calculated by changing the cross entropy loss function such that the loss of the model is reduced on those samples that have been correctly classified and the loss is increased on those samples that have been difficult to classify, as shown in equation (12):

（12）

wherein ,represents natural logarithm, and the base is +.>，/>For regulating parameters->Set to 2 +>Representing a scaling factor for adjusting the weight of the easily classified samples, < > -when the samples are predicted correctly>The scaling factor is close to 0 and is close to 1, so that the weight of the sample easy to classify is reduced; when the sample predicts errors, the +_>Near 0, the scaling factor is near 1, so that the weight of the samples difficult to classify is improved; />The weight of the positive and negative samples is adjusted to be 0.25; in this section, use->To represent the loss function of the timeformer network, i.e. the focus loss is used for the optimization of the whole network.

Setting parametersTo balance the duty cycle of the cross entropy loss and the focus loss throughout the network:

（13）

wherein ,for the total loss of the whole process, +.>For the loss of Xreception network, +.>For loss of TimeSformer network, +.>The weight of the two loss functions can be adjusted to be the proportionality coefficient, and if the detection effect is bad, the weight can be increased appropriately>To increase the specific gravity occupied by the focus loss.

The specific implementation mode of the invention is as follows:

the source domain data set selected by the invention is Celeb-DF (v 2), and the Celeb-DF (v 2) data set comprises real and deep synthesized video, and the video quality is similar to that of online transmission. Celeb-DF includes 590 raw videos collected from video websites, which have topics of different ages and sexes, and 5639 corresponding DeepFake videos. The target domain dataset was chosen from faceforensis++ datasets, which are a Face counterfeited dataset consisting of 1000 original video sequences, created using four methods of operation, face2Face, faceSwap, deepFakes and neurosortutes. The method for detecting whether the video sequence in the target domain data set is the face fake or not comprises the following steps:

first, each video sequence in the source domain dataset (face a, style a) and the target domain dataset (face B, style B) is fourier transformed on the time-frequency plane and its magnitude and phase spectra are calculated. Wherein face a and face B represent faces in different data sets and style a and style B represent different styles for each image. Then, randomly pairing each video sequence in the source domain data set with each video sequence in the target domain data set, and exchanging the amplitude spectrum between the paired video sequences; and finally, carrying out inverse Fourier transform on the video sequence after the amplitude spectrum exchange on a time-frequency plane, and reserving the original phase spectrum of the video sequence to form an image containing a face A and a style B.

Then, frame pictures in the domain-aligned video sequence are displayedInputting the image data into an Xreception network to obtain a feature vector of each frame of image; inputting the video sequence with the aligned domains into a TimeSformer network to obtain a feature vector of the video sequence;

then, mutually fusing the feature vectors output by the Xception network and the TimeSformer network to obtain fused feature vectors, specifically, averaging the feature vectors of each frame of image output by the Xception network, and adding the average value with the feature vector of the video sequence output by the TimeSformer network, so as to obtain a feature vector containing both image information and video information, and fully reflecting whether the face in the video sequence is fake or not;

and finally, inputting the fused feature vector into a classifier to obtain a judgment result of whether the video sequence is forged by the human face.

The present invention is not described in detail in the present application, and is well known to those skilled in the art. Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims

1. A face fake video detection method based on Fourier domain adaptation is characterized by comprising the following steps of: the method comprises the following steps:

s11: video given a source domain dataset is，

Video of the target domain dataset is, wherein ,/>A certain video representing a source domain dataset, +.>Representation->Color picture frame of corresponding video, wherein +.>Representing the real number field, ++> and />Representing the height and width of the image, 3 representing an RGB image with color channels red, green and blue,/for the color channels red, green and blue>A label representing the correspondence of the video or picture, i.e. a face fake video is true or false, wherein +.>Video representing a target domain dataset, +.>Picture representing a target domain dataset, +.>A corresponding tag representing a target domain dataset;

（1）

（2）

（3）

（4）

wherein ,representing inverse fourier transform ++>An image representing a source domain video, +.>An image representing a target field video, +.>Representing an image generated after style migration, +.>Representing the fourier transformed phase portion of the source domain image,representing the magnitude part of the target domain image after fourier transformation, a ∈>Representing the magnitude part of the source domain image after fourier transformation, a ∈>Representing a mask matrix->Representing a composite of two functions;

in the step S15 of the process described above,set to 0.001, the fourier domain adaptation refers to: performing Fourier transform on each video sequence in the source domain data set and the target domain data set on a time-frequency plane, calculating the amplitude spectrum and the phase spectrum of each video sequence, randomly pairing each video sequence in the source domain data set with each video sequence in the target domain data, and exchanging the amplitude of each paired video sequenceA degree spectrum; finally, carrying out inverse Fourier transform on the video sequence after the amplitude spectrum exchange on a time-frequency plane, and retaining the original phase spectrum;

the feature fusion means that feature vectors of different networks are combined or integrated to generate new feature vectors with more expressive capacity or more suitable for classification tasks, and the used Xattention is,/>The feature of the image is extracted by the functional representation, +.>Representing the extracted feature vector, as shown in equation (5), the average feature vector corresponding to the frame sequence is then found from the corresponding number of frames +.>See formula (6), wherein +.>Representing the number of frames contained in the frame sequence; similarly, timeSformer is +.>,/>The function representation extracts features of the image sequence, < >>Representing the extracted feature vector, < >>Representing a frame sequence generated after style migration, as shown in equation (7), let ∈>Is a fused feature vector expressed as formula (8) representing +.>And->The two feature vectors are correspondingly added to obtain a fused feature vectorThen, the final prediction probability ++can be obtained by further carrying out the formula (9)>, wherein />The softmax layer is shown as such,representing a linear layer; for Xreception networks, ++can be determined according to equation (10)>Probability of conversion to predictive class through the linear layer and softmax layer +.>；

（5）

（6）

（7）

（8）

（9）

（10）。

2. The method for detecting human face falsification video based on fourier domain adaptation according to claim 1, wherein the method comprises the steps of: for Xreception networks, the loss function usedFor cross entropy loss, the calculation formula is expressed as follows by formula (11):

（11）

3. The face fake video detection method based on fourier domain adaptation according to claim 2, wherein: for TimeSformer space-time transformer networks, the loss function usedFor the focus loss, the calculation formula is expressed as follows by formula (12):

（12）

wherein ,represents natural logarithm, and the base is +.>，/>For regulating parameters->Set to 2 +>Representing a scaling factor,/->The weight of the positive and negative samples is adjusted to be 0.25; wherein (1)>Representing the loss function of the timeformer network.

4. A face falsification video detection method based on fourier domain adaptation as defined in claim 3, wherein: setting parametersWeight parameters for different losses:

（13）