CN116563957A - Face fake video detection method based on Fourier domain adaptation - Google Patents

Face fake video detection method based on Fourier domain adaptation Download PDF

Info

Publication number
CN116563957A
CN116563957A CN202310834717.5A CN202310834717A CN116563957A CN 116563957 A CN116563957 A CN 116563957A CN 202310834717 A CN202310834717 A CN 202310834717A CN 116563957 A CN116563957 A CN 116563957A
Authority
CN
China
Prior art keywords
representing
image
video
domain
video sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310834717.5A
Other languages
Chinese (zh)
Other versions
CN116563957B (en
Inventor
王春鹏
时超轶
马宾
王玉立
魏子麒
夏之秋
李琦
李健
咸永锦
韩冰
王晓雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN202310834717.5A priority Critical patent/CN116563957B/en
Publication of CN116563957A publication Critical patent/CN116563957A/en
Application granted granted Critical
Publication of CN116563957B publication Critical patent/CN116563957B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/40Spoof detection, e.g. liveness detection
    • G06V40/45Detection of the body part being alive
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Abstract

The invention discloses a face fake video detection method based on Fourier domain adaptation, which relates to the technical field of face fake detection and is characterized in that: the method mainly comprises the following steps: s1: carrying out Fourier domain adaptation on video sequences in a source domain data set and a target domain data set to obtain a video sequence with domain alignment; s2, inputting each frame of image in the video sequence with the aligned domains into an Xreception network to obtain a feature vector of each frame of image; s3: inputting the video sequence with the aligned domains into a TimeSformer space-time converter network to obtain feature vectors of the video sequence; s4: mutually fusing the feature vectors output by the Xreception network and the TimeSformer space-time converter network to obtain fused feature vectors; s5: and inputting the fused feature vectors into a classifier to obtain a judgment result of whether the video sequence is forged by the human face.

Description

Face fake video detection method based on Fourier domain adaptation
Technical Field
The invention relates to the technical field of face counterfeiting detection, in particular to a face counterfeiting video detection method based on Fourier domain adaptation.
Background
Face forging refers to the process of falsifying or replacing a real face by utilizing a digital image processing technology or an artificial intelligence technology so as to generate a false face image or video. Face counterfeiting techniques may be used in entertainment, education, medical, etc., but may also be used for malicious purposes such as fraud, defamation, disruption of social order, etc. Therefore, the face-forgery detection technology is an important security protection means that can protect personal privacy and social fairness by analyzing whether a face in an image or video is genuine or forgery.
Currently, face counterfeiting detection techniques are mainly divided into two categories: methods based on conventional image processing techniques and methods based on deep learning techniques. Methods based on traditional image processing technologies mainly use some statistical features or visual artifacts in images to determine whether a face is forged, such as color distribution, edge sharpness, illumination inconsistency, blink frequency, and the like. The method has the advantages of simplicity and easiness in implementation, but has the defects that different characteristic extractors are required to be designed for different forging modes, the generalization capability is poor, and the method is easily interfered by factors such as noise, compression, shielding and the like. The method based on the deep learning technology mainly utilizes a convolutional neural network or a cyclic neural network and other models to automatically learn the characteristics in the image or the video, and classifies or regresses. The method has the advantages of extracting high-level semantic features, having strong adaptability to different fake modes and being capable of processing data with high resolution and high frame rate. However, the disadvantage is that a large amount of annotation data is required for training and that generalization capability is poor for unknown forgeries or cross-domain datasets.
In order to improve the generalization ability and the cross-domain adaptability of the face-forgery detection technology, some researchers have proposed methods based on domain adaptation or domain alignment. Domain adaptation or domain alignment refers to transforming or mapping data sets of different distributions or different styles so that they are more similar or identical under some measure. For example, the paper of the artificial intelligence international top-level conference CVPR 2022 is a face falsification detection method based on a spatial domain adaptation network (Spatial Domain Adaptation Network, SDAN) and a frequency domain adaptation network (Frequency Domain Adaptation Network, FDAN), which first performs spatial domain adaptation and frequency domain adaptation on images in a source domain data set and a target domain data set, and then inputs the adapted images into a shared convolutional neural network for feature extraction and classification. The method can effectively reduce the difference between the source domain data set and the target domain data set in the space domain and the frequency domain, and improves the accuracy of cross-domain detection.
However, the above method only considers spatial domain adaptation and frequency domain adaptation of images, and does not consider timing information present in video sequences. The video sequence contains frame-to-frame dynamic changes and associations that are useful to distinguish between real faces and fake faces. For example, in a video sequence, a real face often has self-consistent movements such as expression changes, eye flickering, head rotation, etc., and a fake face may have abnormal phenomena such as incompatibility, stiffness, repetition, etc. Therefore, in performing face-forgery detection, not only image information but also video information is considered.
Disclosure of Invention
In order to make up for the defects of the prior art, the invention relates to a face counterfeiting detection method based on a Fourier domain adaptation and deep learning network, which can effectively utilize image information and video information to judge whether the face counterfeiting exists in a video sequence and has good generalization capability and cross-domain adaptability.
The invention is realized by the following technical scheme:
a face fake video detection method based on Fourier domain adaptation is characterized by comprising the following steps of: the method comprises the following steps:
s1: carrying out Fourier domain adaptation on video sequences in a source domain data set and a target domain data set to obtain a video sequence with domain alignment;
s2, inputting each frame of image in the video sequence with the aligned domains into an Xreception network to obtain a feature vector of each frame of image;
s3: inputting the video sequence with the aligned domains into a TimeSformer space-time converter network to obtain feature vectors of the video sequence;
s4: mutually fusing the feature vectors output by the Xreception network and the TimeSformer space-time converter network to obtain fused feature vectors;
s5: inputting the fused feature vectors into a classifier to obtain a judgment result of whether the video sequence is forged by a human face;
in S1, fourier domain adaptation is performed on video sequences in a source domain data set and a target domain data set to obtain a video sequence after domain alignment, and the implementation steps include:
s11: video given a source domain dataset isThe video of the target domain dataset is +.>, wherein ,/>A certain video representing a source domain dataset, +.>Representation->Color picture frame of corresponding video, wherein +.>Representing the real number field, ++> and />Representing the height and width of the image, 3 representing an RGB image with color channels red, green and blue,/for the color channels red, green and blue>Representing the video or picture correspondenceIs true or false, wherein +_a.>Video representing a target domain dataset, +.>Picture representing a target domain dataset, +.>A corresponding tag representing a target domain dataset;
s12: is provided withAmplitude component representing a fourier transformation of a color image, < >>The phase component representing the fourier transform of a color image is converted from the spatial domain to the frequency domain for a single-channel image by equation (1), equation (1) being:
(1)
wherein ,is the image at coordinates +.>Pixel value at +.>Is the transformed image at coordinates +.>Value of (I) at (I)>Is imaginary unit, ++>Euler number, & lt + & gt>Representing the abscissa of the image, +.>Representing the ordinate of the image, +.> and />Represents the abscissa in the frequency domain, +.>Is indicated at->Frequency variation in direction, +.>Is indicated at->Frequency variation in direction, +.>Representing the height of the image +.>Representing the width of the image;
s13: by usingRepresenting a mask matrix for replacing the low frequency region of the image, expressed by equation (2):
(2)
wherein the center position of the designated image is,/>An area with image pixel values of 1 is formed as a square, wherein +.>The size of this square area is indicated, < >>、/>Representing the height and width of the image, +.>、/>Indicating the height and width of the masking region required to be performed;
s14: converting the image of the frequency domain into the space domain again according to the inverse Fourier transform to obtain an image after domain alignment, wherein the transformation formula is as shown in formula (3):
(3)
s15: let equation (3) be written asGiven frame pictures in two video sequences +.>,/>Fourier domain adaptation is expressed by equation (4):
(4)
wherein ,representing inverse fourier transform ++>An image representing a source domain video, +.>An image representing a target field video, +.>Representing an image generated after style migration, +.>Representing the phase part of the source domain image after fourier transformation, and>representing the magnitude part of the target domain image after fourier transformation, a ∈>Representing the magnitude part of the source domain image after fourier transformation, a ∈>Representing a mask matrix->Representing a composite of two functions; in said S15->Set to 0.001; the fourier domain adaptation refers to: performing Fourier transform on each video sequence in the source domain data set and the target domain data set on a time-frequency plane, calculating the amplitude spectrum and the phase spectrum of each video sequence, randomly pairing each video sequence in the source domain data set with each video sequence in the target domain data set, and exchanging the amplitude spectrum between the paired video sequences; and finally, carrying out inverse Fourier transform on the video sequence after the amplitude spectrum exchange on a time-frequency plane, and retaining the original phase spectrum of the video sequence.
As a further limitation of the technical proposal, the characteristic fusionCombining means that feature vectors of different networks are combined or integrated to generate a new feature vector with more expressive ability or more suitable for classification task, and the used Xreception is,The feature of the image is extracted by the functional representation, +.>Representing the extracted feature vector, as shown in equation (5), the average feature vector corresponding to the frame sequence is then found from the corresponding number of frames +.>See formula (6), whereinRepresenting the number of frames contained in the frame sequence; similarly, timeSformer is +.>,/>The function representation extracts features of the image sequence, < >>Representing the extracted feature vector, < >>Representing a frame sequence generated after style migration, as shown in equation (7), let ∈>Is a fused feature vector expressed as formula (8) representing +.>And (3) withThe two eigenvectors are correspondingly added to obtain a fused eigenvector +.>Then, the final prediction probability ++can be obtained by further carrying out the formula (9)>, wherein />Representing a softmax layer,/->Representing a linear layer; for Xreception networks, ++can be determined according to equation (10)>Probability of conversion to predictive class through the linear layer and softmax layer +.>
(5)
(6)
(7)
(8)
(9)
(10)。
As a further limitation of the present solution, for Xception networks, a loss function is usedFor cross entropy loss, the calculation formula is expressed as follows by formula (11):
(11)
wherein ,represents natural logarithm, and the base is +.>,/>Is a true sample tag.
As a further limitation of the present solution, for the TimeSformer space-time transformer network, a loss function is usedFor the focus loss, the calculation formula is expressed as follows by formula (12):
(12)
wherein ,represents natural logarithm, and the base is +.>,/>For regulating parameters->Set to 2 +>Representing a scaling factor,/->The weight of the positive and negative samples is adjusted to be 0.25;
wherein ,representing the loss function of the timeformer network.
As a further limitation of the technical proposal, the parameters are setWeight parameters for different losses:
(13)
wherein ,for the total loss of the whole process, +.>For the loss of Xreception network, +.>For loss of TimeSformer network, +.>Is set to 0.5 for the proportionality coefficient.
The beneficial effects of the invention are as follows:
(1) Not only is Fourier domain adaptation carried out on the image, but also Fourier domain adaptation is carried out on continuous frames in the video sequence, domain alignment is realized by utilizing information on a frequency domain, and domain alignment operation is carried out between a source domain and a target domain, so that the accuracy of face counterfeiting detection is greatly improved;
(2) The method is characterized in that the method does not adopt a spatial domain adaptation network, but adopts an Xception network and a TimeSformer network to respectively extract the characteristics of an image and a video sequence, and mutually fuses the characteristics of the image and the video sequence, so that the detection effect is improved by utilizing the information on space and time sequence;
(3) The full training of the neural network is added while domain alignment between different data sets is performed, so that very excellent performance can be achieved in face fake detection. The X-section network has multi-layer convolution operation, can extract multi-scale characteristics of images, is a time-space modeling method based on an attention mechanism, can effectively capture time sequence characteristics in videos, and can comprehensively utilize characteristic information of static images and video sequences by combining with the X-section and the time-space former, so that the detection capability of face counterfeiting is improved. Both Xpercent and TimeSformer are models trained by a large-scale data set, have stronger robustness and generalization capability, and can cope with the change and interference of different samples.
Drawings
Fig. 1 is a fourier transformed image amplitude low frequency component transfer diagram of the present invention.
Fig. 2 is a flowchart of feature fusion of an image sequence using an Xception network and a TimeSformer network according to the present invention.
Fig. 3 is a schematic diagram of the operation of the TimeSformer encoder module.
Fig. 4 is a process of the timeformer processing a video sequence to a timing feature.
Fig. 5 is a flow chart of the present invention.
Detailed Description
In order to clearly illustrate the technical features of the present solution, the present invention will be described in detail below with reference to the following detailed description and the accompanying fig. 1 to 5. In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "left", "right", "front", "rear", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.
The specific embodiments of the present invention are as follows:
a face fake video detection method based on Fourier domain adaptation is characterized by comprising the following steps of: the method comprises the following steps:
s1: carrying out Fourier domain adaptation on video sequences in a source domain data set and a target domain data set to obtain a video sequence with domain alignment;
s2, inputting each frame of image in the video sequence with the aligned domains into an Xreception network to obtain a feature vector of each frame of image;
s3: inputting the video sequence with the aligned domains into a TimeSformer space-time converter network to obtain feature vectors of the video sequence;
s4: mutually fusing the feature vectors output by the Xreception network and the TimeSformer space-time converter network to obtain fused feature vectors;
s5: and inputting the fused feature vectors into a classifier to obtain a judgment result of whether the video sequence is forged by the human face.
In S1, fourier domain adaptation is performed on video sequences in a source domain data set and a target domain data set to obtain a video sequence after domain alignment, and the implementation steps include:
s11: video given a source domain dataset isThe video of the target domain dataset is +.>, wherein ,/>A certain video representing a source domain dataset, +.>Representation->Color picture frame of corresponding video, wherein +.>Representing the real number field, ++> and />Representing the height and width of the image, 3 representing an RGB image with color channels red, green and blue,/for the color channels red, green and blue>A label representing the correspondence of the video or picture, i.e. a face fake video is true or false, wherein +.>Video representing a target domain dataset, +.>Picture representing a target domain dataset, +.>A corresponding tag representing a target domain dataset;
s12: is provided withAmplitude component representing a fourier transformation of a color image, < >>The phase component representing the fourier transform of a color image is converted from the spatial domain to the frequency domain for a single-channel image by equation (1), equation (1) being:
(1)
wherein ,is the image at coordinates +.>Pixel value at +.>Is the transformed image at coordinates +.>Value of (I) at (I)>Is imaginary unit, ++>Euler number, & lt + & gt>Representing the abscissa of the image, +.>Representing the ordinate of the image, +.> and />Represents the abscissa in the frequency domain, +.>Is indicated at->Frequency variation in direction, +.>Is indicated at->Frequency variation in direction, +.>Representing the height of the image +.>Representing the width of the image;
s13: by usingRepresenting a mask matrix for replacing the low frequency region of the image, expressed by equation (2):
(2)
wherein the center position of the designated image is,/>An area with image pixel values of 1 is formed as a square, wherein +.>The size of this square area is indicated, < >>、/>Representing the height and width of the image, +.>、/>Indicating the height and width of the masking region required to be performed;
s14: converting the image of the frequency domain into the space domain again according to the inverse Fourier transform to obtain an image after domain alignment, wherein the transformation formula is as shown in formula (3):
(3)
s15: let equation (3) be written asGiven frame pictures in two video sequences +.>,/>Fourier domain adaptation is expressed by equation (4):
(4)
wherein ,representing inverse fourier transform ++>An image representing a source domain video, +.>An image representing a target field video, +.>Representing an image generated after style migration, +.>Representing the phase part of the source domain image after fourier transformation, and>representing the magnitude part of the target domain image after fourier transformation, a ∈>Representing the magnitude part of the source domain image after fourier transformation, a ∈>Representing a mask matrix->Representing a composite of the two functions.
Wherein the low frequency part of the amplitude on the source domain video imageLow frequency portion on target domain video imageReplaced, then, a->The low frequency region of the image is +.>Low frequency region of image is replaced, and generated imageContent and->The same style and->Same (same as->With the same appearance).
In the S15, atIn the gradual increase from 0 to 1, < + >>The image will also get closer and closer to +.>But at the same time visible artefacts will occur, therefore +.>Setting to 0.001, replacing low frequency part (i.e. region with slow gray value change) on source domain video image with low frequency part (i.e. region matched with target style) on target domain video image, thereby generating an image with the same content as source domain and the same style as target domain, in this way, domain difference between source domain and target domain can be obviously reducedAnd the effect can be better achieved when the fake detection is performed.
The fourier domain adaptation refers to: performing Fourier transform on each video sequence in the source domain data set and the target domain data set on a time-frequency plane, calculating the amplitude spectrum and the phase spectrum of each video sequence, randomly pairing each video sequence in the source domain data set with each video sequence in the target domain data set, and exchanging the amplitude spectrum between the paired video sequences; and finally, carrying out inverse Fourier transform on the video sequence after the amplitude spectrum exchange on a time-frequency plane, and retaining the original phase spectrum of the video sequence.
The Xattention network is a convolution neural network model based on a depth separable convolution (Depthwise Separable Convolution) structural design, and can perform feature extraction on each picture in a video sequence, so that the accuracy of image identification is improved, the Xattention network adopts the depth separable convolution to replace the traditional convolution, so that the number of parameters and the calculated amount are reduced, the depth separable convolution comprises a depth convolution layer and a point-by-point convolution layer, the depth convolution layer performs convolution operation on each input channel respectively, the depth of a convolution kernel is set to be 1, and the input depth is kept unchanged; the point-by-point convolution layer combines the characteristic graphs of different channels to form an output characteristic graph;
each frame of image is input into the Xception network, and a feature vector with the length of 2048 can be obtained as a feature representation of the frame of image.
The TimeSformer network is a video classification model designed based on a transducer structure, and can model each frame in a video sequence and learn the time sequence relationship between frames;
each video sequence is input into the timeformer network, and a feature vector with a length of 2048 can be obtained as a feature representation of the video sequence.
The TimeSformer network comprises a block segmentation layer, a position embedding layer, a linear embedding layer, 12 encoder modules and a global average pooling layer;
the image frame is first block-divided into image blocks,the image blocks are linearly embedded into a vector form, and the vector form is added with the position information contained in the block segmentation to be combined into an embedded vectorAs input to the encoder module.
The encoder module processes the input video sequence using a separate spatiotemporal attention mechanism; the separate spatiotemporal attention mechanism includes: a temporal attention mechanism for interactive processing of image blocks within each frame of the input video sequence and at the same spatial locations of the image blocks from frame to frame; a spatial attention mechanism for processing the interaction between the image blocks in each frame and other image blocks in the same frame; a multi-layer perceptron module for transforming and mapping the characteristics of the temporal and spatial attention mechanisms, and the output of the multi-layer perceptron module can be used as the input of the next encoder module. After 12 iterations, the output of the encoder module enters the global averaging pooling layer, and the required sequence features can be obtained.
The feature fusion means that feature vectors of different networks are combined or integrated to generate new feature vectors with more expressive capacity or more suitable for classification tasks, and the used Xattention is,/>The feature of the image is extracted by the functional representation, +.>Representing the extracted feature vector, as shown in equation (5), the average feature vector corresponding to the frame sequence is then found from the corresponding number of frames +.>See formula (6), wherein +.>Representing the number of frames contained in the frame sequence; similarly, timeSformer is +.>,/>The functional representation extracts features of the image sequence,representing the extracted feature vector, < >>Representing a frame sequence generated after style migration, as shown in equation (7), let ∈>Is a fused feature vector expressed as formula (8) representing +.>And->The two eigenvectors are correspondingly added to obtain a fused eigenvector +.>Then, the final prediction probability ++can be obtained by further carrying out the formula (9)>, wherein />Representing a softmax layer,/->Representing a linear layer; for Xreception networks, ++can be determined according to equation (10)>Probability of conversion to predictive class through the linear layer and softmax layer +.>
(5)
(6)
(7)
(8)
(9)
(10)。
For Xreception networks, the loss function usedFor cross entropy loss, the calculation formula is expressed as follows by formula (11):
(11)
wherein ,represents natural logarithm, and the base is +.>,/>Is a true sample tag.
For TimeSformer space-time transformer networks, the loss function usedFor focus loss, focus loss is suitable for handling the problem of class imbalance, in particular, it is calculated by changing the cross entropy loss function such that the loss of the model is reduced on those samples that have been correctly classified and the loss is increased on those samples that have been difficult to classify, as shown in equation (12):
(12)
wherein ,represents natural logarithm, and the base is +.>,/>For regulating parameters->Set to 2 +>Representing a scaling factor for adjusting the weight of the easily classified samples, < > -when the samples are predicted correctly>The scaling factor is close to 0 and is close to 1, so that the weight of the sample easy to classify is reduced; when the sample predicts errors, the +_>Near 0, the scaling factor is near 1, so that the weight of the samples difficult to classify is improved; />The weight of the positive and negative samples is adjusted to be 0.25; in this section, use->To represent the loss function of the timeformer network, i.e. the focus loss is used for the optimization of the whole network.
Setting parametersTo balance the duty cycle of the cross entropy loss and the focus loss throughout the network:
(13)
wherein ,for the total loss of the whole process, +.>For the loss of Xreception network, +.>For loss of TimeSformer network, +.>The weight of the two loss functions can be adjusted to be the proportionality coefficient, and if the detection effect is bad, the weight can be increased appropriately>To increase the specific gravity occupied by the focus loss.
The specific implementation mode of the invention is as follows:
the source domain data set selected by the invention is Celeb-DF (v 2), and the Celeb-DF (v 2) data set comprises real and deep synthesized video, and the video quality is similar to that of online transmission. Celeb-DF includes 590 raw videos collected from YouTube, which have subjects of different ages, ethnicities, and sexes, and 5639 corresponding DeepFake videos. The target domain dataset was chosen from faceforensis++ datasets, which are a Face counterfeited dataset consisting of 1000 original video sequences, created using four methods of operation, face2Face, faceSwap, deepFakes and neurosortutes. The method for detecting whether the video sequence in the target domain data set is the face fake or not comprises the following steps:
first, each video sequence in the source domain dataset (face a, style a) and the target domain dataset (face B, style B) is fourier transformed on the time-frequency plane and its magnitude and phase spectra are calculated. Wherein face a and face B represent faces in different data sets and style a and style B represent different styles for each image. Then, randomly pairing each video sequence in the source domain data set with each video sequence in the target domain data set, and exchanging the amplitude spectrum between the paired video sequences; and finally, carrying out inverse Fourier transform on the video sequence after the amplitude spectrum exchange on a time-frequency plane, and reserving the original phase spectrum of the video sequence to form an image containing a face A and a style B.
Then, frame pictures in the domain-aligned video sequence are displayedInputting the image data into an Xreception network to obtain a feature vector of each frame of image; inputting the video sequence with the aligned domains into a TimeSformer network to obtain a feature vector of the video sequence;
then, mutually fusing the feature vectors output by the Xception network and the TimeSformer network to obtain fused feature vectors, specifically, averaging the feature vectors of each frame of image output by the Xception network, and adding the average value with the feature vector of the video sequence output by the TimeSformer network, so as to obtain a feature vector containing both image information and video information, and fully reflecting whether the face in the video sequence is fake or not;
and finally, inputting the fused feature vector into a classifier to obtain a judgment result of whether the video sequence is forged by the human face.
The present invention is not described in detail in the present application, and is well known to those skilled in the art. Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims (5)

1. A face fake video detection method based on Fourier domain adaptation is characterized by comprising the following steps of: the method comprises the following steps:
s1: carrying out Fourier domain adaptation on video sequences in a source domain data set and a target domain data set to obtain a video sequence with domain alignment;
s2, inputting each frame of image in the video sequence with the aligned domains into an Xreception network to obtain a feature vector of each frame of image;
s3: inputting the video sequence with the aligned domains into a TimeSformer space-time converter network to obtain feature vectors of the video sequence;
s4: mutually fusing the feature vectors output by the Xreception network and the TimeSformer space-time converter network to obtain fused feature vectors;
s5: inputting the fused feature vectors into a classifier to obtain a judgment result of whether the video sequence is forged by a human face;
in S1, fourier domain adaptation is performed on video sequences in a source domain data set and a target domain data set to obtain a video sequence after domain alignment, and the implementation steps include:
s11: video given a source domain dataset isThe video of the target domain dataset is +.>, wherein ,/>A certain video representing a source domain dataset, +.>Representation->Color picture frame of corresponding video, wherein +.>Representing the real number field, ++> and />Representing the height and width of the image, 3 representing an RGB image with color channels red, green and blue,/for the color channels red, green and blue>A label representing the correspondence of the video or picture, i.e. a face fake video is true or false, wherein +.>Video representing a target domain dataset, +.>Picture representing a target domain dataset, +.>A corresponding tag representing a target domain dataset;
s12: is provided withAmplitude component representing a fourier transformation of a color image, < >>The phase component representing the fourier transform of a color image is converted from the spatial domain to the frequency domain for a single-channel image by equation (1), equation (1) being:
(1)
wherein ,is the image at coordinates +.>Pixel value at +.>Is the transformed image at the coordinatesValue of (I) at (I)>Is imaginary unit, ++>Euler number, & lt + & gt>Representing the abscissa of the image, +.>The ordinate of the image is represented, and />Represents the abscissa in the frequency domain, +.>Is indicated at->Frequency variation in direction, +.>Is indicated at->Frequency variation in direction, +.>Representing the height of the image +.>Representing the width of the image;
s13: by usingRepresenting a mask matrix for replacing the low frequency region of the image, expressed by equation (2):
(2)
wherein the center position of the designated image is,/>An area with image pixel values of 1 is formed as a square, wherein +.>The size of this square area is indicated, < >>、/>Representing the height and width of the image, +.>、/>Indicating the height and width of the masking region required to be performed;
s14: converting the image of the frequency domain into the space domain again according to the inverse Fourier transform to obtain an image after domain alignment, wherein the transformation formula is as shown in formula (3):
(3)
s15: let equation (3) be written asGiven frame pictures in two video sequences +.>,/>Fourier domain adaptation is expressed by equation (4):
(4)
wherein ,representing inverse fourier transform ++>An image representing a source domain video, +.>An image representing a target field video, +.>Representing an image generated after style migration, +.>Representing the fourier transformed phase portion of the source domain image,representing the magnitude part of the target domain image after fourier transformation, a ∈>Representing the magnitude part of the source domain image after fourier transformation, a ∈>Representing a mask matrix->Representing a composite of two functions;
in the step S15 of the process described above,set to 0.001, the fourier domain adaptation refers to: performing Fourier transform on each video sequence in the source domain data set and the target domain data set on a time-frequency plane, calculating the amplitude spectrum and the phase spectrum of each video sequence, randomly pairing each video sequence in the source domain data set with each video sequence in the target domain data set, and exchanging the amplitude spectrum between the paired video sequences; and finally, carrying out inverse Fourier transform on the video sequence after the amplitude spectrum exchange on a time-frequency plane, and retaining the original phase spectrum of the video sequence.
2. The method for detecting human face falsification video based on fourier domain adaptation according to claim 1, wherein the method comprises the steps of: the feature fusion refers to combining feature vectors of different networks orIntegrating to generate new feature vectors with more expressive power or more suitable for classification task, wherein the used Xreception is,/>The feature of the image is extracted by the functional representation, +.>Representing the extracted feature vector, as shown in equation (5), the average feature vector corresponding to the frame sequence is then found from the corresponding number of frames +.>See formula (6), wherein +.>Representing the number of frames contained in the frame sequence; similarly, timeSformer is +.>,/>The function representation extracts features of the image sequence, < >>Representing the extracted feature vector, < >>Representing a frame sequence generated after style migration, as shown in equation (7), let ∈>Is a fused feature vector expressed as formula (8) representing +.>And->The two eigenvectors are correspondingly added to obtain a fused eigenvector +.>Then, the final prediction probability ++can be obtained by further carrying out the formula (9)>, wherein />Representing a softmax layer,/->Representing a linear layer; for Xreception networks, ++can be determined according to equation (10)>Probability of conversion to predictive class through the linear layer and softmax layer +.>
(5)
(6)
(7)
(8)
(9)
(10)。
3. The method for detecting human face falsification video based on fourier domain adaptation according to claim 2, wherein the method comprises the steps of: for Xreception networks, the loss function usedFor cross entropy loss, the calculation formula is expressed as follows by formula (11):
(11)
wherein ,represents natural logarithm, and the base is +.>,/>Is a true sample tag.
4. A face falsification video detection method based on fourier domain adaptation as defined in claim 3, wherein: for TimeSformer space-time transformer networks, the loss function usedFor the focus loss, the calculation formula is expressed as follows by formula (12):
(12)
wherein ,represents natural logarithm, and the base is +.>,/>For regulating parameters->Set to 2 +>Representing a scaling factor,/->The weight of the positive and negative samples is adjusted to be 0.25;
wherein ,representing the loss function of the timeformer network.
5. The method for detecting human face falsification video based on fourier domain adaptation according to claim 4, wherein: setting parametersWeight parameters for different losses:
(13)
wherein ,for the total loss of the whole process, +.>For the loss of Xreception network, +.>For loss of TimeSformer network, +.>Is set to 0.5 for the proportionality coefficient.
CN202310834717.5A 2023-07-10 2023-07-10 Face fake video detection method based on Fourier domain adaptation Active CN116563957B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310834717.5A CN116563957B (en) 2023-07-10 2023-07-10 Face fake video detection method based on Fourier domain adaptation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310834717.5A CN116563957B (en) 2023-07-10 2023-07-10 Face fake video detection method based on Fourier domain adaptation

Publications (2)

Publication Number Publication Date
CN116563957A true CN116563957A (en) 2023-08-08
CN116563957B CN116563957B (en) 2023-09-29

Family

ID=87488318

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310834717.5A Active CN116563957B (en) 2023-07-10 2023-07-10 Face fake video detection method based on Fourier domain adaptation

Country Status (1)

Country Link
CN (1) CN116563957B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115927A (en) * 2023-10-23 2023-11-24 广州佰锐网络科技有限公司 Audio and video security verification method and system applied to living body detection in financial business

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112734696A (en) * 2020-12-24 2021-04-30 华南理工大学 Face changing video tampering detection method and system based on multi-domain feature fusion
EP3818526A1 (en) * 2018-07-05 2021-05-12 DTS, Inc. Hybrid audio synthesis using neural networks
CN113313054A (en) * 2021-06-15 2021-08-27 中国科学技术大学 Face counterfeit video detection method, system, equipment and storage medium
CN113435292A (en) * 2021-06-22 2021-09-24 北京交通大学 AI counterfeit face detection method based on inherent feature mining
US20210390355A1 (en) * 2020-06-13 2021-12-16 Zhejiang University Image classification method based on reliable weighted optimal transport (rwot)
CN114492599A (en) * 2022-01-07 2022-05-13 北京邮电大学 Medical image preprocessing method and device based on Fourier domain self-adaptation
CN114519897A (en) * 2021-12-31 2022-05-20 重庆邮电大学 Human face in-vivo detection method based on color space fusion and recurrent neural network
CN114758272A (en) * 2022-03-31 2022-07-15 中国人民解放军战略支援部队信息工程大学 Forged video detection method based on frequency domain self-attention
CN115188039A (en) * 2022-05-27 2022-10-14 国家计算机网络与信息安全管理中心 Depth forgery video technology tracing method based on image frequency domain information
CN115273169A (en) * 2022-05-23 2022-11-01 西安电子科技大学 Face counterfeiting detection system and method based on time-space-frequency domain clue enhancement
WO2023280423A1 (en) * 2021-07-09 2023-01-12 Cariad Estonia As Methods, systems and computer programs for processing and adapting image data from different domains
CN115761459A (en) * 2022-12-09 2023-03-07 云南楚姚高速公路有限公司 Multi-scene self-adaption method for bridge and tunnel apparent disease identification
US20230081645A1 (en) * 2021-01-28 2023-03-16 Tencent Technology (Shenzhen) Company Limited Detecting forged facial images using frequency domain information and local correlation
CN115909129A (en) * 2022-10-17 2023-04-04 同济大学 Face forgery detection method based on frequency domain feature double-flow network
CN116386590A (en) * 2023-05-29 2023-07-04 北京科技大学 Multi-mode expressive voice synthesis method and device

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3818526A1 (en) * 2018-07-05 2021-05-12 DTS, Inc. Hybrid audio synthesis using neural networks
US20210390355A1 (en) * 2020-06-13 2021-12-16 Zhejiang University Image classification method based on reliable weighted optimal transport (rwot)
CN112734696A (en) * 2020-12-24 2021-04-30 华南理工大学 Face changing video tampering detection method and system based on multi-domain feature fusion
US20230081645A1 (en) * 2021-01-28 2023-03-16 Tencent Technology (Shenzhen) Company Limited Detecting forged facial images using frequency domain information and local correlation
CN113313054A (en) * 2021-06-15 2021-08-27 中国科学技术大学 Face counterfeit video detection method, system, equipment and storage medium
CN113435292A (en) * 2021-06-22 2021-09-24 北京交通大学 AI counterfeit face detection method based on inherent feature mining
WO2023280423A1 (en) * 2021-07-09 2023-01-12 Cariad Estonia As Methods, systems and computer programs for processing and adapting image data from different domains
CN114519897A (en) * 2021-12-31 2022-05-20 重庆邮电大学 Human face in-vivo detection method based on color space fusion and recurrent neural network
CN114492599A (en) * 2022-01-07 2022-05-13 北京邮电大学 Medical image preprocessing method and device based on Fourier domain self-adaptation
CN114758272A (en) * 2022-03-31 2022-07-15 中国人民解放军战略支援部队信息工程大学 Forged video detection method based on frequency domain self-attention
CN115273169A (en) * 2022-05-23 2022-11-01 西安电子科技大学 Face counterfeiting detection system and method based on time-space-frequency domain clue enhancement
CN115188039A (en) * 2022-05-27 2022-10-14 国家计算机网络与信息安全管理中心 Depth forgery video technology tracing method based on image frequency domain information
CN115909129A (en) * 2022-10-17 2023-04-04 同济大学 Face forgery detection method based on frequency domain feature double-flow network
CN115761459A (en) * 2022-12-09 2023-03-07 云南楚姚高速公路有限公司 Multi-scene self-adaption method for bridge and tunnel apparent disease identification
CN116386590A (en) * 2023-05-29 2023-07-04 北京科技大学 Multi-mode expressive voice synthesis method and device

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CHUNPENG WANG: "RD-IWAN: Residual Dense Based Imperceptible Watermark Attack Network", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY *
HUI QI: "A Real-Time Face Detection Method Based on Blink Detection", IEEE ACCESS *
陈然;伍世虔;徐望明;: "一种基于空域和频域多特征融合的人脸活体检测算法", 电视技术, no. 03 *
陈鹏;梁涛;刘锦;戴娇;韩冀中;: "融合全局时序和局部空间特征的伪造人脸视频检测方法", 信息安全学报, no. 02 *
韩晗;徐智;: "基于域自适应与多子空间的人脸识别研究", 桂林电子科技大学学报, no. 03 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115927A (en) * 2023-10-23 2023-11-24 广州佰锐网络科技有限公司 Audio and video security verification method and system applied to living body detection in financial business

Also Published As

Publication number Publication date
CN116563957B (en) 2023-09-29

Similar Documents

Publication Publication Date Title
Guo et al. Fake face detection via adaptive manipulation traces extraction network
CN109949317B (en) Semi-supervised image example segmentation method based on gradual confrontation learning
CN108537743B (en) Face image enhancement method based on generation countermeasure network
Wang et al. Improving cross-database face presentation attack detection via adversarial domain adaptation
CN111667400B (en) Human face contour feature stylization generation method based on unsupervised learning
CN113536972B (en) Self-supervision cross-domain crowd counting method based on target domain pseudo label
CN116563957B (en) Face fake video detection method based on Fourier domain adaptation
Xia et al. Towards deepfake video forensics based on facial textural disparities in multi-color channels
Zhang et al. A survey on face anti-spoofing algorithms
CN114694220A (en) Double-flow face counterfeiting detection method based on Swin transform
Ma et al. Unsupervised domain adaptation augmented by mutually boosted attention for semantic segmentation of vhr remote sensing images
Li et al. Zooming into face forensics: A pixel-level analysis
CN113553954A (en) Method and apparatus for training behavior recognition model, device, medium, and program product
Hongmeng et al. A detection method for deepfake hard compressed videos based on super-resolution reconstruction using CNN
CN117095471B (en) Face counterfeiting tracing method based on multi-scale characteristics
Peng et al. Presentation attack detection based on two-stream vision transformers with self-attention fusion
CN114937298A (en) Micro-expression recognition method based on feature decoupling
CN112990340B (en) Self-learning migration method based on feature sharing
CN114119356A (en) Method for converting thermal infrared image into visible light color image based on cycleGAN
CN111914617B (en) Face attribute editing method based on balanced stack type generation type countermeasure network
CN111489405B (en) Face sketch synthesis system for generating confrontation network based on condition enhancement
Wu et al. Ggvit: Multistream vision transformer network in face2face facial reenactment detection
Duan et al. Image information hiding method based on image compression and deep neural network
Sabitha et al. Enhanced model for fake image detection (EMFID) using convolutional neural networks with histogram and wavelet based feature extractions
CN114463379A (en) Dynamic capturing method and device for video key points

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant