CN113313054A - Face counterfeit video detection method, system, equipment and storage medium - Google Patents

Face counterfeit video detection method, system, equipment and storage medium Download PDF

Info

Publication number
CN113313054A
CN113313054A CN202110662165.5A CN202110662165A CN113313054A CN 113313054 A CN113313054 A CN 113313054A CN 202110662165 A CN202110662165 A CN 202110662165A CN 113313054 A CN113313054 A CN 113313054A
Authority
CN
China
Prior art keywords
face
neural network
video
network model
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110662165.5A
Other languages
Chinese (zh)
Inventor
周文柏
张卫明
俞能海
刘泓谷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202110662165.5A priority Critical patent/CN113313054A/en
Publication of CN113313054A publication Critical patent/CN113313054A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method, a system, equipment and a storage medium for detecting a face forged video, wherein the scheme is characterized in that phase information is extracted and processed in a frequency domain and then inversely transformed back to a space domain, space phase characteristics are extracted by combining a space domain image, a reasonable shallow learning method is designed to learn local texture characteristics, and the mobility and the interpretability of a model are greatly improved; the scheme is excellent in the detection task of the DeepFake face forged video, and achieves the best effect in the detection of the cross data set. In addition, a new thought method is provided for counterfeit video detection and other computer vision tasks, and development of follow-up work is facilitated.

Description

Face counterfeit video detection method, system, equipment and storage medium
Technical Field
The invention relates to the technical field of face counterfeit video detection, in particular to a face counterfeit video detection method, a face counterfeit video detection system, face counterfeit video detection equipment and a storage medium.
Background
The deep fake video has become one of the most widely spread media on the internet today. Since deep learning has been highly successful in computer vision tasks, image generation using Auto encoders (Auto encoders) and generation countermeasure Networks (general adaptive Networks) has been rapidly developed in recent years. With the increasing advancement of image generation technologies and the easy accessibility of relevant data algorithms, high quality depfake face forged video is easier to make and can easily deceive humans. However, these counterfeiting techniques are likely to be abused for malicious purposes, causing serious security and ethical problems, and thus the method of DeepFake face counterfeit video detection is in force. In previous work, the deep fake detection mainly focuses on how to better distinguish the authenticity of homologous data under the condition of high-quality specific data.
The currently existing deep fake detection work at an image level is mainly divided into two types: spatial domain detection and frequency domain detection. Although the spatial domain image-based method has achieved a very good effect under certain conditions, the method either relies heavily on a uniformly distributed data set or has a very high requirement on the quality of a forged video, and the forged video of a real scene is usually low in quality and high in noise, so that artifacts generated by a forging process are covered to a great extent, and thus the method has certain limitations. However, most of the previous detection work based on the frequency domain only extracts features from the magnitude spectrum, but actually the magnitude spectrum cannot directly represent the information content contained in the frequency domain, so that the information of the frequency domain is not fully utilized, and therefore, the detection based on the frequency domain is still in an early stage, and is worthy of further research and exploration.
The feature of the original face identity information attribute is usually reserved in the deep fake face fake video, so that the semantic information expressed by the fake video is almost the same. For a traditional model for classifying images by using a deep neural network, the network structure is generally deep, the receptive field is large, and global semantic information gets more attention, so that some characteristics which are not beneficial to distinguishing authenticity are extracted in a certain sense. Therefore, in consideration of the special properties of the DeepFake face forged video, the design of a neural network structure is needed, the reasonable neural network structure is designed and more universal frequency domain information is matched, so that the robust DeepFake face forged video detection method is formed, the illegal DeepFake face forged video in the Internet can be prevented from being widely spread, and the method has important practical application value.
Disclosure of Invention
The invention aims to provide a method, a system, equipment and a storage medium for detecting a face forged video, which combine a robust solution of a space domain and a frequency domain and ensure high recall rate and low omission factor in a real scene.
The purpose of the invention is realized by the following technical scheme:
a face-forged video detection method comprises the following steps:
removing the convolution layer of the middle part of the pre-trained neural network model for detecting the face counterfeit video;
and extracting phase spectrum information from the input face video images frame by frame, inputting each face image frame and corresponding phase spectrum information into a first layer convolution layer of the neural network model, and outputting the detection result of the face video images by the neural network model.
A face forgery video detection system is used for realizing the method, and the system comprises:
the model construction unit is used for removing the convolution layer in the middle part of the pre-trained neural network model for detecting the face counterfeit video;
and the information extraction and detection unit is used for extracting phase spectrum information from input human face video images frame by frame, inputting each human face image frame and corresponding phase spectrum information into the first layer convolution layer of the neural network model, and outputting the detection result of the human face video images by the neural network model.
An electronic device, comprising: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.
A readable storage medium, storing a computer program which, when executed by a processor, implements the aforementioned method.
According to the technical scheme provided by the invention, the phase information is extracted and processed in the frequency domain and then inversely transformed back to the space domain, and the space phase characteristics are extracted by combining space domain images, so that a reasonable shallow learning method is designed to learn local texture characteristics, and the mobility and the interpretability of the model are greatly improved; the scheme is excellent in the detection task of the DeepFake face forged video, and achieves the best effect in the detection of the cross data set. In addition, a new thought method is provided for counterfeit video detection and other computer vision tasks, and development of follow-up work is facilitated.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a face-forged video detection method based on spatial phase shallow learning according to an embodiment of the present invention;
FIG. 2 is a comparison graph of the difference between the magnitude spectrum and the phase spectrum of a true image and a false image in a multi-up-sampling process according to an embodiment of the present invention;
FIG. 3 is a comparison graph of frequency domain analysis of an original image and an up-sampled image according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of the main differences between the method of the present invention and the prior art provided by the embodiment of the present invention;
fig. 5 is a phase diagram visualization contrast diagram provided by the embodiment of the present invention;
FIG. 6 is a graph of a comparison of the gradient heat map visualization of the method of the present invention and Xconcentration on different data sets provided by an embodiment of the present invention;
fig. 7 is a schematic diagram of a face-forgery-inhibited video detection system according to an embodiment of the present invention;
fig. 8 is a schematic diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
In recent years, deep counterfeiting technology has made great progress. With the increasing advancement of image generation technologies and the ready availability of relevant data algorithms, high quality face-forged video is more easily produced and can easily fool humans. However, these counterfeiting techniques are likely to be abused for malicious purposes, causing serious security and ethical issues. The academia has proposed many detection methods in an attempt to mitigate this risk. However, most methods focus mainly on how to better distinguish the authenticity of homologous data under specific data conditions of high quality.
Aiming at the problems of the current DeepFake face fake video detection method, the invention aims to provide a robust solution combining a space domain and a frequency domain and ensures high recall rate and low omission ratio in a real scene. To achieve the above object, an embodiment of the present invention provides a face-forged video detection method (SPSL) based on Spatial-Phase Shallow Learning.
In order to fully utilize the commonality in the process of generating the deep fake face video, a strong mapping relation needs to be established between the image generation process and the detection method. Relevant studies based on the generation of deep fake face forged video have shown that a number of up-sampling (Upsampling) operations are essential steps in the generation process. It is inspired by this fact that the subtle changes in the image caused by this operation are analyzed. Since the constraints of the existing image generation method are almost all completed in the spatial domain, such artifacts in the spatial domain are largely eliminated, but in the frequency domain, the up-sampling directly brings extra frequency components, and the accumulation up-sampling operation further amplifies the phenomenon. Based on theoretical derivation and analysis, it is found that the phase spectrum contains more frequency components and is therefore more sensitive to changes due to upsampling than the magnitude spectrum typically used in existing frequency domain methods. Accordingly, the present invention provides a method for extracting phase information in the frequency domain to assist in detecting cumulative upsampling. For natural images, most of the high frequency components have amplitudes close to 0, but the frequency components can still be calculated in the phase spectrum, so that the phase spectrum contains more effective frequency components. And each time of up-sampling, a new frequency component is generated, so that the true and false image can be detected by using the frequency domain information.
In order to fully utilize the characteristic of the DeepFake face forged video, the invention provides a spatial domain detection scheme based on shallow learning, the depth of a neural network is reduced by removing partial convolution layers and pooling layers of the neural network, the network is forced to pay attention to a local area under a smaller receptive field, and texture features are captured. The shallow learning can reduce the operation and time cost, is a detection scheme very suitable for the characteristics of the DeepFake face counterfeit video, and is more suitable for practical application while improving the detection effect.
Based on the above principle, the main process of the detection method provided by the present invention is shown in fig. 1, and mainly includes:
removing a convolution layer in the middle part of a pre-trained neural network model for detecting a face forgery video, so as to reduce the depth and the receptive field of the neural network model, and enabling the neural network model to focus on a local area and capture the texture characteristics of the local area;
and extracting phase spectrum information from the input face video images frame by frame, inputting each face image frame and corresponding phase spectrum information into a first layer convolution layer of the neural network model, and outputting the detection result of the face video images by the neural network model.
Compared with the prior art, the scheme of the embodiment of the invention mainly has the following advantages:
1) the method provides a complete human face forged video detection task framework based on spatial phase shallow learning, makes full use of the commonality and the characteristics of forged videos, and ensures the robustness of the detection method.
2) According to the method, the phase information is extracted and processed in the frequency domain and then inversely transformed back to the space domain, the space phase characteristics are extracted by combining with space domain images, a reasonable shallow learning method is designed to learn local texture characteristics, and the model mobility and the model interpretability are greatly improved.
3) The method disclosed by the invention is excellent in detection task of the DeepFake face forged video, and achieves the best effect in detection of cross data sets. In addition, the invention also provides a new idea method for counterfeit video detection and other computer vision tasks, and is beneficial to the development of follow-up work.
For ease of understanding, the present invention is further described below.
Before describing the main technical solution of the present invention, first, an important observation about the generation process of the deep fake video is described, and this important observation is also briefly described before, and the following description is provided.
Almost all the deep fake videos are synthesized by generating images by an Auto Encoder (Auto Encoder) and a generation countermeasure network (general adaptive Networks), and such methods usually have a large number of up-sampling steps, so that traces generated in the frequency domain of the images by accumulated up-sampling are closely related to the deep fake videos. However, since most of the existing methods do not deeply study the frequency domain, such frequency domain forgery trace is not fully utilized. As shown in fig. 2, after a large number of upsamplings, the magnitude spectrum of the generated image will produce some slight differences, while in the phase spectrum, such differences are greatly amplified; in fig. 2, the upper curve corresponds to the phase spectrum and the lower curve corresponds to the amplitude spectrum, with the right-hand end point as a reference. Although this difference is not readily detectable by the human eye, the exposed potential weaknesses can be captured by well-designed detectors. To verify this phenomenon, analysis of the magnitude and phase spectra in the frequency domain is performed by upsampling the real face image, and the residual result of the phase spectrum can verify this important observation, as shown in fig. 3.
Based on this observation, it is considered necessary to establish a connection between upsampling and face forgery detection, which brings significant advantages to the detection of a DeepFake video. As shown in fig. 4, differences and differences between most of the existing methods and the method of the present invention are shown, and the method of the present invention not only focuses on spatial domain representation of a forged image, but also focuses on weak points in a process of generating a forged video, so that robustness and universality of a model can be further improved. The frequency spectrum of each frame of image can be obtained through Fourier transform, the phase spectrum is separated through calculation, and the extra information is sent to a network as a channel for feature extraction and classification.
Secondly, due to the characteristics of the deep fake video generation technology, most of the generation methods only focus on semantic information of fake human faces including identities, attributes and the like, and observe that local textures of true and false images have large differences, so that a guide model is expected to focus more on local information of the images rather than global information. Meanwhile, the problems of calculation amount and efficiency are considered, shallow learning is realized by reducing convolution layers of a network part finally, so that a smaller receptive field is obtained, a model is prompted to pay attention to a more detailed region, local texture features are extracted, and a more universal counterfeiting detection method is realized.
By combining the two thinking and findings, the invention fuses phase information and textural feature information, converts the phase information extracted from a frequency domain into a space domain and uses the space domain for shallow learning to obtain a final DeepFake face forged video detection frame, and the specific implementation steps of the detection frame are as follows:
1. phase spectrum information is extracted.
First, all videos (as a training set) are subjected to frame-by-frame face detection and alignment, and then, a Discrete Fourier Transform (DFT) is performed, as follows:
Figure BDA0003115790310000061
wherein, x (N) represents the value of the image pixel, x (u) represents the value of the image frequency spectrum at the frequency u, N represents the total number of pixels, j represents a complex number, and e is a natural constant.
The phase spectrum is calculated using the spectrum x (u), which is expressed as:
Figure BDA0003115790310000062
where I (u) and R (u) are the imaginary and real parts of the frequency spectrum X (u), respectively.
Then, performing inverse Fourier transform on the calculated phase spectrum to a space domain, wherein the formula is as follows:
Figure BDA0003115790310000063
wherein p (N) represents the value of a spatial pixel point of the phase spectrum, p (u) represents the value of the phase spectrum at the frequency u, N represents the total number of pixels, and j represents a complex number.
After a series of fourier transform and inverse transform, a phase spectrum spatial domain representation as shown in fig. 5 is obtained, and ORG, DF, F2F, FS and NT marked on the left side in fig. 5 represent an original image and 4 conventional counterfeiting methods; image represents an Image, and Phase represents a Phase spectrum of the corresponding Image. The obvious difference between the phase spectrum of different counterfeiting methods and the phase spectrum of a real image can be easily found, and the phase spectrum is further proved to contain more distinguishable information. In addition, we observe that in fig. 5, the phase spectrum spatial domain images of different forgery methods have specific patterns, so that different forgery methods can be distinguished more easily. And theoretically proves that the phase spectrum contains more frequency components which can be used for distinguishing the true image from the false image, and the following proves that:
for a natural image, the information expressed by the low frequency components is much higher than the high frequency components, so that the amplitude spectrum is partial to the high frequency components
Figure BDA0003115790310000071
Thus R (u) can be obtainedk)≈±0and I(uk) Is approximately equal to +/-0; for this portion of the high frequency component, a phase spectrum is obtained:
Figure BDA0003115790310000072
thus, it can be demonstrated that the phase spectrum contains more abundant frequency components than the amplitude spectrum.
2. And constructing a shallow learning framework.
By analyzing the texture characteristics of the true and false samples, a great amount of artifacts and abnormal traces exist on local textures of the forged video, namely texture information is more concerned to be used for detecting the deep fake video. The convolution operation of the neural network can directly influence the size of the receptive field and further influence the size of the network attention area, and the receptive field is defined as follows:
RFl-1=sl·RFl+(kl-sl)
Figure BDA0003115790310000073
wherein L represents the number of layers of the convolutional layer, klAnd slRepresents the size of the convolutional layer and the convolution step size; the number of layers L of the convolutional layers is reduced, so that the receptive field of the neural network model is reduced.
In order to achieve the purpose, the method directly redesigns and builds the existing common neural network model (the neural network model for detecting the face fake video), specifically, a plurality of convolutional layers at the front end and convolutional layers at the tail end of the neural network model are reserved, and the other convolutional layers which are the middle parts are removed. Illustratively, the neural network model may be a pre-trained XceptionNet model having 12 convolutional layers, leaving the first 3 convolutional layers and the last convolutional layers, removing the middle 8 convolutional layers, and also reducing the receptive field from 1083 to 187, which greatly reduces the receptive field to direct the network to focus on a local area.
In addition, the input of the neural network model needs to be slightly modified, and since the phase information extracted in the previous steps is directly utilized by the input, the number of channels of the first layer convolution layer of the network structure is added to 1, and for example, assuming that the original input channel of the first layer convolution layer is 3, that is, an input RGB image, the number of channels is changed to 4, and the input RGB image + phase information (that is, an input RGB-P image) is input.
The 4-channel RGB-P image is processed in the neural network in the same way as the RGB image, because the P channel is still an image in a spatial domain form, the features can be directly extracted through the convolutional layer, and the number of channels of the other convolutional layers is not different from that of the original neural network except that the number of input channels of the first convolutional layer is increased by 1. After a series of operations such as convolution and the like, the last layer of convolution layer outputs characteristic vectors of 2048 dimensions, and then the characteristics extracted by the network are directly input into the full connection layer to complete the binary classification for judging authenticity.
Based on the scheme, a shallow learning framework suitable for the DeepFake counterfeiting detection is constructed, the DeepFake face counterfeiting video detection can be effectively realized, and the high recall rate and the low omission factor are ensured in a real scene.
Of course, the above model framework needs to be trained and optimized, and the related solution can be implemented by referring to the conventional technology, which is described in a little bit below.
As in the previous example, a network redesigned based on xceptionet can be used as a backbone network of the model, which is an effective network structure for deep fake detection and is widely used in existing detection methods. For migration into the inventive task, the input image is a combination of the original RGB image and the phase-space representation on the channel, and the RGB-P image is cropped 299 x 299 as input.
In order to make the model easier to converge in the training process, the XceptionNet pre-trained on ImageNet is still adopted, and the whole training process adopts a mode of adaptively adjusting the learning rate to obtain a global optimal value.
To illustrate the effects of the above embodiments of the present invention, experiments are described below.
1. And (4) evaluating the homology and distribution data by experiments.
To examine the effectiveness of the present invention, the performance of the method and some of the latest detection methods at the level of homologous homologation data were first compared. The FF + + data set is used as a training set and a testing set, and videos with High Quality (HQ) and Low Quality (LQ) are selected for verification respectively. Xception was used as the baseline model (baseline). All comparative experiments used the same training data partition. Table 1 shows the evaluation results of high-quality video and low-quality video on FF + + dataset without using the scheme, and the SPSL (xception) in table 1 represents the method of the present invention, and the network in parentheses is used as the backbone network of the method of the present invention in the experiment, and the following tables are also the same. The results in table 1 show that the process of the present invention achieves an improvement in performance in both high and low quality experimental settings. In addition, the detection performance for different forgery algorithms was also verified separately on low quality video, and the results are shown in table 2. Although the main contribution of the invention lies in the detection performance of different data distributions, the method can achieve better effect on the same source and distribution data.
Figure BDA0003115790310000081
Figure BDA0003115790310000091
TABLE 1 FF + + dataset high-quality video and low-quality video evaluation results
Figure BDA0003115790310000092
TABLE 2 evaluation results of four different falsification methods on FF + + dataset
2. And (4) carrying out experimental evaluation under a real scene of unknown data distribution.
Under a real scene, it is generally difficult to obtain a data source and a data distribution of a to-be-detected deep fake video, and thus the mobility of a detection method is very important for the practical application of the technology. Using FF + + dataset as training set and using the Celeb-DF dataset with different source distribution than the training set as test set, the cross-dataset evaluation results are shown in table 3.
Figure BDA0003115790310000093
Figure BDA0003115790310000101
TABLE 3 Cross-dataset evaluation results on Celeb-DF datasets
The results in table 3 show that the method provided by the present invention achieves the best effect of the currently published assessment on the Celeb-DF dataset, and has a greater improvement in mobility than the previous methods. Where Xception-c40 refers to the result of training on dataset FF + + Low Quality (LQ) using the underlying Xception network.
3. And (4) multi-classification evaluation.
An extended experiment is carried out to verify the effectiveness of the method in multi-classification. The experiment aims to distinguish not only true and false images but also different tampering methods through the framework of the invention. All real lips in FF + + datasets are labeled with 0 and images of 4 different counterfeiting methods are labeled with 1-5, respectively, and the experimental results are shown in table 4, where c0, c23, and c40 represent datasets of different degrees of compression.
Figure BDA0003115790310000102
TABLE 4 recall results for each of the forgery methods for three different video qualities
As can be seen from the results in table 4, the present invention can achieve a great improvement in the effect under three different quality forged videos. The above results show that the method of the present invention can find potential differences in high-dimensional space in a multi-classification scene. For all the counterfeiting methods, XceptionNet (namely, the left basic method) is easier to confuse real samples and samples forged by NeuralTextures, and the framework of the invention can better segment corresponding true classes and false classes in the feature space.
4. And (4) performing ablation experiments.
The above experiment can verify the effectiveness of the whole framework of the method, but it still needs to further verify that both the phase spectrum and the shallow learning can improve the detection performance. As shown in table 5, experimental comparisons were performed under four settings of the baseline, the use of only the phase spectrum, the use of only the shallow learning, and the simultaneous use of the phase spectrum and the shallow learning, respectively.
Figure BDA0003115790310000111
TABLE 5 evaluation results of phase and shallow layer
The experimental results shown in table 5 indicate that the phase spectra or shallow learning alone can still improve mobility, but the complete framework performs best.
Furthermore, the method described by the invention is verified to be a universal method through experiments, in other words, the whole framework can be plugged and played on any deep convolution neural network classifier. ResNet is the most widely applied network architecture, and ResNet-34 and ResNet-50 are selected to verify the universality of the method, and the experimental results are shown in Table 6.
Figure BDA0003115790310000112
TABLE 6 evaluation results of FF + + and Celeb-DF in different network structures
The experimental results shown in Table 6 show that both ResNet-34 and ResNet-50 have greatly improved performance, and the universality of the method is verified.
5. Interpretability analysis.
The effectiveness of the invention and the single module thereof is verified by the experiment, and the invention reaches the leading level on the detection performance of the deep Fake. Further, we analyzed the advantages of the present invention in terms of interpretability. As shown in fig. 6, the gradient heat map visualization technology is used to analyze the image region concerned by the deep fake detection network in the decision process, the dark color represents a region concerned more, and the light color represents a region concerned less, so that compared with the basic method, the method focuses more on the five sense organ region of the human face, which is more consistent with the unknown region of the human face forgery in the real world, and has stronger interpretability.
Another embodiment of the present invention further provides a face-forgery-inhibited video detection system, which is mainly used to implement the method provided by the foregoing embodiment, as shown in fig. 7, the system includes:
the model construction unit is used for removing the convolution layer in the middle part of the pre-trained neural network model for detecting the face counterfeit video;
and the information extraction and detection unit is used for extracting phase spectrum information from input human face video images frame by frame, inputting each human face image frame and corresponding phase spectrum information into the first layer convolution layer of the neural network model, and outputting the detection result of the human face video images by the neural network model.
Another embodiment of the present invention further provides an electronic device, as shown in fig. 8, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.
Further, the electronic device further comprises at least one input device and at least one output device; in the electronic equipment, a processor, a memory, an input device and an output device are connected through a bus.
In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical button or a mouse and the like;
the output device may be a display terminal;
the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.
Another embodiment of the present invention further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method provided by the foregoing embodiment.
The readable storage medium in the embodiment of the present invention may be provided in the foregoing electronic device, for example, as a memory in the electronic device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. A method for detecting a face forged video is characterized by comprising the following steps:
removing the convolution layer of the middle part of the pre-trained neural network model for detecting the face counterfeit video;
and extracting phase spectrum information from the input face video images frame by frame, inputting each face image frame and corresponding phase spectrum information into a first layer convolution layer of the neural network model, and outputting the detection result of the face video images by the neural network model.
2. The method as claimed in claim 1, wherein said extracting phase spectrum information frame by frame from the input face video image comprises:
carrying out frame-by-frame face detection and alignment on an input face video image, and then carrying out discrete Fourier transform to obtain a frequency spectrum X (u):
Figure FDA0003115790300000011
wherein, x (N) represents the value of the image pixel point, x (u) represents the value of the image frequency spectrum at the frequency u, N represents the total number of pixels, j represents a complex number, and e is a natural constant;
the phase spectrum is calculated using the spectrum x (u), which is expressed as:
Figure FDA0003115790300000012
where I (u) and R (u) are the imaginary and real parts of the frequency spectrum X (u), respectively.
3. The method as claimed in claim 1, wherein the removing the convolution layer of the middle portion comprises: several convolutional layers at the front end and those at the tail end of the neural network model are reserved, and the other convolutional layers are removed.
4. The method as claimed in claim 1 or 3, wherein the neural network model is a pre-trained XceptionNet model having 12 convolutional layers, the first 3 convolutional layers and the last convolutional layers are retained, and the middle 8 convolutional layers are removed.
5. The method for detecting face-forged video based on shallow spatial phase learning according to claim 1 or 3, wherein the receptive field is defined as follows:
RFl-1=sl·RFl+(kl-sl)
Figure FDA0003115790300000013
wherein L represents the number of layers of the convolutional layer, klAnd slRepresents the size of the convolutional layer and the convolution step size;
the number of layers L of the convolutional layers is reduced, so that the receptive field of the neural network model is reduced.
6. The method as claimed in claim 1, wherein the step of inputting each face image frame and corresponding phase spectrum information to the first convolution layer of the neural network model, and the step of outputting the detection result of the face video image from the neural network model includes:
the human face image frame is a three-channel image, the corresponding phase spectrum information is an image in a spatial domain form, and the human face image frame and the corresponding phase spectrum information form a four-channel image;
increasing the input channels of the first layer of convolution layer of the neural network model from three channels to four channels, wherein the number of the channels of the rest convolution layers is unchanged; and outputting the feature vector by the last layer of convolution layer through convolution operation of the neural network model, and outputting the authenticity detection result of the face video image by the feature vector through the full-connection layer.
7. A face forgery video detection system for implementing the method of any one of claims 1 to 6, the system comprising:
the model construction unit is used for removing the convolution layer in the middle part of the pre-trained neural network model for detecting the face counterfeit video;
and the information extraction and detection unit is used for extracting phase spectrum information from input human face video images frame by frame, inputting each human face image frame and corresponding phase spectrum information into the first layer convolution layer of the neural network model, and outputting the detection result of the human face video images by the neural network model.
8. An electronic device, comprising: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.
9. A readable storage medium, storing a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1-6.
CN202110662165.5A 2021-06-15 2021-06-15 Face counterfeit video detection method, system, equipment and storage medium Pending CN113313054A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110662165.5A CN113313054A (en) 2021-06-15 2021-06-15 Face counterfeit video detection method, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110662165.5A CN113313054A (en) 2021-06-15 2021-06-15 Face counterfeit video detection method, system, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113313054A true CN113313054A (en) 2021-08-27

Family

ID=77378881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110662165.5A Pending CN113313054A (en) 2021-06-15 2021-06-15 Face counterfeit video detection method, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113313054A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762138A (en) * 2021-09-02 2021-12-07 恒安嘉新(北京)科技股份公司 Method and device for identifying forged face picture, computer equipment and storage medium
CN114841340A (en) * 2022-04-22 2022-08-02 马上消费金融股份有限公司 Deep forgery algorithm identification method and device, electronic equipment and storage medium
CN114881838A (en) * 2022-07-07 2022-08-09 中国科学技术大学 Bidirectional face data protection method, system and equipment for deep forgery
CN116132084A (en) * 2022-09-20 2023-05-16 马上消费金融股份有限公司 Video stream processing method and device and electronic equipment
CN116563957A (en) * 2023-07-10 2023-08-08 齐鲁工业大学(山东省科学院) Face fake video detection method based on Fourier domain adaptation
CN117238015A (en) * 2023-08-28 2023-12-15 浙江大学 General depth forging detection method based on generation model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428865A (en) * 2020-04-20 2020-07-17 杭州电子科技大学 Visual analysis method for understanding DQN model
CN112215171A (en) * 2020-10-15 2021-01-12 腾讯科技(深圳)有限公司 Target detection method, device, equipment and computer readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428865A (en) * 2020-04-20 2020-07-17 杭州电子科技大学 Visual analysis method for understanding DQN model
CN112215171A (en) * 2020-10-15 2021-01-12 腾讯科技(深圳)有限公司 Target detection method, device, equipment and computer readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HONGGU LIU 等: "Spatial-Phase Shallow Learning: Rethinking Face Forgery Detection in Frequency Domain", 《ARXIV》 *
杨杰之: "基于细粒度肺炎识别方法", 《重庆师范大学学报》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762138A (en) * 2021-09-02 2021-12-07 恒安嘉新(北京)科技股份公司 Method and device for identifying forged face picture, computer equipment and storage medium
CN113762138B (en) * 2021-09-02 2024-04-23 恒安嘉新(北京)科技股份公司 Identification method, device, computer equipment and storage medium for fake face pictures
CN114841340A (en) * 2022-04-22 2022-08-02 马上消费金融股份有限公司 Deep forgery algorithm identification method and device, electronic equipment and storage medium
CN114841340B (en) * 2022-04-22 2023-07-28 马上消费金融股份有限公司 Identification method and device for depth counterfeiting algorithm, electronic equipment and storage medium
CN114881838A (en) * 2022-07-07 2022-08-09 中国科学技术大学 Bidirectional face data protection method, system and equipment for deep forgery
CN114881838B (en) * 2022-07-07 2022-10-28 中国科学技术大学 Bidirectional face data protection method, system and equipment for deep forgery
CN116132084A (en) * 2022-09-20 2023-05-16 马上消费金融股份有限公司 Video stream processing method and device and electronic equipment
CN116563957A (en) * 2023-07-10 2023-08-08 齐鲁工业大学(山东省科学院) Face fake video detection method based on Fourier domain adaptation
CN116563957B (en) * 2023-07-10 2023-09-29 齐鲁工业大学(山东省科学院) Face fake video detection method based on Fourier domain adaptation
CN117238015A (en) * 2023-08-28 2023-12-15 浙江大学 General depth forging detection method based on generation model

Similar Documents

Publication Publication Date Title
Liu et al. Global texture enhancement for fake face detection in the wild
Wu et al. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features
CN113313054A (en) Face counterfeit video detection method, system, equipment and storage medium
Scherhag et al. Detection of face morphing attacks based on PRNU analysis
Debiasi et al. PRNU-based detection of morphed face images
Wang et al. Effective image splicing detection based on image chroma
Nishiyama et al. Facial deblur inference using subspace analysis for recognition of blurred faces
Hussain et al. Evaluation of image forgery detection using multi-scale weber local descriptors
Alshayeji et al. Detection method for counterfeit currency based on bit-plane slicing technique
Xu et al. Forensic detection of Gaussian low-pass filtering in digital images
Gul et al. SVD based image manipulation detection
Yu et al. Manipulation classification for jpeg images using multi-domain features
Isaac et al. Image forgery detection based on Gabor wavelets and local phase quantization
Barni et al. Detection of adaptive histogram equalization robust against JPEG compression
Agarwal et al. MagNet: Detecting digital presentation attacks on face recognition
Isaac et al. Multiscale local gabor phase quantization for image forgery detection
CN111259792A (en) Face living body detection method based on DWT-LBP-DCT characteristics
Maser et al. PRNU-based detection of finger vein presentation attacks
CN117373136A (en) Face counterfeiting detection method based on frequency mask and attention consistency
Bera et al. Spoofing detection on hand images using quality assessment
Le-Tien et al. Combined Zernike moment and multiscale Analysis for tamper detection in digital images
Lu et al. Face morphing detection with convolutional neural network based on multi-features
binti Ashari et al. Multi-scale texture analysis for finger vein anti-spoofing
Goel et al. An approach for anti-forensic contrast enhancement detection using grey level co-occurrence matrix and Zernike moments
Moin et al. Benford's law for detecting contrast enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210827

RJ01 Rejection of invention patent application after publication