CN113313054A

CN113313054A - Face counterfeit video detection method, system, equipment and storage medium

Info

Publication number: CN113313054A
Application number: CN202110662165.5A
Authority: CN
Inventors: 周文柏; 张卫明; 俞能海; 刘泓谷
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-08-27

Abstract

The invention discloses a method, a system, equipment and a storage medium for detecting a face forged video, wherein the scheme is characterized in that phase information is extracted and processed in a frequency domain and then inversely transformed back to a space domain, space phase characteristics are extracted by combining a space domain image, a reasonable shallow learning method is designed to learn local texture characteristics, and the mobility and the interpretability of a model are greatly improved; the scheme is excellent in the detection task of the DeepFake face forged video, and achieves the best effect in the detection of the cross data set. In addition, a new thought method is provided for counterfeit video detection and other computer vision tasks, and development of follow-up work is facilitated.

Description

Face counterfeit video detection method, system, equipment and storage medium

Technical Field

The invention relates to the technical field of face counterfeit video detection, in particular to a face counterfeit video detection method, a face counterfeit video detection system, face counterfeit video detection equipment and a storage medium.

Background

The deep fake video has become one of the most widely spread media on the internet today. Since deep learning has been highly successful in computer vision tasks, image generation using Auto encoders (Auto encoders) and generation countermeasure Networks (general adaptive Networks) has been rapidly developed in recent years. With the increasing advancement of image generation technologies and the easy accessibility of relevant data algorithms, high quality depfake face forged video is easier to make and can easily deceive humans. However, these counterfeiting techniques are likely to be abused for malicious purposes, causing serious security and ethical problems, and thus the method of DeepFake face counterfeit video detection is in force. In previous work, the deep fake detection mainly focuses on how to better distinguish the authenticity of homologous data under the condition of high-quality specific data.

The currently existing deep fake detection work at an image level is mainly divided into two types: spatial domain detection and frequency domain detection. Although the spatial domain image-based method has achieved a very good effect under certain conditions, the method either relies heavily on a uniformly distributed data set or has a very high requirement on the quality of a forged video, and the forged video of a real scene is usually low in quality and high in noise, so that artifacts generated by a forging process are covered to a great extent, and thus the method has certain limitations. However, most of the previous detection work based on the frequency domain only extracts features from the magnitude spectrum, but actually the magnitude spectrum cannot directly represent the information content contained in the frequency domain, so that the information of the frequency domain is not fully utilized, and therefore, the detection based on the frequency domain is still in an early stage, and is worthy of further research and exploration.

The feature of the original face identity information attribute is usually reserved in the deep fake face fake video, so that the semantic information expressed by the fake video is almost the same. For a traditional model for classifying images by using a deep neural network, the network structure is generally deep, the receptive field is large, and global semantic information gets more attention, so that some characteristics which are not beneficial to distinguishing authenticity are extracted in a certain sense. Therefore, in consideration of the special properties of the DeepFake face forged video, the design of a neural network structure is needed, the reasonable neural network structure is designed and more universal frequency domain information is matched, so that the robust DeepFake face forged video detection method is formed, the illegal DeepFake face forged video in the Internet can be prevented from being widely spread, and the method has important practical application value.

Disclosure of Invention

The invention aims to provide a method, a system, equipment and a storage medium for detecting a face forged video, which combine a robust solution of a space domain and a frequency domain and ensure high recall rate and low omission factor in a real scene.

The purpose of the invention is realized by the following technical scheme:

a face-forged video detection method comprises the following steps:

removing the convolution layer of the middle part of the pre-trained neural network model for detecting the face counterfeit video;

and extracting phase spectrum information from the input face video images frame by frame, inputting each face image frame and corresponding phase spectrum information into a first layer convolution layer of the neural network model, and outputting the detection result of the face video images by the neural network model.

A face forgery video detection system is used for realizing the method, and the system comprises:

the model construction unit is used for removing the convolution layer in the middle part of the pre-trained neural network model for detecting the face counterfeit video;

and the information extraction and detection unit is used for extracting phase spectrum information from input human face video images frame by frame, inputting each human face image frame and corresponding phase spectrum information into the first layer convolution layer of the neural network model, and outputting the detection result of the human face video images by the neural network model.

An electronic device, comprising: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium, storing a computer program which, when executed by a processor, implements the aforementioned method.

According to the technical scheme provided by the invention, the phase information is extracted and processed in the frequency domain and then inversely transformed back to the space domain, and the space phase characteristics are extracted by combining space domain images, so that a reasonable shallow learning method is designed to learn local texture characteristics, and the mobility and the interpretability of the model are greatly improved; the scheme is excellent in the detection task of the DeepFake face forged video, and achieves the best effect in the detection of the cross data set. In addition, a new thought method is provided for counterfeit video detection and other computer vision tasks, and development of follow-up work is facilitated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a face-forged video detection method based on spatial phase shallow learning according to an embodiment of the present invention;

FIG. 2 is a comparison graph of the difference between the magnitude spectrum and the phase spectrum of a true image and a false image in a multi-up-sampling process according to an embodiment of the present invention;

FIG. 3 is a comparison graph of frequency domain analysis of an original image and an up-sampled image according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the main differences between the method of the present invention and the prior art provided by the embodiment of the present invention;

fig. 5 is a phase diagram visualization contrast diagram provided by the embodiment of the present invention;

FIG. 6 is a graph of a comparison of the gradient heat map visualization of the method of the present invention and Xconcentration on different data sets provided by an embodiment of the present invention;

fig. 7 is a schematic diagram of a face-forgery-inhibited video detection system according to an embodiment of the present invention;

fig. 8 is a schematic diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

In recent years, deep counterfeiting technology has made great progress. With the increasing advancement of image generation technologies and the ready availability of relevant data algorithms, high quality face-forged video is more easily produced and can easily fool humans. However, these counterfeiting techniques are likely to be abused for malicious purposes, causing serious security and ethical issues. The academia has proposed many detection methods in an attempt to mitigate this risk. However, most methods focus mainly on how to better distinguish the authenticity of homologous data under specific data conditions of high quality.

Aiming at the problems of the current DeepFake face fake video detection method, the invention aims to provide a robust solution combining a space domain and a frequency domain and ensures high recall rate and low omission ratio in a real scene. To achieve the above object, an embodiment of the present invention provides a face-forged video detection method (SPSL) based on Spatial-Phase Shallow Learning.

In order to fully utilize the commonality in the process of generating the deep fake face video, a strong mapping relation needs to be established between the image generation process and the detection method. Relevant studies based on the generation of deep fake face forged video have shown that a number of up-sampling (Upsampling) operations are essential steps in the generation process. It is inspired by this fact that the subtle changes in the image caused by this operation are analyzed. Since the constraints of the existing image generation method are almost all completed in the spatial domain, such artifacts in the spatial domain are largely eliminated, but in the frequency domain, the up-sampling directly brings extra frequency components, and the accumulation up-sampling operation further amplifies the phenomenon. Based on theoretical derivation and analysis, it is found that the phase spectrum contains more frequency components and is therefore more sensitive to changes due to upsampling than the magnitude spectrum typically used in existing frequency domain methods. Accordingly, the present invention provides a method for extracting phase information in the frequency domain to assist in detecting cumulative upsampling. For natural images, most of the high frequency components have amplitudes close to 0, but the frequency components can still be calculated in the phase spectrum, so that the phase spectrum contains more effective frequency components. And each time of up-sampling, a new frequency component is generated, so that the true and false image can be detected by using the frequency domain information.

In order to fully utilize the characteristic of the DeepFake face forged video, the invention provides a spatial domain detection scheme based on shallow learning, the depth of a neural network is reduced by removing partial convolution layers and pooling layers of the neural network, the network is forced to pay attention to a local area under a smaller receptive field, and texture features are captured. The shallow learning can reduce the operation and time cost, is a detection scheme very suitable for the characteristics of the DeepFake face counterfeit video, and is more suitable for practical application while improving the detection effect.

Based on the above principle, the main process of the detection method provided by the present invention is shown in fig. 1, and mainly includes:

removing a convolution layer in the middle part of a pre-trained neural network model for detecting a face forgery video, so as to reduce the depth and the receptive field of the neural network model, and enabling the neural network model to focus on a local area and capture the texture characteristics of the local area;

Compared with the prior art, the scheme of the embodiment of the invention mainly has the following advantages:

1) the method provides a complete human face forged video detection task framework based on spatial phase shallow learning, makes full use of the commonality and the characteristics of forged videos, and ensures the robustness of the detection method.

2) According to the method, the phase information is extracted and processed in the frequency domain and then inversely transformed back to the space domain, the space phase characteristics are extracted by combining with space domain images, a reasonable shallow learning method is designed to learn local texture characteristics, and the model mobility and the model interpretability are greatly improved.

3) The method disclosed by the invention is excellent in detection task of the DeepFake face forged video, and achieves the best effect in detection of cross data sets. In addition, the invention also provides a new idea method for counterfeit video detection and other computer vision tasks, and is beneficial to the development of follow-up work.

For ease of understanding, the present invention is further described below.

Before describing the main technical solution of the present invention, first, an important observation about the generation process of the deep fake video is described, and this important observation is also briefly described before, and the following description is provided.

Almost all the deep fake videos are synthesized by generating images by an Auto Encoder (Auto Encoder) and a generation countermeasure network (general adaptive Networks), and such methods usually have a large number of up-sampling steps, so that traces generated in the frequency domain of the images by accumulated up-sampling are closely related to the deep fake videos. However, since most of the existing methods do not deeply study the frequency domain, such frequency domain forgery trace is not fully utilized. As shown in fig. 2, after a large number of upsamplings, the magnitude spectrum of the generated image will produce some slight differences, while in the phase spectrum, such differences are greatly amplified; in fig. 2, the upper curve corresponds to the phase spectrum and the lower curve corresponds to the amplitude spectrum, with the right-hand end point as a reference. Although this difference is not readily detectable by the human eye, the exposed potential weaknesses can be captured by well-designed detectors. To verify this phenomenon, analysis of the magnitude and phase spectra in the frequency domain is performed by upsampling the real face image, and the residual result of the phase spectrum can verify this important observation, as shown in fig. 3.

Based on this observation, it is considered necessary to establish a connection between upsampling and face forgery detection, which brings significant advantages to the detection of a DeepFake video. As shown in fig. 4, differences and differences between most of the existing methods and the method of the present invention are shown, and the method of the present invention not only focuses on spatial domain representation of a forged image, but also focuses on weak points in a process of generating a forged video, so that robustness and universality of a model can be further improved. The frequency spectrum of each frame of image can be obtained through Fourier transform, the phase spectrum is separated through calculation, and the extra information is sent to a network as a channel for feature extraction and classification.

Secondly, due to the characteristics of the deep fake video generation technology, most of the generation methods only focus on semantic information of fake human faces including identities, attributes and the like, and observe that local textures of true and false images have large differences, so that a guide model is expected to focus more on local information of the images rather than global information. Meanwhile, the problems of calculation amount and efficiency are considered, shallow learning is realized by reducing convolution layers of a network part finally, so that a smaller receptive field is obtained, a model is prompted to pay attention to a more detailed region, local texture features are extracted, and a more universal counterfeiting detection method is realized.

By combining the two thinking and findings, the invention fuses phase information and textural feature information, converts the phase information extracted from a frequency domain into a space domain and uses the space domain for shallow learning to obtain a final DeepFake face forged video detection frame, and the specific implementation steps of the detection frame are as follows:

1. phase spectrum information is extracted.

First, all videos (as a training set) are subjected to frame-by-frame face detection and alignment, and then, a Discrete Fourier Transform (DFT) is performed, as follows:

wherein, x (N) represents the value of the image pixel, x (u) represents the value of the image frequency spectrum at the frequency u, N represents the total number of pixels, j represents a complex number, and e is a natural constant.

The phase spectrum is calculated using the spectrum x (u), which is expressed as:

where I (u) and R (u) are the imaginary and real parts of the frequency spectrum X (u), respectively.

Then, performing inverse Fourier transform on the calculated phase spectrum to a space domain, wherein the formula is as follows:

wherein p (N) represents the value of a spatial pixel point of the phase spectrum, p (u) represents the value of the phase spectrum at the frequency u, N represents the total number of pixels, and j represents a complex number.

After a series of fourier transform and inverse transform, a phase spectrum spatial domain representation as shown in fig. 5 is obtained, and ORG, DF, F2F, FS and NT marked on the left side in fig. 5 represent an original image and 4 conventional counterfeiting methods; image represents an Image, and Phase represents a Phase spectrum of the corresponding Image. The obvious difference between the phase spectrum of different counterfeiting methods and the phase spectrum of a real image can be easily found, and the phase spectrum is further proved to contain more distinguishable information. In addition, we observe that in fig. 5, the phase spectrum spatial domain images of different forgery methods have specific patterns, so that different forgery methods can be distinguished more easily. And theoretically proves that the phase spectrum contains more frequency components which can be used for distinguishing the true image from the false image, and the following proves that:

for a natural image, the information expressed by the low frequency components is much higher than the high frequency components, so that the amplitude spectrum is partial to the high frequency components

Thus R (u) can be obtained_k)≈±0and I(u_k) Is approximately equal to +/-0; for this portion of the high frequency component, a phase spectrum is obtained:

thus, it can be demonstrated that the phase spectrum contains more abundant frequency components than the amplitude spectrum.

2. And constructing a shallow learning framework.

By analyzing the texture characteristics of the true and false samples, a great amount of artifacts and abnormal traces exist on local textures of the forged video, namely texture information is more concerned to be used for detecting the deep fake video. The convolution operation of the neural network can directly influence the size of the receptive field and further influence the size of the network attention area, and the receptive field is defined as follows:

RF_l-1＝s_l·RF_l+(k_l-s_l)

wherein L represents the number of layers of the convolutional layer, k_lAnd s_lRepresents the size of the convolutional layer and the convolution step size; the number of layers L of the convolutional layers is reduced, so that the receptive field of the neural network model is reduced.

In order to achieve the purpose, the method directly redesigns and builds the existing common neural network model (the neural network model for detecting the face fake video), specifically, a plurality of convolutional layers at the front end and convolutional layers at the tail end of the neural network model are reserved, and the other convolutional layers which are the middle parts are removed. Illustratively, the neural network model may be a pre-trained XceptionNet model having 12 convolutional layers, leaving the first 3 convolutional layers and the last convolutional layers, removing the middle 8 convolutional layers, and also reducing the receptive field from 1083 to 187, which greatly reduces the receptive field to direct the network to focus on a local area.

In addition, the input of the neural network model needs to be slightly modified, and since the phase information extracted in the previous steps is directly utilized by the input, the number of channels of the first layer convolution layer of the network structure is added to 1, and for example, assuming that the original input channel of the first layer convolution layer is 3, that is, an input RGB image, the number of channels is changed to 4, and the input RGB image + phase information (that is, an input RGB-P image) is input.

The 4-channel RGB-P image is processed in the neural network in the same way as the RGB image, because the P channel is still an image in a spatial domain form, the features can be directly extracted through the convolutional layer, and the number of channels of the other convolutional layers is not different from that of the original neural network except that the number of input channels of the first convolutional layer is increased by 1. After a series of operations such as convolution and the like, the last layer of convolution layer outputs characteristic vectors of 2048 dimensions, and then the characteristics extracted by the network are directly input into the full connection layer to complete the binary classification for judging authenticity.

Based on the scheme, a shallow learning framework suitable for the DeepFake counterfeiting detection is constructed, the DeepFake face counterfeiting video detection can be effectively realized, and the high recall rate and the low omission factor are ensured in a real scene.

Of course, the above model framework needs to be trained and optimized, and the related solution can be implemented by referring to the conventional technology, which is described in a little bit below.

As in the previous example, a network redesigned based on xceptionet can be used as a backbone network of the model, which is an effective network structure for deep fake detection and is widely used in existing detection methods. For migration into the inventive task, the input image is a combination of the original RGB image and the phase-space representation on the channel, and the RGB-P image is cropped 299 x 299 as input.

In order to make the model easier to converge in the training process, the XceptionNet pre-trained on ImageNet is still adopted, and the whole training process adopts a mode of adaptively adjusting the learning rate to obtain a global optimal value.

To illustrate the effects of the above embodiments of the present invention, experiments are described below.

1. And (4) evaluating the homology and distribution data by experiments.

To examine the effectiveness of the present invention, the performance of the method and some of the latest detection methods at the level of homologous homologation data were first compared. The FF + + data set is used as a training set and a testing set, and videos with High Quality (HQ) and Low Quality (LQ) are selected for verification respectively. Xception was used as the baseline model (baseline). All comparative experiments used the same training data partition. Table 1 shows the evaluation results of high-quality video and low-quality video on FF + + dataset without using the scheme, and the SPSL (xception) in table 1 represents the method of the present invention, and the network in parentheses is used as the backbone network of the method of the present invention in the experiment, and the following tables are also the same. The results in table 1 show that the process of the present invention achieves an improvement in performance in both high and low quality experimental settings. In addition, the detection performance for different forgery algorithms was also verified separately on low quality video, and the results are shown in table 2. Although the main contribution of the invention lies in the detection performance of different data distributions, the method can achieve better effect on the same source and distribution data.

TABLE 1 FF + + dataset high-quality video and low-quality video evaluation results

TABLE 2 evaluation results of four different falsification methods on FF + + dataset

2. And (4) carrying out experimental evaluation under a real scene of unknown data distribution.

Under a real scene, it is generally difficult to obtain a data source and a data distribution of a to-be-detected deep fake video, and thus the mobility of a detection method is very important for the practical application of the technology. Using FF + + dataset as training set and using the Celeb-DF dataset with different source distribution than the training set as test set, the cross-dataset evaluation results are shown in table 3.

TABLE 3 Cross-dataset evaluation results on Celeb-DF datasets

The results in table 3 show that the method provided by the present invention achieves the best effect of the currently published assessment on the Celeb-DF dataset, and has a greater improvement in mobility than the previous methods. Where Xception-c40 refers to the result of training on dataset FF + + Low Quality (LQ) using the underlying Xception network.

3. And (4) multi-classification evaluation.

An extended experiment is carried out to verify the effectiveness of the method in multi-classification. The experiment aims to distinguish not only true and false images but also different tampering methods through the framework of the invention. All real lips in FF + + datasets are labeled with 0 and images of 4 different counterfeiting methods are labeled with 1-5, respectively, and the experimental results are shown in table 4, where c0, c23, and c40 represent datasets of different degrees of compression.

TABLE 4 recall results for each of the forgery methods for three different video qualities

As can be seen from the results in table 4, the present invention can achieve a great improvement in the effect under three different quality forged videos. The above results show that the method of the present invention can find potential differences in high-dimensional space in a multi-classification scene. For all the counterfeiting methods, XceptionNet (namely, the left basic method) is easier to confuse real samples and samples forged by NeuralTextures, and the framework of the invention can better segment corresponding true classes and false classes in the feature space.

4. And (4) performing ablation experiments.

The above experiment can verify the effectiveness of the whole framework of the method, but it still needs to further verify that both the phase spectrum and the shallow learning can improve the detection performance. As shown in table 5, experimental comparisons were performed under four settings of the baseline, the use of only the phase spectrum, the use of only the shallow learning, and the simultaneous use of the phase spectrum and the shallow learning, respectively.

TABLE 5 evaluation results of phase and shallow layer

The experimental results shown in table 5 indicate that the phase spectra or shallow learning alone can still improve mobility, but the complete framework performs best.

Furthermore, the method described by the invention is verified to be a universal method through experiments, in other words, the whole framework can be plugged and played on any deep convolution neural network classifier. ResNet is the most widely applied network architecture, and ResNet-34 and ResNet-50 are selected to verify the universality of the method, and the experimental results are shown in Table 6.

TABLE 6 evaluation results of FF + + and Celeb-DF in different network structures

The experimental results shown in Table 6 show that both ResNet-34 and ResNet-50 have greatly improved performance, and the universality of the method is verified.

5. Interpretability analysis.

The effectiveness of the invention and the single module thereof is verified by the experiment, and the invention reaches the leading level on the detection performance of the deep Fake. Further, we analyzed the advantages of the present invention in terms of interpretability. As shown in fig. 6, the gradient heat map visualization technology is used to analyze the image region concerned by the deep fake detection network in the decision process, the dark color represents a region concerned more, and the light color represents a region concerned less, so that compared with the basic method, the method focuses more on the five sense organ region of the human face, which is more consistent with the unknown region of the human face forgery in the real world, and has stronger interpretability.

Another embodiment of the present invention further provides a face-forgery-inhibited video detection system, which is mainly used to implement the method provided by the foregoing embodiment, as shown in fig. 7, the system includes:

Another embodiment of the present invention further provides an electronic device, as shown in fig. 8, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the electronic device further comprises at least one input device and at least one output device; in the electronic equipment, a processor, a memory, an input device and an output device are connected through a bus.

In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical button or a mouse and the like;

the output device may be a display terminal;

the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.

Another embodiment of the present invention further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method provided by the foregoing embodiment.

The readable storage medium in the embodiment of the present invention may be provided in the foregoing electronic device, for example, as a memory in the electronic device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for detecting a face forged video is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein said extracting phase spectrum information frame by frame from the input face video image comprises:

carrying out frame-by-frame face detection and alignment on an input face video image, and then carrying out discrete Fourier transform to obtain a frequency spectrum X (u):

wherein, x (N) represents the value of the image pixel point, x (u) represents the value of the image frequency spectrum at the frequency u, N represents the total number of pixels, j represents a complex number, and e is a natural constant;

3. The method as claimed in claim 1, wherein the removing the convolution layer of the middle portion comprises: several convolutional layers at the front end and those at the tail end of the neural network model are reserved, and the other convolutional layers are removed.

4. The method as claimed in claim 1 or 3, wherein the neural network model is a pre-trained XceptionNet model having 12 convolutional layers, the first 3 convolutional layers and the last convolutional layers are retained, and the middle 8 convolutional layers are removed.

5. The method for detecting face-forged video based on shallow spatial phase learning according to claim 1 or 3, wherein the receptive field is defined as follows:

RF_l-1＝s_l·RF_l+(k_l-s_l)

wherein L represents the number of layers of the convolutional layer, k_lAnd s_lRepresents the size of the convolutional layer and the convolution step size;

the number of layers L of the convolutional layers is reduced, so that the receptive field of the neural network model is reduced.

6. The method as claimed in claim 1, wherein the step of inputting each face image frame and corresponding phase spectrum information to the first convolution layer of the neural network model, and the step of outputting the detection result of the face video image from the neural network model includes:

the human face image frame is a three-channel image, the corresponding phase spectrum information is an image in a spatial domain form, and the human face image frame and the corresponding phase spectrum information form a four-channel image;

increasing the input channels of the first layer of convolution layer of the neural network model from three channels to four channels, wherein the number of the channels of the rest convolution layers is unchanged; and outputting the feature vector by the last layer of convolution layer through convolution operation of the neural network model, and outputting the authenticity detection result of the face video image by the feature vector through the full-connection layer.

7. A face forgery video detection system for implementing the method of any one of claims 1 to 6, the system comprising:

8. An electronic device, comprising: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.

9. A readable storage medium, storing a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1-6.