CN113989713B

CN113989713B - Depth forgery detection method based on video frame sequence prediction

Info

Publication number: CN113989713B
Application number: CN202111265016.1A
Authority: CN
Inventors: 曹娟; 谢添; 郭晨阳
Original assignee: Hangzhou Zhongke Ruijian Technology Co ltd
Current assignee: Hangzhou Zhongke Ruijian Technology Co ltd
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2023-05-12
Anticipated expiration: 2041-10-28
Also published as: CN113989713A

Abstract

The invention relates to a depth forgery detection method based on video frame sequence prediction, which aims to improve the attention of a time sequence model to time sequence characteristics. The technical scheme adopted by the invention is as follows: a depth forgery detection method based on video frame sequence prediction is characterized in that: inputting the suspicious video into a trained time sequence model, extracting the characteristics of the suspicious video through the time sequence model, inputting the characteristics into a true and false classifier, and outputting the true and false probability of the suspicious video by the true and false classifier; training of the timing model, comprising: randomly disturbing original continuous video frames of the video clips, and recording the disturbing mode; inputting the disturbed video frames into a time sequence model to extract characteristics, and simultaneously sending the characteristics into a frame sequence classifier and a true and false classifier; and calculating frame sequence prediction loss between the result of the frame sequence classifier and the disturbing mode, and calculating true and false classification loss between the result of the true and false classifier and the true and false labels of the video fragments. The invention is suitable for the fields of machine learning and computer vision.

Description

Depth forgery detection method based on video frame sequence prediction

Technical Field

The invention relates to a depth forgery detection method based on video frame sequence prediction. Is suitable for the fields of machine learning and computer vision.

Background

In recent years, deep learning technology has been developed and has been widely used in the field of computer vision. On the one hand, the deep learning technology brings new rounds of artificial intelligence wave, but on the other hand, a series of safety problems caused by the deep learning also draw more and more attention. At present, the image and video recognition technology based on deep learning is widely applied to aspects in life of people, such as intelligent supervision of network content, automatic video monitoring analysis, face recognition-based access control system, face brushing payment and the like. In these critical application areas, reliability and security of information and data should be emphasized and guaranteed.

Since 2017, some false images and videos generated based on deep forging (also called deep fake) technology have attracted extensive attention on the internet, and especially when deep forging is used on a person with great influence, the false images and videos tend to have greater influence by the influence of the person. The presence of impermissible video on the Deepfakes forum of Reddit, which modifies the face of the pornographic video character into the video of the face of the star, causes serious negative effects, and in addition, the large amount of "one-key" face-changing software makes the acquisition of counterfeit video simpler and simpler. False images and video have become one of the most significant information data security risks at present, and their detection and supervision face significant challenges.

AI synthesized false faces present a great threat to creating video of artifacts of a target person doing something or speaking something with near-true facial expressions and physical actions, subverting people's knowledge of the "see-in-the-eye" concept. There is a great need in the art for an effective technique to detect false face images or videos in a network environment, but the difficulty is great, mainly because the forged area of the face forged image is weak and locally present. The detection is extremely susceptible to image noise. In addition, the counterfeit area is often unpredictable, and different areas are aimed at by each counterfeit method, so that the problem of correctly detecting and classifying the counterfeit area is very difficult.

Most of the current false detection methods of the depth counterfeit video can be generally divided into an image-level false detection method and a video-level false detection method, and the image-level method can be roughly divided into methods based on image counterfeit defects, improvement of network structures, multi-feature fusion and other tasks; the method of the video level counterfeit discrimination can be roughly classified into a method based on a time-series physiological signal and a method based on a time-series low-level counterfeit trace.

Image forgery defect-based method this method is the mainstream method, and attempts to detect by mining the non-uniformity phenomenon of human face caused by operations such as scaling, rotation, distortion, etc. during forgery. Li and Lyu propose a CNN-based detection method for detecting a phenomenon that the resolution of an image face is low due to tampering by detecting a face region and a face surrounding region. Li et al propose a method of face X-ray which gives a better generalization by designing a face contour mask to guide the model to focus on the face contour region where the counterfeit region is likely to occur.

Methods for improving network architecture this type of approach aims to improve the effect of the model on the classification of true and false by modifying or improving the classification network. Afchar et al propose methods based on mesoscopic and steganographic analysis features, two different networks are proposed to focus on the mesoscopic properties of the image, a variant of the Meso-4 network and the Meso-4 network incorporating an acceptance module, respectively. Nguyen et al propose a counterfeiting detection system based on capsule networks (capsuleNetworks) with fewer parameters than conventional CNNs. Rossler et al evaluate five different detection methods while studying the FaceForensis++ dataset using 7 depth-counterfeit detection methods that construct based on spatial and temporal features: 1) a convolutional neural network using steganographic features, 2) a convolutional neural network whose convolutional layer is specially designed to suppress high-level semantic information of images, 3) a convolutional neural network with a special global pool layer that computes statistical information for four dimensions of a feature map: mean, variance, maximum and minimum, 4) mesoInception-4 network, 5) pre-trained XnaptinionNet network, training the last XnaptinionNet network in a data driven manner, best results were obtained on faceforensis + + dataset.

Methods of multi-feature fusion attempt to obtain more features from the image that are available for authentication. Durall et al tried to detect forgery with frequency domain features, they found that the forgery image was a trace of forgery in the frequency domain, and then classified using an SVM classifier based on classical frequency domain analysis, resulting in a good detection result in a small number of labeled training samples. Qian et al found that compression problems can be addressed by mining spurious modes in the frequency domain signal against the problem of compressing spurious face images, and they proposed that the face spurious network (F3-Net) excavated the spurious modes deeply through a two-stage collaborative learning framework. Finally, the method is significantly superior to other methods in the compressed faceforensis++ data set. Nirkin et al found that the face after the face change was distinguished from the face's context, and then constructed a two-branched network, one of which was a classification network in which the face semantic was split into inputs, and the other of which was a classification network in which the face context (e.g., hair, ear, neck) was the input. The method utilizes the characteristics of two branch networks to detect forgery

Methods with other tasks this type of method attempts to improve the detection effect with other tasks or is directly used for falsification detection. The mode of multitasking is used by Nguyen et al to locate counterfeit areas at the pixel level while classifying true or false for video. The authors use a Y-decoder and three loss functions to constrain the network in hopes that valuable features can be shared among multiple tasks. Li et al define the false detection problem as an image segmentation task at the pixel level using a full convolution network to perform feature extraction and binarize the segmentation result to mark the false region in the image.

The image-level fake detection method has the advantages that training and detection speeds are very high, and the method is particularly effective when single-frame image fake marks are obvious. However, the disadvantage is that in the single frame method, there is little concern about locally falsified regions, and suspicious regions are found.

Methods based on time-sequential physiological signals this class of methods emphasizes the use of physiological signals inherent to humans for forgery detection. Agarwal and Farid propose a fake video detection method. The authors believe that the facial expression and head movements of a person speaking have a unique pattern, namely a soft biometric model (SoftBiometric Models), which they have extracted using the OpenFace2 toolkit and finally classified using SVM. Heart rate related features are also used for counterfeiting detection, by Ortega et al by acquiring heart rate related features using remote photoplethysmography (rpg). The phenomenon that the mouth shape is inconsistent with the voice is also used for counterfeiting detection, and when the method has audio information, a good counterfeiting detection result can be obtained.

The method based on the time sequence low-level fake trace mainly uses a time sequence network to excavate the time sequence low-level fake trace. A common basic form of CNN combined RNN network is to use a Convolutional Neural Network (CNN) to extract features of a single frame image, and then use the RNN network to obtain features at video level for classification. The 3D convolution network is also commonly used for mining time sequence low-level fake marks, the input of the method is generally continuous face frames, and the 3D convolution network also carries out convolution of space dimension, so that the extracted features can be directly subjected to fake detection classification to obtain a good fake detection result.

The detection of video level tamper marks has the advantages that: the model combines the space characteristic and the time sequence characteristic, so that better detection precision can be achieved; based on the time sequence characteristics of physiological phenomena, the interpretability and generalization are strong. However, the method has the defects that the model parameter quantity is increased, and the training difficulty is increased; in the case where the single-frame forgery trace is obvious, the model is easy to learn no time series special forgery trace.

At present, in the field of deep counterfeiting detection, a lot of methods for counterfeiting detection are available by using a timing model, a plurality of public data sets are mainly used in the training of the model, the data sets and a real use scene are quite different, a large number of counterfeiting marks are often contained in a single frame of a counterfeiting video in the data sets, and the counterfeiting marks of the single frame of the video which cause great influence in a real Internet environment are not obvious. The trained time sequence model in the situation often cannot model time sequence abnormal characteristics well, and cannot be popularized to fake videos or new fake types in real scenes well. This results in a significant performance penalty when the model trained in the public dataset is taken into a real scene.

Disclosure of Invention

The invention aims to solve the technical problems that: aiming at the problems, a depth forgery detection method based on video frame sequence prediction is provided to improve the attention of a time sequence model to time sequence characteristics.

The technical scheme adopted by the invention is as follows: a depth forgery detection method based on video frame sequence prediction is characterized in that:

inputting the suspicious video into a trained time sequence model, extracting the characteristics of the suspicious video through the time sequence model, inputting the characteristics into a true and false classifier, and outputting the true and false probability of the suspicious video by the true and false classifier;

training of the timing model, comprising:

randomly disturbing original continuous video frames of the video clips, and recording the disturbing mode;

inputting the disturbed video frames into a time sequence model to extract characteristics, and simultaneously sending the characteristics into a frame sequence classifier and a true-false classifier;

and calculating frame sequence prediction loss between the result of the frame sequence classifier and the disturbing mode, and calculating true and false classification loss between the result of the true and false classifier and the true and false labels of the video fragments.

The true and false classification loss employs a cross entropy loss function.

The frame sequence prediction loss adopts a cross entropy loss function.

The loss L of the whole model is regulated by true and false classification loss and frame sequence prediction loss through a super parameter alpha;

α=max (0.5-epoch 0.1, 0) formula 2

L= (1- α) los1+α -los2 formula 3

Where epoch represents the batch of model training, loss1 represents true and false classification loss, loss2 represents frame order prediction loss.

A depth falsification detection device based on video frame sequence prediction, comprising:

the detection module is used for inputting the suspicious video into the trained time sequence model, extracting the characteristics of the suspicious video through the time sequence model, inputting the characteristics into the true and false classifier, and outputting the true and false probability of the suspicious video by the true and false classifier;

a model training module comprising:

the video frame sequence disturbing module is used for randomly disturbing original continuous video frames of the video clips and recording a lower disturbing mode;

the multitask training module is used for inputting the disturbed video frames into the time sequence model to extract the characteristics, and the characteristics are simultaneously sent into the frame sequence classifier and the true and false classifier; and calculating frame sequence prediction loss between the result of the frame sequence classifier and the disturbing mode, and calculating true and false classification loss between the result of the true and false classifier and the true and false labels of the video fragments.

A storage medium having stored thereon a computer program executable by a processor, characterized by: the computer program when executed performs the steps of the video frame sequence prediction based depth falsification detection method.

A video frame sequence prediction based depth falsification detection apparatus having a memory and a processor, the memory having stored thereon a computer program executable by the processor, characterized by: the computer program when executed implements the steps of the video frame sequence prediction based depth falsification detection method.

The beneficial effects of the invention are as follows: the invention provides a method for multitask training a time sequence model by a video frame sequence prediction task and a fake detection task, and after the method is trained, the model can extract more robust human face fake detection characteristics so as to achieve better generalization performance.

The invention adopts two task training models of the video frame sequence prediction task and the true and false classification task, the video frame sequence prediction task forces the model to pay more attention to fake trace at the time sequence level, and the closing of the time sequence characteristics of the model is improved. Experiments show that the training is performed on the data set with obvious single-frame fake trace by using the task, so that the generalization effect of the model on the data set with unobvious single-frame fake trace can be improved.

Drawings

Fig. 1 is a block diagram of an embodiment.

Detailed Description

The embodiment is a depth forgery detection method based on video frame sequence prediction, which specifically comprises the following steps: and inputting the suspicious video into a trained time sequence model, extracting the characteristics of the suspicious video through the time sequence model, inputting the characteristics into a true and false classifier, and outputting the true and false probability of the suspicious video by the true and false classifier.

Training of the timing model in this example includes:

s1, randomly scrambling video frames of an input video clip by a video frame sequence scrambling module, randomly extracting continuous 4-frame images in the video for the input video, and then scrambling the video frame sequence by using one of candidate 12 scrambling modes (including the unscrambling situation).

For a normal frame-ordered video, the embodiment randomly selects one of the predefined 12 scrambling modes, and tags 0 to 11 corresponding to the 12 scrambling modes are used as additional tags of the data, and the auxiliary task is a 12-classification task, namely, a requirement model can predict in which mode the data is scrambled according to the input disordered frames.

For the case of 4 frames, the case of not distinguishing the positive order from the reverse order is 24 kinds of scrambling patterns in total, because the total number of arrangement of 4 frames is 24 kinds, but because the case of removing the reverse order is only 12 kinds of scrambling patterns. The main reason for removing the reverse order is that the difference between the reverse order and the positive order of the face video is not large, and when the model acquires a disordered video, the model can be considered to be disordered in a reverse order or a positive order in a certain way, so that the model is prevented from considering that the video is obtained in a disordered way, and the disordered way of the reverse order is removed.

The video frame sequence disturbing module provides video fragments with disturbed frame sequence, labels corresponding to the disturbing mode and true and false labels of the original video fragments for the multi-task training module.

S2, after the multitask training module inputs the video clips with the disordered frame sequences into the time sequence model, extracting features through the time sequence model, wherein the features are simultaneously sent into the frame sequence classifier and the true and false classifier, respectively executing the video frame sequence prediction task and the true and false classification task, calculating frame sequence prediction loss between the result of the frame sequence classifier and the disordered mode, and calculating true and false classification loss between the result of the true and false classifier and the true and false labels of the video clips.

The model in this embodiment is constrained by using a plurality of loss functions, and both true and false classification loss and frame order prediction loss adopt cross entropy loss, wherein the cross entropy formula is shown as 1, and y is as follows _i A true label representing the ith sample, a _i Representing the predicted result of the i-th sample.

The overall loss L of the model is regulated by true and false classification loss and frame sequence prediction loss through a super parameter alpha, the generation and regulation modes of the alpha are respectively shown in a formula 2 and a formula 3, wherein epoch represents a batch trained by the model, loss1 represents the true and false classification loss, and loss2 represents the frame sequence prediction loss. It can be seen that the hyper-parameter a becomes 0 after 5 runs of training, which means that the model uses the auxiliary tasks during the first 5 runs of training, after which the training runs only use the true-false classification loss for supervised learning.

α=max (0.5-epoch 0.1, 0) formula 2

L= (1- α) los1+α -los2 formula 3

In this example, 3 timing models were used for comparison experiments, which were cnn+lstm, cnn+ Transformer, R3D models, respectively, in which cnn+lstm, cnn+fransformer, and R3D models used res net18 pre-trained on ImageNet dataset, and the size of the video clip single frame image was scaled to 299 x 299. For more convenient comparison, the experiment did not employ data enhancement. The training of the entire network used the SGD optimizer for a total of 10 training rounds, with the initial learning rate set to 0.01 and every 2 rounds dropping to the original 0.1. The loss function uses cross entropy. The single frame image of the test phase video clip is also scaled to 299 x 299. The model is implemented in the PyTorch framework using a block of Tesla V100 GPU.

To verify the effectiveness of the method of the present invention, the present example chooses to conduct experiments on DFD datasets, celebDF datasets, and faceforensis++ datasets.

DeepFake Detection (DFD) datasets were constructed by Google, who was hosting in 2019, recorded 363 videos in cooperation with paid and agreed-to cooperating actors, and 3068 deep videos were created accordingly. The data set is currently hosted in faceforensis + +, from which the homepage can be downloaded

The Celeb-DF dataset provides a fake video of similar visual quality to the video popular on the internet. These videos are generated by an improved version of the disclosed deep fake generation algorithm, thereby improving the phenomena of low resolution and inconsistent colors of the face. The dataset contained 408 true videos and 795 synthetic false videos.

In the data set comparison, the comparison of model effects is mainly performed by adopting an AUC index, and the auxiliary analysis is performed by using the sensitivity, the specificity and the ACC index performance at a 0.5 threshold. The experiment firstly carries out model precision test and generalization test, and finally carries out visual display.

The faceforensis + + dataset contains 1000 real views crawled from Youtube, 1000 false videos per tamper category. The faceforensis++ data set basically ensures that each frame contains a face and only one face, the video with too much face shielding is manually filtered out, and basically all frames are successfully tampered to ensure the video quality, so that the noise label interference problem of the supervised task is reduced. And the data set contains data synthesized by a plurality of different types of counterfeiting methods, which is helpful for detecting the difference between different algorithms. In addition, the data set is updated continuously along with the development of fake technology, and videos generated by new fake algorithms are added.

Table 1 detection results of timing models under different data sets

As shown in Table 1, several timing models and models that add frame sequence prediction tasks are listed as results of training on DFD, faceForensics ++ data sets. In the results of the DFD dataset, the R3D model performed best in the AUC index, with a value of 96.4% higher than the baseline model by 12.0%, but in the AUC index, it can also be observed that the overall effect was not much different between all time series models, indicating that the time series model performed better overall on the dataset than the single frame model. In the sensitivity index, the difference between models is not large. For the specificity index, the r3d_shuffle model achieves the best results, 25.32% higher than the lowest Resnet18 network, and it is observed that the model with auxiliary tasks performed is higher in specificity than the corresponding model without auxiliary tasks performed. In the ACC index, the R3D_shuffle model has a better classification effect, which is 6.06% higher than the baseline model, and the R3D_shuffle model has a better prediction effect under the 0.5 threshold.

The experimental results in the faceforensis++ dataset and the DFD dataset are substantially similar: R3D performs best in AUC index, the numerical value is 98.14% and is 26.74% higher than that of the baseline model, and the difference in AUC index of each time sequence model is small, so that the difference between time sequence models is not large and the whole time sequence model is superior to a single frame model. Among the sensitivity indexes, the difference between the models is not large but better than that of a single-frame model, the best result is 88.93% for the specificity index R3D_shuffle model, and in addition, it can be observed that the model using the auxiliary task is higher in specificity performance than the corresponding model not using the auxiliary task. The R3D model obtains a better result in the ACC index, which shows that the model achieves a better prediction effect under the threshold of 0.5.

The model accuracy is integrally shown, 1) the time sequence model is superior to a single-frame model, and 2) the auxiliary task can improve the specificity of the model to a certain extent, namely the recall rate of real video can be improved.

The model generalization test results are shown in table 2, and in this example, 3 datasets were used for generalization tests, by training on DFD and faceforensis++ datasets, respectively, and testing on Celeb-DF datasets.

TABLE 2 comparison of generalized results for different face divisions

In the experimental results of the DFD training, celeb-DF test: the time sequence models trained by using the frame sequence prediction auxiliary task are superior to the time sequence models not trained by using the method, and the Resnet18_lstm and Resnet18_ transformer, R D networks are trained by the method to obtain generalization performance improvement of 4.85%, 2.39% and 0.63%. The difference between time sequence models is not large in sensitivity index, the Resnet18_lstm model obtains the best result of 99.25%, and the single-frame model has low effect. The Resnet18_transducer_shuffle works best on the specificity and accuracy indicators, indicating that the model performs best at the 0.5 threshold.

In the FaceForensis++ training, the experimental results of the Celeb-DF test: in addition to the R3D convolution, the renet18_lstm, renet18_transducer obtained 3.23% and 3.32% improvement, respectively, by training of this auxiliary task, as well as all timing models gave better AUC results than the single frame model. For sensitivity indicators, single frame models work best, probably because single frame models tend to predict spurious samples. In the time sequence model, the Resnet18_transducer_buffer effect is good, and the R3D_buffer effect is low. The r3d_shuffle effect results best at specificity and accuracy, which indicates that the model classifies best at the 0.5 threshold.

From the experimental results, the following observations can be made: 1) The time sequence model has better generalization than a single frame model; 2) The auxiliary task of video frame sequence prediction proposed by the embodiment strengthens the attention of a time sequence model to time sequence characteristics.

The embodiment also provides a depth forgery detection device based on video frame sequence prediction, which comprises a detection module and a model training module, wherein the model training module comprises a video frame sequence disturbing module and a multi-task training module.

The detection module in the example is used for inputting the suspicious video into the trained time sequence model, extracting the characteristics of the suspicious video through the time sequence model, inputting the characteristics into the true and false classifier, and outputting the true and false probability of the suspicious video by the true and false classifier; the video frame sequence disturbing module is used for randomly disturbing original continuous video frames of the video clips and recording a disturbing mode; the multitask training module is used for inputting the disordered video frames into the time sequence model to extract the characteristics, and the characteristics are simultaneously sent into the frame sequence classifier and the true and false classifier.

The present embodiment also provides a storage medium having stored thereon a computer program executable by a processor, which when executed implements the steps of the depth falsification detection method based on video frame sequential prediction in this example.

The present embodiment also provides a depth falsification detection apparatus based on video frame sequence prediction, which has a memory and a processor, and the memory stores a computer program executable by the processor, and the computer program when executed implements the steps of the depth falsification detection method based on video frame sequence prediction in this example.

Claims

1. A depth forgery detection method based on video frame sequence prediction is characterized in that:

training of the timing model, comprising:

inputting the disturbed video frames into a time sequence model to extract characteristics, and simultaneously sending the characteristics into a frame sequence classifier and a true and false classifier;

calculating frame sequence prediction loss between the result of the frame sequence classifier and the disturbing mode, and calculating true and false classification loss between the result of the true and false classifier and the true and false labels of the video fragments;

α=max (0.5-epoch 0.1, 0) formula 2

L= (1- α) los1+α -los2 formula 3

2. The method for detecting depth forgery based on video frame sequence prediction according to claim 1, wherein: the true and false classification loss employs a cross entropy loss function.

3. The method for detecting depth forgery based on video frame sequence prediction according to claim 1, wherein: the frame sequence prediction loss adopts a cross entropy loss function.

4. A depth falsification detection device based on video frame sequence prediction, comprising:

a model training module comprising:

the video frame sequence disturbing module is used for randomly disturbing original continuous video frames of the video clips and recording the disturbing mode;

the multi-task training module is used for inputting the disturbed video frames into the time sequence model to extract the characteristics, and the characteristics are simultaneously sent into the frame sequence classifier and the true and false classifier; calculating frame sequence prediction loss between the result of the frame sequence classifier and the disturbing mode, and calculating true and false classification loss between the result of the true and false classifier and the true and false labels of the video fragments;

α=max (0.5-epoch 0.1, 0) formula 2

L= (1- α) los1+α -los2 formula 3

5. A storage medium having stored thereon a computer program executable by a processor, characterized by: the computer program when executed implements the steps of the video frame sequence prediction-based depth falsification detection method of any one of claims 1 to 3.

6. A depth forgery detection device based on video frame sequence prediction, having a memory and a processor, the memory storing a computer program executable by the processor, characterized in that: the computer program when executed implements the steps of the video frame sequence prediction-based depth falsification detection method of any one of claims 1 to 3.