CN112801037A

CN112801037A - Face tampering detection method based on continuous inter-frame difference

Info

Publication number: CN112801037A
Application number: CN202110222677.XA
Authority: CN
Inventors: 房志峰; 吴剑; 冯凯
Original assignee: SHANDONG UNIVERSITY OF POLITICAL SCIENCE AND LAW
Current assignee: SHANDONG UNIVERSITY OF POLITICAL SCIENCE AND LAW
Priority date: 2021-03-01
Filing date: 2021-03-01
Publication date: 2021-05-14

Abstract

The invention discloses a method for detecting face tampering in a video based on continuous interframe difference, which is provided by using a deep learning technology and utilizing a twin neural network, and compared with the traditional detection method, the method inputs two adjacent frames into the twin neural network for comparison by utilizing trace characteristics left by an image fusion edge of a tampered video and a time sequence relation of the adjacent frames in the video, can quickly and effectively detect the tampering of the video, and has stronger robustness; the method can quickly and accurately carry out tampering detection on the recorded video, and has important practical application value.

Description

Face tampering detection method based on continuous inter-frame difference

Technical Field

The invention relates to the technical field of computer vision and artificial intelligence, in particular to a video tampering detection method based on continuous interframe difference.

Background

In recent years, with rapid development and wide application of technologies such as internet, 5G and the like, especially popularization of mobile video equipment such as mobile phones and the like, people can shoot images and videos with high definition at any time and any place and distribute the images and videos to the internet. The method can be applied to a plurality of application scenes, and the method greatly facilitates the rapid propagation of the information. On the other hand, however, this also brings about a great negative effect, such as someone falsifies the captured video by face changing or the like and then releases the video to the internet. If the tampered video cannot be screened quickly and effectively, serious negative effects can be caused in society. Therefore, the method for detecting the tampered video has very important practical significance and research value.

The application is provided by Shanghai traffic university 2020, 4 and 10 days on the basis of a method and a system for passively detecting AI face-changing videos based on joint features, and the application number is CN2020102796459, and the specific contents comprise that the decoded data of the tampered videos are subjected to statistical analysis to find out global identifiable features; judging the video without the tampering trace in the coding as the video without tampering to obtain the tampered video; deframing the tampered video to obtain continuous multi-frame video frame images; detecting and identifying the face position appearing in each video frame, and dividing a face region and a small-range peripheral background region to form a face region picture sequence; extracting texture information of each picture in the sequence to obtain intra-frame consistency information; carrying out sequence analysis on the intra-frame consistency characteristic sequence, and inspecting the inter-frame consistency characteristics of the video; and judging whether the face is tampered by AI face changing or not by combining the intra-frame consistency characteristic and the inter-frame consistency characteristic. The invention greatly contributes to the breakthrough of the prior art.

Video tamper detection has received increasing attention from researchers as a popular area of recent research, h.cao and a.kot in 2009 from the article "Accurate detection of removal and regulation for digital images sensors" on IEEE trans. inf.forces Security ", Ferrara P et al in 2012 from IEEE Transactions on Information sensors and Security, using a Color Filter Array (CFA) based detection method for tamper detection. C. Hsu et al, Video for tampering detection of noise identification, 2008, published at the IEEE International Workshop on Multimedia Signal Processing, and P.Mullan et al, 2017, published at the IEEE International Conference on Image Processing, tamper detection is performed by analyzing the noise residual of successive frames in the Video to extract the characteristics of the tamper traces. J. Lukas et al published 2006 in proc. spie paper detection digital image using sensor pattern noise, s. chakraborty et al published 2017 in International Symposium on Electronic Imaging, Media Watermarking, Security, and sensors International conference, tamper detection using a detection method based on camera response non-uniformity (PRNU) noise, and so on.

The methods have a good detection effect on the video with a light compression degree, but most of the videos on the Internet are highly compressed, and when the methods are used for detection, the tampering trace is difficult to detect. Therefore, these detection methods are not suitable for detecting Face-tampered video such as Face2Face and deep Face.

Matern et al, 2019, in IEEE WACV Workshop on Image and Video formalisms International conference, "expanding visual objects to expose artifacts and surface manipulations", performed counterfeit detection using some of the deficiencies of the deepke algorithm itself. For example, in a GAN-generated face, the colors of the left and right eyes may appear mismatched, as well as other forms of asymmetry, such as earrings on only one side, or ears having distinctive characteristics, and it is common for counterfeit video to exhibit an incredible specular reflection in the eyes, manifested as a lack of specular reflection or white speckle, roughly simulate teeth, show these teeth as a single white speckle, and so forth.

Li et al, 2018 at the IEEE Workshop on Information formalities and Security International conference, "In Ictu Oculi: expanding AI created facial video by detecting eye", detected by detecting eye blinking behavior. Blinks have a specific frequency and duration in humans, and this phenomenon is less easily replicated in the Deepfake video. The above paper proposes a solution based on a long-term cyclic network, which only processes eye sequences to capture temporal inconsistencies of blinking behavior in the deep video. However, as the face generation network used by the Deepfake is continuously evolved, the inconsistency of the output face feature points is gradually weakened until the inconsistency disappears, and at this time, the detection effect of the method used in the above document is greatly reduced.

Wang et al, preprint article "Fake dot: a Simple base for Spotting AI-Synthesized Faces", published in 2019 on arXiv print, uses the disparity in the distribution of color components in the HSV channel and YCbCr channel to resolve tampered images, since the deep network generates images in the RGB color space, although not subject to any constraints on color correlation, the conversion to other spaces (HSV and bcycr) will present a disparity.

Li et al, protection of deep network generated images using disparities in color components, 2018, published on arXiv preprints, and used the inconsistency between GAN-generated forged face images and the distribution of characteristic points of real faces to distinguish tampered images. The above is only tamper detection for the image frames extracted from the video, but does not take into account the timing relationship between the video frames. Compared with the 'human face tampering video detection method based on interframe difference', a paper published in the 'information safety bulletin' in 3/2020, Zhang Yixue et al firstly adopts a traditional statistical method, a detection method based on Local Binary Pattern (LBP)/Histogram of Oriented Gradient (HOG) characteristics, then proposes a method based on a twin neural network to compare differences among video frames, and adopts various discriminators to achieve better effect. However, the application of the detection methods mentioned in the article only uses one of the detection methods in detection, only plays a role in mutual comparison, and is not comprehensively applied. In addition, when the training set and the test set used in the method are generated by adopting different GAN generation models, the accuracy rate of image tampering detection is obviously lower, which indicates that the generalization capability of the method is not strong.

Therefore, the above problems should be solved in time by those skilled in the art.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a method for detecting face tampering in a video based on continuous interframe difference.

Compared with the traditional detection method, the method utilizes the trace characteristics left by the image fusion edge of the tampered video and the time sequence relation of the adjacent frames in the video, inputs the two adjacent frames into the twin neural network for comparison, can quickly and effectively carry out tampering detection on the video, and has strong robustness.

In order to achieve the purpose, the invention is achieved by the following technical scheme:

a video tampering detection method based on continuous interframe difference comprises the following steps:

the method comprises the following steps: constructing a discriminator for deep learning;

constructing a deep learning discriminator suitable for human facial feature extraction based on a convolutional neural network, wherein the discriminator consists of 6 batch normalization layers BN, 6 convolution layers, 4 pooling layers and 1 Flatten layer; the input of the deep learning discriminator is a 128 multiplied by 100 three-channel color face image (128 multiplied by 100 multiplied by 3) obtained by interception, and simple normalization processing is carried out;

step two: constructing a loss function of network training; in the face tampering detection method based on continuous interframe difference, a contrast loss function is adopted as a loss function of a twin neural network;

step three: constructing a twin neural network suitable for detection based on the discriminator; making the discriminator into two-way input, enabling two adjacent frame images to respectively enter one-way discriminator to form two-way (384 multiplied by 1) output, namely (768 multiplied by 1), on the basis, importing data into a full connection layer Dense, and finally outputting the result which is the judgment result;

step four: decomposing a video to be detected into a frame sequence, constructing a face locator, and identifying and extracting a face image from each frame;

step five: extracting features from the intercepted human face for representation; inputting the face data extracted in the step into a FaceNet network, and calculating the eigenvector of Embedding so as to obtain the characteristic representation of the face;

step six: training the twin neural network;

step seven: and detecting video tampering by using the trained network.

Preferably, the first step is to construct a deep learning discriminator, and the input of the deep learning discriminator is a 128 × 100 three-channel color face image (128 × 100 × 3) obtained by frame-by-frame video capture, and the discriminator is subjected to simple normalization processing. The specific structure of the discriminator is described as follows:

using a batch standardization layer BN with a default attribute value as a first layer of a discriminator to carry out standardization processing on input data, accelerating the training and convergence speed of the network and preventing overfitting;

the second layer outputs 5 × 5 convolutional layer using a 32 channel with stride 1, ReLU activation function, changes the image to (124 × 96 × 32);

a batch of normalization layers BN are added in the third layer, because the output of the convolution layers is in symmetrical and non-sparse distribution and is similar to Gaussian distribution, the normalization is carried out, and more stable distribution can be generated;

the fourth layer outputs 3 × 3 convolutional layer by using 32 channels with stride 1, and the ReLU activation function changes the image into (122 × 94 × 32);

the fifth layer uses a batch normalization layer BN with the following properties:

# specifies the axis to be normalized, usually the characteristic axis. Batch sample on channel axis (last axis)

# standardized so that axis is-1 here

axis＝-1,

momentum of # dynamic mean 0.99%

epsilon 0.001, small floating point number # greater than 0 to prevent error divided by 0

center is set to True, # is set to True, and beta is added as an offset

When scale is True, # is True, it is multiplied by gamma

beta _ initializer ═ zeros', # beta weight initialization method

Gamma _ initializer ═ ones', # parameter gamma initialization method

method for initializing moving _ mean _ initialization ═ zeros [, ] dynamic mean

method for initializing dynamic variance of moving _ variance _ initializer ═ ones' #

The sixth layer outputs 3 × 3 convolutional layer with 32 channels with stride 1, ReLU activation function, and changes the image to (120 × 92 × 32);

the seventh layer uses a maximum pooling layer with stride of 1 and pool _ size of (2 × 2) to change the input image into (60 × 46 × 32), and further reduces the dimension of the information extracted by the sixth layer of convolutional layer, reduces the amount of calculation, and reduces the sensitivity of the convolutional layer to the position;

the eighth layer uses a batch normalization layer BN with the following properties:

# standardized so that axis is-1 here

axis＝-1,

momentum of # dynamic mean 0.99%

center is set to True, # is set to True, and beta is added as an offset

When scale is True, # is True, it is multiplied by gamma

beta _ initializer ═ zeros', # beta weight initialization method

Gamma _ initializer ═ ones', # parameter gamma initialization method

The ninth layer outputs 3 × 3 convolutional layer with 16 channels with stride 1, ReLU activation function, changes the image to (58 × 44 × 16);

the tenth layer uses an average pooling layer with stride of 2 and pool _ size of (3 × 3) to change the input image into (29 × 22 × 16), and further reduces the dimension of the information extracted by the ninth layer of convolutional layers, thereby reducing the amount of calculation and the sensitivity of the convolutional layers to the position;

the eleventh layer uses a batch normalization layer BN with the following properties:

# standardized so that axis is-1 here

axis＝-1,

momentum of # dynamic mean 0.99%

center is set to True, # is set to True, and beta is added as an offset

When scale is True, # is True, it is multiplied by gamma

beta _ initializer ═ zeros', # beta weight initialization method

Gamma _ initializer ═ ones', # parameter gamma initialization method

The input data is subjected to standardization processing, the training and convergence speed of the network is accelerated, and overfitting is prevented;

the twelfth layer outputs a 3 × 3 convolutional layer by using a 16 channel with stride 1, and changes an image into (27 × 20 × 16) by using a ReLU activation function;

the thirteenth layer uses an average pooling layer with stride of 2 and pool _ size of (3 × 3) to change the input image into (14 × 10 × 16), and further reduces the dimension of the information extracted from the twelfth layer of convolutional layer;

the fourteenth layer uses a batch normalization layer BN with the following properties:

# standardized so that axis is-1 here

axis＝-1,

momentum of # dynamic mean 0.99%

center is set to True, # is set to True, and beta is added as an offset

When scale is True, # is True, it is multiplied by gamma

beta _ initializer ═ zeros', # beta weight initialization method

Gamma _ initializer ═ ones', # parameter gamma initialization method

The fifteenth layer outputs a 3 × 3 convolutional layer by using a 16-channel with stride 1, and changes an image into (12 × 8 × 16) by using a ReLU activation function;

the sixteenth layer uses a maximum pooling layer with stride 1 and pool _ size (2 × 2) to change the input image into (6 × 4 × 16), and further reduces the dimension of the information extracted from the fifteenth layer convolutional layer;

the seventeenth layer expands the input data to (384 × 1) using a scatter layer.

Preferably, the contrast Loss function (contrast Loss) in the Loss function for constructing network training in the second step is used as a dimension reduction learning method for learning a mapping relationship, so that similar samples in a high-dimensional space still keep the similar relationship after being mapped to a low-dimensional space through the function, and the contrast Loss function can well express the matching degree of paired samples, and can effectively process the relationship of paired data in the twin neural network;

the formula for the contrast Loss function (contrast Loss) L is as follows:

wherein:

here D_w(X₁，X₂) Representing two sample features X₁And X₂The Euclidean distance of;

y is a label indicating whether the two samples match, wherein Y ═ 1 indicates that the two samples are similar or matched, and Y ═ 0 indicates that the two samples do not match;

m is a manually set threshold;

n is the number of samples;

p is the dimension of the sample feature;

it can be seen from formula (1) that the contrast loss function can well express the matching degree of the pair samples, and can also be well used for training the model for extracting features.

Preferably, the fourth step is to decompose the video to be detected into a frame sequence, construct a face locator, and identify and extract a face image from each frame, which is specifically as follows:

compared with the traditional method, the MTCNN method based on the deep convolutional neural network has better performance and can more accurately position the face, and in addition, the MTCNN can also realize real-time detection;

firstly, zooming the picture into pictures with different sizes according to different zooming proportions to form a characteristic pyramid of the pictures; PNet mainly obtains regression vectors of a candidate window and a boundary box of a face region; performing regression by using the boundary frame, calibrating the candidate window, and combining the highly overlapped candidate frames through non-maximum suppression (NMS); RNet trains the candidate frames passing through PNet in RNet network, then fine-tunes the candidate forms by using the regression values of the boundary frames, and then removes the overlapped forms by using NMS; the ONet function is similar to the RNet function, and only five face key point positions are displayed simultaneously when overlapping candidate windows are removed;

the network structure of the PNet is a full convolution neural network structure, and the input of the training network is a picture with the size of 12 multiplied by 12, so that training data of the PNet network needs to be generated before training; training data can generate a series of bounding boxes through calculation with the IOU of the GuaranteTrue Box; training data can be obtained through a sliding window or random sampling method, and the training data is divided into three positive samples, negative samples and intermediate samples; wherein the IOU of the positive sample is the generated sliding window and the GuaranteTrue Box is greater than 0.65, the IOU of the negative sample is less than 0.3, the IOU of the middle sample is greater than 0.4 and less than 0.65; then, converting the bounding box resize into a picture with the size of 12 multiplied by 12, converting the picture into a structure with the size of 12 multiplied by 3, and generating training data of the PNet network; training data passes through 10 convolution kernels of 3 × 3 × 3 and Max Pooling (stride ═ 2) operations of 2 × 2, and 10 feature maps of 5 × 5 are generated; then generating 16 3 × 3 feature maps through 16 3 × 3 × 10 convolution kernels; then generating 32 1 × 1 feature maps through 32 convolution kernels of 3 × 3 × 16; finally, for 32 1 × 1 feature maps, 2 1 × 1 feature maps can be generated for classification through 2 1 × 1 × 32 convolution kernels; 4 convolution kernels of 1 × 1 × 32 are generated, and 4 feature maps of 1 × 1 are generated and used for judging a regression frame; 10 convolution kernels of 1 × 1 × 32 generate 10 feature maps of 1 × 1 for judging the face contour points;

the RNet model inputs 24 × 24 pictures, and generates 28 11 × 11 feature maps by using 28 convolution kernels of 3 × 3 × 3 and 3 × 3(stride 2) maxporoling; 48 4 × 4 feature maps were generated after 48 convolution kernels of 3 × 3 × 28 and maxporoling of 3 × 3(stride ═ 2); generating 64 3 × 3 feature maps after passing through 64 convolution kernels of 2 × 2 × 48; converting the 3 × 3 × 64 feature map into a 128-sized fully connected layer; converting the regression frame classification problem into a full connection layer with the size of 2; converting the position regression problem of the bounding box into a full connection layer with the size of 4; converting key points of the human face outline into a full-connection layer with the size of 10;

ONet is the last network in MTCNN, used for making the final output of the network; generating training data of the ONet similar to the RNet, and detecting the detected data as bounding boxes detected after the picture passes through the PNet and the RNet network, wherein the bounding boxes comprise positive samples, negative samples and intermediate samples; the ONet model input is a 48 × 48 × 3 picture, and is converted into 32 23 × 23 feature maps after 32 3 × 3 × 3 convolution kernels and 3 × 3(stride 2) maxporoling; after passing through 64 convolution kernels of 3 × 3 × 32 and maxporoling of 3 × 3(stride ═ 2), the feature maps are converted into 64 feature maps of 10 × 10; after passing through 64 convolution kernels of 3 × 3 × 64 and maxporoling of 3 × 3(stride ═ 2), the feature maps are converted into 64 feature maps of 4 × 4; converting into 128 characteristic maps of 3 × 3 by 128 convolution kernels of 2 × 2 × 64; converting into a full link layer with 256 sizes through a full link operation; preferably generating regression box classification features of size 2; regression features for regression box positions of size 4; the human face contour position regression feature with the size of 10;

the basic structure of the MTCNN network is to generate predicted bounding boxes from original pictures and PNets; inputting an original picture and a bounding box generated by PNet, and generating a corrected bounding box through RNet; inputting the element picture and a bounding box generated by the RNet, generating a corrected bounding box and key points of the face contour through the ONet,

the implementation process is as follows:

1. firstly, reading in a picture to be detected: image ═ cv2. image (imagepath)

2. Loading the trained model parameters, and constructing a detection object: MtcnnDetector

3. And (3) executing inference operation: detect _ face (image) of all _ boxes, landworks ═ detectors

4. Drawing a target frame: c v2.rectangle (image, box, (0, 255)).

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: compared with the traditional detection method, the method utilizes the trace characteristics left by the image fusion edge of the tampered video and the time sequence relation of the adjacent frames in the video to input the two adjacent frames into the twin neural network for comparison, can quickly and effectively carry out tampering detection on the video, is unique, has stronger robustness and is suitable for popularization and application.

Drawings

FIG. 1 is a diagram showing the construction of a deep learning discriminator according to the present invention;

FIG. 2 is a basic structural diagram of an MTCNN network according to the present invention;

FIG. 3 is a partial face presentation in the Celeb-DF dataset according to the invention;

FIG. 4 is a comparison graph of normal distribution curves of distances between real faces and between a real face and a forged face of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in the drawings, for better understanding of the technical solution of the present invention, the following detailed description of the embodiments of the present invention is provided in conjunction with the accompanying drawings, and the specific embodiments described herein are only for explaining the present invention, but the embodiments of the present invention are not limited thereto.

Example 1

s1, constructing a deep learning discriminator;

a deep learning discriminator suitable for human facial feature extraction is constructed on the basis of a convolutional neural network, and the discriminator consists of 6 batch normalization layers BN, 6 convolution layers, 4 pooling layers and 1 Flatten layer. The input of the deep learning discriminator is a 128 × 100 three-channel color face image (128 × 100 × 3) obtained by interception (three parameters are used for representing the width, height and channel number of the image respectively, wherein the unit of the width and the height is pixels), and simple normalization processing is performed. Next, a batch normalization layer BN with default attribute values is used as the first layer of the discriminator to normalize the input data, speed up the training and convergence of the network, and prevent overfitting. The second layer outputs a 5 × 5 (two parameters here are used to represent width and height of convolution kernel, respectively, in pixels) convolution layer using a 32 channel with stride 1, and the ReLU activation function changes the image to (124 × 96 × 32) (three parameters here are used to represent width, height, and number of channels, respectively, of the image, where width and height are in pixels). The third layer adds a batch normalization layer BN, because the output of the convolution layer is a symmetrical, non-sparse distribution, similar to Gaussian distribution, and the normalization thereof can generate more stable distribution. The fourth layer outputs a 3 × 3 (two parameters here are used to represent width and height of convolution kernel, respectively, in pixels) convolution layer using a 32 channel with stride 1, and the ReLU activation function changes the image to (122 × 94 × 32) (three parameters here are used to represent width, height, and number of channels, respectively, of the image, where width and height are in pixels). The fifth layer uses a batch normalization layer BN with the following properties:

# standardized so that axis is-1 here

axis＝-1,

momentum of # dynamic mean 0.99%

center is set to True, # is set to True, and beta is added as an offset

When scale is True, # is True, it is multiplied by gamma

beta _ initializer ═ zeros', # beta weight initialization method

Gamma _ initializer ═ ones', # parameter gamma initialization method

The sixth layer outputs a 3 × 3 (two parameters here are used to represent width and height of convolution kernel, respectively, in pixels) convolution layer using 32 channels with stride 1, and the ReLU activation function changes the image to (120 × 92 × 32) (three parameters here are used to represent width, height, and number of channels, respectively, of the image, where width and height are in pixels). The seventh layer uses a maximum pooling layer with a stride of 1 and pool _ size of (2 × 2) (two parameters are used to represent the width and height of the pool, respectively, and the unit is pixel), changes the input image into (60 × 46 × 32) (three parameters are used to represent the width, height, and channel number of the image, respectively, wherein the width and height are in pixel), further reduces the dimension of the information extracted by the sixth layer of convolutional layer, reduces the amount of computation, and reduces the sensitivity of the convolutional layer to the position. The eighth layer uses a batch normalization layer BN with the following properties:

# standardized so that axis is-1 here

axis＝-1,

momentum of # dynamic mean 0.99%

center is set to True, # is set to True, and beta is added as an offset

When scale is True, # is True, it is multiplied by gamma

beta _ initializer ═ zeros', # beta weight initialization method

Gamma _ initializer ═ ones', # parameter gamma initialization method

The ninth layer outputs a 3 × 3 (two parameters here are used to represent width and height of convolution kernel, respectively, in pixels) convolution layer using 16 channels with stride 1, and the ReLU activation function changes the image to (58 × 44 × 16) (three parameters here are used to represent width, height, and number of channels, respectively, of the image, where width and height are in pixels). The tenth layer uses an average pooling layer with a stride of 2 and pool _ size of (3 × 3) (two parameters are used to represent the width and height of the pool, respectively, and the unit is pixel), changes the input image into (29 × 22 × 16) (three parameters are used to represent the width, height, and channel number of the image, respectively, wherein the width and height are in pixel), further reduces the dimension of the information extracted by the ninth layer of convolutional layers, reduces the amount of computation, and reduces the sensitivity of the convolutional layers to the position. The eleventh layer uses a batch normalization layer BN with the following properties:

# standardized so that axis is-1 here

axis＝-1,

momentum of # dynamic mean 0.99%

center is set to True, # is set to True, and beta is added as an offset

When scale is True, # is True, it is multiplied by gamma

beta _ initializer ═ zeros', # beta weight initialization method

Gamma _ initializer ═ ones', # parameter gamma initialization method

Input data are subjected to standardization processing, the training and convergence speed of the network is increased, and overfitting is prevented. The twelfth layer outputs a 3 × 3 (two parameters here are used to represent width and height of convolution kernel, respectively, in pixels) convolution layer using 16 channels with a stride of 1, and the ReLU activation function changes the image to (27 × 20 × 16) (three parameters here are used to represent width, height, and number of channels, respectively, of the image, where width and height are in pixels). The thirteenth layer uses an average pooling layer with a stride of 2 and pool _ size of (3 × 3) (two parameters here are used to indicate the width and height of the pool, respectively, and the unit is a pixel), to change the input image into (14 × 10 × 16) (three parameters here are used to indicate the width, height, and number of channels of the image, respectively, where the width and height are in pixels), and further reduce the dimension of the information extracted by the twelfth convolutional layer. The fourteenth layer uses a batch normalization layer BN with the following properties:

# standardized so that axis is-1 here

axis＝-1,

momentum of # dynamic mean 0.99%

center is set to True, # is set to True, and beta is added as an offset

When scale is True, # is True, it is multiplied by gamma

beta _ initializer ═ zeros', # beta weight initialization method

Gamma _ initializer ═ ones', # parameter gamma initialization method

The fifteenth layer outputs a 3 × 3 (two parameters here are used to represent the width and height of the convolutional layer, respectively, in pixels) convolutional layer using a 16 channel with a stride of 1, and the ReLU activation function changes the image to (12 × 8 × 16) (three parameters here are used to represent the width, height, and number of channels, respectively, of the image, where the width and height are in pixels). The sixteenth layer uses a maximum pooling layer with stride 1 and pool _ size (2 × 2) (two parameters are used to indicate the width and height of the pool, respectively, and the unit is pixel), changes the input image into (6 × 4 × 16) (three parameters are used to indicate the width, height, and number of channels of the image, respectively, wherein the width and height are in pixel), and further reduces the dimension of the information extracted by the fifteenth layer convolutional layer. The seventeenth layer expands the input data to (384 × 1) using a scatter layer.

S2, constructing a loss function of network training; in the face tampering detection method based on continuous interframe difference, a contrast loss function is adopted as a loss function of a twin neural network.

The contrast Loss function (contrast Loss) is used as a dimension reduction learning method for learning a mapping relation, so that similar samples in a high-dimensional space still keep the similar relation after being mapped to a low-dimensional space through the function, the contrast Loss function can well express the matching degree of paired samples, and the relation of paired data in a twin neural network can be effectively processed. The formula for the contrast Loss function (contrast Loss) L is as follows:

wherein:

here D_w(X₁，X₂) Representing two sample features X₁And X₂The euclidean distance of (c). Y is a label indicating whether two samples match, where Y ═ 1 indicates that two samples are similar or match, and Y ═ 0 indicates no match. m is a manually set threshold. N is the number of samples. P is the dimension of the sample feature. It can be seen from formula (1) that the contrast loss function can well express the matching degree of the pair samples, and can also be well used for training the model for extracting features.

S3, constructing a twin neural network suitable for detection based on the discriminator; the discriminator is made into double-path input, two adjacent frame images respectively enter one path of discriminator to form double-path (384 multiplied by 1) output, namely (768 multiplied by 1), on the basis, data is led into a full connection layer Dense, and finally the output is the judgment result.

S4, decomposing the video to be detected into a frame sequence, constructing a face locator, and identifying and extracting a face image from each frame;

compared with the traditional method, the MTCNN method based on the deep convolutional neural network has better performance, can more accurately position the human face, and can also realize real-time detection.

Firstly, the picture can be zoomed into pictures with different sizes according to different zoom ratios to form a characteristic pyramid of the pictures. The PNet mainly obtains regression vectors of candidate windows and bounding boxes of the face region. And using the bounding box to do regression, calibrating the candidate window, and then merging the highly overlapped candidate boxes through non-maximum suppression (NMS). RNet trains the candidate frames passing through PNet in RNet network, then uses regression value of boundary frame to fine tune candidate form, and uses NMS to remove overlapped form. ONet functions similarly to RNet, except that five face keypoint locations are displayed simultaneously while overlapping candidate windows are removed.

The network structure of the PNet is a full convolution neural network structure, and the input of the training network is a picture of 12 × 12 (two parameters are used for representing the width and the height of an image respectively, and the unit is a pixel), so training data of the network of the PNet needs to be generated before training. The training data may be calculated from the IOU of the GuaranteTrue Box to generate a series of bounding boxes. Training data can be obtained through a sliding window or random sampling method, and the training data is divided into three positive samples, negative samples and intermediate samples. Where the positive samples are generated sliding windows and the IOU of the guardenteetrue Box is greater than 0.65, the negative samples are IOU less than 0.3, and the intermediate samples are IOU greater than 0.4 and less than 0.65. Then, a picture with a size of 12 × 12 (two parameters are used to represent the width and height of the image, respectively, and the unit is a pixel) is converted into a structure of 12 × 12 × 3 (three parameters are used to represent the width, height, and number of channels of the image, respectively, wherein the unit of the width and height is a pixel), and the training data of the PNet network is generated. The training data is passed through 10 convolution kernels of 3 × 3 × 3 (here, three parameters are used to indicate the width, height, and number of channels of the convolution kernel, respectively, where the width and height are in pixels), and Max Pooling (stride 2) of 2 × 2 (here, two parameters are used to indicate the width and height of the pool, respectively, and the unit is pixels), to generate 10 feature maps of 5 × 5 (here, two parameters are used to indicate the width and height of the image, respectively, and the unit is pixels). Then, 16 3 × 3 feature maps (two parameters are used to indicate the width and height of the image, respectively, and the unit is pixel) are generated by using 16 convolution kernels of 3 × 3 × 10 (three parameters are used to indicate the width, height, and channel number of the convolution kernels, respectively, where the unit of the width and height is pixel). Then, 32 feature maps of 1 × 1 (two parameters are used to represent the width and height of the image, respectively, and the unit is pixel) are generated by 32 convolution kernels of 3 × 3 × 16 (three parameters are used to represent the width, height, and channel number of the convolution kernels, respectively, where the unit of the width and height is pixel). Finally, for the feature maps of 32 1 × 1 (two parameters here are respectively used for representing the width and height of the image, and the unit is pixel), 2 feature maps of 1 × 1 (two parameters here are respectively used for representing the width and height of the image, and the unit is pixel) can be generated for classification through convolution kernels of 2 feature maps of 1 × 1 × 32 (three parameters here are respectively used for representing the width, height, and channel number of the convolution kernel, and the unit of the width and height is pixel); 4 convolution kernels of 1 × 1 × 32 (three parameters are used for respectively representing the width, height and channel number of the convolution kernels, wherein the unit of the width and the height is pixels), and 4 feature maps of 1 × 1 (two parameters are used for respectively representing the width and the height of an image, and the unit is pixels) are generated for judging a regression frame; 10 convolution kernels of 1 × 1 × 32 (three parameters are used to represent the width, height and channel number of the convolution kernels respectively, wherein the unit of the width and the height is pixels), and 10 feature maps of 1 × 1 (two parameters are used to represent the width and the height of an image respectively, and the unit is pixels) are generated and used for judging the face contour points.

An RNet model inputs pictures with a size of 24 × 24 (two parameters are used for representing the width and height of an image respectively and the unit is a pixel), and generates 28 feature maps of 11 × 11 (two parameters are used for representing the width and height of the image respectively and the unit is a pixel) by using a convolution kernel of 28 convolution kernels of 3 × 3 × 3 (three parameters are used for representing the width, height and channel number of the convolution kernel respectively and the unit of the width and height is a pixel) and a convolution kernel of 3 × 3 (two parameters are used for representing the width and height of a pool respectively and the unit is a pixel) (stride is 2); 48 4 × 4 feature maps (two parameters here are used to represent the width and height of an image, respectively, and the unit is a pixel) are generated by using a convolution kernel of 48 3 × 3 × 28 (three parameters here are used to represent the width, height, and channel number of the convolution kernel, respectively, and the unit of the width and height is a pixel) and a convolution kernel of 3 × 3 (two parameters here are used to represent the width and height of a pool, respectively, and the unit is a pixel) (stride ═ 2); after 64 convolution kernels of 2 × 2 × 48 (three parameters are used for respectively representing the width, height and channel number of the convolution kernels, wherein the unit of the width and the height is pixels), 64 feature maps of 3 × 3 (two parameters are used for respectively representing the width and the height of an image, and the unit is pixels) are generated; converting a 3 × 3 × 64 (three parameters are used to represent the width, height, and number of channels of an image, respectively, where the width and height are in pixels) feature map into a 128-size fully connected layer; converting the regression frame classification problem into a full connection layer with the size of 2; converting the position regression problem of the bounding box into a full connection layer with the size of 4; face contour keypoints are converted into fully connected layers of size 10.

ONet is the last network in MTCNN for the final output of the network. The training data generation of ONet is similar to RNet, and the detection data is detected as boundary boxes after the picture passes through the PNet and RNet network, and the detected boundary boxes comprise positive samples, negative samples and intermediate samples. The ONet model input is a 48 × 48 × 3 (here, three parameters are used to represent the width, height, and channel number of an image, respectively, where the unit of the width and height is a pixel), a convolution kernel of 32 × 3 × 3 (here, three parameters are used to represent the width, height, and channel number of the convolution kernel, respectively, where the unit of the width and height is a pixel), and a feature map of 3 × 3 (here, two parameters are used to represent the width and height of a pool, respectively, and the unit is a pixel) (stride ═ 2), and then the convolution kernel is converted into 32 × 23 (here, two parameters are used to represent the width, height, and the unit is a pixel) features; converting into 64 10 × 10 feature maps (two parameters are used to represent the width and height of an image, respectively, and the unit is a pixel) by using a convolution kernel of 64 3 × 3 × 32 (three parameters are used to represent the width, height, and channel number of the convolution kernel, respectively, wherein the unit of the width and height is a pixel) and a maxpolong of 3 × 3 (two parameters are used to represent the width and height of a pool, respectively, and the unit is a pixel) (stride ═ 2); converting into 64 4 × 4 feature maps (two parameters are used to represent the width and height of an image, respectively, and the unit is a pixel) by using a convolution kernel of 64 3 × 3 × 64 (three parameters are used to represent the width, height, and channel number of the convolution kernel, respectively, wherein the unit of the width and height is a pixel) and a maxpolong of 3 × 3 (two parameters are used to represent the width and height of a pool, respectively, and the unit is a pixel) (stride ═ 2); converting the convolution kernel into 128 characteristic maps of 3 × 3 (two parameters are used for representing the width and the height of the image respectively and the unit is pixel) by 128 convolution kernels of 2 × 2 × 64 (three parameters are used for representing the width, the height and the channel number of the convolution kernel respectively, and the unit of the width and the height is pixel); converting into a full connection layer of 256 size by a full connection operation; preferably generating regression box classification features of size 2; regression features for regression box positions of size 4; a face contour position regression feature of size 10.

Predicted bounding boxes are generated from the original picture and PNet. And inputting the original picture and a bounding box generated by the PNet, and generating a corrected bounding box through the RNet. Inputting the element picture and the bounding box generated by the RNet, and generating a corrected bounding box and a face contour key point through the ONet, wherein the basic structure is shown in FIG. 2. The implementation process is as follows:

4. Drawing a target frame: cv2.rectangle (image, box, (0, 255))

S5, extracting features from the intercepted human face for representation; and inputting the face data extracted in the steps into a faceNet network, and calculating the feature vector of Embedding so as to obtain the feature representation of the face.

S6, training the twin neural network;

and S7, detecting video tampering by using the trained network.

In order to verify the feasibility of the method and to test the various performances of the method, the method provided by the invention is simulated on a TensorFlow deep learning framework.

First, we select the evidence collection Celeb-DF mentioned in "Celeb-DF: A New data set for deep Fake Forensics". Celeb-DF contained 408 original videos from YouTube and from these real videos 795 DeepFake videos were synthesized, and the partial faces in this data set were shown as the partial faces in the Celeb-DF data set in FIG. 3:

in the test, mainly focusing on the difference between the real face image and the fake face image, a part of the real face and the fake face are selected for comparison, in the selection process, a plurality of groups of data are selected for comparison, one group of data is a fake face intercepted randomly, and the other group of data is a real face intercepted randomly. After data acquisition, firstly mapping the images for testing to a certain dimensional space through a deep learning network, then calculating the distance between real-real and real-fake, and finding that the two types of distances have obvious discrimination by using the method, and the test result of the algorithm is shown in the following table:

in the above table, the a column is the distance between the real face and the forged face, the b column is the distance between the real face and the other real faces of the person, the data on the data are viewed visually, the distinction is obvious, the p value obtained by performing p test on the a column data and the b column data is 0.000002 which is far less than 0.05, the two groups of data have very large distinction degree, and a normal distribution graph is drawn on the data, as shown in fig. 4.

As can be seen from fig. 4, the normal distribution graphs of the two groups of data are completely different and have high distinguishability, wherein the right normal curve a (solid line) is the facenet distance curve between the real face and the forged face, and the left normal curve b (dotted line) is the distance curve between the real face and the other real face, so that the difference between the two curves is very obvious, which is also a characteristic that can be used in identifying the forged video. Finally, the classification effect of the true and false images is tested by using normal distribution, wherein 100 true face images are used, 50 false face images are used as training data to obtain a normal distribution model, and then 1000 true and false images are used for testing, the obtained accuracy is 95.6%, and a specific confusion matrix is shown in the following table:

	t	f
			p	979	67
n	933	21

the experiment is combined with the face recognition technology such as facenet and the like to test a high-quality face data set Celeb-DF, and the experimental result shows that the method has better distinctiveness statistically, can quickly and effectively carry out tamper detection on the video and has stronger robustness.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A video tampering detection method based on continuous interframe difference is characterized in that: the video tampering detection method comprises the following steps:

step six: training the twin neural network;

step seven: and detecting video tampering by using the trained network.

2. The method according to claim 1, wherein the video tampering detection method based on the difference between consecutive frames comprises: step one, constructing a deep learning discriminator, inputting a 128 multiplied by 100 three-channel color face image (128 multiplied by 100 multiplied by 3) obtained by intercepting a video, and performing simple normalization processing;

the specific structure of the discriminator is described as follows:

a batch normalization layer BN with a default attribute value is added in the third layer, because the output of the convolution layer is in symmetrical and non-sparse distribution, similar to Gaussian distribution, and the normalization is carried out on the convolution layer, so that more stable distribution can be generated;

# specifies the axis to be normalized, typically the characteristic axis;

the bulk samples were normalized at channel axis (last axis) # so that here axis is-1

# specifies the axis to be normalized, typically the characteristic axis;

# specifies the axis to be normalized, typically the characteristic axis;

# specifies the axis to be normalized, typically the characteristic axis;

3. The method according to claim 1, wherein the video tampering detection method based on the difference between consecutive frames comprises: step two, the contrast loss function in the loss function of the constructed network training is used as a dimension reduction learning method for learning a mapping relation, so that similar samples in a high-dimensional space still keep the similar relation after being mapped to a low-dimensional space through the function, the contrast loss function can well express the matching degree of the pair samples, and the relation of the pair data in the twin neural network can be effectively processed;

the formula for the comparison loss function L is as follows:

wherein:

m is a manually set threshold;

n is the number of samples;

p is the dimension of the sample feature;

4. The method according to claim 1, wherein the video tampering detection method based on the difference between consecutive frames comprises: step four, decomposing the video to be detected into a frame sequence, constructing a face locator, and identifying and extracting a face image from each frame, wherein the steps are as follows:

the network structure of the PNet is a full convolution neural network structure, and the input of the training network is a picture with the size of 12 multiplied by 12, so that training data of the PNet network needs to be generated before training; training data can generate a series of bounding boxes through calculation with the IOU of the GuaranteTrue Box; training data can be obtained through a sliding window or random sampling method, and the training data is divided into three positive samples, negative samples and intermediate samples; wherein the IOU of the positive sample is the generated sliding window and the GuaranteTrue Box is greater than 0.65, the IOU of the negative sample is less than 0.3, the IOU of the middle sample is greater than 0.4 and less than 0.65; then, converting the bounding box resize into a picture with the size of 12 multiplied by 12, converting the picture into a structure with the size of 12 multiplied by 3, and generating training data of the PNet network; training data passes through 10 convolution kernels of 3 × 3 × 3 and Max Pooling (stride = 2) of 2 × 2 to generate 10 feature maps of 5 × 5; then generating 16 3 × 3 feature maps through 16 3 × 3 × 10 convolution kernels; then generating 32 1 × 1 feature maps through 32 convolution kernels of 3 × 3 × 16; finally, for 32 1 × 1 feature maps, 2 1 × 1 feature maps can be generated for classification through 2 1 × 1 × 32 convolution kernels; 4 convolution kernels of 1 × 1 × 32 are generated, and 4 feature maps of 1 × 1 are generated and used for judging a regression frame; 10 convolution kernels of 1 × 1 × 32 generate 10 feature maps of 1 × 1 for judging the face contour points;

the RNet model inputs 24 × 24 pictures, and generates 28 11 × 11 feature maps after 28 convolution kernels of 3 × 3 × 3 and 3 × 3(stride = 2) maxporoling; 48 4 × 4 signatures were generated after 48 convolution kernels of 3 × 3 × 28 and 3 × 3(stride = 2) maxporoling; generating 64 3 × 3 feature maps after passing through 64 convolution kernels of 2 × 2 × 48; converting the 3 × 3 × 64 feature map into a 128-sized fully connected layer; converting the regression frame classification problem into a full connection layer with the size of 2; converting the position regression problem of the bounding box into a full connection layer with the size of 4; converting key points of the human face outline into a full-connection layer with the size of 10;

ONet is the last network in MTCNN, used for making the final output of the network; generating training data of the ONet similar to the RNet, and detecting the detected data as bounding boxes detected after the picture passes through the PNet and the RNet network, wherein the bounding boxes comprise positive samples, negative samples and intermediate samples; ONet model input is a 48 × 48 × 3 picture, which is transformed into 32 23 × 23 feature maps by 32 3 × 3 × 3 convolution kernels and 3 × 3(stride = 2) maxporoling; conversion to 64 10 × 10 signatures after 64 3 × 3 × 32 convolution kernels and 3 × 3(stride = 2) maxporoling; conversion to 64 4 × 4 signatures after 64 convolution kernels of 3 × 3 × 64 and 3 × 3(stride = 2) maxporoling; converting into 128 characteristic maps of 3 × 3 by 128 convolution kernels of 2 × 2 × 64; converting into a full link layer with 256 sizes through a full link operation; preferably generating regression box classification features of size 2; regression features for regression box positions of size 4; a face contour position regression feature of size 10.

5. The method according to claim 4, wherein the video tampering detection method based on the difference between consecutive frames comprises: the basic structure of the MTCNN network is to generate predicted bounding boxes from original pictures and PNets; inputting an original picture and a bounding box generated by PNet, and generating a corrected bounding box through RNet; inputting the element picture and a bounding box generated by the RNet, generating a corrected bounding box and key points of the face contour through the ONet,

the implementation process is as follows:

1. firstly, reading in a picture to be detected: image = cv2. image (imagepath)

2. Loading the trained model parameters, and constructing a detection object: detecter = MtcnnDetector

3. And (3) executing inference operation: all _ boxes, landworks = detector

4. Drawing a target frame: c v2.rectangle (image, box, (0, 255)).