CN113283393A

CN113283393A - Method for detecting Deepfake video based on image group and two-stream network

Info

Publication number: CN113283393A
Application number: CN202110717852.2A
Authority: CN
Inventors: 王金伟; 张玫瑰
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-08-20
Anticipated expiration: 2041-06-28
Also published as: CN113283393B

Abstract

The invention relates to a method for detecting a Deepfake video based on an image group and a two-stream network, which comprises the following steps: (1) extracting key frames of a video to be detected to form an image group; (2) inputting the first frame of the image group into a spatial stream in a two-stream network to extract spatial features; (3) respectively differentiating the rest frames of the image group with the first frame to obtain a difference image, forming a difference image sequence, and inputting the difference image sequence into a time stream in the two-stream network to extract time characteristics; (4) and fusing the extracted spatial features and the time features, and evaluating the authenticity of the video by using a dynamic routing algorithm. Compared with the prior art, the method has the advantages that the computational redundancy is reduced by utilizing the image group, the network is concentrated on the key frame, the space-time information of the key frame is fully utilized by fusing the spatial characteristic and the time characteristic, and the classification is carried out by a dynamic routing algorithm to obtain a more accurate evaluation result.

Description

Method for detecting Deepfake video based on image group and two-stream network

Technical Field

The invention belongs to the field of video detection, and particularly relates to a method for detecting a Deepfake video based on an image group and a two-stream network.

Background

With the rise and development of artificial intelligence technology, face changing technology is gradually gaining wide attention in the continuous development process. The advent of deep is a breakthrough of face exchange technology, which is a technology that can replace the face image of a source person in a video with the face image of a target person. With the advent and optimization of generative confrontation networks, face exchange becomes easier and less noticeable to the naked eye. Celebrities and politicians have a large number of videos released on the network as public characters, so that lawless persons can forge videos at will, thereby spreading false information, making confusion and the like, and threatening the human society. Therefore, detection aiming at the Deepfake video is not slow, and the method has great practical significance.

The detection method of the deep video can be divided into detection based on intra-frame artifacts and detection based on inter-frame time characteristics. The detection method based on the intra-frame artifacts firstly decomposes the video into frames, then analyzes all the frames, and judges the video authenticity by averaging the results of all the frames to obtain the prediction of the video level. This method is similar to image detection, except that after compression, the sharpness of the video frame is reduced and the detection difficulty is increased. Although CNN can correctly predict each frame, predicting the authenticity of a video by calculating an average is not accurate. The second method based on the inter-frame time characteristics takes the video as a whole, takes the time correlation between video frames into consideration, and evaluates the Deepfake video more reasonably. However, both of the above two methods have a common problem that the result can be obtained only by analyzing the whole video, and the similarity between video frames inevitably causes high redundancy of information between video frames, so that the detection method has a large calculation amount and is slow in processing.

Disclosure of Invention

In order to solve the problems of large calculation amount and low efficiency of the existing Deepfake video detection technology, the invention provides a Deepfake video detection method based on an image group and a two-stream network. The technical scheme adopted by the invention is as follows:

a method for detecting a Deepfake video based on an image group and a two-stream network comprises the following steps:

step 1: extracting key frames of a video to be detected to form an image group;

step 2: inputting the first frame of the image group into a spatial stream in a two-stream network to extract spatial features;

and step 3: differentiating the rest frames of the image group with the first frame to obtain a difference image, forming a difference image sequence, and inputting the difference image sequence into a time stream in the two-stream network to extract time characteristics;

and 4, step 4: and fusing the extracted spatial features and the time features, and evaluating the authenticity of the video by using a dynamic routing algorithm.

Further, in the step 1, a face region image in a video frame is obtained by cutting in a fixed size, the face region images between adjacent frames are differentiated, 10 frames of face region images with the largest face region change are extracted according to the average intensity of the inter-frame difference to serve as key frames, and an image group is formed according to time sequence to represent the video.

Further, the calculation formula of the inter-frame difference method is as follows,

absDiff_i＝F_i-F_i-1，

wherein, F_i、F_i-1Respectively representing the face region image of the i-th frame and the face region image of the i-1 th frame, absDiff_iRepresenting the difference between the face area image of the ith frame and the face area image of the (i-1) th frame; the calculation expression of the average strength of the inter-frame difference is as follows,

wherein, absDiff_i(x, y) is absDiff_iThe values at coordinates (x, y), width and height, respectively, represent the width and height of the face region image, diffMean_iAnd the average intensity of the difference between the face region image of the ith frame and the face region image of the (i-1) th frame is represented.

Further, the two-stream network in step 2 and step 3 includes spatial stream and temporal stream; the spatial stream is composed of parts of the first sequence to the fifth sequence of the pre-trained ResNet50 network and a main capsule network and is used for extracting spatial features; the time flow consists of a spatial pyramid pooling network and a GRU network and is used for extracting time characteristics; the spatial characteristics are used as auxiliary information and assigned to a hidden state of the GRU network; the GRU network is used for analyzing time coherence; the two-flow network is trained by adopting an Adam optimization algorithm, the loss function adopts a cross entropy loss function, and the expression of the cross entropy loss function is as follows

Wherein L is the loss value, y is the sum of

Respectively representing a sample label and a prediction label.

Furthermore, the capsule structures of the main capsule networks are the same and comprise two-dimensional convolution layers, a statistic pool layer and a one-dimensional convolution layer, wherein the statistic pool layer is used for calculating the mean value and the variance of each convolution kernel; the calculation expression of the mean value is as follows,

the computational expression of the variance is as follows,

wherein, mu_kMeans, I, representing the k-th layer convolution kernel_kijRepresenting the value at the k-th layer of the convolution kernel (i, j), W, H representing the width and height of the convolution kernel respectively,

representing the variance of the k-th layer convolution kernel.

Furthermore, the output of the spatial pyramid pooling network is a one-dimensional feature vector, the length of the feature vector is determined by the number N of pyramid layers,

where the coefficient 3 is the dimension of the difference map.

Further, the difference map in step 3 can be expressed as

Diff_m-1＝F_m-F₁,m＝2,…,10，

Wherein Diff_m-1Represents the m-1 th difference chart, F_mAnd F₁Respectively showing the mth frame and the first frame in the image group.

Further, in the step 4, the spatial features and the temporal features are spliced and fused and transmitted to the digital capsule network through a dynamic routing algorithm; the output vector of the digital capsule network is averaged after softmax to obtain the final network output vector

Representing the probability that the video is a Deepfake video,

representing the probability that the video is a real video if

Then the network predicts the label

If the video to be detected is the Deepfake video

Then the network predicts the label

The video to be detected is a real video.

Compared with the prior art, the invention has the beneficial effects that: key frames in the video are selected through inter-frame difference to form an image group to replace a video input network, so that the network can mainly learn the characteristics of the key video frames, the calculation redundancy is reduced, and the operation efficiency is improved; the space-time combined two-flow detection network is provided and a dynamic routing algorithm is adopted for detection, so that the space characteristics and the time characteristics of the image group are fully utilized, and the detection precision is effectively improved.

Drawings

FIG. 1 is a method block diagram of the present invention.

Fig. 2 is a schematic structural diagram of a spatial flow network according to the present invention.

Fig. 3 is a schematic structural diagram of the spatial pyramid pooling network of the present invention.

Fig. 4 is pseudo code of the dynamic routing algorithm of the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings.

FIG. 1 shows a flow chart of the present invention, which comprises the following steps:

(1) extracting key frames from video to be detected to form image group

The method comprises the steps of cutting a video to be detected in a fixed size to obtain a face region image, extracting a key frame by utilizing interframe difference to serve as an image group of an input network, and extracting the key frame by taking the two frame images as a difference and then obtaining a frame with larger change according to the average strength of interframe difference as a key frame. Since there is strong temporal correlation between video frames, in order not to lose temporal features, extracted 10-frame key frames are sequentially combined into a group of images to represent a video. The calculation formula of the interframe difference method is shown as formula (1), the calculation expression of the average strength of interframe difference is shown as formula (2),

absDiff_i＝F_i-F_i-1， (1)

wherein, F_i、F_i-1Respectively representing the face region image of the i-th frame and the face region image of the i-1 th frame, absDiff_iabsDiff, which is a difference between the face area image of the i-th frame and the face area image of the i-1 th frame_i(x, y) is absDiff_iThe values at coordinates (x, y), width and height, respectively, represent the width and height of the face region image, diffMean_iAnd the average intensity of the difference between the face region image of the ith frame and the face region image of the (i-1) th frame is represented.

(2) The first frame of the group of images is input into the spatial stream in the two-stream network to extract spatial information.

Since the number of Deepfake videos is small, the network training is not suitable to start from zero, and in order to avoid training overfitting, the part of the ResNet50 network that is pre-trained on the ILSVRC database is used to extract potential features, compared to a full ResNet50 network, using the first to fifth sequences of the pre-trained network (two blocks in the first conv layer and the second conv layer) is more advantageous for detection because the full ResNet50 network extracts high-level semantic information, which ignores the artifact features within the frame.

As shown in fig. 2, the complete capsule network includes a plurality of main capsule networks for extracting key features and a digital capsule network for classification. The main capsule network is composed of a plurality of groups of neurons called capsules, each capsule can have different structures, in order to simplify the operation, the invention adopts the capsules with the same structure, each capsule comprises a two-dimensional convolution layer, a statistical pool layer and a one-dimensional convolution layer, the statistical pool layer is used for calculating the mean value and the variance of each convolution kernel, the expressions of the mean value and the variance are respectively an expression (3) and an expression (4),

representing the variance of the k-th layer convolution kernel.

In image processing, CNN focuses on the detection of important features in an image, and ignores the spatial relationship between features. The capsule network is based on the learning characteristics of each complete capsule, each capsule represents the characteristics of different human face areas, such as eyes, nose, mouth and the like, and the capsule network is a directional vector, can reflect spatial hierarchy information and is more robust to false face detection.

(3) Temporal streaming extraction of inter-frame disparity in a two-stream network with residual frames of an image group

Because each frame in the image group has strong similarity, under the condition of carrying out spatial feature analysis on the main frame, the difference image sequence is obtained by subtracting the remaining multiple frames from the main frame respectively, such as formula 5, which is favorable for reducing feature redundancy and saving computing resources.

Diff_m-1＝F_m-F₁,m＝2,…,10， (5)

After generating the difference graph sequence, the temporal coherence between frames is analyzed by using a GRU network, the GRU network is generally used for text analysis, a cell predicts a word, the word is represented by a one-dimensional vector, the human face difference graph in the invention is three-dimensional, and the human face difference graph needs to be tiled into a one-dimensional shape in order to adapt to the GRU network. Because the difference map is sparse, direct tiling not only causes space waste, but also increases the amount of calculation, so that the invention adopts a spatial pyramid pooling network to extract key information of the three-dimensional difference map. The spatial pyramid pooling network can obtain output with fixed size no matter what the input size is, as shown in fig. 3, pooling difference maps in different scales, combining the pooled features obtained in each scale into a one-dimensional feature vector, the length of the feature vector is determined by the number N of pyramid layers,

wherein the coefficient 3 is the dimension of the difference graph, and N is generally 3-5.

The one-dimensional feature vector learned from the three-dimensional difference map is input into the GRU network to extract time inconsistency information. Compared with the LSTM network, the GRU network can choose to forget and memorize by using the same updating gate control, thereby greatly reducing the parameter quantity and accelerating the network training. Hidden states in the GRU network are generally initialized to zero, and spatial features extracted from the spatial streams are assigned to the hidden states as auxiliary information in the invention. Because the input in the time stream is obtained by differentiating with the first frame, a large number of important characteristics are lost, and the characteristics are extracted from the space stream, so the characteristics are directly introduced into the time stream, thereby avoiding repeated extraction of the space characteristics, and achieving the purposes of reducing redundancy and accelerating the training and detection process.

(4) Evaluating authenticity of video to be detected by utilizing dynamic routing algorithm

After learning of the time characteristics and the space characteristics of the two-stream network, the two are spliced to realize fusion of space-time characteristics, and the possibility of video truth is calculated by using a dynamic routing algorithm to obtain a video evaluation result. The dynamic routing algorithm is proposed in the capsule network, can be regarded as a full-connection layer of a vector version, and can more accurately route the features to the category to which the features belong by using the length of the vector to express the probability of the existence of the entity. The specific dynamic routing algorithm is shown in fig. 4, the space characteristic and the time characteristic are spliced and fused and transmitted to the digital capsule network through the dynamic routing algorithm, and the output vector of the digital capsule network is averaged after passing through softmax to obtain the final network output vector

Representing the probability that the video is a Deepfake video,

representing the probability that the video is a real video if

Then the network predicts the label

If the video to be detected is the Deepfake video

Then the network predicts the label

The video to be detected is a real video.

Since the capsule is forensically not reconstructed, the network is trained using only the cross-entropy loss function, expressed as follows,

wherein L is the loss value, y is the sum of

Respectively representing a sample label and a prediction label, wherein the training data is from a faceforces + + dataset, respectively extracting key frames of each video in the dataset to form an image group, the image group sample label from the Deepfake video is 0, and the image group sample label from the real video is 1.

In conclusion, the method for detecting the Deepfake video utilizes the image group, greatly reduces the calculation redundancy, and enables the network to be concentrated on the key video segment; extracting space and time characteristics of the image group by using a two-stream network, and fully mining key characteristics of video authenticity as a judgment basis; and finally, classifying through a dynamic routing algorithm, so that an evaluation result can be obtained more accurately.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A method for detecting a Deepfake video based on an image group and a two-stream network is characterized by comprising the following steps:

step 1: extracting key frames of a video to be detected to form an image group;

step 2: inputting a first frame of an image group into a spatial stream in a two-stream network to extract spatial information as spatial features;

and step 3: respectively differentiating the rest frames of the image group with the first frame to obtain a difference image, forming a difference image sequence, and inputting the difference image sequence into a time stream in a two-stream network to extract the inter-frame inconsistency as a time characteristic;

2. The method as claimed in claim 1, wherein in step 1, the face region images in the video frames are obtained by cropping with a fixed size, the face region images between adjacent frames are differentiated, 10 frames of face region images with the largest face region change are extracted as key frames according to the average intensity of the inter-frame difference, and the image groups are formed in time sequence to represent the video.

3. The method for detecting the Deepfake video based on the image group and the two-stream network as claimed in claim 2, wherein the calculation formula of the inter-frame difference method is as follows,

absDiff_i＝F_i-F_i-1，

4. The method for detecting the Deepfake video based on the image group and the two-stream network as claimed in claim 1, wherein the two-stream network in the steps 2 and 3 comprises a spatial stream and a temporal stream; the spatial stream is composed of parts of the first sequence to the fifth sequence of the pre-trained ResNet50 network and a main capsule network and is used for extracting spatial features; the time flow consists of a spatial pyramid pooling network and a GRU network and is used for extracting time characteristics; the spatial characteristics are used as auxiliary information and assigned to a hidden state of the GRU network; the GRU network is used for analyzing time coherence; the two-flow network is trained by adopting an Adam optimization algorithm, the loss function adopts a cross entropy loss function, the expression is as follows,

wherein L is the loss value, y is the sum of

Respectively representing a sample label and a prediction label.

5. The method as claimed in claim 4, wherein the main capsule network has the same capsule structure, and includes two-dimensional convolutional layers, a statistical pool layer and a one-dimensional convolutional layer, wherein the statistical pool layer is used for calculating the mean and variance of each convolutional kernel; the calculation expression of the mean value is as follows,

the computational expression of the variance is as follows,

representing the variance of the k-th layer convolution kernel.

6. The method of claim 4, wherein the output of the spatial pyramid pooling network is a one-dimensional eigenvector, the length of the eigenvector is determined by the number N of pyramid layers,

where the coefficient 3 is the dimension of the difference map.

7. The method as claimed in claim 1, wherein the difference map in step 3 is represented as a difference map in the form of a spatio-temporal combination two-stream network

Diff_m-1＝F_m-F₁,m＝2,…,10，

8. The method for detecting the Deepfake video based on the image group and the two-stream network as claimed in claim 1, wherein in the step 4, the spatial feature and the temporal feature are merged and fused and transmitted to the digital capsule network through a dynamic routing algorithm; the output vector of the digital capsule network is averaged after softmax to obtain the final network output vector

Representing the probability that the video is a Deepfake video,

representing the probability that the video is a real video if

Then the network predicts the label

If the video to be detected is the Deepfake video

Then the network predicts the label

The video to be detected is a real video.