CN114758272A

CN114758272A - Forged video detection method based on frequency domain self-attention

Info

Publication number: CN114758272A
Application number: CN202210334683.9A
Authority: CN
Inventors: 李邵梅; 吉立新; 黄瑞阳; 马欣; 杨帆; 高超; 张建朋
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-07-15

Abstract

The invention provides a forged video detection method based on frequency domain self-attention. The method comprises the following steps: dividing a video to be detected into a plurality of video frames; judging whether each video frame forges the video frame, specifically comprising: extracting a face image in a current video frame and recording the face image as an original face image; extracting a phase spectrum of the original face image, reconstructing the original face image based on the phase spectrum, and recording the reconstructed face image as a reconstructed face image; splitting the reconstructed face image into a plurality of image blocks with the same size, and converting the image blocks into sequence data; inputting the sequence data into a trained transform model to extract a characteristic vector, inputting the characteristic vector into a multilayer perceptron, and judging whether a video frame corresponding to the characteristic vector is a forged video frame; and counting the number of the forged video frames and the number of the real video frames, and if the former is larger than the latter, considering the video to be detected as a forged video, and otherwise, considering the video to be detected as a real video.

Description

Forged video detection method based on frequency domain self-attention

Technical Field

The invention relates to the technical field of video processing and network space security, in particular to a frequency domain self-attention-based counterfeit video detection method.

Background

The traditional forged video face detection method is mainly based on CNN (convolutional neural network), and researches in two years find that an attention model based on Transformer can obtain better performance in forged video detection, but the conventional forged video detection model based on Transformer only performs forged feature learning from original image pixels and characteristic images based on CNN, does not consider phase spectrum features based on frequency domain transformation, and has a space for further improving detection accuracy.

Disclosure of Invention

Aiming at the problem of low detection precision of the existing forged video detection method, the invention provides a forged video detection method based on frequency domain self attention.

The invention provides a method for detecting a forged video based on frequency domain self attention, which comprises the following steps:

step 1: dividing a video to be detected into a plurality of video frames;

And 2, step: judging whether each video frame forges a video frame, specifically comprising:

step 2.1: extracting a face image in a current video frame and recording the face image as an original face image; extracting a phase spectrum of the original face image, reconstructing the original face image based on the phase spectrum, and recording the reconstructed face image as a reconstructed face image;

step 2.2: splitting the reconstructed face image into a plurality of image blocks with the same size, and converting the image blocks into sequence data;

step 2.3: inputting the sequence data into a trained transform model to extract a characteristic vector, inputting the characteristic vector into a multilayer perceptron, and judging whether a video frame corresponding to the characteristic vector is a forged video frame;

and step 3: and counting the number of the forged video frames and the number of the real video frames, and if the former is larger than the latter, considering the video to be detected as a forged video, and otherwise, considering the video to be detected as a real video.

Further, step 2.1 specifically includes:

converting the original face image I (x, y) into a grayscale image I_g(x, y) for the grayscale image I_g(x, y) performing fast Fourier transform according to the formula (1) to obtain an image F (x, y); then calculating according to a formula (2) to obtain a phase spectrum S (x, y); finally, obtaining a reconstructed face image P (x, y) according to a formula (3);

F(x,y)＝FFT(I_g(x,y)) (1)

S(x,y)＝p(F(x,y)) (2)

P(x,y)＝IFFT([e^i·S(x,y)]) (3)

Wherein FFT (. cndot.) and IFFT (. cndot.) represent the fast Fourier transform and inverse fast Fourier transform, respectively, and p (. cndot.) is a function of the phase angle.

Further, step 2.2 specifically includes:

setting the size of the reconstructed face image to be H multiplied by W and the size of the image block to be P multiplied by P, obtaining N image blocks, wherein N is (H multiplied by W)/P²；

Converting the N image blocks into sequence data z according to formula (4)₀：

Wherein x is_classRepresenting D-dimensional learnable category-related variables,

representing N pixel matrices of size P, E representing a linear mapping matrix for transforming image blocks to D-dimensional embedding, E_posRepresenting a matrix in which the locations are embedded.

Further, step 2.3 specifically includes:

the feature extraction process of the transform model is represented by formulas (5) to (6), and the decision process of the multilayer perceptron is represented by formula (7):

z'_l＝MHA(LN(z_l-1))+(z_l-1), l＝1...L (5)

z_l＝MLP(LN(z'_l))+(z'_l), l＝1...L (6)

wherein MHA (-) represents a multi-head attention mechanism; LN (·) denotes layer normalization; MLP (-) represents a multi-layer perceptron; recording the corresponding multilayer perceptron in the formula (6) as a first multilayer perceptron, and recording the corresponding multilayer perceptron in the formula (7) as a second multilayer perceptron; l represents the total number of layers of the Transformer, L represents the L-th layer, z_lRepresents the output of the l-th layer MLP, z' _lRepresents the output of the l-th layer MHA,

denotes z_lData of 1 st dimension.

Further, the first multi-layer perceptron consists of two hidden layers; the first hidden layer has H₁A node, a second hidden layer having H₂A plurality of nodes; wherein H₁＝D，H₂Is equal to said pluralityThe output dimension of the layer perceptron;

the calculation formula of the first hidden layer is shown as formula (10), and the calculation formula of the second hidden layer is shown as formula (11):

wherein the content of the first and second substances,

representing the learnable weights of the first hidden layer and the second hidden layer, respectively, g (-) represents the activation function,

an intermediate temporary value representing the i-th hidden node of the first hidden layer of the MLP,

an intermediate temporary value representing the i-th hidden node of the second hidden layer of the MLP,

an output representing an i-th hidden node of a first hidden layer of the MLP;

an output representing an i-th hidden node of a second hidden layer of the MLP; x is the number of_jIs the input for the j-th dimension.

Further, the second multi-layer perceptron has a hidden layer, the hidden layer has two nodes, the value of the first node is taken as the probability that the video frame is a real video frame, and the value of the second node is taken as the probability that the video frame is a fake video frame.

The invention has the beneficial effects that:

by combining the Transformer model with the phase spectrum characteristics of the frequency domain, compared with the traditional CNN-based characteristic extraction network, the Transformer model has better forged video characteristic extraction performance, and compared with the traditional method for detecting forged videos based on the Transformer model, the method considers the influence of the phase spectrum characteristics on the forged videos, and can further improve the detection precision of the forged videos.

Drawings

Fig. 1 is a schematic flowchart of a method for detecting a forged video based on frequency domain self-attention according to an embodiment of the present invention;

fig. 2 is an effect diagram of reconstructing a face image based on a phase spectrum according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a process of determining whether a current video frame is a forged video frame according to an embodiment of the present invention;

FIG. 4 is a diagram of a transform model provided in the prior art;

fig. 5 is a flowchart of category embedding to be learned and location embedding to input embedding according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be described clearly below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a method for detecting a counterfeit video based on frequency domain self-attention, including the following steps:

s101: dividing a video to be detected into a plurality of video frames;

s102: judging whether each video frame forges the video frame, specifically comprising:

s1021: as shown in fig. 2, extracting a face image in a current video frame, and recording the face image as an original face image; extracting a phase spectrum of the original face image, reconstructing the original face image based on the phase spectrum, and recording the reconstructed face image as a reconstructed face image;

specifically, a classical retinaFace model can be adopted to extract a face image from a video frame. Converting the original face image I (x, y) into a grayscale image I (x, y) because the phase information does not require attention to the color of the image_g(x, y) for the grayscale image I_g(x, y) performing fast Fourier transform according to the formula (1) to obtain an image F (x, y); then calculating according to a formula (2) to obtain a phase spectrum S (x, y); finally, obtaining a reconstructed face image P (x, y) according to a formula (3);

F(x,y)＝FFT(I_g(x,y)) (1)

S(x,y)＝p(F(x,y)) (2)

P(x,y)＝IFFT([e^i·S(x,y)]) (3)

wherein FFT (-) and IFFT (-) respectively represent fast Fourier transform and inverse fast Fourier transform, and p (-) is a function of the phase angle;

s1022: as shown in fig. 3, the reconstructed face image is divided into a plurality of image blocks with the same size, and the image blocks are converted into sequence data;

Specifically, if the size of the reconstructed face image is H × W and the size of the image block is P × P, N image blocks are obtained, where N is (H × W)/P²；

representing N pixel matrices of size P, E representing the linear mapping moment for transforming image blocks to D-dimensional embeddingArray, E_posRepresenting a matrix in which the locations are embedded.

S1023: as shown in fig. 3, inputting the sequence data into a trained transform model to extract a feature vector, inputting the feature vector into a multilayer perceptron, and determining whether a video frame corresponding to the feature vector is a counterfeit video frame;

specifically, the feature extraction process is represented by formulas (5) to (6), and the decision process of the multi-layer perceptron is represented by formula (7):

z'_l＝MHA(LN(z_l-1))+(z_l-1), l＝1...L (5)

z_l＝MLP(LN(z'_l))+(z'_l), l＝1...L (6)

wherein MHA (-) represents a multi-head attention mechanism; LN (·) denotes layer normalization; MLP (-) represents a multi-layer perceptron; for the sake of distinction and description, the corresponding multi-layer perceptron in the formula (6) is referred to as a first multi-layer perceptron, and the corresponding multi-layer perceptron in the formula (7) is referred to as a second multi-layer perceptron; l represents the total number of layers of the Transformer, each layer L represents the L-th layer, z _lRepresents the output of the l-layer MLP, z'_lRepresents the output of the l-th layer MHA,

denotes z_lData of 1 st dimension.

It should be noted that LN (-) is used to normalize one or several dimensions of the input, for example, to normalize X { X } for some dimension of the input X₁,x₂,…x_nNormalized, LN () is calculated as:

wherein E (x) is the mean value of x,

Var[x]is the variance of x and is the sum of the variance of x,

ε is a small value added to prevent the denominator from being 0, and is typically 1 e-05.

The MHA (-) is calculated as follows:

wherein, W_Q、W_K、W_VIs a learnable parameter matrix, d is K^TDimension of Q, assuming K denotes the number of output classes of the neural network, v is the output vector, v_jFor the value of the jth output category in v, i represents the category currently required to be calculated, the calculation result is between 0 and 1, and the softmax values of all the categories are summed to be 1. The formula for softmax (·) is:

in this embodiment, the first multi-layer perceptron is composed of two hidden layers; the first hidden layer of the first multilayer perceptron has H₁A node, a second hidden layer having H₂A node; wherein H₁＝D，H₂Is equal to the output dimension of the multilayer perceptron;

Wherein the content of the first and second substances,

the output of the i-th hidden node representing the first hidden layer of the MLP, as can be seen from equation (10), is via

Sending the activation function to obtain;

the output of the i-th hidden node representing the second hidden layer of the MLP, as seen by equation (11), is via

Sending the activation function to obtain; x is the number of_jIs the input for the j-th dimension.

In this embodiment, a ReLU function is used as an activation function, and a calculation formula thereof is as follows:

in this embodiment, the second multi-layer perceptron has a hidden layer, the hidden layer has two nodes, the value of the first node is taken as the probability that the video frame is a real video frame, and the value of the second node is taken as the probability that the video frame is a fake video frame. If the value of the first node is larger than that of the second node, the current video frame is considered to be a real video frame, otherwise, the current video frame is considered to be a fake video frame.

As an implementation, the training process of the Transformer model is as follows:

Firstly, collecting M real face images and M forged face images generated based on depth forging; then, respectively positioning the human face regions in the M real human face images and the M forged human face images by adopting a RetinaFace model, and cutting and extracting the human face regions; then, the extracted real face image and the forged face image are reconstructed based on the phase spectrum, and M real face images obtained through reconstruction form a forward sample set p ═ p₁,p₂,…,p_MAnd the reconstructed M forged face images form a negative direction n ═ n₁,n₂,…,n_M}; setting the label of each sample in the positive sample set to be 1, and setting the label of each sample in the negative sample set to be 0; finally, p and n are input into the network shown in FIG. 4 for training. Preferably, M is 100.

S103: and counting the number of the forged video frames and the number of the real video frames, and if the former is larger than the latter, considering the video to be detected as a forged video, and otherwise, considering the video to be detected as a real video.

Example 2

The following describes the processing flow by taking the detection of one video frame as an example. Firstly, extracting a face region from a video frame by adopting a RetinaFace model, and then obtaining a face image after phase spectrum reconstruction by adopting formulas (1) - (3).

For each phase spectrum reconstructed gray-scale face image, the size of the image is firstly adjusted to 256 × 256, and then the image is cut into image blocks of 32 × 32 size, so as to obtain 64 image blocks, and each image block is mapped to 32 × 32-1024 dimensions through linear mapping. The 64 image block embeddings and the 1 × 1024 dimensional learnable classification embeddings (vectors indicated by the leftmost bands in fig. 3) constitute a 65 × 1024 dimensional embeddings. Considering that the positional relationship between image blocks is meaningful for understanding the contents of an image, a final 65 × 1024 dimensional embedding is obtained as an input of the transform model by adding a 65 × 1024 dimensional learnable position embedding to the embedding.

The above-mentioned 65X 1024 dimensional embedding I₁The features are extracted by being fed into a Transformer model, wherein the Transformer model is composed of 6 (namely L is 6) network structures shown in FIG. 4, the number of attention heads of a multi-head attention mechanism is 16, and the number of nodes H of the 1 st hidden layer of the multi-layer perceptron is H₁2048, number of nodes of the 2 nd hidden layer H₂Is 1024. After transform coding, outputting a new 65 x 1024 dimensional image representation I₂。

From I₂The 1024-dimensional vector of the 1 st row is extracted and input into the multi-layer perceptron at the top of the graph 3 as the learned category vector, and the perceptron has only 1 hidden layer and 2 nodes and is used for converting the 1024-dimensional input vector into a 2-dimensional category vector, which is {0.03,0.4} because 0.03 is used for converting the 1024-dimensional input vector into the 2-dimensional category vector <0.4, so the video frame is determined to be a fake video frame.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A fake video detection method based on frequency domain self attention is characterized by comprising the following steps:

step 1: dividing a video to be detected into a plurality of video frames;

step 2: judging whether each video frame forges the video frame, specifically comprising:

and 3, step 3: and counting the number of the forged video frames and the number of the real video frames, and if the former is larger than the latter, considering the video to be detected as a forged video, and otherwise, considering the video to be detected as a real video.

2. A method for detecting a forged video based on frequency domain self attention as claimed in claim 1, wherein the step 2.1 specifically comprises:

converting the original face image I (x, y) into a gray image I_g(x, y) for the grayscale image I_g(x, y) performing fast Fourier transform according to the formula (1) to obtain an image F (x, y); then calculating according to a formula (2) to obtain a phase spectrum S (x, y); finally, obtaining a reconstructed face image P (x, y) according to a formula (3);

F(x,y)＝FFT(I_g(x,y)) (1)

S(x,y)＝p(F(x,y)) (2)

P(x,y)＝IFFT([e^i·S(x,y)]) (3)

wherein FFT (-) and IFFT (-) respectively represent the fast Fourier transform and the inverse fast Fourier transform, and p (-) is a function of the phase angle.

3. A method for detecting a forged video based on frequency domain self attention as claimed in claim 1, wherein the step 2.2 specifically comprises:

setting the size of the reconstructed face image to be H multiplied by W and the size of the image block to be P multiplied by P, obtaining N image blocks, wherein N is (H multiplied by W)/P ²；

Dividing N said graphs according to formula (4)Block conversion to sequence data z₀：

Wherein x is_classRepresenting a class-related variable learnable by the D-dimension,

4. A method for detecting a forged video based on frequency domain self attention as claimed in claim 3, wherein the step 2.3 specifically comprises:

z'_l＝MHA(LN(z_l-1))+(z_l-1),l＝1...L (5)

z_l＝MLP(LN(z'_l))+(z'_l),l＝1...L (6)

wherein MHA (-) represents a multi-head attention mechanism; LN (·) denotes layer normalization; MLP (-) represents a multi-layer perceptron; recording the corresponding multilayer perceptron in the formula (6) as a first multilayer perceptron, and recording the corresponding multilayer perceptron in the formula (7) as a second multilayer perceptron; l represents the total number of layers of the Transformer, L represents the L-th layer, z_lRepresents the output of the l-th layer MLP, z'_lRepresents the output of the l-th layer MHA,

denotes z_lIn the 1 st dimensionAnd (4) data.

5. The method for detecting forged video based on frequency domain self-attention according to claim 4, wherein the first multi-layer perceptron is composed of two hidden layers; the first hidden layer has H ₁A node, the second hidden layer having H₂A plurality of nodes; wherein H₁＝D，H₂Is equal to the output dimension of the multilayer perceptron;

wherein the content of the first and second substances,

an output representing an i-th hidden node of a first hidden layer of the MLP;

6. The method according to claim 4, wherein the second multi-layered perceptron has a hidden layer, the hidden layer has two nodes, the value of the first node is taken as the probability that the video frame is a real video frame, and the value of the second node is taken as the probability that the video frame is a fake video frame.