CN115147758A

CN115147758A - Depth forged video detection method and system based on intra-frame inter-frame feature differentiation

Info

Publication number: CN115147758A
Application number: CN202210718973.3A
Authority: CN
Inventors: 王风宇; 肖扬; 孔健
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2022-10-04

Abstract

The disclosure belongs to the technical field of counterfeit video detection, and particularly relates to a depth counterfeit video detection method and system based on intra-frame inter-frame feature differentiation, which comprises the following steps: acquiring original data of a depth forged video; extracting intra-frame features and inter-frame features based on the acquired original data; calculating the difference and Euclidean distance between the extracted intra-frame features and the inter-frame features; and finishing the detection of the authenticity of the deep forged video according to the obtained differentiation and the Euclidean distance.

Description

Depth counterfeit video detection method and system based on intra-frame and inter-frame feature differentiation

Technical Field

The disclosure belongs to the technical field of counterfeit video detection, and particularly relates to a depth counterfeit video detection method and system based on intra-frame inter-frame feature differentiation.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

With the development of deep learning technology, more and more synthetic videos appear in the daily life of people; the ability to synthesize counterfeit images and video that are difficult for humans to perceive with great power has made it difficult for humans to identify the synthesized counterfeit images with the naked eye alone.

In the prior art, the face synthesis technology is adopted, and attractive services such as face exchange and facial expression processing can be provided by clicking for a few times on the mobile device; however, such artificial intelligence techniques have great security and privacy problems, such as threatening facial recognition systems and the like, and are very likely to cause great social impact and harm. In recent years, with the emergence of a large number of videos synthesized by deeply forging human faces, a series of detection technologies aiming at the videos appear. In addition, there are also sophisticated face-forgery detection systems developed by using biometric information (such as blinking or head gestures and advanced training sets) to train detectors that can recognize deep-forgery videos.

The inventors have appreciated that, although techniques based on capturing single features, such as convolutional network traces, bioactivity detection, and video picture intra-frame features, have made good progress in detection tasks, there is still room for significant improvement in the accuracy of detection algorithms for low-quality and hybrid artificial intelligence synthesized counterfeit face videos. At present, in face identity exchange, a single-frame image is the starting point of a counterfeiting technology, and dynamic changes among the single-frame images are rarely considered. Operation of a single modality on one object may result in inconsistencies of other modalities.

Disclosure of Invention

In order to solve the problems, the present disclosure provides a depth-forged video detection method and system based on intra-frame and inter-frame feature differentiation, which can more effectively combat forged videos through the differentiation between intra-frame features and inter-frame features of the forged videos, capture intra-frame features and inter-frame features of the forged videos and differentiation thereof, and further improve detection of the depth-forged videos.

According to some embodiments, a first aspect of the present disclosure provides a depth-forged video detection method based on intra-frame inter-frame feature differentiation, which adopts the following technical solutions:

a depth forgery video detection method based on intra-frame and inter-frame feature differentiation comprises the following steps:

acquiring original data of a depth forged video;

extracting intra-frame features and inter-frame features based on the acquired original data;

calculating the difference and Euclidean distance between the extracted intra-frame features and the inter-frame features;

and finishing the detection of the authenticity of the deep forged video according to the obtained differentiation and the Euclidean distance.

As a further technical limitation, after the original data of the depth-forged video is acquired, the acquired original data is subjected to format unification processing, the acquired original data of the depth-forged video is extracted into picture frames in units of frames, and interference of irrelevant video backgrounds except for human face parts is suppressed.

Furthermore, the bottleneck attention optimization is adopted to carry out the feature optimization of the picture frame, and (4) extracting facial region features by using an attention mechanism, and suppressing irrelevant backgrounds.

Furthermore, in the process of extracting the picture frame, a face recognition library is used for positioning and cutting the face area in the picture frame, and the pictures are set to be 128 x 128 to be stored in a unified mode.

As a further technical limitation, in the process of extracting the intra-frame features, the RGB image is used, the inter-frame flow is expressed by a dense optical flow, and the intra-frame features are extracted by focusing on an area where the change of the face is large based on the optical flow features.

As a further technical limitation, in the process of extracting the inter-frame features, the extraction of the inter-frame features is completed by calculating the offset vectors of all pixel points on the front and back frame images and tracking and estimating the optical flow of the movement offset of all the pixel points.

As a further technical limitation, the cross entropy loss function is adopted to respectively calculate the loss functions of the intra-frame features and the inter-frame features, and the weighted sum of the loss functions is calculated to obtain the differentiation of the intra-frame features and the inter-frame features.

According to some embodiments, a second aspect of the present disclosure provides a depth-based counterfeit video detection system based on intra-frame inter-frame feature differentiation, which adopts the following technical solutions:

a depth forgery video detection system based on intra-frame and inter-frame feature differentiation, the method comprises the following steps:

an acquisition module for acquiring the data of the target object, configured to obtain raw data of a depth-forged video;

an extraction module configured to extract intra-frame features and inter-frame features based on the acquired raw data;

a calculation module configured to calculate a euclidean distance and a differentiation between the extracted intra-frame features and inter-frame features;

and the detection module is configured to complete the detection of the authenticity of the deep-forged video according to the obtained differentiation and the Euclidean distance.

According to some embodiments, a third aspect of the present disclosure provides a computer-readable storage medium, which adopts the following technical solutions:

a computer readable storage medium, on which a program is stored, which when executed by a processor implements the steps in the method for detecting depth forgery video based on intra-frame inter-frame feature differentiation according to the first aspect of the present disclosure.

According to some embodiments, a fourth aspect of the present disclosure provides an electronic device, which adopts the following technical solutions:

an electronic device includes a memory, a processor, and a program stored on the memory and executable on the processor, where the processor executes the program to implement the steps in the method for detecting depth forgery video based on intra-frame inter-frame feature differentiation according to the first aspect of the present disclosure.

Compared with the prior art, the beneficial effect of this disclosure is:

the present disclosure constructs a dual network structure, in which two improved CNN subnetworks are used to train the input data to extract intra-frame and inter-frame features, respectively, and a contrast loss function is employed to correlate the two subnetworks and capture the dissonance between the intra-frame and inter-frame features; respectively using cross entropy loss functions after full connection layers of an intra-frame sub-network and an inter-frame sub-network, and combining the cross entropy loss functions and the contrast loss into an overall loss function of the network to improve the learning effect of the network;

the present disclosure adds a bottleneck attention module in the network to optimize the input feature map based on its global feature statistics, placing the bottleneck attention module at the bottleneck of the model, enabling lower level features to benefit from context information. The light-weight module design is adopted, so that the whole program can run in an efficient mode.

The method and the device have the advantage that the detection precision and the accuracy in related work such as deep forgery detection are obviously improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

Fig. 1 is a flowchart of a depth-forgery-video detection method based on intra-frame inter-frame feature differentiation in an embodiment of the present disclosure;

fig. 2 is an overall architecture diagram of a depth-forgery-video detection method based on intra-frame inter-frame feature differentiation in the first embodiment of the disclosure;

FIG. 3 is a block diagram of an intra-frame subnetwork in one embodiment of the present disclosure;

FIG. 4 is a block diagram of a bottleneck attention module in a first embodiment of the disclosure;

fig. 5 is a block diagram of a depth forgery video detection system based on intra-frame inter-frame feature differentiation in the second embodiment of the present disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

Example one

The embodiment of the disclosure introduces a depth forged video detection method based on intra-frame inter-frame feature differentiation.

As shown in fig. 1 and fig. 2, a method for detecting a depth forged video based on intra-frame inter-frame feature differentiation includes:

step S01: acquiring original data of a depth forged video;

step S02: extracting intra-frame features and inter-frame features based on the obtained original data;

step S03: calculating the difference and Euclidean distance between the extracted intra-frame features and the inter-frame features;

step S04: and finishing the detection of the authenticity of the deep forged video according to the obtained differentiation and the Euclidean distance.

In one or more embodiments, in step S01, a data set for collecting related depth forgery aspects is searched, and the collected data set is subjected to preprocessing such as format unification, label generation, picture face positioning, picture cropping, and the like. The data set is then partitioned.

The comparative datasets used were all public datasets including faceforces + +, deepfakeTIMIT, UADFV, and Celeb DF datasets. Most of the data sets are human face videos collected from Youtube or other open video websites, and then forged videos are created on the basis of the human face videos through different forging means. Because the collected data in the data sets are all streamed on the internet in the form of video and no uniform format and label for dividing authenticity exists before each other, before the formal experiment starts, the unified data preprocessing needs to be performed on each related data set, namely, the data format is unified; for videos of different formats and different sizes in the data set, all videos are extracted into pictures in units of frames.

In order to improve the effect of network training and suppress the interference of irrelevant video backgrounds except for the face part in the training, for all the picture frames extracted from the data set video, the embodiment uses the face recognition library to position and cut the face area in the picture frame and uniformly adjusts the face area to be 128 × 128 picture size for storage. After the data format is unified, label processing is performed on each picture according to the authenticity of the video in the data set and the division of the training data.

In this embodiment, a data set such as Face forces + + is divided into a training set, a verification set and a test set according to a ratio of 5. Notably, unlike some previous studies, each dataset contains video from different identities (real and fake). This aspect is important for a fair assessment and prediction of the generalization ability of false detection systems to unknown identities. In order to test the generalized models on the data sets with different counterfeiting modes, compared with the training detection only on the data of a single data set, the present embodiment adopts a dual-flow network HOR (Detecting Deep video using the discrete between Intra-and Inter-frame Maps) to evaluate the detection capability of the training data mixed with a plurality of different counterfeiting modes.

As one or more embodiments, in step S02, corresponding intra-frame features and inter-frame features are extracted from the acquired data to be used as raw input to train a subsequent dual-stream detection network model.

The overall network architecture in this embodiment is a dual-stream network composed of two Convolutional Neural Networks (CNN) subnetworks, which are an intra-frame subnetwork for extracting intra-frame spatial features and an inter-frame subnetwork for extracting inter-frame temporal features, respectively.

The intra-frame subnetwork extracts intra-frame features from RGB images of the video, and the inter-frame subnetwork extracts inter-frame features from a dense optical flow graph of the video; for the inter-frame stream(s), dense optical flow is used to represent inter-frame flow. Because the basic principle of depth falsification in video processing is to process falsified images of each frame and then connect the falsified images generated by them. This will result in areas of the face with greater facial changes (e.g., eyes, lips) that may be more distorted due to anterior-posterior discontinuities. Compared with other traditional technologies based on intra-frame image features, the optical flow features can be better focused on areas with large face changes, and the detection accuracy rate is improved.

Judging whether the video is true or false by judging the difference between the intra-frame characteristic and the inter-frame characteristic; the differentiation of intra-frame and inter-frame features is captured by a contrast loss function, which makes the euclidean distance between intra-frame and inter-frame features smaller for real video and larger for fake video.

Specifically, as shown in fig. 2, the intra-frame subnetwork extracts intra-frame features from an RGB image of a video, and the inter-frame subnetwork extracts inter-frame features from a video dense optical flow image which is preprocessed in advance; the network architecture of intra-frame inter-frame subnetworks is based on a ResNet network using separable convolution, and the adopted residual learning mechanism solves the problems of slow network convergence, degraded training effect and the like.

In the process of extracting the interframe features, a dense optical flow algorithm is adopted, specifically:

optical flow can be largely classified into sparse optical flow and dense optical flow. The principle of the two is similar, but the sparse optical flow only aims at the motion estimation of purposely selected pixel points on the image in two frames of images in front and back. And the dense optical flow is an optical flow tracking estimation algorithm for realizing the movement offset of all pixel points by calculating the offset vectors of all pixel points on the front and rear two frames of images.

The main algorithm idea of the dense optical flow algorithm is to determine the weight of each pixel point according to the pixel value and the coordinate of other pixel points in the neighborhood of the pixel point, and then expand the coordinate of the point by a polynomial. The dependent variable is shown in equation (1) by regarding the image as a function of a two-dimensional signal (the output image is a grayscale image). The image is then approximately modeled using a quadratic polynomial as shown in equation (2), where A represents a 2x2 symmetric matrix, b is a 2x1 vector matrix, and c is a scalar:

x＝(x，y) ^T (1)

f(x)～x ^T Ax+b ^T x+c (2)

then, through the factorization, the right side of equation (2) can be written as equation (3):

the two-dimensional signal space (Cartesian coordinate system) of the original image requires a six-dimensional vector as a coefficient transformation to (1, x, y, x) ² ，y ² Xy) is the space of the basis function, and the positions x and y of different pixel points are substituted to calculate the gray values of the different pixel points. In order to obtain the six-bit coefficient of each pixel point in each frame of image, the Farnback algorithm sets a neighborhood of (2n + 1) × (2n + 1) around each pixel point, and then uses (2n + 1) in the pixel point domain ² And fitting by taking the pixel points as sample points of a least square method.

In a gray value matrix with the size of (2n + 1) x (2n + 1) in the neighborhood of a pixel point, the matrixes are split and combined into (2n + 1) ² The vector f of x1 is given in column-first order and is known (1,x, y, x) ² ，y ² Xy) as the basis function and the dimension of the transformation matrix B is (2n + 1) ² X 6 (i.e. 6 column vectors b) _i A matrix composed of all), the dimension of the coefficient vector r shared in the neighborhood is 6 × 1, then there is formula (4):

f＝B×r＝(b ₁ b ₂ b ₃ b ₄ b ₅ b ₆ )×r (4)

when the least square method is used for solving, the Farnback algorithm is used for carrying out sample error on each pixel point in the neighborhood by using two-dimensional Gaussian distributionThe difference is given a weight. In a matrix of (2n + 1) x (2n + 1) with two-dimensional Gaussian distribution in the neighborhood of each pixel point, the matrix is split into (2n + 1) in column priority order ² Vector a of x 1. As shown in equation (5), the original conversion matrix B of the basis functions will be transformed into:

B＝(a·b ₁ a·b ₂ a·b ₃ a·b ₄ a·b ₅ a·b ₆ ) (5)

and converting the basis function matrix B again in a dual mode, so that the coefficient vector of each pixel point in a single image can be obtained. And then, obtaining an optical flow field through parameter vector calculation and local fuzzification processing, and supplementing a two-layer optical flow data matrix into a three-layer matrix in order to enable an input data structure of inter-frame flow to correspond to an RGB three-layer data structure of frame flow after the optical flow field is obtained.

The network architecture of the intra-frame subnetwork is composed of 36 convolutional layers, the network is based on an Xception network model, and the network shows strong learning capability in the aspect of image vision.

The inter-frame subnetwork is used for exploring the space-time long-distance context correlation of the face key area in the video frame so as to enhance the capability of learning representation. The embodiment uses the dense optical flow to deduce the displacement process and direction of the pixel points in the image through the front frame image and the back frame image, and then captures the features from the dense optical flow through the inter-frame sub-network.

Specifically, a calcoptical flowfarnback function provided in opencv is used, which employs a farnback dense optical flow algorithm based on image pyramid modeling. The algorithm builds a three-layer image pyramid according to all pixel points in the front frame and the back frame, and the size of each layer is half of that of the previous layer. In three iterations, the window size for constructing the stream is set to 15. And performing Farneback dense optical flow algorithm calculation on each layer of image from top to bottom of the two image pyramids respectively. The optical flow can be easier to capture objects with large moving amplitude by establishing an image pyramid.

In terms of the overall structure of the inter-frame subnetwork, the whole is similar to the intra-frame subnetwork, but in order to better extract the inter-frame features, an LSTM (Long short-term memory) network layer is added at the tail of the network. This can solve the problems of gradient burst and gradient disappearance in the long sequence training process. Similar to the visual stream, this embodiment adds a softmax layer at the end of the inter sub-network, and the output is taken into the cross-entropy loss of the inter mode.

In order to improve the detection capability of the network, the present embodiment introduces a bottleneck attention module as shown in fig. 3. The bottleneck attention module utilizes an attention (attention) mechanism to extract accent features from the facial regions of the video frame and suppress irrelevant background information.

The bottleneck attention module extracts key features from the face region of the video frame by using an attention (attention) mechanism and suppresses irrelevant background information; the input to this module is the set of frame-level feature maps at the network bottleneck.

The characteristic diagram is composed of

Wherein C, H and W represent the number of channels, height and width of the feature map, respectively. Mapping for a given input feature

Bottleneck attention Module extrapolates 3D attention maps

The final output F' of the bottleneck attention module is calculated as follows:

the specific implementation of the 3D attention map M (F) is by calculating the attention of the channels separately

And spatial attention

On two separate branches. In the channel attentionThe input tensor F first soft-encodes the global information in each channel using a global average pooling layer (global average pool). Then, a multi-layer perceptron (MLP) with a hidden layer is used to derive the channel vector M _C (F) The attention across the channels is calculated. To accommodate the size of the spatial branch output data, the bottleneck attention module adds a Batch Normalization layer after MLP. Briefly, channel attention M _C (F) Is calculated as follows:

M _C (F)＝BN(MLP(AvqPool(F)) (7)

spatial branching through generation of spatial attention maps

To highlight or ignore features of different spatial locations. The most important of which is a large perceptual region in the spatial dimension to efficiently utilize the context information. Whereas the dilation convolution can be used to enlarge the perception area with high efficiency. The space branch adopts a bottleneck structure proposed by ResNet, so that the parameter quantity and the calculation overhead are saved. In particular, to integrate and compress feature maps in channel dimensions, features are projected using a 1 × 1 convolution

Reduction of dimension to

In short, the calculation formula of spatial attention is as follows:

where c represents the convolution operation, BN represents the batch normalization operation, and the superscript represents the size of the convolution filter.

Finally, attention M is paid to the channel _C (F) And spatial attention M _S (F) After that, the two attention branches are combined to produce the final M (F). Due to channel attention M _C (F) And spatial attention M _S (F) With different dimensionsIn combination M _S (F) And M _C (F) Previously, attention mapping was extended to

After element summation, a final 3D attention map M (F) in the range of 0 to 1 is obtained using the sigmoid function. This 3D attention map is multiplied by the original input feature map F and then added to the original input feature map F to obtain the output feature map of the final bottleneck attention module, as shown in equation (6).

The main advantage of using the self-attention mechanism in CNNs is efficient global context modeling, efficient back-propagation (i.e., model training). The global environment allows the model to better recognize locally ambiguous patterns and to focus on important parts. Therefore, capturing and utilizing the global environment is critical for various visual tasks. In this regard, CNN models typically stack many convolutional layers, or use pooling operations to ensure that features have a larger receptive field.

As one or more embodiments, in step S03, the present embodiment uses a cross-loss function to calculate the difference between the intra-frame features extracted from the intra-frame subnetwork and the inter-frame subnetwork, and uses the intra-frame subnetwork and the inter-frame subnetwork to learn a distinctive unimodal feature through cross-entropy loss.

After the sub-network extracts features separately for the incoming data, and respectively outputting the extracted features to a contrast loss function and a cross entropy loss function through the full connection layer. The difference between the extracted intra-frame features and the inter-frame features is captured from between them using a contrast loss function and measured by the euclidean distance between them.

The cross-entropy loss has the advantages of simplicity and high efficiency, and is a loss function which is commonly used in the task of deep forgery detection. However, in the HOR detection network proposed in this embodiment, the contrast loss is used as a key component of the objective function. In the aspect that the contrast loss function is mainly used in the aspect related to dimensionality reduction at first, the theoretical basis is that the similarity of the data samples on the feature space is not shifted by dimensionality reduction (feature extraction). Therefore, the embodiment provides a differentiation detection method based on intra-frame features and inter-frame features by using the characteristic that the contrast loss can effectively reflect the similarity between samples. Contrast loss maximizes the disparity score of the manipulated video while minimizing the disparity score of the real video. Compared with the traditional detection network using the cross entropy loss function, the method has better detection effect.

Contrast loss maximizes the disparity score of the manipulated video while minimizing the disparity score of the real video. The contrast loss function is shown by equation (9), where y ⁱ Is a video v ⁱ The margin is a hyper-parameter. Differential score

Is an intra feature f representing an intra subnet and an inter subnet, respectively _a And inter-frame feature f _e The euclidean distance between. In addition, the present embodiment separately learns the feature representation using cross-entropy loss of intra-and inter-sub-networks. These loss functions are defined in equation (11) (intra frame) and equation (12) (inter frame). The total loss is a weighted sum of these three losses, as shown in equation (13):

L＝L _c +L _a +L _e (13)

in one or more embodiments, in step S04, the authenticity of the video is determined according to the euclidean distance between the output features of the intra-frame subnetwork and the output features of the inter-frame subnetwork. To mark test video, the present embodiment uses 1 (d _t ^ i < τ), where 1{ } represents the comparison of the Euclidean distance d _ t ^ i with the threshold τ to represent the logical indicator function. τ is determined from the training set. The present embodiment calculates the euclidean distance of the real and dummy videos of the training set, and the midpoint between the mean values of the real and dummy videos is used as the representative value of τ.

If a video is represented as different modalities, manipulating either modality individually will result in some variability between modalities. In general, the difference of the intra-frame and inter-frame features of the real video is obviously smaller than that of the false video; the difference measure is expressed by Euclidean distance, namely the larger the Euclidean distance is, the larger the difference is, the smaller the Euclidean distance is, the smaller the difference is; and comparing the Euclidean distance with a specified threshold value to determine the authenticity of the video.

The embodiment provides a novel depth forgery detection method of a double-network architecture based on the differentiation between the features in a depth forgery video frame and between frames, wherein a double-network structure is constructed, two improved CNN sub-networks are respectively used for extracting the features in the frame and between frames, and a contrast loss function is used for capturing the differentiation between the two sub-networks; a contrast loss function is used to represent the relationship between the intra features and the inter features. In other studies based on forged audio-video, a contrast loss function is used to detect whether audio and video in the forged audio-video are consistent; thorough experimental evaluation was performed taking into account the latest common data set. The evaluation result verifies the superiority of the embodiment in face video detection, and also verifies the hypothesis of the differentiation between the RGB image and the optical flow; the network composition architecture in this embodiment also includes intra-frame sub-networks and inter-frame sub-networks, which are intended to learn distinctive unimodal features through cross-entropy loss. Subsequent experiments show that the detection precision of the network detection effect additionally containing the cross entropy loss is higher than that of the network detection effect only using the contrast loss.

Example two

The second embodiment of the disclosure introduces a depth-forgery-video detection system based on intra-frame and inter-frame feature differentiation.

As shown in fig. 5, a depth-forged video detection system based on intra-frame inter-frame feature differentiation includes:

an acquisition module configured to acquire original data of a depth-forged video;

a calculation module configured to calculate a euclidean distance and a difference between the extracted intra features and inter features;

The detailed steps are the same as those of the depth-forged video detection method based on intra-frame and inter-frame feature differentiation provided in the first embodiment, and are not described herein again.

EXAMPLE III

The third embodiment of the disclosure provides a computer-readable storage medium.

A computer readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the steps in the method for detecting depth forgery video based on intra-frame inter-frame feature differentiation according to the first embodiment of the present disclosure.

Example four

The fourth embodiment of the disclosure provides an electronic device.

An electronic device includes a memory, a processor, and a program stored in the memory and executable on the processor, where the processor executes the program to implement the steps in the method for detecting a depth-forged video based on intra-frame inter-frame feature differentiation according to the first embodiment of the present disclosure.

The detailed steps are the same as those of the method for detecting a depth forged video based on intra-frame inter-frame feature differentiation provided in the first embodiment, and are not repeated here.

Although the embodiments of the present disclosure have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present disclosure, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive changes in the technical solutions of the present disclosure.

Claims

1. A depth counterfeit video detection method based on intra-frame inter-frame feature differentiation is characterized by comprising the following steps:

acquiring original data of a depth forged video;

2. The method as claimed in claim 1, wherein after the original data of the depth-forged video is obtained, the obtained original data is subjected to format unification processing, the obtained original data of the depth-forged video is extracted into a picture frame by taking a frame as a unit, and interference of irrelevant video backgrounds except for a face part is suppressed.

3. The method for detecting depth forgery video based on feature differentiation between frames as claimed in claim 2, wherein the feature optimization of the picture frame is performed by using bottleneck attention optimization, and the feature of the face region is extracted by using an attention mechanism to suppress irrelevant background.

4. The method for detecting depth-forged video based on intra-frame inter-frame feature differentiation as claimed in claim 2, characterized in that in the process of extracting picture frames, a face recognition library is used to position and cut the face region in the picture frames, and the pictures are set to 128 x 128 for unified storage.

5. The method as claimed in claim 1, wherein in the process of extracting the features in the frame, the RGB image is used to represent the inter-frame flow by dense optical flow, and the extraction of the features in the frame is completed based on the optical flow features focusing on the area with large changes in the face.

6. The method for detecting depth forgery video based on difference between interframe features as claimed in claim 1, wherein in the process of extracting interframe features, the extraction of interframe features is completed by calculating the offset vectors of all pixel points on two frames of images before and after, and performing optical flow tracking estimation on the movement offset of all pixel points.

7. The method as claimed in claim 1, wherein the cross entropy loss function is used to calculate loss functions of the intra-frame features and the inter-frame features, respectively, and the weighted sum of the loss functions is calculated to obtain the difference between the intra-frame features and the inter-frame features.

8. A depth forgery video detection system based on intra-frame inter-frame feature differentiation is characterized by comprising:

9. A computer-readable storage medium, on which a program is stored, wherein the program, when executed by a processor, implements the steps in the method for detecting depth forgery video based on intra-frame inter-frame feature differentiation according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor and a program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for detecting depth forgery video based on intra frame feature difference according to any one of claims 1 to 7.