CN117523435A

CN117523435A - Fake video detection method and device based on graph network time sequence consistency

Info

Publication number: CN117523435A
Application number: CN202311289334.0A
Authority: CN
Inventors: 俞山青; 王健博; 周嘉俊; 吴添银; 童啸瑞; 陈作辉; 宣琦
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2024-02-06

Abstract

A fake video detection method and device based on graph network time sequence consistency, the method includes: s1, performing frame extraction on an input depth fake video; s2, detecting a face image of the extracted single-frame picture by using a RetainFace detection model and carrying out alignment scaling on the face; s3, extracting features such as mouth type facial expression and the like in a single-frame face picture by using a face key point detection network; s4, converting the video frame sequence with the characteristics extracted by the target detection network into a corresponding characteristic matrix; s5, calculating the similarity between video frames according to the feature vector of the single-frame picture; s6, constructing a KNN graph by using the K nearest neighbor graph according to the similarity between video frames; s6, converting the KNN graph into a line graph; s7, global feature extraction is carried out on the graph by using a graph convolution neural network, and graph representation is obtained; s8, acquiring global characterization of the line graph by using a pooling layer; s9, classifying a fake algorithm for fake video by using the full connection layer and the Softmax layer, so as to realize detection of video technology.

Description

Fake video detection method and device based on graph network time sequence consistency

Technical Field

The invention relates to a depth fake video identification method and device based on graph network time sequence consistency, and belongs to the fields of deep learning, computer vision and graph neural networks.

Background

In recent years, with rapid development of deep learning and computer vision, particularly development of a generated countermeasure network (GAN) and a Diffusion Model (Diffusion Model) in a neural network Model, generated artificial intelligence (AI Generate Content) has been developed in a blowout type.

Specifically, the deep forging technology mainly edits or generates a face, and is mainly divided into: face reproduction, face replacement, face editing, and face generation. The face reproduction uses the source face to drive the target face, so that the behavior of the target face is consistent with that of the source face. Face replacement refers to replacing a target face with a source face such that the target face becomes the source face. Face editing refers to modifying facial attributes of a target face, such as modifying facial expressions, facial organ sizes, skin colors, etc., of the target face. Face synthesis is to create a face that does not exist at all in reality under the condition that no target face is taken as a basis.

With the rapid development of depth forging technology, producers lightly make a large number of depth forging videos through computer or mobile phone software and release the videos to the internet for transmission.

The existing research on the identification of the depth counterfeit video is less, the downsampling of the depth counterfeit video into pictures is mainly performed at present, the main method is to use a deep learning model in the field of computer vision to extract the characteristics, and finally a classifier is used for detection. The method ignores time sequence information among video frames, cannot discover time sequence inconsistent information existing among the video frames, and reduces the identification accuracy of the depth fake video.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a depth fake video detection method based on graph network time sequence consistency. The invention combines the characteristics of the key points of the human face and the time sequence information characteristics, calculates the similarity between frames, composes the video frames, judges the forged video, and greatly improves the accuracy of the detection algorithm.

The invention adopts the technical scheme that: a first aspect of the present invention provides a fake video detection method based on graph network timing consistency, including the steps of:

s1: and decomposing the input depth fake video frame by frame and extracting the frames to obtain extracted video frames.

S2: and (3) carrying out face detection on the video frame sequence obtained in the step (S1) sequentially by adopting a Retinaface model, carrying out affine transformation on the video frames with faces and aligning and scaling the standard face coordinate points, and finally cutting the aligned and scaled face video frames to obtain RGB face video frame images.

S3: and (2) extracting key point characteristics of eyebrow, eyes, nose, mouth and face outline areas in the picture from the frame image obtained in the step (S2) by using a face key point detection network, and extracting key points from each frame.

S4: and (3) calculating the similarity between videos of each frame according to the key point characteristics of the video frames obtained in the step (S3).

S5: from the similarity between video frames, a KNN map is constructed using a K nearest neighbor map (K-Nearest Neighborhood Graph).

S6: the KNN pattern obtained in step S5 is converted into a Line Graph (Line Graph).

S7: global feature extraction is performed on the line graph using a graph convolution neural network (Graph Convolutional Network) to obtain a representation of the graph, and a global pooling layer is used to obtain a global representation of the line graph.

S8: and (3) inputting the global representation obtained in the step (S7) into a full-connection layer and a Softmax layer, and mapping the global representation into fake algorithm classification of fake video to realize detection of fake algorithm.

In the step S1, the input depth counterfeit video is decomposed frame by frame and frames are extracted, so as to obtain extracted video frames. The method comprises the following steps: the input depth fake video is decomposed frame by frame and frames are extracted, and extracted video frames are obtained, wherein the extracted video frames are specifically as follows: and decomposing the input depth fake video into single-frame images, and uniformly extracting 100-frame images according to the total frame number of the video. For video less than 100 frames, all frames of the video are decimated.

In the step S3, the key point features of the eyebrow, eye, nose, mouth, and face contour regions in the picture are extracted from the frame image obtained in the step S2 by using a face key point detection network, and key points are extracted for each frame. The method comprises the following steps:

the characteristics of eye-mind drift, facial expression, lip synchronization and the like in the picture are extracted, the key points of the human face are divided into internal key points and outline key points, the internal key points comprise 51 key points of eyebrows, eyes, nose and mouth, and the outline key points comprise 17 key points. The 51 key points are as follows:

the single-side eyebrow has 5 key points, and the left boundary to the right boundary are uniformly sampled, and 10 key points are all arranged; the eyes are divided into 6 key points, namely left and right boundaries, and the upper eyelid and the lower eyelid are uniformly sampled, and the total number of the key points is 12; the lips are divided into 20 key points, wherein the left and right mouth corners are respectively 2, the outer boundaries of the upper and lower lips are respectively and uniformly sampled with 5 points, the inner boundaries of the upper and lower lips are respectively and uniformly sampled with 3 points, and the total number of the key points is 20; the nose bridge samples 4 key points, and the nose tip part uniformly collects 5 key points, and the total number of the key points is 9. In addition, the facial contours uniformly sample 17 keypoints. A total of 68 keypoints of the face are sampled.

In the step S4, according to the key point features of the video frames obtained in the step S3, the similarity between the videos of each frame is calculated, and a specific calculation formula is as follows:

wherein,and the similarity of the key point vectors between the ith frame and the jth frame of pictures is represented. K represents the sequence number of 68 keypoints. Wherein->And the value of the kth key point in the key point vector of the ith frame picture. And calculating Euclidean distance between the key points, and accumulating to obtain the similarity of the two frames of pictures.

In the step S5, a K-nearest neighbor (KNN) graph is constructed according to the similarity, which is specifically as follows: firstly, each frame of image is converted into corresponding nodes, and the nodes are connected through the similarity among the nodes. And for each node, K=6 nearest neighbor nodes are taken and connected, wherein the weight of the edge between each node is the similarity of key points between two nodes.

In the step S6, the KNN diagram obtained in the step S5 is converted into a Line Graph (Line Graph). The specific process is as follows: in the step S6, the KNN diagram is convertedThe procedure for the line graph is as follows. First, the edge (v) in the KNN diagram _i ,v _j ) Converted into nodes in a line graphAnd if two edges in the KNN graph have common nodes, the two nodes in the corresponding line graph are connected with each other. After the KNN diagram of the video is converted into the line diagram, the nodes on the line diagram are +.>Is characterized by corresponding to multi-feature similarity s carried by edges in a KNN diagram _ij 。

In the step S7, global feature extraction is performed on the line graph using a graph convolution neural network (Graph Convolutional Network) to obtain a representation of the graph, and a global pooling layer is used to obtain a global representation of the line graph. The specific process is as follows:

constructing a pooled graph convolution neural network (Graph Convolutional Network, GCN), and carrying out global feature extraction on a graph of the video to obtain a graph representation. Specifically, first, a node feature matrix X and an adjacency matrix a on the Line graph are defined. Then, applying multiple layers of GCN for feature learning, the output of each layer can be expressed as:

H ^(l+1) ＝σ(D ^-1/2 AD ^-1/2 H ^(l) W ^(l) ) (2)

wherein H is ^(l) Representing node characteristics of the first layer, W ^(l) Representing the weight matrix of the first layer, σ (·) represents the activation function, and D is the degree matrix of a. After obtaining the characterization of all the nodes on the line graph, further obtaining the global characterization of the line graph by using a pooling layer:

Z＝Poolling(H ^(l+1) ) (3)

a second aspect of the present invention relates to a fake video detection device based on graph network timing consistency, comprising a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors execute the executable codes to implement the fake video detection method based on graph network timing consistency of the present invention

A third aspect of the present invention relates to a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the fake video detection method of the present invention based on graph network timing consistency.

Compared with the prior art, the invention has the advantages and effects that:

(1) When the original video image features are extracted, 68 key points such as face contours, eyebrows, eyes, noses and mouths are positioned by using a face key point extraction technology, and the inter-frame inconsistency is calculated by using the offset of inter-frame key point coordinates, so that the time sequence inconsistency of the inter-frame motion is found, and the capturing capability of the model for the time sequence inconsistency is improved.

(2) Compared with a method adopting manual characteristics, the method improves the flexibility of characteristic extraction by using the graph convolutional neural network. Timing information between video frames is used. And a time sequence diagram is constructed by utilizing the inter-frame similarity, and the detection accuracy is greatly improved by utilizing a diagram convolution neural network.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a frame diagram of the present invention;

fig. 2 is a schematic diagram of the algorithm structure of the present invention.

Detailed Description

Various exemplary embodiments of the invention will now be described in detail, which should not be considered as limiting the invention, but rather as more detailed descriptions of certain aspects, features and embodiments of the invention.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In addition, for numerical ranges in this disclosure, it is understood that each intermediate value between the upper and lower limits of the ranges is also specifically disclosed. Every smaller range between any stated value or stated range, and any other stated value or intermediate value within the stated range, is also encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although only preferred methods and materials are described herein, any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention. All documents mentioned in this specification are incorporated by reference for the purpose of disclosing and describing the methods and/or materials associated with the documents. In case of conflict with any incorporated document, the present specification will control.

It will be apparent to those skilled in the art that various modifications and variations can be made in the specific embodiments of the invention described herein without departing from the scope or spirit of the invention. Other embodiments will be apparent to those skilled in the art from consideration of the specification of the present invention. The specification and examples are exemplary only.

As used herein, the terms "comprising," "including," "having," "containing," and the like are intended to be inclusive and mean an inclusion, but not limited to.

The "parts" in the present invention are all parts by mass unless otherwise specified.

Example 1

The invention provides a fake video detection method based on graph network time sequence consistency, which is shown in fig. 1 and comprises the following implementation steps:

image preprocessing:

step S1: original video frame extraction

The video on the network is generally 30 frames per second, one minute of video can reach one thousand eight hundred frames, and if each frame in the video is subjected to feature extraction, the resource overhead is very large. Therefore, in the invention, firstly, the python library of OpenCV is utilized to decompose the video into video frames, and then 100 frames of images are uniformly extracted according to the total frame number of the video. For video less than 100 frames, all frames of the video are decimated.

Step S2: face detection and clipping

The deep fake video is used for forging the human face, so that the main forged area is concentrated in the human face area, and the human face in a single frame picture needs to be detected and cut. The specific flow is as follows: first, face detection is performed on video frames by using a RetainFace network, and video frames without faces are discarded. And obtaining face region coordinates, clipping according to the coordinates, and finally amplifying (256) the clipped face picture to pixels.

Extracting a characteristic composition:

step S3: calculating key points of human face

Face key point extraction is an important step in the technology of identifying depth counterfeit video. By identifying and extracting key points in the face image, the face can be modeled more accurately, so that whether the face is forged or not can be judged. The invention divides the key points of the human face into 51 internal key points and 17 key points of the facial outline. Wherein the internal key points comprise 51 key points in total of eyebrows, eyes, nose and mouth. Using the OpenCV library of python, the keypoint coordinates are acquired 68 pairs using keypoint detection and localization in the dlib library.

Step S4: calculating inter-frame similarity

In general, the consistency of the timing sensitive features between successive frames of a real video is relatively good, so its similarity is relatively high; the consistency of the timing sensitive features between successive frames of the deep pseudo video is relatively poor, or the consistency of some timing sensitive features is relatively poor, so that the similarity is relatively low. Based on the assumption that the deep pseudo video and the real video have significant differences in the consistency of the time sequence sensitive characteristics, a graph network based on the time sequence sensitive characteristics is constructed.

And converting each frame of image into corresponding nodes, and connecting the nodes through the similarity among frames. And for each node, K=6 nearest neighbor nodes are taken and connected, wherein the weight of the edge between each node is the similarity of key points between two nodes.

The similarity between each video frame is calculated, and a specific calculation formula is as follows:

Step S5: a KNN diagram was constructed.

Based on the similarity between frames, a K nearest neighbor map (K-Nearest Neighborhood Graph, KNN map) is constructed. Specifically, each node v in the K-nearest neighbor graph _i Frame f representing video _i Each node frame will be connected to the K frames most similar to it. For each node, K=6 nearest neighbor nodes are taken and connected, wherein the weight of the edge between each node is the similarity of key points between two nodes

Step S6: KNN plots were converted to Line graphs.

The procedure for converting the KNN diagram into a line diagram is as follows. First, the edge (v) in the KNN diagram _i ,v _j ) Converted into nodes in a line graphAnd if two edges in the KNN graph have common nodes, the two nodes in the corresponding line graph are connected with each other. After the KNN diagram of the video is converted into the line diagram, the nodes on the line diagram are +.>Is characterized by corresponding to multi-feature similarity s carried by edges in a KNN diagram _ij 。

Feature classification:

step S7: neural network feature analysis is rolled using a graph.

H ^(l+1) ＝σ(D ^-1/2 AD ^-1/2 H ^(l) W ^(l) ) (2)

Z＝Pooling(H ^(l+1) ) (3)

step S8: full connectivity layer and Softmax layer classifications.

And (3) inputting the global representation Z obtained in the step S7 into a full connection layer and a Softmax layer, and mapping the global representation Z into true and false classification of the video. The specific formula is as follows:

wherein,and judging true and false video category for the model.

During the training phase, optimization is performed by back propagation and gradient descent. Using a cross entropy loss function, the loss function is as follows:

wherein y is the true and false category of the video,to predict the authenticity category.

The invention can be applied to the identification of the Internet video depth counterfeiting technology of the real scene, has accurate identification effect, and can help related personnel to accurately locate the counterfeiting video.

In a word, the invention utilizes a fake video detection method based on the time sequence consistency of the graph network, solves the problems of poor detection effect and time sequence information loss caused by adopting only single-frame video, and improves the accuracy of fake video identification.

Example 2

The present embodiment provides a fake video detection device based on graph network time sequence consistency, which includes a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for implementing the fake video detection method based on graph network time sequence consistency of embodiment 1 when executing the executable codes.

Example 3

The present embodiment relates to a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the fake video detection method based on graph network timing consistency of the present embodiment 1.

The above embodiments are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solutions of the present invention should fall within the protection scope defined by the claims of the present invention without departing from the design spirit of the present invention.

Claims

1. A fake video detection method based on graph network time sequence consistency is characterized by comprising the following steps:

s1: decomposing the input depth fake video frame by frame and extracting frames to obtain extracted video frames;

s2: carrying out face detection on the video frame sequence obtained in the step S1 by adopting a Retinaface model in sequence, carrying out affine transformation on the video frame with the face and aligning and scaling the standard face coordinate points, and finally cutting the aligned and scaled face video frame to obtain RGB face video frame images;

s3: extracting key point characteristics of eyebrow, eyes, nose, mouth and face outline areas in the picture from the frame image obtained in the step S2 by using a face key point detection network, and extracting key points from each frame;

s4: calculating the similarity between videos of each frame according to the key point characteristics of the video frames obtained in the step S3;

s5: constructing a KNN graph by using a K neighbor graph (K-Nearest Neighborhood Graph) according to the similarity between video frames;

s6: converting the KNN Graph obtained in the step S5 into a Line Graph;

s7: global feature extraction is performed on the line graph using a graph convolution neural network (Graph Convolutional Network) to obtain a representation of the graph, and a global pooling layer is used to obtain a global representation of the line graph;

s8: and (3) inputting the global representation obtained in the step (S7) into a full-connection layer and a Softmax layer, and mapping the global representation into a fake algorithm classification of fake video to realize identification of the fake algorithm.

2. The method for detecting counterfeit video based on graph network time sequence consistency of claim 1, wherein in the step S1, the input depth counterfeit video is decomposed frame by frame and frames are extracted, and extracted video frames are obtained, specifically as follows: decomposing an input depth fake video into single-frame images, and uniformly extracting 100-frame images according to the total frame number of the video; for video less than 100 frames, all frames of the video are decimated.

3. The method for detecting counterfeit video based on the time sequence consistency of the graph network of claim 1, wherein in the step S3, facial key point characteristics such as eye-mind drift, facial expression, lip synchronization and the like in the picture are extracted, the facial key points are divided into internal key points and outline key points, the internal key points comprise 51 key points of eyebrows, eyes, nose and mouth, and the outline key points comprise 17 key points; the 51 key points are as follows:

the single-side eyebrow has 5 key points, and the left boundary to the right boundary are uniformly sampled, and 10 key points are all arranged; the eyes are divided into 6 key points, namely left and right boundaries, and the upper eyelid and the lower eyelid are uniformly sampled, and the total number of the key points is 12; the lips are divided into 20 key points, wherein the left and right mouth corners are respectively 2, the outer boundaries of the upper and lower lips are respectively and uniformly sampled with 5 points, the inner boundaries of the upper and lower lips are respectively and uniformly sampled with 3 points, and the total number of the key points is 20; sampling 4 key points on the nose bridge, uniformly collecting 5 key points on the nose tip part, and totally collecting 9 key points; in addition, 17 key points are uniformly sampled by the face contour; a total of 68 keypoints of the face are sampled.

4. The method for detecting counterfeit video based on graph network timing consistency of claim 1, wherein in the step S4, the similarity between each video frame is calculated according to the following specific calculation formula:

wherein,representing the similarity of key point vectors between the ith frame and the jth frame of pictures; k represents the serial numbers of 68 key points; wherein->A value representing a kth key point in a key point vector of an ith frame picture; and calculating Euclidean distance between the key points, and accumulating to obtain the similarity of the two frames of pictures.

5. The method for detecting counterfeit video based on graph network time sequence consistency according to claim 1, wherein in the step S5, a K-nearest neighbor (KNN) graph is constructed according to the similarity, specifically as follows; firstly, converting each frame of image into corresponding nodes, and connecting the nodes through the similarity between the nodes; and for each node, K=6 nearest neighbor nodes are taken and connected, wherein the weight of the edge between each node is the similarity of key points between two nodes.

6. The method for detecting counterfeit video based on graph network time sequence consistency according to claim 1, wherein in the step S6, the step of converting the KNN graph into a line graph is as follows; first, the edge (v) in the KNN diagram _i ,v _j ) Converted into nodes in a line graphIf two edges in the KNN graph have common nodes, the two nodes in the corresponding line graph are connected with each other; after the KNN diagram of the video is converted into the line diagram, the nodes on the line diagram are +.>Is characterized by corresponding to multi-feature similarity s carried by edges in a KNN diagram _ij 。

7. The method for detecting counterfeit video based on graph network time sequence consistency according to claim 1, wherein in the step S7, a pooled graph convolution neural network (Graph Convolutional Network, GCN) is constructed, and global feature extraction is performed on a graph of the video to obtain a graph representation; specifically, firstly, defining a node characteristic matrix X and an adjacent matrix A on a Line graph; then, applying multiple layers of GCN for feature learning, the output of each layer can be expressed as:

H ^(l+1) ＝σ(D ^-1/2 AD ^-1/2 H ^(l) W ^(l) ) (2)

wherein H is ^(l) Representing node characteristics of the first layer, W ^(l) Representing the weight matrix of the first layer, sigma (·) representing the activation function, D being the degree matrix of a; after obtaining the characterization of all the nodes on the line graph, further obtaining the global characterization of the line graph by using a pooling layer:

Z＝Pooling(H ^(l+1) ) (3)。

8. the method for detecting counterfeit video based on graph network time sequence consistency according to claim 1, wherein in the step S8, the global representation Z obtained in the step S7 is input into a full connection layer and a Softmax layer, and is mapped into true and false classification of video; the specific formula is as follows:

wherein,and judging true and false video category for the model.

9. A fake video detection device based on graph network time sequence consistency, characterized by comprising a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the fake video detection method based on graph network time sequence consistency according to any one of claims 1-8 when executing the executable codes.

10. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements the fake video detection method based on graph network timing consistency of any one of claims 1-8.