CN117523435A - Fake video detection method and device based on graph network time sequence consistency - Google Patents

Fake video detection method and device based on graph network time sequence consistency Download PDF

Info

Publication number
CN117523435A
CN117523435A CN202311289334.0A CN202311289334A CN117523435A CN 117523435 A CN117523435 A CN 117523435A CN 202311289334 A CN202311289334 A CN 202311289334A CN 117523435 A CN117523435 A CN 117523435A
Authority
CN
China
Prior art keywords
graph
video
frame
key points
face
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311289334.0A
Other languages
Chinese (zh)
Inventor
俞山青
王健博
周嘉俊
吴添银
童啸瑞
陈作辉
宣琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202311289334.0A priority Critical patent/CN117523435A/en
Publication of CN117523435A publication Critical patent/CN117523435A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • G06V10/247Aligning, centring, orientation detection or correction of the image by affine transforms, e.g. correction due to perspective effects; Quadrilaterals, e.g. trapezoids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/169Holistic features and representations, i.e. based on the facial image taken as a whole
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

A fake video detection method and device based on graph network time sequence consistency, the method includes: s1, performing frame extraction on an input depth fake video; s2, detecting a face image of the extracted single-frame picture by using a RetainFace detection model and carrying out alignment scaling on the face; s3, extracting features such as mouth type facial expression and the like in a single-frame face picture by using a face key point detection network; s4, converting the video frame sequence with the characteristics extracted by the target detection network into a corresponding characteristic matrix; s5, calculating the similarity between video frames according to the feature vector of the single-frame picture; s6, constructing a KNN graph by using the K nearest neighbor graph according to the similarity between video frames; s6, converting the KNN graph into a line graph; s7, global feature extraction is carried out on the graph by using a graph convolution neural network, and graph representation is obtained; s8, acquiring global characterization of the line graph by using a pooling layer; s9, classifying a fake algorithm for fake video by using the full connection layer and the Softmax layer, so as to realize detection of video technology.

Description

Fake video detection method and device based on graph network time sequence consistency
Technical Field
The invention relates to a depth fake video identification method and device based on graph network time sequence consistency, and belongs to the fields of deep learning, computer vision and graph neural networks.
Background
In recent years, with rapid development of deep learning and computer vision, particularly development of a generated countermeasure network (GAN) and a Diffusion Model (Diffusion Model) in a neural network Model, generated artificial intelligence (AI Generate Content) has been developed in a blowout type.
Specifically, the deep forging technology mainly edits or generates a face, and is mainly divided into: face reproduction, face replacement, face editing, and face generation. The face reproduction uses the source face to drive the target face, so that the behavior of the target face is consistent with that of the source face. Face replacement refers to replacing a target face with a source face such that the target face becomes the source face. Face editing refers to modifying facial attributes of a target face, such as modifying facial expressions, facial organ sizes, skin colors, etc., of the target face. Face synthesis is to create a face that does not exist at all in reality under the condition that no target face is taken as a basis.
With the rapid development of depth forging technology, producers lightly make a large number of depth forging videos through computer or mobile phone software and release the videos to the internet for transmission.
The existing research on the identification of the depth counterfeit video is less, the downsampling of the depth counterfeit video into pictures is mainly performed at present, the main method is to use a deep learning model in the field of computer vision to extract the characteristics, and finally a classifier is used for detection. The method ignores time sequence information among video frames, cannot discover time sequence inconsistent information existing among the video frames, and reduces the identification accuracy of the depth fake video.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a depth fake video detection method based on graph network time sequence consistency. The invention combines the characteristics of the key points of the human face and the time sequence information characteristics, calculates the similarity between frames, composes the video frames, judges the forged video, and greatly improves the accuracy of the detection algorithm.
The invention adopts the technical scheme that: a first aspect of the present invention provides a fake video detection method based on graph network timing consistency, including the steps of:
s1: and decomposing the input depth fake video frame by frame and extracting the frames to obtain extracted video frames.
S2: and (3) carrying out face detection on the video frame sequence obtained in the step (S1) sequentially by adopting a Retinaface model, carrying out affine transformation on the video frames with faces and aligning and scaling the standard face coordinate points, and finally cutting the aligned and scaled face video frames to obtain RGB face video frame images.
S3: and (2) extracting key point characteristics of eyebrow, eyes, nose, mouth and face outline areas in the picture from the frame image obtained in the step (S2) by using a face key point detection network, and extracting key points from each frame.
S4: and (3) calculating the similarity between videos of each frame according to the key point characteristics of the video frames obtained in the step (S3).
S5: from the similarity between video frames, a KNN map is constructed using a K nearest neighbor map (K-Nearest Neighborhood Graph).
S6: the KNN pattern obtained in step S5 is converted into a Line Graph (Line Graph).
S7: global feature extraction is performed on the line graph using a graph convolution neural network (Graph Convolutional Network) to obtain a representation of the graph, and a global pooling layer is used to obtain a global representation of the line graph.
S8: and (3) inputting the global representation obtained in the step (S7) into a full-connection layer and a Softmax layer, and mapping the global representation into fake algorithm classification of fake video to realize detection of fake algorithm.
In the step S1, the input depth counterfeit video is decomposed frame by frame and frames are extracted, so as to obtain extracted video frames. The method comprises the following steps: the input depth fake video is decomposed frame by frame and frames are extracted, and extracted video frames are obtained, wherein the extracted video frames are specifically as follows: and decomposing the input depth fake video into single-frame images, and uniformly extracting 100-frame images according to the total frame number of the video. For video less than 100 frames, all frames of the video are decimated.
In the step S3, the key point features of the eyebrow, eye, nose, mouth, and face contour regions in the picture are extracted from the frame image obtained in the step S2 by using a face key point detection network, and key points are extracted for each frame. The method comprises the following steps:
the characteristics of eye-mind drift, facial expression, lip synchronization and the like in the picture are extracted, the key points of the human face are divided into internal key points and outline key points, the internal key points comprise 51 key points of eyebrows, eyes, nose and mouth, and the outline key points comprise 17 key points. The 51 key points are as follows:
the single-side eyebrow has 5 key points, and the left boundary to the right boundary are uniformly sampled, and 10 key points are all arranged; the eyes are divided into 6 key points, namely left and right boundaries, and the upper eyelid and the lower eyelid are uniformly sampled, and the total number of the key points is 12; the lips are divided into 20 key points, wherein the left and right mouth corners are respectively 2, the outer boundaries of the upper and lower lips are respectively and uniformly sampled with 5 points, the inner boundaries of the upper and lower lips are respectively and uniformly sampled with 3 points, and the total number of the key points is 20; the nose bridge samples 4 key points, and the nose tip part uniformly collects 5 key points, and the total number of the key points is 9. In addition, the facial contours uniformly sample 17 keypoints. A total of 68 keypoints of the face are sampled.
In the step S4, according to the key point features of the video frames obtained in the step S3, the similarity between the videos of each frame is calculated, and a specific calculation formula is as follows:
wherein,and the similarity of the key point vectors between the ith frame and the jth frame of pictures is represented. K represents the sequence number of 68 keypoints. Wherein->And the value of the kth key point in the key point vector of the ith frame picture. And calculating Euclidean distance between the key points, and accumulating to obtain the similarity of the two frames of pictures.
In the step S5, a K-nearest neighbor (KNN) graph is constructed according to the similarity, which is specifically as follows: firstly, each frame of image is converted into corresponding nodes, and the nodes are connected through the similarity among the nodes. And for each node, K=6 nearest neighbor nodes are taken and connected, wherein the weight of the edge between each node is the similarity of key points between two nodes.
In the step S6, the KNN diagram obtained in the step S5 is converted into a Line Graph (Line Graph). The specific process is as follows: in the step S6, the KNN diagram is convertedThe procedure for the line graph is as follows. First, the edge (v) in the KNN diagram i ,v j ) Converted into nodes in a line graphAnd if two edges in the KNN graph have common nodes, the two nodes in the corresponding line graph are connected with each other. After the KNN diagram of the video is converted into the line diagram, the nodes on the line diagram are +.>Is characterized by corresponding to multi-feature similarity s carried by edges in a KNN diagram ij
In the step S7, global feature extraction is performed on the line graph using a graph convolution neural network (Graph Convolutional Network) to obtain a representation of the graph, and a global pooling layer is used to obtain a global representation of the line graph. The specific process is as follows:
constructing a pooled graph convolution neural network (Graph Convolutional Network, GCN), and carrying out global feature extraction on a graph of the video to obtain a graph representation. Specifically, first, a node feature matrix X and an adjacency matrix a on the Line graph are defined. Then, applying multiple layers of GCN for feature learning, the output of each layer can be expressed as:
H (l+1) =σ(D -1/2 AD -1/2 H (l) W (l) ) (2)
wherein H is (l) Representing node characteristics of the first layer, W (l) Representing the weight matrix of the first layer, σ (·) represents the activation function, and D is the degree matrix of a. After obtaining the characterization of all the nodes on the line graph, further obtaining the global characterization of the line graph by using a pooling layer:
Z=Poolling(H (l+1) ) (3)
a second aspect of the present invention relates to a fake video detection device based on graph network timing consistency, comprising a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors execute the executable codes to implement the fake video detection method based on graph network timing consistency of the present invention
A third aspect of the present invention relates to a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the fake video detection method of the present invention based on graph network timing consistency.
Compared with the prior art, the invention has the advantages and effects that:
(1) When the original video image features are extracted, 68 key points such as face contours, eyebrows, eyes, noses and mouths are positioned by using a face key point extraction technology, and the inter-frame inconsistency is calculated by using the offset of inter-frame key point coordinates, so that the time sequence inconsistency of the inter-frame motion is found, and the capturing capability of the model for the time sequence inconsistency is improved.
(2) Compared with a method adopting manual characteristics, the method improves the flexibility of characteristic extraction by using the graph convolutional neural network. Timing information between video frames is used. And a time sequence diagram is constructed by utilizing the inter-frame similarity, and the detection accuracy is greatly improved by utilizing a diagram convolution neural network.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a frame diagram of the present invention;
fig. 2 is a schematic diagram of the algorithm structure of the present invention.
Detailed Description
Various exemplary embodiments of the invention will now be described in detail, which should not be considered as limiting the invention, but rather as more detailed descriptions of certain aspects, features and embodiments of the invention.
It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In addition, for numerical ranges in this disclosure, it is understood that each intermediate value between the upper and lower limits of the ranges is also specifically disclosed. Every smaller range between any stated value or stated range, and any other stated value or intermediate value within the stated range, is also encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although only preferred methods and materials are described herein, any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention. All documents mentioned in this specification are incorporated by reference for the purpose of disclosing and describing the methods and/or materials associated with the documents. In case of conflict with any incorporated document, the present specification will control.
It will be apparent to those skilled in the art that various modifications and variations can be made in the specific embodiments of the invention described herein without departing from the scope or spirit of the invention. Other embodiments will be apparent to those skilled in the art from consideration of the specification of the present invention. The specification and examples are exemplary only.
As used herein, the terms "comprising," "including," "having," "containing," and the like are intended to be inclusive and mean an inclusion, but not limited to.
The "parts" in the present invention are all parts by mass unless otherwise specified.
Example 1
The invention provides a fake video detection method based on graph network time sequence consistency, which is shown in fig. 1 and comprises the following implementation steps:
image preprocessing:
step S1: original video frame extraction
The video on the network is generally 30 frames per second, one minute of video can reach one thousand eight hundred frames, and if each frame in the video is subjected to feature extraction, the resource overhead is very large. Therefore, in the invention, firstly, the python library of OpenCV is utilized to decompose the video into video frames, and then 100 frames of images are uniformly extracted according to the total frame number of the video. For video less than 100 frames, all frames of the video are decimated.
Step S2: face detection and clipping
The deep fake video is used for forging the human face, so that the main forged area is concentrated in the human face area, and the human face in a single frame picture needs to be detected and cut. The specific flow is as follows: first, face detection is performed on video frames by using a RetainFace network, and video frames without faces are discarded. And obtaining face region coordinates, clipping according to the coordinates, and finally amplifying (256) the clipped face picture to pixels.
Extracting a characteristic composition:
step S3: calculating key points of human face
Face key point extraction is an important step in the technology of identifying depth counterfeit video. By identifying and extracting key points in the face image, the face can be modeled more accurately, so that whether the face is forged or not can be judged. The invention divides the key points of the human face into 51 internal key points and 17 key points of the facial outline. Wherein the internal key points comprise 51 key points in total of eyebrows, eyes, nose and mouth. Using the OpenCV library of python, the keypoint coordinates are acquired 68 pairs using keypoint detection and localization in the dlib library.
Step S4: calculating inter-frame similarity
In general, the consistency of the timing sensitive features between successive frames of a real video is relatively good, so its similarity is relatively high; the consistency of the timing sensitive features between successive frames of the deep pseudo video is relatively poor, or the consistency of some timing sensitive features is relatively poor, so that the similarity is relatively low. Based on the assumption that the deep pseudo video and the real video have significant differences in the consistency of the time sequence sensitive characteristics, a graph network based on the time sequence sensitive characteristics is constructed.
And converting each frame of image into corresponding nodes, and connecting the nodes through the similarity among frames. And for each node, K=6 nearest neighbor nodes are taken and connected, wherein the weight of the edge between each node is the similarity of key points between two nodes.
The similarity between each video frame is calculated, and a specific calculation formula is as follows:
wherein,and the similarity of the key point vectors between the ith frame and the jth frame of pictures is represented. K represents the sequence number of 68 keypoints. Wherein->And the value of the kth key point in the key point vector of the ith frame picture. And calculating Euclidean distance between the key points, and accumulating to obtain the similarity of the two frames of pictures.
Step S5: a KNN diagram was constructed.
Based on the similarity between frames, a K nearest neighbor map (K-Nearest Neighborhood Graph, KNN map) is constructed. Specifically, each node v in the K-nearest neighbor graph i Frame f representing video i Each node frame will be connected to the K frames most similar to it. For each node, K=6 nearest neighbor nodes are taken and connected, wherein the weight of the edge between each node is the similarity of key points between two nodes
Step S6: KNN plots were converted to Line graphs.
The procedure for converting the KNN diagram into a line diagram is as follows. First, the edge (v) in the KNN diagram i ,v j ) Converted into nodes in a line graphAnd if two edges in the KNN graph have common nodes, the two nodes in the corresponding line graph are connected with each other. After the KNN diagram of the video is converted into the line diagram, the nodes on the line diagram are +.>Is characterized by corresponding to multi-feature similarity s carried by edges in a KNN diagram ij
Feature classification:
step S7: neural network feature analysis is rolled using a graph.
Constructing a pooled graph convolution neural network (Graph Convolutional Network, GCN), and carrying out global feature extraction on a graph of the video to obtain a graph representation. Specifically, first, a node feature matrix X and an adjacency matrix a on the Line graph are defined. Then, applying multiple layers of GCN for feature learning, the output of each layer can be expressed as:
H (l+1) =σ(D -1/2 AD -1/2 H (l) W (l) ) (2)
wherein H is (l) Representing node characteristics of the first layer, W (l) Representing the weight matrix of the first layer, σ (·) represents the activation function, and D is the degree matrix of a. After obtaining the characterization of all the nodes on the line graph, further obtaining the global characterization of the line graph by using a pooling layer:
Z=Pooling(H (l+1) ) (3)
step S8: full connectivity layer and Softmax layer classifications.
And (3) inputting the global representation Z obtained in the step S7 into a full connection layer and a Softmax layer, and mapping the global representation Z into true and false classification of the video. The specific formula is as follows:
wherein,and judging true and false video category for the model.
During the training phase, optimization is performed by back propagation and gradient descent. Using a cross entropy loss function, the loss function is as follows:
wherein y is the true and false category of the video,to predict the authenticity category.
The invention can be applied to the identification of the Internet video depth counterfeiting technology of the real scene, has accurate identification effect, and can help related personnel to accurately locate the counterfeiting video.
In a word, the invention utilizes a fake video detection method based on the time sequence consistency of the graph network, solves the problems of poor detection effect and time sequence information loss caused by adopting only single-frame video, and improves the accuracy of fake video identification.
Example 2
The present embodiment provides a fake video detection device based on graph network time sequence consistency, which includes a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for implementing the fake video detection method based on graph network time sequence consistency of embodiment 1 when executing the executable codes.
Example 3
The present embodiment relates to a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the fake video detection method based on graph network timing consistency of the present embodiment 1.
The above embodiments are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solutions of the present invention should fall within the protection scope defined by the claims of the present invention without departing from the design spirit of the present invention.

Claims (10)

1. A fake video detection method based on graph network time sequence consistency is characterized by comprising the following steps:
s1: decomposing the input depth fake video frame by frame and extracting frames to obtain extracted video frames;
s2: carrying out face detection on the video frame sequence obtained in the step S1 by adopting a Retinaface model in sequence, carrying out affine transformation on the video frame with the face and aligning and scaling the standard face coordinate points, and finally cutting the aligned and scaled face video frame to obtain RGB face video frame images;
s3: extracting key point characteristics of eyebrow, eyes, nose, mouth and face outline areas in the picture from the frame image obtained in the step S2 by using a face key point detection network, and extracting key points from each frame;
s4: calculating the similarity between videos of each frame according to the key point characteristics of the video frames obtained in the step S3;
s5: constructing a KNN graph by using a K neighbor graph (K-Nearest Neighborhood Graph) according to the similarity between video frames;
s6: converting the KNN Graph obtained in the step S5 into a Line Graph;
s7: global feature extraction is performed on the line graph using a graph convolution neural network (Graph Convolutional Network) to obtain a representation of the graph, and a global pooling layer is used to obtain a global representation of the line graph;
s8: and (3) inputting the global representation obtained in the step (S7) into a full-connection layer and a Softmax layer, and mapping the global representation into a fake algorithm classification of fake video to realize identification of the fake algorithm.
2. The method for detecting counterfeit video based on graph network time sequence consistency of claim 1, wherein in the step S1, the input depth counterfeit video is decomposed frame by frame and frames are extracted, and extracted video frames are obtained, specifically as follows: decomposing an input depth fake video into single-frame images, and uniformly extracting 100-frame images according to the total frame number of the video; for video less than 100 frames, all frames of the video are decimated.
3. The method for detecting counterfeit video based on the time sequence consistency of the graph network of claim 1, wherein in the step S3, facial key point characteristics such as eye-mind drift, facial expression, lip synchronization and the like in the picture are extracted, the facial key points are divided into internal key points and outline key points, the internal key points comprise 51 key points of eyebrows, eyes, nose and mouth, and the outline key points comprise 17 key points; the 51 key points are as follows:
the single-side eyebrow has 5 key points, and the left boundary to the right boundary are uniformly sampled, and 10 key points are all arranged; the eyes are divided into 6 key points, namely left and right boundaries, and the upper eyelid and the lower eyelid are uniformly sampled, and the total number of the key points is 12; the lips are divided into 20 key points, wherein the left and right mouth corners are respectively 2, the outer boundaries of the upper and lower lips are respectively and uniformly sampled with 5 points, the inner boundaries of the upper and lower lips are respectively and uniformly sampled with 3 points, and the total number of the key points is 20; sampling 4 key points on the nose bridge, uniformly collecting 5 key points on the nose tip part, and totally collecting 9 key points; in addition, 17 key points are uniformly sampled by the face contour; a total of 68 keypoints of the face are sampled.
4. The method for detecting counterfeit video based on graph network timing consistency of claim 1, wherein in the step S4, the similarity between each video frame is calculated according to the following specific calculation formula:
wherein,representing the similarity of key point vectors between the ith frame and the jth frame of pictures; k represents the serial numbers of 68 key points; wherein->A value representing a kth key point in a key point vector of an ith frame picture; and calculating Euclidean distance between the key points, and accumulating to obtain the similarity of the two frames of pictures.
5. The method for detecting counterfeit video based on graph network time sequence consistency according to claim 1, wherein in the step S5, a K-nearest neighbor (KNN) graph is constructed according to the similarity, specifically as follows; firstly, converting each frame of image into corresponding nodes, and connecting the nodes through the similarity between the nodes; and for each node, K=6 nearest neighbor nodes are taken and connected, wherein the weight of the edge between each node is the similarity of key points between two nodes.
6. The method for detecting counterfeit video based on graph network time sequence consistency according to claim 1, wherein in the step S6, the step of converting the KNN graph into a line graph is as follows; first, the edge (v) in the KNN diagram i ,v j ) Converted into nodes in a line graphIf two edges in the KNN graph have common nodes, the two nodes in the corresponding line graph are connected with each other; after the KNN diagram of the video is converted into the line diagram, the nodes on the line diagram are +.>Is characterized by corresponding to multi-feature similarity s carried by edges in a KNN diagram ij
7. The method for detecting counterfeit video based on graph network time sequence consistency according to claim 1, wherein in the step S7, a pooled graph convolution neural network (Graph Convolutional Network, GCN) is constructed, and global feature extraction is performed on a graph of the video to obtain a graph representation; specifically, firstly, defining a node characteristic matrix X and an adjacent matrix A on a Line graph; then, applying multiple layers of GCN for feature learning, the output of each layer can be expressed as:
H (l+1) =σ(D -1/2 AD -1/2 H (l) W (l) ) (2)
wherein H is (l) Representing node characteristics of the first layer, W (l) Representing the weight matrix of the first layer, sigma (·) representing the activation function, D being the degree matrix of a; after obtaining the characterization of all the nodes on the line graph, further obtaining the global characterization of the line graph by using a pooling layer:
Z=Pooling(H (l+1) ) (3)。
8. the method for detecting counterfeit video based on graph network time sequence consistency according to claim 1, wherein in the step S8, the global representation Z obtained in the step S7 is input into a full connection layer and a Softmax layer, and is mapped into true and false classification of video; the specific formula is as follows:
wherein,and judging true and false video category for the model.
9. A fake video detection device based on graph network time sequence consistency, characterized by comprising a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the fake video detection method based on graph network time sequence consistency according to any one of claims 1-8 when executing the executable codes.
10. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements the fake video detection method based on graph network timing consistency of any one of claims 1-8.
CN202311289334.0A 2023-10-08 2023-10-08 Fake video detection method and device based on graph network time sequence consistency Pending CN117523435A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311289334.0A CN117523435A (en) 2023-10-08 2023-10-08 Fake video detection method and device based on graph network time sequence consistency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311289334.0A CN117523435A (en) 2023-10-08 2023-10-08 Fake video detection method and device based on graph network time sequence consistency

Publications (1)

Publication Number Publication Date
CN117523435A true CN117523435A (en) 2024-02-06

Family

ID=89740716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311289334.0A Pending CN117523435A (en) 2023-10-08 2023-10-08 Fake video detection method and device based on graph network time sequence consistency

Country Status (1)

Country Link
CN (1) CN117523435A (en)

Similar Documents

Publication Publication Date Title
CN110348319B (en) Face anti-counterfeiting method based on face depth information and edge image fusion
CN109919977B (en) Video motion person tracking and identity recognition method based on time characteristics
Lin Face detection in complicated backgrounds and different illumination conditions by using YCbCr color space and neural network
CN105740780B (en) Method and device for detecting living human face
CN111754396B (en) Face image processing method, device, computer equipment and storage medium
CN109359541A (en) A kind of sketch face identification method based on depth migration study
JP2021507394A (en) How to generate a human hairstyle based on multi-feature search and deformation
WO2019071976A1 (en) Panoramic image saliency detection method based on regional growth and eye movement model
CN110008846B (en) Image processing method
CN111126240B (en) Three-channel feature fusion face recognition method
JP2006524394A (en) Delineation of human contours in images
CN110263768A (en) A kind of face identification method based on depth residual error network
CN112419295B (en) Medical image processing method, medical image processing device, computer equipment and storage medium
CN113095263B (en) Training method and device for pedestrian re-recognition model under shielding and pedestrian re-recognition method and device under shielding
CN113762009B (en) Crowd counting method based on multi-scale feature fusion and double-attention mechanism
CN113378675A (en) Face recognition method for simultaneous detection and feature extraction
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN111126307A (en) Small sample face recognition method of joint sparse representation neural network
CN112101195A (en) Crowd density estimation method and device, computer equipment and storage medium
CN115482595B (en) Specific character visual sense counterfeiting detection and identification method based on semantic segmentation
CN111108508A (en) Facial emotion recognition method, intelligent device and computer-readable storage medium
CN113076905A (en) Emotion recognition method based on context interaction relationship
CN114120389A (en) Network training and video frame processing method, device, equipment and storage medium
CN109165551B (en) Expression recognition method for adaptively weighting and fusing significance structure tensor and LBP characteristics
CN114445691A (en) Model training method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination