CN113283393A - Method for detecting Deepfake video based on image group and two-stream network - Google Patents

Method for detecting Deepfake video based on image group and two-stream network Download PDF

Info

Publication number
CN113283393A
CN113283393A CN202110717852.2A CN202110717852A CN113283393A CN 113283393 A CN113283393 A CN 113283393A CN 202110717852 A CN202110717852 A CN 202110717852A CN 113283393 A CN113283393 A CN 113283393A
Authority
CN
China
Prior art keywords
network
video
frame
stream
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110717852.2A
Other languages
Chinese (zh)
Other versions
CN113283393B (en
Inventor
王金伟
张玫瑰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202110717852.2A priority Critical patent/CN113283393B/en
Publication of CN113283393A publication Critical patent/CN113283393A/en
Application granted granted Critical
Publication of CN113283393B publication Critical patent/CN113283393B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a method for detecting a Deepfake video based on an image group and a two-stream network, which comprises the following steps: (1) extracting key frames of a video to be detected to form an image group; (2) inputting the first frame of the image group into a spatial stream in a two-stream network to extract spatial features; (3) respectively differentiating the rest frames of the image group with the first frame to obtain a difference image, forming a difference image sequence, and inputting the difference image sequence into a time stream in the two-stream network to extract time characteristics; (4) and fusing the extracted spatial features and the time features, and evaluating the authenticity of the video by using a dynamic routing algorithm. Compared with the prior art, the method has the advantages that the computational redundancy is reduced by utilizing the image group, the network is concentrated on the key frame, the space-time information of the key frame is fully utilized by fusing the spatial characteristic and the time characteristic, and the classification is carried out by a dynamic routing algorithm to obtain a more accurate evaluation result.

Description

Method for detecting Deepfake video based on image group and two-stream network
Technical Field
The invention belongs to the field of video detection, and particularly relates to a method for detecting a Deepfake video based on an image group and a two-stream network.
Background
With the rise and development of artificial intelligence technology, face changing technology is gradually gaining wide attention in the continuous development process. The advent of deep is a breakthrough of face exchange technology, which is a technology that can replace the face image of a source person in a video with the face image of a target person. With the advent and optimization of generative confrontation networks, face exchange becomes easier and less noticeable to the naked eye. Celebrities and politicians have a large number of videos released on the network as public characters, so that lawless persons can forge videos at will, thereby spreading false information, making confusion and the like, and threatening the human society. Therefore, detection aiming at the Deepfake video is not slow, and the method has great practical significance.
The detection method of the deep video can be divided into detection based on intra-frame artifacts and detection based on inter-frame time characteristics. The detection method based on the intra-frame artifacts firstly decomposes the video into frames, then analyzes all the frames, and judges the video authenticity by averaging the results of all the frames to obtain the prediction of the video level. This method is similar to image detection, except that after compression, the sharpness of the video frame is reduced and the detection difficulty is increased. Although CNN can correctly predict each frame, predicting the authenticity of a video by calculating an average is not accurate. The second method based on the inter-frame time characteristics takes the video as a whole, takes the time correlation between video frames into consideration, and evaluates the Deepfake video more reasonably. However, both of the above two methods have a common problem that the result can be obtained only by analyzing the whole video, and the similarity between video frames inevitably causes high redundancy of information between video frames, so that the detection method has a large calculation amount and is slow in processing.
Disclosure of Invention
In order to solve the problems of large calculation amount and low efficiency of the existing Deepfake video detection technology, the invention provides a Deepfake video detection method based on an image group and a two-stream network. The technical scheme adopted by the invention is as follows:
a method for detecting a Deepfake video based on an image group and a two-stream network comprises the following steps:
step 1: extracting key frames of a video to be detected to form an image group;
step 2: inputting the first frame of the image group into a spatial stream in a two-stream network to extract spatial features;
and step 3: differentiating the rest frames of the image group with the first frame to obtain a difference image, forming a difference image sequence, and inputting the difference image sequence into a time stream in the two-stream network to extract time characteristics;
and 4, step 4: and fusing the extracted spatial features and the time features, and evaluating the authenticity of the video by using a dynamic routing algorithm.
Further, in the step 1, a face region image in a video frame is obtained by cutting in a fixed size, the face region images between adjacent frames are differentiated, 10 frames of face region images with the largest face region change are extracted according to the average intensity of the inter-frame difference to serve as key frames, and an image group is formed according to time sequence to represent the video.
Further, the calculation formula of the inter-frame difference method is as follows,
absDiffi=Fi-Fi-1
wherein, Fi、Fi-1Respectively representing the face region image of the i-th frame and the face region image of the i-1 th frame, absDiffiRepresenting the difference between the face area image of the ith frame and the face area image of the (i-1) th frame; the calculation expression of the average strength of the inter-frame difference is as follows,
Figure BDA0003135619650000021
wherein, absDiffi(x, y) is absDiffiThe values at coordinates (x, y), width and height, respectively, represent the width and height of the face region image, diffMeaniAnd the average intensity of the difference between the face region image of the ith frame and the face region image of the (i-1) th frame is represented.
Further, the two-stream network in step 2 and step 3 includes spatial stream and temporal stream; the spatial stream is composed of parts of the first sequence to the fifth sequence of the pre-trained ResNet50 network and a main capsule network and is used for extracting spatial features; the time flow consists of a spatial pyramid pooling network and a GRU network and is used for extracting time characteristics; the spatial characteristics are used as auxiliary information and assigned to a hidden state of the GRU network; the GRU network is used for analyzing time coherence; the two-flow network is trained by adopting an Adam optimization algorithm, the loss function adopts a cross entropy loss function, and the expression of the cross entropy loss function is as follows
Figure BDA0003135619650000022
Wherein L is the loss value, y is the sum of
Figure BDA0003135619650000023
Respectively representing a sample label and a prediction label.
Furthermore, the capsule structures of the main capsule networks are the same and comprise two-dimensional convolution layers, a statistic pool layer and a one-dimensional convolution layer, wherein the statistic pool layer is used for calculating the mean value and the variance of each convolution kernel; the calculation expression of the mean value is as follows,
Figure BDA0003135619650000024
the computational expression of the variance is as follows,
Figure BDA0003135619650000025
wherein, mukMeans, I, representing the k-th layer convolution kernelkijRepresenting the value at the k-th layer of the convolution kernel (i, j), W, H representing the width and height of the convolution kernel respectively,
Figure BDA0003135619650000026
representing the variance of the k-th layer convolution kernel.
Furthermore, the output of the spatial pyramid pooling network is a one-dimensional feature vector, the length of the feature vector is determined by the number N of pyramid layers,
Figure BDA0003135619650000031
where the coefficient 3 is the dimension of the difference map.
Further, the difference map in step 3 can be expressed as
Diffm-1=Fm-F1,m=2,…,10,
Wherein Diffm-1Represents the m-1 th difference chart, FmAnd F1Respectively showing the mth frame and the first frame in the image group.
Further, in the step 4, the spatial features and the temporal features are spliced and fused and transmitted to the digital capsule network through a dynamic routing algorithm; the output vector of the digital capsule network is averaged after softmax to obtain the final network output vector
Figure BDA0003135619650000032
Representing the probability that the video is a Deepfake video,
Figure BDA0003135619650000033
representing the probability that the video is a real video if
Figure BDA0003135619650000034
Then the network predicts the label
Figure BDA0003135619650000035
If the video to be detected is the Deepfake video
Figure BDA0003135619650000036
Then the network predicts the label
Figure BDA0003135619650000037
The video to be detected is a real video.
Compared with the prior art, the invention has the beneficial effects that: key frames in the video are selected through inter-frame difference to form an image group to replace a video input network, so that the network can mainly learn the characteristics of the key video frames, the calculation redundancy is reduced, and the operation efficiency is improved; the space-time combined two-flow detection network is provided and a dynamic routing algorithm is adopted for detection, so that the space characteristics and the time characteristics of the image group are fully utilized, and the detection precision is effectively improved.
Drawings
FIG. 1 is a method block diagram of the present invention.
Fig. 2 is a schematic structural diagram of a spatial flow network according to the present invention.
Fig. 3 is a schematic structural diagram of the spatial pyramid pooling network of the present invention.
Fig. 4 is pseudo code of the dynamic routing algorithm of the present invention.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings.
FIG. 1 shows a flow chart of the present invention, which comprises the following steps:
(1) extracting key frames from video to be detected to form image group
The method comprises the steps of cutting a video to be detected in a fixed size to obtain a face region image, extracting a key frame by utilizing interframe difference to serve as an image group of an input network, and extracting the key frame by taking the two frame images as a difference and then obtaining a frame with larger change according to the average strength of interframe difference as a key frame. Since there is strong temporal correlation between video frames, in order not to lose temporal features, extracted 10-frame key frames are sequentially combined into a group of images to represent a video. The calculation formula of the interframe difference method is shown as formula (1), the calculation expression of the average strength of interframe difference is shown as formula (2),
absDiffi=Fi-Fi-1, (1)
Figure BDA0003135619650000041
wherein, Fi、Fi-1Respectively representing the face region image of the i-th frame and the face region image of the i-1 th frame, absDiffiabsDiff, which is a difference between the face area image of the i-th frame and the face area image of the i-1 th framei(x, y) is absDiffiThe values at coordinates (x, y), width and height, respectively, represent the width and height of the face region image, diffMeaniAnd the average intensity of the difference between the face region image of the ith frame and the face region image of the (i-1) th frame is represented.
(2) The first frame of the group of images is input into the spatial stream in the two-stream network to extract spatial information.
Since the number of Deepfake videos is small, the network training is not suitable to start from zero, and in order to avoid training overfitting, the part of the ResNet50 network that is pre-trained on the ILSVRC database is used to extract potential features, compared to a full ResNet50 network, using the first to fifth sequences of the pre-trained network (two blocks in the first conv layer and the second conv layer) is more advantageous for detection because the full ResNet50 network extracts high-level semantic information, which ignores the artifact features within the frame.
As shown in fig. 2, the complete capsule network includes a plurality of main capsule networks for extracting key features and a digital capsule network for classification. The main capsule network is composed of a plurality of groups of neurons called capsules, each capsule can have different structures, in order to simplify the operation, the invention adopts the capsules with the same structure, each capsule comprises a two-dimensional convolution layer, a statistical pool layer and a one-dimensional convolution layer, the statistical pool layer is used for calculating the mean value and the variance of each convolution kernel, the expressions of the mean value and the variance are respectively an expression (3) and an expression (4),
Figure BDA0003135619650000042
Figure BDA0003135619650000043
wherein, mukMeans, I, representing the k-th layer convolution kernelkijRepresenting the value at the k-th layer of the convolution kernel (i, j), W, H representing the width and height of the convolution kernel respectively,
Figure BDA0003135619650000044
representing the variance of the k-th layer convolution kernel.
In image processing, CNN focuses on the detection of important features in an image, and ignores the spatial relationship between features. The capsule network is based on the learning characteristics of each complete capsule, each capsule represents the characteristics of different human face areas, such as eyes, nose, mouth and the like, and the capsule network is a directional vector, can reflect spatial hierarchy information and is more robust to false face detection.
(3) Temporal streaming extraction of inter-frame disparity in a two-stream network with residual frames of an image group
Because each frame in the image group has strong similarity, under the condition of carrying out spatial feature analysis on the main frame, the difference image sequence is obtained by subtracting the remaining multiple frames from the main frame respectively, such as formula 5, which is favorable for reducing feature redundancy and saving computing resources.
Diffm-1=Fm-F1,m=2,…,10, (5)
Wherein Diffm-1Represents the m-1 th difference chart, FmAnd F1Respectively showing the mth frame and the first frame in the image group.
After generating the difference graph sequence, the temporal coherence between frames is analyzed by using a GRU network, the GRU network is generally used for text analysis, a cell predicts a word, the word is represented by a one-dimensional vector, the human face difference graph in the invention is three-dimensional, and the human face difference graph needs to be tiled into a one-dimensional shape in order to adapt to the GRU network. Because the difference map is sparse, direct tiling not only causes space waste, but also increases the amount of calculation, so that the invention adopts a spatial pyramid pooling network to extract key information of the three-dimensional difference map. The spatial pyramid pooling network can obtain output with fixed size no matter what the input size is, as shown in fig. 3, pooling difference maps in different scales, combining the pooled features obtained in each scale into a one-dimensional feature vector, the length of the feature vector is determined by the number N of pyramid layers,
Figure BDA0003135619650000051
wherein the coefficient 3 is the dimension of the difference graph, and N is generally 3-5.
The one-dimensional feature vector learned from the three-dimensional difference map is input into the GRU network to extract time inconsistency information. Compared with the LSTM network, the GRU network can choose to forget and memorize by using the same updating gate control, thereby greatly reducing the parameter quantity and accelerating the network training. Hidden states in the GRU network are generally initialized to zero, and spatial features extracted from the spatial streams are assigned to the hidden states as auxiliary information in the invention. Because the input in the time stream is obtained by differentiating with the first frame, a large number of important characteristics are lost, and the characteristics are extracted from the space stream, so the characteristics are directly introduced into the time stream, thereby avoiding repeated extraction of the space characteristics, and achieving the purposes of reducing redundancy and accelerating the training and detection process.
(4) Evaluating authenticity of video to be detected by utilizing dynamic routing algorithm
After learning of the time characteristics and the space characteristics of the two-stream network, the two are spliced to realize fusion of space-time characteristics, and the possibility of video truth is calculated by using a dynamic routing algorithm to obtain a video evaluation result. The dynamic routing algorithm is proposed in the capsule network, can be regarded as a full-connection layer of a vector version, and can more accurately route the features to the category to which the features belong by using the length of the vector to express the probability of the existence of the entity. The specific dynamic routing algorithm is shown in fig. 4, the space characteristic and the time characteristic are spliced and fused and transmitted to the digital capsule network through the dynamic routing algorithm, and the output vector of the digital capsule network is averaged after passing through softmax to obtain the final network output vector
Figure BDA0003135619650000052
Representing the probability that the video is a Deepfake video,
Figure BDA0003135619650000053
representing the probability that the video is a real video if
Figure BDA0003135619650000054
Then the network predicts the label
Figure BDA0003135619650000055
If the video to be detected is the Deepfake video
Figure BDA0003135619650000056
Then the network predicts the label
Figure BDA0003135619650000057
The video to be detected is a real video.
Since the capsule is forensically not reconstructed, the network is trained using only the cross-entropy loss function, expressed as follows,
Figure BDA0003135619650000061
wherein L is the loss value, y is the sum of
Figure BDA0003135619650000062
Respectively representing a sample label and a prediction label, wherein the training data is from a faceforces + + dataset, respectively extracting key frames of each video in the dataset to form an image group, the image group sample label from the Deepfake video is 0, and the image group sample label from the real video is 1.
In conclusion, the method for detecting the Deepfake video utilizes the image group, greatly reduces the calculation redundancy, and enables the network to be concentrated on the key video segment; extracting space and time characteristics of the image group by using a two-stream network, and fully mining key characteristics of video authenticity as a judgment basis; and finally, classifying through a dynamic routing algorithm, so that an evaluation result can be obtained more accurately.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims (8)

1. A method for detecting a Deepfake video based on an image group and a two-stream network is characterized by comprising the following steps:
step 1: extracting key frames of a video to be detected to form an image group;
step 2: inputting a first frame of an image group into a spatial stream in a two-stream network to extract spatial information as spatial features;
and step 3: respectively differentiating the rest frames of the image group with the first frame to obtain a difference image, forming a difference image sequence, and inputting the difference image sequence into a time stream in a two-stream network to extract the inter-frame inconsistency as a time characteristic;
and 4, step 4: and fusing the extracted spatial features and the time features, and evaluating the authenticity of the video by using a dynamic routing algorithm.
2. The method as claimed in claim 1, wherein in step 1, the face region images in the video frames are obtained by cropping with a fixed size, the face region images between adjacent frames are differentiated, 10 frames of face region images with the largest face region change are extracted as key frames according to the average intensity of the inter-frame difference, and the image groups are formed in time sequence to represent the video.
3. The method for detecting the Deepfake video based on the image group and the two-stream network as claimed in claim 2, wherein the calculation formula of the inter-frame difference method is as follows,
absDiffi=Fi-Fi-1
wherein, Fi、Fi-1Respectively representing the face region image of the i-th frame and the face region image of the i-1 th frame, absDiffiRepresenting the difference between the face area image of the ith frame and the face area image of the (i-1) th frame; the calculation expression of the average strength of the inter-frame difference is as follows,
Figure FDA0003135619640000011
wherein, absDiffi(x, y) is absDiffiThe values at coordinates (x, y), width and height, respectively, represent the width and height of the face region image, diffMeaniAnd the average intensity of the difference between the face region image of the ith frame and the face region image of the (i-1) th frame is represented.
4. The method for detecting the Deepfake video based on the image group and the two-stream network as claimed in claim 1, wherein the two-stream network in the steps 2 and 3 comprises a spatial stream and a temporal stream; the spatial stream is composed of parts of the first sequence to the fifth sequence of the pre-trained ResNet50 network and a main capsule network and is used for extracting spatial features; the time flow consists of a spatial pyramid pooling network and a GRU network and is used for extracting time characteristics; the spatial characteristics are used as auxiliary information and assigned to a hidden state of the GRU network; the GRU network is used for analyzing time coherence; the two-flow network is trained by adopting an Adam optimization algorithm, the loss function adopts a cross entropy loss function, the expression is as follows,
Figure FDA0003135619640000012
wherein L is the loss value, y is the sum of
Figure FDA0003135619640000013
Respectively representing a sample label and a prediction label.
5. The method as claimed in claim 4, wherein the main capsule network has the same capsule structure, and includes two-dimensional convolutional layers, a statistical pool layer and a one-dimensional convolutional layer, wherein the statistical pool layer is used for calculating the mean and variance of each convolutional kernel; the calculation expression of the mean value is as follows,
Figure FDA0003135619640000021
the computational expression of the variance is as follows,
Figure FDA0003135619640000022
wherein, mukMeans, I, representing the k-th layer convolution kernelkijRepresenting the value at the k-th layer of the convolution kernel (i, j), W, H representing the width and height of the convolution kernel respectively,
Figure FDA0003135619640000023
representing the variance of the k-th layer convolution kernel.
6. The method of claim 4, wherein the output of the spatial pyramid pooling network is a one-dimensional eigenvector, the length of the eigenvector is determined by the number N of pyramid layers,
Figure FDA0003135619640000024
where the coefficient 3 is the dimension of the difference map.
7. The method as claimed in claim 1, wherein the difference map in step 3 is represented as a difference map in the form of a spatio-temporal combination two-stream network
Diffm-1=Fm-F1,m=2,…,10,
Wherein Diffm-1Represents the m-1 th difference chart, FmAnd F1Respectively showing the mth frame and the first frame in the image group.
8. The method for detecting the Deepfake video based on the image group and the two-stream network as claimed in claim 1, wherein in the step 4, the spatial feature and the temporal feature are merged and fused and transmitted to the digital capsule network through a dynamic routing algorithm; the output vector of the digital capsule network is averaged after softmax to obtain the final network output vector
Figure FDA0003135619640000025
Figure FDA0003135619640000026
Representing the probability that the video is a Deepfake video,
Figure FDA0003135619640000027
representing the probability that the video is a real video if
Figure FDA0003135619640000028
Then the network predicts the label
Figure FDA0003135619640000029
If the video to be detected is the Deepfake video
Figure FDA00031356196400000210
Then the network predicts the label
Figure FDA00031356196400000211
The video to be detected is a real video.
CN202110717852.2A 2021-06-28 2021-06-28 Deepfake video detection method based on image group and two-stream network Active CN113283393B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110717852.2A CN113283393B (en) 2021-06-28 2021-06-28 Deepfake video detection method based on image group and two-stream network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110717852.2A CN113283393B (en) 2021-06-28 2021-06-28 Deepfake video detection method based on image group and two-stream network

Publications (2)

Publication Number Publication Date
CN113283393A true CN113283393A (en) 2021-08-20
CN113283393B CN113283393B (en) 2023-07-25

Family

ID=77285677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110717852.2A Active CN113283393B (en) 2021-06-28 2021-06-28 Deepfake video detection method based on image group and two-stream network

Country Status (1)

Country Link
CN (1) CN113283393B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494804A (en) * 2022-04-18 2022-05-13 武汉明捷科技有限责任公司 Unsupervised field adaptive image classification method based on domain specific information acquisition

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030090505A1 (en) * 1999-11-04 2003-05-15 Koninklijke Philips Electronics N.V. Significant scene detection and frame filtering for a visual indexing system using dynamic thresholds
US20120008836A1 (en) * 2010-07-12 2012-01-12 International Business Machines Corporation Sequential event detection from video
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN110633806A (en) * 2019-10-21 2019-12-31 深圳前海微众银行股份有限公司 Longitudinal federated learning system optimization method, device, equipment and readable storage medium
CN111182292A (en) * 2020-01-05 2020-05-19 西安电子科技大学 No-reference video quality evaluation method and system, video receiver and intelligent terminal
CN111241958A (en) * 2020-01-06 2020-06-05 电子科技大学 Video image identification method based on residual error-capsule network
CN111860414A (en) * 2020-07-29 2020-10-30 中国科学院深圳先进技术研究院 Method for detecting Deepfake video based on multi-feature fusion
CN111967427A (en) * 2020-08-28 2020-11-20 广东工业大学 Fake face video identification method, system and readable storage medium
KR20200132665A (en) * 2019-05-17 2020-11-25 삼성전자주식회사 Attention layer included generator based prediction image generating apparatus and controlling method thereof
CN112163488A (en) * 2020-09-21 2021-01-01 中国科学院信息工程研究所 Video false face detection method and electronic device
US20210042529A1 (en) * 2019-08-07 2021-02-11 Zerofox, Inc. Methods and systems for detecting deepfakes
CN112487989A (en) * 2020-12-01 2021-03-12 重庆邮电大学 Video expression recognition method based on capsule-long-and-short-term memory neural network
CN112801037A (en) * 2021-03-01 2021-05-14 山东政法学院 Face tampering detection method based on continuous inter-frame difference
CN112927202A (en) * 2021-02-25 2021-06-08 华南理工大学 Method and system for detecting Deepfake video with combination of multiple time domains and multiple characteristics
CN112991278A (en) * 2021-03-01 2021-06-18 华南理工大学 Method and system for detecting Deepfake video by combining RGB (red, green and blue) space domain characteristics and LoG (LoG) time domain characteristics

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030090505A1 (en) * 1999-11-04 2003-05-15 Koninklijke Philips Electronics N.V. Significant scene detection and frame filtering for a visual indexing system using dynamic thresholds
US20120008836A1 (en) * 2010-07-12 2012-01-12 International Business Machines Corporation Sequential event detection from video
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
KR20200132665A (en) * 2019-05-17 2020-11-25 삼성전자주식회사 Attention layer included generator based prediction image generating apparatus and controlling method thereof
US20210042529A1 (en) * 2019-08-07 2021-02-11 Zerofox, Inc. Methods and systems for detecting deepfakes
CN110633806A (en) * 2019-10-21 2019-12-31 深圳前海微众银行股份有限公司 Longitudinal federated learning system optimization method, device, equipment and readable storage medium
CN111182292A (en) * 2020-01-05 2020-05-19 西安电子科技大学 No-reference video quality evaluation method and system, video receiver and intelligent terminal
CN111241958A (en) * 2020-01-06 2020-06-05 电子科技大学 Video image identification method based on residual error-capsule network
CN111860414A (en) * 2020-07-29 2020-10-30 中国科学院深圳先进技术研究院 Method for detecting Deepfake video based on multi-feature fusion
CN111967427A (en) * 2020-08-28 2020-11-20 广东工业大学 Fake face video identification method, system and readable storage medium
CN112163488A (en) * 2020-09-21 2021-01-01 中国科学院信息工程研究所 Video false face detection method and electronic device
CN112487989A (en) * 2020-12-01 2021-03-12 重庆邮电大学 Video expression recognition method based on capsule-long-and-short-term memory neural network
CN112927202A (en) * 2021-02-25 2021-06-08 华南理工大学 Method and system for detecting Deepfake video with combination of multiple time domains and multiple characteristics
CN112801037A (en) * 2021-03-01 2021-05-14 山东政法学院 Face tampering detection method based on continuous inter-frame difference
CN112991278A (en) * 2021-03-01 2021-06-18 华南理工大学 Method and system for detecting Deepfake video by combining RGB (red, green and blue) space domain characteristics and LoG (LoG) time domain characteristics

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
AKUL MEHRA 等: "Deepfake Detection using Capsule Networks and Long Short-Term Memory Networks", 《HTTPS://PURL.UTWENTE.NL//ESSAYS/83028》, pages 407 - 414 *
OSCAR DE LIMA 等: "Deepfake Detection using Spatiotemporal Convolutional Networks", 《ARXIV》, pages 1 - 6 *
张怡暄 等: "基于帧间差异的人脸篡改视频检测方法", 《信息安全学报》, vol. 05, no. 02, pages 49 - 72 *
张玫瑰: "基于关键帧的Deepfake视频检测算法", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 2023, pages 138 - 2734 *
耿鹏志 等: "基于篡改伪影的深度伪造检测方法", 《计算机工程》, vol. 47, no. 12, pages 156 - 162 *
赵磊 等: "基于时空特征一致性的Deepfake视频检测模型", 《工程科学与技术》, vol. 52, no. 04, pages 243 - 250 *
项俊 等: "时域模型对视频行人重识别性能影响的研究", 《计算机工程与应用》, vol. 56, no. 20, pages 152 - 157 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494804A (en) * 2022-04-18 2022-05-13 武汉明捷科技有限责任公司 Unsupervised field adaptive image classification method based on domain specific information acquisition
CN114494804B (en) * 2022-04-18 2022-10-25 武汉明捷科技有限责任公司 Unsupervised field adaptive image classification method based on domain specific information acquisition

Also Published As

Publication number Publication date
CN113283393B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
CN112347859B (en) Method for detecting significance target of optical remote sensing image
Wang et al. Deep metric learning for crowdedness regression
Fan et al. A survey of crowd counting and density estimation based on convolutional neural network
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN109740419A (en) A kind of video behavior recognition methods based on Attention-LSTM network
CN111723693B (en) Crowd counting method based on small sample learning
CN110889449A (en) Edge-enhanced multi-scale remote sensing image building semantic feature extraction method
CN112560831B (en) Pedestrian attribute identification method based on multi-scale space correction
CN110097028B (en) Crowd abnormal event detection method based on three-dimensional pyramid image generation network
CN113221641A (en) Video pedestrian re-identification method based on generation of confrontation network and attention mechanism
CN109829495A (en) Timing image prediction method based on LSTM and DCGAN
CN111931602A (en) Multi-stream segmented network human body action identification method and system based on attention mechanism
CN106650617A (en) Pedestrian abnormity identification method based on probabilistic latent semantic analysis
CN115512103A (en) Multi-scale fusion remote sensing image semantic segmentation method and system
CN114220154A (en) Micro-expression feature extraction and identification method based on deep learning
CN114612456B (en) Billet automatic semantic segmentation recognition method based on deep learning
CN116580278A (en) Lip language identification method, equipment and storage medium based on multi-attention mechanism
CN115113165A (en) Radar echo extrapolation method, device and system
CN113283393B (en) Deepfake video detection method based on image group and two-stream network
CN115049739A (en) Binocular vision stereo matching method based on edge detection
CN113033283B (en) Improved video classification system
CN113221683A (en) Expression recognition method based on CNN model in teaching scene
CN114758285B (en) Video interaction action detection method based on anchor freedom and long-term attention perception
CN115170985B (en) Remote sensing image semantic segmentation network and segmentation method based on threshold attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant