CN117649621A - Fake video detection method, device and equipment - Google Patents

Fake video detection method, device and equipment Download PDF

Info

Publication number
CN117649621A
CN117649621A CN202311333691.2A CN202311333691A CN117649621A CN 117649621 A CN117649621 A CN 117649621A CN 202311333691 A CN202311333691 A CN 202311333691A CN 117649621 A CN117649621 A CN 117649621A
Authority
CN
China
Prior art keywords
module
feature
video data
video
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311333691.2A
Other languages
Chinese (zh)
Inventor
李硕豪
于淼淼
张军
王翔汉
黄魁华
黄金才
何华
李虹颖
刘忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202311333691.2A priority Critical patent/CN117649621A/en
Publication of CN117649621A publication Critical patent/CN117649621A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a fake video detection method, a fake video detection device and fake video detection equipment, wherein the method comprises the following steps: processing the obtained video data set containing the face features to form an image sequence set; constructing a model framework, wherein the model framework comprises a space defect extraction module, a dynamic defect extraction module and a classification module; training a model framework based on an image sequence set, a first loss function and a second loss function to obtain a detection model capable of detecting fake videos and/or determining a fake algorithm to be used, wherein the first loss function is used for training a result output by the classification module, and the second loss function is used for training data output by the dynamic defect extraction module; obtaining video data to be detected, and processing the video data to be detected to form an image sequence set to be detected; and inputting the image sequence set to be detected into a detection model to obtain a corresponding detection result. The fake video detection method can accurately and efficiently identify and detect the fake video.

Description

Fake video detection method, device and equipment
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a fake video detection method, device and equipment.
Background
The deep forging technology in recent years truly breaks the traditional recognition of 'seeing' of people, and the misuse of the technology brings irreparable serious consequences to the safety of users and must be highly valued.
In general, the amount of information contained in a video is larger and richer, and a fake video has higher credibility than a fake image. When the video and the image are forged, the falsification and the forging can be realized in a very easy way through some open source software, even common users without any professional knowledge reserve can easily get hands on, and the manufacturing cost is lower.
The dynamic information contained in the video greatly improves the credibility of the public, so that the potential hazard of forging the video is greater from the practical application point of view, and the research on effective depth forging video detection technology is urgently needed. Besides the fact that the authenticity of the video can be judged, it is very critical to trace back which fake algorithm the fake video is generated by, and the traceability can reduce the propagation strength and the credibility of the deep fake video to a certain extent, so that the harmfulness of the deep fake video is further reduced.
Disclosure of Invention
The invention aims to provide a fake video detection method, device and equipment capable of accurately and efficiently identifying fake videos.
The invention provides a fake video detection method, which comprises the following steps:
processing the obtained video data set containing the face features to form an image sequence set, wherein the image sequence set contains a continuous face image sequence corresponding to each video data, and the video data set comprises fake video data and real video data;
constructing a model framework, wherein the model framework comprises a space defect extraction module for extracting features of space artifacts and falsified traces of each frame of image, a dynamic defect extraction module for extracting dynamic defect features of the features extracted by the space defect extraction module, and a classification module for classifying and detecting the features extracted by the dynamic defect extraction module to at least determine whether an input video is a falsified video;
training the model architecture based on the image sequence set, a first loss function and a second loss function to obtain a detection model capable of detecting fake videos and/or determining a fake algorithm to be used, wherein the first loss function is used for training results output by the classification module, and the second loss function is used for training data output by the dynamic defect extraction module;
obtaining video data to be detected, and processing the video data to be detected to form a sequence set of images to be detected;
and inputting the image sequence set to be detected into the detection model to obtain a corresponding detection result.
In some embodiments, the processing the obtained video data set including the face features to form an image sequence set includes:
extracting image frames from the obtained video data containing the face features to obtain a plurality of continuous frame images corresponding to the video data;
and carrying out face recognition on each frame image, cutting to obtain face images with target sizes, forming a face image sequence by a plurality of face images generated by the continuous frame images, and forming the image sequence set by all face image sequences of the video data.
In some embodiments, the method further comprises:
constructing the spatial defect extraction module based on a depth separable convolution network;
inputting the image sequence set to the space defect extraction module, and performing the following processing by the space defect extraction module:
extracting the characteristics of each frame of image in the image sequence set to form a characteristic image;
and calculating each feature map based on a fast Fourier transform algorithm to obtain frequency domain sensing features corresponding to each feature map, and processing based on the obtained frequency domain sensing features to obtain a deep feature map corresponding to each frame of image.
In some embodiments, the method further comprises:
and carrying out space average pooling operation on each feature map and each deep feature map respectively to obtain a first feature map set and a first deep feature map set.
In some embodiments, the method further comprises:
inputting a first feature atlas into the dynamic defect extraction module, and performing the following processing by the dynamic defect extraction module:
processing the first feature atlas to obtain a first feature sequence corresponding to each video data, wherein the dimension of the first feature sequence is a first dimension;
adding a first representative feature as a learning at a first position of each of said first feature sequences;
adding a first embedded feature for retaining the position information of the corresponding first feature sequence at the second position of each first feature sequence;
and processing the first feature sequence with the added features through an encoder to obtain a first target feature vector.
In some embodiments, the method further comprises:
inputting a first deep feature atlas into the dynamic defect extraction module, and performing the following processing by the dynamic defect extraction module:
processing the first deep feature atlas to obtain a second feature sequence corresponding to each video data, wherein the dimension of the second feature sequence is a second dimension;
adding a second representative feature as a learning at a first position of each of said second feature sequences;
adding a second embedded feature for retaining the position information of the corresponding second feature sequence at a second position of each second feature sequence;
and processing the second feature sequence with the added features through an encoder to obtain a second target feature vector.
In some embodiments, the method further comprises:
constructing the classification module based on a multi-layer perception fully-connected neural network;
inputting the first target feature vector and the second target feature vector into the classification module to obtain a first detection result and a second detection result;
and taking the average value of the first detection result and the second detection result as a final detection result.
In some embodiments, the first loss function is a cross entropy loss function, and the classification module is further capable of performing a detection task that determines a fake algorithm used by the fake video;
the training the model architecture based on the image sequence set, the first loss function and the second loss function comprises the following steps:
assigning a first label representing a real video, a second label representing a fake video or a fake video generation algorithm based on the detection task;
constructing corresponding difficult sample mining loss functions for the first target feature vector and the second target feature vector respectively;
and training the model framework based on the assignment, the image sequence set, the cross entropy loss function and the difficult sample mining loss function which respectively correspond to the first target feature vector and the second target feature vector.
Another embodiment of the present invention also provides a counterfeit video detection device, including:
the processing module is used for processing the obtained video data set containing the face characteristics to form an image sequence set, wherein the image sequence set contains a continuous face image sequence corresponding to each video data, and the video data set comprises fake video data and real video data;
the system comprises a building module, a model framework and a classifying module, wherein the building module is used for building the model framework, the model framework comprises a space defect extracting module, a dynamic defect extracting module and a classifying module, the space defect extracting module is used for extracting features of each frame of image, the dynamic defect extracting module is used for extracting dynamic defect features of the features extracted by the space defect extracting module, and the classifying module is used for classifying and detecting the features extracted by the dynamic defect extracting module to at least determine whether an input video is a fake video or not;
the training module is used for training the model framework according to the image sequence set, a first loss function and a second loss function to obtain a detection model capable of detecting fake videos and/or determining a fake algorithm to be used, the first loss function is used for training the result output by the classification module, and the second loss function is used for training the data output by the dynamic defect extraction module;
the acquisition module is used for acquiring video data to be detected and processing the video data to be detected based on the processing module to form a sequence set of images to be detected;
and the input module is used for inputting the image sequence set to be detected into the detection model to obtain a corresponding detection result.
Another embodiment of the present invention also provides an electronic device, including:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to implement a counterfeit video detection method as claimed in any preceding claim.
The invention has the beneficial effects that a model framework is constructed and trained, so that the model framework can utilize the inconformity in the time domain and combine the related defects in the space domain to detect the fake video more accurately, thereby not only realizing the accurate detection of the fake video, but also remarkably improving the detection efficiency. Meanwhile, the detection model obtained through training can also support the determination of the number of algorithms involved in fake video, and provides references for algorithm tracing.
Drawings
Fig. 1 is a flowchart of a counterfeit video detection method of the present invention.
Fig. 2 is a flowchart of the application of the fake video detection method of the present invention.
Fig. 3 is a block diagram showing the structure of the counterfeit video detection device of the present invention.
Detailed Description
As shown in fig. 1, the present invention includes a counterfeit video detection method, comprising:
s1: processing the obtained video data set containing the face features to form an image sequence set, wherein the image sequence set contains a continuous face image sequence corresponding to each video data, and the video data set comprises fake video data and real video data;
s2: constructing a model framework, wherein the model framework comprises a space defect extraction module for extracting features of space artifacts and falsified traces of each frame of image, a dynamic defect extraction module for extracting dynamic defect features of the features extracted by the space defect extraction module, and a classification module for classifying and detecting the features extracted by the dynamic defect extraction module to at least determine whether an input video is a falsified video;
s3: training a model framework based on the image sequence set, a first loss function and a second loss function to obtain a detection model capable of detecting fake videos and/or determining a fake algorithm to be used, wherein the first loss function is used for training a result output by the classification module, and the second loss function is used for training data output by the dynamic defect extraction module;
s4: obtaining video data to be detected, and processing the video data to be detected to form an image sequence set to be detected;
s5: and inputting the image sequence set to be detected into a detection model to obtain a corresponding detection result.
The method has the advantages that a model framework is constructed and trained, so that the method can utilize the inconclusiveness of the time domain and combine the related defects of the space domain to accurately detect the forged video, the accurate detection of the forged video is realized, and the detection efficiency is remarkably improved. Meanwhile, the detection model obtained through training can also support the determination of the number of algorithms involved in fake video, and provides references for algorithm tracing.
Specifically, when the obtained video data set containing the face features is processed to form an image sequence set, the method includes:
extracting image frames from the obtained video data containing the face features to obtain a plurality of continuous frame images corresponding to the video data;
and carrying out face recognition on each frame image, cutting to obtain face images with target sizes, forming a face image sequence by a plurality of face images generated by continuous frame images, and forming an image sequence set by the face image sequences of all video data.
For example, N consecutive frames are first extracted from the video data, or the video data is divided into N consecutive frames. Then, face regions are detected for each frame image by using a face recognition tool such as Dlib, and then the detected face regions are cut off, and the cut face images are uniformly adjusted to 224×224. Will pass through the above-mentioned placeN continuous face image sequences obtained in the processing process are recorded as X= [ X ] 1 ,X 2 ,...,X N ]. The sequence of face images of all video data forms a set of image sequences.
Further, when the model architecture is constructed, a plurality of modules involved in the model architecture need to be built in sequence, and the construction of each module and the data processing process of each module are described below:
s6: constructing a space defect extraction module based on a depth separable convolution network;
inputting the image sequence set to a space defect extraction module, and performing the following processing by the space defect extraction module:
s7: extracting the characteristics of each frame of image in the image sequence set to form a characteristic image;
s8: and calculating each feature map based on a fast Fourier transform algorithm to obtain frequency domain sensing features corresponding to each feature map, and processing based on the obtained frequency domain sensing features to obtain a deep feature map corresponding to each frame of image.
Specifically, as shown in fig. 2, the spatial defect extraction module in this embodiment uses XceptionNet (depth separable convolution network) as a backbone network, and extracts spatial artifacts and tamper marks of each frame of image in X (the above-mentioned face image sequence), where the XceptionNet network is composed of four parts including a Stem (convolution) layer, an Entry flow (input layer), a Middle layer, and a deep Exit flow (output layer). The feature of the X output through the Entry flow is expressed asTherein, N, c 1 、h 1 、w 1 The number of frames of the video sequence, the number of channels of the Entry flow output feature map, the height and the width are respectively represented. For the i (i.e. [1, N)]) The space defect extraction module adopts fast Fourier transform FFT to obtain frequency spectrum representation, which is marked as +.>To make the frequency domain response more flexible, the present embodiment uses F i And an additional learnable filter->Performing point multiplication to adaptively adjust the frequency domain response, namely: g'. i =G i ⊙F i
Finally, the G 'is processed by the inverse fast Fourier transform iFFT' i Converting back to the spatial domain to obtain the frequency domain perception featureThrough the process, the feature map of the N frames of images finally generates frequency domain perception features +.>This feature can be used as a complement to RGB domain information in the image to mine finer forgery cues. Will then->And continuously inputting the video data into a Middle flow layer of a backbone network for processing, and then transferring the video data into an Exit flow for processing to generate a deep feature map corresponding to the video data. Deep feature map output by Exit flow is denoted +.>Therein, N, c 2 、h 2 、w 2 Respectively representing the frame number and the feature map M of the video sequence 2 Channel number, height and width.
In order to reduce the data dimension and avoid overfitting in the model training process, the method of the embodiment further comprises the following steps:
s9: and respectively carrying out space average pooling operation on each feature map and the deep feature map to obtain a first feature map set and a first deep feature map set.
For example, the intermediate layer feature map of the Entry flow output in the above stepBy using spaceThe average pooling operation is processed to obtain the characteristic +.>Deep feature map outputting Exit flowProcessing by using space average pooling operation to obtain characteristic ∈>
Further, a dynamic defect extraction module is constructed, and then, the method of this embodiment further includes:
inputting the first feature atlas into a dynamic defect extraction module, and performing the following processing by the dynamic defect extraction module:
s10: processing the first feature atlas to obtain a first feature sequence corresponding to each video data, wherein the dimension of the first feature sequence is a first dimension;
s11: adding a first representative feature as a learning at a first position of each first feature sequence;
s12: adding a first embedded feature for retaining position information of the corresponding first feature sequence at a second position of each first feature sequence;
s13: and processing the first feature sequence with the added features through an encoder to obtain a first target feature vector.
Specifically, M in the present embodiment 1 In' the feature vector corresponding to each frame of image has a size of c 1 X 1, after flattening, N consecutive frames corresponding to a video data will form a 1D sequence of length N, denoted asWherein t is E [1, N]. Then, through a trainable linear mapping layer E (liner projection), the dimension of each emmbedding feature sequence is represented by c 1 Becomes D. In order to achieve the classification purpose, the dynamic defect extraction module in the present embodiment performs all the following stepsAn extra learning class embedding is added in front of the emmbed feature sequence and is marked as f class Which can be a representative feature learned over an input sequence. In addition, a leachable E is embedded in the 1D position in the feature sequence in the embodiment pos For retaining location information. Finally, the resulting ebedding sequence is expressed as
Then h is 0 Continue to input to Transformer Encoder (encoder) module and output f from module class The eigenvector at the location is noted as z 1 I.e. the first target feature vector.
The first deep feature atlas is processed based on the same method, and the method specifically comprises the following steps:
s14: inputting the first deep feature atlas into a dynamic defect extraction module, and performing the following processing by the dynamic defect extraction module:
s15: processing the first deep feature atlas to obtain a second feature sequence corresponding to each video data, wherein the dimension of the second feature sequence is a second dimension;
s16: adding a second representative feature as a learning at the first position of each second feature sequence;
s17: adding a second embedded feature for retaining position information of the corresponding second feature sequence at a second position of each second feature sequence;
s18: and processing the second feature sequence with the added features through an encoder to obtain a second target feature vector.
Further, the construction classification module specifically comprises:
s19: constructing a classification module based on a multi-layer perception fully connected neural network (MLP network);
s20: inputting the first target feature vector and the second target feature vector into a classification module to obtain a first detection result and a second detection result;
s21: and taking the average value of the first detection result and the second detection result as a final detection result.
The average value of the two detection results is taken as the detection result output by the final model, so that the detection accuracy of the model is improved.
After the above modules are built, the model needs to be trained, and the data processing process executed by each module after obtaining the data to be processed is the data processing flow of each module during training and practical application. When training the model architecture, in order to make the training of the model architecture sufficient and ensure that the obtained detection model has higher precision, two different loss functions are adopted for training different feature data in the embodiment. In addition, the classification module in this embodiment can also execute a detection task of determining a fake algorithm used for fake video, and when the classification module or the model is required to execute different detection tasks, different parameters need to be set for the labels in the classification module so that the output label content matches with the detection task. Specifically:
the first Loss function in this embodiment is a cross entropy Loss function (CE Loss), and when training the model architecture based on the image sequence set and the first and second Loss functions, the method includes:
s22: assigning a first label representing a real video, a second label representing a fake video or a fake video generation algorithm based on the detection task;
s23: constructing corresponding difficult sample mining loss functions for the first target feature vector and the second target feature vector respectively;
s24: training the model architecture based on assignment, an image sequence set, a cross entropy loss function and a difficult sample mining loss function respectively corresponding to the first target feature vector and the second target feature vector.
For example, the cross entropy loss function is expressed as:
wherein N is the total training video sample number, y i Andthe predictive/detect label and the true label for the ith sample, respectively. For fake video detection tasks, the real video can be +.>Set to 0, falsify y of video i Set to 1. For the counterfeit algorithm traceability task, the +.>Set to 0, y of fake video generated by different fake algorithms i Respectively set as 1,2,3 and …, which can be determined according to the number of counterfeiting algorithms
Next, continuing to refer to fig. 2, for the first target feature vector z respectively 1 And a second target feature vector z 2 And constructing a difficult sample mining loss function. In z 1 For example, all of its triplet combinations, i.e., anchor, positive, and negative, are first found in the same training batch, and then for each triplet<a,p,n>Calculating r= (a-p) - (a-n), i.e. the difference of anchor and positive minus the difference of anchor and negative, it is evident that the larger r the model is, the more difficult it is to make the correct classification, i.e. the error of the detection result is large. And after all the triples are combined to calculate r, sequencing all the obtained r from big to small, and selecting the first K corresponding triples as difficult sample mining objects. Next, the SoftMargin Loss is adopted as a difficult sample mining Loss function, and the Loss function corresponding to the first target feature vector is defined as:
in the above, x [ i ]]Representing the calculated r value, y of the ith triplet i E {1}. Similarly, the feature vector z 2 The difficult sample mining loss function calculated in the above manner is represented as L hard_2 . Finally, the three loss functions are integrated to form a pairThe loss function of the model should be detected and trained based on the loss function in this embodiment, where the loss function specifically is:
L total =L ce +L hard_1 +L hard_2
after the loss function is obtained, the model architecture is constructed, and the model architecture can be trained by using training data. The trained model can perform true and false detection or fake video generation algorithm detection, namely fake algorithm detection, on a new input video, namely a video to be detected.
As described above, after the detection model is input with data, the corresponding processing is performed by the above-described processes of processing the data by the respective modules until the final detection result is output by the classification module. The model architecture in the embodiment is not an existing model architecture, but is based on an existing model with different functions, so as to solve the technical problem of the embodiment, and perform matching construction, each module in the model is based on its basic function when performing data processing, and is correspondingly implemented according to the method designed in the embodiment, and in addition, preparation of training data is completed based on the method designed in the embodiment, and the detection model constructed and formed based on the method of the embodiment can accurately and efficiently complete detection of a counterfeit video or detection of a counterfeit video generation algorithm, so that a guarantee is provided for protecting user information security.
As shown in fig. 3, another embodiment of the present invention also provides a counterfeit video detection device 100, including:
the processing module is used for processing the obtained video data set containing the face characteristics to form an image sequence set, wherein the image sequence set contains a continuous face image sequence corresponding to each video data, and the video data set comprises fake video data and real video data;
the system comprises a building module, a model framework and a classifying module, wherein the building module is used for building the model framework, the model framework comprises a space defect extracting module, a dynamic defect extracting module and a classifying module, the space defect extracting module is used for extracting features of each frame of image, the dynamic defect extracting module is used for extracting dynamic defect features of the features extracted by the space defect extracting module, and the classifying module is used for classifying and detecting the features extracted by the dynamic defect extracting module to at least determine whether an input video is a fake video or not;
the training module is used for training the model framework according to the image sequence set, a first loss function and a second loss function to obtain a detection model capable of detecting fake videos and/or determining a fake algorithm to be used, the first loss function is used for training the result output by the classification module, and the second loss function is used for training the data output by the dynamic defect extraction module;
the acquisition module is used for acquiring video data to be detected and processing the video data to be detected based on the processing module to form a sequence set of images to be detected;
the first input module is used for inputting the image sequence set to be detected into the detection model to obtain a corresponding detection result.
In some embodiments, the processing the obtained video data set including the face features to form an image sequence set includes:
extracting image frames from the obtained video data containing the face features to obtain a plurality of continuous frame images corresponding to the video data;
and carrying out face recognition on each frame image, cutting to obtain face images with target sizes, forming a face image sequence by a plurality of face images generated by the continuous frame images, and forming the image sequence set by all face image sequences of the video data.
In some embodiments, the build module is further to:
constructing the spatial defect extraction module based on a depth separable convolution network;
inputting the image sequence set to the space defect extraction module, and performing the following processing by the space defect extraction module:
extracting the characteristics of each frame of image in the image sequence set to form a characteristic image;
and calculating each feature map based on a fast Fourier transform algorithm to obtain frequency domain sensing features corresponding to each feature map, and processing based on the obtained frequency domain sensing features to obtain a deep feature map corresponding to each frame of image.
In some embodiments, the apparatus further comprises:
and the pooling module is used for carrying out space average pooling operation on each feature map and each deep feature map so as to obtain a first feature map set and a first deep feature map set.
In some embodiments, the apparatus further comprises:
the second input module is used for inputting the first feature atlas into the dynamic defect extraction module, and the dynamic defect extraction module performs the following processing:
processing the first feature atlas to obtain a first feature sequence corresponding to each video data, wherein the dimension of the first feature sequence is a first dimension;
adding a first representative feature as a learning at a first position of each of said first feature sequences;
adding a first embedded feature for retaining the position information of the corresponding first feature sequence at the second position of each first feature sequence;
processing the first feature sequence with the added features through an encoder to obtain a first target feature vector;
in some embodiments, the apparatus further comprises:
the third input module is used for inputting the first deep feature atlas into the dynamic defect extraction module, and the dynamic defect extraction module performs the following processing:
processing the first deep feature atlas to obtain a second feature sequence corresponding to each video data, wherein the dimension of the second feature sequence is a second dimension;
adding a second representative feature as a learning at a first position of each of said second feature sequences;
adding a second embedded feature for retaining the position information of the corresponding second feature sequence at a second position of each second feature sequence;
processing the second feature sequence with the added features through an encoder to obtain a second target feature vector;
in some embodiments, the build module is further to:
constructing the classification module based on a multi-layer perception fully-connected neural network;
inputting the first target feature vector and the second target feature vector into the classification module to obtain a first detection result and a second detection result;
and taking the average value of the first detection result and the second detection result as a final detection result.
In some embodiments, the first loss function is a cross entropy loss function, and the classification module is further capable of performing a detection task that determines a fake algorithm used by the fake video;
the training the model architecture based on the image sequence set, the first loss function and the second loss function comprises the following steps:
assigning a first label representing a real video, a second label representing a fake video or a fake video generation algorithm based on the detection task;
constructing corresponding difficult sample mining loss functions for the first target feature vector and the second target feature vector respectively;
and training the model framework based on the assignment, the image sequence set, the cross entropy loss function and the difficult sample mining loss function which respectively correspond to the first target feature vector and the second target feature vector.
Another embodiment of the present invention also provides an electronic device, including:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to implement a counterfeit video detection method as described in any of the embodiments above.
Further, an embodiment of the present invention also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the fake video detection method as described above. It should be understood that each solution in this embodiment has a corresponding technical effect in the foregoing method embodiment, which is not described herein.
Further, embodiments of the present invention also provide a computer program product tangibly stored on a computer-readable medium and comprising computer-readable instructions that, when executed, cause at least one processor to perform a counterfeit video detection method such as in the embodiments described above.
The computer storage medium of the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable medium can be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage media element, a magnetic storage media element, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, antenna, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Additionally, it should be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a system for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to imply that the scope of the present application is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the spirit of the application, steps may be implemented in any order, and there are many other variations of the different aspects of one or more embodiments described above which are not provided in detail for the sake of brevity.
One or more embodiments herein are intended to embrace all such alternatives, modifications and variations that fall within the broad scope of the present application. Any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the one or more embodiments in the present application, are therefore intended to be included within the scope of the present application.

Claims (10)

1. A counterfeit video detection method, comprising:
processing the obtained video data set containing the face features to form an image sequence set, wherein the image sequence set contains a continuous face image sequence corresponding to each video data, and the video data set comprises fake video data and real video data;
constructing a model framework, wherein the model framework comprises a space defect extraction module for extracting features of space artifacts and falsified traces of each frame of image, a dynamic defect extraction module for extracting dynamic defect features of the features extracted by the space defect extraction module, and a classification module for classifying and detecting the features extracted by the dynamic defect extraction module to at least determine whether an input video is a falsified video;
training the model architecture based on the image sequence set, a first loss function and a second loss function to obtain a detection model capable of detecting fake videos and/or determining a fake algorithm to be used, wherein the first loss function is used for training results output by the classification module, and the second loss function is used for training data output by the dynamic defect extraction module;
obtaining video data to be detected, and processing the video data to be detected to form a sequence set of images to be detected;
and inputting the image sequence set to be detected into the detection model to obtain a corresponding detection result.
2. The method of claim 1, wherein processing the obtained video data set containing facial features to form a set of image sequences comprises:
extracting image frames from the obtained video data containing the face features to obtain a plurality of continuous frame images corresponding to the video data;
and carrying out face recognition on each frame image, cutting to obtain face images with target sizes, forming a face image sequence by a plurality of face images generated by the continuous frame images, and forming the image sequence set by all face image sequences of the video data.
3. The method of claim 1, further comprising:
constructing the spatial defect extraction module based on a depth separable convolution network;
inputting the image sequence set to the space defect extraction module, and performing the following processing by the space defect extraction module:
extracting the characteristics of each frame of image in the image sequence set to form a characteristic image;
and calculating each feature map based on a fast Fourier transform algorithm to obtain frequency domain sensing features corresponding to each feature map, and processing based on the obtained frequency domain sensing features to obtain a deep feature map corresponding to each frame of image.
4. A counterfeit video detection method according to claim 3, and further comprising:
and carrying out space average pooling operation on each feature map and each deep feature map respectively to obtain a first feature map set and a first deep feature map set.
5. The method of claim 4, further comprising:
inputting a first feature atlas into the dynamic defect extraction module, and performing the following processing by the dynamic defect extraction module:
processing the first feature atlas to obtain a first feature sequence corresponding to each video data, wherein the dimension of the first feature sequence is a first dimension;
adding a first representative feature as a learning at a first position of each of said first feature sequences;
adding a first embedded feature for retaining the position information of the corresponding first feature sequence at the second position of each first feature sequence;
and processing the first feature sequence with the added features through an encoder to obtain a first target feature vector.
6. The method of claim 5, further comprising:
inputting a first deep feature atlas into the dynamic defect extraction module, and performing the following processing by the dynamic defect extraction module:
processing the first deep feature atlas to obtain a second feature sequence corresponding to each video data, wherein the dimension of the second feature sequence is a second dimension;
adding a second representative feature as a learning at a first position of each of said second feature sequences;
adding a second embedded feature for retaining the position information of the corresponding second feature sequence at a second position of each second feature sequence;
and processing the second feature sequence with the added features through an encoder to obtain a second target feature vector.
7. The method of claim 6, further comprising:
constructing the classification module based on a multi-layer perception fully-connected neural network;
inputting the first target feature vector and the second target feature vector into the classification module to obtain a first detection result and a second detection result;
and taking the average value of the first detection result and the second detection result as a final detection result.
8. The method of claim 6, wherein the first loss function is a cross entropy loss function, and the classification module is further capable of performing a detection task that determines a forgery algorithm used by the forgery video;
the training the model architecture based on the image sequence set, the first loss function and the second loss function comprises the following steps:
assigning a first label representing a real video, a second label representing a fake video or a fake video generation algorithm based on the detection task;
constructing corresponding difficult sample mining loss functions for the first target feature vector and the second target feature vector respectively;
and training the model framework based on the assignment, the image sequence set, the cross entropy loss function and the difficult sample mining loss function which respectively correspond to the first target feature vector and the second target feature vector.
9. A counterfeit video detection device, comprising:
the processing module is used for processing the obtained video data set containing the face characteristics to form an image sequence set, wherein the image sequence set contains a continuous face image sequence corresponding to each video data, and the video data set comprises fake video data and real video data;
the system comprises a building module, a model framework and a classifying module, wherein the building module is used for building the model framework, the model framework comprises a space defect extracting module, a dynamic defect extracting module and a classifying module, the space defect extracting module is used for extracting features of each frame of image, the dynamic defect extracting module is used for extracting dynamic defect features of the features extracted by the space defect extracting module, and the classifying module is used for classifying and detecting the features extracted by the dynamic defect extracting module to at least determine whether an input video is a fake video or not;
the training module is used for training the model framework according to the image sequence set, a first loss function and a second loss function to obtain a detection model capable of detecting fake videos and/or determining a fake algorithm to be used, the first loss function is used for training the result output by the classification module, and the second loss function is used for training the data output by the dynamic defect extraction module;
the acquisition module is used for acquiring video data to be detected and processing the video data to be detected based on the processing module to form a sequence set of images to be detected;
and the input module is used for inputting the image sequence set to be detected into the detection model to obtain a corresponding detection result.
10. An electronic device, comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to implement the counterfeit video detection method of any of claims 1-8.
CN202311333691.2A 2023-10-13 2023-10-13 Fake video detection method, device and equipment Pending CN117649621A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311333691.2A CN117649621A (en) 2023-10-13 2023-10-13 Fake video detection method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311333691.2A CN117649621A (en) 2023-10-13 2023-10-13 Fake video detection method, device and equipment

Publications (1)

Publication Number Publication Date
CN117649621A true CN117649621A (en) 2024-03-05

Family

ID=90042169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311333691.2A Pending CN117649621A (en) 2023-10-13 2023-10-13 Fake video detection method, device and equipment

Country Status (1)

Country Link
CN (1) CN117649621A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117893952A (en) * 2024-03-15 2024-04-16 视睿(杭州)信息科技有限公司 Video mosaic defect detection method based on deep learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117893952A (en) * 2024-03-15 2024-04-16 视睿(杭州)信息科技有限公司 Video mosaic defect detection method based on deep learning

Similar Documents

Publication Publication Date Title
JP7490141B2 (en) IMAGE DETECTION METHOD, MODEL TRAINING METHOD, IMAGE DETECTION APPARATUS, TRAINING APPARATUS, DEVICE, AND PROGRAM
CN111709408B (en) Image authenticity detection method and device
CN106845421B (en) Face feature recognition method and system based on multi-region feature and metric learning
CN107527337B (en) A kind of the video object removal altering detecting method based on deep learning
CN110659582A (en) Image conversion model training method, heterogeneous face recognition method, device and equipment
CN110276252B (en) Anti-expression-interference face recognition method based on generative countermeasure network
CN111400540B (en) Singing voice detection method based on extrusion and excitation residual error network
CN111444873A (en) Method and device for detecting authenticity of person in video, electronic device and storage medium
CN117649621A (en) Fake video detection method, device and equipment
US9378406B2 (en) System for estimating gender from fingerprints
Tuna et al. Image description using a multiplier-less operator
CN114387641A (en) False video detection method and system based on multi-scale convolutional network and ViT
CN115393760A (en) Method, system and equipment for detecting Deepfake composite video
CN107944363A (en) Face image processing process, system and server
CN112651333B (en) Silence living body detection method, silence living body detection device, terminal equipment and storage medium
Devi et al. Batch Normalized Siamese Network Deep Learning Based Image Similarity Estimation
CN113673465A (en) Image detection method, device, equipment and readable storage medium
CN109101984B (en) Image identification method and device based on convolutional neural network
CN112270404A (en) Detection structure and method for bulge defect of fastener product based on ResNet64 network
US20220398859A1 (en) Signal-based machine learning fraud detection
CN113505716B (en) Training method of vein recognition model, and recognition method and device of vein image
CN115984639A (en) Intelligent detection method for fatigue state of part
CN115115947A (en) Remote sensing image detection method and device, electronic equipment and storage medium
Achar et al. Indian currency recognition system using CNN and comparison with yolov5
CN114663936A (en) Face counterfeiting detection method based on uncertainty perception level supervision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination