CN116310937A - Method, device, equipment and medium for detecting depth fake video - Google Patents

Method, device, equipment and medium for detecting depth fake video Download PDF

Info

Publication number
CN116310937A
CN116310937A CN202211678998.1A CN202211678998A CN116310937A CN 116310937 A CN116310937 A CN 116310937A CN 202211678998 A CN202211678998 A CN 202211678998A CN 116310937 A CN116310937 A CN 116310937A
Authority
CN
China
Prior art keywords
video
audio
detected
depth
audio characteristics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211678998.1A
Other languages
Chinese (zh)
Inventor
喻民
姜建国
梁亚超
刘超
李敏
黄伟庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202211678998.1A priority Critical patent/CN116310937A/en
Publication of CN116310937A publication Critical patent/CN116310937A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/40Spoof detection, e.g. liveness detection

Abstract

The invention provides a method, a device, equipment and a medium for detecting depth counterfeit video, wherein the method comprises the following steps: performing video feature analysis on the video to be detected to obtain video and audio features of the video to be detected; inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result; the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model is obtained by training labels which respectively correspond to the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video, wherein the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model takes the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video as samples. The invention aims to solve the problem that the accuracy of detecting the depth counterfeit video is insufficient due to the defect that the counterfeit audio and video in the depth counterfeit video cannot be detected respectively in the prior art.

Description

Method, device, equipment and medium for detecting depth fake video
Technical Field
The present invention relates to the field of video detection technologies, and in particular, to a method, an apparatus, a device, and a medium for detecting a depth counterfeit video.
Background
Video media are carriers of important information, playing an important role in acquiring information. Such as the face-changing video generated by the deep learning technology is emerging on the network in recent years, and with the continuous improvement of the quality of the forged video, the malicious application of the deep forging technology has a great hazard.
Currently, the technology of detecting deep forgery video can be mainly divided into two types: a detection method based on specific artifacts and a detection method based on deep learning. Among other things, detection methods based on specific artifacts focus on specific visual artifacts created during video counterfeiting, which may be noticeable or faint to humans, but are capable of being detected by machine learning and forensic analysis techniques. Meanwhile, the detection method based on the deep learning regards the detection task of the deep fake video as an ordinary image or video classification task, and the training of the carefully designed deep neural network automatically extracts useful features so as to realize the distinction of the real video and the fake video.
However, the above existing detection methods are all based on a single mode, i.e. picture or video, and existing depth-falsified video is often accompanied by falsified audio. Therefore, how to effectively use visual information and auditory information in video is an unsolved problem, although there is a depth forgery video detection algorithm based on visual hearing that detects according to consistency between visual mode and auditory mode (such as facial emotion of a person in video and emotion of voice content), and ignores forgery characteristics in a single mode. In addition, the current detection algorithm ignores the relation between the visual mode and the auditory mode, and does not effectively fuse the video and audio characteristics. The existing depth fake video detection method based on visual and auditory sense is caused to be insufficient in detection accuracy when detecting the video.
Disclosure of Invention
The invention provides a method, a device, equipment and a medium for detecting a depth counterfeit video, which are used for solving the problem that the accuracy rate for detecting the depth counterfeit video is insufficient because the defects that the counterfeit audio and video in the depth counterfeit video cannot be detected respectively in the prior art.
The invention provides a depth fake video detection method, which comprises the following steps:
performing video feature analysis on the video to be detected to obtain video and audio features of the video to be detected;
inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result;
the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model is obtained by training labels which respectively correspond to the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video, wherein the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model takes the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video as samples.
According to the depth forging video detection method provided by the invention, the video feature analysis is carried out on the video to be detected to obtain the video and audio features of the video to be detected, and the method comprises the following steps:
performing video preliminary feature analysis on the video to be detected to obtain video and audio preliminary features;
utilizing a composite attention module to perform visual and auditory information interaction on the video and audio primary characteristics to obtain video and audio characteristics;
wherein the video-audio preliminary features include a visual preliminary feature with video frames and an audio preliminary feature with mel-cepstrum coefficients;
the compound attention module is obtained by taking a residual module of a residual neural network as a framework and replacing a middle convolution layer in the residual module with a compound attention layer.
According to the depth forging video detection method provided by the invention, the video preliminary feature analysis is carried out on the video to be detected to obtain the video and audio preliminary features, and the method comprises the following steps:
extracting a video frame and a Mel cepstrum coefficient of a face region in a video to be detected;
and respectively obtaining the visual preliminary features and the audio preliminary features by using a residual neural network according to the video frames and the mel-frequency cepstrum coefficients.
According to the depth forging video detection method provided by the invention, the step of extracting the video frame of the face area in the video to be detected comprises the following steps:
and extracting a face region picture in each frame in video frames of the video to be detected by using the multitask convolutional neural network model to obtain the video frames.
According to the depth fake video detection method provided by the invention, the step of extracting the mel cepstrum coefficient in the video to be detected comprises the following steps:
and carrying out framing and windowing treatment on the audio in the video to be detected by utilizing an audio analysis tool to obtain the Mel cepstrum coefficient.
According to the method for detecting the depth forgery video provided by the invention, the composite attention module is utilized to perform visual and auditory information interaction on the video and audio preliminary characteristics to obtain the video and audio characteristics, and the method comprises the following steps:
respectively inputting the visual preliminary features and the audio preliminary features into a convolution layer in a composite attention module for convolution;
dividing the convolved visual preliminary feature and audio preliminary feature into a self-attention feature and a cross-modal attention feature along a channel dimension of the composite attention module;
inputting the self-attention feature and the cross-modal attention feature into a composite attention layer in the composite attention module to obtain self-attention weights of the self-attention feature and cross-modal attention weights of the cross-modal attention feature;
and multiplying the self-attention feature by the self-attention weight, and splicing after multiplying the cross-modal attention feature by the cross-modal attention weight to obtain the video and audio feature.
According to the depth counterfeit video detection method provided by the invention, the labels corresponding to the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video respectively correspond to the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video, and the specific corresponding relation comprises:
the video-audio characteristics of the depth counterfeit video corresponding to the video-audio characteristic label of the depth counterfeit video comprise: real visual features and counterfeit audio features, counterfeit video features and real audio features, and counterfeit video features and counterfeit audio features;
the video-audio characteristics of the real video corresponding to the video-audio characteristic label of the real video comprise real visual characteristics and real audio characteristics.
The invention also provides a device for detecting the depth fake video, which comprises:
and an analysis module: the method comprises the steps of performing video feature analysis on a video to be detected to obtain video and audio features of the video to be detected;
and a detection module: inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result;
the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model is obtained by training labels which respectively correspond to the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video, wherein the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model takes the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video as samples.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the depth falsification video detection method as described in any one of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a depth falsification video detection method as described in any of the above.
According to the depth forging video detection method, the device, the equipment and the medium, video characteristics including visual characteristics and audio characteristics are obtained after video characteristics of a video to be detected are analyzed through the depth forging video detection method, the visual characteristics and the audio characteristics are detected through a preset multi-layer perceptron classification model respectively, and the more effective and accurate detection of the depth forging video is achieved by distinguishing real and forged video characteristics and audio characteristics.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for detecting a deep forgery video according to the present invention;
FIG. 2 is a second flow chart of a method for detecting a deep forgery video according to the present invention;
FIG. 3 is a schematic flow chart of a composite attention module in a method for detecting a depth forgery video according to the present invention;
fig. 4 is a schematic structural diagram of a device for detecting a depth counterfeit video according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
A depth-counterfeit video detection method of the present invention is described below with reference to fig. 1-3, comprising:
and S1, carrying out video feature analysis on the video to be detected to obtain video and audio features of the video to be detected.
And S2, inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result.
Wherein the audio-visual features in step S1 include visual features and audio features. The preset multi-layer perceptron classification model in the step S2 is obtained by training the labels which respectively correspond to the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video by taking the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video as samples.
Specifically, the label corresponding to each of the video-audio feature of the depth counterfeit video and the video-audio feature of the real video, and the video-audio feature of the depth counterfeit video and the video-audio feature of the real video specifically correspond to each other, and the specific relationship includes:
the video-audio characteristics corresponding to the video-audio characteristic labels of the depth counterfeit video comprise: the authentic visual and counterfeit audio features, the counterfeit video and authentic audio features, and the counterfeit video and counterfeit audio features.
The video-audio characteristics of the real video corresponding to the video-audio characteristic label of the real video comprise real visual characteristics and real audio characteristics.
According to the invention, the video and audio characteristics including the visual characteristics and the audio characteristics are obtained after the video to be detected is subjected to video characteristic analysis, the visual characteristics and the audio characteristics are respectively detected through the preset multi-layer perceptron classification model, and the detection on the depth counterfeit video is realized more effectively and accurately by distinguishing the real video characteristics from the counterfeit video characteristics and the audio characteristics.
Referring to fig. 2 and fig. 3 together, specifically, performing video feature analysis on a video to be detected to obtain video and audio features of the video to be detected, including:
and carrying out video preliminary feature analysis on the video to be detected to obtain video and audio preliminary features.
And utilizing the composite attention module to perform visual and auditory information interaction on the video and audio primary characteristics to obtain the video and audio characteristics.
The video-audio preliminary features comprise visual preliminary features with video frames and audio preliminary features with mel-frequency cepstrum coefficients.
The compound attention module is obtained by taking a residual module in the residual neural network as a framework and replacing an intermediate convolution layer in the residual module with the compound attention layer. Such as replacing the middle convolution layer in the residual module of the residual neural network-50 (Resnet-50) with the composite attention layer, a composite attention module is obtained. The composite attention layer consists of a self-attention mechanism and a cross-modal attention mechanism, and the self-attention and the cross-modal attention weight are obtained by the visual characteristics and the auditory characteristics through a convolution layer and a Sigmoid activation function. Meanwhile, the composite attention modules can be overlapped in multiple layers, and each layer of composite attention module is obtained by replacing an intermediate convolution layer in a residual module of the residual neural network with the composite attention layer. The composite attention model formed by overlapping the multi-layer composite attention modules can strengthen the characteristic representation of each mode and perform information interaction between visual characteristics and audio characteristics, thereby being more beneficial to the detection of each mode.
In order to better perform video preliminary feature analysis on a video to be detected, in this embodiment, performing video preliminary feature analysis on the video to be detected to obtain video and audio preliminary features includes:
and extracting the video to be detected to obtain a video frame of a face region in the video to be detected and a Mel cepstrum coefficient at a corresponding moment. I.e. extracting audio features at the instants corresponding to the extracted video frames.
And according to the video frame and the Mel cepstrum coefficient, respectively obtaining the vision preliminary feature and the audio preliminary feature by using a residual neural network.
The present invention uses a residual neural network (Resnet) as a backbone network (backbone) that extracts preliminary features of visual and audio based on video frames and mel-frequency cepstrum coefficients, respectively. The feature extraction network can primarily extract relevant features of the video to be detected and reduce the size of the feature map so as to be beneficial to effective interaction of information between the visual features and the audio features of the composite attention model.
The method for extracting the video to be detected comprises the steps of:
and extracting a face region picture in each frame in the video frames of the video to be detected by using the multitask convolutional neural network model to obtain the video frames.
The present invention extracts a face region picture in each frame as an input to a neural network (residual neural network) by using a multitasking convolutional neural network (MTCNN, multi-task convolutional neural network) model. Thereby improving the accuracy of the video frame of the face region in the video to be detected. The time of extraction can be selected as the first 3s of the video to be detected.
The method for extracting the video to be detected to obtain the mel-cepstrum coefficient comprises the following steps:
and carrying out framing and windowing treatment on the audio in the video to be detected by utilizing an audio analysis tool to obtain the Mel cepstrum coefficient. Audio analysis tools such as the Librosa tool are employed.
In this embodiment, firstly, a libros tool is used to frame, window and extract mel cepstrum coefficients (MFCCs) of the audio in the video to be detected, and the obtained mel cepstrum coefficients of each frame of audio are used as the input of the neural network (residual neural network). The time of extraction can be selected from the first 3s in the video to be detected. The same time of the video and the audio is selected, so that the detection of the depth fake video with the aligned video and audio can be realized.
In order to better realize information interaction between the visual feature and the audio feature, in this embodiment, the using the composite attention module to perform visual and audio information interaction on the visual and audio preliminary feature to obtain the visual and audio feature includes:
the visual preliminary feature and the audio preliminary feature (auditory feature) are respectively input into a convolution layer in the composite attention module for convolution.
And dividing the convolved visual preliminary features and audio preliminary features into self-attention features and cross-modal attention features respectively by using the channel dimension of the composite attention module.
Inputting the self-attention feature and the cross-modal attention feature into a compound attention layer in the compound attention module to obtain the self-attention weight of the self-attention feature and the cross-modal attention weight of the cross-modal attention feature.
And multiplying the self-attention feature by the self-attention weight, and then splicing after multiplying the cross-modal attention feature by the cross-modal attention weight to obtain the video and audio feature.
In this embodiment, taking the visual feature as an example, the visual feature is expressed as x ε R T×C×H×W Wherein T represents the number of extracted video frames, C represents the number of feature channels, and H and W represent the height and width of the feature map. First, the visual preliminary feature is divided into two parts x along the channel dimension 1 ∈R T×C×H×W And x 2 ∈R T×C×H×W (x 1 For attention features, x 2 As a cross-modal attention feature), x 1 The self-attention weight of the visual characteristic is obtained after the input to the convolution layer, and x is calculated 2 Also input to the convolution layer results in cross-modal attention weights for the audio features. The self-attention weight of the visual feature and the cross-modal attention weight from the audio feature are respectively compared to two parts of the visual feature (i.e., x 1 And x 2 ) Multiplying and splicing along the channel dimension to obtain the final visual characteristic. The self-attention mechanism can enable the neural network to pay more attention to the region of interest, and the cross-modal attention mechanism can enable information interaction between visual features and audio features, so that the model is more attention to easily-detected video frames and audio frames, and further effective and accurate detection of depth falsified videos is achieved.
The depth-counterfeit video detection device provided by the invention will be described below, and the depth-counterfeit video detection device described below and the depth-counterfeit video detection method described above can be referred to correspondingly to each other.
Referring to fig. 4, the present invention further provides a depth counterfeit video detection device, which includes an analysis module 410 and a detection module 420.
The analysis module 410 is configured to perform video feature analysis on the video to be detected, so as to obtain video and audio features of the video to be detected.
The detection module 420 inputs the video and audio features of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result.
The video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model is obtained by training labels which respectively correspond to the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video, wherein the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model takes the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video as samples.
The invention obtains the video and audio characteristics of the video to be detected after the video characteristic analysis is carried out on the video to be detected through the analysis module 410. And inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model through a detection module 420 to obtain a detection result. Thereby realizing more effective and accurate detection of the depth counterfeit video.
Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform the depth counterfeit video detection method provided above, the method comprising:
s1, carrying out video feature analysis on the video to be detected to obtain video and audio features of the video to be detected.
S2, inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result.
The video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model is obtained by training labels which respectively correspond to the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video, wherein the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model takes the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video as samples.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above provided depth falsification video detection method, the method comprising:
s1, carrying out video feature analysis on the video to be detected to obtain video and audio features of the video to be detected.
S2, inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result.
The video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model is obtained by training labels which respectively correspond to the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video, wherein the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model takes the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video as samples.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A depth counterfeit video detection method, comprising:
performing video feature analysis on the video to be detected to obtain video and audio features of the video to be detected;
inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result;
the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model is obtained by training labels which respectively correspond to the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video, wherein the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model takes the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video as samples.
2. The method for detecting deep forgery video according to claim 1, wherein,
the video feature analysis is performed on the video to be detected to obtain video and audio features of the video to be detected, and the method comprises the following steps:
performing video preliminary feature analysis on the video to be detected to obtain video and audio preliminary features;
utilizing a composite attention module to perform visual and auditory information interaction on the video and audio primary characteristics to obtain video and audio characteristics;
wherein the video-audio preliminary features include a visual preliminary feature with video frames and an audio preliminary feature with mel-cepstrum coefficients;
the compound attention module is obtained by taking a residual module of a residual neural network as a framework and replacing a middle convolution layer in the residual module with a compound attention layer.
3. The method for detecting depth counterfeit video according to claim 2, wherein the performing a preliminary video feature analysis on the video to be detected to obtain preliminary video-audio features comprises:
extracting a video frame and a Mel cepstrum coefficient of a face region in a video to be detected;
and respectively obtaining the visual preliminary features and the audio preliminary features by using a residual neural network according to the video frames and the mel-frequency cepstrum coefficients.
4. The method for detecting deep forgery video according to claim 3, wherein,
the step of extracting the video frame of the face area in the video to be detected comprises the following steps:
and extracting a face region picture in each frame in video frames of the video to be detected by using the multitask convolutional neural network model to obtain the video frames.
5. The method for detecting deep forgery video according to claim 3, wherein,
the step of extracting the mel-frequency cepstrum coefficient in the video to be detected comprises the following steps:
and carrying out framing and windowing treatment on the audio in the video to be detected by utilizing an audio analysis tool to obtain the Mel cepstrum coefficient.
6. The method for detecting deep forgery video according to any one of claims 2 to 5, wherein the performing visual and audible information interaction on the preliminary video-audio features by using a composite attention module to obtain video-audio features includes:
respectively inputting the visual preliminary features and the audio preliminary features into a convolution layer in a composite attention module for convolution;
dividing the convolved visual preliminary feature and audio preliminary feature into a self-attention feature and a cross-modal attention feature along a channel dimension of the composite attention module;
inputting the self-attention feature and the cross-modal attention feature into a composite attention layer in the composite attention module to obtain self-attention weights of the self-attention feature and cross-modal attention weights of the cross-modal attention feature;
and multiplying the self-attention feature by the self-attention weight, and splicing after multiplying the cross-modal attention feature by the cross-modal attention weight to obtain the video and audio feature.
7. The method for detecting deep forgery video according to claim 1, wherein,
the labels corresponding to the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video respectively correspond to the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video, and the specific corresponding relation comprises:
the video-audio characteristics of the depth counterfeit video corresponding to the video-audio characteristic label of the depth counterfeit video comprise: real visual features and counterfeit audio features, counterfeit video features and real audio features, and counterfeit video features and counterfeit audio features;
the video-audio characteristics of the real video corresponding to the video-audio characteristic label of the real video comprise real visual characteristics and real audio characteristics.
8. A depth counterfeit video detection device, comprising:
and an analysis module: the method comprises the steps of performing video feature analysis on a video to be detected to obtain video and audio features of the video to be detected;
and a detection module: inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result;
the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model is obtained by training labels which respectively correspond to the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video, wherein the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model takes the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video as samples.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the depth falsification video detection method of any of claims 1 to 7 when the program is executed by the processor.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the depth falsification video detection method of any of claims 1 to 7.
CN202211678998.1A 2022-12-26 2022-12-26 Method, device, equipment and medium for detecting depth fake video Pending CN116310937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211678998.1A CN116310937A (en) 2022-12-26 2022-12-26 Method, device, equipment and medium for detecting depth fake video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211678998.1A CN116310937A (en) 2022-12-26 2022-12-26 Method, device, equipment and medium for detecting depth fake video

Publications (1)

Publication Number Publication Date
CN116310937A true CN116310937A (en) 2023-06-23

Family

ID=86782317

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211678998.1A Pending CN116310937A (en) 2022-12-26 2022-12-26 Method, device, equipment and medium for detecting depth fake video

Country Status (1)

Country Link
CN (1) CN116310937A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117059131A (en) * 2023-10-13 2023-11-14 南京龙垣信息科技有限公司 False audio detection method based on emotion recognition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117059131A (en) * 2023-10-13 2023-11-14 南京龙垣信息科技有限公司 False audio detection method based on emotion recognition
CN117059131B (en) * 2023-10-13 2024-03-29 南京龙垣信息科技有限公司 False audio detection method based on emotion recognition

Similar Documents

Publication Publication Date Title
US10657969B2 (en) Identity verification method and apparatus based on voiceprint
Khalid et al. Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors
CN111563422B (en) Service evaluation acquisition method and device based on bimodal emotion recognition network
JP7148737B2 (en) Liveness detection verification method, liveness detection verification system, recording medium, and liveness detection verification system training method
CN110570348B (en) Face image replacement method and equipment
CN111950497A (en) AI face-changing video detection method based on multitask learning model
CN111191073A (en) Video and audio recognition method, device, storage medium and device
CN116310937A (en) Method, device, equipment and medium for detecting depth fake video
CN108986798A (en) Processing method, device and the equipment of voice data
CN112329438A (en) Automatic lie detection method and system based on domain confrontation training
US20220345455A1 (en) Systems For Authenticating Digital Contents
CN110287981B (en) Significance detection method and system based on biological heuristic characterization learning
CN115775363A (en) Illegal video detection method based on text and video fusion
CN110728193A (en) Method and device for detecting richness characteristics of face image
CN108880815A (en) Auth method, device and system
US11238289B1 (en) Automatic lie detection method and apparatus for interactive scenarios, device and medium
CN114596609B (en) Audio-visual falsification detection method and device
CN116206373A (en) Living body detection method, electronic device and storage medium
CN116311423A (en) Cross-attention mechanism-based multi-mode emotion recognition method
CN112347990B (en) Multi-mode-based intelligent manuscript examining system and method
CN112633264B (en) Vehicle attribute identification method and device, electronic equipment and storage medium
CN113160796A (en) Language identification method, device, equipment and storage medium of broadcast audio
CN116994590B (en) Method and system for identifying deeply forged audio
Gibson et al. An audio-visual approach to learning salient behaviors in couples' problem solving discussions
CN117496394A (en) Fake video detection method and device based on multi-mode fusion of image and voice

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination