CN116310937A

CN116310937A - Method, device, equipment and medium for detecting depth fake video

Info

Publication number: CN116310937A
Application number: CN202211678998.1A
Authority: CN
Inventors: 喻民; 姜建国; 梁亚超; 刘超; 李敏; 黄伟庆
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-06-23

Abstract

The invention provides a method, a device, equipment and a medium for detecting depth counterfeit video, wherein the method comprises the following steps: performing video feature analysis on the video to be detected to obtain video and audio features of the video to be detected; inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result; the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model is obtained by training labels which respectively correspond to the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video, wherein the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model takes the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video as samples. The invention aims to solve the problem that the accuracy of detecting the depth counterfeit video is insufficient due to the defect that the counterfeit audio and video in the depth counterfeit video cannot be detected respectively in the prior art.

Description

Method, device, equipment and medium for detecting depth fake video

Technical Field

The present invention relates to the field of video detection technologies, and in particular, to a method, an apparatus, a device, and a medium for detecting a depth counterfeit video.

Background

Video media are carriers of important information, playing an important role in acquiring information. Such as the face-changing video generated by the deep learning technology is emerging on the network in recent years, and with the continuous improvement of the quality of the forged video, the malicious application of the deep forging technology has a great hazard.

Currently, the technology of detecting deep forgery video can be mainly divided into two types: a detection method based on specific artifacts and a detection method based on deep learning. Among other things, detection methods based on specific artifacts focus on specific visual artifacts created during video counterfeiting, which may be noticeable or faint to humans, but are capable of being detected by machine learning and forensic analysis techniques. Meanwhile, the detection method based on the deep learning regards the detection task of the deep fake video as an ordinary image or video classification task, and the training of the carefully designed deep neural network automatically extracts useful features so as to realize the distinction of the real video and the fake video.

However, the above existing detection methods are all based on a single mode, i.e. picture or video, and existing depth-falsified video is often accompanied by falsified audio. Therefore, how to effectively use visual information and auditory information in video is an unsolved problem, although there is a depth forgery video detection algorithm based on visual hearing that detects according to consistency between visual mode and auditory mode (such as facial emotion of a person in video and emotion of voice content), and ignores forgery characteristics in a single mode. In addition, the current detection algorithm ignores the relation between the visual mode and the auditory mode, and does not effectively fuse the video and audio characteristics. The existing depth fake video detection method based on visual and auditory sense is caused to be insufficient in detection accuracy when detecting the video.

Disclosure of Invention

The invention provides a method, a device, equipment and a medium for detecting a depth counterfeit video, which are used for solving the problem that the accuracy rate for detecting the depth counterfeit video is insufficient because the defects that the counterfeit audio and video in the depth counterfeit video cannot be detected respectively in the prior art.

The invention provides a depth fake video detection method, which comprises the following steps:

performing video feature analysis on the video to be detected to obtain video and audio features of the video to be detected;

inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result;

the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model is obtained by training labels which respectively correspond to the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video, wherein the video-audio characteristics comprise visual characteristics and audio characteristics, the preset multi-layer perceptron classification model takes the video-audio characteristics of the depth fake video and the video-audio characteristics of the real video as samples.

According to the depth forging video detection method provided by the invention, the video feature analysis is carried out on the video to be detected to obtain the video and audio features of the video to be detected, and the method comprises the following steps:

performing video preliminary feature analysis on the video to be detected to obtain video and audio preliminary features;

utilizing a composite attention module to perform visual and auditory information interaction on the video and audio primary characteristics to obtain video and audio characteristics;

wherein the video-audio preliminary features include a visual preliminary feature with video frames and an audio preliminary feature with mel-cepstrum coefficients;

the compound attention module is obtained by taking a residual module of a residual neural network as a framework and replacing a middle convolution layer in the residual module with a compound attention layer.

According to the depth forging video detection method provided by the invention, the video preliminary feature analysis is carried out on the video to be detected to obtain the video and audio preliminary features, and the method comprises the following steps:

extracting a video frame and a Mel cepstrum coefficient of a face region in a video to be detected;

and respectively obtaining the visual preliminary features and the audio preliminary features by using a residual neural network according to the video frames and the mel-frequency cepstrum coefficients.

According to the depth forging video detection method provided by the invention, the step of extracting the video frame of the face area in the video to be detected comprises the following steps:

and extracting a face region picture in each frame in video frames of the video to be detected by using the multitask convolutional neural network model to obtain the video frames.

According to the depth fake video detection method provided by the invention, the step of extracting the mel cepstrum coefficient in the video to be detected comprises the following steps:

and carrying out framing and windowing treatment on the audio in the video to be detected by utilizing an audio analysis tool to obtain the Mel cepstrum coefficient.

According to the method for detecting the depth forgery video provided by the invention, the composite attention module is utilized to perform visual and auditory information interaction on the video and audio preliminary characteristics to obtain the video and audio characteristics, and the method comprises the following steps:

respectively inputting the visual preliminary features and the audio preliminary features into a convolution layer in a composite attention module for convolution;

dividing the convolved visual preliminary feature and audio preliminary feature into a self-attention feature and a cross-modal attention feature along a channel dimension of the composite attention module;

inputting the self-attention feature and the cross-modal attention feature into a composite attention layer in the composite attention module to obtain self-attention weights of the self-attention feature and cross-modal attention weights of the cross-modal attention feature;

and multiplying the self-attention feature by the self-attention weight, and splicing after multiplying the cross-modal attention feature by the cross-modal attention weight to obtain the video and audio feature.

According to the depth counterfeit video detection method provided by the invention, the labels corresponding to the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video respectively correspond to the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video, and the specific corresponding relation comprises:

the video-audio characteristics of the depth counterfeit video corresponding to the video-audio characteristic label of the depth counterfeit video comprise: real visual features and counterfeit audio features, counterfeit video features and real audio features, and counterfeit video features and counterfeit audio features;

the video-audio characteristics of the real video corresponding to the video-audio characteristic label of the real video comprise real visual characteristics and real audio characteristics.

The invention also provides a device for detecting the depth fake video, which comprises:

and an analysis module: the method comprises the steps of performing video feature analysis on a video to be detected to obtain video and audio features of the video to be detected;

and a detection module: inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the depth falsification video detection method as described in any one of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a depth falsification video detection method as described in any of the above.

According to the depth forging video detection method, the device, the equipment and the medium, video characteristics including visual characteristics and audio characteristics are obtained after video characteristics of a video to be detected are analyzed through the depth forging video detection method, the visual characteristics and the audio characteristics are detected through a preset multi-layer perceptron classification model respectively, and the more effective and accurate detection of the depth forging video is achieved by distinguishing real and forged video characteristics and audio characteristics.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for detecting a deep forgery video according to the present invention;

FIG. 2 is a second flow chart of a method for detecting a deep forgery video according to the present invention;

FIG. 3 is a schematic flow chart of a composite attention module in a method for detecting a depth forgery video according to the present invention;

fig. 4 is a schematic structural diagram of a device for detecting a depth counterfeit video according to the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

A depth-counterfeit video detection method of the present invention is described below with reference to fig. 1-3, comprising:

and S1, carrying out video feature analysis on the video to be detected to obtain video and audio features of the video to be detected.

And S2, inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result.

Wherein the audio-visual features in step S1 include visual features and audio features. The preset multi-layer perceptron classification model in the step S2 is obtained by training the labels which respectively correspond to the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video by taking the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video as samples.

Specifically, the label corresponding to each of the video-audio feature of the depth counterfeit video and the video-audio feature of the real video, and the video-audio feature of the depth counterfeit video and the video-audio feature of the real video specifically correspond to each other, and the specific relationship includes:

the video-audio characteristics corresponding to the video-audio characteristic labels of the depth counterfeit video comprise: the authentic visual and counterfeit audio features, the counterfeit video and authentic audio features, and the counterfeit video and counterfeit audio features.

According to the invention, the video and audio characteristics including the visual characteristics and the audio characteristics are obtained after the video to be detected is subjected to video characteristic analysis, the visual characteristics and the audio characteristics are respectively detected through the preset multi-layer perceptron classification model, and the detection on the depth counterfeit video is realized more effectively and accurately by distinguishing the real video characteristics from the counterfeit video characteristics and the audio characteristics.

Referring to fig. 2 and fig. 3 together, specifically, performing video feature analysis on a video to be detected to obtain video and audio features of the video to be detected, including:

and carrying out video preliminary feature analysis on the video to be detected to obtain video and audio preliminary features.

And utilizing the composite attention module to perform visual and auditory information interaction on the video and audio primary characteristics to obtain the video and audio characteristics.

The video-audio preliminary features comprise visual preliminary features with video frames and audio preliminary features with mel-frequency cepstrum coefficients.

The compound attention module is obtained by taking a residual module in the residual neural network as a framework and replacing an intermediate convolution layer in the residual module with the compound attention layer. Such as replacing the middle convolution layer in the residual module of the residual neural network-50 (Resnet-50) with the composite attention layer, a composite attention module is obtained. The composite attention layer consists of a self-attention mechanism and a cross-modal attention mechanism, and the self-attention and the cross-modal attention weight are obtained by the visual characteristics and the auditory characteristics through a convolution layer and a Sigmoid activation function. Meanwhile, the composite attention modules can be overlapped in multiple layers, and each layer of composite attention module is obtained by replacing an intermediate convolution layer in a residual module of the residual neural network with the composite attention layer. The composite attention model formed by overlapping the multi-layer composite attention modules can strengthen the characteristic representation of each mode and perform information interaction between visual characteristics and audio characteristics, thereby being more beneficial to the detection of each mode.

In order to better perform video preliminary feature analysis on a video to be detected, in this embodiment, performing video preliminary feature analysis on the video to be detected to obtain video and audio preliminary features includes:

and extracting the video to be detected to obtain a video frame of a face region in the video to be detected and a Mel cepstrum coefficient at a corresponding moment. I.e. extracting audio features at the instants corresponding to the extracted video frames.

And according to the video frame and the Mel cepstrum coefficient, respectively obtaining the vision preliminary feature and the audio preliminary feature by using a residual neural network.

The present invention uses a residual neural network (Resnet) as a backbone network (backbone) that extracts preliminary features of visual and audio based on video frames and mel-frequency cepstrum coefficients, respectively. The feature extraction network can primarily extract relevant features of the video to be detected and reduce the size of the feature map so as to be beneficial to effective interaction of information between the visual features and the audio features of the composite attention model.

The method for extracting the video to be detected comprises the steps of:

and extracting a face region picture in each frame in the video frames of the video to be detected by using the multitask convolutional neural network model to obtain the video frames.

The present invention extracts a face region picture in each frame as an input to a neural network (residual neural network) by using a multitasking convolutional neural network (MTCNN, multi-task convolutional neural network) model. Thereby improving the accuracy of the video frame of the face region in the video to be detected. The time of extraction can be selected as the first 3s of the video to be detected.

The method for extracting the video to be detected to obtain the mel-cepstrum coefficient comprises the following steps:

and carrying out framing and windowing treatment on the audio in the video to be detected by utilizing an audio analysis tool to obtain the Mel cepstrum coefficient. Audio analysis tools such as the Librosa tool are employed.

In this embodiment, firstly, a libros tool is used to frame, window and extract mel cepstrum coefficients (MFCCs) of the audio in the video to be detected, and the obtained mel cepstrum coefficients of each frame of audio are used as the input of the neural network (residual neural network). The time of extraction can be selected from the first 3s in the video to be detected. The same time of the video and the audio is selected, so that the detection of the depth fake video with the aligned video and audio can be realized.

In order to better realize information interaction between the visual feature and the audio feature, in this embodiment, the using the composite attention module to perform visual and audio information interaction on the visual and audio preliminary feature to obtain the visual and audio feature includes:

the visual preliminary feature and the audio preliminary feature (auditory feature) are respectively input into a convolution layer in the composite attention module for convolution.

And dividing the convolved visual preliminary features and audio preliminary features into self-attention features and cross-modal attention features respectively by using the channel dimension of the composite attention module.

Inputting the self-attention feature and the cross-modal attention feature into a compound attention layer in the compound attention module to obtain the self-attention weight of the self-attention feature and the cross-modal attention weight of the cross-modal attention feature.

And multiplying the self-attention feature by the self-attention weight, and then splicing after multiplying the cross-modal attention feature by the cross-modal attention weight to obtain the video and audio feature.

In this embodiment, taking the visual feature as an example, the visual feature is expressed as x ε R ^T×C×H×W Wherein T represents the number of extracted video frames, C represents the number of feature channels, and H and W represent the height and width of the feature map. First, the visual preliminary feature is divided into two parts x along the channel dimension ₁ ∈R ^T×C×H×W And x ₂ ∈R ^T×C×H×W (x ₁ For attention features, x ₂ As a cross-modal attention feature), x ₁ The self-attention weight of the visual characteristic is obtained after the input to the convolution layer, and x is calculated ₂ Also input to the convolution layer results in cross-modal attention weights for the audio features. The self-attention weight of the visual feature and the cross-modal attention weight from the audio feature are respectively compared to two parts of the visual feature (i.e., x ₁ And x ₂ ) Multiplying and splicing along the channel dimension to obtain the final visual characteristic. The self-attention mechanism can enable the neural network to pay more attention to the region of interest, and the cross-modal attention mechanism can enable information interaction between visual features and audio features, so that the model is more attention to easily-detected video frames and audio frames, and further effective and accurate detection of depth falsified videos is achieved.

The depth-counterfeit video detection device provided by the invention will be described below, and the depth-counterfeit video detection device described below and the depth-counterfeit video detection method described above can be referred to correspondingly to each other.

Referring to fig. 4, the present invention further provides a depth counterfeit video detection device, which includes an analysis module 410 and a detection module 420.

The analysis module 410 is configured to perform video feature analysis on the video to be detected, so as to obtain video and audio features of the video to be detected.

The detection module 420 inputs the video and audio features of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result.

The invention obtains the video and audio characteristics of the video to be detected after the video characteristic analysis is carried out on the video to be detected through the analysis module 410. And inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model through a detection module 420 to obtain a detection result. Thereby realizing more effective and accurate detection of the depth counterfeit video.

Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform the depth counterfeit video detection method provided above, the method comprising:

s1, carrying out video feature analysis on the video to be detected to obtain video and audio features of the video to be detected.

S2, inputting the video and audio characteristics of the video to be detected into a preset multi-layer perceptron classification model to obtain a detection result.

Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above provided depth falsification video detection method, the method comprising:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A depth counterfeit video detection method, comprising:

2. The method for detecting deep forgery video according to claim 1, wherein,

the video feature analysis is performed on the video to be detected to obtain video and audio features of the video to be detected, and the method comprises the following steps:

3. The method for detecting depth counterfeit video according to claim 2, wherein the performing a preliminary video feature analysis on the video to be detected to obtain preliminary video-audio features comprises:

4. The method for detecting deep forgery video according to claim 3, wherein,

the step of extracting the video frame of the face area in the video to be detected comprises the following steps:

5. The method for detecting deep forgery video according to claim 3, wherein,

the step of extracting the mel-frequency cepstrum coefficient in the video to be detected comprises the following steps:

6. The method for detecting deep forgery video according to any one of claims 2 to 5, wherein the performing visual and audible information interaction on the preliminary video-audio features by using a composite attention module to obtain video-audio features includes:

7. The method for detecting deep forgery video according to claim 1, wherein,

the labels corresponding to the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video respectively correspond to the video-audio characteristics of the depth counterfeit video and the video-audio characteristics of the real video, and the specific corresponding relation comprises:

8. A depth counterfeit video detection device, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the depth falsification video detection method of any of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the depth falsification video detection method of any of claims 1 to 7.