CN117542124A

CN117542124A - Face video depth forging detection method and device

Info

Publication number: CN117542124A
Application number: CN202311268531.4A
Authority: CN
Inventors: 梁坚; 徐雨婷
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2024-02-09

Abstract

The invention provides a face video depth forgery detection method and a device, which relate to the technical field of image processing and comprise the following steps: carrying out face region identification and cutting on the target video to obtain a face region video only containing a face region; generating M frames of face region thumbnails corresponding to the face region video according to M groups of different continuous N frames of face region video frames in the face region video, wherein each frame of face region thumbnail comprises N continuous frames of face region video frames; respectively inputting the face region thumbnails of each frame into a face video depth forgery detection model, and outputting video forgery probability corresponding to the face region thumbnails of each frame; and under the condition that the average value of the video forgery probabilities exceeds a preset first preset threshold value, judging that the target video is a forgery video.

Description

Face video depth forging detection method and device

Technical Field

The invention relates to the technical field of image processing, in particular to a face video depth forgery detection method and device.

Background

Depth counterfeiting deceives users by generating and manipulating the appearance of faces through generation techniques, and with the great success of generating networks, "depth counterfeiting" products have been lifelike to a point where humans are unable to distinguish. These deep counterfeited products may be abused for malicious purposes, resulting in serious trust and security issues such as financial fraud, identity theft, fraud etc. The rapid development of social media exacerbates the abuse of deep counterfeiting technology. Therefore, it is important to develop advanced detection methods to protect the privacy of data of individual users.

In the related art, research is focused on video-based methods for detecting deep counterfeited video by modeling spatiotemporal dependencies. Since the depth forgery algorithm is performed frame by frame, there is subtle spatio-temporal inconsistency from frame to frame, the core of the video level depth forgery detection method is to capture the inconsistency by spatio-temporal modeling. Existing depth false video detection methods generally follow two directions.

Some methods use a dual-branch network or module to learn the spatial and temporal information, respectively, and then fuse them. However, these two-way methods may disrupt the cooperation between temporal and spatial features and cause subtle artifacts to be ignored. While other methods directly use backbone networks that can model time, such as Long short-term memory (LSTM) and 3D-CNN. However, the two methods have large calculation amount, large model parameter quantity, low reasoning speed and high deployment requirement. Furthermore, the rise of converters for visual task backbones has prompted the advent of corresponding depth falsification detection methods. While these approaches have made breakthrough advances in performance, their computational complexity also makes them a significant challenge in deployment and use.

Disclosure of Invention

The invention provides a face video depth forging detection method and device, which are used for solving the defects that the mode of carrying out depth forging detection in the prior art is high in computational complexity and difficult to deploy and use.

The invention provides a face video depth forgery detection method, which comprises the following steps:

carrying out face region identification and cutting on the target video to obtain a face region video only containing a face region;

generating M frames of face region thumbnails corresponding to the face region video according to M groups of different continuous N frames of face region video frames in the face region video, wherein each frame of face region thumbnail comprises N continuous frames of face region video frames;

respectively inputting the face region thumbnails of each frame into a face video depth forgery detection model, and outputting video forgery probability corresponding to the face region thumbnails of each frame;

and under the condition that the average value of the video forgery probabilities exceeds a preset first preset threshold value, judging that the target video is a forgery video.

According to the face video depth forgery detection method provided by the invention, face region identification and clipping are carried out on a target video to obtain a face region video only containing a face region, and the face region video depth forgery detection method comprises the following steps:

inputting the target video into a P-Net in a face detection model to quickly generate a face candidate window;

filtering non-face windows in the face candidate windows through R-Net in the face detection model;

and selecting a face region in the filtering result through the O-Net in the face detection model to obtain a face region video only comprising the face region.

According to the face video depth forging detection method provided by the invention, before the step of respectively inputting the face region thumbnails of each frame into the face video depth forging detection model and outputting the video forging probability corresponding to the face region thumbnails of each frame, the face video depth forging detection method further comprises the following steps:

acquiring at least one video sample and a video counterfeiting probability label corresponding to the video sample;

generating a multi-frame human face region thumbnail sample image according to continuous P-frame human face region video sample frames in the video samples, wherein each frame of human face region thumbnail sample image comprises continuous P-frame human face region video sample frames;

taking each frame of the face region thumbnail sample image and the corresponding video counterfeiting probability label as a training sample to obtain a plurality of training samples;

and training the preset model through a plurality of training samples.

According to the face video depth falsification detection method provided by the invention, a multi-frame face region thumbnail sample image is generated according to continuous P-frame face region video sample frames in the video sample, and the method comprises the following steps:

adding masks at the same positions in each group of continuous P-frame face region video sample frames to obtain a multi-frame face region thumbnail sample image;

the mask positions in the face region thumbnail sample map of each frame are randomly generated.

According to the face video depth forgery detection method provided by the invention, the preset model is trained through a plurality of training samples, and the method comprises the following steps:

for any training sample, inputting the training sample into a preset model, and outputting the video forging probability corresponding to the training sample;

calculating a loss value based on the video forgery probability corresponding to the training sample and the video forgery probability label;

and under the condition that the loss value is smaller than a preset threshold value, training of the preset model is completed, and the face video depth forgery detection model is obtained.

According to the face video depth forgery detection method provided by the invention, the preset model is a TALL-Swin model.

The invention also provides a device for detecting the depth forgery of the face video, which comprises the following steps:

the recognition module is used for recognizing and cutting the face area of the target video to obtain a face area video only containing the face area;

the generating module is used for generating M frames of face region thumbnails corresponding to the face region video according to M groups of different continuous N frames of face region video frames in the face region video, wherein each frame of face region thumbnail comprises the continuous N frames of face region video frames;

the output module is used for respectively inputting the face region thumbnails of each frame into a face video depth forgery detection model and outputting video forgery probability corresponding to the face region thumbnails of each frame;

and the judging module is used for judging that the target video is a fake video under the condition that the average value of the fake probabilities of the videos exceeds a preset first preset threshold value.

The device is also for:

the face region identification and clipping are carried out on the target video to obtain a face region video only containing the face region, which comprises the following steps:

The device is also for:

and training the preset model through a plurality of training samples.

The device is also for:

the preset model is a TALL-Swin model.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the face video depth forgery detection method according to any one of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a face video depth falsification detection method as described in any one of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a face video depth falsification detection method as described in any one of the above.

According to the face video depth forgery detection method and device, the face region is identified and cut through the target video, the face region video only comprising the face region is obtained, the data pressure for processing the non-face region data can be effectively reduced, meanwhile, the video image is converted into the face region thumbnail, the data size of an analysis object is effectively reduced, further, the face region thumbnail with smaller data size is directly analyzed through the face video depth forgery detection model, further, whether the target video is a forgery video or not is finally judged, and the method has the advantages of small calculated amount, high reasoning speed, capability of capturing time-space inconsistency, high generalization and the like, the face video with depth forgery can be efficiently and accurately detected, and the influence of the depth forgery on network safety is restrained.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a face video depth forgery detection method provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of a face video depth forgery detection device provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flowchart of a face video depth forgery detection method provided in an embodiment of the present application, as shown in fig. 1, including:

step 110, recognizing and cutting a face region of a target video to obtain a face region video only containing the face region;

in the embodiment of the present application, the target video may specifically be a video carrying a face.

In the embodiment of the application, the face region identification and clipping are performed on the target video, specifically, the detection analysis is performed on the target video through a face detection model, the preprocessing is performed on the video, the face region in the video is detected, tracked and aligned, and the clipping is performed to obtain the face region video with only the face region.

More specifically, the face region recognition and clipping are performed on the target video to obtain a face region video only containing the face region, including:

In the embodiment of the application, the face detection model is based on a Multi-task cascade convolution network (Multi-task Cascaded Convolutional Networks, MTCNN), can find the position and the size of a face in each frame of image in a video, and can finish the tasks of face detection and face alignment at the same time;

the face detection model is an open source model, can be directly obtained in an open source community, is a pre-step of the invention, and is used for preprocessing the face video, so that the subsequent detection step is convenient to carry out;

the network structure of the face detection model is composed of three cascaded CNN network structures, and the video processing flow is as follows:

in the first stage, candidate windows are quickly generated through shallow CNN (Propos Network, P-Net);

then, a large number of non-face windows are filtered through a more complex CNN (Refinement Network, R-Net);

finally, the junction is optimized again through a more powerful CNN (Output Network), and the face region with the largest area in the Output result is selected as the final result of the preprocessing stage, so that the face region video only comprising the face region is obtained.

Step 120, generating M frames of face region thumbnails corresponding to the face region video according to M groups of different continuous N frames of face region video frames in the face region video, wherein each frame of face region thumbnail comprises continuous N frames of face region video frames;

in the embodiment of the application, for a given face region video, the face region video may be further divided into M groups of different continuous N-frame face region video frames, and M-frame face region thumbnails corresponding to the face region video are generated.

For example, given a video, where T is the frame length of the video, C is the number of channels, and is the resolution per frame. Assuming that each video contains N video segments, we divide the video into N equal-length segments of length, then sample consecutive N frames (N is set to 4 by default) in the segments at random locations to form one segment, then thumbnail I is composed of sub-imagesAnd rearranging to obtain the multi-frame face region thumbnail.

For example, the face region thumbnail may be configured to sequentially arrange consecutive 4 frames of face region video frames in a 2×2 grid-like layout.

130, respectively inputting the face region thumbnails of each frame into a face video depth forgery detection model, and outputting video forgery probabilities corresponding to the face region thumbnails of each frame;

in the embodiment of the application, after each frame of face region thumbnail is respectively input into the face video depth falsification detection model, the video falsification probability corresponding to each frame of face region thumbnail can be obtained.

In the embodiment of the application, the probability of forging by video is used to identify the probability that the face region video frame in the face region thumbnail is likely to be forged.

And step 140, judging that the target video is a fake video under the condition that the average value of the fake probabilities of the videos exceeds a preset first preset threshold value.

In the embodiment of the present application, the first preset threshold may be a preset threshold.

In the embodiment of the application, after the video forgery probability corresponding to the face region thumbnail is obtained, the average value of the video forgery probabilities can be further calculated, and finally the calculated average value is used as the video forgery probability of the target video.

In this embodiment of the present application, when the average value of the probability of video forgery exceeds a preset first preset threshold, it is indicated that the target video is likely to be a face video that is forged at this time, so that the target video is determined to be a forged video.

In an alternative embodiment, in the case that the average value of the probabilities of video forgery does not exceed the preset first preset threshold value, it is indicated that the target video is not a face video that is forged at this time, and it is thus determined that the target video is not a forged video.

In the embodiment of the application, the face region video only comprising the face region is obtained by carrying out face region recognition and cutting on the target video, so that the data pressure for processing the non-face region data can be effectively reduced, meanwhile, the video image is converted into the face region thumbnail, the data volume of an analysis object is effectively reduced, further, the face region thumbnail with smaller data volume is directly analyzed through the face video depth forging detection model, further, the judgment on whether the target video is a forged video is finally realized, and the method has the advantages of small calculated amount, high reasoning speed, strong generalization and the like, and can be used for effectively and accurately detecting the face video with depth forging and inhibiting the influence of the depth forging on network safety.

Optionally, before the step of inputting the face region thumbnail of each frame into the face video depth forgery detection model and outputting the video forgery probability corresponding to the face region thumbnail of each frame, the method further includes:

and training the preset model through a plurality of training samples.

In an embodiment of the present application, generating a multi-frame face region thumbnail sample image according to consecutive P-frame face region video sample frames in the video samples includes:

In the embodiment of the application, after the video sample is obtained, the video sample can be further input into a face detection model to obtain a face region video only comprising a face region;

and then generating a multi-frame human face region thumbnail sample image according to the continuous P-frame human face region video sample frames, wherein each frame of human face region thumbnail sample image comprises the continuous P-frame human face region video sample frames.

In this embodiment of the present application, the face region thumbnail sample map may be specifically obtained by stitching thumbnails of face region video sample frames.

In the embodiment of the application, the face region thumbnail sample image and the corresponding video forgery probability label are taken as one training sample, and a plurality of training samples are obtained.

More specifically, in the embodiment of the present application, a mask may be further added at the same position in each group of consecutive P-frame face region video sample frames, where the size of the mask is s×s.

In the present embodiment, the mask is based on two core designs: 1) Mask positions between different thumbnails are random, encouraging the network to pay more attention to complementary and less prominent features. 2) The location of the mask on the subgraph within the thumbnail is fixed to take advantage of the fact that most depth counterfeit videos are tampered with frame by frame, forcing the model to detect inconsistencies between adjacent frames of the depth counterfeit video. The mask positions in the different face region thumbnail sample maps are randomly generated, possibly the same or different.

In the embodiment of the application, the model is forced to detect the inconsistency between the adjacent frames of the depth fake video through the design of the mask, so that the effectiveness of model training is effectively ensured.

Optionally, training the preset model through a plurality of training samples includes:

In this embodiment, the preset model may be specifically an adjusted Swin Transformer network architecture, which is a preferred embodiment, and the preset model may be configured as other common 2D-CNN preset model networks and common visual Transformer architectures.

More specifically, the preset model can further enlarge the window sizes of the first three stages of the Swin transform, so that the interaction between frames in the thumbnail becomes more frequent;

the window size of the last stage is set to be the same as the feature map size, enabling global attention calculations while the model captures global spatiotemporal dependencies, so the window size is set to [14, 14,7] at the four stages of calculation.

In the embodiment of the application, the model can calculate self-attention and simultaneously further consider the spatial dependence of the cross-sub-images.

In the embodiments of the present application, a strong modeling capability for short-distance and long-distance spatial dependencies is ensured using the local and global contexts of the deep forgery mode. The patch fusion process enables the model to capture more comprehensive dependencies through a hierarchical representation.

In the embodiment of the application, the cross entropy loss function is adopted as an objective function training network, and the loss function is positioned as follows:

wherein, represents x _i Input segment, n is video segment length, y _i The label of the segment is represented and,is the model of the embodiment.

In the embodiment of the application, the loss value is calculated through the loss function, and the preset model is continuously optimized with the aim of minimizing the loss value, and under the condition that the loss value is smaller than the preset threshold value, training of the preset model is completed, and the face video depth forgery detection model is obtained.

The invention can balance the speed and the accuracy, and sacrifice a bit of space information while ensuring the performance. Attention-based models are better at handling contextual features, and Swin-fransformer uses sliding windows to reduce computation and memory.

The face video depth forgery detection device provided by the invention is described below, and the face video depth forgery detection device described below and the face video depth forgery detection method described above can be referred to correspondingly.

Fig. 2 is a schematic structural diagram of a face video depth forgery detection device provided in an embodiment of the present application, as shown in fig. 2, including:

the recognition module 210 is configured to recognize and cut a face region of the target video, so as to obtain a face region video only including the face region;

the generating module 220 is configured to generate M frames of face region thumbnails corresponding to the face region video according to M groups of different consecutive N frames of face region video frames in the face region video, where each frame of face region thumbnail includes consecutive N frames of face region video frames;

the output module 230 is configured to input the face region thumbnails of each frame into a face video depth forgery detection model, and output a video forgery probability corresponding to the face region thumbnails of each frame;

the determining module 240 is configured to determine that the target video is a forged video if the average value of the video forging probabilities exceeds a preset first preset threshold value.

The device is also for:

and training the preset model through a plurality of training samples.

The device is also for:

the preset model is a TALL-Swin model.

According to the method, the target video is used for carrying out face region recognition and clipping to obtain the face region video only comprising the face region, so that the data pressure for processing the non-face region data can be effectively reduced, meanwhile, the video image is converted into the face region thumbnail, the data volume of an analysis object is effectively reduced, further, the face region thumbnail with smaller data volume is directly analyzed through the face video depth forging detection model, further, whether the target video is a forged video or not is finally judged, the method is small in calculated amount, fast in reasoning speed, and the video-level method has the advantages of capability of capturing time-space inconsistency, strong generalization and the like, the depth forging face video can be efficiently and accurately detected, and the influence of the depth forging on network safety is restrained.

Fig. 3 is a schematic structural diagram of an electronic device provided by the present invention, and as shown in fig. 3, the electronic device may include: processor 310, communication interface (Communications Interface) 320, memory 330 and communication bus 340, wherein processor 310, communication interface 320, memory 330 accomplish communication with each other through communication bus 340. The processor 310 may invoke logic instructions in the memory 330 to perform a face video depth forgery detection method comprising: carrying out face region identification and cutting on the target video to obtain a face region video only containing a face region;

Further, the logic instructions in the memory 330 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute the face video depth forgery detection methods provided by the above methods, and the method includes: carrying out face region identification and cutting on the target video to obtain a face region video only containing a face region;

In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the face video depth forgery detection method provided by the above methods, the method comprising: carrying out face region identification and cutting on the target video to obtain a face region video only containing a face region;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The face video depth forgery detection method is characterized by comprising the following steps of:

2. The face video depth forgery detection method according to claim 1, wherein the face region recognition and clipping are performed on the target video to obtain a face region video containing only the face region, comprising:

3. The face video depth forgery detection method according to claim 1, characterized by further comprising, before the step of inputting the face region thumbnails of each frame into the face video depth forgery detection model and outputting the video forgery probability corresponding to the face region thumbnail of each frame, respectively:

and training the preset model through a plurality of training samples.

4. A face video depth forgery detection method according to claim 3, wherein generating a multi-frame face region thumbnail sample map from consecutive P-frame face region video sample frames in the video samples comprises:

5. A face video depth forgery detection method according to claim 3, wherein training a preset model by a plurality of the training samples comprises:

6. The method for detecting the depth forgery of a face video according to claim 5, wherein the preset model is a TALL-Swin model.

7. A face video depth forgery detection device, characterized by comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the face video depth falsification detection method of any one of claims 1 to 6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the face video depth forgery detection method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program which when executed by a processor implements a face video depth falsification detection method as claimed in any one of claims 1 to 6.