CN111967427A

CN111967427A - Fake face video identification method, system and readable storage medium

Info

Publication number: CN111967427A
Application number: CN202010882723.4A
Authority: CN
Inventors: 方俊涛; 孙宇平; 凌捷; 罗玉
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-11-20

Abstract

The invention discloses a method, a system and a readable storage medium for identifying a forged face video based on a convolutional neural network and an attention mechanism, wherein the method comprises the following steps: sampling an input video sequence to obtain N video frames; detecting, cutting and aligning the video frame with the face to obtain a high-quality face image, and obtaining a sample set to be detected; and inputting the sample set to be detected into a trained attention system deep convolution neural network model to obtain an output result, wherein the result is a result for judging the authenticity of the video. According to the invention, the negative sample is obtained by a special processing method of the positive sample, so that the time cost for obtaining the negative sample is reduced, and meanwhile, the face image of a fake face video is well simulated, so that the trained network has good identification capability; in addition, the invention can also highlight the manipulated image areas so as to guide the neural network to detect the areas, which is beneficial to the face forgery detection and improves the accuracy rate of the original CNN model.

Description

Fake face video identification method, system and readable storage medium

Technical Field

The invention relates to the field of computer vision and artificial intelligence, in particular to a method and a system for identifying a fake face video based on a convolutional neural network and an attention mechanism and a readable storage medium.

Background

As the application of deep learning techniques matures, deep false videos dominated by deepake also get hotter. The deep learning technology can create face-changing videos so as to achieve false-trusting. These spurious videos look very realistic. To date, these videos have not only been forged in the context of pornography and talk attacks, giving people the perception that some celebrities are doing something that is detrimental to their reputation, but, more surprisingly, deep fake videos have also been used to impersonate political characters speaking. The Adam Schiff, a major director of the American, is more likely to issue a warning, and the video generated by Deepfake may have a catastrophic effect on the American national selection in 2020. Because of the difficulty in detection and identification, such false videos will cause significant harm to the whole society, private and public industries to a great extent.

Video synthesis algorithm based on artificial intelligence: a new generation of artificial intelligence based video synthesis algorithms is based on the latest development of new deep learning models, in particular generative confrontational networks (GANs). The GAN model consists of two serially trained deep neural networks, which are a discriminator and a generator, respectively. After the training is finished, the image with the sense of reality is synthesized by using the generator. The GAN model inspires many subsequent tasks on image synthesis, such as: emily L Denton, Soumith Chintala, Rob Fergus et al propose to use laplacian pyramid to resist the depth of the network and produce the image model; alec Radford, Luke Metz, and Soumith Chintala proposed unsupervised representation learning and the like based on deep convolution to generate an antagonistic network. Ming-Yu Liu et al, 2017, proposed an unsupervised image-to-image translation framework based on coupled GANs, aiming at learning joint representations of images in different fields. The algorithm is the basis of the DeepFake algorithm. The creation of a DeepFake video begins with an input video of a particular person ("target") and generates another video based on a GAN model trained to switch between the faces of the target and the source, with the face of the target being replaced with the face of the other person ("source").

GAN generated image/video detection: traditional counterfeiting can be detected using methods such as: peng Zhou et al propose a double-flow CNN for face tamper detection; NoisePrint et al uses CNN model tracking device fingerprints for counterfeit detection. Recently, advances have been made in the discovery of GAN-generated images or videos. Li et al observed a lack of true blinking of the depth pseudoface because training images obtained over the internet typically did not include photographs of the subject's closed eyes. Thus, a CNN/RNN model was used to expose depth false videos by detecting blinks. However, an attacker can avoid such detection by simply purposefully merging the closed eye images in the training. Afchar et al trained a convolutional neural network named MesoNet to directly classify false and true videos synthesized by DeepFake and Face2 Face. This work extends the time domain by merging RNNs on CNNs. It shows good performance. However, this method has its disadvantages that it is inefficient in generating a false image using an artificial intelligence-based synthesis algorithm, requires high time and equipment costs, and the accuracy of identifying a deep face-forged video is yet to be further improved.

Disclosure of Invention

In view of the above problems, it is an object of the present invention to provide a method, a system and a readable storage medium for identifying a counterfeit face video based on a convolutional neural network and an attention mechanism, which can identify the counterfeit face video with a high accuracy.

The invention provides a fake face video identification method based on a convolutional neural network and an attention mechanism, which comprises the following steps:

sampling an input video sequence to obtain N video frames;

detecting, cutting and aligning the video frame with the face to obtain a high-quality face image, and obtaining a sample set to be detected;

and inputting the sample set to be detected into a trained attention system deep convolution neural network model to obtain an output result, wherein the result is a result for judging the authenticity of the video.

In the scheme, the method further comprises the steps of detecting, cutting and aligning the human face of the video frame by utilizing python and using the dlib packet, and obtaining a high-quality human face image through screening to obtain a sample set to be detected.

In this scheme, the training method of the attention mechanism deep convolutional neural network model includes:

collecting a real face image, and labeling the real face image as a positive sample image by using a label;

carrying out face detection and facial area extraction on the collected positive sample image by using a dlib tool to obtain a facial area image;

screening the obtained face region image to remove poor-quality images;

aligning the screened face area images into a plurality of proportion images so as to obtain more resolution examples; then randomly selecting a proportion, and smoothing the selected facial region image of the proportion by using Gaussian blur with the kernel size of 5 multiplied by 5;

the smoothed face region image returns to the size of the original image through affine deformation, so that the false image of a fake human face is simulated;

obtaining a negative sample image by changing the shape of the affine warped surface area;

and cutting a preset Rol area for multiple times in the positive sample image and the negative sample image to obtain the input of the attention mechanism deep convolutional neural network model, and training the attention mechanism deep convolutional neural network model.

In this scheme, in the positive sample image and the negative sample image, a preset region is cut for multiple times, and the input of the depth convolution neural network model with the attention mechanism is specifically:

after the positive sample image and the negative sample image are obtained, cutting the positive sample image and the negative sample image;

the RoI area is determined using the face landworks technique with y coordinates:

wherein y is₀,x₀,y₁,x₁The smallest rectangle, variable, surrounding the entire face is determined

Is taken to be [0, h/5 ]]And [0, w/8 ]]Performing the following steps;

finally, the size of the RoI is adjusted to 224 multiplied by 224 for CNN model training; and (3) cutting each trained RoI area for 10 times, and averaging all predicted values of RoI to serve as a final result of the probability of detecting false videos.

In this scheme, establishing the depth convolution neural network model of the attention mechanism specifically includes:

an attention-based layer is realized by utilizing a direct regression method, a space attention force is calculated by using an auxiliary convolution layer, and the space attention force consists of a convolution layer, channel multiplication and activation operation;

inserting an attention-based layer in a CNN, selecting Resnet50 as a backbone network, inserting the attention-based layer between Stage2 and Stage3 of Resnet50, using a feature map output by Stage2 as an input of the attention-based layer, and using an output of the attention-based layer as an input of Stage 3.

In this scheme, the method for implementing an attribute-based layer by using a direct regression method, which uses an auxiliary convolution layer to calculate a spatial attention, comprises convolution layers, channel multiplication and activation operation, and specifically comprises the following steps:

the input to the attention-based layer is a convolutional feature map (feature map) F ∈ R^(B×H×W×C)Where B is the batch size and H, W, C is the height, width, and number of passes, respectively; the method for realizing the layer based on attention adopts a Direct Regression (Direct Regression) method;

computing spatial attention by using auxiliary convolutional layers, adding a channel to the previous convolutional layer to generate a feature map F1 ∈ R^{(B×H×W×(C+1))}F1 is divided into F2 ∈ R^(B×H×W×1)And the characteristic diagram F3 ∈ R^(B×H×W×C)(ii) a Wherein Matt ∈ R^(B ^×H×W×1)Is the most abundant by F1The latter channel is obtained through Sigmoid operation; the signature F3 is channel multiplied with Matt and the result is fed to the subsequent convolutional layer.

The invention provides a fake face video identification system based on a convolutional neural network and an attention mechanism, which comprises a memory and a processor, wherein the memory comprises a fake face video identification method program based on the convolutional neural network and the attention mechanism, and when the fake face video identification method program based on the convolutional neural network and the attention mechanism is executed by the processor, the following steps are realized:

sampling an input video sequence to obtain N video frames;

screening the obtained face region image to remove poor-quality images;

Is taken to be [0, h/5 ]]And [0, w/8 ]]Performing the following steps;

computing spatial attention by using auxiliary convolutional layers, adding a channel to the previous convolutional layer to generate a feature map F1 ∈ R^{(B×H×W×(C+1))}F1 is divided into F2 ∈ R^(B×H×W×1)And the characteristic diagram F3 ∈ R^(B×H×W×C)(ii) a Wherein Matt ∈ R^(B ^×H×W×1)The last channel of F1 is obtained through Sigmoid operation; the signature F3 is channel multiplied with Matt and the result is fed to the subsequent convolutional layer.

The third aspect of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a program of a method for identifying a counterfeit face video based on a convolutional neural network and an attention mechanism of a machine, and when the program of the method for identifying a counterfeit face video based on a convolutional neural network and an attention mechanism is executed by a processor, the steps of the method for identifying a counterfeit face video based on a convolutional neural network and an attention mechanism as described above are implemented.

According to the invention, the negative sample is obtained by a special processing method of the positive sample, so that the time cost for obtaining the negative sample is reduced, and meanwhile, the face image of a fake face video is well simulated, so that the trained network has good identification capability; in addition, the invention adds an attention-based layer into the CNN model, and the attention-based layer can highlight the manipulated image areas, thereby guiding the neural network to detect the areas, being beneficial to the face forgery detection and improving the accuracy of the original CNN model.

Drawings

FIG. 1 is a flow chart of a method for identifying a fake face video based on a convolutional neural network and an attention mechanism;

FIG. 2 is a schematic structural diagram of a neural network adopted by a fake face video identification method based on a convolutional neural network and an attention mechanism

FIG. 3 is a schematic diagram of an attention-based layer;

FIG. 4 is a schematic diagram of a neural network training phase;

fig. 5 is a schematic diagram of the acquisition of a negative sample image.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

Fig. 1 shows a flow chart of a fake face video identification method based on a convolutional neural network and an attention mechanism.

As shown in fig. 1, the present invention provides a method for identifying a video of a counterfeit human face based on a convolutional neural network and an attention mechanism, comprising the following steps:

s102, sampling an input video sequence to obtain N video frames;

s104, detecting, cutting and aligning the video frame with the face to obtain a high-quality face image, and obtaining a sample set to be detected;

and S106, inputting the sample set to be detected into a trained attention system deep convolution neural network model to obtain an output result, wherein the result is a result for judging the authenticity of the video.

It should be noted that the input video may be a video clip, and different video frames may be obtained by sampling the video clip, where the video frames include image information. It is worth mentioning that the deep convolution neural network model of the attention mechanism is a trained neural network model, and the model can be updated and iterated according to the judgment result every time, so that the accuracy is improved.

According to the embodiment of the invention, the method further comprises the steps of carrying out face detection, cutting and alignment on the video frame by utilizing python and using a dlib packet, and obtaining a high-quality face image through screening to obtain a sample set to be detected.

It should be noted that Python is a cross-platform computer programming language. Is a high-level scripting language that combines interpretive, compiled, interactive, and object-oriented capabilities. The dlib is a C language tool library and comprises a machine learning algorithm, an image processing, a network and some tool class libraries, and the specific function of the dlib tool package is used for detecting, cutting and aligning the face of the video frame.

It should be noted that the screening may be manual screening or machine screening.

According to the embodiment of the invention, the training method of the attention mechanism deep convolutional neural network model comprises the following steps:

screening the obtained face region image to remove poor-quality images;

It should be noted that, specifically, the removing the image with poor quality is to acquire feature information of the image; and calculating a difference value between the characteristic information and the standard characteristic information, and if the difference value is greater than a preset threshold value, judging the image with poor quality.

According to the embodiment of the present invention, in the positive sample image and the negative sample image, the preset region is cut for multiple times, and the input of the depth convolution neural network model with the attention mechanism is specifically:

Is taken to be [0, h/5 ]]And [0, w/8 ]]Performing the following steps;

According to the embodiment of the invention, the establishing of the depth convolution neural network model of the attention mechanism specifically comprises the following steps:

According to the embodiment of the invention, the attention-based layer is realized by using a direct regression method, the space attention is calculated by using an auxiliary convolution layer, and the space attention is composed of a convolution layer, channel multiplication and activation operation, and specifically comprises the following steps:

It should be noted that, when determining whether a video is a false video, the method further includes:

sequencing each analyzed video frame according to the probability of false;

acquiring a false probability value of a video frame in a preset threshold range;

calculating an average of probability values for the range of video frames for the false;

and if the average value is larger than a preset probability threshold value, judging that the video is false.

When the answer is output, the average value of the top 1/3 with the highest probability of being false is taken as the probability that the whole video is false in terms of the frame as an analysis unit for the video, and if the average value is greater than 1/2, the video is output as false.

In order to better explain the using method and technical content of the invention, the following will explain the key steps of the invention.

Fig. 2 is a schematic structural diagram of a neural network adopted by a fake face video identification method based on a convolutional neural network and an attention mechanism.

As shown in FIG. 2, an attention-based layer is inserted into the backbone network. Resnet50 was chosen as the backbone network, and an attention-based layer was inserted between stage2 and stage 3. The feature map output by Stage2 is used as the input to an attention-based layer, and the output of the attention-based layer is used as the input to Stage 3.

FIG. 3 is a schematic diagram of an attention-based layer. The input to the attention-based layer is the convolution feature map (feature map) F ∈ R^(B×H×W×C)Where B is the batch size and H, W, C is the height, width, and number of passes, respectively. We propose a method to implement attention-based layers: direct Regression (Direct Regression).

Direct Regression (Direct Regression): an implementation is to compute the spatial attention by using an auxiliary convolution. Adding a channel to the previous convolution layer to generate a feature map F1 ∈ R^{(B×H×W×(C+1))}. F1 can be divided into F2 ∈ R^(B ^×H×W×1)And the characteristic diagram F3 ∈ R^(B×H×W×C). Wherein Matt ∈ R^(B×H×W×1)Is obtained from the last channel of F1 through Sigmoid operation. The signature F3 is channel multiplied with Matt and the result is fed to the subsequent convolutional layer.

The feature mapping of the classifier model can be successfully processed by an attention-based layer, and the learned attention map can highlight the area influencing the CNN decision in the image and further be used for guiding the CNN to find more discriminant features.

Fig. 4 is a schematic diagram of a neural network training phase. The training phase includes obtaining a data set, tailoring samples, training a network.

Obtaining the data set mainly includes the acquisition of positive and negative samples. The positive samples are directly collected through the internet, and the negative samples are obtained by performing special processing on the basis of the positive samples.

After positive and negative samples are obtained, cutting the samples to obtain a RoI area: we used face landworks to determine the RoI. The coordinates used are:

wherein y is₀,x₀,y₁,x₁Defining a minimum rectangle, variable, that surrounds the entire face

Is taken to be [0, h/5 ]]And [0, w/8 ]]In (1). The final RoI is resized to 224 x 224 for CNN model training. We crop the RoI of each training example 10 times. Then we average the predicted values of all the RoI as the final result of the probability of detecting false videos.

Fig. 5 is a schematic diagram of the acquisition of the negative sample image. Performing face detection and face region extraction on the collected original image (positive sample image) (a) by utilizing a dlib tool; aligning the extracted surfaces into a plurality of proportions to obtain (b), randomly selecting one proportion, and then performing smoothing treatment by using Gaussian blur with the kernel size of 5 multiplied by 5 to obtain (c); the smooth face is affine deformed to return to the original image with the size of (d); by changing the shape of the affine warped surface, different post-processing processes of fake face videos (deepfake) are simulated, namely a rectangle (e) and a convex polygon (f) containing an eyebrow and a mouth bottom.

And (4) taking the RoI obtained by the positive and negative samples as input, performing supervision training on the neural network, continuously optimizing the weight until the weight is optimal, and finishing a training stage.

Converting an input video into a frame sequence, detecting a cut human face, inputting the cut human face into a trained neural network, and outputting an answer.

Detecting and cutting a human face: detecting and cutting the human face by using a dlib packet by using python; outputting an answer: for a video, taking the frame as an analysis unit, taking the average value of the top 1/3 with the highest probability of being false as the probability of being false of the whole video, and if the average value is more than 1/2, outputting the video as false.

sampling an input video sequence to obtain N video frames;

screening the obtained face region image to remove poor-quality images;

Is taken to be [0, h/5 ]]And [0, w/8 ]]Performing the following steps;

sequencing each analyzed video frame according to the probability of false;

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Claims

1. A fake face video identification method based on a convolutional neural network and an attention mechanism is characterized by comprising the following steps:

sampling an input video sequence to obtain N video frames;

2. A method for identifying a counterfeit face video based on a convolutional neural network and an attention mechanism as claimed in claim 1, wherein:

and detecting, cutting and aligning the human face of the video frame by using a dlib packet through python, and screening to obtain a high-quality human face image to obtain a sample set to be detected.

3. The method for identifying the video of the forged face based on the convolutional neural network and the attention mechanism as claimed in claim 1, wherein the training method of the deep convolutional neural network model of the attention mechanism is as follows:

screening the obtained face region image to remove poor-quality images;

4. The method for identifying the video of the forged face based on the convolutional neural network and the attention mechanism as claimed in claim 3, wherein in the positive sample image and the negative sample image, a preset area is cut out for multiple times, and the input of the deep convolutional neural network model with the attention mechanism is specifically as follows:

Is taken to be [0, h/5 ]]And [0, w/8 ]]Performing the following steps;

5. The method for identifying the video of the forged face based on the convolutional neural network and the attention mechanism as claimed in claim 1, wherein the establishing of the deep convolutional neural network model of the attention mechanism specifically comprises:

realizing an attribute-based layer by using a direct regression method, calculating a space attention intention by using an auxiliary convolution layer, wherein the space attention intention consists of the convolution layer, channel multiplication and activation operation;

6. The method for identifying the video of the forged face based on the convolutional neural network and the attention mechanism as claimed in claim 5, wherein the attention-basedlayer is implemented by using a direct regression method, the space attention is calculated by using an auxiliary convolutional layer, and the method comprises the following steps:

computing spatial attention by using auxiliary convolutional layers, one channel added to a previous convolutional layer to generate a feature map

F1 is divided into F2 ∈ R^(B×H×W×1)And the characteristic diagram F3 ∈ R^(B×H×W×C)(ii) a Wherein Matt ∈ R^(B×H×W×1)The last channel of F1 is obtained through Sigmoid operation; for characteristic diagramF3 is channel multiplied with mat and the result is fed to the subsequent convolutional layer.

7. A fake face video identification system based on a convolutional neural network and an attention mechanism is characterized by comprising a memory and a processor, wherein the memory comprises a fake face video identification method program based on the convolutional neural network and the attention mechanism, and the fake face video identification method program based on the convolutional neural network and the attention mechanism realizes the following steps when being executed by the processor:

sampling an input video sequence to obtain N video frames;

8. The system according to claim 7, wherein the training method of the deep convolutional neural network model of the attention mechanism comprises:

screening the obtained face region image to remove poor-quality images;

9. The system according to claim 8, wherein the predetermined region is cut out multiple times in the positive sample image and the negative sample image, and the input of the attention-based deep convolutional neural network model is specifically:

Is taken to be [0, h/5 ]]And [0, w/8 ]]Performing the following steps;

10. A computer-readable storage medium, characterized in that the computer-readable storage medium includes a program of a convolutional neural network and attention mechanism based counterfeit face video identification method of a machine, and when the program of the convolutional neural network and attention mechanism based counterfeit face video identification method is executed by a processor, the steps of the convolutional neural network and attention mechanism based counterfeit face video identification method according to any one of claims 1 to 7 are implemented.