CN113011332A

CN113011332A - Face counterfeiting detection method based on multi-region attention mechanism

Info

Publication number: CN113011332A
Application number: CN202110295565.7A
Authority: CN
Inventors: 周文柏; 张卫明; 俞能海; 赵汉卿; 陈冬冬
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-06-22

Abstract

The invention discloses a face counterfeiting detection method based on a multi-region attention mechanism, which comprises the following steps: inputting a human face image to be detected into a convolutional neural network to obtain a shallow layer characteristic image, a middle layer characteristic image and a deep layer characteristic image; carrying out texture enhancement operation on the shallow feature map to obtain a texture feature map; generating a multi-region attention map for the intermediate layer characteristic map through a multi-attention mechanism; performing attention pooling on the texture feature map by using a multi-region attention map to obtain local texture features, and performing attention pooling on the deep feature map after adding the attention maps to obtain global features; and after the global features and the local texture features are fused, classifying to obtain a face forgery detection result. The method has a plurality of attention areas, and each area can extract mutually independent features to enable the network to pay more attention to local texture information, so that the accuracy of a detection result is improved.

Description

Face counterfeiting detection method based on multi-region attention mechanism

Technical Field

The invention relates to the technical field of face forgery detection, in particular to a face forgery detection method based on a multi-region attention mechanism.

Background

The face counterfeiting refers to tampering of face regions in media such as images or videos by using a computer technology, including identity replacement and expression editing, and the face counterfeiting technology can be applied to the post-processing of movies and televisions. With the great development of the deep learning technology in the field of image generation, a generation countermeasure network and an auto-encoder are applied to the field of face forgery to generate a face forgery picture or video that is difficult to be distinguished by human eyes, such as defafakes, FSGAN, FaceShifter, and the like. Many face-forgery-programs are available on the internet, so that anyone can synthesize a fake video by using a personal computer through simple learning. These counterfeit videos are now widely available on the internet. If this technique is used to spread rumors, counterfeit evidence, etc. illegal activities, it can cause serious social harm. Therefore, research on the face-forged video detection technology has gained wide attention in academia. The mainstream face counterfeit detection algorithm at present uses a face detector based on a deep neural network and a counterfeit face classifier.

The neural network model can capture the characteristics capable of effectively distinguishing real faces from fake faces by training on a large-scale fake face data set. The effect of the general image classification network in the task of detecting the counterfeit human face has certain limitations, especially when detecting compressed video and counterfeit methods which do not appear in the training set. In computer vision, the attention mechanism is a broad concept, and is referred to as flexible attention based on positions, and each position in a feature map is multiplied by a weight; however, this scheme: 1) texture information is ignored using only deep features; 2) only one attention area is available, and local features are ignored; therefore, the accuracy of the detection result still needs to be improved.

Disclosure of Invention

The invention aims to provide a face forgery detection method based on a multi-region attention mechanism, which has a plurality of attention regions, wherein each region can extract independent features to enable a network to pay more attention to local texture information, so that the accuracy of a detection result is improved.

The purpose of the invention is realized by the following technical scheme:

a face counterfeiting detection method based on a multi-region attention mechanism comprises the following steps:

inputting a human face image to be detected into a convolutional neural network to obtain a shallow layer characteristic image, a middle layer characteristic image and a deep layer characteristic image;

carrying out texture enhancement operation on the shallow feature map to obtain a texture feature map; generating a multi-region attention map for the intermediate layer characteristic map through a multi-attention mechanism; performing attention pooling on the texture feature map by using the multi-region attention maps to obtain local texture features, and performing attention pooling on the deep feature map after adding the multi-region attention maps to obtain global features; and after the global features and the local texture features are fused, classifying to obtain a face forgery detection result.

According to the technical scheme provided by the invention, on one hand, the method can be used together with the traditional convolutional neural network backbone to improve the accuracy rate in the human face forgery detection task. On the other hand, the input face images are classified by using the local texture features of different regions and the global deep features, so that the accuracy of the classification result can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a network overall structure diagram of a face forgery detection method based on a multi-region attention mechanism according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an attention generating module and an attention pooling module according to an embodiment of the present invention;

fig. 3 is a visualization example of each discriminant region obtained through weak supervised learning according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Compared with the traditional method, the method provided by the embodiment of the invention has a plurality of attention areas, and each area can extract independent features so that the network pays more attention to local texture information. As shown in fig. 1, the method mainly includes: inputting a human face image to be detected into a convolutional neural network to obtain a shallow layer characteristic image, a middle layer characteristic image and a deep layer characteristic image; carrying out texture enhancement operation on the shallow feature map to obtain a texture feature map; generating a multi-region attention map for the intermediate layer characteristic map through a multi-attention mechanism; performing attention pooling on the texture feature map by using the multi-region attention maps to obtain local texture features, and performing attention pooling on the deep feature map after adding the multi-region attention maps to obtain global features; and after the global features and the local texture features are fused, classifying to obtain a face forgery detection result.

The scheme provided by the embodiment of the invention can be used together with the traditional convolutional neural network backbone to improve the accuracy rate in the face forgery detection task; and the local texture features of different regions and the global deep features are used for classifying the input face images. Compared with the prior art, the method achieves higher accuracy and mobility in various published Deepfake detection data sets.

For ease of understanding, the following detailed description is provided for the above-described aspects of embodiments of the present invention.

One, the whole network and its composition.

Fig. 1 shows the entire network structure, mainly including: a backbone of the convolutional neural network, and a texture enhancement module, an attention generation module, an attention pooling module, and a fully-connected classifier. The main introduction is as follows:

1. a convolutional neural network.

The convolutional neural network can be a convolutional neural network used for a face forgery detection task in a traditional scheme, and a main part mainly refers to a feature extraction part. Shallow layer, middle layer and deep layer characteristic maps extracted from the main part of the convolutional neural network are correspondingly input into a texture enhancing module, an attention generating module and an attention pooling module.

It will be appreciated by those skilled in the art that convolutional neural networks are generally constructed of similar N-layer stacks, and shallow, intermediate and deep layers are relatively general concepts, as the method of the present invention is not limited to using a particular convolutional neural network as a backbone. Since the feature extraction portion has a multi-layer structure, the shallow layer, the intermediate layer, and the deep layer gradually increase in depth, for example, the shallow layer may be the first layer or the second layer, the deep layer may be the last layer or the second last layer, and the intermediate layer may be any layer between the shallow layer and the deep layer. As a specific example of the EfficientNet as a backbone is provided in the experiments below, the shallow layer refers to the second block, the middle layer refers to the fifth block, and the deep layer refers to the seventh (i.e., last) block.

2. And a texture enhancement module.

The texture enhancement module is mainly used for enhancing the sensitivity of the feature map to the texture features of face forgery by extracting the residual error of the convolutional neural network feature map, and can extract and enhance texture information by using densely connected convolutional layers.

It was observed that slight artifacts caused by the counterfeiting method tend to remain in the texture information of the shallow features. Here, the texture information represents high frequency components of the shallow features, similar to residual information of RGB images. Therefore, shallow features should be of interest and enhanced. As shown in fig. 1, the texture enhancing module performs a texture enhancing operation on the shallow feature map, and the main steps include: carrying out local average pooling on the shallow feature map to obtain a non-texture feature map D; and inputting the residual error of the shallow characteristic diagram and the non-texture characteristic diagram D into the densely connected convolution layer to obtain a texture characteristic diagram.

Those skilled in the art will appreciate that dense connections are a term of art, in this english the denseconnection, and in particular it is the structure proposed by the DenseNet network, as distinguished from conventional convolutional networks: conventional convolutional neural network processing flows are multi-layer serial, with the input of each layer containing the output of all previous layers in a densely connected structure.

3. An attention generation module.

The difference between real and fake faces usually appears as different features in different facial regions and is not easily captured by a single attention structure. Therefore, the discrimination of the face forgery detection network can be improved by using multi-region attention instead of global average pooling.

As shown in fig. 2, a multi-region attention map is generated by an attention generating module for the intermediate layer feature map through a multi-attention mechanism; the attention generating module comprises a convolution layer (specifically, a 1 × 1 convolution layer), a batch normalization layer and a nonlinear activation layer (ReLU) which are arranged in sequence. The middle layer feature map will pass through the attention generation module, and M region attention maps a with the size of Ht × Wt can be obtained.

4. Attention pooling module.

In the embodiment of the invention, the attention pooling module uses a multi-region attention map to pool the texture feature map to obtain local texture features, and uses the multi-region attention map to add and then pool the deep feature map to obtain the global features.

In embodiments of the invention, Bilinear Attention Pooling (BAP) is used instead of global averaging pooling, BAP is applied to shallow and deep feature maps to collect texture features from the shallow and to retain deeper semantic features. Specifically, the method comprises the following steps: if the resolution of the multi-region attention map does not match the resolution of the texture map, mapping the multi-region attention map to the same resolution as the texture map; then, multiplying the texture feature map by each region attention map respectively to obtain a plurality of partial texture feature maps; performing global average pooling on all partial texture feature maps, and performing L₂And normalizing to obtain local texture features.

In the embodiment of the invention, a plurality of attention maps are provided, and generally, each attention map has high intensity in a specific area and low intensity in other areas; therefore, after multiplying each region attention map by the texture feature map, the obtained partial texture feature map is the feature map with only the texture information of the corresponding attention region reserved.

5. And (4) fully connecting a classifier.

In the embodiment of the invention, a multilayer full-connection network is used for fusing the global characteristics and the local texture characteristics, classification is carried out based on the fused characteristics, and the output result shows that the human face image to be detected is a real image or a forged image.

And II, a loss function.

In the embodiment of the present invention, the loss function of the network model shown in fig. 1 during training includes two parts: cross entropy loss and region independence loss.

The cross entropy loss is a conventional loss, and can be realized by referring to a conventional technology, so that the detailed description is omitted. The following description will be made mainly in detail with respect to the region independence loss function.

The loss of area independence is a loss of assistance in training the attention generating module due to the training notesThe intent generation module may be prone to network performance degradation due to lack of tag guidance, i.e., different attention attempts tend to focus on the same area, which is detrimental to network capture of rich local texture information. In addition, for different input pictures, it is desirable that each attention map is located in a fixed semantic area to reduce the randomness of the information captured by each attention map. To achieve the above goal, we propose a regional independence penalty that helps reduce the overlap between attention maps and maintain the consistency of different inputs

Expressed as:

in which the region independence is lost

The first part of (a) is an intra-class loss, which pulls feature V near feature center c, and the second part is an inter-class loss, which repels scattered feature centers;

for the local feature of the jth region of sample i,

respectively representing the characteristic centers of the jth area, the kth area and the ith area of the t-th update; m is_inRepresenting the edge distance, y, between a feature and the center of the corresponding feature_iIs the label (true or false) of sample i, m_in(y_i) Is thatThe different labels have different margins; m is_outIs the margin between the centers of each feature; b is the batch size (batch size), M is the number of multi-region attention maps, each region attention map focusing on a region;

the characteristic center c is the sliding average value of the characteristic V, and the updating formula is as follows:

wherein, c^t-1、c^tRespectively representing the feature centers of the t-1 st and t-th updates, VⁱFor the feature of sample i, α is the update rate of the feature center, decaying α after each training period.

When the classification is carried out, the local texture features and the deep features are connected and then a classification result is obtained through the full connection layer. In the embodiment of the invention, the loss function is optimized by using a gradient descent algorithm in the training process

Is the cross entropy loss of the classifier.

And thirdly, a data enhancement scheme of attention guidance.

In order to further separate different attention diagrams, attention-guided data enhancement is introduced, namely, in the training process, Gaussian blur is carried out on a selected attention area in the input face image I to generate a data enhanced face image I^′Expressed as:

wherein, I_dIs the face image I after Gaussian blur,

is the selected attention area.

In order to demonstrate the effects of the above-described embodiments of the present invention, the following description is made with reference to the experimental results.

In the experiment, effective-Net B4 is selected as a main network, a second layer feature map (a shallow layer feature map) is selected as an input of a texture enhancement module, a fifth layer feature map (a middle layer feature map) is used for extracting an attention map, and the last layer before global pooling is used as a deep layer feature map. The number of attention areas is set to 4.

In the experiment, various public face forgery detection data sets are used, and the example of the method is trained and tested according to a standard flow. Table 1 shows the accuracy of the method in the FF + + data set high-quality (HQ) and low-quality (LQ) video test, and as can be seen from the results shown in table 1, the method has higher detection accuracy on high-quality video than other existing methods.

TABLE 1 accuracy of the method on FF + + dataset compared to other methods

Table 2 shows the accuracy of the migration test on the data set Celeb-DF v2 after the FF + + data set is trained by the method.

Method	FF++	Celeb-DF
			Two-stream	70.1	53.8
Meso4	84.7	54.8
			Mesolnception4	83	53.6
FWA	80.1	56.9
			Xception-raw	99.7	48.2
Xception-c23	99.7	65.3
			Xception-c40	95.5	65.5
Multi-task	76.3	54.3
			Capsule	96.6	57.5
DSP-FWA	93	64.6
			Two Branch	93.18	73.41
F3-Net	98.1	65.17
			EfficientNet-B4	99.7	64.29
Method for producing a composite material	99.8	67.44

TABLE 2 migration Capacity of this method on Celeb-DFv2 dataset compared to other methods

Table 3 shows that the testing effect (Logloss, the smaller the effect is better) of the method in the private testing set of the DFDC match is better than the winning method in the match.

Method	Logloss
		Selim Seferbekov	0.1983
WM	0.1787
		NTechLab	0.1703
Eighteen Years Old	0.1882
		The Medics	0.2157
Method for producing a composite material	0.1679

TABLE 3 Logloss of the method on DFDC test set compares the DFDC match winning method

The comparison result shows that: compared with the prior art, the method achieves higher accuracy and mobility in various published Deepfake detection data sets.

Fig. 3 shows a visualization example of each discriminant region obtained by the weak supervised learning, which is different from the original method (Dang et al) that requires additional information to train attention, but the method provided by the present invention does not require additional information, and thus may be referred to as the attention obtained by the weak supervised learning, and each discriminant region in fig. 3 refers to a different attention region.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A face forgery detection method based on a multi-region attention mechanism is characterized by comprising the following steps:

2. The method for detecting face forgery based on multi-region attention mechanism as claimed in claim 1, wherein the texture enhancement module performs texture enhancement operation on the shallow feature map to obtain the texture feature map, and the implementation steps include:

performing local average pooling on the shallow feature map to obtain a non-texture feature map;

and inputting the residual error of the shallow characteristic diagram and the non-texture characteristic diagram into the densely connected convolution layer to obtain the texture characteristic diagram.

3. The method for detecting face forgery based on multi-region attention mechanism as claimed in claim 1, wherein the multi-region attention diagram is generated by the attention generation module to the middle layer feature diagram through the multi-attention mechanism; the attention generation module comprises a convolution layer, a batch normalization layer and a nonlinear activation layer which are arranged in sequence.

4. The method for detecting face forgery based on multi-region attention mechanism as claimed in claim 1, wherein the attention pooling for texture feature map is performed using multi-region attention maps to obtain local texture features, and the attention pooling for deep layer feature maps after adding the attention maps to obtain global features is realized by an attention pooling module; wherein:

bilinear attention pooling is adopted when performing attention pooling on a texture feature map by using multi-region attention maps, and comprises the following steps: if the resolution of the multi-region attention map does not match the resolution of the texture map, mapping the multi-region attention map to the same resolution as the texture map; then, multiplying the texture feature map by each region attention map respectively to obtain a plurality of partial texture feature maps; performing global average pooling on all partial texture feature maps, and performing L₂And normalizing to obtain local texture features.

5. The method for detecting face forgery based on multi-region attention mechanism as claimed in claim 2, wherein the loss function of the network model corresponding to the method in training includes two parts: cross entropy loss and region independence loss; wherein, the loss of the regional independence is the loss of the auxiliary training attention generation module

Expressed as:

wherein the content of the first and second substances,

for the local feature of the jth region of sample i,

respectively representing the characteristic centers of the jth area, the kth area and the ith area of the t-th update; m is_inRepresenting the edge distance, y, between a feature and the center of the corresponding feature_iIs a label of sample i, m_in(y_i) Is a whole, and different labels have different margins; m is_outIs the margin between the centers of each feature; b is the batch size, M is the number of multi-zone attention maps, each zone attention map focuses on a zone;

6. The method of claim 1, wherein the method further comprises: attention-guided data enhancement is introduced, namely, in the training process, Gaussian blur is carried out on a selected attention area in the input face image I to generate a data-enhanced face image I^′Expressed as:

wherein, I_dIs the face image I after Gaussian blur,

is the selected attention area.