CN114842524B

CN114842524B - Face false distinguishing method based on irregular significant pixel cluster

Info

Publication number: CN114842524B
Application number: CN202210260013.7A
Authority: CN
Inventors: 殷光强; 李超; 王治国; 米尔卡米力江·亚森; 李梦媛; 刘学婷
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2023-03-10
Anticipated expiration: 2042-03-16
Also published as: CN114842524A

Abstract

The invention discloses a face counterfeit identification method based on irregular saliency pixel clusters, which comprises the following steps: firstly, acquiring a real face image and a forged face image, sequentially carrying out preprocessing and rough processing, enhancing texture information of shallow features in a face candidate region after the rough processing, and carrying out saliency detection to obtain a real face saliency map and a forged face saliency map based on the texture information; then generating a plurality of attention diagrams focusing on different areas of the input feature diagram through an attention module, respectively fusing the real face saliency map and the fake face saliency map based on the texture information with the plurality of attention diagrams, and respectively merging, performing standardized average pooling and stacking after fusion to obtain a texture saliency matrix; and finally, obtaining global depth features based on a plurality of attention maps, and sending the global depth features and the texture saliency matrix into a classifier to realize the identification of real and fake human faces. The invention effectively improves the speed and the accuracy of the face forgery detection.

Description

Face false distinguishing method based on irregular significant pixel cluster

Technical Field

The invention belongs to the technical field of face detection, and particularly relates to a face counterfeit identification method based on irregular significant pixel clusters.

Background

With the rapid development of the generation model, in recent years, face counterfeiting technology has been remarkably successful, and can generate high-quality fake faces which are difficult to distinguish by human eyes, and once the fake faces are utilized maliciously, serious social problems or political threats are caused. In order to reduce the risk, a large number of face identification methods emerge, and most of these identification methods model face forgery detection as a common binary classification problem (true/false) or only tend to detect forgery from a central area of a face, and the like. In the detection method modeled as a common binary classification, the network is focused on extracting complex global features and sending the global features into the binary classifier, so that the technical problem of low detection speed exists. For the method of detecting forgery from the central area of the Face, since the shapes and sizes of the forgings generated by different Face forgery algorithms are greatly different, for example, the representative forgings generated by the defakes and the Face2Face usually modify the Face boundary, while the representative forgings generated by the StyleGAN and the PGGAN flexibly modify the whole Face, thus the detection accuracy is also low.

In addition, the document with publication number CN113536990A also discloses a method for identifying deeply forged face data. The method comprises the steps of introducing a multi-mode multi-scale converter (M2 TR), using the multi-scale converter to operate image blocks with different sizes, and detecting local context inconsistency of different scales. To improve the detection result and robustness to image compression, M2TR also introduces frequency information and further combines it with RGB features through a cross-mode fusion module. Although the method can detect the forged face, in practical application, the method needs to divide the feature map into spatial image blocks with different sizes and calculate the self-attention between the image blocks with different heads, and because the image blocks with different sizes have area overlapping, a large amount of redundant calculation exists in the network, the consumption of calculation time is increased, and the detection speed of the network on the image is reduced. Meanwhile, the number of the image blocks to be divided into is large, and the image blocks with different scales to be divided into have great uncertainty, so that the extracted multi-scale information is not sufficient and effective, and the accuracy of the network is influenced.

For the reasons mentioned above, the existing face-forgery-detection method cannot identify the forged face and the real face efficiently and accurately. Therefore, a method for detecting face forgery, which can quickly and accurately identify a true or false face, is urgently needed to improve the speed and accuracy of face forgery detection.

Disclosure of Invention

The invention aims to overcome the technical problems in the prior art, and provides a face counterfeit identification method based on irregular saliency pixel clusters, which is based on the combination of a visual attention mechanism and a saliency region detection technology, on one hand, because the difference between real and forged face images is usually subtle and local, the face counterfeit detection is redefined into a special refined two-classification problem, so that the network pays attention to different local features, meanwhile, because artifacts caused by the counterfeiting method are obvious in texture features, the utilization of the features is specially concerned and enhanced; on the other hand, the false description of different shapes and sizes effective for the face false detection problem is improved, so that the speed and the accuracy of face false detection are improved.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a face false distinguishing method based on irregular significant pixel clusters is characterized by comprising the following steps:

the method comprises the following steps: acquiring a plurality of real face images and fake face images, and respectively preprocessing the acquired images to obtain real face standard images and fake face standard images;

step two: respectively carrying out rough processing on a real face standard image and a forged face standard image by using skin color prior knowledge and geometric prior knowledge of a face to obtain a real face standard image and a forged face standard image which are marked with face candidate areas;

step three: enhancing texture information of shallow features in the face candidate area through a backbone network, and performing significance detection on a real face standard image and a forged face standard image of which the texture information is enhanced to obtain significance image features of the face candidate area; processing the salient image characteristics of the face candidate region to reserve fine artifacts in shallow layer characteristics to the maximum extent, and obtaining a real face salient image and a forged face salient image based on texture information after the processing is finished;

step four: extracting a feature map of a specific layer in the backbone network, inputting the feature map into a multi-attention module, and generating a plurality of attention maps focusing different areas of the input feature map through the multi-attention module;

step five: respectively fusing the results obtained in the third step with a plurality of attention maps, and respectively obtaining a local significant characterization based on texture enhancement of a real face significant map and a local significant characterization based on texture enhancement of a fake face significant map after fusion; merging the local significant characteristics of the texture enhancement of each real face significant image and merging the local significant characteristics of the texture enhancement of each fake face significant image, performing standardized average pooling on the merged local significant characteristics, and stacking the pooled standardized significant characteristics together to obtain a texture significant matrix;

step six: and splicing the plurality of attention maps obtained in the fourth step and acting on the feature map of the last layer of the backbone network feature extraction layer to obtain global depth features, and sending the global depth features and the texture significant matrix obtained in the fifth step into a classifier together to realize the identification of real and forged faces.

In the first step, a real face data set universal in the field is used to obtain real face images with different visual qualities, and a faceforces + + face-counterfeiting data set is used to obtain counterfeit face images, wherein the number of the obtained images is hundreds of thousands.

The preprocessing in the first step is to use normalization operation to process and transform the real face image and the forged face image respectively, and the real face standard image and the forged face standard image are obtained after the preprocessing.

The step two of using the skin color prior knowledge and the geometric prior knowledge of the human face to perform rough processing refers to: establishing a skin color model, calculating the skin color similarity of each pixel point in the image to obtain a skin color outline, and detecting and marking a face candidate area by using the geometric information of the face.

The concrete treatment process of the third step is as follows: inputting a real face standard image marked with a face candidate region and a forged face standard image into a backbone network, and enhancing texture information of shallow features in the face candidate region based on a residual network thought; then, carrying out significance detection on the real face standard image and the forged face standard image with the enhanced texture information by using an attention mechanism and ConvLSTM, and obtaining significance image characteristics of the face candidate region after the detection is finished; and then, calculating and coding the salient image characteristics of the obtained face candidate region by using the spatial attention, and realizing the maximum retention of fine artifacts in shallow features, thereby obtaining a real face salient image and a forged face salient image based on texture information.

The multi-attention module in the fourth step is a lightweight model and consists of a convolution layer of 1 multiplied by 1, a batch normalization layer and a nonlinear activation layer ReLU.

And fifthly, after the texture enhanced local significant representation based on the real face significant image and the texture enhanced local significant representation based on the fake face significant image are obtained, converting the local significant representations into sequence data by taking pixel points as units, processing the sequence data by using a structure based on a Transformer, and then merging the sequence data.

The method for obtaining the global depth features in the sixth step comprises the following steps: and splicing the plurality of attention maps to obtain a single-channel attention map, and applying the single-channel attention map to the last layer of the feature extraction layer of the backbone network to obtain the global depth feature.

By adopting the technical scheme, the invention has the beneficial technical effects that:

1. on one hand, because the difference between real and forged face images is delicate and local, the invention redefines face forging detection as a special refined two-classification problem, so that the network pays attention to different local characteristics, meanwhile, artifacts caused by a forging method are obvious in texture characteristics, and the characteristics are particularly concerned and enhanced; on the other hand, the false description of different shapes and sizes effective for the face false detection problem is improved, so that the speed and the accuracy of face false detection are improved.

Further, the advantages of the steps of the invention are as follows:

the method has the advantages that a plurality of real face images and fake face images are obtained in the first step, and disturbance of the data to the model counterfeit identification performance can be reduced by using abundant and sufficient data sets. And the acquired images are respectively preprocessed, so that the generation of overfitting can be effectively prevented.

And in the second step, the skin color priori knowledge and the geometric priori knowledge of the human face are used for respectively carrying out rough processing on the real human face standard image and the forged human face standard image, the rough processing refers to the detection of the human face in the image, the human face detection is the premise and the basis of the human face forging and identification, and in order to realize efficient human face identification, the priori knowledge is required to be used for accurately detecting the human face in the image so as to form the image marked with the human face candidate area.

In the third step, artifacts caused by the face forgery method are obvious in texture information, so that the texture information of the shallow feature in the face candidate area is enhanced through the backbone network in the step, which is equivalent to enhancing the representation of the texture information in the feature of the shallow feature of the image, and the texture information can be regarded as the high-frequency component of the shallow feature to obtain the shallow feature with enhanced texture, so that the false distinguishing effect is more favorably improved.

In addition, the significance detection in the step can simulate a visual attention mechanism in a human visual system, so that the backbone network focuses on an important region of the human face to learn the significance image characteristics of the candidate region of the human face, and the human face false distinguishing is carried out on the subsequent network on the basis of texture enhancement to reduce the calculated amount, thereby improving the processing speed of the backbone network. And the spatial attention is used for calculating and coding the significant image features of the obtained face candidate region, so that the fine artifacts in the shallow features can be retained to the maximum extent, and the false distinguishing effect is further effectively improved.

In the fourth step, the characteristic diagram of a specific layer in the backbone network is extracted, the characteristic diagram is input into the multi-attention module, and then the multi-attention module generates a plurality of attention diagrams focusing on different areas of the input characteristic diagram. The different areas respectively correspond to areas such as eyes and mouths, and the step can further guide a subsequent backbone network to focus on different local areas through semantic features, and the local areas have strong discriminability on face false identification.

And step five, fusing the saliency map based on the texture information and a multi-attention map drawing which generates attention to different local areas, so that the backbone network effectively excavates the texture information in the different local areas, and the artifacts generated by the counterfeiting method can be effectively captured in the different strongly-discriminative local areas. Meanwhile, the local areas have different irregular structure information, and artifacts are detected by using a transform-based structure in the step, so that description of irregular counterfeiting can be effectively improved, and the accuracy of face counterfeit identification is improved.

And step six, the texture significant matrix and the global depth feature are sent to a classifier together, the complementation of the local feature and the global feature is realized, the features with strong identification ability and high robustness can be constructed on the basis, and the overall performance of the network human face false identification is effectively improved.

2. According to the invention, by enhancing the expression of texture information, more artifacts generated by a counterfeiting method are reserved, and the shallow texture features and the deep semantic features are aggregated to be used as each local distinguishing representation, so that the fine difference is prevented from disappearing in the deep layer.

3. The invention improves the fine classification effect by promoting the description of counterfeiting in different shapes and sizes, and finally realizes the rapid and accurate counterfeiting detection for identifying true and false faces.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a diagram of a model framework of the present invention.

FIG. 3 is a schematic diagram of a key structure for significance detection according to the present invention.

FIG. 4 is a schematic diagram of a multi-attention module according to the present invention.

Detailed Description

Example 1

The invention discloses a face counterfeit identification method based on irregular saliency pixel clusters, the flow chart of the method is shown in figure 1, the model frame chart is shown in figure 2, and the method specifically comprises the following steps:

the method comprises the following steps: and acquiring a plurality of real face images and forged face images, and respectively preprocessing the acquired images to obtain real face standard images and forged face standard images.

It should be noted that in this step, in consideration of the influence of the environment in practical application, the image of the human face is acquired, and it is preferable to use a real human face data set that is common in the field to acquire real human face images with different visual qualities. For forged face images, faceforces + + face forged data sets which are different in counterfeiting method and provide multi-grading quality are preferably used, hundreds of thousands of face images are simulated to be obtained in total, and disturbance of data to model counterfeit performance can be reduced by using abundant and sufficient data sets.

Further, the preprocessing of the acquired image means that a normal operation is used to process and transform the real face image and the forged face image respectively, and the real face standard image and the forged face standard image are obtained after the processing is finished. Wherein the use of the normaize operation is effective to prevent the occurrence of overfitting.

Step two: and respectively carrying out rough processing on the real face standard image and the forged face standard image by using the skin color prior knowledge and the geometric prior knowledge of the face to obtain the real face standard image and the forged face standard image which are marked with the face candidate regions.

In this step, the rough processing using the skin color prior knowledge and the geometric prior knowledge of the human face means: establishing a skin color model, calculating the skin color similarity of each pixel point in the image to obtain a skin color outline, and detecting and marking a face candidate region by using the geometric information of the face. It should be noted that the rough processing here is mainly to detect the face in the image, and the face detection is a precondition and a basis for the counterfeit identification of the face, and in order to realize efficient face identification, the prior knowledge needs to be used to accurately detect the face in the image to form an image marked with a face candidate area.

Step three: enhancing texture information of shallow features in the face candidate area through a backbone network, and performing significance detection on a real face standard image and a forged face standard image of which the texture information is enhanced to obtain significance image features of the face candidate area; and processing the salient image characteristics of the face candidate region to furthest reserve the fine artifacts in the shallow features, and obtaining a real face salient image and a forged face salient image based on texture information after the processing is finished.

The backbone network in this step is not limited, and any network may be used as long as it can enhance the image texture information, and based on this, for example, efficientNet-64 may be used as the backbone network. The specific treatment process comprises the following steps: inputting a real face standard image marked with a face candidate region and a forged face standard image into a backbone network, and enhancing texture information of shallow features in the face candidate region based on a residual network thought; then, performing saliency detection on the real face standard image and the forged face standard image with the enhanced texture information by using an attention mechanism and ConvLSTM, and obtaining saliency image characteristics of a face candidate region after the detection is completed; and then, calculating and coding the salient image characteristics of the obtained face candidate region by using the spatial attention, and realizing the maximum retention of fine artifacts in shallow features, thereby obtaining a real face salient image and a forged face salient image based on texture information.

It should be noted that, in this step, since artifacts caused by the face forgery method are significant in texture information, the texture information of the shallow features needs to be enhanced. The detailed process comprises the following steps: inputting the roughly processed real face standard image and the forged face standard image into a backbone network, applying local average pooling to shallow features extracted by the backbone network for down sampling to obtain pooled features D, and acquiring feature representation of prominent texture information based on the idea of a residual error network, wherein the formula is as follows:

T _L ＝f _L (I)-D

in the formula: the subscript L here represents the shallow level, T represents the feature that most expresses texture information, f (I) represents the original shallow feature, and D represents the pooled feature. And then using a DenseBlock to enhance the expression of T to obtain the texture-enhanced characteristic.

After texture enhancement, the saliency detection is carried out by using an attention mechanism and ConvLSTM, and in order to accurately capture the saliency area with high retention artifacts, the detail information of the detection result is derived and optimized in a memory-oriented understanding mode through the inspiration of a human brain mechanism. The structural schematic diagram is shown in fig. 3, and in fig. 3: subscript t represents a time step in ConvLSTM, ht represents memory of understanding of the previous image according to time sequence, F-represents a more representative characteristic after weighting, and in order to emphasize importance degree of relation among different pixel points on judgment of a final significance region, a real face significance map and a forged face significance map based on texture information are obtained through space attention processing.

Step four: and constructing a multi-attention module which is a lightweight model and consists of a 1 x 1 convolution layer, a batch normalization layer and a nonlinear activation layer ReLU. And then extracting a characteristic diagram of a specific layer in the backbone network, inputting the characteristic diagram into a multi-attention module, and generating a plurality of attention diagrams focusing on different areas of the input characteristic diagram through the multi-attention module.

It should be noted that in this step, the foregoing steps focus on and enhance the utilization of shallow features, and ignore deep semantic features, and the semantic features can further guide the subsequent network to focus on different local regions, which have strong discriminability for face authentication, so that the specific layer herein refers to the fourth or fifth layer of the backbone network. Inputting the characteristic diagram into a multi-attention module, and generating a plurality of attention diagrams focusing on different areas of the input characteristic diagram through the multi-attention module, namely obtaining the multi-attention diagram A to search for the distinguished local areas. The schematic diagram of the module is shown in fig. 4, and in fig. 4: the subscript L here represents the deep level, where Ak denotes the kth attention and corresponds to a specific locally discerned region, such as the eyes, mouth, etc., or even a blended boundary of two face images.

Step five: respectively fusing the results obtained in the step three with a plurality of attention maps, and respectively obtaining the texture enhanced local significant characteristics based on the real face significant map and the texture enhanced local significant characteristics based on the forged face significant map after fusion; after the local significant representation of texture enhancement based on the real face significant figure and the local significant representation of texture enhancement based on the fake face significant figure are obtained, converting the local significant representations into sequence data by taking pixel points as units, and processing the sequence data by utilizing a structure based on a Transformer; and merging the local significant characteristics of the texture enhancement of each real face significant image, merging the local significant characteristics of the texture enhancement of each fake face significant image, performing standardized average pooling on the merged local significant characteristics, and stacking the pooled standardized significant characteristics together to obtain a texture significant matrix.

The method for obtaining the global depth feature in the step comprises the following steps: and splicing the plurality of attention maps to obtain a single-channel attention map, and applying the single-channel attention map to the last layer of the feature extraction layer of the backbone network to obtain the global depth feature.

In summary, because the difference between real and forged face images is often subtle and local, the invention realizes the face discrimination based on the combination of the visual attention mechanism and the salient region detection technology, redefines the face forgery detection as a special refined two-classification problem, leads the network to pay attention to different local features, meanwhile, the artifacts caused by the forgery method are obvious in the texture features, and specially pays attention to and enhances the utilization of the features. In addition, the invention can effectively improve the speed and the accuracy of face forgery detection by promoting the effective forgery description of different shapes and sizes of face forgery detection problems.

Example 2

This example verifies the method described in example 1 as follows:

the method comprises the steps of obtaining real face image samples with different visual qualities in a real face data set which is universal in the field, obtaining forged face image samples in a faceforces + + face forging data set, and respectively obtaining 50000 real face images and 50000 forged face images in order to achieve balance of true and false labels and good model generalization performance. The method comprises the steps of carrying out false identification on a face image by the method of the embodiment 1, preprocessing the face image in the first step to obtain a standard image, carrying out face detection on the standard image by using priori knowledge in the second step, inputting the image into a backbone network in the third step, obtaining a face saliency map based on texture information by using EfficientNet-64 as the backbone network, obtaining a multi-attention map through the fourth step, combining the face saliency map in the third step with the multi-attention map in the fourth step through the fifth step, processing and combining the face saliency map and the multi-attention map to obtain a texture saliency matrix, and finally identifying a real face and a forged face through the sixth step.

The result shows that the counterfeit identification accuracy rate, namely the evaluation index ACC, reaches 97.6%, and the counterfeit identification speed, namely the evaluation index FPS, reaches 213.8.

In addition, as comparison to evaluate the importance of the texture information, the elimination step three is to not enhance the texture information of the shallow feature, since the artifact appears prominently in the texture information, the perception of the artifact by the deep layer of the network disappears, and the accuracy of the network is reduced by 4.8%.

As a contrast to evaluate the effectiveness of multiple attentions, step four and step five are eliminated, i.e. only a single attention is used, since the difference between real and fake faces is usually locally subtle, it is difficult to capture a fake with a single attention, resulting in a decrease of 1.6% in the accuracy of the network.

In summary, the present invention effectively improves the speed and accuracy of face forgery detection by promoting the description of forgery of different shapes and sizes effective to the face forgery detection problem.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. A face false distinguishing method based on irregular significant pixel clusters is characterized by comprising the following steps:

the method comprises the following steps: acquiring a plurality of real face images and forged face images, and respectively preprocessing the acquired images to obtain real face standard images and forged face standard images;

step three: enhancing texture information of shallow features in the face candidate area through a backbone network, and performing significance detection on the real face standard image and the forged face standard image with the enhanced texture information to obtain significance image features of the face candidate area; processing the salient image characteristics of the face candidate region to reserve fine artifacts in shallow layer characteristics to the maximum extent, and obtaining a real face salient image and a forged face salient image based on texture information after the processing is finished;

step four: extracting a feature map of a fourth or fifth layer in the backbone network, inputting the feature map into a multi-attention module, and generating a plurality of attention maps focusing different areas of the input feature map through the multi-attention module;

step five: respectively fusing the results obtained in the step three with a plurality of attention maps, and respectively obtaining the texture enhanced local significant characteristics based on the real face significant map and the texture enhanced local significant characteristics based on the forged face significant map after fusion; merging the local significant characteristics of the texture enhancement of each real face significant image and merging the local significant characteristics of the texture enhancement of each fake face significant image, performing standardized average pooling on the merged local significant characteristics, and stacking the pooled standardized significant characteristics together to obtain a texture significant matrix;

step six: and splicing the plurality of attention maps obtained in the fourth step and acting on the feature map of the last layer of the backbone network feature extraction layer to obtain global depth features, and sending the global depth features and the texture significance matrix obtained in the fifth step into a classifier together to realize the identification of real and forged faces.

2. The method according to claim 1, wherein the face identification method based on irregular salient pixel clusters comprises: in the first step, a real face data set universal in the field is used to obtain real face images with different visual qualities, and a faceforces + + face-counterfeiting data set is used to obtain counterfeit face images, wherein the number of the obtained images is hundreds of thousands.

3. The method according to claim 1, wherein the method comprises the following steps: the preprocessing in the step one is to use the normaize operation to process and transform the real face image and the forged face image respectively, and the real face standard image and the forged face standard image are obtained after the processing is finished.

4. The method according to claim 1, wherein the method comprises the following steps: the step two of rough processing by using the skin color priori knowledge and the geometric priori knowledge of the human face refers to the following steps: establishing a skin color model, calculating the skin color similarity of each pixel point in the image to obtain a skin color outline, and detecting and marking a face candidate area by using the geometric information of the face.

5. The face authentication method based on irregular salient pixel clusters according to any one of claims 1 to 4, wherein: the concrete treatment process of the third step is as follows: inputting a real face standard image marked with a face candidate region and a forged face standard image into a backbone network, and enhancing texture information of shallow features in the face candidate region based on a residual network thought; then, carrying out significance detection on the real face standard image and the forged face standard image with the enhanced texture information by using an attention mechanism and ConvLSTM, and obtaining significance image characteristics of the face candidate region after the detection is finished; and then, calculating and coding the saliency image features of the obtained face candidate region by using the spatial attention, and realizing the maximum reservation of fine artifacts in shallow features, thereby obtaining a real face saliency map and a forged face saliency map based on texture information.

6. The method according to claim 1, wherein the face identification method based on irregular salient pixel clusters comprises: the multi-attention module in the fourth step is a lightweight model and consists of a convolution layer of 1 multiplied by 1, a batch normalization layer and a nonlinear activation layer ReLU.

7. The method according to claim 1, wherein the method comprises the following steps: and fifthly, after the texture enhanced local significant representation based on the real face significant image and the texture enhanced local significant representation based on the fake face significant image are obtained, converting the local significant representations into sequence data by taking pixel points as units, processing the sequence data by using a structure based on a Transformer, and then merging the sequence data.

8. The method according to claim 1, wherein the method comprises the following steps: the method for obtaining the global depth features in the sixth step comprises the following steps: and firstly splicing the plurality of attention diagrams to obtain a single-channel attention diagram, and then acting the single-channel attention diagram on the feature diagram of the last layer of the backbone network feature extraction layer to obtain the global depth feature.