CN114842524A

CN114842524A - Face false distinguishing method based on irregular significant pixel cluster

Info

Publication number: CN114842524A
Application number: CN202210260013.7A
Authority: CN
Inventors: 殷光强; 李超; 王治国; 米尔卡米力江·亚森; 李梦媛; 刘学婷
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-08-02
Anticipated expiration: 2042-03-16
Also published as: CN114842524B

Abstract

The invention discloses a face counterfeit identification method based on irregular saliency pixel clusters, which comprises the following steps: firstly, acquiring a real face image and a forged face image, sequentially carrying out preprocessing and rough processing, enhancing texture information of shallow features in a face candidate region after the rough processing, and carrying out saliency detection to obtain a real face saliency map and a forged face saliency map based on the texture information; generating a plurality of attention diagrams focusing on different areas of the input feature diagram through a plurality of attention modules, respectively fusing the real face saliency map and the forged face saliency map based on the texture information with the plurality of attention diagrams, respectively merging, performing standardized average pooling and stacking after fusion to obtain a texture saliency matrix; and finally, obtaining global depth features based on a plurality of attention maps, and sending the global depth features and the texture saliency matrix into a classifier to realize the identification of real and fake human faces. The invention effectively improves the speed and accuracy of face forgery detection.

Description

Face false distinguishing method based on irregular significant pixel cluster

Technical Field

The invention belongs to the technical field of face detection, and particularly relates to a face counterfeit identification method based on irregular saliency pixel clusters.

Background

With the rapid development of the generation model, in recent years, face counterfeiting technology has been remarkably successful, and can generate high-quality fake faces which are difficult to distinguish by human eyes, and once the fake faces are utilized maliciously, serious social problems or political threats are caused. In order to reduce such risks, human face authentication methods are widely introduced, and these authentication methods mostly model the human face forgery detection as a general binary classification problem (true/false) or only tend to detect forgery from the central area of the human face, and the like. In the detection method modeled as a common binary classification, the network is focused on extracting complex global features and sending the global features into the binary classifier, so that the technical problem of low detection speed exists. For the method of detecting forgery from the central area of the Face, since the shapes and sizes of the forgings generated by different Face forgery algorithms are greatly different, for example, the representative forgings generated by defakes and Face2Face usually modify the Face boundary, while the representative forgings generated by StyleGAN and PGGAN flexibly modify the whole Face, thus the detection accuracy is also low.

In addition, the document with publication number CN113536990A also discloses a method for identifying deeply forged face data. The method operates on image blocks of different sizes by introducing a multi-modal multi-scale transformer (M2TR) using a multi-scale transformer to detect local context inconsistencies at different scales. To improve the detection result and robustness to image compression, M2TR also introduces frequency information and further combines it with RGB features through a cross-mode fusion module. Although the method can detect the forged face, in practical application, the method needs to divide the feature map into spatial image blocks with different sizes and calculate the self-attention between the image blocks with different heads, and because the image blocks with different sizes have area overlapping, a large amount of redundant calculation exists in the network, the consumption of calculation time is increased, and the detection speed of the network on the image is reduced. Meanwhile, the number of the image blocks to be divided into is large, and the image blocks with different scales to be divided into have great uncertainty, so that the extracted multi-scale information is not sufficient and effective, and the accuracy of the network is influenced.

For the above reasons, the existing face-forgery-detection method cannot efficiently and accurately identify a forged face and a real face. Therefore, a forgery detection method capable of quickly and accurately identifying a true and false face is urgently needed to improve the speed and accuracy of face forgery detection.

Disclosure of Invention

The invention aims to overcome the technical problems in the prior art, and provides a face counterfeit identification method based on irregular saliency pixel clusters, which is based on the combination of a visual attention mechanism and a saliency region detection technology, on one hand, because the difference between real and forged face images is usually subtle and local, the face counterfeit detection is redefined as a special refined binary classification problem, so that the network pays attention to different local features, meanwhile, because artifacts caused by the counterfeiting method are obvious in texture features, the utilization of the features is specially concerned and enhanced; on the other hand, the false description of different shapes and sizes effective for the face false detection problem is improved, so that the speed and the accuracy of face false detection are improved.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a face false distinguishing method based on irregular significant pixel clusters is characterized by comprising the following steps:

the method comprises the following steps: acquiring a plurality of real face images and forged face images, and respectively preprocessing the acquired images to obtain real face standard images and forged face standard images;

step two: rough processing is respectively carried out on the real human face standard image and the forged human face standard image by using the skin color prior knowledge and the geometric prior knowledge of the human face to obtain the real human face standard image and the forged human face standard image which are marked with human face candidate regions;

step three: enhancing texture information of shallow features in the face candidate area through a backbone network, and performing significance detection on the real face standard image and the forged face standard image with the enhanced texture information to obtain significance image features of the face candidate area; processing the salient image characteristics of the face candidate region to reserve fine artifacts in shallow layer characteristics to the maximum extent, and obtaining a real face salient image and a forged face salient image based on texture information after the processing is finished;

step four: extracting a characteristic diagram of a specific layer in the backbone network, inputting the characteristic diagram into a multi-attention module, and generating a plurality of attention diagrams focusing on different areas of the input characteristic diagram through the multi-attention module;

step five: respectively fusing the results obtained in the step three with a plurality of attention maps, and respectively obtaining the texture enhanced local significant characteristics based on the real face significant map and the texture enhanced local significant characteristics based on the forged face significant map after fusion; merging the local significant characteristics of the texture enhancement of each real face significant image and merging the local significant characteristics of the texture enhancement of each fake face significant image, performing standardized average pooling on the merged local significant characteristics, and stacking the pooled standardized significant characteristics together to obtain a texture significant matrix;

step six: and splicing the plurality of attention maps obtained in the fourth step and acting on the feature map of the last layer of the backbone network feature extraction layer to obtain global depth features, and sending the global depth features and the texture significant matrix obtained in the fifth step into a classifier together to realize the identification of real and forged faces.

In the first step, a real face data set which is common in the field is used to obtain real face images with different visual qualities, a faceforces + + face counterfeiting data set is used to obtain counterfeiting face images, and the number of the obtained images is hundreds of thousands.

The preprocessing in the step one is to use the normaize operation to process and transform the real face image and the forged face image respectively, and the real face standard image and the forged face standard image are obtained after the processing is finished.

The step two of rough processing by using the skin color priori knowledge and the geometric priori knowledge of the human face refers to the following steps: establishing a skin color model, calculating the skin color similarity of each pixel point in the image to obtain a skin color outline, and detecting and marking a face candidate region by using the geometric information of the face.

The concrete treatment process of the third step is as follows: inputting a real face standard image and a forged face standard image marked with a face candidate region into a backbone network, and enhancing texture information of shallow features in the face candidate region based on a residual error network thought; then, carrying out significance detection on the real face standard image and the forged face standard image with the enhanced texture information by using an attention mechanism and ConvLSTM, and obtaining significance image characteristics of the face candidate region after the detection is finished; and then, calculating and coding the salient image characteristics of the obtained face candidate region by using the spatial attention, and realizing the maximum retention of fine artifacts in shallow features, thereby obtaining a real face salient image and a forged face salient image based on texture information.

The multi-attention module in the fourth step is a lightweight model and consists of a convolution layer of 1 multiplied by 1, a batch normalization layer and a nonlinear activation layer ReLU.

And fifthly, after the texture enhanced local significant representation based on the real face significant image and the texture enhanced local significant representation based on the fake face significant image are obtained, converting the local significant representations into sequence data by taking pixel points as units, processing the sequence data by using a structure based on a Transformer, and then merging the sequence data.

The method for obtaining the global depth features in the sixth step comprises the following steps: and firstly splicing the plurality of attention diagrams to obtain a single-channel attention diagram, and then acting the single-channel attention diagram on the feature diagram of the last layer of the backbone network feature extraction layer to obtain the global depth feature.

By adopting the technical scheme, the invention has the beneficial technical effects that:

1. on one hand, because the difference between real and forged face images is delicate and local, the invention redefines face forging detection as a special refined two-classification problem, so that the network pays attention to different local characteristics, meanwhile, artifacts caused by a forging method are obvious in texture characteristics, and the characteristics are particularly concerned and enhanced; on the other hand, the false description of different shapes and sizes effective for the face false detection problem is improved, so that the speed and the accuracy of face false detection are improved.

Further, the advantages of the steps of the invention are as follows:

the method has the advantages that a plurality of real face images and fake face images are obtained in the first step, and disturbance of the data to the model counterfeit identification performance can be reduced by using abundant and sufficient data sets. And the acquired images are respectively preprocessed, so that the generation of overfitting can be effectively prevented.

And in the second step, the skin color priori knowledge and the geometric priori knowledge of the human face are used for respectively carrying out rough processing on the real human face standard image and the forged human face standard image, the rough processing refers to the detection of the human face in the image, the human face detection is the premise and the basis of the human face forging and identification, and in order to realize efficient human face identification, the priori knowledge is required to be used for accurately detecting the human face in the image so as to form the image marked with the human face candidate area.

In the third step, artifacts caused by the face forgery method are obvious in texture information, so that the texture information of the shallow feature in the face candidate area is enhanced through the backbone network in the step, which is equivalent to enhancing the representation of the texture information in the feature of the shallow feature of the image, and the texture information can be regarded as the high-frequency component of the shallow feature to obtain the shallow feature with enhanced texture, so that the false distinguishing effect is more favorably improved.

In addition, the significance detection in the step can simulate a visual attention mechanism in a human visual system, so that the backbone network focuses on an important region of the human face to learn the significance image characteristics of the candidate region of the human face, and the human face false distinguishing is carried out on the subsequent network on the basis of texture enhancement to reduce the calculated amount, thereby improving the processing speed of the backbone network. And the spatial attention is used for calculating and coding the significant image features of the obtained face candidate region, so that the fine artifacts in the shallow features can be retained to the maximum extent, and the false distinguishing effect is further effectively improved.

In the fourth step, the characteristic diagram of a specific layer in the backbone network is extracted first, and the characteristic diagram is input into the multi-attention module, and then a plurality of attention diagrams focusing on different areas of the input characteristic diagram are generated through the multi-attention module. The different areas respectively correspond to areas such as eyes and mouths, and the step can further guide a subsequent backbone network to focus on different local areas through semantic features, wherein the local areas have strong discriminability for face counterfeit detection.

And step five, fusing the saliency map based on the texture information and the multi-attention map drawing which gives attention to different local areas, so that the backbone network effectively excavates the texture information in the different local areas, and the artifacts generated by the counterfeiting method can be effectively captured in the different local areas with strong discriminability. Meanwhile, the local areas have different irregular structure information, artifacts are detected by using a structure based on a transform in the step, description of irregular counterfeiting can be effectively improved, and accuracy of face counterfeit identification is improved.

And step six, the texture significant matrix and the global depth feature are sent to a classifier together, the complementation of the local feature and the global feature is realized, the features with strong identification ability and high robustness can be constructed on the basis, and the overall performance of the network human face false identification is effectively improved.

2. According to the method, the expression of texture information is enhanced, so that artifacts generated by a counterfeiting method are more reserved, the shallow texture features and the deep semantic features are aggregated to be used as the distinguishing representation of each part, and the subtle differences are prevented from disappearing in the deep layer.

3. The invention improves the fine classification effect by promoting the description of counterfeiting in different shapes and sizes, and finally realizes rapid and accurate counterfeiting detection for identifying true and false faces.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a diagram of a model framework of the present invention.

FIG. 3 is a schematic diagram of a key structure for significance detection according to the present invention.

FIG. 4 is a schematic structural diagram of a multi-attention module according to the present invention.

Detailed Description

Example 1

The invention discloses a face counterfeit identification method based on irregular saliency pixel clusters, the flow chart of the method is shown in figure 1, the model frame chart is shown in figure 2, and the method specifically comprises the following steps:

the method comprises the following steps: and acquiring a plurality of real face images and forged face images, and respectively preprocessing the acquired images to obtain real face standard images and forged face standard images.

It should be noted that in this step, in consideration of the influence of the environment in practical application, the image of the human face is acquired, and it is preferable to use a real human face data set that is common in the field to acquire real human face images with different visual qualities. For forged face images, faceforces + + face forged data sets which are different in counterfeiting method and provide multi-grading quality are preferably used, hundreds of thousands of face images are simulated to be obtained in total, and disturbance of data to model counterfeit performance can be reduced by using abundant and sufficient data sets.

Further, the preprocessing of the acquired image means that a normal face image and a forged face image are processed and transformed by normal operation, and a normal face image and a forged face image are obtained after the processing is finished. Wherein the use of the normaize operation is effective to prevent the occurrence of overfitting.

Step two: and performing rough processing on the real human face standard image and the forged human face standard image respectively by using the skin color prior knowledge and the geometric prior knowledge of the human face to obtain the real human face standard image and the forged human face standard image which are marked with the human face candidate regions.

In this step, the rough processing using the skin color prior knowledge and the geometric prior knowledge of the human face means: establishing a skin color model, calculating the skin color similarity of each pixel point in the image to obtain a skin color outline, and detecting and marking a face candidate region by using the geometric information of the face. It should be noted that the rough processing here is mainly to detect the face in the image, the face detection is a precondition and a basis for face forgery identification, and in order to realize efficient face counterfeit identification, the prior knowledge needs to be used to accurately detect the face in the image to form an image marked with a face candidate region.

Step three: enhancing texture information of shallow features in the face candidate area through a backbone network, and performing significance detection on the real face standard image and the forged face standard image with the enhanced texture information to obtain significance image features of the face candidate area; and processing the salient image characteristics of the face candidate region to furthest reserve the fine artifacts in the shallow features, and obtaining a real face salient image and a forged face salient image based on texture information after the processing is finished.

The backbone network in this step is not limited, and any network may be used as long as it can enhance the image texture information, and based on this, for example, EfficientNet-64 may be used as the backbone network. The specific treatment process comprises the following steps: inputting a real face standard image and a forged face standard image marked with a face candidate region into a backbone network, and enhancing texture information of shallow features in the face candidate region based on a residual error network thought; then, carrying out significance detection on the real face standard image and the forged face standard image with the enhanced texture information by using an attention mechanism and ConvLSTM, and obtaining significance image characteristics of the face candidate region after the detection is finished; and then, calculating and coding the salient image characteristics of the obtained face candidate region by using the spatial attention, and realizing the maximum retention of fine artifacts in shallow features, thereby obtaining a real face salient image and a forged face salient image based on texture information.

It should be noted that, in this step, since artifacts caused by the face forgery method are significant in texture information, the texture information of the shallow features needs to be enhanced. The detailed process comprises the following steps: inputting the roughly processed real face standard image and the forged face standard image into a backbone network, applying local average pooling to shallow features extracted by the backbone network for down sampling to obtain pooled features D, and acquiring feature representation of prominent texture information based on the idea of a residual error network, wherein the formula is as follows:

T _L ＝f _L (I)-D

in the formula: the subscript L here represents the shallow level, T represents the feature that most expresses texture information, f (i) represents the original shallow feature, D represents the pooled feature. And then using a DenseBlock to enhance the expression of T to obtain the texture-enhanced characteristic.

After texture enhancement, the saliency detection is carried out by using an attention mechanism and ConvLSTM, and in order to accurately capture the saliency area with high retention artifacts, the detail information of the detection result is derived and optimized in a memory-oriented understanding mode through the inspiration of a human brain mechanism. The structure is schematically shown in fig. 3, and in fig. 3: subscript t represents the time step in ConvLSTM, ht represents the memory of the last image understanding according to the time sequence, F-represents the more representative characteristics after weighting, and in order to emphasize the importance degree of the relation between different pixel points on the judgment of the final salient region, the real face salient image and the forged face salient image based on the texture information are obtained by using space attention processing.

Step four: and constructing a multi-attention module which is a lightweight model and consists of a 1 x 1 convolution layer, a batch normalization layer and a nonlinear activation layer ReLU. And then extracting a characteristic diagram of a specific layer in the backbone network, inputting the characteristic diagram into a multi-attention module, and generating a plurality of attention diagrams focusing on different areas of the input characteristic diagram through the multi-attention module.

It should be noted that in this step, the foregoing steps focus on and enhance the utilization of shallow features, and ignore deep semantic features, and the semantic features can further guide the subsequent network to focus on different local regions, which have strong discriminability for face authentication, so that the specific layer herein refers to the fourth or fifth layer of the backbone network. Inputting the characteristic diagram into a multi-attention module, and generating a plurality of attention diagrams focusing on different areas of the input characteristic diagram through the multi-attention module, namely obtaining the multi-attention diagram A to search for the distinguished local areas. The schematic diagram of the module is shown in fig. 4, and in fig. 4: the subscript L here represents the deep level, where Ak denotes the kth attention and corresponds to a specific locally discerned region, such as the eyes, mouth, etc., or even a blended boundary of two face images.

Step five: respectively fusing the results obtained in the step three with a plurality of attention maps, and respectively obtaining the texture enhanced local significant characteristics based on the real face significant map and the texture enhanced local significant characteristics based on the forged face significant map after fusion; after the local significant representation of texture enhancement based on the real face significant figure and the local significant representation of texture enhancement based on the fake face significant figure are obtained, converting the local significant representations into sequence data by taking pixel points as units, and processing the sequence data by utilizing a structure based on a Transformer; and respectively merging the local significant characteristics of the texture enhancement of each real face significant image, merging the local significant characteristics of the texture enhancement of each fake face significant image, performing standardized average pooling on the merged local significant characteristics, and stacking the pooled standardized significant characteristics together to obtain a texture significant matrix.

The method for obtaining the global depth feature in the step comprises the following steps: and firstly splicing the plurality of attention diagrams to obtain a single-channel attention diagram, and then acting the single-channel attention diagram on the feature diagram of the last layer of the backbone network feature extraction layer to obtain the global depth feature.

In summary, because the difference between real and forged face images is often subtle and local, the invention realizes the face discrimination based on the combination of the visual attention mechanism and the salient region detection technology, redefines the face forgery detection as a special refined two-classification problem, leads the network to pay attention to different local features, meanwhile, the artifacts caused by the forgery method are obvious in the texture features, and specially pays attention to and enhances the utilization of the features. In addition, the invention can effectively improve the speed and the accuracy of face forgery detection by promoting the description of forgery of different shapes and sizes effective to the face forgery detection problem.

Example 2

This example verifies the method described in example 1 as follows:

the method comprises the steps of obtaining real face image samples with different visual qualities in a real face data set which is universal in the field, obtaining forged face image samples in a faceforces + + face forging data set, and obtaining 50000 real face images and 50000 forged face images respectively in order to achieve balance of true and false labels and good model generalization performance. The method comprises the steps of carrying out false identification on a face image by the method in the embodiment 1, preprocessing the face image in the step one to obtain a standard image, carrying out face detection on the standard image by using priori knowledge in the step two, inputting the image into a backbone network in the step three, obtaining a face saliency map based on texture information by using EfficientNet-64 as the backbone network, obtaining a multi-attention map in the step four, combining the face saliency map in the step three and the multi-attention map in the step four by the step five, processing and combining the face saliency map and the multi-attention map to obtain a texture saliency matrix, and finally identifying a real face and a forged face in the step six.

The result shows that the counterfeit identification accuracy rate, namely the evaluation index ACC, reaches 97.6%, and the counterfeit identification speed, namely the evaluation index FPS, reaches 213.8.

In addition, as comparison to evaluate the importance of the texture information, the elimination step three is to not enhance the texture information of the shallow features, since the artifacts are obvious in the texture information, the perception of the artifacts in the deep layer of the network is lost, and the accuracy of the network is reduced by 4.8%.

As a contrast to evaluate the effectiveness of multiple attentions, step four and step five are eliminated, i.e. only a single attention is used, since the difference between real and fake faces is usually locally subtle, it is difficult to capture a fake with a single attention, resulting in a decrease of 1.6% in the accuracy of the network.

In summary, the present invention effectively improves the speed and accuracy of face forgery detection by promoting the description of forgery of different shapes and sizes effective to the face forgery detection problem.

Where mentioned above are merely embodiments of the present invention, any feature disclosed in this specification may, unless stated otherwise, be replaced by alternative features serving equivalent or similar purposes; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. A face false distinguishing method based on irregular significant pixel clusters is characterized by comprising the following steps:

2. The method according to claim 1, wherein the face identification method based on irregular salient pixel clusters comprises: in the first step, a real face data set which is common in the field is used to obtain real face images with different visual qualities, a faceforces + + face counterfeiting data set is used to obtain counterfeiting face images, and the number of the obtained images is hundreds of thousands.

3. The method according to claim 1, wherein the face identification method based on irregular salient pixel clusters comprises: the preprocessing in the step one is to use the normaize operation to process and transform the real face image and the forged face image respectively, and the real face standard image and the forged face standard image are obtained after the processing is finished.

4. The method according to claim 1, wherein the face identification method based on irregular salient pixel clusters comprises: the step two of rough processing by using the skin color priori knowledge and the geometric priori knowledge of the human face refers to the following steps: establishing a skin color model, calculating the skin color similarity of each pixel point in the image to obtain a skin color outline, and detecting and marking a face candidate region by using the geometric information of the face.

5. The face authentication method based on irregular salient pixel clusters according to any one of claims 1 to 4, wherein: the concrete treatment process of the third step is as follows: inputting a real face standard image and a forged face standard image marked with a face candidate region into a backbone network, and enhancing texture information of shallow features in the face candidate region based on a residual error network thought; then, carrying out significance detection on the real face standard image and the forged face standard image with the enhanced texture information by using an attention mechanism and ConvLSTM, and obtaining significance image characteristics of the face candidate region after the detection is finished; and then, calculating and coding the salient image characteristics of the obtained face candidate region by using the spatial attention, and realizing the maximum retention of fine artifacts in shallow features, thereby obtaining a real face salient image and a forged face salient image based on texture information.

6. The method according to claim 1, wherein the face identification method based on irregular salient pixel clusters comprises: the multi-attention module in the fourth step is a lightweight model and consists of a convolution layer of 1 multiplied by 1, a batch normalization layer and a nonlinear activation layer ReLU.

7. The method according to claim 1, wherein the method comprises the following steps: and fifthly, after the texture enhanced local significant representation based on the real face significant image and the texture enhanced local significant representation based on the fake face significant image are obtained, converting the local significant representations into sequence data by taking pixel points as units, processing the sequence data by using a structure based on a Transformer, and then merging the sequence data.

8. The method according to claim 1, wherein the face identification method based on irregular salient pixel clusters comprises: the method for obtaining the global depth features in the sixth step comprises the following steps: and firstly splicing the plurality of attention diagrams to obtain a single-channel attention diagram, and then acting the single-channel attention diagram on the feature diagram of the last layer of the backbone network feature extraction layer to obtain the global depth feature.