CN115272240A

CN115272240A - Passive detection method and device for face image tampering, terminal equipment and storage medium

Info

Publication number: CN115272240A
Application number: CN202210909266.2A
Authority: CN
Inventors: 高士超; 杨高波; 胡胜; 汤应恒
Original assignee: Jiangsu Fantuo Information Technology Co ltd
Current assignee: Jiangsu Fantuo Information Technology Co ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-11-01

Abstract

The application is applicable to the technical field of image processing, and provides a method and a device for passively detecting face image tampering, terminal equipment and a storage medium, wherein the method comprises the following steps: acquiring a face image to be detected, performing frequency domain enhancement processing and compression noise suppression processing on the face image to be detected to obtain a first face image with enhanced frequency domain and a second face image with suppressed image compression noise, and then inputting the first face image and the second face image into a face image tampering detection model for tampering detection to obtain a detection result; the detection result is used for prompting that the face image to be detected is a tampered image or a real image, and the feature extraction part of the face image tampering detection model comprises a multi-frequency channel attention mechanism module. The method and the device can improve the detection accuracy rate of face image tampering.

Description

Passive detection method and device for face image tampering, terminal equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for passively detecting face image tampering, a terminal device, and a storage medium.

Background

With the rapid development of the face synthesis technology, the forged face image is more and more vivid, and human eyes are difficult to recognize whether the image is distorted. Meanwhile, due to commercialization of the tampering technology, the threshold for face synthesis becomes low, which brings serious social public trust and security problems. In addition, in the process of picture transmission, information loss is generated and image quality is reduced through processing of an image compression algorithm, and the reduction of the image quality causes low detection accuracy of an image tampering detection model, so that artificial artifacts in the picture are difficult to distinguish.

Disclosure of Invention

The application provides a passive detection method and device for face image tampering, terminal equipment and a storage medium, and aims to improve the detection accuracy of face image tampering.

In a first aspect, an embodiment of the present application provides a method for passively detecting face image tampering, including:

acquiring a human face image to be detected;

carrying out frequency domain enhancement processing and compression noise suppression processing on the face image to be detected to obtain a first face image with enhanced frequency domain and a second face image with suppressed image compression noise;

inputting the first face image and the second face image into a face image tampering detection model for tampering detection to obtain a detection result;

the detection result is used for prompting that the face image to be detected is a tampered image or a real image, and the feature extraction part of the face image tampering detection model comprises a multi-frequency channel attention mechanism module.

Optionally, the frequency domain enhancement processing is performed on the face image to be detected to obtain a frequency domain enhanced first face image and a second face image for suppressing image compression noise, and the method includes:

converting the face image to be detected from an RGB color space to a YCbCr color space to obtain a Y channel subimage, a Cb channel subimage and a Cr channel subimage;

performing DCT transformation on the Y-channel subimage, the Cb-channel subimage and the Cr-channel subimage respectively to obtain a first original DCT image corresponding to the Y-channel subimage, a second original DCT image corresponding to the Cb-channel subimage and a third original DCT image corresponding to the Cr-channel subimage;

and obtaining a first face image with enhanced frequency domain and a second face image with suppressed image compression noise according to the first original DCT image, the second original DCT image and the third original DCT image.

Optionally, obtaining the frequency-domain enhanced first face image according to the first original DCT image, the second original DCT image, and the third original DCT image, includes:

respectively carrying out normalization processing on the first original DCT image, the second original DCT image and the third original DCT image to obtain a first DCT weight image corresponding to the first original DCT image, a second DCT weight image corresponding to the second original DCT image and a third DCT weight image corresponding to the third original DCT image;

dividing the first original DCT map, the second original DCT map and the third original DCT map into n x n square blocks respectively; wherein n is a power of 2 and n is greater than or equal to 8;

respectively aiming at each original DCT image in the first original DCT image, the second original DCT image and the third original DCT image, carrying out normalization processing on each square in the original DCT images to obtain a frequency domain matrix weight containing n multiplied by n values, and obtaining a frequency domain weight image corresponding to the original DCT images through matrix replication and matrix expansion; the frequency domain weight graph corresponding to the first original DCT graph is a first frequency domain weight graph, the frequency domain weight graph corresponding to the second original DCT graph is a second frequency domain weight graph, and the frequency domain weight graph corresponding to the third original DCT graph is a third frequency domain weight graph;

adding the first DCT weight graph, the second DCT weight graph and the third DCT weight graph with the first frequency domain weight graph, the second frequency domain weight graph and the third frequency domain weight graph according to the color channel to obtain a first enhanced weight graph corresponding to the first original DCT graph, a second enhanced weight graph corresponding to the second original DCT graph and a third enhanced weight graph corresponding to the third original DCT graph;

and obtaining a frequency domain enhanced first face image according to the first enhanced weight map, the second enhanced weight map and the third enhanced weight map.

Optionally, obtaining the frequency-domain enhanced first face image according to the first enhanced weight map, the second enhanced weight map, and the third enhanced weight map includes:

calculating a first product of the first enhanced weight map and the Y-channel sub-image, a second product of the second enhanced weight map and the Cb-channel sub-image, and a third product of the third enhanced weight map and the Cr-channel sub-image;

and respectively carrying out inverse DCT (discrete cosine transform) and RGB (red, green and blue) conversion on the products in turn aiming at each product of the first product, the second product and the third product to obtain the frequency domain enhanced first face image.

Optionally, obtaining a second face image with suppressed image compression noise according to the first original DCT image, the second original DCT image, and the third original DCT image, includes:

respectively aiming at each original DCT image in a first original DCT image, a second original DCT image and a third original DCT image, dividing the original DCT image into a first image area comprising high-frequency components and a second image area except the first area, performing integral normalization processing on the first image area, and performing sliding window normalization processing on the second image area to obtain a fourth DCT weight image corresponding to the first original DCT image, a fifth DCT weight image corresponding to the second original DCT image and a sixth DCT weight image corresponding to the third original DCT image;

and respectively carrying out inverse DCT conversion and RGB conversion on the DCT weight map according to each DCT weight map in the fourth DCT weight map, the fifth DCT weight map and the sixth DCT weight map in sequence to obtain a second face image for inhibiting image compression noise.

Optionally, the face image tampering detection model is an Xception classification model, an activation function of a feature extraction part of the Xception classification model is a parameterized ReLU activation function, other pooling layers except the last pooling layer in all pooling layers of the feature extraction part are detail-preserving pooling layers, and a multi-frequency channel attention mechanism module is connected after each separable convolution layer in the feature extraction part.

Optionally, each feature channel map of the multi-frequency channel attention mechanism module is processed by N different frequency components, where N is an integer divisible by the number of channels C of the multi-frequency channel attention mechanism module.

In a second aspect, an embodiment of the present application provides a passive detection device for face image tampering, including:

the first acquisition module is used for acquiring a face image to be detected;

the second acquisition module is used for carrying out frequency domain enhancement processing and compression noise suppression processing on the face image to be detected to obtain a frequency domain enhanced first face image and a second face image for suppressing image compression noise;

the detection module is used for inputting the first face image and the second face image into a face image tampering detection model for tampering detection to obtain a detection result; the detection result is used for prompting that the face image to be detected is a tampered image or a real image, and the feature extraction part of the face image tampering detection model comprises a multi-frequency channel attention mechanism module.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the above-mentioned passive detection method for face image tampering when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the passive detection method for face image tampering is implemented.

The above scheme of this application has following beneficial effect:

in the embodiment of the application, a to-be-detected face image is obtained, frequency domain enhancement processing and compression noise suppression processing are carried out on the to-be-detected face image to obtain a frequency domain enhanced first face image and a second face image for suppressing image compression noise, then the first face image and the second face image are input into a face image tampering detection model for tampering detection, and a detection result for prompting that the to-be-detected face image is a tampered image or a real image is obtained. The first face image can enhance the spatial domain characteristics and improve the important characteristics of the face image to be detected, and the second face image can reduce the influence of high-frequency noise caused by an image compression algorithm, so that the detection accuracy of face image tampering can be improved when the face image to be detected is tampered by the face image tampering detection model based on the first face image and the second face image. In addition, the feature extraction part of the human face image tampering detection model comprises a multi-frequency channel attention mechanism module, the multi-frequency channel attention mechanism module adopts global average pooling, the average value of each channel in the feature channels is used as a weight, the interested content is enhanced in a weighting mode to inhibit background features, and therefore the human face image tampering detection accuracy is further improved.

Other advantages of the present application will be described in detail in the detailed description that follows.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of a passive detection method for face image tampering according to an embodiment of the present application;

FIG. 2 is a flowchart of acquiring a first face image according to an embodiment of the present application;

fig. 3 is a flowchart of acquiring a second face image according to an embodiment of the present application;

fig. 4 is a schematic diagram of region division of a segmentation image when a second face image is acquired according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a feature extraction part of a face image tampering model according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating the detailed steps of a qualitative study provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a passive detection device for face image tampering according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

In the process of transmitting the current picture, information loss can be generated and the image quality is reduced through the processing of an image compression algorithm, and the reduction of the image quality can cause the low detection precision of an image tampering detection model, so that artificial artifacts in the picture are difficult to distinguish.

In view of the above problems, in the embodiment of the present application, a to-be-detected face image is obtained, and frequency domain enhancement processing and compression noise suppression processing are performed on the to-be-detected face image to obtain a frequency domain enhanced first face image and a second face image for suppressing image compression noise, and then the first face image and the second face image are input into a face image tampering detection model to perform tampering detection, so as to obtain a detection result for prompting that the to-be-detected face image is a tampered image or a real image. The first face image can enhance the spatial domain characteristics of the face image to be detected and improve the important characteristics of the face image to be detected, and the second face image can reduce the influence of high-frequency noise caused by an image compression algorithm, so that the face image tampering detection model can improve the detection accuracy of face image tampering when the face image to be detected is tampered based on the first face image and the second face image. In addition, the feature extraction part of the human face image tampering detection model comprises a multi-frequency channel attention mechanism module, the multi-frequency channel attention mechanism module adopts global average pooling, the average value of each channel in the feature channels is used as a weight, the interested content is enhanced in a weighting mode to inhibit background features, and therefore the human face image tampering detection accuracy is further improved.

The following describes an exemplary face image tampering passive detection method provided by the present application with reference to a specific embodiment.

The embodiment of the application provides a method for passively detecting face image tampering, which can be executed by a terminal device, and can also be executed by a device (such as a chip) applied to the terminal device. As an example, the terminal device may be a tablet, a server, a notebook, or the like, which is not limited in this application.

As shown in fig. 1, the method for passively detecting face image tampering provided in the embodiment of the present application includes the following steps:

step 101, obtaining a face image to be detected.

The face image to be detected is a face image which needs to be subjected to tampering detection. Specifically, the face image to be detected may be acquired by the terminal device, or may be acquired by other image acquisition devices and then sent to the terminal device.

And 102, performing frequency domain enhancement processing and compression noise suppression processing on the face image to be detected to obtain a first face image with enhanced frequency domain and a second face image with suppressed image compression noise.

In some examples of the present application, since the frequency domain enhancement can enhance the spatial domain features, the first facial image can enhance the spatial domain features of the facial image to be detected, and improve the important features of the facial image to be detected. The second face image is obtained based on the basic principle of two-dimensional discrete cosine transform and an image compression algorithm (such as a JPEG image compression algorithm), and the second face image can reduce the influence of high-frequency noise caused by image compression algorithms such as JPEG.

And 103, inputting the first face image and the second face image into a face image tampering detection model for tampering detection to obtain a detection result.

It is worth mentioning that, the first face image can enhance the spatial domain characteristics of the face image to be detected and improve the important characteristics of the face image to be detected, and the second face image can reduce the influence of high-frequency noise caused by an image compression algorithm, so that the face image tampering detection model can improve the face image tampering detection accuracy when tampering detection is performed on the face image to be detected based on the first face image and the second face image. In addition, the feature extraction part of the human face image tampering detection model comprises a multi-frequency channel attention mechanism module, the multi-frequency channel attention mechanism module adopts global average pooling, the average value of each channel in the feature channels is used as a weight, the interested content is enhanced in a weighting mode to inhibit background features, and therefore the human face image tampering detection accuracy is further improved.

The following describes, with reference to a specific embodiment, an exemplary description of performing frequency domain enhancement processing on the facial image to be detected in step 102 to obtain a frequency domain enhanced first facial image.

As shown in fig. 2, in some embodiments of the present application, a specific implementation manner of acquiring the first face image includes the following steps:

step 201, converting the face image to be detected from the RGB color space to the YCbCr color space to obtain a Y channel sub-image, a Cb channel sub-image, and a Cr channel sub-image.

Step 202, performing DCT transformation on the Y channel sub-image, the Cb channel sub-image, and the Cr channel sub-image, respectively, to obtain a first original DCT graph corresponding to the Y channel sub-image, a second original DCT graph corresponding to the Cb channel sub-image, and a third original DCT graph corresponding to the Cr channel sub-image.

The DCT is transformed into a Discrete Cosine Transform (DCT).

Step 203, respectively performing normalization processing on the first original DCT graph, the second original DCT graph and the third original DCT graph to obtain a first DCT weight graph corresponding to the first original DCT graph, a second DCT weight graph corresponding to the second original DCT graph and a third DCT weight graph corresponding to the third original DCT graph.

Specifically, the first DCT weight graph is Y_NDCTThe second DCT weight map is Cb_NDCTAnd the third DCT weight map is Cr_NDCT。

The Normalization may be Sliding Window Normalization (SWN), where the Sliding Window Normalization refers to dividing an original matrix into smaller matrices, and performing local area Normalization step by step from left to right and from top to bottom in sequence. For example, if the size of the original matrix is W × H (W is the size of the original matrix in the horizontal direction, and H is the size of the original matrix in the vertical direction), and the original matrix is gridded by 8 × 8, the size of each block sub-region is (W/8) × (H/8), which contains (W × H)/64 sample values, and for these sample values, normalization is performed in a manner of maximum normalization, and for the data sample X, the expression is:

wherein x is_minAnd x_maxFor minimum and maximum values of each set of sample valuesValue, x is the raw data value being normalized, x_scaleIs a normalized data value.

Step 204, the first original DCT map, the second original DCT map and the third original DCT map are divided into n × n blocks, respectively.

Wherein n is a power of 2 and n is greater than or equal to 8.

As a preferred example, the first original DCT picture, the second original DCT picture and the third original DCT picture are divided into 8 × 8 squares, respectively.

Step 205, respectively aiming at each original DCT graph in the first original DCT graph, the second original DCT graph, and the third original DCT graph, normalizing each square in the original DCT graph to obtain a frequency domain matrix weight containing n × n values, and obtaining a frequency domain weight graph corresponding to the original DCT graph through matrix replication and matrix expansion.

The frequency domain weight map corresponding to the first original DCT map is a first frequency domain weight map, the frequency domain weight map corresponding to the second original DCT map is a second frequency domain weight map, and the frequency domain weight map corresponding to the third original DCT map is a third frequency domain weight map.

Specifically, the first frequency domain weight map is Y_{F_weight}Second frequency domain weight map Cb_{F_weight}And a third frequency domain weight map Cr_{F_weight}。

Step 206, adding the first DCT weight map, the second DCT weight map, and the third DCT weight map to the first frequency domain weight map Y, the second frequency domain weight map, and the third frequency domain weight map according to the color channel to obtain a first enhanced weight map corresponding to the first original DCT map, a second enhanced weight map corresponding to the second original DCT map, and a third enhanced weight map corresponding to the third original DCT map.

Specifically, the first enhanced weight map is f_YSecond enhanced weight map f_CbAnd a third enhanced weight map f_Cr。

Step 207, calculate the first product of the first enhanced weight map and the Y channel sub-image, the second product of the second enhanced weight map and the Cb channel sub-image, and the third product of the third enhanced weight map and the Cr channel sub-image.

And 208, sequentially performing inverse DCT (discrete cosine transform) and RGB (red, green and blue) conversion on the product respectively aiming at each product of the first product, the second product and the third product to obtain the frequency domain enhanced first face image.

That is, the first face image X can be obtained by sequentially performing inverse DCT transformation and RGB transformation (referring to transformation to RGB color space) on the first product, sequentially performing inverse DCT transformation and RGB transformation on the second product, and sequentially performing inverse DCT transformation and RGB transformation on the third product_FE。

Note that the RGB conversion in the embodiment of the present application refers to conversion into an RGB color space.

Next, an exemplary description is given of the step 102 of performing compression noise suppression processing on the face image to be detected, so as to obtain a second face image with suppressed image compression noise.

As shown in fig. 3, in some embodiments of the present application, a specific implementation manner of obtaining the second face image includes the following steps:

step 301, converting the face image to be detected from the RGB color space to the YCbCr color space to obtain a Y channel sub-image, a Cb channel sub-image, and a Cr channel sub-image.

Step 302, performing DCT transformation on the Y-channel sub-image, the Cb-channel sub-image, and the Cr-channel sub-image, respectively, to obtain a first original DCT graph corresponding to the Y-channel sub-image, a second original DCT graph corresponding to the Cb-channel sub-image, and a third original DCT graph corresponding to the Cr-channel sub-image.

Step 303, dividing the original DCT graph into a first image region including high frequency components and a second image region excluding the first region, performing an integral normalization process on the first image region, and performing a sliding window normalization process on the second image region, respectively for each of the first original DCT graph, the second original DCT graph, and the third original DCT graph, to obtain a fourth DCT weight graph corresponding to the first original DCT graph, a fifth DCT weight graph corresponding to the second original DCT graph, and a sixth DCT weight graph corresponding to the third original DCT graph.

Specifically, as shown in fig. 4, an original DCT graph can be divided into an area a (i.e., the second image area) and an area B (i.e., the first image area), where the area B is a small window at the lower right corner of the matrix, the window size is 1/4 of the height and width of the original image, which represents the high frequency component in the graph, and the area a is the remaining part; then, the sliding window normalization processing is performed on the area a, and the integral normalization is performed on the area B (i.e. the normalization range of the area B is larger), so that the normalized DCT weight map can be obtained.

Specifically, the above operation is performed on the first original DCT graph to obtain a fourth DCT weight graph Y_NDCTPerforming the above operation on the second original DCT map to obtain a fifth DCT weight map Cb_NDCTPerforming the above operation on the third original DCT map to obtain a sixth DCT weight map Cr_NDCT。

And step 304, respectively carrying out inverse DCT transformation and RGB (red, green and blue) conversion on the DCT weight graph sequentially aiming at each DCT weight graph in the fourth DCT weight graph, the fifth DCT weight graph and the sixth DCT weight graph to obtain a second face image for restraining image compression noise.

I.e. sequentially for the fourth DCT weight graph Y_NDCTPerforming inverse DCT and RGB conversion, and sequentially comparing the fifth DCT weight map Cb_NDCTPerforming inverse DCT transformation and RGB transformation, and sequentially performing a sixth DCT weight map Cr_NDCTInverse DCT transformation and RGB conversion are carried out to obtain a second face image X for inhibiting image compression noise_NS。

The face image tampering detection model is exemplarily described below with reference to specific embodiments.

In some embodiments of the present application, the facial image tampering detection model is an Xception classification model, the activation function of the feature extraction part of the Xception classification model is a parameterized ReLU (prilu) activation function, the other pooling layers except the last pooling layer in all the pooling layers of the feature extraction part are detail-preserving pooling layers, and a multi-frequency channel attention mechanism module is connected behind each separable convolution layer in the feature extraction part.

In some embodiments of the present application, the facial image tampering detection model may be constructed based on an overall framework of a classification network Xception, and the facial image tampering detection model may be referred to as a two-classification face forensics network (FENet). In the related art, the classification network Xception is divided into three units (flow) of feature extraction (Entry), optimization (Middle), and summarization (Exit). The feature extraction unit is used for down-sampling the input image to reduce the space dimension and extract rich space and channel features; the optimization unit is used for further learning the optimization characteristics; the summarizing unit is used for summarizing and summarizing the extracted characteristics. It should be noted that, in some embodiments of the present application, the above-mentioned face image tampering detection model is obtained mainly by improving the feature extraction part.

As shown in fig. 5, the FENet inputs two images at the same time in the feature extraction section: first face image X_FEAnd a second face image X_NS。X_FEInputting a Main (Main) Flow, in which a multi-frequency channel attention mechanism module (which can be an adaptive multi-frequency channel attention mechanism (AFCA) module) is added after each Separable convolution (Separable Conv) layer and activation function, and the image is used as a matrix to input a feature map of the Separable convolution layer and the activation function and then is input to the AFCA module; x_NSInputting a Residual (Residual) Flow, wherein the Flow uses a common convolution (Conv) layer, and instead of directly adding the feature of the previous convolution module to the feature output of the current layer after reprocessing to learn the Residual feature, X after preprocessing and noise reduction is used_NSAs a learning object of the residual structure, backward propagation of noise information in the residual structure can be reduced. Two images in the feature extraction part share other trainable parameter layers such as a convolution layer and the like.

It should be noted that, since the FENet is constructed based on an overall frame of Xception, except that the above improvement is performed on the feature extraction part, the rest of the FENet is the same as the Xception, and therefore, in the embodiment of the present application, redundant description is not performed on other units of the feature extraction part in fig. 5.

Wherein, the FENet reserves the last maximal Pooling layer of the feature extraction part, and replaces other maximal Pooling (Max Pooling) layers except the last maximal Pooling layer in the feature extraction part with Detail-Preserving Pooling (DPP) layers.

Detail preservation pooling is a pooling way that can preserve the features of the details, and the expression is:

ω_α,[p,q]＝α+σ(I[q]-I_o[p])

wherein I represents the original image, o represents the output, [ q ]]The pixel value at the position q is, alpha is an offset for ensuring that the input does not disappear and can affect the output, and lambda is also an award index, the two parameters are Inverse Bilateral Weights (Inverse Bilateral Weights), and the number can be learned from training data; omega_α,λ[p,q]F is a neighborhood Ω_pUpper learnable, non-standardized 2D filter, omega_pIs 3 × 3. In summary, the DPP can amplify detail features in the image and measure the importance of various detail components through learnable parameters α, λ, F, etc., so that the pooling operation can also retain rich detail features of the original face image, and these detail features often contain various artificial artifacts, which can become important clues for classification.

It should be noted that, since the detail retention pooling is a general detail retention pooling technique, in the embodiments of the present application, the detail retention pooling is not described in detail.

Among them, a parameterized ReLU activation function (i.e., a prilu activation function) is used in the FENet and an AFCA module is added after each of the prilu activation function and separable convolution to extract features of different frequency components of the image.

Illustratively, the PReLU activation function expression is:

wherein, y_iRepresenting the input of an activation function on the ith channel, a_iIs the slope of the PReLU function over the negative interval. It can be seen that there is one learnable parameter per eigenchannel to control slope in the case of using the PReLU function. The PReLU function uses the linear function with small slope in the negative value domain, so that the problem that the ReLU function is completely invalid when the input is negative is avoided, and the classification accuracy is improved.

It should be noted that, since the prilu activation function is a generic prilu activation function, in the embodiments of the present application, the prlu activation function is not described in detail.

The adaptive multi-frequency channel attention mechanism module is illustratively described below in conjunction with specific embodiments.

Each feature channel map of the multi-frequency channel attention mechanism module needs to be processed through N different frequency components, wherein N is an integer which can be divided by the number C of the channels of the multi-frequency channel attention mechanism module.

Specifically, after the outputs of the PReLU activation function and the separable convolution enter the AFCA module, the AFCA module intercepts the frequency block data of N frequency components as N DCT sub-blocks marked as D₀、D₁、…D_N-1。

Multiplying C H multiplied by W (H is high, W is wide) single-channel feature maps with N different DCT frequency components to obtain N1 multiplied by C one-dimensional weight vectors W₀、W₁、…W_N-1(ii) a Transpose it in the H direction (i.e., horizontal direction), and then convolve it by 1 × 1 to obtain the final channel weight.

And multiplying the channel weight by the initial characteristic channel H multiplied by W multiplied by C correspondingly to finish a one-time attention mechanism.

It is worth mentioning that, since the AFCA module processes all the feature channels with different DCT frequency components, respectively, and learns the features of different frequency components, the accuracy of tamper detection can be greatly improved.

The above-mentioned N frequency components may be obtained by: by utilizing the image data set, qualitative research of different functions of different image frequency components in a face forensics task and a traditional image classification task is carried out, the first N frequency components with the highest accuracy for the face forensics task and the traditional image classification task are extracted, and the N frequency components are used as the N frequency components required in the multi-frequency channel attention mechanism module.

10000 tampered Face images (false Face images) and 40000 real Face images from a Youtube website are respectively obtained from four counterfeiting methods of Deepface, face2Face, faceaway and neuron in a faceForensics + + data set to form a Face experiment data set (80000 in total).

Dividing the image data set into four groups according to different counterfeiting methods, wherein the four groups are respectively Deepface, face2Face, faceawap and NeuralTexture, each group of data set comprises 10000 true and false Face images, and the four sub data sets are respectively subjected to the following operations:

as shown in fig. 6, the process of obtaining N frequency components based on the image data set is: dividing a frequency spectrum graph obtained by performing DCT transformation on an RGB picture (i.e., an original image at the left end in fig. 6) of each sub data set face image into a plurality of parts in the horizontal and vertical directions, as a preferred example, dividing the obtained frequency spectrum graph into 8 parts in the horizontal and vertical directions, and equally dividing the frequency spectrum graph into 64 parts; a mask image is generated, a single frequency block in a spectrogram is reserved, and then Inverse DCT (iDCT) transformation is carried out to obtain a local frequency component image (namely, an original image at the right end in FIG. 6); the above processing is performed for each picture of the four sub data sets, so that 64 sub frequency component data sets can be obtained on each forgery method sub data set; identifying whether the faces of the 64 sub-frequency component data sets are tampered on a defective neural network (ResNet-18) to obtain the accuracy of each frequency component; and finally, selecting the first N frequency components with the highest accuracy as different DCT frequency components distributed to the AFCA module.

The training and testing of the face image tamper detection model are exemplarily described below with reference to specific embodiments.

Illustratively, the face image tamper detection model may be trained and tested using a FaceForensics + + data set and a DFDC Preview data set, where the FaceForensics + + data set includes three sub-data sets with different image qualities, i.e., original Quality (Raw, raw Quality, uncompressed), high Quality (HQ, high Quality, compression parameter 23), and Low Quality (LQ, low Quality, compression parameter 40), and each sub-data set includes 63000 true and false faces.

Since the DFDC Preview data set does not contain images of different qualities, in order to better study the robustness of the proposed network to image compression, the present embodiment employs JPEG algorithm to compress the DFDC Preview data set, and divides the DFDC Preview data set into four subdata sets of Raw (uncompressed), HQ (80%), medium quality (MQ, middle quality) (60%) and LQ (40%) according to JPEG compression ratio, wherein 240000 subdata sets are provided.

The accuracy of the passive detection method for face image tampering of the present application is exemplarily described below with reference to specific experimental data. Wherein, table 1 is the Accuracy (ACC) comparison experimental data of the FENet in the faceforces + + data set, table 2 is the AUC of the FENet in the faceforces + + data set (AUC is a common index for evaluating the quality of the binary network, and the higher the AUC value is, the better the network training effect is generally indicated), table 3 is the ACC comparison experimental data of the FENet in the DFDC Preview data set, and table 4 is the AUC comparison experimental data of the FENet in the DFDC Preview data set.

TABLE 1

Method	AUC(Raw)	AUC(HQ)	AUC(LQ)
				Meso-4	70.36	66.03	54.21
Meso-Incep	79.53	73.45	69.42
				HP-CNN	78.82	71.28	68.37
Constrained-Conv	95.21	92.00	88.35
				AMTEN	96.68	91.78	86.82
XceptionNet	99.26	98.52	97.43
				ResNet34	88.25	80.36	75.32
FENet	99.99	99.93	99.56

TABLE 2

Method	ACC(Raw)	ACC(HQ)	ACC(MQ)	ACC(LQ)
					Meso-4	53.71	60.63	58.25	54.38
Meso-Incep	58.16	64.49	59.47	58.30
					HP-CNN	61.49	64.09	63.38	62.59
Constrained-Conv	81.01	83.40	81.27	80.05
					AMTEN	88.83	85.69	83.96	83.76
XceptionNet	89.37	92.29	90.28	88.04
					ResNet34	94.52	96.68	94.92	93.93
FENet	97.88	96.83	96.15	95.78

TABLE 3

Method	AUC(Raw)	AUC(HQ)	AUC(MQ)	AUC(LQ)
					Meso-4	55.37	53.26	52.14	51.54
Meso-Incep	65.46	64.68	59.72	55.90
					HP-CNN	67.58	63.73	60.85	56.36
Constrained-Conv	87.72	85.34	83.14	80.41
					AMTEN	89.25	86.13	85.57	84.25
XceptionNet	96.91	93.86	88.79	83.37
					ResNet34	73.64	70.24	69.23	68.10
FENet	99.58	98.53	97.78	94.89

TABLE 4

As shown in tables 1 and 3, ACC comparison is performed by adopting a faceforces + + dataset, so that the accuracy of the FENet is 99.96%, which is obviously higher than that of other human face image tampering detection methods; and (3) performing ACC comparison by adopting the DFDC Preview data set, wherein the accuracy rate of FENet is 97.88%, which is obviously higher than the accuracy rates of other face image tampering detection methods.

As shown in table 2 and table 4, AUC comparison is performed by adopting a faceforces + + dataset, and the AUC value of FENet is 99.99% higher than the ACU value of other face image tampering detection methods; and adopting the DFDC Preview data set to perform AUC comparison, wherein the AUC value of FENet is 99.58%, which is higher than the ACU value of other face image tampering detection methods.

The following describes an exemplary face image tampering passive detection apparatus provided in the present application with reference to specific embodiments.

As shown in fig. 7, an embodiment of the present application provides a passive detection apparatus for face image tampering, where the passive detection apparatus for face image tampering 700 includes:

the first obtaining module 701 is configured to obtain a face image to be detected.

A second obtaining module 702, configured to perform frequency domain enhancement processing and compression noise suppression processing on the facial image to be detected, so as to obtain a frequency domain enhanced first facial image and a second facial image with suppressed image compression noise.

The detection module 703 is configured to input the first face image and the second face image into a face image tampering detection model for tampering detection, so as to obtain a detection result; the detection result is used for prompting that the face image to be detected is a tampered image or a real image, and the feature extraction part of the face image tampering detection model comprises a multi-frequency channel attention mechanism module.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

It should be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional units and modules is only used for illustration, and in practical applications, the above function distribution may be performed by different functional units and modules as needed, that is, the internal structure of the apparatus may be divided into different functional units or modules to perform all or part of the above described functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application. For the specific working processes of the units and modules in the system, reference may be made to the corresponding processes in the foregoing method embodiments, which are not described herein again.

As shown in fig. 8, an embodiment of the present application provides a terminal device, and as shown in fig. 8, a terminal device D10 of the embodiment includes: at least one processor D100 (only one processor is shown in fig. 8), a memory D101, and a computer program D102 stored in the memory D101 and operable on the at least one processor D100, wherein the processor D100 implements the steps of any of the various method embodiments described above when executing the computer program D102.

Specifically, when the processor D100 executes the computer program D102, the frequency domain enhancement processing and the compression noise suppression processing are performed on the face image to be detected to obtain a first face image with enhanced frequency domain and a second face image with suppressed image compression noise, and then the first face image and the second face image are input to the face image tampering detection model for tampering detection to obtain a detection result for prompting that the face image to be detected is a tampered image or a real image. The first face image can enhance the spatial domain characteristics of the face image to be detected and improve the important characteristics of the face image to be detected, and the second face image can reduce the influence of high-frequency noise caused by an image compression algorithm, so that the face image tampering detection model can improve the detection accuracy of face image tampering when the face image to be detected is tampered based on the first face image and the second face image. In addition, the feature extraction part of the human face image tampering detection model comprises a multi-frequency channel attention mechanism module, the multi-frequency channel attention mechanism module adopts global average pooling, the average value of each channel in the feature channels is used as a weight, the interested content is enhanced in a weighting mode to inhibit background features, and therefore the human face image tampering detection accuracy is further improved.

The Processor D100 may be a Central Processing Unit (CPU), and the Processor D100 may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage D101 may be an internal storage unit of the terminal device D10 in some embodiments, for example, a hard disk or a memory of the terminal device D10. In other embodiments, the memory D101 may also be an external storage device of the terminal device D10, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the terminal device D10. Further, the memory D101 may also include both an internal storage unit and an external storage device of the terminal device D10. The memory D101 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer programs. The memory D101 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the foregoing method embodiments.

The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above may be implemented by instructing relevant hardware by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the embodiments of the methods described above may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include at least: any entity or device capable of carrying computer program code to a passive detection device/terminal for face image tampering, a recording medium, a computer Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A passive detection method for face image tampering is characterized by comprising the following steps:

acquiring a human face image to be detected;

2. The method according to claim 1, wherein the performing frequency domain enhancement processing and compression noise suppression processing on the facial image to be detected to obtain a frequency domain enhanced first facial image and a second facial image with suppressed image compression noise comprises:

performing DCT transformation on the Y channel subimage, the Cb channel subimage and the Cr channel subimage respectively to obtain a first original DCT image corresponding to the Y channel subimage, a second original DCT image corresponding to the Cb channel subimage and a third original DCT image corresponding to the Cr channel subimage;

and obtaining a frequency domain enhanced first face image and a second face image for inhibiting image compression noise according to the first original DCT image, the second original DCT image and the third original DCT image.

3. The method according to claim 2, wherein obtaining the frequency-domain enhanced first face image according to the first original DCT graph, the second original DCT graph and the third original DCT graph comprises:

dividing the first original DCT map, the second original DCT map and the third original DCT map into n x n squares, respectively; wherein n is a power of 2 and n is greater than or equal to 8;

respectively aiming at each original DCT image in the first original DCT image, the second original DCT image and the third original DCT image, carrying out normalization processing on each square in the original DCT images to obtain a frequency domain matrix weight containing n multiplied by n values, and obtaining a frequency domain weight image corresponding to the original DCT images through matrix replication and matrix expansion; the frequency domain weight map corresponding to the first original DCT map is a first frequency domain weight map, the frequency domain weight map corresponding to the second original DCT map is a second frequency domain weight map, and the frequency domain weight map corresponding to the third original DCT map is a third frequency domain weight map;

adding the first DCT weight map, the second DCT weight map and the third DCT weight map with the first frequency domain weight map, the second frequency domain weight map and the third frequency domain weight map according to color channel correspondence to obtain a first enhanced weight map corresponding to the first original DCT map, a second enhanced weight map corresponding to the second original DCT map and a third enhanced weight map corresponding to the third original DCT map;

4. The method according to claim 3, wherein obtaining the frequency-domain enhanced first face image according to the first enhanced weight map, the second enhanced weight map and the third enhanced weight map comprises:

and sequentially carrying out inverse DCT (discrete cosine transform) transformation and RGB (red, green and blue) conversion on each product of the first product, the second product and the third product to obtain a frequency domain enhanced first face image.

5. The method according to claim 2, wherein obtaining a second face image with suppressed image compression noise according to the first original DCT graph, the second original DCT graph and the third original DCT graph comprises:

for each original DCT map in the first original DCT map, the second original DCT map and the third original DCT map, respectively, dividing the original DCT map into a first image area including high-frequency components and a second image area except the first area, performing integral normalization processing on the first image area, and performing sliding window normalization processing on the second image area to obtain a fourth DCT weight map corresponding to the first original DCT map, a fifth DCT weight map corresponding to the second original DCT map and a sixth DCT weight map corresponding to the third original DCT map;

and sequentially carrying out inverse DCT (discrete cosine transformation) and RGB (red, green and blue) conversion on the DCT weight map respectively aiming at each DCT weight map in the fourth DCT weight map, the fifth DCT weight map and the sixth DCT weight map to obtain a second face image for inhibiting image compression noise.

6. The method of claim 1, wherein the face image tamper detection model is an Xception classification model, the activation function of a feature extraction part of the Xception classification model is a parameterized ReLU activation function, other than the last pooling layer in all pooling layers of the feature extraction part are detail-preserving pooling layers, and each separable convolution layer in the feature extraction part is followed by one of the multi-frequency channel attention mechanism modules.

7. The method of claim 6, wherein each eigenchannel map of the multi-frequency channel attention module is processed through N different frequency components, where N is an integer divisible by the number of channels C of the multi-frequency channel attention module.

8. A passive detection device for face image tampering, comprising:

the first acquisition module is used for acquiring a human face image to be detected;

the detection module is used for inputting the first face image and the second face image into a face image tampering detection model for tampering detection to obtain a detection result;

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the passive detection method for human face image tampering according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the method for passive detection of face image tampering according to any one of claims 1 to 7.