CN116012958A

CN116012958A - Method, system, device, processor and computer readable storage medium for implementing deep fake face identification

Info

Publication number: CN116012958A
Application number: CN202310093773.8A
Authority: CN
Inventors: 朱煜; 吴嘉辉; 汪楠; 李航宇; 叶炯耀
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2023-02-10
Filing date: 2023-02-10
Publication date: 2023-04-25

Abstract

The invention relates to a method for realizing deep fake face identification based on rPPG multi-scale space-time diagram and a two-stage model, wherein the method comprises the following steps: (1) Collecting a depth fake face video data set, and preprocessing videos; (2) generating an rPPG multi-scale space-time diagram; (3) Constructing a mask guided local attention module, performing first-stage training, and extracting the characteristics of a single rPPG time space diagram; (4) Constructing a time domain aggregation module based on a transducer, performing second-stage training, and fusing comprehensive features of a plurality of adjacent time-space diagrams; (6) The construction classification head performs classification recognition processing and constructs a loss function. The invention also relates to a corresponding system, device, processor and storage medium thereof. By adopting the method, the system, the device, the processor and the storage medium thereof, the comprehensive characteristics of a plurality of time-space diagrams representing one video are extracted through the two-stage model, and compared with a baseline model, the method, the system, the device, the processor and the storage medium thereof have better interpretability and the fake face identification effect.

Description

Method, system, device, processor and computer readable storage medium for implementing deep fake face identification

Technical Field

The invention relates to the technical field of digital images, in particular to the technical field of computer vision, and specifically relates to a method, a system, a device, a processor and a computer readable storage medium for realizing deep fake face identification based on rPPG multi-scale space-time diagrams and a two-stage model.

Background

With the development of the generated depth model, the technical threshold of the depth face counterfeiting is lower, and people can easily create vivid face counterfeiting content through the disclosed model or tool. Deep forgery may also be misused by malicious users, creating false political information or propagating pornography. As a defense mechanism, face counterfeit authentication techniques have been developed and used to mitigate the risks associated with deep counterfeiting. Remote photoplethysmography (rpg) extracts the heart beat signal from the recorded video by examining subtle changes in skin color caused by heart activity. Because the face counterfeiting process inevitably destroys the periodic variation of facial color, rpg has proven to be a biological indicator that can be used to effectively identify counterfeit faces.

However, most existing rpg signal depth-based face-forgery identification methods still have some drawbacks. Such as: the application number is: the invention patent application of CN202210572034.2 takes 32 square small frames on each frame of human face to extract heart rate signals, but the ROI areas are overlapped with each other and have single scale; and only a one-stage encoder is used for extracting the characteristics of a single rPPG space-time diagram, and the characteristic fusion of a plurality of adjacent rPPG space-time diagrams is not considered; and only two kinds of cross entropy loss are used, and the attention weight of the local position of the pixel level is not considered, so that the detection performance is limited.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method, a system, a device, a processor and a computer readable storage medium thereof for realizing deep fake face identification based on rPPG multi-scale space-time diagram, which can effectively consider the comprehensive characteristics of a plurality of adjacent video clips.

To achieve the above object, the method, system, device, processor and computer readable storage medium thereof for implementing deep fake face authentication based on rpg multiscale space-time diagram and two-stage model of the present invention are as follows:

the method for realizing the identification of the deeply forged human face based on the rPPG multi-scale space-time diagram and the two-stage model is mainly characterized by comprising the following steps of:

(1) Collecting a depth fake face video data set, and preprocessing video data to obtain a cut face video frame set;

(2) Generating an rPPG multi-scale space-time diagram according to the face video frame obtained after cutting;

(3) Constructing a mask guided local attention module, performing first-stage training, and extracting the characteristics of a single rPPG time space diagram;

(4) Constructing a transducer module, performing second-stage training, and fusing the comprehensive characteristics of a plurality of adjacent rPPG time-space diagrams;

(5) And constructing a classification head, pooling the fused high-dimensional features, classifying and identifying the fused high-dimensional features to obtain the identification result of the target image and constructing an overall loss function.

Preferably, the step (2) specifically includes the following steps:

(2.1) dividing a complete video into a plurality of T-frame video segments in step omega frames;

(2.2) for each frame, carrying out face alignment and extracting face key points;

(2.3) selecting n heartbeat signal information areas according to the face key points to form an ROI set R _t ＝{R _1t ,R _2t ,…,R _nt }；

(2.4) for the ROI set R _t All non-empty subsets in the list are calculated, and the average value of all pixels contained in each non-empty subset is obtained to obtain 2 ⁿ -1 pixel mean of RGB three channels;

(2.5) for each video clip, the T frames contained therein are subjected to the operations of steps (2.2) - (2.4), resulting in a dimension of T× (2) ⁿ -1) x 3 multiscale space-time diagram, wherein T is the length of time, 2 ⁿ -1 is the number of combinations of different information areas and 3 is the number of RGB channels.

Preferably, the n information areas in (2.3) are forehead, chin, upper left and right cheeks, lower left and right cheeks, respectively, and the specific areas are shown in fig. 2.

Preferably, the step (3) specifically includes the following steps:

(3.1) constructing EfficientNet as backbone network f (& gt) for input rPPG time space diagram

Extracting features through backbone network, and obtaining middle layer feature map F _m ＝f _mid (X)∈R ^C×H×W Wherein C, H, W represent the channel number, column number and line number of the feature map respectively;

(3.2) building a mask-guided local attention module to middle layer feature map F _m For input, an attention mask A is obtained _mask ：

A _mask ＝Sigmoid(Conv(F _m ))

Wherein Conv (·) represents a convolution operation;

(3.3) masking the attention with the middle layer feature map F _m Performing point multiplication to obtain a position weighted feature map F' =A _mask ·F _m Taking F' as input to extract the characteristics of the subsequent network layer;

(3.4) calculating the pixel level mask tag A of the rPPG time space map _gt : for rPPG space-time diagram generated by false video, finding out its corresponding real rPPG space-time diagram, making difference pixel by pixel to obtain residual space-time diagram, graying residual space-time diagram, normalizing 0 to 1, and regulating its size to be identical to attention mask A _mask Binarizing the same size with 0.1 as a threshold value to obtain a corresponding pixel-level mask label A _gt ；

(3.5) masking the attention A _mask And corresponding pixel level mask tag A _gt The L1 distance is calculated as a mask loss function L according to the following formula _mask ：

More preferably, the step (4) specifically includes the following steps:

(4.1) respectively inputting K adjacent rPPG time space diagrams into the backbone network trained in the first stage to obtain K global high-dimensional characteristics F _h Then global average pooling is carried out, and classification codes and one-dimensional leachable position codes are overlapped to be used as an input sequence Z of a transducer _in ；

(4.2) constructing a feature fusion module of a plurality of rPPG time space diagrams based on a transducer: will input sequence Z _in Performing multi-head self-attention operation MSA, passing through a feed forward network FFN, and after each operation is performed, further adjusting output by using layer normalization LN and residual connection to obtain output result Z of transducer _out 。

More preferably, the step (4.2) specifically includes the following steps:

(4.2.1) input sequence Z _in Generating a query matrix through a linear mapping layer

Key matrix->

And a Value matrix +.>

The three matrices are then transferred into a multi-headed self-attention mechanism MSA as shown in the following equation:

wherein d is a normalization constant, and T is matrix transposition operation;

(4.2.2) obtaining a feature fusion output Z after the conversion process through the FFN process of a feedforward network layer consisting of a multi-layer perceptron _out 。

More preferably, the step (5) specifically includes the following steps:

(5.1) the fused comprehensive characteristics Z obtained from the second stage training output _out Global average pooling g (·) is performed, and then a fully connected network FC is used to map dimensions to category number 2 to obtain vectors

As shown in the following formula:

Z＝FC(g(Z _out ))

(5.2) calculating Softmax from Z to obtain a final prediction score y', and calculating a two-category cross entropy loss L from the label y _ce As shown in the following formula:

L _ce ＝y log y′+(1-y)log(1-y′)

(5.3) construction of the Overall loss function L _all As shown in the following formula:

L _all ＝L _ce +λL _mask

where λ is the hyper-parameter used to balance cross entropy loss and mask loss.

The system for realizing the identification of the deeply forged human face based on the rPPG multi-scale space-time diagram and the two-stage model by using the method is mainly characterized by comprising the following steps:

the rPPG multi-scale space-time diagram generation module is used for calculating an rPPG space-time diagram from the face video frame;

the mask-guided local attention module is connected with the rPPG multi-scale space-time diagram generation module and is used for enhancing the learning of local information and extracting the characteristics of a single rPPG space-time diagram;

the transducer module is connected with the local attention module guided by the mask and used for fusing the comprehensive characteristics of a plurality of adjacent rPPG time-space diagrams; and

the classification head module is connected with the transducer module and is used for pooling the integrated features after fusion and carrying out classification recognition processing so as to obtain the identification result of the target image and construct an overall loss function.

The device for realizing the deep fake face identification based on the rPPG multi-scale space-time diagram and the two-stage model is mainly characterized by comprising the following components:

a processor configured to execute computer-executable instructions;

and a memory storing one or more computer-executable instructions which, when executed by the processor, implement the steps of the method for implementing deep counterfeited face authentication based on the rpg multiscale space-time diagram and the two-stage model described above.

The processor for realizing the deep fake face identification based on the rPPG multi-scale space-time diagram and the two-stage model is mainly characterized in that the processor is configured to execute computer executable instructions, and when the computer executable instructions are executed by the processor, the steps of the method for realizing the deep fake face identification based on the rPPG multi-scale space-time diagram and the two-stage model are realized.

The computer readable storage medium is mainly characterized in that the computer program is stored thereon, and the computer program can be executed by a processor to realize the steps of the method for realizing the identification of the deep fake human face based on the rPPG multi-scale time space diagram and the two-stage model.

The method, the system, the device, the processor and the computer readable storage medium thereof for realizing the identification of the deeply forged human face based on the rPPG multi-scale space-time diagram and the two-stage model are adopted, the multi-scale space-time diagram of the heart rate signal rPPG is innovatively taken as model input, and a classical CNN model (such as EfficientNet) and a Transformer are used as the two-stage model. In order to enhance the perception of the model on the local position information, the invention also innovatively introduces a mask-guided local attention module, and the model is guided to further distinguish different modes of the vacuum space-time diagram through the indication of the pixel-level space-time diagram mask label. The transducer module fuses the features of multiple neighboring rpg time-space diagrams through a self-attention mechanism. The technical scheme has the advantages that experimental verification is carried out on the faceforensis++ data set, and compared with a baseline model, the method has a more prominent classification and identification effect.

Drawings

Fig. 1 is a schematic diagram of a generation flow of a method for implementing deep fake face identification based on an rpg multi-scale space-time diagram and a two-stage model.

Fig. 2 is a schematic flow chart of generating a multi-scale rpg time-space diagram based on the rpg multi-scale time-space diagram and the two-stage model for realizing the deep fake face identification.

Fig. 3 is a schematic diagram of a frame structure of a system for implementing deep fake face identification based on rpg multi-scale space-time diagram and two-stage model according to the present invention.

FIG. 4 is a schematic diagram of a transducer module according to the present invention.

Detailed Description

In order to more clearly describe the technical contents of the present invention, a further description will be made below in connection with specific embodiments.

Before describing in detail embodiments that are in accordance with the present invention, it should be observed that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, the method for implementing deep fake face identification based on rpg multi-scale space-time diagram and two-stage model includes the following steps:

In practical application, the step (1) specifically includes:

downloading a faceforensic++ data set from a data set officer network to obtain an original video, extracting an image from the original video, and obtaining a cut face image by using a face extractor;

in practical applications, as a preferred embodiment of the present invention, the step (2) specifically includes the following steps:

(2.1) dividing a complete video into a plurality of 64 frame video segments in steps 16;

(2.3) selecting 6 heartbeat signal information areas according to the key points of the human face to form an ROI set R _t ＝{R _1t ,R _2t ,…,R _nt }；

(2.4) for the ROI set R _t All non-empty subsets in the list are calculated, and the average value of all pixels contained in each non-empty subset is obtained to obtain 2 ⁶ -1, the pixel mean of 63 RGB three channels;

(2.5) for each video clip, the same operations (2.2) - (2.4) are performed on 64 frames contained in each video clip, so as to obtain a multi-scale space-time diagram with dimensions of 64×63×3, wherein 64 is a time length, 63 is the number of combination modes of different information areas, and 3 is the number of RGB channels.

As a preferred embodiment of the present invention, the step (3) includes the steps of:

(3.1) constructing EfficientNet as backbone network f (& gt) for input rPPG time space diagram X εR ^3×64×63 Extracting features through a backbone network, and obtaining a middle-layer feature map Fm=f _mid (X)∈R ^C×H×W Wherein C, H, W represent the channel number, column number and line number of the feature map respectively;

A _mask ＝Sigmoid(Conv(F _m ))

Wherein Conv (·) represents a convolution operation;

(3.3) masking attention and middle layer feature map F _m Dot product, obtain a feature map F' =a after position weighting _mask ·F _m Taking F' as input to extract the characteristics of the subsequent network layer;

(3.4) calculating the pixel level mask tag A of the rPPG time space map _gt : for rPPG space-time diagram generated by false video, finding out its corresponding real rPPG space-time diagram, making difference pixel by pixel to obtain residual space-time diagram, graying residual space-time diagram, normalizing 0 to 1, and makingWhich is sized to be aligned with the attention mask a _mask Binarizing the same size with 0.1 as a threshold value to obtain a corresponding pixel-level mask label A _gt ；

As a preferred embodiment of the present invention, the step (4) specifically includes the following steps:

(4.1) construction of the input sequence Z of the transducer _in ：

K time-adjacent rPPG space-time diagrams are respectively input into the backbone network trained in the first stage to obtain K global high-dimensional characteristics F _h Then global average pooling is carried out, and classification codes and one-dimensional leachable position codes are overlapped to be used as an input sequence Z of a transducer _in ；

(4.2): constructing a two-stage model transducer to obtain the comprehensive characteristics of fusion of a plurality of adjacent rPPG time-space diagrams:

input sequence Z _in Generating a query matrix through a linear mapping layer

Key matrix

And a Value matrix +.>

Then, three matrices are transferred into the multi-head self-attention mechanism MSA, as shown in the following formula:

wherein T is matrix transposition operation, and d is normalization constant. Then the characteristic fusion output Z after the conversion process is obtained through the FFN processing of a feedforward network layer formed by a multi-layer perceptron _out 。

As a preferred embodiment of the present invention, the step (5) specifically includes:

global average pooling is carried out on the fused features, and then a fully connected network FC is used to map the dimension number to the category number 2 so as to obtain

Calculating a final prediction score y' according to Z, and calculating a two-category cross entropy loss L according to the label y _ce Finally, constructing an overall loss function L _all As shown in the following formula:

Z＝FC(g(Z _out ))

L _ce ＝y log y′+(1-y)log(1-y′)

L _all ＝L _ce +λL _mask

Referring to fig. 3, the system for implementing deep fake face identification based on rpg multi-scale space-time diagram and two-stage model by using the method includes:

In a specific embodiment of the present invention, the classification and identification method using the present technical solution is tested as follows:

(1) Experimental data set

The invention uses deep face counterfeiting faceforensics++ (FF++) for experimental verification. The ff++ dataset includes 1000 raw videos, 720 of which are used for training and 280 of which are used for testing and validation. Each video was forged by four different facial manipulation methods, namely Deepfackes (DF), face2Face (F2F), faceSwap (FS) and NeuralTextures (NT). Two of these methods replace the full face (DF and FS) and the other two methods only deal with localized areas around the mouth or eyes (F2F and NT).

(2) Training process

The initial learning rate was set to 1e-2, learning was performed using an SGD optimizer, batch was set to 32, and training was performed for 30 rounds.

(3) Test results

In this embodiment, training and testing are performed on four sub-data sets of ff++, respectively, the true and false two-classification capability of the method is evaluated, then multi-classification training and testing are performed on five classes on the data set, and Accuracy (acc.) is selected as an algorithm evaluation index. The experimental results are shown in table 1.

Table 1 Performance of the model on the 1 FF ++ dataset (%)

As can be seen from table 1, the present embodiment has excellent performance in terms of whether the ff++ data set is a training sample, and whether it is a true or false two-class or five-class multi-class, showing the effectiveness of the algorithm.

The device for realizing the identification of the deeply forged human face based on the rPPG multi-scale space-time diagram and the two-stage model comprises:

a processor configured to execute computer-executable instructions;

The processor is configured to execute computer executable instructions, which when executed by the processor, implement the steps of the method for implementing the deep fake face identification based on the rPPG multi-scale space-time diagram and the two-stage model.

The computer readable storage medium having stored thereon a computer program executable by a processor to perform the steps of the method for implementing deep counterfeited face authentication based on rpg multiscale space-time diagrams and two-stage models described above.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution device.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiments.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "examples," "specific examples," or "embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

In this specification, the invention has been described with reference to specific embodiments thereof. It will be apparent, however, that various modifications and changes may be made without departing from the spirit and scope of the invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. The method for realizing the identification of the deeply forged human face based on the rPPG multi-scale space-time diagram and the two-stage model is characterized by comprising the following steps of:

2. The method for implementing deep fake face identification based on rpg multi-scale space-time diagram and two-stage model according to claim 1, wherein the step (2) includes the steps of:

(2.4) for the ROI set R _t All non-empty subsets in the list are calculated, and the average value of all pixels contained in each non-empty subset is obtained to obtain 2 ⁿ -1 RGB three channelA pixel mean value;

3. The method for implementing deep fake face identification based on rpg multi-scale space-time diagram and two-stage model according to claim 1, wherein the step (3) specifically comprises the following steps:

A _mask ＝Sigmoid(Conv(F _m ))

Wherein Conv (·) represents a convolution operation;

L _mask ＝|A _mask -A _gt | ₁ 。

4. The method for implementing deep fake face identification by using the basic rpg multiscale space-time diagram and the two-stage model according to claim 1, wherein the step (4) specifically includes the following steps:

5. The method for implementing deep fake face identification based on rpg multi-scale space-time diagram and two-stage model according to claim 4, wherein the step (4.2) specifically comprises the following steps:

Key matrix->

And a Value matrix +.>

wherein d is a normalization constant, and T is matrix transposition operation;

6. The method for implementing deep fake face identification based on rpg multi-scale space-time diagram and two-stage model according to claim 4, wherein the step (5) specifically comprises the following steps:

(5.1) the fused comprehensive characteristics Z obtained from the second stage training output _out Global average pooling g (·) is performed, and then a fully connected network FC is used to map dimensions to category number 2 to obtain vector Z εR ² As shown in the following formula:

Z＝FC(g(Z _out ))

(5.2) calculating Softmax for vector Z to obtain final prediction score y', and calculating two-category cross entropy loss L according to label y _ce As shown in the following formula:

L _ce ＝y log y′+(1-y)log(1-y′)

L _all ＝L _ce +λL _mask

7. A system for implementing deep counterfeited face authentication based on rpg multiscale space-time diagrams and two-stage models using the method of any one of claims 1 to 6, characterized in that the system comprises:

8. An apparatus for implementing rfpg multi-scale space-time diagram and two-stage model based deep counterfeited face authentication, the apparatus comprising:

a processor configured to execute computer-executable instructions;

a memory storing one or more computer-executable instructions which, when executed by the processor, perform the steps of the method for implementing deep counterfeited face authentication based on rpg multiscale space-time diagrams and two-phase models of any one of claims 1 to 6.

9. A processor for implementing rfpg multi-scale space-time diagram and two-stage model based deep counterfeited face authentication, characterized in that the processor is configured to execute computer executable instructions which, when executed by the processor, implement the respective steps of the method of implementing the rfpg multi-scale space-time diagram and two-stage model based deep counterfeited face authentication according to any one of claims 1 to 6.

10. A computer readable storage medium having stored thereon a computer program executable by a processor to perform the steps of the method of any one of claims 1 to 6 for implementing deep counterfeited face authentication based on rpg multiscale space-time diagrams and two-stage models.