CN113762138B

CN113762138B - Identification method, device, computer equipment and storage medium for fake face pictures

Info

Publication number: CN113762138B
Application number: CN202111027883.1A
Authority: CN
Inventors: 王佳琪; 李玉惠; 傅强; 蔡琳; 阿曼太; 梁彧; 马寒军; 田野; 王杰; 杨满智; 金红; 陈晓光
Original assignee: Eversec Beijing Technology Co Ltd
Current assignee: Eversec Beijing Technology Co Ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2024-04-23
Anticipated expiration: 2041-09-02
Also published as: CN113762138A

Abstract

The embodiment of the invention discloses a method and a device for identifying fake face pictures, computer equipment and a storage medium. Wherein the method comprises the following steps: acquiring a depth fake data set, wherein the depth fake data set comprises a plurality of video pairs, each video pair comprises a real video and a fake video obtained by carrying out face replacement on a real face in the real video; forming a plurality of negative sample images according to the real face and the replacement face, and forming a plurality of positive sample images according to the real video; constructing a training sample set according to each negative sample image and each positive sample image, and training a machine learning model to form a fake face picture recognition model; and inputting the target face picture to be identified into the identification model, and acquiring an identification result of whether the target face picture is a fake face picture. The embodiment of the invention solves the problems of single type of the forged face of the public data set and possibility of network overfitting, and realizes data enhancement and meets the requirements of deep counterfeiting detection and identification.

Description

Identification method, device, computer equipment and storage medium for fake face pictures

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to the fields of deep learning, computer vision, deep counterfeiting generation and deep counterfeiting detection, and particularly relates to a method and device for recognizing a fake face picture, computer equipment and a storage medium.

Background

The purpose of the field of deep forgery detection is anti-scout of deep forgery generation technologies. The main methods at present can be divided into two major classes, namely picture level and video level according to object types.

Most of models of the picture-level detection and identification method depend on the same data distribution, and the effect of unknown tampering type debilitation is poor. The video-level detection and identification method has good effect, can detect a small amount of tampering in the video, is sensitive to preprocessing of the video, such as video compression, light change and the like, and cannot judge the authenticity of a single-frame image. The picture-level detection and identification method has the common problem of poor model generalization accuracy and low accuracy, and the video-level detection and identification method can not finish identification of a single image and can not meet the requirements of deep counterfeiting detection and identification in projects.

Disclosure of Invention

The embodiment of the invention provides a method, a device, computer equipment and a storage medium for recognizing fake face pictures, solves the problems that the fake face of a public data set is single in type and has the possibility of network fitting, and realizes data enhancement and meets the requirements of deep fake detection and recognition.

In a first aspect, an embodiment of the present invention provides a method for identifying a forged face picture, where the method includes:

Acquiring a depth forging data set, wherein the depth forging data set comprises a plurality of video pairs, each video pair comprises a real video and a forging video obtained by carrying out face replacement on a real face in the real video;

Performing image enhancement processing according to the real face and the replacement face included in each video pair to form a plurality of negative sample images, and forming a plurality of positive sample images according to the real video included in each video pair;

constructing a training sample set according to each negative sample image and each positive sample image, and training a preset machine learning model by using the training sample set to form a fake face picture recognition model;

And inputting the target face picture to be identified into the fake face picture identification model, and acquiring an identification result of whether the target face picture is the fake face picture.

In a second aspect, an embodiment of the present invention further provides a device for identifying a forged face picture, where the device for identifying a forged face picture includes:

the device comprises a depth fake data set acquisition module, a fake data set acquisition module and a fake data processing module, wherein the depth fake data set comprises a plurality of video pairs, each video pair comprises a real video, and a fake video obtained by carrying out face replacement on a real face in the real video;

The image enhancement processing module is used for carrying out image enhancement processing according to the real face and the replacement face included in each video pair to form a plurality of negative sample images, and forming a plurality of positive sample images according to the real video included in each video pair;

The fake face picture recognition model forming module is used for constructing a training sample set according to each negative sample image and each positive sample image, training a preset machine learning model by using the training sample set, and forming a fake face picture recognition model;

The result recognition module is used for inputting the target face picture to be recognized into the fake face picture recognition model and obtaining the recognition result of whether the target face picture is the fake face picture.

In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for identifying a forged face picture according to any embodiment of the present invention when the processor executes the computer program.

In a fourth aspect, an embodiment of the present invention further provides a storage medium containing computer executable instructions, where a computer program is stored thereon, where the program when executed by a processor implements a method for identifying a fake face picture according to any embodiment of the present invention.

According to the technical scheme provided by the embodiment of the invention, the depth fake data set is obtained, wherein the depth fake data set comprises a plurality of video pairs, each video pair comprises a real video, and a fake video obtained by carrying out face replacement on a real face in the real video; performing image enhancement processing according to the real face and the replacement face included in each video pair to form a plurality of negative sample images, and forming a plurality of positive sample images according to the real video included in each video pair; constructing a training sample set according to each negative sample image and each positive sample image, and training a preset machine learning model by using the training sample set to form a fake face picture recognition model; and inputting the target face picture to be identified into the fake face picture identification model, and acquiring an identification result of whether the target face picture is the fake face picture. The problems that the variety of the forged face of the public data set is single and the possibility of network overfitting exists are solved, and the requirements of data enhancement and deep forging detection and recognition are met.

Drawings

Fig. 1a is a flowchart of a method for recognizing a fake face picture according to an embodiment of the present invention;

fig. 1b is a schematic diagram of an effect of synthesizing a genuine-fake face in a method for recognizing a fake face picture according to an embodiment of the present invention;

Fig. 1c is a schematic structural diagram of a fusion EFFICIENTNET-b0 network to which the identification method for fake face pictures provided in the first embodiment of the present invention is applicable;

Fig. 1d is a schematic diagram of a specific structure of a fused EFFICIENTNET-b0 network to which the identification method for fake face pictures provided in the first embodiment of the present invention is applicable;

fig. 1e is a schematic flow chart of feature map size change of a fused EFFICIENTNET-b0 network to which the identification method for fake face pictures is applicable according to the first embodiment of the present invention;

FIG. 1f is a schematic diagram of an initial super-parameter setting of a network training integrated EFFICIENTNET-b0 in a method for recognizing a fake face picture according to an embodiment of the present invention;

Fig. 1g is a schematic diagram of a fusion EFFICIENTNET-b0 network and a standard EFFICIENTNET-b0 network for training loss and accuracy comparison, to which the identification method for fake face pictures provided in the first embodiment of the present invention is applicable;

fig. 1h is an overall logic flow diagram in a method for recognizing a fake face picture according to an embodiment of the present invention;

Fig. 2 is a block diagram of a recognition device for forging a face picture according to a second embodiment of the present invention;

Fig. 3 is a block diagram of a computer device according to a third embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1a is a flowchart of a method for recognizing a fake face picture according to an embodiment of the present invention. The method and the device are applicable to the problem that a fake face picture recognition model with a better recognition effect is obtained through training by synthesizing various fake faces in a data enhancement mode by using true and false face synthesis. The method of the present embodiment may be performed by a recognition device for falsifying a face picture, which may be implemented in software and/or hardware, and may be generally integrated into a terminal or a server having a data processing function.

Correspondingly, the method specifically comprises the following steps:

S110, acquiring a depth forging data set, wherein the depth forging data set comprises a plurality of video pairs, each video pair comprises a real video, and a forging video obtained by carrying out face replacement on a real face in the real video.

The depth fake data set is disclosed in the network, and is formed by creating or synthesizing audiovisual content (such as images, audios and videos, texts and the like) based on intelligent methods such as depth learning and the like. That is, the depth falsification data set includes a plurality of video pairs for training the falsified face picture recognition model. Each video pair includes a real video as a positive sample and a counterfeit video as a negative sample. The face included in the real video is a real face, the fake video refers to the unreal content which is compiled and kneaded, and the face included in the fake video is a fake face obtained by replacing the real face in the real face.

In an alternative implementation of this embodiment, the depth falsification data set may be constructed for the public depth falsification detection data set from Celeb DF v.

Wherein the positive sample in this dataset contains 587 real videos (taken by 61 actors) and the negative sample contains 5637 counterfeit videos (generated based on the face of the actor in the real videos being swapped). Then, the number of face types included in the fake video is 61, and training the model when the data diversity is too small easily causes overfitting of the model.

In the present embodiment, the number and variety of negative samples are enriched by composing a video pair using positive and negative samples included in the depth-counterfeit data set to form a new counterfeit face that does not exist in the depth-counterfeit data set.

S120, performing image enhancement processing according to the real face and the replacement face included in each video pair to form a plurality of negative sample images, and forming a plurality of positive sample images according to the real video included in each video pair.

In this embodiment, the negative sample image and the positive sample image are training samples for training to obtain a fake face image recognition model. The positive sample image is an unprocessed real image, the real image comprises a real face, and the negative sample image is a forged image obtained by replacing the real face in the positive sample image with a forged face.

The image enhancement processing is an important content in the image processing, and in the process of image generation, transmission or transformation, the image quality is reduced, the image is blurred and the characteristics are submerged due to the influence of various factors, so that the analysis and the identification are difficult. Thus, selectively highlighting features of interest in an image, attenuating unwanted features, and improving the image's intelligibility is a major aspect of image enhancement, as desired.

The purpose of the image enhancement processing is to improve the visual effect of the image, and common image enhancement techniques include contrast processing, histogram correction, noise processing, edge enhancement, transformation processing, and pseudo-color.

In the present embodiment, by using the true face and the substitute face included in each video pair and the image enhancement technique, a fake face different from the true face and the substitute face that does not exist in the depth fake data set is constructed, and a negative sample image is formed based on each constructed fake face.

In an optional implementation manner of this embodiment, performing image enhancement processing according to the real face and the substitute face included in each video pair to form a plurality of negative sample images includes:

According to a preset interception time point, intercepting real video image frames and fake video image frames respectively in a target real video and a target fake video in a currently processed target video pair; extracting a real face feature tensor and a substitute face feature tensor from the real video image frame and the fake video image frame respectively; forming at least one fake face feature tensor according to the real face feature tensor and the replacement face feature tensor; and replacing the real face feature tensor in the real video image frame with each fake face feature tensor or replacing the face feature tensor in the fake video image frame to form at least one negative sample image.

The time point of interception may be a time point corresponding to intercepting a frame of image in the real video and the fake video, and the time point may be selected randomly or periodically, etc.

It can be understood that the background or environmental information included in the two corresponding video frames at the same playing time point in the real video and the counterfeit video are the same, and the only difference is that the face included in the video frame in the real video is the real face, and the face included in the counterfeit video is the substitute face. Therefore, by intercepting the real video and the fake video respectively by using the same interception time point, two video frames with the same background and different faces, namely a real video image frame and a fake video image frame, can be intercepted.

It will be appreciated that when selecting the point in time of the truncation, it is necessary to ensure that a face must be included in the video image frame at that point in time of the truncation. The embodiment is not limited to the specific selection form of the interception time point.

The image frames are the minimum units constituting the video, and can be divided into real video image frames and fake video image frames. Tensors are multiple linear mappings defined on Cartesian products of vector space and dual space, the coordinates of which are in n-dimensional space, with a quantity of n components, each of which is a function of the coordinates, and upon coordinate transformation, the components are also linearly transformed according to some rules, the tensors being represented by a multi-dimensional array of components.

In this embodiment, the face regions may be obtained in the actual video image frame and the counterfeit video image frame by using the existing face region extraction methods, so as to combine the pixel values of the pixel points included in the face regions to form the actual face feature tensor and the substitute face feature tensor.

Specifically, the fake face feature tensor is composed of a real face feature tensor and a substitute face feature tensor.

In this embodiment, the authenticity videos corresponding to the depth forgery detection dataset are read Celeb DF v, and then the two videos are cut into the specified frame number according to the preset interception time point. Then, the real face feature tensor and the substitute face feature tensor are extracted from the real video image frame and the fake video image frame. Then, at least one fake face feature tensor is formed according to the real face feature tensor and the replacement face feature tensor. And finally, replacing the real face feature tensor in the real video image frame with each fake face feature tensor, or replacing the real face feature tensor in the fake video image frame, performing image fusion on the real and fake images, and performing image enhancement processing to form a plurality of negative sample images.

The advantages of this arrangement are that: a plurality of negative sample images are formed by performing operations such as acquisition of image frames, extraction and replacement of tensors in the image frames, and the like on the target real video and the target counterfeit video. The image data can be enhanced, so that the problem of single face types in the forged video in the public data set is solved, the forged faces with more diversity are synthesized, and the richness of the data set sample is improved.

Optionally, forming at least one fake face feature tensor according to the real face feature tensor and the substitute face feature tensor may include:

randomly generating at least one face synthesis proportional weight p, wherein p epsilon (0, 1); according to the formula: output= (1-p) real_input_ tensor +p false_input_ tensor, and calculating to obtain false face feature tensor output corresponding to each face synthesis proportion weight respectively; wherein real_input_ tensor is the real face feature tensor and fake_input_ tensor is the substitute face feature tensor.

The weight p refers to the importance degree of a certain factor or index relative to a certain object, and specifically may refer to the specific gravity of the real face feature tensor and the substitute face feature tensor occupying the target face feature tensor.

In this embodiment, as the ratio of the true-false synthesis increases, the face approaches the characteristics of the fake face gradually, as shown in fig. 1b. When p=1, the synthesized face is completely consistent with the fake face; when the value of p is between 0 and 1, the generated face contains two face features in the true-false video, and a new face which does not exist in 61 actor faces in the public data set, namely, a fake face is formed. And a random value between 0 and 1 is given to p in the image enhancement process, so that various fake faces can be generated.

The advantages of this arrangement are that: by randomly generating at least one face synthesis scale weight p, a wide variety of counterfeited faces can be generated. Therefore, the problems of over fitting and single sample types can be solved.

Optionally, forming a plurality of positive sample images according to the real video included in each video pair may include:

In the currently processed real video, intercepting alternative real video frames respectively corresponding to at least one interception time point, and acquiring alternative real video frames comprising real faces as positive sample images.

In this embodiment, by capturing an alternative real video frame corresponding to a time point in the real video, the alternative real video frame is obtained, and a positive sample image is formed, and training of a model can be performed by using the positive sample image, so that a sample set formed by the positive sample image is enriched, and the image is enhanced.

S130, constructing a training sample set according to each negative sample image and each positive sample image, and training a preset machine learning model by using the training sample set to form a fake face picture recognition model.

The machine learning model is a model that uses data, samples, or past experience to optimize performance criteria of a computer program. The machine learning model may be based on a classification model implemented by various classification algorithms, for example, a EFFICIENTNET-b0 network model or a decision tree network model, which is not limited in this embodiment.

The fake face picture recognition model specifically refers to a fake face picture recognition model which is formed after model training is carried out on the machine learning model by using a training sample set and is used for recognizing fake face pictures.

S140, inputting the target face picture to be identified into the fake face picture identification model, and acquiring an identification result of whether the target face picture is the fake face picture.

The recognition result may be a conclusion that the authenticity of the input target image is distinguished everywhere, that is, whether the target face picture is a fake face picture.

Based on the above embodiments, the machine learning model is a converged EFFICIENTNET-b0 (which may also be simply referred to as EFFICIENTNET-b 0) network, and the converged EFFICIENTNET-b0 network is an improved version of the standard EFFICIENTNET-b0 network.

Wherein the converged EFFICIENTNET-b0 network comprises: the system comprises a convolutional network, a feature fusion network connected with at least two convolutional layers in the convolutional network, and a classification network connected with the feature fusion network; the convolution network comprises a plurality of end-to-end convolution layers, each convolution layer is used for carrying out feature extraction on the input feature tensor to obtain an output feature tensor with a set scale, and the input feature tensor of the first convolution layer is the feature tensor of the input picture input to the fusion EFFICIENTNET-b0 network; the feature fusion network is used for acquiring output feature tensors of at least two connected convolution layers, and performing layer-by-layer scale transformation and feature weighting on each output feature tensor to obtain a target fusion feature tensor; and the classification network is used for outputting a classification result of whether the input picture is a fake face picture or not according to the target fusion characteristic tensor.

Optionally, the machine learning model is a converged EFFICIENTNET-b0 network, and the converged EFFICIENTNET-b0 network is a modified version of the standard EFFICIENTNET-b0 network.

That is, in an embodiment of the present invention, the use of the EFFICIENTNET-pytorch module in the python official library loads the EFFICIENTNET-b0 (i.e., EFFICIENTNET-b 0) framework network for transformation optimization results in the fusion EFFICIENTNET-b0 network.

Wherein the standard EFFICIENTNET-b0 network includes thirty convolutional layers in total, each of which has the structure shown in table 1.

TABLE 1

The input image size of the standard EFFICIENTNET-b0 network is 224×224×3, and the output feature dimension is 7×7×320.

In this embodiment, instead of directly using the feature extraction result output by the thirty-th convolution layer of the standard EFFICIENTNET-b0 network to classify the true and false pictures, the output feature tensors of at least two convolution layers in the thirty-th convolution layers in the standard EFFICIENTNET-b0 network are obtained, and each output feature tensor is subjected to layer-by-layer scale transformation and feature weighting to obtain a target fusion feature tensor, and then the target fusion feature tensor is used to classify the true and false pictures.

The design of the network is due to the fact that a large number of macroscopic parts such as the connection of the face and the background exist in the face part in the depth counterfeit video, and a large number of microscopic parts such as the naturalness of the eyes of the face exist, and all factors can be the characteristics of detecting the depth counterfeit video. Meanwhile, the information contained in the feature map output by the standard EFFICIENTNET-b0 network in each layer is greatly different, the feature map in the shallow layer highlights the features such as the outline position and the like, and the feature map in the deep layer highlights the whole details of the image. In the prior art, mesoNet proposes that the middle-layer feature map in the classification network is more favorable for the detection and identification of deep counterfeiting, so that the middle-layer feature map is extracted for feature mapping judgment; the winning proposal in the deep counterfeiting competition selects not to change the network to extract and classify the face characteristics, in other words, the deepest characteristic diagram is adopted as the judgment basis. Based on the above investigation results, it is decided to use a feature depth fusion method to fuse features of shallow, middle and deep layers respectively as feature tensors for calculating classification results (i.e., loss calculation and back propagation calculation update gradient).

That is, in this embodiment, the output feature tensors of the convolution layers with different convolution depths in thirty convolution layers in the standard EFFICIENTNET-b0 network are obtained to perform feature weighting, so as to obtain the target fusion feature tensor, so that image features with different dimensions or different emphasis points can be obtained, and further, when the target fusion feature tensor is used for classifying the true and false pictures, better classification accuracy can be obtained.

Further, consider that in the standard EFFICIENTNET-b0 network, the network is divided into seven chunks as a whole, with the second, fifth, tenth, thirteenth, eighteenth, twenty-third, and twenty-ninth convolution layers as the separation of the chunks, respectively.

Considering that the above-mentioned respective convolution layers can already obtain output feature tensors of different convolution depths, the corresponding,

And the fusion EFFICIENTNET-b0 network takes out the feature tensors of the seven layers one by one, performs feature depth fusion, then sends (or does not send) the features of the second nineteenth layer of the original network into a global average pooling layer, outputs the features to a classification network, and finally obtains a classification result through an activation layer, loss calculation and back propagation calculation update gradient in the classification network.

Correspondingly, fig. 1c shows a schematic structure diagram of a fusion EFFICIENTNET-b0 network to which the identification method for fake face pictures provided in the first embodiment of the present invention is applicable. As shown in fig. 1b, the convolutional network included in the converged EFFICIENTNET-b0 network is the same as the structure including the standard convolutional network in the standard EFFICIENTNET-b0 network;

The feature fusion network is specifically configured to be connected to a second convolution layer, a fifth convolution layer, a tenth convolution layer, a thirteenth convolution layer, an eighteenth convolution layer, a twenty-third convolution layer, and a twenty-ninth convolution layer of the convolution network; the feature fusion network specifically comprises: the device comprises a first channel number dimension reduction unit connected with a twenty-ninth convolution layer, a first weighting unit connected with a twenty-third convolution layer and the first channel number dimension reduction unit, a first dimension conversion unit connected with the first weighting unit, a second weighting unit connected with an eighteenth convolution layer and the first dimension conversion unit, a second channel number dimension reduction unit connected with the second weighting unit, a third weighting unit connected with a thirteenth convolution layer and the second channel number dimension reduction unit, a second dimension conversion unit connected with the third weighting unit, a fourth weighting unit connected with a tenth convolution layer and the second dimension conversion unit, a third dimension conversion unit connected with the fourth weighting unit, a fifth weighting unit connected with a fifth convolution layer and the third dimension conversion unit, a fourth weighting unit connected with the fifth weighting unit, a sixth weighting unit connected with the second convolution layer and the fourth dimension conversion unit, and a channel number dimension increase unit connected with the sixth weighting unit, wherein the channel number dimension increase unit is used for outputting a fusion target feature quantity.

The dimension reduction means an operation performed on a data set in a high-dimensional space by converting a single image into the high-dimensional data, and specifically, the channel number dimension reduction unit is a unit requiring dimension reduction processing due to the fact that the channel number dimension of the image feature tensor is relatively high.

Specifically, each channel number dimension reduction unit included in the feature fusion network is a first 1*1 convolution layer for setting dimension reduction dimensions;

The first 1*1 convolution layer is used for performing dimension reduction processing of a set scale on the number of channels in the input characteristic tensor; each dimension conversion unit included in the feature fusion network includes: a first-class 1*1 convolution layer with dimension reduction dimension set and a neighbor up-sampling unit with dimension increase dimension set which are connected end to end; the neighbor up-sampling unit is used for carrying out dimension-up processing of a set scale on the feature map in the input feature tensor; the channel multi-dimension unit included in the feature fusion network is a second type 1*1 convolution layer with a dimension-up scale set; and the second 1*1 convolution layer is used for carrying out set-scale dimension-up processing on the number of channels in the input characteristic tensor.

The first 1*1 convolution layer is used for performing dimension reduction processing of a set scale on the number of channels in the input feature tensor. The neighbor up-sampling unit is mainly used for realizing neighbor up-sampling processing, and can be an image interpolation processing mode, pixels containing the most parts of pixels are used for mapping, in short, one pixel occupies more pixels in two images, and the operation is performed by using the image with more pixels. The dimension reduction process may be an operation of converting high-dimensional data into low-dimensional data, and the dimension increase process may be an operation of converting low-dimensional data into high-dimensional data.

Correspondingly, fig. 1d shows a specific structural schematic diagram of a fused EFFICIENTNET-b0 network to which the identification method for fake face pictures provided in the first embodiment of the present invention is applicable.

As shown in fig. 1d, in this embodiment, specifically, the first channel number dimension reduction unit may include 1*1 convolution layers with dimension reduction dimensions of 7×7×192 (i.e., first class 1*1 convolution layers); the first dimension conversion unit may include a 1*1 convolution layer with a dimension reduction scale of 7×7×112 and a neighbor upsampling unit with a dimension increase scale of 14×14×112; the second channel number dimension reduction unit may include 1*1 convolution layers with dimension reduction dimensions of 14×14×80; the second dimension conversion unit may include 1*1 convolution layers with a dimension reduction scale of 14×14×40 and a neighbor upsampling unit with a dimension increase scale of 28×28×40; the third dimension conversion unit may include a 1*1 convolution layer with a dimension reduction scale of 28×28×24 and a neighbor upsampling unit with a dimension increase scale of 56×56×24; the fourth dimension conversion unit may include 1*1 convolution layers with a dimension reduction scale of 56×56×16 and a neighbor upsampling unit with a dimension increase scale of 112×112×16; the channel multi-dimension up-scaling unit may include 1*1 convolution layers with dimension up-scaling of 112 x 320 (i.e., the second class 1*1 convolution layers).

In this embodiment, for example, feature fusion is performed on the feature tensor 7×7×192 of the twenty-third convolution layer and the feature tensor 7×7×320 of the twenty-ninth convolution layer. Since the number of channels of the feature tensor of the twenty-third convolution layer is 192 and the number of channels of the feature tensor of the twenty-ninth convolution layer is 320, the feature tensor of the twenty-ninth convolution layer needs to be subjected to channel number dimension reduction before feature fusion, and the feature tensor of the twenty-ninth convolution layer after the channel number dimension reduction is 7×7×192.

In this embodiment, for example, the eighteenth convolution layer feature tensor 14×14×112 and the twenty third convolution layer feature tensor 7×7×112 after the fusion perform feature fusion. Since the width and height of the image of the feature tensor of the twenty-third convolution layer after the fusion is 7*7 and the width and height of the image of the feature tensor of the eighteenth convolution layer is 14×14, before the feature fusion, the neighbor upsampling process needs to be performed on the feature tensor of the twenty-third convolution layer after the fusion, and then the feature tensor of the twenty-third convolution layer becomes 14×14×112.

In this embodiment, the overall flow may be summarized as inputting a 224×224×3 image, reducing the image by 32 times after convolution and downsampling five times, outputting a 7×7×320 feature tensor, and outputting a 112×112×320 feature tensor through convolution dimension-increasing, neighbor upsampling and feature weighted fusion, where 224×224 refers to the width and height of the image, 3 refers to the number of image channels, 7*7 refers to the size of the output feature map, 320 refers to the number of channels of the output feature map, and the fused EFFICIENTNET-b0 network feature map size change schematic is shown in fig. 1e.

Further, after the face synthetic image enhancement mode and the network structure are modified, the training network is tried to verify the effect, and first, the initial super-parameters of the convolutional neural network of EFFICIENTNET-b0 are required to be set, wherein the parameters are specifically shown in fig. 1 f.

The training set contains 59668 images in total, the positive sample image, namely the class 0, contains 29962 images, the negative sample image, namely the class 1, contains 29706 images, and the negative sample images are synthesized by the above-mentioned true and false face synthetic image enhancement mode. Based on the data set, the periods of the EFFICIENTNET-b0 network and the standard EFFICIENTNET-b0 network 30 are respectively trained and fused, the loss value and the accuracy rate of the periods are recorded, and the data set is plotted as shown in the figure 1 g. As can be seen from the graph, the convergence of the fused EFFICIENTNET-b0 network in the early stage is slower than that of the standard EFFICIENTNET-b0 network with relatively light weight due to the increase of the parameter quantity and the operation quantity, but the lower loss value and the higher accuracy can be achieved in the later stage.

In this embodiment, the overall implementation process is to first read Celeb DF v the corresponding genuine-fake videos in the public depth forgery detection dataset, then cut the two videos into a specified frame number according to a preset interception time point, then perform image fusion on the genuine-fake images of the two corresponding frames, then send the images to a fusion EFFICIENTNET-b0 network for training, and finally read the trained model file for prediction and identification. The overall logic flow diagram is shown in FIG. 1 h.

According to the technical scheme, the fusion EFFICIENTNET-b0 network is adopted as a machine learning model, wherein the feature fusion mainly utilizes a first class 1*1 convolution layer to carry out channel number dimension reduction processing and neighbor up-sampling to adjust pixels, so that the possibility of network over-fitting is reduced.

Example two

Fig. 2 is a block diagram of a device for recognizing a fake face picture according to a second embodiment of the present invention, where the device for recognizing a fake face picture according to the present embodiment may be implemented by software and/or hardware, and may be configured in a terminal or a server to implement a method for recognizing a fake face picture according to the second embodiment of the present invention. As shown in fig. 2, the apparatus may specifically include: a depth falsification data set acquisition module 210, an image enhancement processing module 220, a falsified face picture recognition model formation module 230, and a result recognition module 240.

The depth forging data set obtaining module 210 is configured to obtain a depth forging data set, where the depth forging data set includes a plurality of video pairs, each video pair includes a real video, and a forging video obtained by performing face replacement on a real face in the real video;

The image enhancement processing module 220 is configured to perform image enhancement processing according to the real face and the substitute face included in each video pair to form a plurality of negative sample images, and form a plurality of positive sample images according to the real video included in each video pair;

The fake face picture recognition model forming module 230 is configured to construct a training sample set according to each negative sample image and each positive sample image, and train a preset machine learning model by using the training sample set to form a fake face picture recognition model;

The result recognition module 240 is configured to input a target face picture to be recognized into the fake face picture recognition model, and obtain a recognition result of whether the target face picture is a fake face picture.

On the basis of the above embodiments, the image enhancement processing module 220 may specifically include:

The video image frame intercepting unit is used for intercepting real video image frames and fake video image frames respectively in a target real video and a target fake video in a target video pair which is processed currently according to a preset intercepting time point;

The human face feature tensor extraction unit is used for extracting a real human face feature tensor and a substitute human face feature tensor from the real video image frame and the fake video image frame respectively;

a fake face feature tensor forming unit, configured to form at least one fake face feature tensor according to the real face feature tensor and the replacement face feature tensor;

and the negative sample image forming unit is used for replacing the real face characteristic tensor in the real video image frame or replacing the face characteristic tensor in the forged video image frame by using each forged face characteristic tensor to form at least one negative sample image.

On the basis of the above embodiments, the fake face feature tensor forming unit may be specifically configured to:

Randomly generating at least one face synthesis proportional weight p, wherein p epsilon (0, 1);

According to the formula: output= (1-p) real_input_ tensor +p false_input_ tensor, and calculating to obtain false face feature tensor output corresponding to each face synthesis proportion weight respectively;

wherein real_input_ tensor is the real face feature tensor and fake_input_ tensor is the substitute face feature tensor.

On the basis of the above embodiments, the image enhancement processing module 220 may further include:

The positive sample image forming unit is used for intercepting alternative real video frames respectively corresponding to at least one interception time point in the currently processed real video, and acquiring alternative real video frames comprising real faces as positive sample images.

Based on the above embodiments, the machine learning model is a converged EFFICIENTNET-b0 network, and the converged EFFICIENTNET-b0 network is an improved version of the standard EFFICIENTNET-b0 network;

wherein the converged EFFICIENTNET-b0 network comprises: the system comprises a convolutional network, a feature fusion network connected with at least two convolutional layers in the convolutional network, and a classification network connected with the feature fusion network;

The convolution network comprises a plurality of end-to-end convolution layers, each convolution layer is used for carrying out feature extraction on the input feature tensor to obtain an output feature tensor with a set scale, and the input feature tensor of the first convolution layer is the feature tensor of the input picture input to the fusion EFFICIENTNET-b0 network;

The feature fusion network is used for acquiring output feature tensors of at least two connected convolution layers, and performing layer-by-layer scale transformation and feature weighting on each output feature tensor to obtain a target fusion feature tensor;

And the classification network is used for outputting a classification result of whether the input picture is a fake face picture or not according to the target fusion characteristic tensor.

Based on the above embodiments, the convolutional network included in the converged EFFICIENTNET-b0 network is the same as the structure including the standard convolutional network in the standard EFFICIENTNET-b0 network;

the feature fusion network is specifically configured to be connected to a second convolution layer, a fifth convolution layer, a tenth convolution layer, a thirteenth convolution layer, an eighteenth convolution layer, a twenty-third convolution layer, and a twenty-ninth convolution layer of the convolution network;

The feature fusion network specifically comprises: the device comprises a first channel number dimension reduction unit connected with a twenty-ninth convolution layer, a first weighting unit connected with a twenty-third convolution layer and the first channel number dimension reduction unit, a first dimension conversion unit connected with the first weighting unit, a second weighting unit connected with an eighteenth convolution layer and the first dimension conversion unit, a second channel number dimension reduction unit connected with the second weighting unit, a third weighting unit connected with a thirteenth convolution layer and the second channel number dimension reduction unit, a second dimension conversion unit connected with the third weighting unit, a fourth weighting unit connected with a tenth convolution layer and the second dimension conversion unit, a third dimension conversion unit connected with the fourth weighting unit, a fifth weighting unit connected with a fifth convolution layer and the third dimension conversion unit, a fourth weighting unit connected with the fifth weighting unit, a sixth weighting unit connected with the second convolution layer and the fourth dimension conversion unit, and a channel number dimension increase unit connected with the sixth weighting unit, wherein the channel number dimension increase unit is used for outputting a fusion target feature quantity.

Based on the above embodiments, the number dimension reduction unit of each channel included in the feature fusion network is a first type 1*1 convolution layer for setting dimension reduction dimensions;

The first 1*1 convolution layer is used for performing dimension reduction processing of a set scale on the number of channels in the input characteristic tensor;

Each dimension conversion unit included in the feature fusion network includes: a first-class 1*1 convolution layer with dimension reduction dimension set and a neighbor up-sampling unit with dimension increase dimension set which are connected end to end;

The neighbor up-sampling unit is used for carrying out dimension-up processing of a set scale on the feature map in the input feature tensor;

The channel multi-dimension unit included in the feature fusion network is a second type 1*1 convolution layer with a dimension-up scale set;

And the second 1*1 convolution layer is used for carrying out set-scale dimension-up processing on the number of channels in the input characteristic tensor.

The identification device for the forged face picture can execute the identification method for the forged face picture provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example III

Fig. 3 is a block diagram of a computer device according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes a processor 310, a memory 320, an input device 330, and an output device 340; the number of processors 310 in the device may be one or more, one processor 310 being taken as an example in fig. 3; the processor 310, memory 320, input 330 and output 340 in the device may be connected by a bus or other means, for example in fig. 3.

The memory 320 is a computer readable storage medium, and may be used to store software programs, computer executable programs, and modules, such as program instructions/modules (e.g., the deep forgery data set obtaining module 210, the image enhancement processing module 220, the forgery face picture recognition model forming module 230, and the result recognition module 240) corresponding to the recognition method of forgery face pictures in the embodiment of the present invention. The processor 310 executes various functional applications of the device and data processing by running software programs, instructions and modules stored in the memory 320, i.e. implements the above-mentioned identification method for falsified face pictures, the method comprising:

Memory 320 may include primarily a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required for functionality; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 320 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 320 may further include memory located remotely from processor 310, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 330 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the apparatus. The output device 340 may include a display device such as a display screen.

Example IV

A fourth embodiment of the present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a method of recognizing a falsified face picture, the method comprising:

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the above-described method operations, and may also perform the related operations in the method for recognizing a fake face picture provided in any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the above-mentioned embodiments of the search apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above-mentioned division, as long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method for identifying counterfeit face pictures, comprising:

Inputting a target face picture to be identified into a fake face picture identification model, and acquiring an identification result of whether the target face picture is a fake face picture or not;

The machine learning model is a converged EFFICIENTNET-b0 network, and the converged EFFICIENTNET-b0 network is an improved standard EFFICIENTNET-b0 network;

2. The method of claim 1, wherein performing image enhancement processing to form a plurality of negative sample images based on the real faces and the replacement faces included in each video pair, comprises:

According to a preset interception time point, intercepting real video image frames and fake video image frames respectively in a target real video and a target fake video in a currently processed target video pair;

Extracting a real face feature tensor and a substitute face feature tensor from the real video image frame and the fake video image frame respectively;

Forming at least one fake face feature tensor according to the real face feature tensor and the replacement face feature tensor;

And replacing the real face feature tensor in the real video image frame with each fake face feature tensor or replacing the face feature tensor in the fake video image frame to form at least one negative sample image.

3. The method of claim 2, wherein forming at least one counterfeit face feature tensor from the real face feature tensor and the replacement face feature tensor comprises:

4. The method of claim 1, wherein forming a plurality of positive sample images from the real video included in each video pair comprises:

5. The method of claim 1, wherein the convolutional network included in the converged EFFICIENTNET-b0 network is the same structure as the standard convolutional network included in the standard EFFICIENTNET-b0 network;

6. The method of claim 5, wherein each channel number dimension reduction unit included in the feature fusion network is a first type 1*1 convolutional layer that sets dimension reduction dimensions;

7. A recognition device for falsifying a face picture, comprising:

The result recognition module is used for inputting the target face picture to be recognized into the fake face picture recognition model and obtaining the recognition result of whether the target face picture is the fake face picture or not;

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a method of identifying counterfeit facial pictures as claimed in any of claims 1-6 when the computer program is executed.

9. A storage medium having stored thereon computer program of executable instructions, which when executed by a processor implements a method of identifying counterfeit facial pictures according to any of claims 1-6.