CN113762138A

CN113762138A - Method and device for identifying forged face picture, computer equipment and storage medium

Info

Publication number: CN113762138A
Application number: CN202111027883.1A
Authority: CN
Inventors: 王佳琪; 李玉惠; 傅强; 蔡琳; 阿曼太; 梁彧; 马寒军; 田野; 王杰; 杨满智; 金红; 陈晓光
Original assignee: Eversec Beijing Technology Co Ltd
Current assignee: Eversec Beijing Technology Co Ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-12-07
Anticipated expiration: 2041-09-02
Also published as: CN113762138B

Abstract

The embodiment of the invention discloses a method and a device for identifying a forged face picture, computer equipment and a storage medium. Wherein, the method comprises the following steps: acquiring a depth forgery data set, wherein the depth forgery data set comprises a plurality of video pairs, each video pair comprises a real video and a forgery video obtained by replacing a real face in the real video with a face; forming a plurality of negative sample images according to the real face and the replacement face, and forming a plurality of positive sample images according to the real video; constructing a training sample set according to the negative sample images and the positive sample images, and training the machine learning model to form a fake human face image recognition model; and inputting the target face picture to be recognized into the recognition model, and acquiring a recognition result of whether the target face picture is a forged face picture. The embodiment of the invention solves the problems that the type of the forged human face of the public data set is single and the possibility of network overfitting exists, and realizes data enhancement and meets the requirements of deep forging detection and identification.

Description

Method and device for identifying forged face picture, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to the fields of deep learning, computer vision, deep forgery generation and deep forgery detection, and particularly relates to a method and a device for recognizing a forged face picture, computer equipment and a storage medium.

Background

The purpose of the field of deep forgery detection is anti-spying of deep forgery generation techniques. The current main method can be divided into two major categories, picture level and video level, according to the object type.

Most models of the picture-level detection and identification method rely on the same data distribution, and the effect of weakness is poor in the face of unknown tampering types. The video level detection and identification method has good effect, can detect a small amount of tampering in the video, but is sensitive to video preprocessing, such as video compression, light change and the like, and the method cannot judge the authenticity of a single-frame image. The picture-level detection and identification method has the common problem of poor model generalization and low accuracy, and the video-level detection and identification method cannot identify a single image and cannot meet the requirement of deep forgery detection and identification in projects.

Disclosure of Invention

The embodiment of the invention provides a method and a device for identifying a forged face picture, computer equipment and a storage medium, solves the problems of single type and network overfitting possibility of forged faces of public data sets, and realizes data enhancement and meets the requirements of deep forging detection and identification.

In a first aspect, an embodiment of the present invention provides a method for identifying a forged face picture, where the method includes:

acquiring a depth forgery data set, wherein the depth forgery data set comprises a plurality of video pairs, each video pair comprises a real video and a forgery video obtained by replacing a real face in the real video with a face;

performing image enhancement processing according to the real face and the replacement face included in each video pair to form a plurality of negative sample images, and forming a plurality of positive sample images according to the real video included in each video pair;

constructing a training sample set according to the negative sample images and the positive sample images, and training a preset machine learning model by using the training sample set to form a forged face picture recognition model;

and inputting the target face picture to be recognized into the fake face picture recognition model, and acquiring the recognition result of whether the target face picture is the fake face picture.

In a second aspect, an embodiment of the present invention further provides an apparatus for recognizing a counterfeit face picture, where the apparatus for recognizing a counterfeit face picture includes:

the depth forgery data set acquisition module is used for acquiring a depth forgery data set, wherein the depth forgery data set comprises a plurality of video pairs, each video pair comprises a real video and a forgery video obtained by replacing a real face in the real video with a face;

the image enhancement processing module is used for carrying out image enhancement processing according to the real face and the replacement face included in each video pair to form a plurality of negative sample images and forming a plurality of positive sample images according to the real video included in each video pair;

the forged face picture recognition model forming module is used for constructing a training sample set according to each negative sample image and each positive sample image, and training a preset machine learning model by using the training sample set to form a forged face picture recognition model;

and the result identification module is used for inputting the target face picture to be identified into the fake face picture identification model and acquiring the identification result of whether the target face picture is the fake face picture.

In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the method for recognizing a forged face picture according to any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the storage medium stores a computer program, where the program, when executed by a processor, implements the method for recognizing a forged face picture according to any embodiment of the present invention.

According to the technical scheme provided by the embodiment of the invention, a depth forgery data set is obtained, wherein the depth forgery data set comprises a plurality of video pairs, each video pair comprises a real video and a forgery video obtained by replacing a real face in the real video with a face; performing image enhancement processing according to the real face and the replacement face included in each video pair to form a plurality of negative sample images, and forming a plurality of positive sample images according to the real video included in each video pair; constructing a training sample set according to the negative sample images and the positive sample images, and training a preset machine learning model by using the training sample set to form a forged face picture recognition model; and inputting the target face picture to be recognized into the fake face picture recognition model, and acquiring the recognition result of whether the target face picture is the fake face picture. The problems that the type of forged faces of public data sets is single and the possibility of network overfitting exists are solved, data enhancement is achieved, and the requirements of deep forging detection and recognition are met.

Drawings

Fig. 1a is a flowchart of a method for recognizing a forged face picture according to an embodiment of the present invention;

fig. 1b is a schematic diagram of a synthetic effect of an authentic face in an identification method for a forged face picture according to an embodiment of the present invention;

fig. 1c is a schematic structural diagram of a fused EfficientNet-b0 network applicable to the method for identifying a forged face picture according to the embodiment of the present invention;

fig. 1d is a schematic structural diagram of a fused EfficientNet-b0 network to which the method for identifying a forged face picture according to the embodiment of the present invention is applied;

fig. 1e is a schematic flow chart of size change of a feature map of a fused EfficientNet-b0 network, which is applicable to the method for identifying a forged face picture according to the embodiment of the present invention;

fig. 1f is a schematic diagram of initial hyper-parameter setting of network training fused with EfficientNet-b0 in the identification method for a forged face picture according to the embodiment of the present invention;

fig. 1g is a schematic diagram of a comparison between training losses and accuracy rates of a fused EfficientNet-b0 network and a standard EfficientNet-b0 network, which is applicable to the identification method for the forged face picture according to the embodiment of the present invention;

fig. 1h is an overall logic flow diagram of a method for recognizing a forged face picture according to an embodiment of the present invention;

fig. 2 is a structural diagram of an apparatus for recognizing a forged face picture according to a second embodiment of the present invention;

fig. 3 is a structural diagram of a computer device according to a third embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1a is a flowchart of a method for recognizing a forged face picture according to an embodiment of the present invention. The embodiment can be suitable for synthesizing diversified forged faces in a data enhancement mode by using the true and false face synthesis so as to train the problem of obtaining a forged face picture recognition model with better recognition effect. The method of the embodiment can be executed by a device for recognizing a forged face picture, and the device can be realized in a software and/or hardware manner, and can be generally integrated in a terminal or a server with a data processing function.

Correspondingly, the method specifically comprises the following steps:

s110, a depth forgery data set is obtained, the depth forgery data set comprises a plurality of video pairs, each video pair comprises a real video and a forgery video obtained by replacing a real face in the real video with a face.

The deep forgery data set is disclosed in a network, and is created or synthesized based on intelligent methods such as deep learning and the like, wherein the data set is composed of visual and audio contents (such as images, audios and videos, texts and the like). That is, the depth-forgery-inhibited data set includes a plurality of video pairs for executing the forgery-inhibited face picture recognition model training. Each video pair includes a true video as a positive example and a fake video as a negative example. The face included in the real video is a real face, the forged video is unreal content which is edited and manufactured, and the face included in the forged video is a forged face obtained by performing face replacement on the real face in the real face.

In an alternative implementation of this embodiment, the depth forgery data set may be constructed for the public depth forgery detection data set from Celeb DF v 2.

Wherein the positive samples in this dataset comprise 587 real videos (taken by 61 actors) and the negative samples comprise 5637 fake videos (generated by face swapping of the faces of the actors in the real videos). Then, the types of faces contained in the forged video are 61, and when the data diversity is too little, the training model is easy to cause overfitting of the model.

In the embodiment, a video pair is formed by using each positive and negative sample included in the depth forgery data set to form a new forgery face which is not included in the depth forgery data set, so as to enrich the number and the types of the negative samples.

And S120, performing image enhancement processing according to the real face and the alternative face included in each video pair to form a plurality of negative sample images, and forming a plurality of positive sample images according to the real video included in each video pair.

In this embodiment, the negative sample image and the positive sample image are training samples used for training to obtain a fake face image recognition model. The positive sample image is an unprocessed real image, the real image comprises a real face, and the negative sample image is a forged image obtained by replacing the real face in the positive sample image with a forged face.

The image enhancement processing is an important content in image processing, and in the processes of image generation, transmission or transformation, due to the influence of various factors, the image quality is reduced, the image is blurred, the characteristics are submerged, and the analysis and the identification are difficult. Therefore, it is the main content of image enhancement to selectively highlight the interesting features in the image according to specific needs, attenuate the unwanted features, and improve the intelligibility of the image.

The image enhancement processing is intended to improve the visual effect of an image, and commonly used image enhancement techniques include contrast processing, histogram modification, noise processing, edge enhancement, transform processing, pseudo color, and the like.

In the present embodiment, by using the real face and the substitute face included in each video pair and the image enhancement technique, a forged face that does not exist in the depth-forged data set and is different from the real face and the substitute face is constructed, and a negative sample image is formed based on each of the constructed forged faces.

In an optional implementation manner of this embodiment, performing image enhancement processing according to the real face and the alternative face included in each video pair to form a plurality of negative sample images includes:

respectively intercepting a real video image frame and a forged video image frame in a target real video and a target forged video in a currently processed target video pair according to a preset interception time point; respectively extracting a real human face feature tensor and a replacement human face feature tensor from the real video image frame and the forged video image frame; forming at least one forged face feature tensor according to the real face feature tensor and the replacement face feature tensor; and replacing the real human face feature tensor in the real video image frame or replacing the replaced human face feature tensor in the forged video image frame by using each forged human face feature tensor to form at least one negative sample image.

The capturing time point may be a time point corresponding to capturing one frame of image from the real video and the forged video, and the time point may be selected randomly or periodically.

It can be understood that, the background or environment information included in the two corresponding video frames at the same playing time point in the real video and the forged video are the same, and the only difference is that the face included in the video frame in the real video is the real face, and the face included in the forged video is the replacement face. Therefore, by respectively intercepting the real video and the forged video at the same intercepting time point, two video frames with the same background and different human faces, namely the real video image frame and the forged video image frame, can be intercepted.

It will be appreciated that in selecting an intercept time point, it is necessary to ensure that a face must be included in the video image frame at that intercept time point. The present embodiment does not limit the specific selection form of the interception time point.

The image frame is the minimum unit of the composed video and can be divided into a real video image frame and a fake video image frame. A tensor is a multiple linear mapping defined on the cartesian product of vector spaces and dual spaces, whose coordinates are a quantity in n-dimensional space of n components, each component being a function of the coordinates, and when the coordinates are transformed, these components are also transformed linearly according to some rule, the tensor can be represented by a multidimensional array of components.

In this embodiment, the face regions can be obtained in the real video image frame and the forged video image frame respectively by the existing manner of extracting each face region, and then the pixel values of each pixel point included in the face region are combined to form a real face feature tensor and a replacement face feature tensor.

Specifically, the forged face feature tensor is composed of a real face feature tensor and a replacement face feature tensor.

In this embodiment, read the corresponding true and false video in the public depth forgery detection data set of Celeb DF v2, and then cut the two videos into the specified number of frames according to the preset interception time point. And then, extracting a real human face feature tensor and a replacement human face feature tensor from the real video image frame and the forged video image frame. And then, forming at least one forged face feature tensor according to the real face feature tensor and the replacement face feature tensor. And finally, replacing the real face feature tensor in the real video image frame or replacing the replaced face feature tensor in the forged video image frame by using each forged face feature tensor, carrying out image fusion on the true and false images, and enhancing the images to form a plurality of negative sample images.

The advantages of such an arrangement are: and forming a plurality of negative sample images by performing operations such as acquisition of image frames, extraction and replacement of tensors in the image frames and the like on the target real video and the target forged video. The image data can be enhanced, so that the problem that the types of human faces included in the forged videos in the public data set are single is solved, more diversified forged human faces are synthesized, and the richness of the data set samples is improved.

Optionally, forming at least one forged face feature tensor according to the real face feature tensor and the alternative face feature tensor may include:

randomly generating at least one face synthesis proportion weight p, wherein p belongs to (0, 1); according to the formula: calculating to obtain forged face feature tensor output respectively corresponding to each face synthesis proportion weight; wherein, real _ input _ tensor is a real face feature tensor, and fake _ input _ tensor is a replacement face feature tensor.

The weight p refers to the importance degree of a certain factor or index relative to a certain event, and specifically may refer to the proportion of the real face feature tensor and the alternative face feature tensor occupying the target face feature tensor.

In this embodiment, as the ratio of the authenticity synthesis increases, the face approaches the characteristics of the counterfeit face gradually, as shown in fig. 1 b. When p is 1, the synthesized face is completely consistent with the forged face; and when the value of p is between 0 and 1, the generated face contains two kinds of face features in the authenticity video, and a new face which does not exist in 61 actor faces in the public data set is formed, namely, a forged face. And in the image enhancement process, a random value between 0 and 1 is given to p, so that various forged faces can be generated.

The advantages of such an arrangement are: by randomly generating at least one face synthesis proportion weight p, a variety of forged faces can be generated. Therefore, the problem of overfitting and the problem of single sample type can be solved.

Optionally, forming a plurality of positive sample images according to the real videos included in each video pair may include:

in the currently processed real video, alternative real video frames respectively corresponding to at least one interception time point are intercepted, and alternative real video frames comprising real faces are obtained and serve as positive sample images.

In the embodiment, the alternative real video frames corresponding to the time points are intercepted from the real video, the alternative real video frames are obtained, the positive sample image is formed, and the model training can be performed by using the positive sample image, so that the sample set formed by the positive sample image is enriched, and the image is enhanced.

S130, a training sample set is constructed according to the negative sample images and the positive sample images, and a preset machine learning model is trained by using the training sample set to form a fake human face image recognition model.

Machine learning models are models that use data, samples, or past experience to optimize the performance criteria of a computer program. The machine learning model may be a classification model implemented based on various classification algorithms, for example, an EfficientNet-b0 network model or a decision tree network model, which is not limited in this embodiment.

The forged face picture recognition model is formed after model training is carried out on the machine learning model by using a training sample set and is used for recognizing forged face pictures.

S140, inputting the target face picture to be recognized into the fake face picture recognition model, and obtaining the recognition result of whether the target face picture is the fake face picture.

The recognition result may be a conclusion that the authenticity of the input target image is everywhere recognized, that is, whether the target face picture is a forged face picture or not.

On the basis of the above embodiments, the machine learning model is a fused EfficientNet-b0 (also referred to as EfficientNet-b0) network, and the fused EfficientNet-b0 network is an improved version of the standard EfficientNet-b0 network.

The fused EfficientNet-b0 network comprises the following components: the system comprises a convolution network, a feature fusion network connected with at least two convolution layers in the convolution network, and a classification network connected with the feature fusion network; the convolutional network comprises a plurality of convolutional layers which are connected end to end, each convolutional layer is used for extracting the characteristics of an input characteristic tensor to obtain an output characteristic tensor with a set scale, and the input characteristic tensor of the first convolutional layer is the characteristic tensor of an input picture input to the fused EfficientNet-b0 network; the characteristic fusion network is used for acquiring output characteristic tensors of at least two connected convolution layers and carrying out scale transformation and characteristic weighting on each output characteristic tensor layer by layer to obtain a target fusion characteristic tensor; and the classification network is used for outputting a classification result of whether the input picture is a forged face picture or not according to the target fusion feature tensor.

Optionally, the machine learning model is a fused EfficientNet-b0 network, and the fused EfficientNet-b0 network is an improved version of a standard EfficientNet-b0 network.

That is, in the embodiment of the present invention, the EfficientNet-b0 (that is, EfficientNet-b0) framework network is loaded by using the EfficientNet-pytorech module in the python official library to perform modification and optimization, so as to obtain the fused EfficientNet-b0 network.

The standard EfficientNet-b0 network comprises thirty convolutional layers, and the structure of each convolutional layer is shown in Table 1.

TABLE 1

The standard EfficientNet-b0 network has an input image size of 224 × 3 and an output feature dimension of 7 × 320.

In this embodiment, instead of directly using the feature extraction result output by the thirtieth convolutional layer of the standard EfficientNet-b0 network to classify the authenticity picture, the method includes obtaining output feature tensors of at least two convolutional layers of the thirty convolutional layers in the standard EfficientNet-b0 network, performing scale transformation and feature weighting on the output feature tensors layer by layer to obtain a target fusion feature tensor, and further using the target fusion feature tensor to classify the authenticity picture.

The design of the network is due to the fact that the human face part in the depth fake video has a plurality of macroscopic parts such as the connection between the human face and the background, and also has a plurality of microscopic parts such as the natural degree of the face eyes, and the factors can be used for detecting the characteristics of the depth fake video. Meanwhile, the information contained in the feature map output by each layer of the standard EfficientNet-b0 network has larger difference, the outline position and other features are highlighted by the shallow feature map, and the overall details of the image are highlighted by the deep feature map. In the prior art, the MesoNet proposes a middle layer feature map in a classification network, which is more favorable for detection and identification of deep forgery, so that the middle layer feature map is extracted for feature mapping judgment; and in the deep forgery competition, the winning scheme does not change the network to extract and classify the human face features, in other words, the deepest feature map is used as the judgment basis. Based on the research results, it is determined to adopt a feature depth fusion mode, and the features of the shallow layer, the middle layer and the deep layer are respectively fused to be used as feature tensors for calculating classification results (namely loss calculation and back propagation calculation update gradient).

That is, in this embodiment, the output feature tensor of the convolution layer with different convolution depths in thirty convolution layers in the standard EfficientNet-b0 network is obtained to perform feature weighting, so as to obtain the target fusion feature tensor, and the image features with different dimensions or different side focuses can be obtained, so that when the target fusion feature tensor is used for classifying the authenticity picture, better classification accuracy can be obtained.

Further, consider that in a standard EfficientNet-b0 network, the entire network is divided into seven chunks, with the second, fifth, tenth, thirteenth, eighteenth, twenty-third, and twenty-ninth convolutional layers, respectively, as the separation of each chunk.

Considering that the above-mentioned respective convolution layers already can obtain output feature tensors of different convolution depths, the corresponding,

and the fused EfficientNet-b0 network takes out the seven layers of feature tensors one by one, performs deep feature fusion, then replaces the features of the twenty-ninth layer of the original network, sends the features into (or not sends the features into) the global average pooling layer, outputs the features into the classification network, and finally obtains a classification result by calculating and updating the gradient through an activation layer, loss calculation and back propagation in the classification network.

Correspondingly, fig. 1c shows a schematic structural diagram of an EfficientNet-b0 fused network, which is applicable to the method for recognizing the forged face picture according to the embodiment of the present invention. As shown in fig. 1b, the convolutional network included in the fused EfficientNet-b0 network has the same structure as the standard convolutional network included in the standard EfficientNet-b0 network;

the feature fusion network is specifically used for being connected with a second convolutional layer, a fifth convolutional layer, a tenth convolutional layer, a thirteenth convolutional layer, an eighteenth convolutional layer, a twenty-third convolutional layer and a twenty-ninth convolutional layer of the convolutional network; the feature fusion network specifically includes: a first channel number dimension reduction unit connected with the twenty-ninth convolution layer, a first weighting unit connected with the twenty-third convolution layer and the first channel number dimension reduction unit respectively, a first dimension conversion unit connected with the first weighting unit, a second weighting unit connected with the eighteenth convolution layer and the first dimension conversion unit, a second channel number dimension reduction unit connected with the second weighting unit, a third weighting unit connected with the thirteenth convolution layer and the second channel number dimension reduction unit respectively, a second dimension conversion unit connected with the third weighting unit, a fourth weighting unit connected with the tenth convolution layer and the second dimension conversion unit respectively, a third dimension conversion unit connected with the fourth weighting unit, a fifth weighting unit connected with the fifth convolution layer and the third dimension conversion unit respectively, a fourth dimension conversion unit connected with the fourth weighting unit, and a sixth weighting unit connected with the second convolution layer and the fourth dimension conversion unit, and the channel number dimension-increasing unit is connected with the sixth weighting unit and is used for outputting the target fusion feature tensor.

The dimension reduction refers to an operation performed on a single image converted into a data set in a high-dimensional space through high-dimensional transformation of single image data, and specifically, the channel number dimension reduction unit is a unit which needs dimension reduction processing due to the fact that the channel number dimension of the image feature tensor is relatively high.

Specifically, each channel number dimension reduction unit included in the feature fusion network is a first type 1 x 1 convolution layer with a dimension reduction scale set;

the first type 1 x 1 convolution layer is used for performing dimension reduction processing with set dimensions on the number of channels in the input feature tensor; each dimension conversion unit included in the feature fusion network includes: the first type 1 x 1 convolution layer is connected end to end and is provided with dimension reduction scales, and the adjacent up-sampling unit is provided with dimension increasing scales; the neighbor up-sampling unit is used for performing dimension-increasing processing with set scale on the feature image in the input feature tensor; the channel number dimension increasing unit in the feature fusion network is a second type 1 x 1 convolution layer with dimension increasing dimension set; and the second type 1 x 1 convolution layer is used for performing dimension increasing processing with set dimension on the number of channels in the input feature tensor.

The first type 1 × 1 convolutional layer is a dimension reduction process for setting the number of channels in the input feature tensor. The neighbor upsampling unit is mainly used for implementing neighbor upsampling processing, which may be a processing mode of image interpolation, and uses pixels containing most pixels to map, in short, one of two images occupies more pixels, and then uses the image with more pixels to perform operation. The dimension reduction process may be an operation of converting high-dimensional data into low-dimensional data, and the dimension increase process may be an operation of converting low-dimensional data into high-dimensional data.

Correspondingly, a specific structural schematic diagram of a fused EfficientNet-b0 network applicable to the method for recognizing the fake face picture provided by the embodiment of the invention is shown in FIG. 1 d.

As shown in fig. 1d, in this embodiment, specifically, the first channel number dimension reduction unit may include 1 × 1 convolution layers (i.e., first type 1 × 1 convolution layers) with dimension reduction scale of 7 × 192; the first dimension conversion unit can comprise 1 × 1 convolution layer with dimension reduction scale of 7 × 112 and a neighboring up-sampling unit with dimension increase scale of 14 × 112; the second channel number dimension reduction unit can comprise 1 × 1 convolution layers with dimension reduction scale of 14 × 80; the second dimension conversion unit can comprise 1 × 1 convolution layer with dimension reduction scale of 14 × 40 and a neighboring up-sampling unit with dimension increase scale of 28 × 40; the third dimension conversion unit can comprise 1 × 1 convolution layer with dimension reduction scale of 28 × 24 and a neighboring up-sampling unit with dimension increase scale of 56 × 24; the fourth dimension conversion unit can comprise 1 × 1 convolution layer with dimension reduction scale of 56 × 16 and a neighboring up-sampling unit with dimension increase scale of 112 × 16; the channel number dimension unit may include 1 x 1 convolution layers (i.e., the second type of 1 x 1 convolution layers) with dimension 112 x 320.

In this embodiment, for example, the twenty-third convolution layer feature tensor 7 × 192 and the twenty-ninth convolution layer feature tensor 7 × 320 are feature fused. Since the number of channels of the twenty-third convolutional layer feature tensor is 192, and the number of channels of the twenty-ninth convolutional layer feature tensor is 320, before feature fusion is performed, it is necessary to perform channel number dimension reduction on the twenty-ninth convolutional layer feature tensor, and the twenty-ninth convolutional layer feature tensor after the channel number dimension reduction is 7 × 192.

In this embodiment, for example, the eighteenth convolution layer feature tensor 14 × 112 and the twenty-third convolution layer feature tensor 7 × 112 after fusion are feature-fused. Since the width and height of the image of the twenty-third convolution layer feature tensor after fusion are 7 × 7, and the width and height of the image of the eighteenth convolution layer feature tensor are 14 × 14, it is necessary to perform neighbor upsampling processing on the twenty-third convolution layer feature tensor after fusion before feature fusion, and the twenty-third convolution layer feature tensor becomes 14 × 112.

In this embodiment, the overall process may be summarized as inputting an image 224 × 3, performing convolution and down-sampling five times, then reducing the image by 32 times, outputting a feature tensor of 7 × 320 size, and then outputting a feature tensor of 112 × 320 size through convolution dimension raising, neighbor up-sampling and feature weighting fusion, where 224 × 224 refers to the width and height of the image, 3 refers to the number of image channels, 7 × 7 refers to the size of the output feature map, 320 refers to the number of output feature map channels, and a schematic diagram of size change of the fused eficientnet-b 0 network feature map is shown in fig. 1 e.

Further, after the face synthesis image enhancement mode and the modification of the network structure, trying to train the network to verify the effect, it is first necessary to set and fuse the initial hyper-parameters of the EfficientNet-b0 convolutional neural network, and the parameters are specifically shown in fig. 1 f.

The training set contains 59668 images in total, the positive sample image, i.e. category 0, contains 29962 images, the negative sample image, i.e. category 1, contains 29706 images, and the negative sample images are synthesized by the above-mentioned method for enhancing the true-false face synthesis image. And training the 30 periods of the fused EfficientNet-b0 network and the standard EfficientNet-b0 network respectively based on the data set, recording the loss value and the accuracy of the network, and drawing a graph as shown in 1 g. As can be seen from the graph, due to the increase of the parameter number and the operation amount of the fused EfficientNet-b0 network, the convergence in the early stage is slower than that of the standard EfficientNet-b0 network with a relatively light weight, but a lower loss value and higher accuracy can be achieved in the later stage.

In this embodiment, the overall implementation process includes reading corresponding true and false videos in the Celeb DF v2 public depth forgery detection data set, cutting the two videos into specified frames according to preset interception time points, performing image fusion on the true and false images of the two corresponding frames, sending the images into a fused EfficientNet-b0 network for training, and finally reading a trained model file for prediction and identification. The overall logic flow diagram is shown in fig. 1 h.

In the technical scheme of this embodiment, a fused EfficientNet-b0 network is used as a machine learning model, wherein feature fusion mainly uses a first type 1 × 1 convolutional layer to perform channel number dimension reduction processing and neighbor upsampling to adjust pixels, so that the probability of network overfitting is reduced.

Example two

Fig. 2 is a structural diagram of an identification apparatus for faking a face picture according to a second embodiment of the present invention, where the identification apparatus for faking a face picture according to the second embodiment of the present invention may be implemented by software and/or hardware, and may be configured in a terminal or a server to implement the identification method for faking a face picture according to the second embodiment of the present invention. As shown in fig. 2, the apparatus may specifically include: the system comprises a depth forgery data set acquisition module 210, an image enhancement processing module 220, a forgery face picture identification model forming module 230 and a result identification module 240.

The depth forgery data set acquisition module 210 is configured to acquire a depth forgery data set, where the depth forgery data set includes a plurality of video pairs, each video pair includes a real video, and a forgery video obtained by performing face replacement on a real face in the real video;

an image enhancement processing module 220, configured to perform image enhancement processing according to the real face and the replacement face included in each video pair to form a plurality of negative sample images, and form a plurality of positive sample images according to the real video included in each video pair;

a forged face picture recognition model forming module 230, configured to construct a training sample set according to each negative sample image and each positive sample image, and train a preset machine learning model using the training sample set to form a forged face picture recognition model;

and the result identification module 240 is configured to input the target face picture to be identified into the fake face picture identification model, and acquire an identification result of whether the target face picture is a fake face picture.

On the basis of the foregoing embodiments, the image enhancement processing module 220 may specifically include:

the video image frame intercepting unit is used for respectively intercepting a real video image frame and a forged video image frame from a target real video and a target forged video in a currently processed target video pair according to a preset intercepting time point;

the face feature tensor extraction unit is used for extracting a real face feature tensor and a replacement face feature tensor from the real video image frame and the forged video image frame respectively;

a forged face feature tensor forming unit, configured to form at least one forged face feature tensor according to the real face feature tensor and the alternative face feature tensor;

and the negative sample image forming unit is used for replacing the real human face feature tensor in the real video image frame by using each forged human face feature tensor, or replacing the replaced human face feature tensor in the forged video image frame to form at least one negative sample image.

On the basis of the foregoing embodiments, the forged face feature tensor forming unit may be specifically configured to:

randomly generating at least one face synthesis proportion weight p, wherein p belongs to (0, 1);

according to the formula: calculating to obtain forged face feature tensor output respectively corresponding to each face synthesis proportion weight;

wherein, real _ input _ tensor is a real face feature tensor, and fake _ input _ tensor is a replacement face feature tensor.

On the basis of the foregoing embodiments, the image enhancement processing module 220 may further include:

and the positive sample image forming unit is used for intercepting the alternative real video frames respectively corresponding to at least one interception time point in the currently processed real video, and acquiring the alternative real video frames comprising the real human faces as the positive sample images.

On the basis of the above embodiments, the machine learning model is a fused EfficientNet-b0 network, and the fused EfficientNet-b0 network is an improved standard EfficientNet-b0 network;

the fused EfficientNet-b0 network comprises the following components: the system comprises a convolution network, a feature fusion network connected with at least two convolution layers in the convolution network, and a classification network connected with the feature fusion network;

the convolutional network comprises a plurality of convolutional layers which are connected end to end, each convolutional layer is used for extracting the characteristics of an input characteristic tensor to obtain an output characteristic tensor with a set scale, and the input characteristic tensor of the first convolutional layer is the characteristic tensor of an input picture input to the fused EfficientNet-b0 network;

the characteristic fusion network is used for acquiring output characteristic tensors of at least two connected convolution layers and carrying out scale transformation and characteristic weighting on each output characteristic tensor layer by layer to obtain a target fusion characteristic tensor;

and the classification network is used for outputting a classification result of whether the input picture is a forged face picture or not according to the target fusion feature tensor.

On the basis of the above embodiments, the convolution network included in the fused EfficientNet-b0 network has the same structure as the standard convolution network included in the standard EfficientNet-b0 network;

the feature fusion network is specifically used for being connected with a second convolutional layer, a fifth convolutional layer, a tenth convolutional layer, a thirteenth convolutional layer, an eighteenth convolutional layer, a twenty-third convolutional layer and a twenty-ninth convolutional layer of the convolutional network;

the feature fusion network specifically includes: a first channel number dimension reduction unit connected with the twenty-ninth convolution layer, a first weighting unit connected with the twenty-third convolution layer and the first channel number dimension reduction unit respectively, a first dimension conversion unit connected with the first weighting unit, a second weighting unit connected with the eighteenth convolution layer and the first dimension conversion unit, a second channel number dimension reduction unit connected with the second weighting unit, a third weighting unit connected with the thirteenth convolution layer and the second channel number dimension reduction unit respectively, a second dimension conversion unit connected with the third weighting unit, a fourth weighting unit connected with the tenth convolution layer and the second dimension conversion unit respectively, a third dimension conversion unit connected with the fourth weighting unit, a fifth weighting unit connected with the fifth convolution layer and the third dimension conversion unit respectively, a fourth dimension conversion unit connected with the fourth weighting unit, and a sixth weighting unit connected with the second convolution layer and the fourth dimension conversion unit, and the channel number dimension-increasing unit is connected with the sixth weighting unit and is used for outputting the target fusion feature tensor.

On the basis of the above embodiments, the channel number dimension reduction unit included in the feature fusion network is a first type 1 × 1 convolution layer with a dimension reduction scale set;

the first type 1 x 1 convolution layer is used for performing dimension reduction processing with set dimensions on the number of channels in the input feature tensor;

each dimension conversion unit included in the feature fusion network includes: the first type 1 x 1 convolution layer is connected end to end and is provided with dimension reduction scales, and the adjacent up-sampling unit is provided with dimension increasing scales;

the neighbor up-sampling unit is used for performing dimension-increasing processing with set scale on the feature image in the input feature tensor;

the channel number dimension increasing unit in the feature fusion network is a second type 1 x 1 convolution layer with dimension increasing dimension set;

and the second type 1 x 1 convolution layer is used for performing dimension increasing processing with set dimension on the number of channels in the input feature tensor.

The identification device for the forged face picture can execute the identification method for the forged face picture provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE III

Fig. 3 is a structural diagram of a computer device according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes a processor 310, a memory 320, an input device 330, and an output device 340; the number of the processors 310 in the device may be one or more, and one processor 310 is taken as an example in fig. 3; the processor 310, the memory 320, the input device 330 and the output device 340 in the apparatus may be connected by a bus or other means, and fig. 3 illustrates the connection by a bus as an example.

The memory 320 may be used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the identification method of the fake face picture in the embodiment of the present invention (for example, the depth fake data set acquisition module 210, the image enhancement processing module 220, the fake face picture identification model forming module 230, and the result identification module 240). The processor 310 executes various functional applications and data processing of the device by running software programs, instructions and modules stored in the memory 320, so as to implement the above-mentioned method for recognizing a forged face picture, which includes:

The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 320 may further include memory located remotely from the processor 310, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 330 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the apparatus. The output device 340 may include a display device such as a display screen.

Example four

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the computer-executable instructions are executed by a computer processor to perform a method for recognizing a forged face picture, and the method includes:

Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the method for recognizing a forged face picture provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the above search apparatus, each included unit and module are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for recognizing a forged face picture is characterized by comprising the following steps:

2. The method of claim 1, wherein performing image enhancement processing based on the real face and the alternate face included in each video pair to form a plurality of negative sample images comprises:

respectively intercepting a real video image frame and a forged video image frame in a target real video and a target forged video in a currently processed target video pair according to a preset interception time point;

respectively extracting a real human face feature tensor and a replacement human face feature tensor from the real video image frame and the forged video image frame;

forming at least one forged face feature tensor according to the real face feature tensor and the replacement face feature tensor;

and replacing the real human face feature tensor in the real video image frame or replacing the replaced human face feature tensor in the forged video image frame by using each forged human face feature tensor to form at least one negative sample image.

3. The method of claim 2, wherein forming at least one forged face feature tensor from the real face feature tensor and the alternative face feature tensor comprises:

4. The method of claim 1, wherein forming a plurality of positive sample images from the real videos included in each video pair comprises:

5. The method of any one of claims 1-4, wherein the machine learning model is a converged EffectintNet-b 0 network, the converged EffectintNet-b 0 network being a modified version of a standard EffectintNet-b 0 network;

6. The method of claim 5, wherein the convolutional network included in the converged EfficientNet-b0 network has the same structure as the standard convolutional network included in the standard EfficientNet-b0 network;

7. The method according to claim 6, wherein each channel number dimension reduction unit included in the feature fusion network is a first type 1 x 1 convolution layer with a dimension reduction scale set;

8. An apparatus for recognizing a forged face picture, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for identifying a counterfeit face picture according to any one of claims 1 to 7 when executing the computer program.

10. A storage medium having computer-executable instructions stored thereon, wherein the program, when executed by a processor, implements a method for recognizing a counterfeit face picture according to any one of claims 1 to 7.