CN111144314B

CN111144314B - Method for detecting tampered face video

Info

Publication number: CN111144314B
Application number: CN201911376257.6A
Authority: CN
Inventors: 张勇东; 尚志华; 谢洪涛; 邓旭冉; 李岩
Original assignee: Beijing Zhongke Research Institute; University of Science and Technology of China USTC
Current assignee: Beijing Zhongke Research Institute; University of Science and Technology of China USTC
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2020-09-18
Anticipated expiration: 2039-12-27
Also published as: CN111144314A

Abstract

The invention discloses a method for detecting a tampered face video, which comprises the following steps: decoding the face video data into a group of continuous frame images, intercepting the face area of each frame image, and correspondingly storing the face area as a face picture according to the frame number; extracting each face picture through a feature extractor to obtain a corresponding feature map; and simultaneously inputting the feature maps of two continuous frames into an interframe correlation classifier, fusing the feature maps of the two frames together by the interframe correlation classifier by adopting an attention mechanism, and classifying, wherein the classification result is the probability that the two input frames are tampered. The method simultaneously utilizes the information of the frame picture and the interframe relation of the adjacent frames, and has favorable effect. Meanwhile, the detection is automatically completed, and the method can be suitable for large-scale video platforms and social platforms.

Description

Method for detecting tampered face video

Technical Field

The invention relates to the technical field of network space security, in particular to a method for detecting a tampered face video.

Background

The technology of 'face changing' based on the deep neural network is quite popular, the face in the video can be quickly changed into the face of other people based on the technology, and more lawless persons tamper the video aiming at politicians, stars and celebrities to issue false messages. For this phenomenon, methods for detecting whether a video is tampered, such as detecting blink frequency, detecting noise consistency, etc., have been available.

However, the existing method has poor detection performance, and cannot ensure the accuracy of the detection result, and particularly, the existing method cannot meet the requirements of practical application along with the rapid development of a counterfeiting technology.

Disclosure of Invention

The invention aims to provide a method for detecting a tampered face video, which has higher detection accuracy.

The purpose of the invention is realized by the following technical scheme:

a method for detecting a tampered face video comprises the following steps:

decoding the face video data into a group of continuous frame images, intercepting the face area of each frame image, and correspondingly storing the face area as a face picture according to the frame number;

extracting the characteristics of each face picture through a characteristic extractor to obtain a corresponding characteristic graph;

and simultaneously inputting the feature maps of two continuous frames into an interframe correlation classifier, fusing the feature maps of the two frames by adopting an attention mechanism, and classifying, wherein the classification result is the probability that the two input frames are tampered.

The technical scheme provided by the invention can show that the method has a good effect based on the deep neural network and simultaneously utilizes the self information of the frame picture and the interframe relation of the adjacent frame. Meanwhile, the detection is automatically completed, and the method can be suitable for large-scale video platforms and social platforms.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting a tampered face video according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an attention module according to an embodiment of the invention;

fig. 3 is a schematic diagram of a classifier according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a method for detecting a tampered face video, which mainly comprises the following steps of:

1. and decoding the face video data into a group of continuous frame images, intercepting the face area of each frame image, and correspondingly storing the face area as a face image according to the frame number.

In the embodiment of the invention, the face video data can be decoded into a group of continuous frame images through a universal opencv or ffmpeg toolkit; and intercepting the face region of each frame of image through a Dlib tool library of an open source in python, wherein the sizes of the face regions in different frames of images are the same or different.

2. And extracting the characteristics of each face picture through a characteristic extractor to obtain a corresponding characteristic image.

In the embodiment of the invention, the feature extractor selects an Xcenter network to realize, can extract and extract the feature map of each face picture,

the feature extractor can input pictures with any size, but the input of the inter-frame correlation classifier needs a fixed-size classifier, so that an adaptive positive layer (adaptive pooling layer) is added at the tail end of the feature extractor, the feature map with any size can be divided into different regions according to a uniform scale, and the average value in each region is calculated, so that the feature map with the uniform scale is obtained.

The scale of the feature map is set to be N × M, where N × N represents the spatial size of the feature, and M represents the feature vector dimension of each point in the feature space.

For example, N may be 10, and M may be 2048.

3. And simultaneously inputting the feature maps of two continuous frames into an interframe correlation classifier, fusing the feature maps of the two frames by adopting an attention mechanism, and classifying, wherein the classification result is the probability that the two input frames are tampered.

The preferred embodiment of this step is as follows:

firstly, a correlation matrix Cor between two characteristic graphs (marked as a characteristic graph A and a characteristic graph B) is obtained, and Cor is A × B calculated according to the similarity between two characteristic vectors in the two characteristic graphs^T。

Obtaining the correlation matrix corresponding to each of the characteristic diagrams A and B by deforming the correlation matrix Cor as follows: r_A＝reshape(Cor,N×N×N²)，R_B＝reshape(Cor^T,N×N×N²) (ii) a Where reshape (X, SHAPE) denotes the conversion of the size of X to SHAPE, where X ═ Cor, Cor^TAnd X has a size N²×N²，SHAPE＝N×N×N²Where N × N represents the spatial dimension of the feature.

The principle of the above steps is that assuming that N is 10, M is 2048, each feature map is regarded as a three-dimensional matrix 10 × 10 × 2048, 10 × 010 is a space size, 2048 is a feature vector dimension, a space size (10 × 110) is regarded as a dimension, a feature map can be regarded as a matrix (10 × 10) × 2048 is 100 × 2048, the correlation matrix between the two feature maps is in the SHAPE of (10 × 10) × (10 × 10) 100 × 100, i.e. it is a two-dimensional matrix, and for subsequent calculation, Cor needs to be deformed into a three-dimensional matrix, 100 of the first dimension in Cor is regarded as 10 × 10 of the two dimensions, i.e. 10 continuous points in 100, corresponding to one row in the SHAPE (SHAPE), and the deformation result is that the correlation matrix R of the two feature maps is a row_A、R_BThe warped content is consistent with the position, but transposes and fuses the dimensions.

Secondly, to obtain more distinctive features, R is added_AAnd R_BRespectively input into the attention module to generate corresponding attention mask M_AAnd M_BAnd then calculating: a. the_T＝(M_A+1)×A，B_T＝(M_B+1) × B, and then, mixing A_TAnd B_TAnd F is spliced together in the feature dimension and input into a final classifier as a weighted feature value, and exemplary F is a feature map of 10 × 10 × 4096.

As shown in fig. 2, the attention module mainly includes: the three convolutional layers are sequentially connected, each convolutional layer uses padding of 1, the filling value is 0, a batch regularization layer is connected behind each convolutional layer, and a ReLu activation layer is connected behind each batch regularization layer except the last convolutional layer; and the output of the last convolution layer is added with the input correlation matrix R after passing through a batch regularization layer, and then a corresponding mask M is obtained through a ReLu activation layer.

Illustratively, the convolution kernel sizes of the three convolution layers are set to 1 × 1, 3 × 3, and 1 × 1 in this order. The input dimension of the first 1 × 1 convolutional layer is 2048, the output dimension is 512, the input and output dimensions of the subsequent 3 × 3 convolutional layers are 512, the input dimension of the final 1 × 1 convolutional layer is 512, the output dimension is 2048, the input and output dimensions are added after the batch regularization layer passes through, the attention mask is obtained through a ReLu activation layer and then the characteristic dimension, namely 2048 dimension addition.

As shown in fig. 3, the classifier includes: three convolution layers and a full-connection layer at the tail end which are connected in sequence; after the feature map fusion results of the two frames are input, the feature map fusion results are sequentially input into the full-connection layer through the processing of the three convolution layers, the output dimensionality of the full-connection layer is 1, and then the probability that the input two frames are tampered is obtained through a sigmod function.

Illustratively, the convolution kernel sizes of the three convolution layers are set to 1 × 1, 3 × 3, and 3 × 3 in this order. The first 1 x 1 convolutional layer has an input dimension of 4096 and an output dimension of 512. Then, the input and output of the 3 × 3 convolutional layers are all 512 dimensions. Finally, the output dimension of the fully-connected layer is 512, and the output dimension is 1.

In the embodiment of the invention, the feature extractor and the interframe correlation classifier form a deep neural network, and whether the face in the video is tampered or not can be automatically detected after network training. During the training process, the loss function is set as:

where s is the probability that two frames of the input are tampered with.

The present invention provides two training modes (determined by mean or maximum), either of which can be used:

the first method comprises the following steps: in the training process, two continuous frames are respectively used as input to calculate loss and are reversely transmitted; after training is finished, for a test video, after every two continuous frames are input, calculating the probability of tampering, finally obtaining K-1 probabilities of tampering, judging whether the test video comes from the tampered video according to the average value of the K-1 probabilities of tampering, and considering that the test video comes from the tampered video when the average value is larger than 50%, wherein K represents the number of frames of the test video.

And the second method comprises the following steps: in the training process, two continuous frames are used as input, the probability of tampering the two continuous frames is calculated, the loss of the calculated maximum probability of tampering in a batch of training samples (the number of the samples can be set by self) is calculated, and then the maximum probability of tampering is propagated reversely; after training is finished, for a test video, calculating the probability of being tampered after every two continuous frames are input, finally obtaining K-1 tampered probabilities, judging whether the test video comes from the tampered video according to the maximum value of the tampered probabilities, and considering that the test video comes from the tampered video when the maximum value is larger than 50%.

The technical scheme of the embodiment of the invention is based on the deep neural network, and simultaneously utilizes the self information of the frame picture and the interframe relation of the adjacent frame, thereby achieving better effect. Meanwhile, the detection is automatically completed, and the method can be suitable for large-scale video platforms and social platforms.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for detecting a tampered face video is characterized by comprising the following steps:

simultaneously inputting the feature maps of two continuous frames into an interframe correlation classifier, fusing the feature maps of the two frames together by adopting an attention mechanism, and classifying, wherein the classification result is the probability that the two input frames are tampered;

wherein the fusing the feature maps of the two frames together by the inter-frame relevance classifier using an attention mechanism comprises:

marking the characteristic graphs of two continuous frames as A and B, solving a correlation matrix Cor between the two characteristic graphs, and calculating the similarity between every two characteristic vectors in the two characteristic graphs, wherein Cor is A × B^T；

Obtaining the correlation matrix corresponding to each of the characteristic diagrams A and B by deforming the correlation matrix Cor as follows: r_A＝reshape(Cor，N×N×N²)，R_B＝reshape(Cor^T，N×N×N²) (ii) a Where reshape (X, SHAPE) denotes the conversion of the size of X to SHAPE, where X ═ Cor, Cor^TAnd X has a size N²×N²，SHAPE＝N×N×N²Wherein N × N represents the spatial dimension of the feature;

r is to be_AAnd R_BRespectively input into the attention module to generate corresponding attention mask M_AAnd M_BAnd then calculating: a. the_T＝(M_A+1)×A，B_T＝(M_B+1) × B, and then, mixing A_TAnd B_TStitched together in the feature dimension.

2. The method according to claim 1, wherein the face video data is decoded into a group of continuous frame images through a general opencv or ffmpeg toolkit; and intercepting the face region of each frame of image through a Dlib tool library of an open source in python, wherein the sizes of the face regions in different frames of images are the same or different.

3. The method according to claim 1, wherein the extracting the features of each face picture by the feature extractor to obtain the corresponding feature map comprises:

the feature extractor selects an Xmeeting network to realize;

the self-adaptive pooling layer is added at the tail end of the feature extractor and used for dividing the feature map with any size into different regions according to a uniform scale and solving the average value in each region so as to obtain the feature map with the uniform scale;

4. The method for detecting the tampered face video, according to claim 1, wherein the attention module comprises three convolutional layers connected in sequence, the three convolutional layers all use padding-1, the padding value is 0, a batch regularization layer is connected behind each convolutional layer, and a ReLu activation layer is connected behind the batch regularization layer except the last convolutional layer;

and the output of the last convolution layer is added with the input correlation matrix after being subjected to batch regularization layer, and then a corresponding mask is obtained through the ReLu activation layer.

5. The method for detecting the tampered face video, according to claim 1, is characterized in that the feature map fusion results of two frames are classified by a classifier in an interframe correlation classifier; the classifier includes: three convolution layers and a full-connection layer at the tail end which are connected in sequence; after the feature map fusion results of the two frames are input, the feature map fusion results are sequentially input into the full-connection layer through the processing of the three convolution layers, the output dimensionality of the full-connection layer is 1, and then the probability that the input two frames are tampered is obtained through a sigmod function.

6. The method for detecting the tampered face video, according to claim 1, wherein the feature extractor and the interframe correlation classifier form a deep neural network, and in the training process, the loss function is as follows:

wherein s is the probability of tampering of the two input frames;

using any one of the following training modes:

the first method comprises the following steps: in the training process, two continuous frames are respectively used as input to calculate loss and are reversely transmitted; after training is finished, calculating the probability of tampering every time when two continuous frames of the test video are input, finally obtaining K-1 probabilities of tampering, and judging whether the test video comes from tampering according to the average value of the K-1 probabilities of tampering, wherein K represents the number of frames of the test video;

and the second method comprises the following steps: in the training process, two continuous frames are used as input, the probability of being tampered is calculated, loss is calculated for the calculated maximum probability of being tampered in a batch of training samples, and then back propagation is carried out; after training is finished, for the test video, after every two continuous frames are input, the probability of tampering is calculated, K-1 probabilities of tampering are finally obtained, and whether the test video comes from the tampered video or not is judged according to the maximum value of the probabilities of tampering.