CN116994175A

CN116994175A - Space-time combination detection method, device and equipment for depth fake video

Info

Publication number: CN116994175A
Application number: CN202310865480.7A
Authority: CN
Inventors: 孟令中; 李元昕; 董乾; 薛云志; 李�瑞; 马钰锡
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2023-07-14
Filing date: 2023-07-14
Publication date: 2023-11-03

Abstract

The disclosure relates to a space-time combination detection method, a device and equipment for depth forging video, and belongs to the technical field of depth forging. The method comprises the following steps: for each image frame of the video to be detected, carrying out face recognition processing on the image frame to obtain face characteristic information of the image frame; detecting intra-frame local image features based on the face feature information of each image frame, and acquiring an intra-frame detection result of the video to be detected; detecting an inter-frame face movement mode based on face feature information of all image frames, and acquiring an inter-frame detection result of a video to be detected; and judging whether the video to be detected is a depth fake video or not according to the combination of the intra-frame detection result and the inter-frame detection result. The method combines time sequence feature (facial motion mode) detection on the basis of image feature detection, combines two different detection indexes of the image feature and the time sequence feature, can realize multi-index detection of the depth fake video, and improves the detection performance of the depth fake video.

Description

Space-time combination detection method, device and equipment for depth fake video

Technical Field

The disclosure relates to the technical field of depth forging, and in particular relates to a space-time combination detection method, device and equipment for a depth forging video.

Background

With the development of the internet and artificial intelligence, image generation technology based on deep learning is continuously researched and applied, wherein deep-learning-based face-changing technology deep fake has a great influence on society.

The generation principle of the depth counterfeit video is mainly to replace a target face on the face of the original video through an automatic encoder or a network model such as a generation countermeasure network (GAN, generative Adversarial Network). The purpose of making deep counterfeited video by users may be to entertain (for example, beautify face, change face, etc.), or may maliciously spread false news or make fake certificates, etc., and the actions for the latter purpose will confuse the viewing and listening, mislead public judgment, and cause bad influence on society.

Early-stage depth counterfeit videos have obvious traces such as unnatural light and fusion edges on faces. However, with the continuous development and progress of deep learning technology, the content of deep synthesis becomes more and more true under the learning drive of a large amount of data, the quality of deep fake video becomes higher and higher, and the true and false are difficult to distinguish only by naked eyes. Therefore, there is a need to propose a new method for effectively detecting the depth counterfeit video to ensure the accuracy and efficiency of the identification of the depth counterfeit video.

Disclosure of Invention

Aiming at the problems, the invention provides a space-time combination detection method, a device and equipment for the depth counterfeit video, which can realize multi-index detection of the depth counterfeit video and improve the detection performance of the depth counterfeit video.

According to a first aspect of an embodiment of the present disclosure, there is provided a spatiotemporal combination detection method of depth counterfeit video, including:

performing face recognition processing on each image frame of the video to be detected to obtain face feature information of the image frame;

detecting intra-frame local image features based on the face feature information of each image frame, and acquiring an intra-frame detection result of the video to be detected;

detecting an inter-frame face movement mode based on face feature information of all image frames, and acquiring an inter-frame detection result of the video to be detected;

and judging whether the video to be detected is a depth fake video or not according to the combination of the intra-frame detection result and the inter-frame detection result.

Optionally, for each image frame of the video to be detected, face recognition processing is performed on the image frame to obtain face feature information of the image frame, including:

For each image frame of the video to be detected, detecting whether the image frame contains a human face or not;

cutting a face image from the image frame and extracting facial feature points from the face image under the condition that the image frame is detected to contain a face; the facial feature information comprises the facial image and facial feature points, the facial image is the minimum image of facial features and facial contours contained in the image frame, and the facial feature points are used for positioning the edges of the facial features and the facial contours in the facial image.

Optionally, the detecting the intra-frame local image feature based on the face feature information of each image frame, and obtaining an intra-frame detection result of the video to be detected includes:

performing image size correction and head direction correction on the face characteristic information of each image frame to obtain the face correction characteristic information of the image frame; the image size correction is used for adjusting the face image in the face feature information to a target size, and the head direction correction is used for adjusting the face image in the face feature information to a target direction;

intercepting a plurality of face local feature information of each image frame from the face correction feature information of each image frame, wherein the face local feature information of the image frame is derived from different areas of the face image in the face correction feature information of the image frame, and the sizes of the face images in the face local feature information of the image frame are consistent;

Carrying out feature extraction on each face local feature information of each image frame to obtain a local detection result of each face local feature information of each image frame;

and acquiring the intra-frame detection result based on the local detection result of all the face local feature information of all the image frames.

Optionally, the feature extraction is performed on each piece of face local feature information of each image frame, and a local detection result of each piece of face local feature information of each image frame is obtained, including:

extracting a feature set from the face local feature information aiming at each face local feature information;

based on statistical correlation among pixel neighborhoods, channel-level subspace approximate transformation is applied to the feature set to remove redundant information in the feature set, so that dimension reduction features are obtained;

predicting the probability of each channel of the dimension reduction feature from the deep fake video to obtain the local detection result of the face local feature information.

Optionally, the detecting the inter-frame face motion mode based on the face feature information of all the image frames, and obtaining the inter-frame detection result of the video to be detected includes:

performing feature point calibration on the face feature information of each image frame to obtain the face calibration feature information of the image frame;

Constructing a first feature vector sequence according to the face calibration feature information of each image frame, wherein each feature vector in the first feature vector sequence indicates the face calibration feature information of one image frame;

constructing a second feature vector sequence by using the difference value of the face calibration feature information of the adjacent image frames, wherein each feature vector in the second feature vector sequence indicates the difference value of the face calibration feature information of a pair of adjacent image frames;

and learning and predicting the first characteristic vector sequence and the second characteristic vector sequence through a double-flow cyclic neural network, and fusing the prediction probabilities output by two branches in the double-flow cyclic neural network to obtain the inter-frame detection result.

Optionally, the performing feature point calibration on the face feature information of each image frame to obtain the face calibration feature information of the image frame includes:

performing multiple downsampling on face images in the face feature information of each image frame to obtain a plurality of face images with different sizes, and constructing pyramid representations based on the face images with different sizes;

performing Lucas-Kanade operation on the pyramid representation with image blocks of the same size to predict coordinates of each facial feature point in facial feature information of the image frame corresponding to a next image frame of the image frame using frame-to-frame continuity;

And integrating facial feature point coordinates predicted by the Lucas-Kanade operation and facial feature point coordinates extracted from the facial feature information based on a Kalman filter to obtain the facial calibration feature information of the image frame.

Optionally, the determining whether the video to be detected is a depth falsified video according to the combination of the intra-frame detection result and the inter-frame detection result includes:

under the condition that the intra-frame detection result is located in a target result interval, carrying out weighted summation processing on the intra-frame detection result and the inter-frame detection result according to a target weight coefficient combination to obtain a space-time combination detection score;

if the time-space combination detection score is larger than a preset threshold value, determining that the video to be detected is a depth fake video; and if the time-space combination detection score is smaller than a preset threshold value, determining that the video to be detected is not the depth falsified video.

Optionally, the obtaining process of the target weight coefficient combination includes:

respectively testing an intra-frame local image feature detection branch and an inter-frame facial motion mode detection branch by adopting a test data set to obtain an intra-frame test result and an inter-frame test result;

Calculating a first AUC value according to the intra-frame test result, and calculating a second AUC value according to the inter-frame test result;

determining values of an intra-frame weight coefficient and an inter-frame weight coefficient in the target weight coefficient combination based on the first AUC value and the second AUC value; the sum of the values of the intra-frame weight coefficient and the inter-frame weight coefficient is equal to 1, and the ratio of the values of the intra-frame weight coefficient and the inter-frame weight coefficient is equal to the ratio of the first AUC value to the second AUC value.

According to a second aspect of embodiments of the present disclosure, there is provided a spatiotemporal combination detection apparatus of depth counterfeit video, the apparatus comprising:

the identification module is used for carrying out face identification processing on each image frame of the video to be detected to obtain face characteristic information of the image frame;

the intra-frame detection module is used for detecting the characteristics of the intra-frame local image based on the face characteristic information of each image frame and obtaining an intra-frame detection result of the video to be detected;

the inter-frame detection module is used for detecting an inter-frame face movement mode based on face characteristic information of all image frames and acquiring an inter-frame detection result of the video to be detected;

And the judging module is used for judging whether the video to be detected is a depth fake video or not according to the combination of the intra-frame detection result and the inter-frame detection result.

According to a third aspect of embodiments of the present disclosure, there is provided a computer device comprising: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements the temporal-spatial combination detection method for deep forgery video provided in the first aspect of the disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the spatiotemporal joint detection method of deep counterfeited video provided by the first aspect of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least comprises the following beneficial effects:

on the basis of image feature detection, the invention combines time sequence feature (facial motion mode) detection, combines two different detection indexes of image feature and time sequence feature, realizes multi-index detection of depth counterfeit video, and improves detection performance.

Aiming at the detection of image characteristics, the invention provides a scheme comprising a series of steps of size and direction correction, local image interception, local probability prediction and integrated classification, wherein the local probability prediction uses a PixelHop++ unit, a channel level subspace approximate transformation (Saab), an XGBoost classifier and the like, and does not use any deep learning method, so that the anti-attack of deep forging (deep) can be effectively resisted, the learning mechanism of the non-deep learning method does not need back propagation, the weight reduction of model parameters and sizes is realized, and the training complexity is effectively reduced.

In addition, since the face has little movement in each image frame, the extracted facial feature points have obvious jitter, and the jitter noise seriously interferes with time modeling, which eventually results in inaccurate detection results. Therefore, before the inter-frame facial motion mode detection is executed, the pyramid Lucas-Kanade operation is adopted to calibrate the feature points, then the Kalman filter is adopted to integrate the facial feature point coordinates predicted by the Lucas-Kanade operation and the facial feature point coordinates actually extracted, and the dithering noise introduced during the feature point calibration is removed, so that the detection precision of the depth counterfeit video is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flow diagram illustrating a depth counterfeit video authentication, according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a spatiotemporal joint detection method of depth counterfeit video according to an exemplary embodiment.

Fig. 3 is a schematic diagram of face feature information shown according to an exemplary embodiment.

Fig. 4 is a feature extraction diagram of local feature information shown according to an exemplary embodiment.

Fig. 5 is a schematic diagram showing a comparison of before and after feature point calibration according to an exemplary embodiment.

Fig. 6 is a diagram illustrating learning and prediction of inter-frame facial motion patterns according to an example embodiment.

FIG. 7 is a schematic representation of the comparison of ROC curves for two discrimination methods on the test dataset Celeb-DF-v 2.

FIG. 8 is a graphical representation of the PRC curve comparison of two discrimination methods on the test dataset Celeb-DF-v 2.

Fig. 9 is a block diagram of a spatiotemporal joint detection apparatus for depth counterfeit video, according to an exemplary embodiment.

Fig. 10 is a block diagram of a computer device 1000, shown in accordance with an exemplary embodiment.

Detailed Description

Exemplary embodiments will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the depth counterfeit video discrimination scheme provided by the invention combines intra-frame local image feature detection and inter-frame face motion pattern detection, and optionally, the intra-frame local image feature detection is realized by an intra-frame detection model, and the inter-frame face motion pattern detection is realized by an inter-frame detection model.

As shown in fig. 2, the detection method provided by the present invention includes the following steps (steps 1 to 4).

Step 1: and carrying out face recognition processing on each image frame of the video to be detected to obtain face characteristic information of the image frame.

In the invention, at least two image frames of the video to be detected contain human faces. When each image frame of the video to be detected is subjected to face recognition processing, whether the image frame contains a face or not is detected, and if the image frame contains the face, face feature information is further extracted. Based on this, the above step 1 includes: for each image frame, cutting out a face image from the image frame if the image frame is detected to contain a face; extracting facial feature points from the face image; the face feature information of the image frame includes the face image and the face feature points.

Wherein the cropped face image is the smallest image of the facial features and facial contours of the face contained in the image frame, and the extracted facial feature points are being used to locate the edges of the facial features and facial contours in the face image. The present invention is not limited to the number and shape of facial feature points, and alternatively 68 dots are extracted from a face image as facial feature points. For example, a face image cut out from an image frame, and facial feature points extracted from the face image, as shown in fig. 3.

In addition, in order to ensure the universality of the identification of the depth counterfeit video, the method does not limit the duration, the frame rate, the resolution, the action and the like of the video to be detected.

Step 2: and detecting the local image characteristics in the frames based on the face characteristic information of each image frame, and acquiring an intra-frame detection result of the video to be detected.

In one example, as shown in fig. 1, the above step 2 includes the following sub-steps (steps 2.1 to 2.4). Optionally, the step 2 may be implemented by an intra-frame detection model, where the intra-frame detection model includes a correction module, an interception module, a local prediction module, and an integration module, the step 2.1 may be implemented by the correction module, the step 2.2 may be implemented by the interception module, the step 2.3 may be implemented by the local prediction module, and the step 2.4 may be implemented by the integration module.

Step 2.1: and carrying out image size correction and head direction correction on the face characteristic information of each image frame to obtain the face correction characteristic information of the image frame.

The image size correction is used for adjusting the face images in the face feature information of all the image frames to a target size, such as 128×128, so that the face sizes of the face images in the face feature information of all the image frames are kept consistent; the head direction correction is used for adjusting the face images in the face feature information of all the image frames to the target direction so that the head postures of the face images in the face feature information of all the image frames are kept consistent. Based on this, the face correction feature information includes the corrected face image and the face feature points.

Step 2.2: and intercepting a plurality of face local characteristic information of each image frame from the face correction characteristic information of the image frame.

The plurality of face local characteristic information of each image frame is derived from different areas of the face image in the face correction characteristic information of the image frame, such as left eye, right eye, mouth and the like; the sizes of face images in the face partial feature information of each image frame are uniform, for example, 32×32.

Step 2.3: and carrying out feature extraction on each piece of face local feature information of each image frame to obtain a local detection result of each piece of face local feature information of each image frame.

Optionally, step 2.3 is implemented by a local prediction module, as shown in fig. 4, where the local prediction module includes a feature extraction unit, a spatial dimension reduction unit, and a classification unit. The feature extraction unit is used for extracting a feature set from the local feature information of the face; the space dimension reduction unit is used for applying channel level subspace approximate transformation to the feature set based on the statistical correlation between the pixel neighborhoods so as to remove redundant information in the feature set and obtain dimension reduction features; the classification unit is used for predicting the probability that each channel of the dimension reduction feature comes from the deep fake video so as to obtain the local detection result of the local feature information of the face.

The feature extraction unit takes each face local feature information of the image frame as input and takes a feature set corresponding to each face local feature information as output. Alternatively, as shown in fig. 4, the feature extraction unit is composed of three cascaded pixelhop++ units, and the present invention is not limited to the block size and step size of each unit, and is illustrated in fig. 4 with a block size of 3×3 and a step size of 1. The feature extraction dimensions of each Hop (Hop) in the cascaded PixelHop++ units are different, so that features can be extracted from multiple dimensions respectively, and the accuracy and the perfection of feature extraction are improved.

The output dimension of the feature set obtained by the feature extraction unit is still not concise enough, so that the invention further uses the space dimension reduction unit to remove redundant information in the extracted features. In the processing process of the space dimension reduction unit, the statistical correlation among pixel neighborhoods is utilized, channel level subspace approximation transformation (Saab) is applied to a feature set, the feature set is firstly decomposed into local mean values and frequency components, and then principal component analysis is applied to the frequency components to deduce a kernel. Wherein each core represents a particular frequency selective filter; the core with the larger eigenvalue extracts the lower frequency component, while the core with the smaller eigenvalue extracts the higher frequency component; the dimension reduction is achieved by discarding high frequency components with very small eigenvalues.

After redundancy removal, ki1 channels can be obtained from the i-th hop (i=1, 2, 3), and a binary classifier learning semantic is pre-trained for each channel for subsequent output based on the spatial dimension reduction unit, predicting the probability of each channel from deep fake video. Alternatively, XGBoost (eXtreme Gradient Boosting, extreme gradient lifting) is employed as a classifier for each channel and its maximum depth is set to 1 to prevent classification over-fitting.

Step 2.4: and acquiring an intra-frame detection result of the video to be detected based on the local detection results of all the face local feature information of all the image frames.

The local detection result of each face local feature information comprises the probability that a plurality of channels of the face local feature information come from the depth fake video, and the probability that the face local feature information comes from the depth fake video can be obtained by combining the prediction probabilities (such as average or weighted summation) of all the channels; for each image frame, the probability of the image frame from the depth counterfeit video can be obtained by combining the prediction probabilities (such as average or weighted summation) of all the face local feature information of the image frame.

For the video to be detected, the prediction probabilities of all image frames need to be combined to obtain the probability that the video to be detected is a depth fake video, namely an intra-frame detection result. In order to control the input dimension of each module and reduce the calculated amount, the video to be detected can be segmented (for example, every six image frames are one segment), and the probability of the segment from the depth falsified video can be obtained by combining the prediction probabilities (for example, average or weighted summation and the like) of all the image frames in each segment; and then, combining the prediction probabilities of all the fragments to obtain an intra-frame detection result of the video to be detected.

It should be appreciated that in the case where an image frame includes multiple faces, the probability that the face in the image frame is from a depth counterfeit video may be calculated based on the predicted probabilities of all the face local feature information belonging to the same face in the image frame, and then the probability that the image frame is from a depth counterfeit video may be calculated based on the predicted probabilities of all the face in the image frame.

Step 3: and detecting an inter-frame face movement mode based on the face characteristic information of all the image frames, and obtaining an inter-frame detection result of the video to be detected.

In one example, as shown in fig. 1, the above step 3 includes the following sub-steps (steps 3.1 to 3.4). Alternatively, the above step 3 may be implemented by an inter-frame detection model, where the inter-frame detection model includes a calibration module, a sequence building module, and a prediction module, the below step 3.1 may be implemented by the calibration module, the below steps 3.2 and 3.3 may be implemented by the sequence building module, and the below step 3.4 may be implemented by the prediction module.

Step 3.1: and carrying out feature point calibration on the face feature information of each image frame to obtain the face calibration feature information of the image frame.

Since the detection of the inter-frame facial motion mode needs to use continuous facial feature point coordinates, the requirement on precision is high, however, the invention discovers that even under the condition that the human face hardly moves in each image frame, the extracted facial feature points have obvious jitter, and the jitter noise can seriously interfere with time modeling, so that the detection result is not accurate enough. Therefore, the present invention performs feature point calibration before performing detection of an inter-frame face movement pattern to improve the accuracy of the detection result.

Alternatively, the present invention uses the Lucas-Kanade optical flow algorithm to calculate the movement of facial feature points from frame to frame, and uses the frame-to-frame continuity to predict the corresponding position of each facial feature point in the next image frame. However, since the Lucas-Kanade operation is sensitive to the size of the face image block, a pyramid Lucas-Kanade operation may be introduced, where the face image is first downsampled several times to half the original image to obtain a plurality of face images of different sizes, and a pyramid representation is constructed based on the plurality of face images of different sizes, and then the pyramid representation is performed with the same size image block to predict the coordinates of each face feature point in the face feature information of the image frame in the next image frame of the image frame using the frame-to-frame continuity. In addition, since the Lucas-Kanade operation also introduces noise, the facial feature point coordinates predicted by the Lucas-Kanade operation and the facial feature point coordinates extracted from the facial feature information are integrated further based on a kalman filter to obtain the facial calibration feature information of the image frame. Based on this, the face calibration feature information includes the face image and the calibrated facial feature points.

For example, as shown in fig. 5, dots in the left face image in fig. 5 (1) and (2) are facial feature points in the face feature information before feature point calibration, and dots in the right face image in fig. 5 (1) and (2) are facial feature points in the face calibration feature information after feature point calibration. It can be seen that the positions of the facial feature points before and after the feature point calibration change.

Step 3.2: and constructing a first feature vector sequence according to the face calibration feature information of each image frame.

Each feature vector in the first sequence of feature vectors indicates face alignment feature information for one image frame. Illustratively, assuming that n image frames exist in the video to be detected, m facial feature points (m is a positive integer, such as 68) exist in the face calibration feature information of each image frame, and the kth facial feature point is denoted as Z ^k ＝[x ^k ，y ^k ] ^T Face calibration feature information alpha for the ith image frame _i Can be expressed asFirst eigenvector sequence a= [ alpha ] ₁ ，…，α _n ] ^T 。

Step 3.3: and constructing a second feature vector sequence by using the difference value of the face calibration feature information of the adjacent image frames.

Each feature vector in the second sequence of feature vectors indicates a difference in face alignment feature information for a pair of adjacent image frames. Illustratively, assuming that n image frames exist in the video to be detected, m facial feature points (n is a positive integer, such as 68) exist in the face calibration feature information of each image frame, and the kth facial feature point is denoted as Z ^k ＝[x ^k ，y ^k ] ^T Face calibration feature information alpha for the ith image frame _i Can be expressed asDifference beta of face calibration feature information of i-th image frame and i+1-th image frame _i Can be expressed as Second eigenvector sequence b= [ beta ] ₁ ，…，β _n-1 ] ^T 。

Step 3.4: and learning and predicting the first characteristic vector sequence and the second characteristic vector sequence through the double-flow cyclic neural network, and fusing the prediction probabilities output by two branches in the double-flow cyclic neural network to obtain an inter-frame detection result.

The first sequence of feature vectors may indicate a facial motion pattern of a face, and the second sequence of feature vectors may indicate a facial motion difference pattern of the face. As shown in fig. 6, the prediction module may adopt a dual-flow cyclic neural network (Recurrent Neural Networks, RNN) to learn and predict the input first feature vector sequence and the second feature vector sequence respectively, so as to obtain the probability that the video to be detected is a deep fake video, and then fuse the prediction probabilities (such as average or weighted summation) of the two branches, so as to obtain the inter-frame detection result of the video to be detected. Optionally, in order to control the input dimensions of each module and reduce the calculation amount, the video to be detected may be segmented (for example, each six image frames are a segment), and the inter-frame detection result of each segment (for example, average or weighted summation) is combined to obtain the inter-frame detection result of the whole video to be detected.

Step 4: and judging whether the video to be detected is a depth fake video or not according to the combination of the intra-frame detection result and the inter-frame detection result.

In one example, the above step 4 includes the following sub-steps (steps 4.1 to 4.2).

Step 4.1: and under the condition that the intra-frame detection result is positioned in the target result interval, carrying out weighted summation processing on the intra-frame detection result and the inter-frame detection result according to the target weight coefficient combination to obtain a space-time combined detection score.

In order to more effectively combine the intra-frame detection result and the inter-frame detection result, weight distribution can be performed based on the detection performance of the intra-frame local image feature detection branch (such as an intra-frame detection model) and the inter-frame face motion mode detection branch (such as an inter-frame detection model) to obtain a target weight coefficient combination; carrying out weighted summation processing on the intra-frame detection result and the inter-frame detection result based on the target weight coefficient combination to obtain a space-time combination detection score, wherein the space-time combination detection score indicates the probability that the video to be detected is a depth fake video from the angle of fusion of the image characteristics and the time sequence characteristics; and finally, judging whether the video to be detected is a depth falsified video or not according to the space-time combined detection score.

Optionally, the obtaining process of the target weight coefficient combination includes: respectively testing an intra-frame local image feature detection branch and an inter-frame facial motion mode detection branch by adopting a test data set to obtain an intra-frame test result and an inter-frame test result; calculating a first AUC value according to the intra-frame test result, and calculating a second AUC value according to the inter-frame test result; determining values of an intra-frame weight coefficient and an inter-frame weight coefficient in the target weight coefficient combination based on the first AUC value and the second AUC value; the sum of the values of the intra-frame weight coefficient and the inter-frame weight coefficient is equal to 1, and the ratio of the values of the intra-frame weight coefficient and the inter-frame weight coefficient is equal to the ratio of the first AUC value and the second AUC value.

Because the research finds that the AUC value of the intra-frame local image feature detection branch on each test data set can reach more than 90% when the detection performance of the intra-frame local image feature detection branch is tested, and the possibility of larger errors exists when the output prediction probability is in the interval of 0.4 to 0.6, in order to efficiently and accurately acquire the space-time combination detection score, the intra-frame detection result and the inter-frame detection result are fused according to the target weight coefficient combination under the condition that the intra-frame detection result is located in the target result interval, and the intra-frame detection result is directly taken as the space-time combination detection score under the condition that the intra-frame detection result is not located in the target result interval. The target result interval may be 0.4 to 0.6, or may be other interval distribution (e.g. 0.3 to 0.7), which is not limited in the present invention.

Step 4.2: if the time-space combination detection score is larger than a preset threshold value, determining that the video to be detected is a depth falsified video; and if the time-space combination detection score is smaller than a preset threshold value, determining that the video to be detected is not the depth falsified video.

The preset threshold value may be set to 0.5 generally, where the video to be detected is determined to be a depth falsified video when the temporal-spatial combination detection score is greater than 0.5, and the video to be detected is determined to be a true video when the temporal-spatial combination detection score is less than 0.5. Of course, other settings of the preset threshold are possible, for example, in the case of strict authentication requirements for the deep fake video, the preset threshold may be set to 0.7, etc. In addition, in practical application, under the condition that the time-space combination detection score is equal to a preset threshold value, whether the video to be detected is a depth falsified video or not can be flexibly set.

In summary, in order to ensure the accuracy and efficiency of the identification of the depth counterfeit video, the invention provides a space-time combined identification method of the depth counterfeit video, which combines the detection of time sequence features (facial motion mode) on the basis of the detection of image features, combines two different detection indexes of the image features and the time sequence features, realizes the multi-index detection of the depth counterfeit video, and improves the detection performance.

Aiming at the detection of image characteristics, the invention provides a scheme comprising a series of steps of size and direction correction, local image interception, local probability prediction and integrated classification, wherein the local probability prediction uses a PixelHop++ unit, a channel level subspace approximation transformation (Saab), an XGBoost classifier and the like, and does not use any deep learning method, so that the method can effectively resist the attack of deep false; in addition, the learning mechanism of the non-deep learning method does not need back propagation, so that the weight reduction of model parameters and sizes is realized, and the training complexity is effectively reduced.

The invention uses a test data set Celeb-DF-v2 to respectively test and compare the identification methods of the depth counterfeit video in the related technology and the invention. Wherein, the related art detects the depth counterfeit video based on the image feature only, and the present invention detects the depth counterfeit video based on the fusion of the image feature and the time sequence feature.

First, on the test dataset Celeb-DF-v2, the ROC curves for both detection methods are shown in FIG. 7. As can be seen from the ROC curve, the AUC value is improved by 0.0086 by adopting the identification method provided by the invention.

Second, on the test dataset Celeb-DF-v2, the PRC curves for both detection methods are shown in FIG. 8. As can be seen from the PRC curve, the area under the PRC curve is increased by 0.0079 compared with the related art by adopting the identification method provided by the invention.

In addition, as shown in the first table, through calculating indexes such as accuracy and the like, the identification method provided by the invention can be greatly improved compared with the related art, and the accuracy and the precision are respectively improved by 13.39% and 15.13%.

Table one or two identification methods generalized index comparison results

Wherein 59 real videos and 59 depth falsified videos are contained in the test dataset Celeb-DF-v 2. As shown in table two, the number of samples of the real video which are erroneously recognized as the deep counterfeit video is 7 by adopting the authentication method provided by the related technology, and the value is reduced to 2 by adopting the authentication method provided by the invention, and the false alarm rate is also reduced to 3.39% from 11.86%; the number of correctly identified depth counterfeit videos is 47 by adopting the identification method provided by the related technology, and the value is increased to 57 by adopting the identification method provided by the invention, and the counterfeit detection rate is also increased to 96.61% from 79.66%.

Accuracy index comparison result of two identification methods

In addition, in the time sequence feature detection, the invention discovers that even under the condition that the human face in each image frame is hardly moved, the extracted facial feature points have obvious jitter, the jitter noise can seriously interfere with time modeling, and finally the detection result is inaccurate. Therefore, before the inter-frame facial motion mode detection is executed, the pyramid Lucas-Kanade operation is adopted to calibrate the feature points, then the Kalman filter is adopted to integrate the facial feature point coordinates predicted by the Lucas-Kanade operation and the facial feature point coordinates actually extracted, and the dithering noise introduced during the feature point calibration is removed, so that the detection precision of the depth counterfeit video is improved.

Exemplary apparatus

Fig. 9 is a space-time combination detection apparatus for depth falsified video according to an exemplary embodiment, and referring to fig. 9, the apparatus includes: an identification module 910, an intra-frame detection module 920, an inter-frame detection module 930, and a decision module 940.

The recognition module 910 is configured to perform face recognition processing on each image frame of the video to be detected, so as to obtain face feature information of the image frame;

The intra-frame detection module 920 is configured to detect intra-frame local image features based on face feature information of each image frame, and obtain an intra-frame detection result of the video to be detected;

the inter-frame detection module 930 is configured to detect an inter-frame facial motion mode based on face feature information of all image frames, and obtain an inter-frame detection result of the video to be detected;

and a determining module 940, configured to determine whether the video to be detected is a depth falsified video according to the combination of the intra-frame detection result and the inter-frame detection result.

In one embodiment of the present disclosure, the identification module 910 is further configured to: for each image frame of the video to be detected, detecting whether the image frame contains a human face or not; cutting a face image from the image frame and extracting facial feature points from the face image under the condition that the image frame is detected to contain a face; the facial feature information comprises the facial image and facial feature points, the facial image is the minimum image of facial features and facial contours contained in the image frame, and the facial feature points are used for positioning the edges of the facial features and the facial contours in the facial image.

In one embodiment of the present disclosure, the intra-frame detection module 920 is further configured to: performing image size correction and head direction correction on the face characteristic information of each image frame to obtain the face correction characteristic information of the image frame; the image size correction is used for adjusting the face image in the face feature information to a target size, and the head direction correction is used for adjusting the face image in the face feature information to a target direction; intercepting a plurality of face local feature information of each image frame from the face correction feature information of each image frame, wherein the face local feature information of the image frame is derived from different areas of the face image in the face correction feature information of the image frame, and the sizes of the face images in the face local feature information of the image frame are consistent; carrying out feature extraction on each face local feature information of each image frame to obtain a local detection result of each face local feature information of each image frame; and acquiring the intra-frame detection result based on the local detection result of all the face local feature information of all the image frames.

In one embodiment of the present disclosure, the intra-frame detection module 920 is further configured to: extracting a feature set from the face local feature information aiming at each face local feature information; based on statistical correlation among pixel neighborhoods, channel-level subspace approximate transformation is applied to the feature set to remove redundant information in the feature set, so that dimension reduction features are obtained; predicting the probability of each channel of the dimension reduction feature from the deep fake video to obtain the local detection result of the face local feature information.

In one embodiment of the present disclosure, the inter-frame detection module 930 is further configured to: performing feature point calibration on the face feature information of each image frame to obtain the face calibration feature information of the image frame; constructing a first feature vector sequence according to the face calibration feature information of each image frame, wherein each feature vector in the first feature vector sequence indicates the face calibration feature information of one image frame; constructing a second feature vector sequence by using the difference value of the face calibration feature information of the adjacent image frames, wherein each feature vector in the second feature vector sequence indicates the difference value of the face calibration feature information of a pair of adjacent image frames; and learning and predicting the first characteristic vector sequence and the second characteristic vector sequence through a double-flow cyclic neural network, and fusing the prediction probabilities output by two branches in the double-flow cyclic neural network to obtain the inter-frame detection result.

In one embodiment of the present disclosure, the inter-frame detection module 930 is further configured to: performing multiple downsampling on face images in the face feature information of each image frame to obtain a plurality of face images with different sizes, and constructing pyramid representations based on the face images with different sizes; performing Lucas-Kanade operation on the pyramid representation with image blocks of the same size to predict coordinates of each facial feature point in facial feature information of the image frame corresponding to a next image frame of the image frame using frame-to-frame continuity; and integrating facial feature point coordinates predicted by the Lucas-Kanade operation and facial feature point coordinates extracted from the facial feature information based on a Kalman filter to obtain the facial calibration feature information of the image frame.

In one embodiment of the present disclosure, the decision module 940 is further configured to: under the condition that the intra-frame detection result is located in a target result interval, carrying out weighted summation processing on the intra-frame detection result and the inter-frame detection result according to a target weight coefficient combination to obtain a space-time combination detection score; if the time-space combination detection score is larger than a preset threshold value, determining that the video to be detected is a depth fake video; and if the time-space combination detection score is smaller than a preset threshold value, determining that the video to be detected is not the depth falsified video.

In one embodiment of the present disclosure, the decision module 940 is further configured to: respectively testing an intra-frame local image feature detection branch and an inter-frame facial motion mode detection branch by adopting a test data set to obtain an intra-frame test result and an inter-frame test result; calculating a first AUC value according to the intra-frame test result, and calculating a second AUC value according to the inter-frame test result; determining values of an intra-frame weight coefficient and an inter-frame weight coefficient in the target weight coefficient combination based on the first AUC value and the second AUC value; the sum of the values of the intra-frame weight coefficient and the inter-frame weight coefficient is equal to 1, and the ratio of the values of the intra-frame weight coefficient and the inter-frame weight coefficient is equal to the ratio of the first AUC value to the second AUC value.

The exemplary apparatus is an apparatus embodiment corresponding to the above exemplary method, and specific operations of the respective modules may be understood with reference to the description of the method embodiment, which is not repeated herein.

Exemplary electronic device

Fig. 10 is a block diagram of a computer device 1000, shown in accordance with an exemplary embodiment. The computer device 1000 may be a terminal, notebook, desktop, server, computer cluster, or other type of electronic device.

Referring to fig. 10, a computer device 1000 may include at least one processor 1010 and memory 1020. The processor 1010 may execute instructions stored in the memory 1020. The processor 1010 is communicatively coupled to the memory 1020 via a data bus. In addition to memory 1020, processor 1010 may be communicatively coupled with input devices 1030, output devices 1040, and communication devices 1050 via a data bus.

The processor 1010 may be any conventional processor. The processor may include, for example, a central processing unit (Central Processing Unit, CPU), an image processor (Graphic Process Unit, GPU), a field programmable gate array (Field Programmable Gate Array, FPGA), a System On Chip (SOC), an application specific integrated Chip (Application Specific Integrated Circuit, ASIC), or a combination thereof.

The memory 1020 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

In the embodiment of the present disclosure, the memory 1020 stores executable instructions, and the processor 1010 may read the executable instructions from the memory 1020 and execute the instructions to implement all or part of the steps of the temporal-spatial combination detection method for deep-falsified video in the above-described exemplary embodiment.

Exemplary computer-readable storage Medium

In addition to the methods and apparatus described above, exemplary embodiments of the present disclosure include a computer program product or a computer-readable storage medium storing the computer program product. The computer program instructions are embodied in a computer program instruction that is executable by a processor to implement all or part of the steps described in the above exemplary embodiments.

The computer program product may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages, as well as scripting languages (e.g., python). The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

A computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the readable storage medium include: a Static Random Access Memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic or optical disk, or any suitable combination of the foregoing having one or more electrical conductors.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. The specification and embodiments are to be regarded as exemplary only, and the disclosure is not limited to the exact construction illustrated and described above, and various modifications and changes may be made without departing from the scope thereof.

Claims

1. A method for spatiotemporal joint detection of deep counterfeit video, the method comprising:

2. The method according to claim 1, wherein the performing face recognition processing on the image frames for each image frame of the video to be detected to obtain face feature information of the image frames includes:

3. The method according to claim 1, wherein the detecting intra-frame local image features based on the face feature information of each image frame, and obtaining an intra-frame detection result of the video to be detected, includes:

4. A method according to claim 3, wherein the feature extracting the local feature information of each face of each image frame to obtain the local detection result of the local feature information of each face of each image frame includes:

5. The method according to claim 1, wherein the detecting the inter-frame face motion pattern based on the face feature information of all the image frames, and obtaining the inter-frame detection result of the video to be detected, includes:

6. The method according to claim 5, wherein the performing feature point calibration on the face feature information of each image frame to obtain the face calibration feature information of the image frame includes:

7. The method according to any one of claims 1 to 6, wherein the determining whether the video to be detected is a depth falsified video with a combination of the intra detection result and the inter detection result includes:

8. The method of claim 7, wherein the obtaining of the target weight coefficient combination comprises:

9. A spatiotemporal joint detection device for depth counterfeit video, said device comprising:

10. A computer device, the computer device comprising: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements a spatiotemporal joint detection method of depth counterfeit video according to any of claims 1-8.