CN114724218A

CN114724218A - Video detection method, device, equipment and medium

Info

Publication number: CN114724218A
Application number: CN202210369060.5A
Authority: CN
Inventors: 郝艳妮; 马先钦; 王璋盛; 王一刚; 曹家; 罗引; 王磊
Original assignee: Beijing Zhongke Wenge Technology Co ltd
Current assignee: Beijing Zhongke Wenge Technology Co ltd
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-07-08

Abstract

The present disclosure relates to a video detection method, apparatus, device, and medium. The video detection method comprises the following steps: acquiring an image sequence to be detected, wherein the image sequence comprises at least two video frames in the same video; carrying out nonlinear transformation processing on the facial features of the images aiming at each image in the image sequence to obtain the attention features of a plurality of regions of the face corresponding to the images; constructing a time sequence relation characteristic among a plurality of regions of the face corresponding to the image sequence based on the attention characteristics of the plurality of regions of the face corresponding to each image; and calculating the probability that the video is the video with the forged face based on the time sequence relation characteristics. According to the embodiment of the disclosure, the accuracy of the probability calculation result is higher, the generalization capability is stronger, and the accuracy of the fake face video detection is further improved.

Description

Video detection method, device, equipment and medium

Technical Field

The present disclosure relates to the field of video processing technologies, and in particular, to a video detection method, apparatus, device, and medium.

Background

The forged face video refers to a video in which faces such as human faces and animal faces in video content are falsified by a depth forging algorithm (deep fake).

Therefore, how to accurately detect the fake face video is a technical problem which needs to be solved urgently.

Disclosure of Invention

In order to solve the technical problem, the present disclosure provides a video detection method, apparatus, device and medium.

In a first aspect, the present disclosure provides a video detection method, including:

acquiring an image sequence to be detected, wherein the image sequence comprises at least two video frames in the same video;

carrying out nonlinear transformation processing on the facial features of the images aiming at each image in the image sequence to obtain the attention features of a plurality of regions of the face corresponding to the images;

constructing a time sequence relation characteristic among a plurality of regions of the face corresponding to the image sequence based on the attention characteristics of the plurality of regions of the face corresponding to each image;

and calculating the probability that the video is the video with the forged face based on the time sequence relation characteristics.

In a second aspect, the present disclosure provides a video detection apparatus, comprising:

the image acquisition module is used for acquiring an image sequence to be detected, and the image sequence comprises at least two video frames in the same video;

the non-linear change module is used for carrying out non-linear transformation processing on the facial features of the images aiming at each image in the image sequence to obtain the attention features of a plurality of regions of the face corresponding to the images;

the feature construction module is used for constructing time sequence relation features among a plurality of regions of the face corresponding to the image sequence based on the attention features of the plurality of regions of the face corresponding to each image;

and the probability calculation module is used for calculating the probability that the video is the video with the forged face based on the time sequence relation characteristics.

In a third aspect, the present disclosure provides a video detection apparatus, comprising:

a processor;

a memory for storing executable instructions;

wherein the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the video detection method of the first aspect.

In a fourth aspect, the present disclosure provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the video detection method of the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

the video detection method, the video detection device, the video detection equipment and the video detection medium can perform nonlinear transformation processing on the facial features of each image in the image sequence after the image sequence containing at least two video frames in the same video to be detected is obtained, obtain the attention features of a plurality of regions of the face corresponding to each image, construct the time sequence relation features among the plurality of regions of the face corresponding to the image sequence based on the attention features of the plurality of regions of the face corresponding to each image, further calculate the probability that the video is the video with the forged face based on the time sequence relation features, wherein the probability can be used for judging whether the video is the video with the forged face or not, and because the probability can be calculated based on the time sequence relation features among the plurality of regions of the face corresponding to the image sequence in the embodiment of the present disclosure, the time sequence relation among the plurality of regions of the face can be introduced when the probability is calculated, furthermore, the time sequence inconsistency of the face in the video is detected, so that the accuracy of the probability calculation result is higher, the generalization capability is stronger, and the accuracy of the fake face video detection is further improved.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a schematic flowchart of a video detection method according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of an image sequence acquisition method according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of an attention feature obtaining method according to an embodiment of the disclosure;

fig. 4 is a schematic flowchart of another attention feature obtaining method according to an embodiment of the disclosure;

fig. 5 is a schematic flow chart of a method for acquiring a time series relationship characteristic according to an embodiment of the present disclosure;

fig. 6 is a schematic flowchart of another time series relationship characteristic obtaining method according to the embodiment of the present disclosure;

fig. 7 is a schematic diagram of a video detection model provided in an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a three-dimensional attention neural network model provided by an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a timing diagram neural network model provided by an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of a video detection apparatus according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a video detection device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more complete and thorough understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein is intended to be open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

At present, detection methods for forged face videos mainly fall into two categories:

the image-based detection method. The image-based detection method may include a traditional image forensics method, a deep learning method, an image internal pattern analysis method, and a fake image fingerprint method.

Specifically, the image-based detection method is a method for detecting a forged face video by using a video frame as a granularity, and artifacts, such as image noise and uneven texture, generated by a depth forging algorithm are generally used as important clues for detecting a forged face to realize the detection of the forged face video. Although the image-based detection method achieves better detection performance by means of an advanced convolutional neural network architecture, the depth counterfeiting algorithm is continuously improved, and the image-based detection method has the defect of weaker generalization performance, so that the accuracy of the image-based detection method in detecting the counterfeit face video generated by the novel depth counterfeiting algorithm is poorer.

And (II) a video-based detection method. The video-based detection method may include a method of detecting by using timing consistency of the video as a clue.

Specifically, the video-based detection method is a method for detecting a video with a video as a granularity to forge a face, and since a depth forging technology generally tampers a face video in a way of processing the face video frame by frame, and a discontinuous phenomenon exists between video frames due to the processing way, the video-based detection method generally realizes the detection of the video with time sequence consistency generated by a depth forging algorithm as an important clue for forging the face.

The consistency may include multi-modal consistency and video frame consistency, among others.

The multi-modal consistency can be used for judging whether the face and lip action mode of the video is consistent with the voice audio mode, so that the detection of the fake face video is realized.

Although the method for detecting the fake face video based on the video consistency improves the generalization performance of fake face video detection and can avoid the limitation of the method for detecting the fake face video based on the multi-modal consistency, the applicant finds that the following problems still exist in the process of using the method for detecting the fake face video based on the video consistency: (1) the prior art uses a two-dimensional attention mechanism to obtain local region features from a face image; (2) in the prior art, after obtaining the local area features, the local area features are directly used for classification, or after obtaining the local area features, only the inter-area relation of the image is modeled, and the inter-area relation is used for classification. Therefore, the method for detecting the fake face video based on the video consistency still has the problem of low accuracy.

In summary, the detection methods for the two types of forged face videos have the problem of low accuracy in detecting the forged face video, and therefore, how to accurately detect the forged face video is a technical problem that needs to be solved urgently.

In view of the foregoing problems, embodiments of the present disclosure provide a video detection method, apparatus, device, and medium. The video detection method will be described first with reference to specific embodiments.

In embodiments of the present disclosure, the video detection method may be performed by a computing device. The computing device may include, but is not limited to, an electronic device, a server, and the like. The electronic device may include a mobile phone, a tablet computer, a desktop computer, a notebook computer, a vehicle-mounted terminal, a wearable electronic device, an all-in-one machine, an intelligent home device, and other devices having a computing function, and may also be a virtual machine or a simulator-simulated device. The server can be an independent server or a cluster of a plurality of servers, and can comprise a server built in the local and a server built in the cloud.

Fig. 1 is a schematic flowchart of a video detection method according to an embodiment of the present disclosure.

As shown in fig. 1, the video detection method may include the following steps.

S110, an image sequence to be detected is obtained, and the image sequence comprises at least two video frames in the same video.

In embodiments of the present disclosure, a computing device may acquire a sequence of images to be detected that includes at least two video frames of the same video.

The image sequence includes at least two video frames in the same video, and the at least two video frames may be two adjacent video frames in the video or two non-adjacent video frames in the video. The sequence order of the image sequence, that is, the arrangement order of at least two video frames in the image sequence, may be determined according to the playing time of each video frame in the video from the first to the last.

Further, the video to which the image sequence belongs may be a video that needs to detect whether a face in the video content is tampered.

In some embodiments, the at least two video frames in the sequence of images may be video frames obtained by globally framing a video.

In other embodiments, the at least two video frames in the image sequence may be video frames obtained by piecewise framing the video.

And S120, carrying out nonlinear transformation processing on the facial features of the images aiming at each image in the image sequence to obtain the attention features of a plurality of regions of the face corresponding to the images.

In the embodiment of the present disclosure, after the image sequence to be detected is acquired, the computing device may perform nonlinear transformation processing on the facial feature of each image in the image sequence to obtain the attention features of multiple regions of the face corresponding to each image.

Specifically, after the image sequence to be detected is acquired, the computing device may extract facial features of the images for each image in the image sequence, and then perform fusion calculation based on an attention mechanism on the facial features of the images, so as to implement nonlinear transformation processing on the facial features of the images, and obtain the attention features of multiple regions of the face corresponding to the images.

The face in the image may be a face of any object, such as a human face, an animal face, or an avatar face, without limitation.

In some embodiments, the facial features of the image may be understood as features for characterizing the face in the image, for example, the facial features may include features of specific parts of the face, features of the outline of the face, features of the expression of the face, and the like, without limitation.

Alternatively, the facial feature may be a two-dimensional facial feature, or may be a three-dimensional facial feature, which is not limited herein.

In some embodiments, the face of each object may be pre-partitioned into multiple regions as desired. The attention features of a plurality of regions of the face corresponding to the image can be understood as features for characterizing the degree of correlation between each region in the face and the face in the image.

Alternatively, the attention feature may be a two-dimensional attention feature, or may be a three-dimensional attention feature, which is not limited herein.

For example, after the image sequence to be detected is acquired, the computing device may extract, for each image in the image sequence, two-dimensional facial features of the image, for example, two-dimensional facial features of the image are extracted by using a two-dimensional facial feature extractor, and then perform fusion calculation based on a two-dimensional attention mechanism, such as two-dimensional convolution operation, on the two-dimensional facial features of the image, so as to implement nonlinear transformation processing on the two-dimensional facial features of the image, and obtain two-dimensional attention features of multiple regions of the face corresponding to the image.

For example, after the image sequence to be detected is acquired, for each image in the image sequence, the computing device may extract three-dimensional facial features of the image, for example, extract the three-dimensional facial features of the image using a three-dimensional facial feature extractor to extract a model, and then perform fusion calculation such as three-dimensional convolution operation based on a three-dimensional attention mechanism on the three-dimensional facial features of the image, for example, implement three-dimensional convolution operation based on a three-dimensional attention neural network model to implement nonlinear transformation processing on the three-dimensional facial features of the image, so as to obtain three-dimensional attention features of multiple regions of the face corresponding to the image.

S130, constructing a time sequence relation characteristic among a plurality of face regions corresponding to the image sequence based on the attention characteristics of the plurality of face regions corresponding to the images.

In the embodiment of the present disclosure, after obtaining the attention features of the multiple regions of the face in each image, the computing device may perform a time-series relationship construction process based on the whole image sequence on the attention features of the multiple regions of the face in each image, so as to obtain the time-series relationship features between the multiple regions of the face corresponding to the image sequence.

The time sequence relation characteristic among a plurality of regions of the face corresponding to the image sequence can be understood as a characteristic for representing the change situation of the attention characteristic of each region in the face in the time dimension.

For example, if the face in the image may be divided into four regions, i.e., an eyebrow region, an eye region, a mouth region, and a nose region, in advance, the time-sequence relationship between the multiple regions of the face corresponding to the image sequence may be a change of the image in the eyebrow region, the eye region, the mouth region, and the nose region at a time point corresponding to different video frames of the image sequence.

In some embodiments, the computing device may construct, according to the attention features of the multiple regions of the face corresponding to each image, graph structure data of the multiple regions of the face corresponding to each image, and then perform, in a graph convolution operation manner, time series relationship construction processing on the graph structure data of the multiple regions of the face corresponding to each image, to obtain a time series relationship feature between the multiple regions of the face corresponding to the image sequence.

In other embodiments, the computing device may construct vector data of the plurality of regions of the face corresponding to each image according to the attention features of the plurality of regions of the face corresponding to each image, for example, according to a specified region sequence, concatenate the feature vectors corresponding to the respective regions of the face corresponding to each image to obtain the vector data of the plurality of regions of the face corresponding to each image. After the computing device obtains the vector data, the time-series relation construction processing can be performed on the vector data of the plurality of regions of the face corresponding to each image by utilizing a time-domain convolutional neural network, so that the time-series relation characteristic among the plurality of regions of the face corresponding to the image sequence is obtained.

And S140, calculating the probability that the video is the video with the forged face based on the time sequence relation characteristics.

In the embodiment of the disclosure, the computing device may perform classification detection on the time-series relationship features between the plurality of regions of the face corresponding to the image sequence after the time-series relationship features between the plurality of regions of the face corresponding to the image sequence, so as to obtain the probability that the video to be detected is the fake face video.

Alternatively, the computing device may input the time-sequence relationship features between the multiple regions of the face corresponding to the image sequence into a classifier obtained by pre-training and used for detecting whether the video to which the time-sequence relationship features belong is a video of a fake face, and obtain the probability that the video output by the classifier is the video of the fake face.

Further, the computing device may compare the probability with a preset probability threshold, and if the probability is greater than or equal to the preset probability threshold, determine that the video to be detected to which the image sequence belongs is a video of a forged face; and if the probability is smaller than a preset probability threshold value, determining that the video to be detected to which the image sequence belongs is not the video with the forged face.

It should be noted that the preset probability threshold may be a probability value preset according to needs and used for characterizing the video as a fake face, for example, the preset probability threshold may be 0.5 or 0.8, and is not limited herein.

In the embodiment of the disclosure, after an image sequence to be detected, which includes at least two video frames in the same video, is acquired, non-linear transformation processing can be performed on facial features of each image in the image sequence to obtain attention features of multiple regions of the face corresponding to each image, a time sequence relation feature between the multiple regions of the face corresponding to the image sequence is constructed based on the attention features of the multiple regions of the face corresponding to each image, and then, based on the time sequence relation feature, a probability that the video is a video of a forged face is calculated, where the probability can be used to determine whether the video is a video of a forged face, and processing on the time sequence relation feature can detect time sequence inconsistency of the face in the video, so that the accuracy of a calculation result of the probability is higher, the generalization capability is stronger, and further, the accuracy of detecting the forged face video is improved.

The method for acquiring the image sequence to be detected by the computing device is explained in detail below.

In some embodiments of the present disclosure, the at least two video frames in the image sequence may be video frames obtained by piecewise framing the video.

In these embodiments, the computing device may piecewise frame the video into a sequence of images. Specifically, the detailed description will be made with reference to fig. 2.

Fig. 2 is a schematic flowchart of an image sequence acquisition method according to an embodiment of the present disclosure. As shown in fig. 2, the image sequence acquisition method may include the following steps.

S210, dividing the video into a plurality of video segments.

In the embodiment of the present disclosure, the computing device may perform segmentation processing on the video to be detected to divide the video into a plurality of video segments.

Specifically, the computing device may segment the video according to a preset segmentation mode to obtain a plurality of video segments.

The preset segmentation mode can be a mode which is preset according to needs and is used for segmenting the video according to user requirements. For example, the preset segmentation mode may be to divide the video into a certain number of video segments in equal length, the preset segmentation mode may also be to divide the video into a certain length of video segments, and the preset segmentation mode may also be to divide the video into a plurality of video segments randomly, which is not limited herein.

In some embodiments, the computing device may extract each video frame in the video using an Open Source Computer Vision Library (OpenCV) tool, then intercept face images in each video frame using a Multi-task convolutional neural network (MTCNN) face detection model, store the intercepted face images in a form of a picture, further perform segmentation processing on the stored face images in consecutive video frames in a segmentation manner that the video is equally divided into a certain number of video segments, and further take multiple sets of segmented face images as multiple video segments.

In other embodiments, the computing device may extract each video frame in the video using an OpenCV tool, store each extracted video frame in a form of a picture, further perform segmentation processing on the stored continuous video frames in a segmentation manner that the video is equally divided into a certain number of video segments, and further take a plurality of groups of segmented video frames as a plurality of video segments.

And S220, extracting the key video frame of each video segment according to a preset frame extraction mode.

In the embodiment of the present disclosure, after obtaining a plurality of video segments, the computing device may perform frame extraction processing on pictures in each video segment, and use the extracted pictures as key video frames of the video segment.

Here, a key video frame may be understood as a video frame for characterizing a video segment to which the key video frame belongs.

Specifically, the preset frame extracting mode may be a mode preset according to needs and used for extracting a frame from a video according to user requirements. For example, the preset frame extraction manner may be to extract key video frames at certain intervals, or the preset frame extraction manner may be to randomly extract a certain number of key video frames, which is not limited herein.

And S230, sequencing the key video frames according to the playing time sequence to obtain an image sequence.

In the embodiment of the present disclosure, after obtaining the key video frames of each video segment, the computing device may sort the key video frames according to the playing time sequence to obtain the image sequence.

Therefore, in the embodiment of the disclosure, the computing device may obtain the image sequence by segmenting and frame-extracting, so that each video frame in the image sequence can cover each time period of the video, and further, the detection of the fake face video can cover the whole video, and the accuracy of the fake face video detection is further improved.

In other embodiments of the present disclosure, the at least two video frames in the image sequence may be video frames obtained by performing overall frame decimation on a video.

In some embodiments, the computing device may extract each video frame in the video using an Open Source Computer Vision Library (OpenCV) tool, then intercept a face image in each video frame using a Multi-task convolutional neural network (MTCNN) face detection model, and store each intercepted face image in the form of a picture, so as to obtain a plurality of pictures corresponding to the video.

In other embodiments, the computing device may extract each video frame in the video using the OpenCV tool, and store each extracted video frame in the form of a picture, so as to obtain a plurality of pictures corresponding to the video.

In these embodiments, the computing device may perform overall framing of the video to obtain the sequence of images. Specifically, the computing device may perform overall frame extraction processing on a plurality of pictures corresponding to the video according to a preset frame extraction manner, so as to obtain a plurality of key video frames. After obtaining the plurality of key video frames, the computing device may sort the key video frames in the order of playing time to obtain an image sequence.

The preset frame extracting mode can be a mode which is preset according to needs and used for extracting frames of the video according to user requirements. For example, the preset frame extraction manner may be to extract key video frames at certain intervals, or the preset frame extraction manner may be to randomly extract a certain number of key video frames, which is not limited herein.

Therefore, in other embodiments of the present disclosure, the computing device may obtain the image sequence by integrally extracting the frame, so that each video frame in the image sequence can be flexibly selected at each position in the whole video, and further, the detection of the fake face video can select the coverage range according to the actual situation, thereby further improving the accuracy of the fake face video detection.

In still other embodiments of the present disclosure, in order to make the facial features of each frame of image in the key video frames easier to be extracted, after the key video frames are extracted by the computing device, image enhancement processing needs to be performed on each key video frame first to expand the difference between different facial features in the images, so as to obtain each processed key video frame, and then each processed key video frame is sequenced according to the playing time sequence, so as to obtain an image sequence.

For example, the data enhancement processing on the image may include at least one of processing modes of horizontally turning the image, translating for a certain distance, scaling for a certain proportion, rotating for a certain angle, adjusting for a certain hue value, adjusting for a certain contrast, adjusting for a certain saturation, adjusting for a certain brightness, adding for a certain gaussian noise, performing for a certain motion blur filtering, performing for a certain gaussian blur filtering, performing for a certain JPEG compression, and performing graying processing.

Optionally, the computing device may select at least one of the data enhancement processing manners to enhance the data of the image according to a certain probability.

The following describes in detail a specific processing manner of the image enhancement processing and a probability of selection with a specific example, as shown in table 1.

TABLE 1

Therefore, in the embodiment of the disclosure, after the key video frames are extracted by the computing device, image enhancement processing is performed on each key video frame to expand the difference between different facial features in the image, and then the processed key video frames are used for forming an image sequence, so that the characteristics of the attention feature and the time sequence relation feature obtained by each video frame in the image sequence are more prominent, the probability obtained by computing is more accurate, and the accuracy of detecting the forged face video is further improved.

The following describes in detail a specific manner in which a computing device obtains three-dimensional attention features of multiple regions of a face corresponding to an image.

In some embodiments of the present disclosure, the computing device may obtain three-dimensional attention features of a plurality of regions of the face corresponding to the image through a three-dimensional convolution operation.

Fig. 3 is a schematic flow chart of an attention feature obtaining method according to an embodiment of the disclosure. As shown in fig. 3, the attention feature acquiring method may include the following steps.

And S310, carrying out three-dimensional facial feature extraction processing on the image to obtain the three-dimensional facial features corresponding to the image.

In the embodiment of the present disclosure, after acquiring any image in an image sequence, a computing device may perform three-dimensional facial feature extraction processing on the image to obtain a three-dimensional facial feature corresponding to the image.

Alternatively, after acquiring any image, the computing device may input image data corresponding to the image into the three-dimensional facial feature extractor, and obtain a feature map output by the three-dimensional facial feature extractor, where the feature map may be used as a three-dimensional facial feature corresponding to the image.

For example, the three-dimensional facial feature extractor may be a three-dimensional Residual Network (Residual Network), such as a Mixed Convolution neural Network (MC 3), the MC3 may be composed of five convolutional layers, the first two layers are three-dimensional convolutional layers, the third layer to the fifth layer are two-dimensional convolutional layers, the feature map output by the MC3 may have four dimensions, which may be the number of channels, the length, the height, and the width, respectively, for example, the size of the feature map may be 256 × 20 × 14 × 14 (the number of channels × the length × the height × the width).

And S320, performing three-dimensional convolution operation on the three-dimensional facial features corresponding to the image to obtain an attention weight matrix of a plurality of areas of the face corresponding to the image.

In this embodiment of the present disclosure, after obtaining the three-dimensional facial feature corresponding to the image, the computing device may perform a three-dimensional convolution operation on the three-dimensional facial feature of the image to obtain an attention weight matrix of multiple regions of the face corresponding to the image.

Optionally, S320 may specifically include: and performing three-dimensional convolution operation on the three-dimensional facial features corresponding to the image based on a three-dimensional attention neural network model to obtain an attention weight matrix of a plurality of regions of the face corresponding to the image, wherein the three-dimensional attention neural network model is obtained based on time sequence continuity loss function and sparse attention contrast loss function training.

Specifically, after the three-dimensional facial features corresponding to the image are acquired, the computing device may input the three-dimensional facial features into the three-dimensional attention neural network model to obtain an attention weight matrix output by the three-dimensional attention neural network model.

For example, the three-dimensional attention neural network model may include three attention mechanism modules to implement a three-dimensional convolution operation on three-dimensional facial features. Each attention mechanism module comprises a group of three-dimensional convolution layers, a three-dimensional batch standardization layer and an activation layer, wherein the three-dimensional convolution layers are used for performing sliding window operation on convolution kernels in a three-dimensional space of an input image, the three-dimensional batch standardization layer is used for performing batch standardization operation on five-dimensional input consisting of small-batch four-bit data, and the three-dimensional activation layer is used for performing nonlinear transformation operation on characteristics in the three-dimensional space through an activation function. Therefore, the three-dimensional attention mechanism can be realized through the multilayer neural network, so that the problem that the two-dimensional attention mechanism ignores the time sequence information is solved.

The following describes the three-dimensional convolution operation network in detail by using a specific example, as shown in table 2.

TABLE 2

Here, the three numbers in parentheses are understood as parameters indicating the dimensions of length, height, and width, respectively, Conv3D (constraint 3D) is understood as a three-dimensional Convolution layer, and BN3D (Batch Normalization 3D) is understood as a three-dimensional Batch Normalization operation layer.

Leaky ReLU can be understood as an activation function applied to a three-dimensional activation layer, and its formula is:

softmax can be understood as another activation function applied to a three-dimensional activation layer, whose formula is:

further, after the three-dimensional facial features corresponding to the image are processed by the three-dimensional convolution operation network shown in table 2, each channel of the obtained attention weights stores an attention matrix with a size of 14 × 14.

Therefore, in the embodiment of the disclosure, the computing device may process the three-dimensional facial features of the image based on the three-dimensional convolution operation to obtain the three-dimensional attention features for representing the degree of correlation between each region in the face and the face in the image, so that the attention features do not ignore the timing information, and further improve the accuracy of video detection of the forged face.

In the embodiment of the present disclosure, a back propagation method may be used to train the three-dimensional attention neural network model through a loss function, where the loss function may include a time sequence continuity loss function and a sparse attention contrast loss function, and parameters of the three-dimensional attention neural network model are adjusted respectively until loss values corresponding to the time sequence continuity loss function and the sparse attention contrast loss function are respectively smaller than corresponding loss value thresholds.

Wherein the time sequence continuity loss function

The expression of (a) is:

wherein T represents the number of images in the image sequence, M represents the total number of face regions concerned by the three-dimensional attention neural network model,

an attention weight matrix representing the jth face region of the ith image in the sequence of images.

Sparse attention contrast loss function

The expression of (a) is:

wherein the vector

[·,·]Representing a vector merge operation, 1_W×1Representing all 1 vectors, 1, of size W × 1_1×HRepresenting all 1 vectors of size 1 x H,

indicating a function, it means that the function value is 1 when i ≠ j, otherwise the function value is 0,

the entropy function of the information is represented,

to represent

Information entropy threshold ofThe value is obtained.

Therefore, in the embodiment of the disclosure, the face region concerned by the three-dimensional attention neural network model can be enabled to be stable in the time dimension all the time through the time sequence continuity loss function, and the diversity of the face regions concerned by the three-dimensional attention neural network model in the space dimension can be kept through the sparse attention contrast loss function, so that the accuracy of the attention weight matrix output by the three-dimensional attention neural network model is improved, and the accuracy of the fake face video detection is further improved.

And S330, generating attention characteristics of the plurality of regions of the face corresponding to the image based on the three-dimensional face characteristics corresponding to the image and the attention weight matrix of the plurality of regions of the face corresponding to the image.

In the embodiment of the present disclosure, after acquiring the attention weight matrices of the multiple regions of the face corresponding to the image, the computing device may fuse the three-dimensional facial features corresponding to the image with the attention weight matrices of the multiple regions of the face to obtain the attention features of the multiple regions of the face corresponding to the image, which is described below with reference to fig. 4.

Fig. 4 is a schematic flow chart of another attention feature obtaining method according to an embodiment of the disclosure. As shown in fig. 4, the attention feature acquiring method may include the following steps.

And S410, taking the product of the three-dimensional face features corresponding to the image and the attention weight matrixes of the plurality of regions of the face corresponding to the image as the attention feature matrix of the plurality of regions of the face corresponding to the image.

In the embodiment of the disclosure, after obtaining the three-dimensional facial feature corresponding to any image in the image sequence and the attention weight matrix of the plurality of regions of the face corresponding to the image, the computing device may calculate a product of the three-dimensional facial feature corresponding to the image and the attention weight matrix to obtain the attention feature matrix of the plurality of regions of the face corresponding to the image.

Wherein the product of the three-dimensional facial feature and the attention weight matrix specifically refers to an outer product of the three-dimensional facial feature and the attention weight matrix.

For example, the parametric dimensions of the attention feature matrix may include color values, vector dimensions, length, height, and width. For example, the size of the attention feature matrix may be 256 × 8 × 20 × 14 × 14 (color value × vector dimension × length × height × width).

And S420, summing the height dimension and the width dimension of the attention feature matrix of the plurality of regions of the face corresponding to the image respectively to obtain the summed attention feature matrix.

In this disclosure, after obtaining the attention feature matrices of the multiple regions of the face corresponding to the image, the computing device may perform summation processing on the height dimension and the width dimension of the attention feature matrices, respectively, to obtain a summed attention feature matrix.

And S430, performing global average pooling on the length dimension of the summed attention feature matrix to obtain the attention features of a plurality of regions of the face corresponding to the image.

In this embodiment of the disclosure, after obtaining the summed attention feature matrix corresponding to the image, the computing device may perform global average pooling on a length dimension of the summed attention feature matrix to obtain attention features of multiple regions of the face corresponding to the image.

The global average pooling can be understood as performing dimension reduction processing on the length dimension by using an averaging method, so as to increase the operation speed.

Specifically, the computing device may calculate an average value in the length dimension of all the regions of the obtained feature map of each channel, to obtain the attention features of multiple regions of the face corresponding to the image subjected to the global average pooling process in the length dimension.

For example, the dimension parameters of the attention feature may include a color value and a vector dimension. For example, the size of the attention feature may be 256 × 8 (color value × vector dimension).

Therefore, in the embodiment of the disclosure, after the computing device obtains the three-dimensional facial features corresponding to each image in the image sequence and the attention weight matrices of the multiple regions of the face corresponding to the image, the three-dimensional facial features corresponding to the image and the attention weight matrices of the multiple regions of the face may be fused to obtain the attention features with higher accuracy, and the attention features with higher accuracy may be used to represent the degree of correlation between each region in the face and the face in the image, and further represent the size of each region in the face in video detection, and further obtain a more accurate video detection result.

In other embodiments of the present disclosure, the computing device may further obtain attention features of a plurality of regions of the face corresponding to the image through the ResNet18 network model.

The ResNet18 network model contains 17 convolutional layers and 1 fully-connected layer.

Specifically, after acquiring the three-dimensional facial features corresponding to the image, the computing device may input the facial features into a ResNet18 network model, perform 17 times convolution operations on the facial features, and add the operation results to obtain the attention features of multiple regions of the face corresponding to the image.

The following describes in detail a method for acquiring, by a computing device, a time-series relationship feature between a plurality of regions of a face corresponding to an image sequence by using a time-series graph convolutional neural network.

Fig. 5 is a schematic flow chart of a method for acquiring a time series relationship characteristic according to an embodiment of the disclosure. As shown in fig. 5, the time-series relationship characteristic acquisition method may include the following steps.

S510, for each image, according to the attention characteristics of the plurality of regions of the face corresponding to the image, graph structure data of the plurality of regions of the face corresponding to the image is constructed.

In the embodiment of the present disclosure, after obtaining the attention feature of the plurality of regions of the face corresponding to each image in the image sequence, the computing device may construct the graph structure data of the plurality of regions of the face corresponding to each image according to the attention feature.

Specifically, the attention feature may be in the form of a matrix, the attention feature is divided into a plurality of attention feature vectors along a channel dimension of the matrix, the attention feature vectors are used as nodes of graph structure data, an adjacent matrix is defined as an initial relationship feature between the attention feature vectors, the relationship feature is used as an edge of the graph structure data, and the nodes and the edges corresponding to each image are combined to obtain the graph structure data of the plurality of regions of the face corresponding to the image.

S520, carrying out time sequence relation construction processing on the graph structure data of the plurality of regions of the face corresponding to each image to obtain time sequence relation characteristics among the plurality of regions of the face corresponding to the image sequence.

In the embodiment of the present disclosure, after obtaining the graph structure data of the multiple regions of the face corresponding to each image, the computing device may perform graph convolution operation on the graph structure data to obtain the time sequence relationship characteristic between the multiple regions of the face corresponding to the image sequence.

Specifically, the computing device may input the graph structure data of the multiple regions of the face corresponding to each image into the time-series graph neural network model, and update the relationship features of the graph structure data according to the input graph structure data, so as to obtain the time-series relationship features between the multiple regions of the face corresponding to the image sequence.

The timing diagram neural network model can be understood as a neural network model which can represent dynamic properties in the form of a graph.

Therefore, in the embodiment of the disclosure, the computing device may perform a time-series relationship feature construction process based on the attention features of the multiple regions of the face corresponding to the image to obtain a time-series relationship feature between the multiple regions of the face corresponding to the image sequence, and detect the time-series inconsistency of the face in the image sequence through the time-series relationship feature, so as to detect the time-series inconsistency of the face in the video, and further improve the accuracy of the video detection of the forged face.

Further, in this embodiment of the present disclosure, in step S520, a time sequence relationship construction process based on a time sequence diagram neural network model may be performed on the graph structure data of the multiple regions of the face corresponding to each image, and a last hidden state diagram corresponding to the obtained image sequence is used as a time sequence relationship feature between the multiple regions of the face corresponding to the image sequence, which is described below with reference to fig. 6.

Fig. 6 is a schematic flowchart of another time series relationship characteristic obtaining method according to an embodiment of the present disclosure. As shown in fig. 6, the time-series relationship characteristic acquisition method may include the following steps.

S610, based on the time sequence diagram neural network model, time sequence relationship construction processing is carried out on diagram structure data of multiple regions of the face corresponding to each image, and a last hidden state diagram corresponding to the image sequence is obtained.

In the embodiment of the disclosure, after obtaining the graph structure data of the plurality of regions of the face corresponding to each image in the image sequence, the computing device inputs the graph structure data corresponding to the first image into the time sequence diagram neural network model according to the sequence in the image sequence to obtain an implicit state diagram capable of representing the time sequence relationship characteristic of the image, and then inputs the implicit state diagram corresponding to the first image and the graph structure data corresponding to the second image into the time sequence diagram neural network model together to obtain an implicit state diagram capable of representing the time sequence relationship characteristic of the second image and the first image, and so on, sequentially inputs all the images in the image sequence into the time sequence diagram neural network model until the graph structure data corresponding to the last image in the image sequence and the implicit state diagram corresponding to the last image in the image sequence are input into the time sequence diagram neural network model together, and obtaining the last implicit state diagram corresponding to the image sequence.

In the embodiment of the present disclosure, the time series graph neural network model may be trained by using a back propagation method through a loss function, and the loss function may be a cross-entropy loss function until a loss value corresponding to the cross-entropy loss function is smaller than a corresponding loss value threshold.

In the process of training the timing diagram neural network model, a back propagation method can be used for adjusting parameters of the timing diagram neural network model through the cross entropy loss function until the loss value corresponding to the cross entropy loss function is smaller than the corresponding loss value threshold value.

Wherein, cross entropy loss function

The expression of (a) is:

where φ denotes the classifier, y denotes the authenticity label of the video to be detected, H⁽⁾Representing the implicit state diagram obtained after the last key video frame is input.

And S620, taking the last hidden state diagram as a time sequence relation characteristic among a plurality of regions of the face corresponding to the image sequence.

In the embodiment of the present disclosure, the last hidden state diagram obtained by the computing device sequentially inputting the diagram structure data corresponding to each image in the image sequence into the time sequence diagram neural network model may represent the time sequence relationship between the same face regions of all images in the image sequence, so the computing device may use the last hidden state diagram output by the time sequence diagram neural network model as the time sequence relationship feature between the plurality of regions of the face corresponding to the image sequence.

Therefore, in the embodiment of the disclosure, after obtaining the graph structure data of the multiple regions of the face corresponding to each image in the image sequence, the computing device can sequentially input the graph structure data of each image into the time sequence graph neural network model according to the sequence in the sequence to obtain the last hidden state graph corresponding to the image sequence, and then represent the time sequence relationship of the same face region between different images in the image sequence through the hidden state graph, detect the time sequence inconsistency of the face in the image sequence through the time sequence relationship feature, further detect the time sequence inconsistency of the face in the video, and further improve the accuracy of the video detection of the forged face.

In an embodiment of the present disclosure, the three-dimensional facial feature extractor, the three-dimensional attention neural network model, the fusion module, the graph structure construction module, the timing diagram neural network model, and the classifier may form a video detection model, and the video detection method provided in the embodiment of the present disclosure may be implemented based on the video detection model, and the structure and the principle of the video detection model for detecting the facial video are described in detail below with reference to fig. 7 to 9.

Fig. 7 is a schematic diagram of a video detection model according to an embodiment of the present disclosure. As shown in fig. 7, after extracting a face image sequence 710 from a face video, the computing device may input the face image sequence 710 into a video detection model, a three-dimensional face feature extractor in the video detection model may perform three-dimensional face feature extraction processing on each face image respectively to obtain three-dimensional face features 720 corresponding to each face image, then a three-dimensional attention neural network model in the video detection model may calculate attention weight matrices 730 of multiple regions of a face corresponding to each face image, then a fusion module in the video detection model may fuse the three-dimensional face features 720 corresponding to each face image with the attention weight matrices 730 respectively to obtain attention features 740 corresponding to each face image, and then a graph structure construction module (not shown in the figure) may construct graph structure data 750 corresponding to each attention feature 740, the timing graph convolutional neural network model can generate a timing relation feature 760 of an image sequence based on each graph structure data 750, and finally the classifier can obtain a detection result of whether the video to be detected is the fake face video or not, namely the probability of whether the video to be detected is the fake face video or not based on the timing relation feature 760.

Further, the specific structure and principle of the three-dimensional attention neural network model in fig. 7 can be explained with reference to fig. 8.

Fig. 8 is a schematic diagram of a three-dimensional attention neural network model according to an embodiment of the present disclosure. As shown in fig. 8, the three-dimensional attention neural network model 800 may include three-dimensional convolution processing modules 820, and each three-dimensional convolution processing module 820 may include a three-dimensional convolution layer 821, a three-dimensional batch normalization layer 822, and a three-dimensional activation layer 823, respectively.

The first three-dimensional convolution processing module 820 may process the three-dimensional face features 810 corresponding to any face image to obtain a first feature matrix 830, the second three-dimensional convolution processing module 820 may process the first feature matrix 830 to obtain a second feature matrix 840, and the third three-dimensional convolution processing module 820 may process the second feature matrix 840 to obtain an attention weight matrix 850.

Further, the specific structure and principle of the graph structure building module and the timing diagram convolutional neural network model in fig. 7 can be explained with reference to fig. 9.

In the embodiment of the present disclosure, after obtaining the attention feature corresponding to any face image, the fusion module may divide the attention feature into 8 attention feature vectors along a second dimension, such as a channel dimension, where the size of each attention feature vector is 1 × 256. Next, the graph structure building module may define an adjacency matrix as an initial relationship feature between the attention feature vectors, wherein the adjacency matrix is 8 × 8 in size. Finally, the graph structure building module may use graph convolution operation, taking the attention feature vector as a node V and the relationship feature as an edge E, to construct a graph G ═ V, E >.

Specifically, the process of constructing the module construction diagram by the diagram structure is as follows:

the graph structure building module first defines a graph convolution parameter matrix W of size 256 × 384_gThe graph convolution parameter matrix W _g910 are inputted to first convolution operation section 920, and first convolution operation section 920 performs convolution on parameter matrix W _g910 perform a convolution operation gconv (g) ═ EVW_gObtaining a result matrix with a size of 8 × 384, inputting the result matrix into the first vector split unit 930, and splitting the result matrix along a second dimension, such as a channel dimension, by the first vector split unit 930 to obtain 3 implicit vectors G_r、G_z、G_hEach hidden vector has a size of 8 × 128. Next, the graph structure building module may define a graph convolution parameter matrix W of initial hidden states of size 8 × 128 and hidden states of size 128 × 384_hWhere the initial value of the hidden state is 0. Then, the graph structure building module may input the data 940 to be processed formed by the hidden state and the shared relationship characteristic into the second graph convolution operation unit 950, so that the second graph convolution operation unit 950 may perform the data 940 to be processedRow diagram convolution operation gconv (h) ═ EhW_hResulting in an implicit state diagram 960. Finally, the graph structure building module may input the hidden state diagram 960 into the second vector splitting unit 970, so that the second vector splitting unit 970 splits the hidden state diagram 960 along a second dimension, such as a channel dimension, to obtain 3 hidden vectors H_r、H_z、H_hEach hidden vector has a size of 8 × 128.

The timing diagram convolutional neural network model may set initial bias parameters 980, implicit state diagram 960, implicit vector G_r、G_z、G_h、H_r、H_z、H_hThe initial bias parameters 980 are input to the gate control unit 990. Wherein, the initial bias parameter 980 is input into the bias parameter adjustment subunit 991, and the initial bias parameter 980 is subjected to vector segmentation to obtain a bias parameter vector b_r、b_z、b_h. Will conceal vector H_r、G_r、b_rAn input reset gate operation subunit 992 executes r-sigmoid (G)_r+H_r+b_r) The operation, r, which is obtained, is understood as the reset result 993, and the reset result 993 is combined with the hidden vector H_hThe first product result 995 is obtained by inputting the first product operator 994. Will conceal vector H_z、G_z、b_zIn the input update gate operation subunit 996, z-sigmoid (G) is executed_z+H_z+b_z) Calculating, wherein the obtained z can be understood as an updated result 997, inputting the updated result 997 and the hidden state diagram 960 into a second product operation subunit 998 to obtain a second product result 999, inputting the updated result 997 into an operation unit 9910, performing 1-z operation to obtain a calculated result 9911, and inputting the first product result 995 and the vector G_h、b_hInputting candidate hidden state operation subunit 9912, executing

Is calculated by

Can be understood as a candidate hidden state 9913, the candidate is hiddenThe result 9911 including the status 9913 is input to a third multiply-multiply subunit 9914 to obtain a third multiply-result 9915, the second multiply-result 999 and the third multiply-result 9915 are input to an add subunit 9916, and the process is executed

The obtained H can be understood as an updated hidden state diagram 9917, the updated hidden state diagram 9917 is input into the hidden state diagram replacing unit 9100, the hidden state diagram 960 output by the second graph convolution operation unit 950 is replaced by the hidden state diagram replacing unit 9100, the updated hidden state diagram 9917 is input into the bias parameter adjusting subunit 991, and the bias parameters are adjusted to obtain the bias parameters capable of improving the effect of extracting the time sequence relation characteristics of the image sequence by the timing diagram neural network model.

Then, the graph structure building module may receive the attention feature corresponding to the next face image, and divide the attention feature into 8 attention feature vectors along a second dimension, such as a channel dimension, to obtain an attention feature vector corresponding to the next face image, that is, a next node V, and then complete next update of the hidden state diagram based on the node until the attention feature of each image in the image sequence is input into the timing graph neural network model, and the obtained hidden state diagram can represent the time sequence relationship feature of the whole image sequence.

Further, in order to verify the accuracy of the video Detection model shown in fig. 7 to 9 in detecting the video of the forged face, faceforces + + (FF + +), Celeb-DF v2, and deteffective Detection change (DFDC) data set tests can be used to detect the performance of the video of the forged face.

The FF + + data set can respectively tamper the faces of 1000 real videos through four depth counterfeiting algorithms of Deepfakes, Face2Face, faceSwap and NeuralTextures, 1000 forged videos are generated through each depth counterfeiting algorithm, and 4000 forged videos are obtained, wherein the FF + + data set can comprise a data set of a High definition (HQ) version and a data set of a Low definition (Low Quality, LQ) version.

The Celeb-DF v2 data set can carry out face falsification on 590 real celebrity videos through an improved deep falsification algorithm to obtain 5639 forged videos.

The DFDC dataset is a dataset produced by Facebook in 2019, and 1131 real videos shot by the Facebook are subjected to deep forgery processing to obtain 4119 forged videos.

In the test, through the area under the curve (AUC) model evaluation index, the video detection model is respectively subjected to model precision test and generalization capability test, wherein the model precision test is used for evaluating the detection accuracy of the video detection model, the same data set is used as a training set and a test set of the test, the generalization capability test is used for evaluating the adaptability of the video detection model to a fresh sample, and different data sets are used as the training set and the test set of the test.

In the model precision test, the FF + + data set may be used for the test, and the test results are shown in table 3.

TABLE 3

W/o (without) represents a comparison method for canceling a specific model for verifying the validity of the x-correlation model, and the w/o three-dimensional attention neural network model of the video detection model represents a model after the video detection model cancels the three-dimensional attention neural network model. FF + + (HQ) and FF + + (LQ) respectively represent the test results of the video detection model in the FF + + data set of the high definition (HQ) version and the FF + + data set of the low definition (LQ) version, and the percentage value represents the probability that the test result is the same as the real result, and the percentage value is used as the detection accuracy of the corresponding detection method in the corresponding data set. In the model precision test, the video detection model can reach the highest detection accuracy, and the detection accuracy of the video detection model in the FF + + (LQ) data set is 18.19 percent higher than that of the basic MC3 method. The detection accuracy of the video detection model is reduced by eliminating a three-dimensional attention neural network model or a timing diagram neural network model in the video detection model.

In the generalization ability test, the FF + + (HQ) dataset can be used to train the corresponding video detection method, and the FF + + (HQ) dataset, the Celeb-DF dataset, and the DFDC dataset are used as test sets, and the test results are shown in table 4.

TABLE 4

Table 4 has the same meaning as table 3, and Celeb-DF v2 and DFDC respectively represent the test results of the video inspection model in the Celeb-DF v2 dataset and the DFDC dataset. In the generalization capability test, the video detection model can achieve the highest generalization performance, and the detection accuracy of the video detection model in the Celeb-DF data set is 11.92 percent higher than that of the basic MC3 method. The detection accuracy rate of the video detection of the forged face is reduced by canceling the three-dimensional attention neural network model or the timing diagram neural network model in the video detection model.

In summary, after the precision test and the generalization capability test are performed on the detection accuracy of the video detection model in the embodiment of the present disclosure, according to the test result, it can be known that the video detection model and the video detection method implemented correspondingly according to the video detection model provided in the embodiment of the present disclosure can obtain the highest detection accuracy in the multiple video detection models for multiple data sets and the video detection methods implemented correspondingly, and therefore, the test result indicates that the video detection method provided in the embodiment of the present disclosure can improve the accuracy and the generalization capability of the video detection of the forged face.

Fig. 10 is a schematic structural diagram of a video detection apparatus according to an embodiment of the present disclosure.

In the embodiment of the present disclosure, the video detection apparatus may be provided in a computing device. The computing device may include, but is not limited to, an electronic device, a server, and the like. The device may include a mobile phone, a tablet computer, a desktop computer, a notebook computer, a vehicle-mounted terminal, a wearable electronic device, an all-in-one machine, an intelligent home device, and other devices having a computing function, and may also be a virtual machine or a simulator-simulated device. The server can be an independent server or a cluster of a plurality of servers, and can comprise a server built in the local and a server built in the cloud.

As shown in fig. 10, the video detection apparatus 1000 includes: an image acquisition module 1010, a linear variation module 1020, a feature construction module 1030, and a probability calculation module 1040.

The image obtaining module 1010 may be configured to obtain an image sequence to be detected, where the image sequence includes at least two video frames in the same video.

The non-linear variation module 1020 may be configured to perform a non-linear transformation process on the facial features of the images for each image in the image sequence to obtain attention features of multiple regions of the face corresponding to the images.

The feature construction module 1030 can be configured to construct a time-series relationship feature between multiple regions of the face corresponding to the image sequence based on the attention features of the multiple regions of the face corresponding to the respective images.

The probability calculation module 1040 may be configured to calculate the probability that the video is a video of a fake face based on the time-series relationship characteristics.

In the embodiment of the disclosure, after an image sequence to be detected and including at least two video frames in the same video is acquired, the facial features of each image in the image sequence are subjected to nonlinear transformation processing to obtain the attention features of a plurality of regions of the face corresponding to each image, a time sequence relation feature between the plurality of regions of the face corresponding to the image sequence is constructed based on the attention features of the plurality of regions of the face corresponding to each image, and then a probability that the video is a video of a forged face is calculated based on the time sequence relation feature, wherein the probability can be used for judging whether the video is the video of the forged face, and since the probability can be calculated based on the time sequence relation feature between the plurality of regions of the face corresponding to the image sequence in the embodiment of the disclosure, the time sequence relation between the plurality of regions of the face can be introduced when the probability is calculated, and further the time sequence inconsistency of the face in the video can be detected, the accuracy of the probability calculation result is higher, the generalization capability is stronger, and the accuracy of the fake face video detection is further improved.

In some embodiments of the present disclosure, the image acquisition module 1010 may include a video segmentation unit, a key frame extraction unit, and a key frame ordering unit.

The video segmentation unit may be configured to divide the video into a plurality of video segments.

The key frame extracting unit may be configured to extract a key video frame of each video segment according to a preset frame extracting manner.

The key frame sorting unit may be configured to sort the key video frames according to the playing time sequence to obtain an image sequence.

In some embodiments of the present disclosure, the non-linear variation module 1020 may include a feature extraction unit, a three-dimensional convolution operation unit, and an attention feature construction unit.

The feature extraction unit may be configured to perform three-dimensional facial feature extraction processing on the image to obtain a three-dimensional facial feature corresponding to the image.

The three-dimensional convolution operation unit can be used for performing three-dimensional convolution operation on the three-dimensional face features corresponding to the image to obtain the attention weight matrix of a plurality of areas of the face corresponding to the image.

The attention feature construction unit may be configured to generate the attention features of the plurality of regions of the face corresponding to the image based on the three-dimensional face feature corresponding to the image and the attention weight matrix of the plurality of regions of the face corresponding to the image.

In some embodiments of the present disclosure, the attention feature construction unit may include a feature matrix construction subunit, a summation subunit, and a pooling subunit.

The feature matrix construction subunit may be configured to take a product of the three-dimensional facial feature corresponding to the image and an attention weight matrix of a plurality of regions of the face corresponding to the image as an attention feature matrix of the plurality of regions of the face corresponding to the image.

The summation subunit may be configured to sum the height dimension and the width dimension of the attention feature matrix of the multiple regions of the face corresponding to the image, respectively, to obtain a summed attention feature matrix.

The pooling subunit may be configured to perform global average pooling on the length dimension of the summed attention feature matrix to obtain attention features of multiple regions of the face corresponding to the image.

In some embodiments of the present disclosure, the attention feature construction unit may include a weight matrix construction subunit.

The weight matrix construction subunit can be used for performing three-dimensional convolution operation on three-dimensional facial features corresponding to the image based on the three-dimensional attention neural network model to obtain an attention weight matrix of a plurality of regions of the face corresponding to the image; the three-dimensional attention neural network model comprises three attention system modules, each attention system module comprises a three-dimensional convolution layer, a three-dimensional batch normalization layer and an activation layer, and a loss function of the three-dimensional attention neural network model comprises a time sequence continuity-based loss function and a sparse attention contrast loss function. The three-dimensional attention neural network model is obtained based on time sequence continuity loss function and sparse attention contrast loss function training.

In some embodiments of the present disclosure, the feature construction module may include a graph structure data construction unit and a time series relationship feature construction unit.

The map structure data construction unit may be configured to construct, for each image, map structure data of a plurality of regions of the face corresponding to the image according to attention features of the plurality of regions of the face corresponding to the image.

The time sequence relation feature construction unit may be configured to perform time sequence relation construction processing on the graph structure data of the multiple regions of the face corresponding to each image, so as to obtain a time sequence relation feature between the multiple regions of the face corresponding to the image sequence.

In some embodiments of the present disclosure, the timing relationship feature construction unit may include an implicit state diagram construction subunit and a timing relationship feature acquisition subunit.

The hidden state diagram constructing subunit can be used for constructing and processing the time sequence relationship of the diagram structure data of the plurality of regions of the face corresponding to each image based on the time sequence diagram neural network model to obtain the last hidden state diagram corresponding to the image sequence.

The time sequence relation feature obtaining subunit may be configured to use the last hidden state diagram as a time sequence relation feature between multiple regions of the face corresponding to the image sequence.

It should be noted that the video detection apparatus 1000 shown in fig. 10 may perform each step in the method embodiments shown in fig. 1 to fig. 5, and implement each process and effect in the method embodiments shown in fig. 1 to fig. 5, which are not described herein again.

Embodiments of the present disclosure also provide a video detection device that may include a processor and a memory, which may be used to store executable instructions. The processor may be configured to read the executable instructions from the memory and execute the executable instructions to implement the video detection method in the foregoing embodiments.

Fig. 11 shows a schematic structural diagram of a video detection device provided by an embodiment of the present disclosure. Referring now specifically to fig. 11, a schematic diagram of a video detection device 1100 suitable for use in implementing embodiments of the present disclosure is shown.

In embodiments of the present disclosure, video detection device 1100 may be a computing device. The computing device may include, but is not limited to, an electronic device, a server, and the like. The electronic device may include a mobile phone, a tablet computer, a desktop computer, a notebook computer, a vehicle-mounted terminal, a wearable electronic device, an all-in-one machine, an intelligent home device, and other devices having a computing function, and may also be a virtual machine or a simulator-simulated device. The server can be an independent server or a cluster of a plurality of servers, and can comprise a server built in the local and a server built in the cloud.

Fig. 11 is a schematic structural diagram of a video detection device according to an embodiment of the present disclosure. The computer device provided in the embodiment of the present disclosure may be implemented as the processing flow of the above method embodiment, as shown in fig. 11, the computer device 1100 includes: memory 1110, processor 1120, computer programs, and communications interface 1130; wherein the computer program is stored in the memory 1110 and configured to be executed by the processor 1120 for the video detection method as described above.

It should be noted that the video detection device 1100 shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of the embodiments of the present disclosure.

In addition, the embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the video detection method described in the above embodiment.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A video detection method, comprising:

2. The method of claim 1, wherein the acquiring the sequence of images to be detected comprises:

dividing the video into a plurality of video segments;

extracting key video frames of each video segment according to a preset frame extraction mode;

and sequencing all the key video frames according to the playing time sequence to obtain the image sequence.

3. The method according to claim 1, wherein the performing the non-linear transformation on the facial features of the image to obtain attention features of a plurality of regions of the face corresponding to the image comprises:

carrying out three-dimensional facial feature extraction processing on the image to obtain three-dimensional facial features corresponding to the image;

performing three-dimensional convolution operation on the three-dimensional facial features corresponding to the image to obtain an attention weight matrix of a plurality of regions of the face corresponding to the image;

and generating attention characteristics of a plurality of regions of the face corresponding to the image based on the three-dimensional face characteristics corresponding to the image and the attention weight matrix of the plurality of regions of the face corresponding to the image.

4. The method of claim 3, wherein generating the attention features of the plurality of regions of the face corresponding to the image based on the three-dimensional facial features corresponding to the image and the attention weight matrix of the plurality of regions of the face corresponding to the image comprises:

taking the product of the three-dimensional facial features corresponding to the image and the attention weight matrix of the plurality of regions of the face corresponding to the image as the attention feature matrix of the plurality of regions of the face corresponding to the image;

respectively summing the height dimension and the width dimension of the attention feature matrix of a plurality of regions of the face corresponding to the image to obtain a summed attention feature matrix;

and performing global average pooling on the length dimension of the summed attention feature matrix to obtain the attention features of a plurality of regions of the face corresponding to the image.

5. The method according to claim 3, wherein the performing a three-dimensional convolution operation on the three-dimensional facial features corresponding to the image to obtain an attention weight matrix of a plurality of regions of the face corresponding to the image comprises:

performing three-dimensional convolution operation on the three-dimensional facial features corresponding to the image based on a three-dimensional attention neural network model to obtain an attention weight matrix of a plurality of regions of the face corresponding to the image; the three-dimensional attention neural network model comprises three attention mechanism modules, each attention mechanism module comprises a three-dimensional convolution layer, a three-dimensional batch normalization layer and an activation layer, and a loss function of the three-dimensional attention neural network model comprises a time sequence continuity-based loss function and a sparse attention contrast loss function.

6. The method according to claim 1, wherein the constructing a time-series relationship feature between the plurality of regions of the face corresponding to the image sequence based on the attention feature of the plurality of regions of the face corresponding to each of the images comprises:

for each image, constructing graph structure data of a plurality of regions of the face corresponding to the image according to the attention features of the plurality of regions of the face corresponding to the image;

and performing time sequence relation construction processing on the graph structure data of the plurality of regions of the face corresponding to each image to obtain the time sequence relation characteristics among the plurality of regions of the face corresponding to the image sequence.

7. The method according to claim 6, wherein the performing a time series relationship construction process on the graph structure data of the plurality of regions of the face corresponding to each image to obtain a time series relationship characteristic between the plurality of regions of the face corresponding to the image sequence comprises:

based on a time sequence diagram neural network model, carrying out time sequence relationship construction processing on diagram structure data of a plurality of regions of the face corresponding to each image to obtain a last hidden state diagram corresponding to the image sequence;

and taking the last hidden state diagram as a time sequence relation characteristic among a plurality of regions of the face corresponding to the image sequence.

8. A video detection apparatus, comprising:

the device comprises an image acquisition module, a detection module and a processing module, wherein the image acquisition module is used for acquiring an image sequence to be detected, and the image sequence comprises at least two video frames in the same video;

the feature construction module is used for constructing a time sequence relation feature among a plurality of regions of the face corresponding to the image sequence based on the attention feature of the plurality of regions of the face corresponding to each image;

9. A video detection device, comprising:

a processor;

a memory for storing executable instructions;

wherein the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the video detection method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, causes the processor to implement a video detection method as claimed in any of the preceding claims 1-7.