CN117690061B

CN117690061B - Depth fake video detection method, device, equipment and storage medium

Info

Publication number: CN117690061B
Application number: CN202311827980.8A
Authority: CN
Inventors: 朱威; 黎健和; 陈盛福; 温世欢
Original assignee: China Post Consumer Finance Co ltd
Current assignee: China Post Consumer Finance Co ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-05-17
Anticipated expiration: 2043-12-27
Also published as: CN117690061A

Abstract

The invention relates to the technical field of video detection, and discloses a method, a device, equipment and a storage medium for detecting depth counterfeit video, wherein the method comprises the following steps: performing face authentication on the video to be detected to obtain face actions; segmenting equipment monitoring data corresponding to the face action from the video to be detected; preprocessing the equipment monitoring data to obtain preprocessed equipment monitoring data; and detecting the preprocessed equipment monitoring data through a preset detection model, judging whether the video to be detected is a depth fake video or not based on a detection result, and combining and adjusting the depth neural network model and the generalized linear model to obtain the preset detection model. The method and the device detect the video to be detected based on the equipment monitoring data cut from the video to be detected and the preset detection model obtained after the combination adjustment of the deep neural network model and the generalized linear model, so that whether the video to be detected is a deep fake video can be accurately judged.

Description

Depth fake video detection method, device, equipment and storage medium

Technical Field

The present invention relates to the field of video detection technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting a depth counterfeit video.

Background

Along with the rapid development of the internet, the financial industry gradually changes to online, and people can meet own financial requirements by opening bank accounts online. However, there is a concomitant increase in network security issues, particularly in the case of fraudsters using various means during some financial transactions. For example, in the process of handling financial services online, a fraudster can cheat a face authentication system by hijacking a camera of the device, so that the system does not start the camera, but obtains a video which is deeply forged.

The conventional detection method of the depth counterfeit video at present mainly depends on analysis of visual differences, but due to continuous progress of the depth counterfeit technology, the visual differences between the counterfeited video and the real video are smaller and smaller, so that the depth counterfeit video is difficult to accurately detect. Therefore, there is a need in the industry for a method that can accurately detect deep counterfeited video.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The invention mainly aims to provide a method, a device, equipment and a storage medium for detecting a depth counterfeit video, and aims to solve the technical problem that the prior art is difficult to accurately detect the depth counterfeit video.

In order to achieve the above object, the present invention provides a depth counterfeit video detection method, comprising the steps of:

performing face authentication on a video to be detected to obtain face actions, wherein the face actions comprise mouth opening actions, blinking actions, head shaking actions, head spotting actions and static actions;

Segmenting equipment monitoring data corresponding to the face action from the video to be detected, wherein the equipment monitoring data is a video segment for making a complete face action in the video to be detected;

preprocessing the equipment monitoring data to obtain preprocessed equipment monitoring data;

And detecting the preprocessed equipment monitoring data through a preset detection model, judging whether the video to be detected is a depth fake video or not based on a detection result, and combining and adjusting a depth neural network model and a generalized linear model to obtain the preset detection model.

Optionally, the step of performing face authentication on the video to be detected to obtain a face action includes:

acquiring the length-width ratio of a mouth of a face in a video to be detected, and judging whether a mouth opening action exists in the video to be detected or not based on the length-width ratio of the mouth;

Acquiring an eye length-width ratio of a face in a video to be detected, and judging whether blink motion exists in the video to be detected or not based on the eye length-width ratio;

And acquiring cheek width change and nose-to-chin distance change of a face in the video to be detected, and judging whether a head shaking action and/or a head nodding action exists in the video to be detected or not based on the cheek width change and the nose-to-chin distance change.

Optionally, the step of performing face authentication on the video to be detected to obtain a face action further includes:

If any one of the mouth opening action, the blink action, the head shaking action and the head nodding action does not exist in the video to be detected, and the human face movement amplitude in the video to be detected is smaller than a preset amplitude threshold, judging that a static action exists in the video to be detected.

Optionally, the preprocessed device monitoring data includes preprocessed time sequence data and preprocessed behavior data, and the step of preprocessing the device monitoring data to obtain preprocessed device monitoring data includes:

Performing frame extraction on time sequence data in the equipment monitoring data according to fixed frequency and fixed quantity, and performing maximum and minimum normalization processing on the time sequence data after frame extraction to obtain the preprocessed time sequence data;

And counting the average value and standard deviation of the behavior data in the equipment monitoring data, and carrying out standard normalization processing on the behavior data based on the average value and the standard deviation to obtain the preprocessed behavior data.

Optionally, after the step of preprocessing the device monitoring data to obtain preprocessed device monitoring data, the method further includes:

Selecting a positive sample and a negative sample from the equipment monitoring data, wherein the positive sample is the equipment monitoring data containing the face action, and the negative sample is the equipment monitoring data not containing the face action;

And calculating a loss function based on the positive sample and the negative sample, and adjusting an initial model based on the loss function to obtain a preset detection model, wherein the initial model is formed by combining a deep neural network model and a generalized linear model.

Optionally, the step of detecting the preprocessed device monitoring data through a preset detection model and judging whether the video to be detected is a depth counterfeit video based on a detection result includes:

inputting the preprocessed time sequence data and the preprocessed behavior data into a preset detection model to obtain a model output result;

And judging whether the video to be detected is a depth fake video or not based on the model output result.

Optionally, the preset detection model includes a one-dimensional convolution module, a deep neural network model and a generalized linear model, and the step of inputting the preprocessed time sequence data and the preprocessed behavior data into the preset detection model to obtain a model output result includes:

Inputting the preprocessed time sequence data into the one-dimensional convolution module, and outputting flattened processing data;

Inputting the flattened processing data into the deep neural network model to obtain a first model output value;

inputting the preprocessed behavior data into the deep neural network model to obtain a second model output value;

the flattened processing data and the preprocessed behavior data are spliced to obtain spliced data, and the spliced data are input into the generalized linear model to obtain a third model output value;

And adding the first model output value, the second model output value and the third model output value to obtain a model output result.

In addition, to achieve the above object, the present invention also proposes a depth-counterfeit video detection device, comprising:

The face authentication module is used for carrying out face authentication on the video to be detected to obtain face actions, wherein the face actions comprise mouth opening actions, blinking actions, head shaking actions, head nodding actions and static actions;

The data segmentation module is used for segmenting equipment monitoring data corresponding to the face actions from the video to be detected, wherein the equipment monitoring data is a video segment for making complete face actions in the video to be detected;

The data processing module is used for preprocessing the equipment monitoring data to obtain preprocessed equipment monitoring data;

the data detection module is used for detecting the preprocessed equipment monitoring data through a preset detection model, judging whether the video to be detected is a deep fake video or not based on a detection result, and obtaining the preset detection model through combining and adjusting a deep neural network model and a generalized linear model.

In addition, to achieve the above object, the present invention also proposes a depth counterfeit video detection device, comprising: a memory, a processor, and a depth-counterfeit video detection program stored on said memory and executable on said processor, said depth-counterfeit video detection program configured to implement the steps of the depth-counterfeit video detection method as described above.

In addition, to achieve the above object, the present invention also proposes a storage medium having stored thereon a depth-forgery-video detection program that, when executed by a processor, implements the steps of the depth-forgery-video detection method as described above.

The method comprises the steps of carrying out face authentication on a video to be detected to obtain face actions, wherein the face actions comprise mouth opening actions, blink actions, head shaking actions, head nodding actions and static actions; segmenting equipment monitoring data corresponding to the face action from the video to be detected, wherein the equipment monitoring data is a video segment for making a complete face action in the video to be detected; preprocessing the equipment monitoring data to obtain preprocessed equipment monitoring data; and detecting the preprocessed equipment monitoring data through a preset detection model, judging whether the video to be detected is a depth fake video or not based on a detection result, and combining and adjusting a depth neural network model and a generalized linear model to obtain the preset detection model. Compared with the traditional depth fake video detection method which mainly relies on analysis of visual difference, the method disclosed by the invention cuts equipment monitoring data corresponding to human face actions from the video to be detected, and detects the video to be detected based on the equipment monitoring data and the preset detection model obtained after combination adjustment of the depth neural network model and the generalized linear model, so that the technical defect that the prior art is too depended on visual difference analysis is avoided, and whether the video to be detected is the depth fake video or not can be accurately judged.

Drawings

Fig. 1 is a schematic structural diagram of a depth counterfeit video detection device in a hardware operation environment according to an embodiment of the present invention;

FIG. 2 is a flowchart of a first embodiment of a method for detecting a deep forgery video according to the present invention;

FIG. 3 is a flowchart of a second embodiment of the method for detecting a deep forgery video according to the present invention;

FIG. 4 is a flowchart of a third embodiment of a method for detecting a deep forgery video according to the present invention;

FIG. 5 is a schematic diagram of a first process for obtaining a model output result in the method for detecting a deep forgery video according to the present invention;

FIG. 6 is a schematic diagram of a second process for obtaining a model output result in the method for detecting a deep forgery video according to the present invention;

fig. 7 is a block diagram showing the construction of a first embodiment of the depth counterfeit video detection device of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a depth counterfeit video detection device in a hardware operation environment according to an embodiment of the present invention.

As shown in fig. 1, the depth-forgery video detection apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It will be appreciated by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the depth counterfeit video detection device, and may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a deep forgery video detection program may be included in the memory 1005 as one type of storage medium.

In the deep forgery video detection apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the depth-forgery-video detection apparatus of the present invention may be provided in the depth-forgery-video detection apparatus, which invokes the depth-forgery-video detection program stored in the memory 1005 through the processor 1001 and executes the depth-forgery-video detection method provided by the embodiment of the present invention.

An embodiment of the present invention provides a method for detecting a depth counterfeit video, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the method for detecting a depth counterfeit video.

In this embodiment, the method for detecting the depth counterfeit video includes the following steps:

step S10: performing face authentication on the video to be detected to obtain face actions, wherein the face actions comprise mouth opening actions, blinking actions, head shaking actions, head pointing actions and static actions.

It should be noted that, the execution body of the method of the embodiment may be a terminal device having functions of face authentication, data processing and program running, such as a smart phone, a smart watch, etc., or may be an electronic device having the same or similar functions, such as the above-mentioned deep forgery video detection device. Hereinafter, this embodiment and the following embodiments will be described with reference to a depth counterfeit video detection device (hereinafter referred to as detection device).

It can be understood that the video to be detected may be a video recorded by a current user holding a video recording device and including a face motion, where the face motion may be other motion that can reflect a face state, such as a mouth opening motion, a blink motion, a head shaking motion, a head pointing motion, and a static motion.

In a specific implementation, the face authentication may be performed on the video to be detected through a conventional face authentication model, for example, EIGENFACES (feature face algorithm) model, LBPH (Local Binary Patterns Histogram, local binary pattern histogram) model, etc., which is not limited in this embodiment.

Step S20: and cutting out equipment monitoring data corresponding to the face action from the video to be detected, wherein the equipment monitoring data is a video segment for making a complete face action in the video to be detected.

It should be noted that, the device monitoring data and the face actions are in one-to-one correspondence. Illustratively, the face action a corresponds to the device monitoring data a, the face action B corresponds to the device monitoring data B, and the face action C corresponds to the device monitoring data C.

Step S30: and preprocessing the equipment monitoring data to obtain preprocessed equipment monitoring data.

In a specific implementation, because the sensor data in the device monitoring data has a time sequence relationship, the data volume of the device monitoring data can be huge and the required model reasoning time is short. Therefore, the frame extraction frequency can be fixed through the preprocessing, so that the equipment monitoring data is compressed, and a large amount of time and required memory for model training reasoning are saved. Meanwhile, the data enhancement can be performed through the preprocessing, namely, different initial frames are selected for multiple times to start to sequentially extract frames and noise data is added, so that the generalization capability of the model is enhanced. In addition, the pre-processing can also be used for carrying out standardized processing on behavior data (registration and login modes, buried point data and the like) and compressed sensor data in the equipment monitoring data.

Step S40: and detecting the preprocessed equipment monitoring data through a preset detection model, judging whether the video to be detected is a depth fake video or not based on a detection result, and combining and adjusting a depth neural network model and a generalized linear model to obtain the preset detection model.

It should be noted that the generalized linear model may be used to capture interactions between features, and the deep neural network model may be used to learn more complex feature representations, for modeling and predictive classification of device monitoring data.

In a specific implementation, multiple face actions in one video and whether a plurality of models make corresponding face actions or not can be integrated, and whether the video to be detected is a deep fake video injected by hijacking a mobile phone camera or not can be analyzed.

The embodiment obtains face actions by carrying out face authentication on the video to be detected, wherein the face actions comprise mouth opening actions, blink actions, head shaking actions, head nodding actions and static actions; segmenting equipment monitoring data corresponding to the face action from the video to be detected, wherein the equipment monitoring data is a video segment for making a complete face action in the video to be detected; preprocessing the equipment monitoring data to obtain preprocessed equipment monitoring data; and detecting the preprocessed equipment monitoring data through a preset detection model, judging whether the video to be detected is a depth fake video or not based on a detection result, and combining and adjusting a depth neural network model and a generalized linear model to obtain the preset detection model. Compared with the traditional depth fake video detection method which mainly relies on analysis of visual difference, the method disclosed by the embodiment cuts equipment monitoring data corresponding to human face actions from the video to be detected, detects the video to be detected based on the equipment monitoring data and the preset detection model obtained after combination adjustment of the depth neural network model and the generalized linear model, and therefore the technical defect that the prior art is too dependent on visual difference analysis is avoided, and whether the video to be detected is the depth fake video or not can be accurately judged.

Referring to fig. 3, fig. 3 is a flowchart of a second embodiment of the depth counterfeit video detection method according to the present invention.

Based on the above-mentioned first embodiment, in this embodiment, in order to more accurately acquire the face motion in the video to be detected, the step S10 may include:

Step S101: and acquiring the length-width ratio of a mouth of a face in the video to be detected, and judging whether a mouth opening action exists in the video to be detected or not based on the length-width ratio of the mouth.

Step S102: and acquiring the eye length-width ratio of a face in the video to be detected, and judging whether blink motion exists in the video to be detected or not based on the eye length-width ratio.

Step S103: and acquiring cheek width change and nose-to-chin distance change of a face in the video to be detected, and judging whether a head shaking action and/or a head nodding action exists in the video to be detected or not based on the cheek width change and the nose-to-chin distance change.

In a specific implementation, whether a mouth opening action exists in the video to be detected can be judged by calculating the aspect ratio (MAR, mouth Aspect Ratio) of the mouth. When the value of MAR is greater than the set threshold, the mouth is considered to have opened. And calculating the length-width ratio of the mouth, and monitoring the state of the mouth in real time so as to judge whether to open the mouth. Whether a blink motion exists in the video to be detected can be determined by calculating an Aspect Ratio (EAR) of the Eye. When the eye is open, the EAR fluctuates up and down at a certain value, whereas when the eye is closed, the EAR drops rapidly, theoretically approaching zero. By monitoring the change of EAR, whether eyes blink or not is judged. The shaking head detection is to judge whether to nod or shake the head by calculating the width change of the cheeks at the left side and the right side and the distance from the nose to the chin. When the width of the cheeks at the left and right sides is greatly changed and the distance from the nose to the chin is also obviously changed, the head shaking action is judged. And detecting the change of the facial features to judge the action of the face in real time.

Step S104: if any one of the mouth opening action, the blink action, the head shaking action and the head nodding action does not exist in the video to be detected, and the human face movement amplitude in the video to be detected is smaller than a preset amplitude threshold, judging that a static action exists in the video to be detected.

Further, based on the first embodiment, the preprocessed device monitoring data includes preprocessed time series data and preprocessed behavior data, and the step S30 may include:

Step S301: and performing frame extraction on the time sequence data in the equipment monitoring data according to the fixed frequency and the fixed quantity, and performing maximum and minimum normalization processing on the time sequence data after frame extraction to obtain the preprocessed time sequence data.

In a specific implementation, the maximum and minimum normalization processing can be performed on the time sequence data after frame extraction based on the following formula:

Wherein x is the time sequence data after frame extraction, x _min is the minimum value of the time sequence data after frame extraction in the time dimension, x _max is the maximum value of the time sequence data after frame extraction in the time dimension, and x' is the time sequence data after preprocessing.

Step S302: and counting the average value and standard deviation of the behavior data in the equipment monitoring data, and carrying out standard normalization processing on the behavior data based on the average value and the standard deviation to obtain the preprocessed behavior data.

In a specific implementation, the above behavior data may be normalized based on the following formula:

Wherein x _mean is the average value of the behavior data in the device monitoring data, x _std is the standard deviation of the behavior data in the device monitoring data, and x ^* is the preprocessed behavior data.

According to the embodiment, whether the mouth opening action exists in the video to be detected or not is judged by acquiring the mouth length-width ratio of the face in the video to be detected and based on the mouth length-width ratio; acquiring an eye length-width ratio of a face in a video to be detected, and judging whether blink motion exists in the video to be detected or not based on the eye length-width ratio; acquiring cheek width change and nose-to-chin distance change of a face in a video to be detected, and judging whether a head shaking action and/or a head nodding action exists in the video to be detected or not based on the cheek width change and the nose-to-chin distance change; if any one of the mouth opening action, the blink action, the head shaking action and the head nodding action does not exist in the video to be detected, and the human face movement amplitude in the video to be detected is smaller than a preset amplitude threshold, judging that a static action exists in the video to be detected; performing frame extraction on time sequence data in the equipment monitoring data according to fixed frequency and fixed quantity, and performing maximum and minimum normalization processing on the time sequence data after frame extraction to obtain the preprocessed time sequence data; and counting the average value and standard deviation of the behavior data in the equipment monitoring data, and carrying out standard normalization processing on the behavior data based on the average value and the standard deviation to obtain the preprocessed behavior data. Compared with the traditional depth forging video detection method, the method of the embodiment carries out face recognition on the video to be detected based on the mouth length-width ratio, the eye length-width ratio, the cheek width change, the nose-to-chin distance change and the face motion amplitude of the face in the video to be detected, so that the accuracy of face recognition is improved.

Referring to fig. 4, fig. 4 is a flowchart of a third embodiment of the depth counterfeit video detection method according to the present invention.

Based on the above embodiments, in the present embodiment, after the step S30, it may further include:

step S31: and selecting a positive sample and a negative sample from the equipment monitoring data, wherein the positive sample is the equipment monitoring data containing the face action, and the negative sample is the equipment monitoring data not containing the face action.

Step S32: and calculating a loss function based on the positive sample and the negative sample, and adjusting an initial model based on the loss function to obtain a preset detection model, wherein the initial model is formed by combining a deep neural network model and a generalized linear model.

In a specific implementation, the above-described loss function may be calculated based on the following formula:

Wherein y _i represents the label of sample i, the positive sample is 1, and the negative sample is 0; representing the probability that model predictive sample i is a positive sample,/> Representing the probability that the model predictive sample i is a negative sample. N represents the number of samples, L _i represents the loss value of the model prediction sample i, and L represents the loss value of the model prediction.

Further, in this embodiment, the preset detection model includes a one-dimensional convolution module, a deep neural network model, and a generalized linear model, and the step S40 may include:

Step S401: and inputting the preprocessed time sequence data and the preprocessed behavior data into a preset detection model to obtain a model output result.

Step S402: and judging whether the video to be detected is a depth fake video or not based on the model output result.

In a specific implementation, the video to be detected may be predicted and scored based on the output result of the model, and when the score exceeds a preset score, the video to be detected may be determined to be a depth counterfeit video.

Further, in this embodiment, the step S402 may include:

Step S4021: and inputting the preprocessed time sequence data into the one-dimensional convolution module, and outputting flattened processing data.

Step S4022: and inputting the flattened processing data into the deep neural network model to obtain a first model output value.

Step S4023: and inputting the preprocessed behavior data into the deep neural network model to obtain a second model output value.

Step S4024: and inputting spliced data obtained by splicing the flattened processing data and the preprocessed behavior data into the generalized linear model to obtain a third model output value.

Step S4025: and adding the first model output value, the second model output value and the third model output value to obtain a model output result.

In a specific implementation, reference may be made to fig. 5, and fig. 5 is a schematic diagram of a first acquisition flow of a model output result in the depth counterfeit video detection method of the present invention. In fig. 5, the preprocessed device monitoring data may be divided into time series data and behavior data. And inputting the time sequence data into Encoder (namely the one-dimensional convolution module) to obtain flattened data, and splicing the flattened data with the behavior data to obtain spliced data. At this time, the data and behavior data output after the flattening process may be input to the Predictor model (i.e., the above-described deep neural network model) and the first model output value and the second model output value may be output, respectively, and the spliced data may be input to the Predictor model (i.e., the above-described generalized linear model) and the third model output value may be output. Finally, the first model output value, the second model output value, and the third model output value may be added to obtain an added result (i.e., the model output result described above).

In the embodiment, a positive sample and a negative sample are selected from the device monitoring data, wherein the positive sample is the device monitoring data containing the face action, and the negative sample is the device monitoring data not containing the face action; calculating a loss function based on the positive sample and the negative sample, and adjusting an initial model based on the loss function to obtain a preset detection model, wherein the initial model is formed by combining a deep neural network model and a generalized linear model; inputting the preprocessed time sequence data and the preprocessed behavior data into a preset detection model to obtain a model output result; inputting the preprocessed time sequence data into the one-dimensional convolution module, and outputting flattened processing data; inputting the flattened processing data into the deep neural network model to obtain a first model output value; inputting the preprocessed behavior data into the deep neural network model to obtain a second model output value; the flattened processing data and the preprocessed behavior data are spliced to obtain spliced data, and the spliced data are input into the generalized linear model to obtain a third model output value; and adding the first model output value, the second model output value and the third model output value to obtain a model output result. Compared with the traditional depth counterfeit video detection method, the method disclosed by the embodiment inputs the preprocessed time sequence data and the preprocessed behavior data into the preset detection model comprising the one-dimensional convolution module, the depth neural network model and the generalized linear model, so that a more comprehensive model output result can be obtained, and the detection accuracy of the depth counterfeit video can be further improved based on the model output result.

Furthermore, in another embodiment, detection of depth counterfeit video may also be achieved based on the following.

And the first part is used for dividing the equipment monitoring data corresponding to each action. In the face authentication process of applying for loan, the applicant is identified to do various actions such as mouth opening, blinking, head shaking, static face actions and the like through the existing face authentication model, and the actions such as forward, backward, upward, downward and the like of the hand actions of the mobile device when the corrected face appears in the camera of the device are indirectly identified. And cutting out equipment monitoring data corresponding to each action, wherein the equipment monitoring data comprises mobile phone sensor data and behavior data of operation equipment. The label corresponding to the equipment monitoring data is whether a corresponding face action or a corresponding hand movement action is made (one action corresponds to one detection model). The hand forward and backward movement detection is to calculate the change of the ratio of the face area to the whole video, wherein the ratio of the face area to the whole video continuously changes from small to large to move forward, and from large to small to move backward. The detection of the forward and backward movement of the hand is to calculate the distance between the nose and the uppermost end of the video, wherein the distance is continuously longer from short to long, the hand moves downwards, and the distance is continuously shorter from long to short, the hand moves upwards.

And a second part, data preprocessing. The sensor data in the equipment monitoring data is embedding in time sequence, the data volume is huge, the required model reasoning time is short, the equipment monitoring data is compressed through fixing the frame pumping frequency, and the time and the required memory for training and reasoning a large number of models are saved. The data enhancement mode is to select different initial frames for starting sequential frame extraction and adding noise data for a plurality of times, which is helpful for enhancing the generalization capability of the model. In addition, behavior data (registration method, buried point data, etc.) and compressed sensor data in the device monitoring data are subjected to normalization processing.

And thirdly, constructing a model. The model of the present embodiment is mainly composed of a processing time series data model and a processing behavior data model. The processing time sequence data model is an improved Wide & Deep model (namely, the Deep layer in the Wide & Deep model is replaced by a cyclic neural network suitable for time sequence data, self-attention is added on DeepFM to strengthen interaction relation among model capturing characteristics), the Deep layer firstly adopts a one-dimensional convolution module (Encoder) to conduct sliding window operation on equipment monitoring data in a time dimension, and therefore local time sequence characteristics are extracted and used as input of the cyclic neural network, and Deep characteristics of the time sequence data are fully mined. In addition, the Wide layer extracts the statistical characteristics of the time sequence data, and the transformed result is added with the result of the Wide layer. While the process behavior data model adds self-attention on the basis of DeepFM to better capture the interaction relationships between features. The whole model combines the results of the time sequence data model and the behavior data model so as to conduct prediction classification of the equipment monitoring data. Referring to fig. 6, fig. 6 is a schematic diagram of a second acquisition flow of a model output result in the depth counterfeit video detection method of the present invention. In fig. 6, a one-dimensional convolution module (Encoder) is adopted in the model 1 to perform sliding window operation on time sequence data of equipment monitoring data in a time dimension, the length of a convolution kernel is 2.5 seconds of frame number, the step length is 1 second of frame number, the number of channels generated by convolution is consistent with the number of channels input, finally 2-dimensional data is output as the input of a cyclic neural network LSTM in the time sequence data model, and the statistical characteristics in the time sequence data are average, standard deviation, maximum, minimum, extreme value, quartile, absolute average deviation and median absolute average deviation of each characteristic attribute, and the statistical characteristics are flattened into 1-dimensional data, namely the process of extracting statistical characteristics of the model 1. The input of the model 2 is one-dimensional sparse behavior data, the model is compressed after DeepFM modules, and the self-attention mechanism and the linear transformation of self-attention are used for obtaining a prediction result of the behavior data. The predicted result of the model is the normalized result of the model 1 plus the normalized result of the model 2.

And fourth, analyzing results. And combining a plurality of facial actions, hand actions and a plurality of models in one video, judging whether corresponding actions are made according to equipment monitoring data, and analyzing whether the video is a depth fake video which is injected by hijacking a mobile phone camera.

In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a depth fake video detection program, and the depth fake video detection program realizes the steps of the depth fake video detection method when being executed by a processor.

Referring to fig. 7, fig. 7 is a block diagram showing the construction of a first embodiment of a depth falsification video detecting apparatus according to the present invention.

As shown in fig. 7, the depth falsification video detecting apparatus according to the embodiment of the present invention includes:

The face authentication module 701 is configured to perform face authentication on a video to be detected to obtain a face action, where the face action includes a mouth opening action, a blink action, a head shaking action, a head nodding action, and a static action;

The data slicing module 702 is configured to slice, from the video to be detected, device monitoring data corresponding to the face action, where the device monitoring data is a video segment of the video to be detected that performs a complete face action;

the data processing module 703 is configured to preprocess the device monitoring data to obtain preprocessed device monitoring data;

the data detection module 704 is configured to detect the preprocessed device monitoring data through a preset detection model, and determine whether the video to be detected is a deep fake video based on a detection result, where the preset detection model is obtained by combining and adjusting a deep neural network model and a generalized linear model.

Based on the first embodiment of the above-mentioned depth-counterfeit video detection device of the present invention, a second embodiment of the depth-counterfeit video detection device of the present invention is proposed.

In this embodiment, the face authentication module 701 is further configured to obtain a mouth length-width ratio of a face in a video to be detected, and determine whether a mouth opening action exists in the video to be detected based on the mouth length-width ratio; acquiring an eye length-width ratio of a face in a video to be detected, and judging whether blink motion exists in the video to be detected or not based on the eye length-width ratio; and acquiring cheek width change and nose-to-chin distance change of a face in the video to be detected, and judging whether a head shaking action and/or a head nodding action exists in the video to be detected or not based on the cheek width change and the nose-to-chin distance change.

Further, the face authentication module 701 is further configured to determine that a still motion exists in the video to be detected if any one of the mouth opening motion, the blink motion, the head shaking motion and the head nodding motion does not exist in the video to be detected, and the face motion amplitude in the video to be detected is smaller than a preset amplitude threshold.

Further, the preprocessed device monitoring data includes preprocessed time sequence data and preprocessed behavior data, and the data processing module 703 is further configured to perform frame extraction on the time sequence data in the device monitoring data according to a fixed frequency and a fixed number, and perform maximum and minimum normalization processing on the time sequence data after frame extraction to obtain the preprocessed time sequence data; and counting the average value and standard deviation of the behavior data in the equipment monitoring data, and carrying out standard normalization processing on the behavior data based on the average value and the standard deviation to obtain the preprocessed behavior data.

Further, the data processing module 703 is further configured to select a positive sample and a negative sample from the device monitoring data, where the positive sample is the device monitoring data including the face motion, and the negative sample is the device monitoring data not including the face motion; and calculating a loss function based on the positive sample and the negative sample, and adjusting an initial model based on the loss function to obtain a preset detection model, wherein the initial model is formed by combining a deep neural network model and a generalized linear model.

Further, the data detection module 704 is further configured to input the preprocessed time-series data and the preprocessed behavior data into a preset detection model, so as to obtain a model output result; and judging whether the video to be detected is a depth fake video or not based on the model output result.

Further, the preset detection model includes a one-dimensional convolution module, a deep neural network model, and a generalized linear model, and the data detection module 704 is further configured to input the preprocessed time-series data into the one-dimensional convolution module, and output flattened processing data; inputting the flattened processing data into the deep neural network model to obtain a first model output value; inputting the preprocessed behavior data into the deep neural network model to obtain a second model output value; the flattened processing data and the preprocessed behavior data are spliced to obtain spliced data, and the spliced data are input into the generalized linear model to obtain a third model output value; and adding the first model output value, the second model output value and the third model output value to obtain a model output result.

Other embodiments or specific implementation manners of the depth counterfeit video detection device of the present invention may refer to the above method embodiments, and will not be described herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method for detecting depth counterfeit video, said method comprising the steps of:

Detecting the preprocessed equipment monitoring data through a preset detection model, judging whether the video to be detected is a depth fake video or not based on a detection result, and combining and adjusting a depth neural network model and a generalized linear model to obtain the preset detection model;

The step of preprocessing the device monitoring data to obtain preprocessed device monitoring data comprises the following steps:

Counting the average value and standard deviation of the behavior data in the equipment monitoring data, and carrying out standard normalization processing on the behavior data based on the average value and the standard deviation to obtain the preprocessed behavior data;

the step of detecting the preprocessed device monitoring data through a preset detection model and judging whether the video to be detected is a depth counterfeit video or not based on a detection result comprises the following steps:

Judging whether the video to be detected is a depth fake video or not based on the model output result;

the step of inputting the preprocessed time sequence data and the preprocessed behavior data into a preset detection model to obtain a model output result comprises the following steps:

2. The method for detecting deep forgery video according to claim 1, wherein the step of performing face authentication on the video to be detected to obtain a face action includes:

3. The method for detecting deep forgery video according to claim 2, wherein the step of performing face authentication on the video to be detected to obtain a face action further comprises:

4. The depth counterfeit video detection method of claim 1, wherein said step of preprocessing said device monitoring data to obtain preprocessed device monitoring data further comprises:

5. A depth-counterfeit video detection device, said depth-counterfeit video detection device comprising:

The data detection module is used for detecting the preprocessed equipment monitoring data through a preset detection model, judging whether the video to be detected is a deep fake video or not based on a detection result, and obtaining the preset detection model through combining and adjusting a deep neural network model and a generalized linear model;

6. A depth counterfeit video detection device, said device comprising: a memory, a processor and a depth-counterfeit video detection program stored on said memory and executable on said processor, said depth-counterfeit video detection program being configured to implement the steps of the depth-counterfeit video detection method of any of claims 1 to 4.

7. A storage medium having stored thereon a depth-counterfeit video detection program which, when executed by a processor, implements the steps of the depth-counterfeit video detection method of any of claims 1 to 4.