CN115424163A

CN115424163A - Lip-shape modified counterfeit video detection method, device, equipment and storage medium

Info

Publication number: CN115424163A
Application number: CN202210938861.9A
Authority: CN
Inventors: 张盛辉; 赵明明; 李海涛; 张磊; 王静
Original assignee: Wuhan University WHU; Wuhan Fiberhome Technical Services Co Ltd
Current assignee: Wuhan University WHU; Wuhan Fiberhome Technical Services Co Ltd
Priority date: 2022-08-05
Filing date: 2022-08-05
Publication date: 2022-12-02

Abstract

The invention discloses a lip modification counterfeit video detection method, a device, equipment and a storage medium, wherein the method comprises the steps of extracting a lip region image of a video to be detected, and acquiring image characteristics in the lip region image; training the random forest model by using a training set in the video data set to obtain a trained random forest model; the image characteristics are input into a trained random forest, the video to be detected is judged to be a true video or a forged video, the authenticity of the video can be quickly judged, the training speed and efficiency of video data are accelerated, the phenomenon that image shake in the video affects characteristic extraction is avoided, the lip action detection accuracy and the characteristic extraction speed of video frames are improved, the influence of noise points or abnormal points on video lip detection is reduced, the higher video figure lip detection precision is ensured, the detection accuracy of lip modification forged videos is improved, and the speed and the efficiency of lip modification forged video detection are improved.

Description

Lip-shape modified counterfeit video detection method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of computer visual identification, in particular to a lip modification counterfeit video detection method, a lip modification counterfeit video detection device, lip modification counterfeit video detection equipment and a lip modification counterfeit video detection storage medium.

Background

Deep forgery of the Deepfake has abundant face transformation application nowadays, and the technology is developed by a user of entertainment, social and news websites, reddit and can be used for fitting face images of any people into various portrait videos.

In order to prevent people from using the modified video for illegal use, related detection web services such as 'H5 video live detection' of an open platform of intellectual Intelligence (AI) are currently available in China for detection of video face modification; the method comprises the steps of identifying a face composite image of an Application Programming Interface (API), detecting a deep fake work of a detection algorithm Sharp-MIL (S-MIL) of an Ali security Turing laboratory, and carrying out face authenticity detection analysis and image/video risk level evaluation by an AntFakes fake face discrimination technology of Tencent cloud AI vision.

In 2020, a paper "A Lip sync Expert Is A1l You Need for Speech to Lip Generation In The world" published by ACM MM2020, from The team of university of Hadlaba, india and university of Bass, UK, in which article they propose an AI model called Wav2Lip, which only requires a piece of character video and a piece of target Speech to be able to integrate audio and video, and The character mouth shape Is perfectly matched with audio.

At present, lip modification detection is not researched much, and good models or algorithms are urgently needed to detect the authenticity of lip modification videos.

Disclosure of Invention

The invention mainly aims to provide a lip-shaped modified counterfeit video detection method, a lip-shaped modified counterfeit video detection device, lip-shaped modified counterfeit video detection equipment and a lip-shaped modified counterfeit video storage medium, and aims to solve the technical problems that the lip-shaped modified video detection method for authenticity identification cannot judge the authenticity of the video, and the video identification error rate is high in the prior art.

In a first aspect, the present invention provides a lip modification spoofing video detection method, including the steps of:

extracting a lip region image of a video to be detected, and acquiring image characteristics in the lip region image;

training the random forest model by using a training set in the video data set to obtain a trained random forest model;

and inputting the image characteristics into a trained random forest, and judging that the video to be detected is a true video or a forged video.

Optionally, the extracting a lip region image of a video to be detected, and acquiring image features in the lip region image includes:

carrying out graying processing and normalization on each frame of image in a video to be detected to obtain a target detection video with standardized gray value data;

extracting a lip region image of the target detection video;

extracting image features in the lip region image using an underlying feature extraction network FFE-Net and a lip representative feature extraction and classification network RC-Net.

Optionally, the extracting a lip region image of a video to be detected includes:

extracting key points of the face in the video to be detected by using a face key point detection algorithm;

lip key points are obtained from the key points according to lip region coordinates, and region video images are extracted from the video to be detected according to the lip key points and serve as lip region images.

Optionally, the extracting image features in the lip region image by using an essential feature extraction network FFE-Net and a lip representative feature extraction and classification network RC-Net includes:

acquiring face key feature point coordinates in the lip region image, and taking the face key feature point coordinates as corner features;

obtaining histograms of three RGB channels of the lip region image through a calchist function, determining an image binarization threshold value according to the histograms, and obtaining data characteristics of gray distribution in three colors corresponding to RGB according to the image binarization threshold value;

extracting basic features in the lip region image by adopting a basic feature extraction network FFE-Net;

and extracting and classifying lip features in the lip region image by adopting a lip representative feature extraction and classification network RC-Net.

Optionally, before the training set in the video data set is used to train a random forest model and a trained random forest model is obtained, the lip modification forgery video detection method further includes:

the method comprises the steps of using a Markov discriminator, an SVM classifier, a DenseNet discriminator, an XceptionNet discriminator and a MesoInceptation-4 discriminator as five decision trees, and forming a random forest model according to the five decision trees.

Optionally, the training the random forest model with a training set in the video data set to obtain a trained random forest model includes:

acquiring a corresponding data set from a video data set according to a preset random proportion to serve as a training set and a test set;

and training each discriminator in the random forest model according to the training set, and testing each classifier in the random forest model according to the test set to obtain the trained random forest model.

Optionally, the inputting the image features into a trained random forest, and determining that the video to be detected is a true video or a fake video includes:

inputting the image characteristics into a trained random forest, obtaining random forest mode numbers corresponding to all decision trees in the random forest, and judging whether the video to be detected is a true video or a forged video according to the random forest mode numbers.

In a second aspect, to achieve the above object, the present invention further proposes a lip modification forgery video detection device including:

the characteristic extraction module is used for extracting a lip region image of a video to be detected and acquiring image characteristics in the lip region image;

the training module is used for training the random forest model by using a training set in the video data set to obtain a trained random forest model;

and the judging module is used for inputting the image characteristics into the trained random forest and judging the video to be detected as a true video or a forged video.

In a third aspect, to achieve the above object, the present invention further proposes a lip modification forgery video detection apparatus including: a memory, a processor, and a lip modification spoof video detection program stored on the memory and executable on the processor, the lip modification spoof video detection program configured to implement the steps of the lip modification spoof video detection method as described above.

In a fourth aspect, to achieve the above object, the present invention further proposes a storage medium having stored thereon a lip-modification-counterfeit video detection program, which when executed by a processor implements the steps of the lip-modification-counterfeit video detection method as described above.

The lip modification forged video detection method provided by the invention comprises the steps of extracting a lip region image of a video to be detected, and acquiring image characteristics in the lip region image; training the random forest model by using a training set in the video data set to obtain a trained random forest model; the image characteristics are input into a trained random forest, the video to be detected is judged to be a true video or a forged video, the authenticity of the video can be quickly judged, the training speed and efficiency of video data are accelerated, the phenomenon that image shake in the video affects characteristic extraction is avoided, the lip action detection accuracy and the characteristic extraction speed of video frames are improved, the influence of noise points or abnormal points on video lip detection is reduced, the higher video figure lip detection precision is ensured, the detection accuracy of lip modification forged videos is improved, and the speed and the efficiency of lip modification forged video detection are improved.

Drawings

Fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a lip-modified counterfeit video detection method according to the present invention;

FIG. 3 is a flowchart illustrating a second embodiment of a lip-modified counterfeit video detection method according to the present invention;

FIG. 4 is a flowchart illustrating a third embodiment of a lip-modified counterfeit video detection method according to the present invention;

FIG. 5 is a flowchart illustrating a fourth embodiment of a lip-modified counterfeit video detection method according to the present invention;

fig. 6 is a flowchart illustrating a fifth embodiment of the lip-modified counterfeit video detection method according to the present invention;

FIG. 7 is a flowchart illustrating a sixth embodiment of a lip-modified counterfeit video detection method according to the present invention;

fig. 8 is a flowchart illustrating a seventh embodiment of a lip-modified counterfeit video detection method according to the present invention;

fig. 9 is a functional block diagram of a first embodiment of the lip-modified counterfeit video detection apparatus according to the present invention.

The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The solution of the embodiment of the invention is mainly as follows: the method comprises the steps of extracting lip region images of a video to be detected to obtain image features in the lip region images; training the random forest model by using a training set in the video data set to obtain a trained random forest model; the image characteristics are input into a trained random forest, the video to be detected is judged to be a true video or a forged video, the authenticity of the video can be quickly judged, the training speed and efficiency of video data are accelerated, the characteristic extraction is prevented from being influenced by image shaking in the video, the lip action detection accuracy and the characteristic extraction speed of a video frame are improved, the influence of noise points or abnormal points on video lip detection is reduced, the high video figure lip detection precision is ensured, the lip modification forged video detection accuracy is improved, the lip modification forged video detection speed and efficiency are improved, and the technical problems that the authenticity of the video cannot be judged and the video identification error rate is high due to the fact that a method for carrying out authenticity identification on the lip modification video is lacked in the prior art are solved.

Referring to fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the apparatus may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used to implement connection communication among these components. The user interface 1003 may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a Wi-Fi interface). The Memory 1005 may be a high-speed RAM Memory or a Non-Volatile Memory (Non-Volatile Memory), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001 described previously.

Those skilled in the art will appreciate that the configuration of the device shown in fig. 1 is not intended to be limiting of the device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating device, a network communication module, a user interface module, and a lip-modification forgery-video detection program.

The apparatus of the present invention calls the lip-modified falsification video detection program stored in the memory 1005 through the processor 1001, and performs the following operations:

The apparatus of the present invention calls the lip-modified falsification video detection program stored in the memory 1005 through the processor 1001, and further performs the following operations:

extracting a lip region image of the target detection video;

extracting image features in the lip region image using a base feature extraction network FFE-Net and a lip representative feature extraction and classification network RC-Net.

According to the scheme, the lip region image of the video to be detected is extracted, and the image characteristics in the lip region image are obtained; training the random forest model by using a training set in the video data set to obtain a trained random forest model; the image characteristics are input into a trained random forest, the video to be detected is judged to be a true video or a forged video, the authenticity of the video can be quickly judged, the training speed and efficiency of video data are accelerated, the phenomenon that image shake in the video affects characteristic extraction is avoided, the lip action detection accuracy and the characteristic extraction speed of video frames are improved, the influence of noise points or abnormal points on video lip detection is reduced, the higher video figure lip detection precision is ensured, the detection accuracy of lip modification forged videos is improved, and the speed and the efficiency of lip modification forged video detection are improved.

Based on the hardware structure, the embodiment of the lip-modified forged video detection method is provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating a lip-modified counterfeit video detection method according to a first embodiment of the present invention.

In a first embodiment, the lip modification spoofed video detection method includes the steps of:

and S10, extracting a lip region image of the video to be detected, and acquiring image characteristics in the lip region image.

It should be noted that the video to be detected is a video that needs to be detected as being true or false, and in order to reduce the amount of calculation, the lip region may be extracted separately, that is, the lip region image of the video to be detected is extracted, and then the lip image feature in the lip region image may be extracted.

And S20, training the random forest model by using the training set in the video data set to obtain the trained random forest model.

It can be understood that the video data set is an original real video image data set collected in advance, a training set can be obtained from the video data set, and a random forest model can be trained through the training set, so that a trained random forest model is obtained.

It should be noted that a random forest (random forest) is a typical bagging algorithm in ensemble learning, and the so-called ensemble learning of ensemble learning refers to that the inside of the whole model is composed of a plurality of weak supervision models, and each of the weak supervision models is better in performance only in a certain direction, so that when the supervision algorithms are combined into a whole, a stable model which is better in all aspects is obtained, the random forest is a typical representative in the ensemble learning, the forest refers to that the inside of the model contains a plurality of decision trees, so that the model containing a plurality of decision trees can be regarded as a forest, and each decision tree in the random forest refers to that a part of a random sampling data set of each decision tree is trained, that is, the angle of the problem is different, so as to ensure that the outputs of each decision tree are similar but different, and finally, the outputs of the decision trees can be integrated together to obtain the final output.

And S30, inputting the image characteristics into a trained random forest, and judging whether the video to be detected is a true video or a forged video.

It should be understood that the image features are input into the trained random forest to obtain a corresponding output result, and the video to be detected can be judged to be a true video or a fake video according to the output result.

According to the scheme, the lip region image of the video to be detected is extracted, and the image characteristics in the lip region image are obtained; training the random forest model by using a training set in the video data set to obtain a trained random forest model; the image characteristics are input into a trained random forest, the video to be detected is judged to be a true video or a forged video, the authenticity of the video can be quickly judged, the training speed and efficiency of video data are accelerated, the characteristic extraction is prevented from being influenced by image shaking in the video, the lip action detection accuracy and the characteristic extraction speed of a video frame are improved, the influence of noise points or abnormal points on video lip detection is reduced, the higher video figure lip detection precision is ensured, the detection accuracy of lip modification forged videos is improved, and the speed and efficiency of lip modification forged video detection are improved.

Further, fig. 3 is a flowchart illustrating a second embodiment of the lip modification falsification video detection method according to the present invention, and as shown in fig. 3, the second embodiment of the lip modification falsification video detection method according to the present invention is proposed based on the first embodiment, in this embodiment, the step S10 specifically includes the following steps:

and S11, performing graying processing and normalization on each frame of image in the video to be detected to obtain a target detection video with standardized gray value data.

It should be noted that image data preprocessing may be performed before extracting the lip region image, that is, graying and normalizing each frame image in the video to be detected to obtain the target detection video with normalized grayscale data.

In specific implementation, for detecting each frame of image in a video, in order to reduce energy consumption and ensure efficient operation of an algorithm, firstly, each frame of image in the video is subjected to gray scale processing and normalization, gray scale value data is standardized, and a sample F (F (x) is set ₀ ，y ₀ )，f(x ₀ ，y ₁ )，…，f(x _i ，y _i ))，f(x _m ,y _n ) For the grey value in the sample, set f _min () Is the minimum value of the grey value, f _max () For the maximum value of the gray value, the Normalization process is carried out:

normal is a grey value data standardized value, and after processing, the grey value is between-1, thereby eliminating the influence of some abnormal grey values on the neural network, and reducing the value difference between different grey values, thereby shortening the training time.

And S12, extracting a lip region image of the target detection video.

It is understood that a region image corresponding to the face lip in the target detection video may be extracted.

And S13, extracting image features in the lip region image by using a basic feature extraction network FFE-Net and a lip representative feature extraction and classification network RC-Net.

It should be understood that the base Feature Extraction can be performed by using a base Feature Extraction network (FFE-Net), and the lip Feature Extraction can be performed by using a lip Representative Feature Extraction and Classification network (RC-Net), that is, the image features in the lip region image are extracted by using the FFE-Net and the RC-Net.

According to the scheme, each frame of image in the video to be detected is subjected to graying processing and normalization, so that the target detection video with the standardized gray value data is obtained; extracting a lip region image of the target detection video; extracting image features in the lip region image by using a basic feature extraction network FFE-Net and a lip representative feature extraction and classification network RC-Net; the consumption of detection data processing can be reduced, the image features can be accurately extracted, the higher lip shape detection precision of the video character is ensured, and the speed and the efficiency of lip shape modification and counterfeiting video detection are improved.

Further, fig. 4 is a flowchart illustrating a third embodiment of the lip modification falsification video detection method according to the present invention, and as shown in fig. 4, the third embodiment of the lip modification falsification video detection method according to the present invention is proposed based on the second embodiment, in this embodiment, the step S12 specifically includes the following steps:

and S121, extracting key points of the face in the video to be detected by using a face key point detection algorithm.

It should be noted that, the key points in the face can be extracted by using a face key point detection algorithm in the face key point detection model, in actual operation, the face key point detection model can adopt a depth residual error network ResNet, and the performance of the ResNet on precision and depth is better than that of a convolutional neural network.

And S122, obtaining lip key points from the key points according to lip region coordinates, and extracting a region video image from the video to be detected as a lip region image according to the lip key points.

It can be understood that the lip key points are obtained from the key points according to the lip region coordinates, so that the lip key points can be obtained, and the region video images extracted from the video to be detected are taken as the lip region images, for example, the face key points have 81 numbers corresponding to 81 numbers, and when the number corresponding to the lip region is detected to be 3-13, the lip region images corresponding to the lip region key points with the corresponding numbers can be obtained.

According to the scheme, the key points of the face in the video to be detected are extracted by using a face key point detection algorithm; the lip key points are obtained from the key points according to the lip region coordinates, and the region video images are extracted from the video to be detected according to the lip key points to serve as the lip region images, so that the consumption of detection data processing can be reduced, the image characteristics can be accurately extracted, the high lip detection precision of video characters is ensured, and the speed and the efficiency of lip modification and forgery video detection are improved.

Further, fig. 5 is a flowchart illustrating a fourth embodiment of the lip modification falsification video detection method according to the present invention, and as shown in fig. 5, the fourth embodiment of the lip modification falsification video detection method according to the present invention is proposed based on the second embodiment, in this embodiment, the step S13 specifically includes the following steps:

step S131, obtaining face key feature point coordinates in the lip region image, and taking the face key feature point coordinates as corner features.

It should be noted that the extracted feature data is coordinates of the face key feature points, that is, coordinates of the face key feature points in the lip region image are obtained, so that the coordinates of the face key feature points are used as corner features.

Step S132, obtaining histograms of three channels of RGB of the lip region image through a calchist function, determining an image binarization threshold value according to the histograms, and obtaining data characteristics of gray distribution in three colors corresponding to RGB according to the image binarization threshold value.

It should be understood that, a histogram of the RGB three channels of the picture is then obtained through the calchist function, and is used to determine the threshold value of the image binarization, so that the gray distribution data characteristics in the RGB three colors can be obtained.

And S133, extracting the basic features in the lip region image by adopting a basic feature extraction network FFE-Net.

It is understood that the FFE-Net emphasizes lip motion by using a Fundamental Feature Extraction network (FFE-Net) for Fundamental Feature Extraction.

In a specific implementation, in order to better adapt to a light flow graph, the FFE-Net adopts a layered architecture and is divided into two stages, namely a contraction stage and an amplification stage; in the contraction phase, useful features can be extracted by using 4 convolution layers under different resolutions, a maximum pooling layer is placed after the first 3 convolution layers, the step size in the spatial domain is 2 x 2, and the size of a feature map is reduced by half in two dimensions; in the amplification stage, 3 deconvolution layers can be placed to generate an extended feature map with two dimensions being doubled; the motion maps with different sizes are generated by an output network, and under the minimum resolution of width W/8 times length H/8, a convolution layer is used for converting the feature map into the motion map with the correct size; then, a deconvolution layer is used for constructing a feature map with double size in two dimensions, and the motion information is transferred from a lower scale to a high scale; integrating feature maps on corresponding scales in a contraction network and an amplification network through addition operation to generate a W/4 by H/4 feature map with higher resolution, wherein a motion map of the scale can also be obtained by applying a convolution layer, the same operation is repeated for 3 times to obtain predicted motion features of different scales, finally, the motion map of the original size is adopted as output, light flow maps of different scales are used as supervision information for predicting the motion map of each scale, FFE-Net input is a picture sequence, and output is a dynamic feature map sequence.

And S134, extracting and classifying lip features in the lip region image by adopting a lip representative feature extraction and classification network RC-Net.

It should be understood that lip feature extraction and Classification work is performed by using a lip Representative feature extraction and Classification network (RC-Net), and the RC-Net can extract more highly dimensional lip features by acquiring lip speaking modes.

In the specific implementation, the RC-Net can extract high-dimensional representative features and verify the identity of a speaker, and the structure of the RC-Net can have three key parts, namely a feature extraction model, a reconstruction branch and a classification branch; the feature extraction module adopts a layered architecture to describe features of lip movement on different scales on a spatial domain, can use 4 convolution layers to extract features under different resolutions, uses a maximum pooling layer with the step of 2 x 2 in 3 spatial domains to perform down-sampling (some elements are discarded to realize image scaling), because model input is a lip feature map sequence, uses three-dimensional network elements including a three-dimensional convolution layer, a three-dimensional maximum pooling layer and the like to describe space-time lip dynamics, reconstructs a branch according to extracted feature movement, the structure of the reconstruction branch corresponds to a feature extraction part, the reconstruction branch has four convolution layers and three up-sampling layers, and the classification branch executes classification tasks of two classes under two full-connection layers and one full-map-mean pooling layer; two kinds of supervision information are applied in RC-Net, namely a lip chart guiding a reconstruction branch and label information guiding a classification branch.

According to the scheme, the coordinates of the key feature points of the human face in the lip region image are obtained, and the coordinates of the key feature points of the human face are used as the corner features; obtaining histograms of RGB three channels of the lip region image through a calchist function, determining an image binarization threshold value according to the histograms, and obtaining data characteristics corresponding to gray distribution in RGB three colors according to the image binarization threshold value; extracting basic features in the lip region image by adopting a basic feature extraction network FFE-Net; the lip representative feature extraction and classification network RC-Net is adopted to extract and classify lip features in the lip region image, so that the image features can be accurately extracted, the higher lip detection precision of video characters is ensured, and the speed and efficiency of lip modification and counterfeiting video detection are improved.

Further, fig. 6 is a flowchart illustrating a fifth embodiment of the lip modification forgery video detection method according to the present invention, and as shown in fig. 6, the fifth embodiment of the lip modification forgery video detection method according to the present invention is proposed based on the first embodiment, and in this embodiment, before the step S20, the lip modification forgery video detection method further includes the following steps:

step S201, a Markov discriminator, an SVM classifier, a DenseNet discriminator, an XceptionNet discriminator and a MesoInceptation-4 discriminator are used as five decision trees, and a random forest model is formed according to the five decision trees.

It should be noted that a random forest model can be formed by using a markov discriminator, an SVM classifier, a DenseNet discriminator, an XceptionNet discriminator, and a mesoinclusion-4 discriminator as five decision trees.

It can be understood that the decision tree is composed of various types of discriminators, the first discriminator uses a markov discriminator, the markov discriminator is composed of a full convolutional layer, and the structure can be: outputting the full convolution layer by an n-n matrix, and finally outputting the matrix by taking the mean value of the matrix as a judgment result, wherein each output element of the matrix output by the full convolution layer represents a receptive field in the original image, namely a patch corresponding to the original image; in a convolutional neural network, the definition of the receptive field is: the area size of the pixel points on the characteristic graph output by each layer of the convolutional neural network mapped on the original image is as follows:

wherein l _k-1 The size of the receptive field for layer k-1, f _k Convolution kernel size for k-th layer, or pooling layer size, s _i Is the stride of the ith layer;

the second discriminator may be a linear SVM classifier that extracts features from the high-pass residual image and determines whether each frame of image is counterfeit by measuring the coexisting subtle patterns.

The third discriminator can adopt DenseNet, which adopts a deeper CNN structure based on residual network, and can effectively find the propagation and the reuse of the characteristics in the network, each layer in DenseNet takes the characteristic diagram of all the previous layers as the input, and the characteristic diagram of itself also takes the input of all the following layers; denseNet facilitates the propagation and reuse of features by creating shortcuts between input and output, and solves the problem of gradient disappearance.

The fourth discriminator may use XceptionNet, which uses a completely separable filter, and in each layer of the network, the three-dimensional filtering of the input feature map is converted by a one-dimensional depth convolution and a two-dimensional point convolution, so that the parameters that the network needs to train are reduced, and the network performance is guaranteed.

The fifth discriminator can adopt a Meso-4 discriminator, and the Meso-4 network introduces an initiation module to replace the first two convolution layers in the Meso-4 network architecture. The concept module is used for overlapping the output of several convolution layers with different kernel shapes, so that the function space of model optimization is increased.

According to the scheme, the Markov discriminator, the SVM classifier, the DenseNet discriminator, the XceptionNet discriminator and the MesoInception-4 discriminator are used as five decision trees, and a random forest model is formed according to the five decision trees, so that the training speed of the random forest model on high-dimensional data can be increased, and the speed and the efficiency of detecting the lip-shaped modified forged video are further increased.

Further, fig. 7 is a flowchart illustrating a sixth embodiment of the lip modification forged video detection method according to the present invention, and as shown in fig. 7, the sixth embodiment of the lip modification forged video detection method according to the present invention is proposed based on the fifth embodiment, in this embodiment, the step S20 specifically includes the following steps:

and S21, acquiring a corresponding data set from the video data set according to a preset random proportion to serve as a training set and a testing set.

It should be noted that the preset random proportion is a random distribution proportion of preset screening data serving as different data sets, and a corresponding data set can be acquired from a video data set as a training set and a test set through the preset random proportion.

And S22, training each discriminator in the random forest model according to the training set, and testing each classifier in the random forest model according to the test set to obtain the trained random forest model.

It should be understood that each discriminator in the random forest model can be trained through the training set, so that each classifier in the random forest model is tested according to the test set, and a trained random forest model is obtained.

In a specific implementation, the above five discriminant models may be trained and tested respectively by using a random 70% data set as a training set and the remaining 30% data set as a test set, where each discriminant behaves as follows:

distinguishing device	Accuracy of test
		Markov discriminator	79.5％
SVM	89.63％
		DenseNet	84.77％
XceptionNet	89.78％
		MesoInception-4	89.6％

Subsequently, a random forest algorithm may be used to train 5 discriminators, respectively, first with a random half total training set.

According to the scheme, the corresponding data set is acquired from the video data set through the preset random proportion and is used as the training set and the test set; training each discriminator in the random forest model according to the training set, testing each classifier in the random forest model according to the testing set to obtain a trained random forest model, and further improving the speed and efficiency of lip modification forgery video detection by improving the training speed of the random forest model to high-dimensional data.

Further, fig. 8 is a flowchart illustrating a seventh embodiment of the lip modification falsification video detection method according to the present invention, and as shown in fig. 8, the seventh embodiment of the lip modification falsification video detection method according to the present invention is proposed based on the first embodiment, in this embodiment, the step S30 specifically includes the following steps:

and S31, inputting the image characteristics to a trained random forest, obtaining random forest mode numbers corresponding to decision trees in the random forest, and judging whether the video to be detected is a true video or a forged video according to the random forest mode numbers.

It should be noted that after training of the random forest is completed, the image features extracted from the video to be detected can be input, the trained random forest is used to judge whether the video to be detected is a true video or a fake video by a mode taking method, 5 discriminators are used to judge the same test sample at the same time, and the mode taking method is used to judge whether the test sample is true or false.

According to the scheme, the image features are input into the trained random forest, the random forest mode corresponding to each decision tree in the random forest is obtained, the video to be detected is judged to be the true video or the forged video according to the random forest mode, the authenticity of the video can be quickly judged, the training speed and efficiency of video data are accelerated, the phenomenon that image shaking in the video affects feature extraction is avoided, the lip motion detection accuracy of a video frame and the feature extraction speed are improved, the influence of noise points or abnormal points on video lip detection is reduced, the higher lip detection precision of video characters is guaranteed, the detection accuracy of lip modification forged videos is improved, and the speed and efficiency of lip modification forged video detection are improved.

Accordingly, the invention further provides a lip-modified counterfeit video detection device.

Referring to fig. 9, fig. 9 is a functional block diagram of a first embodiment of a lip-modified counterfeit video detection apparatus according to the present invention.

In a first embodiment of the lip-modification counterfeit video detection apparatus of the present invention, the lip-modification counterfeit video detection apparatus includes:

the feature extraction module 10 is configured to extract a lip region image of a video to be detected, and acquire an image feature in the lip region image.

And the training module 20 is configured to train the random forest model by using a training set in the video data set to obtain a trained random forest model.

And the judging module 30 is used for inputting the image characteristics into the trained random forest and judging that the video to be detected is a true video or a forged video.

The feature extraction module 10 is further configured to perform graying processing and normalization on each frame of image in the video to be detected, so as to obtain a target detection video with normalized gray value data; extracting a lip region image of the target detection video; extracting image features in the lip region image using a base feature extraction network FFE-Net and a lip representative feature extraction and classification network RC-Net

The feature extraction module 10 is further configured to extract key points of a face in the video to be detected by using a face key point detection algorithm; lip key points are obtained from the key points according to lip region coordinates, and region video images are extracted from the video to be detected according to the lip key points and serve as lip region images.

The feature extraction module 10 is further configured to acquire coordinates of face key feature points in the lip region image, and use the coordinates of the face key feature points as corner features; obtaining histograms of three RGB channels of the lip region image through a calchist function, determining an image binarization threshold value according to the histograms, and obtaining data characteristics of gray distribution in three colors corresponding to RGB according to the image binarization threshold value; extracting basic features in the lip region image by adopting a basic feature extraction network FFE-Net; and extracting and classifying lip features in the lip region image by adopting a lip representative feature extraction and classification network RC-Net.

The training module 20 is further configured to use a markov discriminator, an SVM classifier, a DenseNet discriminator, an XceptionNet discriminator, and a mesoinclusion-4 discriminator as five decision trees, and form a random forest model according to the five decision trees.

The training module 20 is further configured to obtain a corresponding data set from the video data set according to a preset random proportion as a training set and a test set; and training each discriminator in the random forest model according to the training set, and testing each classifier in the random forest model according to the test set to obtain the trained random forest model.

The judging module 30 is further configured to input the image features into a trained random forest, obtain random forest mode numbers corresponding to decision trees in the random forest, and judge that the video to be detected is a true video or a forged video according to the random forest mode numbers.

The steps implemented by each functional module of the lip-shaped modified and forged video detection device may refer to each embodiment of the lip-shaped modified and forged video detection method of the present invention, and are not described herein again.

In addition, an embodiment of the present invention further provides a storage medium, where a lip modification forgery video detection program is stored on the storage medium, and when executed by a processor, the lip modification forgery video detection program implements the following operations:

Further, the lip modification spoofing video detection program when executed by the processor further performs the following operations:

extracting a lip region image of the target detection video;

and obtaining lip key points from the key points according to the lip region coordinates, and extracting a region video image from the video to be detected as a lip region image according to the lip key points.

obtaining histograms of RGB three channels of the lip region image through a calchist function, determining an image binarization threshold value according to the histograms, and obtaining data characteristics corresponding to gray distribution in RGB three colors according to the image binarization threshold value;

inputting the image characteristics into a trained random forest, obtaining random forest mode numbers corresponding to decision trees in the random forest, and judging whether the video to be detected is a true video or a forged video according to the random forest mode numbers.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, method, article, or apparatus comprising the element.

The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A lip modification spoofing video detection method, the lip modification spoofing video detection method comprising:

2. The lip modification forgery video detection method of claim 1, wherein the extracting a lip region image of a video to be detected and acquiring image features in the lip region image comprises:

extracting a lip region image of the target detection video;

3. The lip modification counterfeit video detection method of claim 2, wherein the extracting the lip region image of the video to be detected comprises:

4. The lip modification counterfeit video detection method of claim 2 wherein extracting image features in the lip region image using a base feature extraction network FFE-Net and a lip representative feature extraction and classification network RC-Net comprises:

5. The lip modification spoofing video detection method of claim 1, wherein the lip modification spoofing video detection method further comprises, prior to training a random forest model using a training set in the video data set to obtain a trained random forest model:

6. The lip modification spoofing video detection method of claim 5, wherein training a random forest model with a training set in the video data set to obtain a trained random forest model comprises:

7. The lip modification spoofing video detection method of claim 1, wherein the inputting the image features into a trained random forest and determining the video to be detected as a true video or a spoofing video comprises:

8. A lip modification forgery video detection apparatus, comprising:

and the judging module is used for inputting the image characteristics into a trained random forest and judging that the video to be detected is a true video or a forged video.

9. A lip modification spoofing video detection device, the lip modification spoofing video detection device comprising: a memory, a processor, and a lip modification counterfeit video detection program stored on the memory and executable on the processor, the lip modification counterfeit video detection program configured to implement the steps of the lip modification counterfeit video detection method as claimed in any one of claims 1 to 7.

10. A storage medium having stored thereon a lip modification forgery video detection program that, when executed by a processor, implements the steps of the lip modification forgery video detection method according to any one of claims 1 to 7.