CN114842399A

CN114842399A - Video detection method, and training method and device of video detection model

Info

Publication number: CN114842399A
Application number: CN202210564026.3A
Authority: CN
Inventors: 李艾仑; 王洪斌; 吴至友; 皮家甜; 曾定衡
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2022-08-02
Anticipated expiration: 2042-05-23
Also published as: CN114842399B

Abstract

The application discloses a video detection method and a video detection device, which are used for solving the problems of low detection accuracy and poor universality of the conventional forged video detection method. The video detection method comprises the following steps: acquiring at least one frame of video image of a target face in a video to be detected and a plurality of frames of first optical flow images of the target face based on time sequence arrangement; performing feature extraction on the at least one frame of video image through a video detection model to obtain face emotion features of the target face; performing feature extraction on the multiple frames of first optical flow images through the video detection model to obtain the facial action features of the target human face; and determining the detection result of the video to be detected at least based on the face emotion characteristics and the face action characteristics of the target face.

Description

Video detection method, and training method and device of video detection model

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a video detection method, a video detection model training method and a video detection model training device.

Background

With the development of deep learning, various face counterfeiting technologies come up endlessly, for example, a non-existent face is made or a face in a video is replaced by another face, and the like. Therefore, detection of counterfeit video becomes very important.

At present, the detection of the forged video is still in a development stage, and most detection methods judge the authenticity of the video based on the change of the face characteristics and the occurrence of artifacts in the forging process. However, the method is easy to overfit for some depth forgery characteristics with specific distribution, so that the method can only achieve good detection effect for partial videos, and has low detection accuracy and poor universality.

Disclosure of Invention

The embodiment of the application aims to provide a video detection method and a video detection device, which are used for solving the problems of low detection accuracy and poor universality of the existing video detection method.

In order to achieve the above purpose, the following technical solutions are adopted in the embodiments of the present application:

in a first aspect, an embodiment of the present application provides a video detection method, including:

acquiring at least one frame of video image of a target face in a video to be detected and a plurality of frames of first optical flow images of the target face based on time sequence arrangement;

performing feature extraction on the at least one frame of video image through a video detection model to obtain face emotion features of the target face;

performing feature extraction on the multiple frames of first optical flow images through the video detection model to obtain the facial action features of the target human face;

and determining the detection result of the video to be detected at least based on the face emotion characteristics and the face action characteristics of the target face.

It can be seen that, in the embodiment of the application, a natural law that a real face and a fake face have differences in appearance and dynamic motion is utilized, facial emotion characteristics of a target face are extracted from at least one frame of video image of the target face in a video to be detected based on a video detection model, facial motion characteristics of the target face are extracted from a plurality of frames of first optical flow images of the target face arranged in time sequence in the video to be detected based on the video detection model, and further, a detection result of the video to be detected is determined based on at least the facial emotion characteristics and the facial motion characteristics of the target face, as the facial emotion characteristics belong to static characteristics in a space domain, the appearance of the face can be reflected, the facial motion characteristics belong to dynamic characteristics in a time domain, the face motion can be reflected, and video detection is performed by combining the static facial emotion characteristics in the space domain and the dynamic facial motion characteristics in the time domain, the method can avoid the state of overfitting the depth forgery characteristics of certain specific distribution, thereby improving the detection accuracy and the universality.

In a second aspect, an embodiment of the present application provides a method for training a video detection model, including:

acquiring a sample video set and a true and false label corresponding to each sample video in the sample video set, wherein the sample video set comprises a real video and a plurality of forged videos, the plurality of forged videos correspond to a plurality of face forging algorithms one to one, and each forged video is obtained by forging the real video based on the corresponding face forging algorithm;

acquiring at least one frame of video image of a sample face in a target sample video and a plurality of frames of second optical flow images of the sample face based on time sequence arrangement;

performing feature extraction on at least one frame of video image of a sample face in the target sample video through an initial video detection model to obtain face emotion features of the sample face;

performing feature extraction on a plurality of frames of second optical flow images of the sample face in the target sample video through the initial video detection model to obtain the face action features of the sample face;

determining a detection result of the target sample video at least based on the face emotion characteristics and the face action characteristics of the sample face in the target sample video;

and performing iterative training on the initial video detection model based on the detection result of each sample video in the sample video set and the authenticity label corresponding to each sample video to obtain a video detection model.

In the embodiment of the application, the real video and the forged video obtained by forging the real video based on various human face forging algorithms are used as the sample video, and the sample video and the corresponding true and false labels thereof are used for training the initial video detection model, so that the obtained video detection model can learn the characteristics of various forged videos, the generalization capability of the video detection model can be improved, and the detection effect of the video detection model on various videos can be improved; in the specific model training process, facial emotion characteristics of a sample face are extracted from at least one frame of video image of the sample face in a sample video through an initial video detection model, facial motion characteristics of the sample face are extracted from a plurality of frames of optical flow images of the sample face arranged in the sample video based on time sequence through the initial video detection model, the sample video is detected based on the facial emotion characteristics and the facial motion characteristics of the sample face, iterative training is carried out on the initial video detection model based on the detection result of each sample video in a sample video set and a true and false label corresponding to each sample video to obtain a video detection model, so that the initial detection model can fully learn static characteristics of the sample video in a space domain and accurately extract facial emotion characteristics reflecting the appearance of the face, the dynamic characteristics of the sample video in the time domain can be fully learned, the face action characteristics reflecting the dynamic actions of the human face can be accurately extracted, the capability of accurately identifying the video by combining the two characteristics is realized, the initial video detection model is prevented from falling into the state of overfitting certain specifically distributed depth fake characteristics, the trained video detection model has high detection accuracy and universality, and the accuracy and the universality of the video detection based on the video detection model are improved.

In a third aspect, an embodiment of the present application provides a video detection apparatus, including:

the first image acquisition unit is used for acquiring at least one frame of video image of a target face in a video to be detected and multi-frame first optical flow images of the target face based on time sequence arrangement;

the first spatial domain feature extraction unit is used for extracting features of the at least one frame of video image through a video detection model to obtain facial emotion features of a target face;

the first time domain feature extraction unit is used for extracting features of the plurality of frames of first optical flow images through the video detection model to obtain face action features of the target face;

and the first detection unit is used for determining the detection result of the video to be detected at least based on the face emotion characteristics and the face action characteristics of the target face.

In a fourth aspect, an embodiment of the present application provides a training apparatus for a video detection model, including:

the system comprises a sample acquisition unit, a processing unit and a processing unit, wherein the sample acquisition unit is used for acquiring a sample video set and an authenticity label corresponding to each sample video in the sample video set, the sample video set comprises a real video and a plurality of forged videos, the plurality of forged videos correspond to a plurality of face forging algorithms one to one, and each forged video is obtained by forging the real video based on the corresponding face forging algorithm;

a second image acquisition unit for acquiring at least one frame of video image of a sample face in the target sample video and a plurality of frames of second optical flow images of the sample face arranged based on time sequence;

the second spatial domain feature extraction unit is used for performing feature extraction on at least one frame of video image of a sample face in the target sample video through the initial video detection model to obtain facial emotion features of the sample face;

the second time domain feature extraction unit is used for performing feature extraction on a plurality of frames of second optical flow images of the sample face in the target sample video through the initial video detection model to obtain the face action features of the sample face;

the second detection unit is used for determining the detection result of the target sample video at least based on the face emotion characteristics and the face action characteristics of the sample face in the target sample video;

and the training unit is used for carrying out iterative training on the initial video detection model based on the detection result of each sample video in the sample video set and the authenticity label corresponding to each sample video to obtain the video detection model.

In a fifth aspect, an embodiment of the present application provides an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of the first or second aspect.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, where instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method according to the first aspect or the second aspect.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flowchart of a video detection method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a video detection method according to another embodiment of the present application;

fig. 3 is a schematic structural diagram of a spatial flow network according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a training method of a video detection model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a video detection apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a training apparatus for a video detection model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or described herein. In addition, "and/or" in the specification and claims means at least one of connected objects, and a character "/" generally means that a front and rear related object is in an "or" relationship.

Partial concept description:

OpenCV: is a cross-platform computer vision and machine learning Software library based on bsd (berkly Software distribution) license (open source) distribution. The visual processing algorithms provided by OpenCV are very rich.

The Dlib tool: is a modern C + + tool kit that contains machine learning algorithms and tools for creating complex software in C + + to solve real problems.

Plevict operator: the method is also called Prewitt operator, and the method utilizes the gray difference of upper and lower adjacent points and left and right adjacent points of a pixel point to identify the pixel point with obvious brightness change in the digital image so as to obtain the boundary information of the target in the digital image.

Freeman chain code coding: the method of describing the curve or boundary by the coordinates of the curve starting point and the boundary point direction code. Freeman chain code encoding is commonly used to represent curves and region boundaries in the fields of image processing, computer graphics, pattern recognition, and the like. For example, the Freeman chain code coding may adopt 8 connected chain codes, that is, there are 4 adjacent points, respectively on the upper right, lower seat, left and upper left of the central pixel point. The 8-connected chain code is consistent with the actual similar point, and the information of the central pixel point and the adjacent pixel points can be accurately described.

In order to solve the problems of low detection accuracy and poor universality caused by the fact that the existing video detection method only has good detection effect on partial videos, the embodiment of the application provides a video detection method based on a double-current network architecture, the natural rules that real human faces and fake artificial faces have differences in appearance and dynamic actions are utilized, the facial emotion characteristics of target human faces are extracted from at least one frame of video images of the target human faces in videos to be detected based on a video detection model, the facial action characteristics of the target human faces are extracted from a plurality of frames of first optical flow images of the target human faces arranged based on time sequence in the videos to be detected based on the video detection model, the detection results of the videos to be detected are determined further based on at least the facial emotion characteristics and the facial action characteristics of the target human faces, and the facial emotion characteristics belong to static characteristics on an airspace, the face detection method can reflect the face appearance, the face action features belong to dynamic features in a time domain, the face action can be reflected, and the video detection is performed by combining two types of features, namely static face emotion features in a space domain and dynamic face action features in the time domain, so that the phenomenon that the video detection is over-fitted to certain specifically distributed depth fake features can be avoided, and the detection accuracy and the universality can be improved.

The embodiment of the application also provides a training method of the video detection model, which adopts a real video and a forged video obtained by forging the real video based on a plurality of human face forging algorithms as a sample video, and trains an initial video detection model by using the sample video and a corresponding authenticity label thereof, so that the obtained video detection model can learn the characteristics of a plurality of forged videos, and the generalization capability of the video detection model can be improved, thereby the detection effect of the video detection model on various videos can be improved; in the specific model training process, extracting the facial emotion characteristics of a sample face from at least one frame of video image of the sample face in a sample video through an initial video detection model, extracting the facial motion characteristics of the sample face from a plurality of frames of optical flow images of the sample face arranged in the sample video based on time sequence through the initial video detection model, detecting the sample video based on the facial emotion characteristics and the facial motion characteristics of the sample face at least, then carrying out iterative training on the initial video detection model based on the detection result of each sample video in a sample video set and the authenticity label corresponding to each sample video to obtain a video detection model, thereby leading the initial detection model to be capable of fully learning the static characteristics of the sample video in a space domain and accurately extracting the facial emotion characteristics reflecting the appearance of the face, the dynamic characteristics of the sample video in the time domain can be fully learned, the facial motion characteristics reflecting the dynamic motion of the human face can be accurately extracted, the capability of accurately identifying the video by combining the two characteristics is realized, the initial video detection model is prevented from falling into a state of overfitting certain specifically distributed depth fake characteristics, the trained video detection model has high detection accuracy and universality, and the accuracy and the universality of the video detection based on the video detection model are improved.

It should be understood that the video detection method and the training method of the video detection model provided in the embodiments of the present application may be executed by an electronic device or software installed in the electronic device. The electronic device referred to herein may include a terminal device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device, an intelligent household appliance, an intelligent watch, a vehicle-mounted terminal, an aircraft, or the like; alternatively, the electronic device may further include a server, such as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server providing a cloud computing service.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a flow chart of a video detection method according to an embodiment of the present application is shown, where the method includes the following steps:

s102, at least one frame of video image of a target face in a video to be detected and a plurality of frames of first optical flow images of the target face arranged based on time sequence are obtained.

In the embodiment of the application, the target face in the video to be detected refers to a main face in the video to be detected, for example, the video to be detected includes a face of a user a and a face of a user B, the face of the user a is located in a foreground or a face area of the user a is larger than a face area of the user B, the face of the user B is located in a background or the face area of the user B is smaller than the face area of the user a, and then the face of the user a is the target face.

The single-frame video image of the target face can be any one frame of video to be detected, which contains the video image of the target face. The single-frame video image of the target face can also be a frame of video image containing the target face corresponding to the whole video to be detected. The at least one frame of video image of the target face can be any one or more frames of video images containing the target face in the video to be detected. In consideration of the fact that the RGB image contains image data of three color channels of R (red), G (green), and B (blue), the vital signs of the target face in the video to be detected can be better reflected, and further, the single-frame video image may be any one of the RGB images containing the target face in the video to be detected.

The first optical flow image is an image that can express a change in a video image and includes information on the motion of a target face. In practical application, the optical flow image may be obtained by calculating two adjacent frames of video images in any time sequence by using an optical flow algorithm, where the optical flow algorithm specifically includes, but is not limited to, one or a combination of multiple of Farneback algorithm, FlowNet algorithm, and the like.

In order to enable the acquired at least one frame of video image to accurately reflect the appearance characteristics of the human face (especially accurately reflect the emotional characteristics of the human face) and enable the acquired multiple frames of first optical flow images to accurately reflect the motion characteristics of the human face (especially the facial motion characteristics), in an optional implementation mode, a segmented random sampling mode can be adopted to acquire at least one frame of video image of the target human face in the video to be detected and multiple frames of first optical flow images of the target human face based on time sequence arrangement; of course, in another alternative implementation manner, the whole video to be detected may also be randomly sampled, and a single-frame video image of the target face in the video to be detected is obtained. Specifically, S102 may include: dividing a video to be detected into a plurality of video segments; then, randomly sampling a plurality of frames of RGB images of the target face in each video segment to obtain a plurality of candidate single-frame video images, and determining at least one frame of video image according to the plurality of candidate single-frame video images; randomly sampling multi-frame gray level images of the target face in each video clip to obtain multi-frame candidate gray level images; further, a first perfusion image corresponding to each frame of candidate gray level image is determined based on each frame of candidate gray level image and the candidate gray level images adjacent to the time sequence of each frame of candidate gray level image, and then multiple frames of first optical flow images are determined based on the first optical flow images respectively corresponding to the multiple frames of gray level images.

For example, the video to be detected can be equally divided into K segments based on the time length of the video to be detected, the time length of each segment is equal, and each segment contains multiple frames of video images. Secondly, for each segment, respectively converting the segment into a multi-frame RGB image and a multi-frame gray image based on time sequence arrangement by utilizing OpenCV, and randomly sampling the multi-frame RGB image of the segment to obtain a candidate single-frame RGB image and a multi-frame candidate gray image based on time sequence arrangement; further, the candidate single-frame RGB image of each segment may be used as the final single-frame video image, or at least one frame RGB image with a relatively good effect (e.g., high definition, clear human face) may be selected from the candidate single-frame RGB images of each segment as the final single-frame video image; meanwhile, an optical flow algorithm can be adopted, based on each frame of candidate gray level image and the candidate gray level images adjacent to the time sequence of each frame of candidate gray level image, a frame of first optical flow image corresponding to each frame of candidate gray level image is calculated, and then the first optical flow images corresponding to the plurality of frames of candidate gray level images are determined to be the plurality of frames of first optical flow images arranged based on the time sequence.

Optionally, to improve the quality of the single-frame video image and the multi-frame first optical flow image, the RGB images and the grayscale images included in each segment may be preprocessed, such as filtering, before the RGB images and the grayscale images included in each segment are randomly sampled. The specific pretreatment mode can be selected according to actual needs, and the embodiment of the application is not limited to this.

Only one specific implementation of S102 described above is shown here. Of course, it should be understood that S102 may also be implemented in other manners, and this is not limited in this embodiment of the application.

And S104, performing feature extraction on at least one frame of video image of the target face through the video detection model to obtain the face emotion feature of the target face in the video to be detected.

In order to accurately extract the two types of features, as shown in fig. 2, the video detection model of the embodiment of the present application may adopt a dual-flow network architecture, that is, the video detection model includes a spatial flow network and a temporal flow network, wherein the spatial flow network is used to extract facial emotion features of a human face, the temporal flow network is used to extract facial motion features of the human face, and then video detection is performed by combining the two types of features, namely the facial emotion features and the facial motion features of a target human face in a video to be detected.

Specifically, as shown in fig. 2, the above S104 may be implemented as: and performing feature extraction on at least one frame of video image of the target face through a spatial stream network in the video detection model to obtain the face emotional features of the target face.

Illustratively, at least one frame of video image of the target human face is input into a spatial stream network of a video detection model, and feature extraction is carried out on the input single frame of video image by the spatial stream network, so that the facial emotional feature of the target human face is obtained.

In practice, the spatial stream network may adopt any suitable structure. Optionally, because the depth and the width of the network can be increased by the inclusion-V3 convolutional neural network, and the nonlinearity of the network is increased, the inclusion-V3 convolutional neural network can be adopted by the spatial stream network, so that the problem that the emotional features of the face cannot be accurately extracted due to the content difference of the single-frame video image is effectively solved. More specifically, to Fully utilize useful information in a single frame video image and extract rich facial emotional features, as shown in fig. 3, the spatial stream network may include a plurality of convolutional layers, a Gated Recursive Unit (GRU) Layer, a Fully Connected Layer (full Connected Layer), and the like, wherein the plurality of convolutional layers may include a two-dimensional invariant convolutional Layer, a two-dimensional spectral convolutional Layer, and the like, and each convolutional Layer may be provided with a Batch Normalization (BN) function, a Linear rectification function (regulated Linear Unit, return), and the like. Specifically, each convolution layer is used for extracting facial emotional features with different sizes from at least one frame of video image; the GRU layer is used for selecting facial emotional characteristics extracted from various convolutional layers and reserving the facial emotional characteristics which are useful for video detection; and the full connection layer is used for integrating the facial emotional characteristics reserved by the GRU layer to obtain the final facial emotional characteristics.

And S106, performing feature extraction on the multi-frame first optical flow image through the video detection model to obtain the face action feature of the target face.

The facial motion feature of the target face refers to a feature capable of reflecting the facial motion of the target face, and includes, but is not limited to, a feature reflecting the lip motion of the target face, and the like.

Specifically, as shown in fig. 2, the above S106 may be implemented as: and performing feature extraction on the multi-frame first optical flow image through a time flow network in the video detection model to obtain the face action feature of the target face.

Illustratively, a plurality of frames of first optical flow images of the target human face are input into a time flow network in the video detection model, and feature extraction is performed on the plurality of frames of first optical flow images by the time flow network according to the time sequence of the plurality of frames of first optical flow images, so that the facial motion features of the target human face are obtained.

It should be noted that, in practical applications, the spatial stream network and the temporal stream network may have different network structures, for example, a self-attention layer is introduced into the spatial stream network, so that the spatial stream network can focus on key facial emotional features in a single frame video image; alternatively, the spatial stream network and the temporal stream network may have the same network structure.

And S108, determining a detection result of the video to be detected at least based on the face emotion characteristics and the face action characteristics of the target face.

Specifically, the detection result of the video to be detected may indicate whether the video to be detected is a counterfeit video.

In an optional implementation manner, as shown in fig. 2, the video detection model in the embodiment of the present application further includes a classification network, and the classification network has a function of performing face authenticity identification based on an input facial feature. Specifically, the classification network may include an emotion recognition network and a voice recognition network, wherein the emotion recognition network may recognize an emotion expressed by a human face, that is, an emotional state of the human face, and the voice recognition network may recognize a facial action corresponding to the voice data.

Because the facial emotion characteristics of the target face can reflect the facial emotion of the target face, the facial motion characteristics of the target face can reflect the facial motion of the target face, and the real face and the forged face have differences in facial emotion and facial motion, in the above S108, the facial emotion characteristics and the facial motion characteristics of the target face are input into the classification network of the video detection model, so that whether the target face is a forged face or not can be obtained, and if the target face is a forged face, the video to be detected can be determined to be a forged video; and if the target face is a real face, determining that the video to be detected is a real video.

In another optional implementation manner, considering that the size of the pupil of the human face changes correspondingly when the human face presents different emotions, and the facial action of the user changes correspondingly when the user speaks, in order to accurately identify the authenticity of the video to be detected, as shown in fig. 2, the above S108 may be specifically implemented as:

and S181, determining the pupil size of the target face based on at least one frame of video image of the target face.

In the embodiment of the present application, the size of the pupil of the target face may be determined in any appropriate manner. Optionally, the S181 may be specifically implemented as: based on a preset image segmentation algorithm, segmenting an eye region of a target face from at least one frame of video image of the target face; performing edge detection on the eye region of the target face based on a preset edge detection algorithm to obtain a pupil boundary of the target face; and fitting the eye region based on a preset fitting algorithm on the pupil boundary of the target face to obtain the pupil size of the target face.

Exemplarily, a Dlib tool can be used to extract a target face from the at least one frame of video image, and then one or more image segmentation algorithms commonly used in the art are used to detect key points of human eyes, so as to segment an eye region of the target face from a single frame of video image; then, filtering the eye region of the target face, for example, performing median filtering on the eye region of the target face by using a filtering template with a preset size, so as to filter out normally distributed noise in the eye region; secondly, performing binarization processing on the eye region of the target face based on a one-dimensional maximum entropy threshold segmentation method and a preset threshold to obtain a binarized eye region; further, performing edge detection on the eye region subjected to threshold processing by using a Prewittig operator (Prewittperate) to obtain a pupil boundary of the target face, and representing the pupil boundary of the target face by using Freeman chain code coding; then, a Hough circle fitting algorithm and a pupil boundary of the target face are adopted, an image space is converted into a parameter space based on a standard Hough transformation principle, then the circle center of the eye region is detected, the radius of a circle is deduced from the circle center, and the radius is the pupil size of the target face.

Only one specific implementation of determining pupil size is shown here. Of course, it should be understood that the pupil size may be determined in other manners, which is not limited in the embodiments of the present application.

And S182, determining a first detection result of the video to be detected based on the face emotional characteristics and the pupil size of the target face.

The first detection result of the video to be detected can be used for indicating the authenticity of the video to be detected.

Optionally, considering that the emotion of the real face is theoretically matched with the pupil size, for example, the pupil size of the real face is small when the real face is happy, the pupil size of the real face is large when the real face is frightened, and the like, but the emotion of the face counterfeited by the existing face counterfeiting technology is difficult to match with the pupil size, based on this, in the above S182, the emotion recognition of the facial emotion feature of the target face can be performed through an emotion recognition network to obtain the emotion state of the target face, and then the first detection result of the video to be detected is determined based on the matching state between the emotion state of the target face and the pupil size of the target face.

Illustratively, the size of the pupil matched with the emotion state of the target face can be determined based on the emotion state of the target face and a preset corresponding relation between the emotion state and the size of the pupil, and if the difference value between the size of the pupil matched with the emotion state of the target face and the calculated size of the pupil exceeds a preset threshold value, the target face can be determined to be a forged face, and then a first detection result indicating that the video to be detected is a forged video can be obtained; if the difference value between the pupil size matched with the emotional state of the target face and the calculated pupil size is smaller than a preset threshold value, the target face can be determined to be a real face, and a first detection result indicating that the video to be detected is a real video can be obtained.

Of course, in practical applications, the first detection result may also include a probability that the video to be detected is a fake video and/or a probability that the video to be detected is a real video. For example, the probability that the video to be detected is a fake video and/or the probability that the video to be detected is a real video may be determined based on the matching degree value between the emotional state of the target face and the pupil size of the target face, so as to obtain the first detection result.

Optionally, in order to further obtain a first detection result with higher accuracy, in the above S182, the facial emotion feature of the target face may be input to an emotion recognition network to obtain an emotion state of the target face, and text data corresponding to the voice data of the video to be detected is input to a preset text recognition model to obtain the emotion state of the target face in the video to be detected; further, a first detection result of the video to be detected is determined based on a matching state between the emotion state obtained by the emotion recognition network and the emotion state obtained by the preset text recognition model and a matching state between the emotion state of the target face and the pupil size of the target face.

Exemplarily, if the emotion state obtained based on the emotion recognition network is the same as the emotion state obtained by the preset text recognition model, the emotion state and the preset text recognition model are considered to be matched; further, if the emotion state obtained based on the emotion recognition network is matched with the emotion state obtained by a preset text recognition model, and the emotion state of the target face is matched with the pupil size of the target face, determining that the first detection result is that the video to be detected is a real video; and if the emotion state obtained based on the emotion recognition network is not matched with the emotion state obtained by the preset text recognition model or the emotion state of the target face is not matched with the pupil size of the target face, determining that the first detection result is that the video to be detected is a forged video.

It can be understood that, in the latter implementation manner, it is determined whether the emotional state of the target face matches the pupil size of the target face, and whether the emotional state obtained based on the emotion recognition network matches the emotional state obtained by the preset text recognition model, and the first detection result of the video to be detected is determined by combining the two matching results, so that it is possible to avoid that the first detection result is inaccurate due to matching between the emotion of the fake face and the pupil size.

And S183, determining a second detection result of the video to be detected based on the facial action characteristics of the target face and the voice data of the video to be detected.

Optionally, considering that the voice data of the real video matches the facial motion (especially the lip motion) of the target face in the real video, but the voice data of the video pseudo-created by the existing face counterfeiting technology is difficult to match the facial motion of the face in the video, in S183 above, the voice data of the video to be detected may be subjected to voice recognition by the voice recognition network to obtain the target facial motion feature corresponding to the voice data, and then the second detection result of the video to be detected is determined based on the matching state between the target facial motion feature of the target face and the target facial motion feature corresponding to the voice data.

For example, if the facial motion feature of the target face does not match the target facial motion feature corresponding to the voice data, it may be determined that the second detection result is that the video to be detected is a counterfeit video; and if the face action characteristics of the target face are matched with the target face action characteristics corresponding to the voice data, determining that the second detection result is that the video to be detected is a real video.

Of course, in practical applications, the second detection result may also include the probability that the video to be detected is a fake video and/or the probability that the video to be detected is a real video. For example, the probability that the video to be detected is a fake video and/or the probability that the video to be detected is a real video may be determined based on the matching degree value of the face action feature of the target face and the target face action feature corresponding to the voice data, so as to obtain the second detection result.

S184, determining the detection result of the video to be detected based on the first detection result and the second detection result of the video to be detected.

Exemplarily, if the first detection result and the second detection result both indicate that the side video to be detected is a real video, the video to be detected is finally determined to be the real video; otherwise, finally determining the video to be detected as a forged video.

For another example, if the first detection result and the second detection result both include the probability that the video to be detected is the counterfeit video, the first detection result and the second detection result may be subjected to weighted summation to obtain a final probability, and if the final probability exceeds a preset probability threshold, the video to be detected is determined to be the counterfeit video; otherwise, determining that the video to be detected is a real video.

Only one specific implementation of S108 described above is shown here. Of course, it should be understood that S108 may also be implemented in other manners, and this is not limited in this embodiment of the application.

It should be noted that, in the above S102, if a single-frame video image and a plurality of frames of first optical flow images arranged based on a time sequence are acquired for each segment of the video to be detected, the above S104 to S108 may be executed for each segment of the video to be detected based on the single-frame video image and the plurality of frames of first optical flow images in the segment, so as to obtain a detection result corresponding to the segment; and then, integrating the detection result corresponding to each segment of the video to be detected, and determining whether the video to be detected is a forged video. For example, if the detection result corresponding to the segment exceeding 1/2 in the video to be detected indicates that the video to be detected is a fake video, determining that the video to be detected is a fake video; otherwise, determining that the video to be detected is a real video.

The video detection method provided by the embodiment of the application utilizes the natural law that the real face and the fake face have difference in appearance and dynamic action, extracts the facial emotion characteristics of the target face from at least one frame of video image of the target face in the video to be detected based on a video detection model, extracts the facial action characteristics of the target face from a plurality of frames of first optical flow images of the target face arranged based on time sequence in the video to be detected based on the video detection model, further determines the detection result of the video to be detected based on at least the facial emotion characteristics and the facial action characteristics of the target face, can reflect the appearance of the face due to the fact that the facial emotion characteristics belong to static characteristics in a space domain, can reflect the actions of the face, and carries out video detection by combining the static facial emotion characteristics in the space domain and the dynamic facial action characteristics in the time domain, the method can avoid the state of overfitting the depth forgery characteristics of certain specific distribution, thereby improving the detection accuracy and the universality.

The embodiment of the application further provides a training method of the video detection model, and the trained video detection model can be used for detecting the video to be detected. The following describes the training process of the video inspection model in detail.

Referring to fig. 4, a flow chart of a method for training a video detection model according to an embodiment of the present application is shown, where the method includes the following steps:

s402, acquiring a sample video set and an authenticity label corresponding to each sample video in the sample video set.

Wherein the sample video set comprises a real video and a plurality of fake videos. The multiple forged videos correspond to multiple face forging algorithms one by one, and each forged video is obtained by forging the real video based on the corresponding face forging algorithm. In practical applications, the various Face forgery algorithms may specifically include, but are not limited to, a Face2Face algorithm, a faceSwap algorithm, a defakes algorithm, a Neural Textures algorithm, and the like.

And the authenticity label corresponding to the sample video is used for indicating whether the sample video is a forged video. In practical applications, the authenticity label corresponding to the sample video can be represented in a form of one-hot (one-hot), for example, the authenticity label corresponding to the real video is (1,0), and the authenticity label corresponding to the counterfeit video is (0, 1). Of course, the authenticity label corresponding to the sample video may also be represented by other manners commonly used in the art, which is not limited in this application.

S404, at least one frame of video image of the sample face in the target sample video and a plurality of frames of second optical flow images of the sample face based on time sequence arrangement are obtained.

The specific implementation manner of S404 is similar to that of S102 in the embodiment shown in fig. 1, and is not described herein again.

S406, performing feature extraction on at least one frame of video image of the sample face through the initial video detection model to obtain the face emotion feature of the sample face.

The specific implementation manner of S406 is similar to the specific implementation manner of S104 in the embodiment shown in fig. 1, and is not described herein again.

S408, performing feature extraction on the multi-frame second optical flow image of the sample face through the initial video detection model to obtain the face action feature of the sample face.

The specific implementation manner of S408 is similar to the specific implementation manner of S106 in the embodiment shown in fig. 1, and is not described herein again.

And S410, determining a detection result of the target sample video at least based on the face emotion characteristics and the face action characteristics of the sample face.

The specific implementation manner of S410 is similar to that of S108 in the embodiment shown in fig. 1, and is not described herein again. For example, the S410 may include: determining the pupil size of a sample face in a target sample video based on at least one frame of video image of the sample face in the target sample video; determining a second detection result of the target sample video based on the facial action characteristics of the sample face in the target sample video and the voice data of the target sample video; and determining whether the target sample video is a fake video or not based on the first detection result and the second detection result of the target sample video.

And S412, performing iterative training on the initial video detection model based on the detection result of each sample video in the sample video set and the authenticity label corresponding to each sample video to obtain the video detection model.

Specifically, the detection loss of the initial video detection model can be determined based on the detection result, the authenticity label and the preset loss function of each sample video in the sample video set; further, iterative training is carried out on the initial video detection model based on the detection loss of the initial video detection model until a training stopping condition is met, and the video detection model is obtained.

More specifically, the S412 may include: repeatedly executing the following processing until the initial video detection model meets the preset training stop condition: determining the total detection loss of the initial video detection model based on the first detection result and the second detection result of each sample video in the sample video set and the authenticity label of each sample video; based on the total detection loss, model parameters of the initial video detection model are adjusted.

Exemplarily, a first detection loss of the initial video detection model is determined based on a first detection result of each sample video in the sample video set, the authenticity label of each sample video and a first preset loss function; determining a second detection loss of the initial video detection model based on a second detection result of each sample video in the sample video set, the authenticity label of each sample video and a second preset loss function; further, the first detection loss and the second detection loss of the initial video detection model are subjected to weighted summation to obtain the total detection loss of the initial video detection model. The first detection loss is used for representing the loss generated by the initial video detection model for video detection based on the facial emotional characteristics, and the second detection loss is used for representing the loss generated by the initial video detection model for video detection based on the facial action characteristics. The first preset loss function and the second preset loss function may be set according to actual needs, which is not limited in the embodiment of the present application.

It can be understood that the total detection loss of the initial video detection model is determined by the above method, and the detection loss generated by the initial video detection model performing video detection based on different human face characteristics is comprehensively considered, so that the obtained total detection loss can more accurately reflect the difference between the detection result of each sample video in the sample video set and the authenticity label corresponding to each sample video, and then the model parameters of the initial video detection model are adjusted by using the total detection loss, which is beneficial to improving the detection accuracy of the finally obtained video detection model.

For example, a back propagation algorithm may be adopted to determine the detection loss caused by each network in the initial video detection model based on the detection loss of the initial video detection model and the current model parameters of the initial detection model; then, model parameters of each network are adjusted layer by layer with the goal of reducing the detection loss of the initial video detection model. The model parameters of the initial video detection model may specifically include, but are not limited to: the number of nodes of each network in the initial video detection model, the connection relation and the connection edge weight among the nodes of different networks, the bias corresponding to the nodes in each network and the like.

In practical application, the preset loss function and the training stopping condition may be set according to actual needs, for example, the preset loss function may be set as a cross entropy loss function, and the training stopping condition may include that the detection loss of the initial video detection model is smaller than a preset loss threshold or the iteration number reaches a preset number threshold, and the like, which is not limited in the embodiment of the present application.

The embodiment of the present application shows a specific implementation manner of performing iterative training on an initial video detection model. Of course, it should be understood that the initial video detection model may be iteratively trained in other ways in the art, which is not limited in this application.

According to the training method of the video detection model, the real video and the forged video obtained by forging the real video based on various face forging algorithms are used as the sample video, the initial video detection model is trained by using the sample video and the corresponding authenticity label, so that the obtained video detection model can learn the characteristics of various forged videos, the generalization capability of the video detection model is improved, and the detection effect of the video detection model on various videos is improved; in the specific model training process, extracting the facial emotion characteristics of a sample face from at least one frame of video image of the sample face in a sample video through an initial video detection model, extracting the facial motion characteristics of the sample face from a plurality of frames of optical flow images of the sample face arranged in the sample video based on time sequence through the initial video detection model, detecting the sample video based on the facial emotion characteristics and the facial motion characteristics of the sample face at least, then carrying out iterative training on the initial video detection model based on the detection result of each sample video in a sample video set and the authenticity label corresponding to each sample video to obtain a video detection model, thereby leading the initial detection model to be capable of fully learning the static characteristics of the sample video in a space domain and accurately extracting the facial emotion characteristics reflecting the appearance of the face, the dynamic characteristics of the sample video in the time domain can be fully learned, the face action characteristics reflecting the dynamic actions of the human face can be accurately extracted, the capability of accurately identifying the video by combining the two characteristics is realized, the initial video detection model is prevented from falling into the state of overfitting certain specifically distributed depth fake characteristics, the trained video detection model has high detection accuracy and universality, and the accuracy and the universality of the video detection based on the video detection model are improved.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

In addition, corresponding to the video detection method shown in fig. 1, an embodiment of the present application further provides a video detection apparatus. Referring to fig. 5, a schematic structural diagram of a video detection apparatus 500 according to an embodiment of the present application is shown, where the apparatus 500 includes:

a first image obtaining unit 510, configured to obtain at least one frame of video image of a target face in a video to be detected and multiple frames of first optical flow images of the target face arranged based on a time sequence;

a first spatial domain feature extraction unit 520, configured to perform feature extraction on the at least one frame of video image through a video detection model to obtain facial emotion features of a target face;

a first time domain feature extraction unit 530, configured to perform feature extraction on the multiple frames of first optical flow images through the video detection model, so as to obtain a facial motion feature of the target human face;

a first detecting unit 540, configured to determine a detection result of the video to be detected based on at least the facial emotion feature and the facial motion feature of the target human face.

Optionally, the first detection unit includes:

a pupil determining subunit, configured to determine a pupil size of the target human face based on the at least one frame of video image;

the first detection subunit is used for determining a first detection result of the video to be detected based on the facial emotional characteristics and the pupil size of the target face;

the second detection subunit is used for determining a second detection result of the video to be detected based on the facial action characteristics of the target face and the voice data of the video to be detected;

and the third detection subunit is used for determining the detection result of the video to be detected based on the first detection result and the second detection result.

Optionally, the first detecting subunit is specifically configured to:

performing emotion recognition on the face emotion characteristics of the target face through an emotion recognition network in the video detection model to obtain the emotion state of the target face;

and determining a first detection result of the video to be detected based on the matching state between the emotion state of the target face and the pupil size of the target face.

Optionally, the pupil determination subunit is specifically configured to:

based on a preset image segmentation algorithm, segmenting the eye region of the target human face from the at least one frame of video image;

performing edge detection on the eye region of the target face based on a preset edge detection algorithm to obtain a pupil boundary of the target face;

and fitting the eye region based on a preset fitting algorithm and the pupil boundary of the target face to obtain the pupil size of the target face.

Optionally, the second detecting subunit is specifically configured to:

performing voice recognition on the voice data of the video to be detected through the voice recognition network of the video detection model to obtain target face action characteristics corresponding to the voice data

And determining a second detection result of the video to be detected based on the matching state between the facial action characteristics of the target face and the target facial action characteristics corresponding to the voice data.

Optionally, the first spatial domain feature extraction unit is specifically configured to perform feature extraction on the at least one frame of video image through a spatial stream network in the video detection model to obtain a facial emotion feature of the target face;

the first time domain feature extraction unit is specifically configured to perform feature extraction on the multiple frames of first optical flow images through a time flow network in the video detection model to obtain a facial motion feature of the target face.

Optionally, the acquiring, by the first image acquiring unit, at least one frame of video image of a target face in a video to be detected includes:

dividing the video to be detected into a plurality of video segments;

randomly sampling a plurality of frames of RGB images of the target face in each video segment to obtain a plurality of candidate single-frame video images;

determining the at least one frame of video image from the plurality of candidate single frame video images.

Alternatively, the first image obtaining unit may include, based on a plurality of frames of the first optical flow images of the target faces arranged in time series:

dividing the video to be detected into a plurality of video segments;

randomly sampling multi-frame gray level images of the target face in each video clip to obtain multi-frame candidate gray level images;

determining a first optical flow image corresponding to each frame of candidate gray scale image based on each frame of candidate gray scale image and the candidate gray scale image adjacent to the time sequence of each frame of candidate gray scale image;

and obtaining the plurality of frames of first optical flow images based on the first optical flow images respectively corresponding to the plurality of frames of candidate gray level images.

Obviously, the video detection apparatus provided in the embodiment of the present application can be used as the execution main body of the video detection method shown in fig. 1, and thus the functions of the video detection method in fig. 1 can be implemented. Since the principle is the same, the description will not be repeated here.

The video detection device provided by the embodiment of the application utilizes the natural law that the real face and the fake face have difference in appearance and dynamic action, extracts the facial emotion characteristics of the target face from at least one frame of video image of the target face in the video to be detected based on the video detection model, extracts the facial action characteristics of the target face from a plurality of frames of first optical flow images of the target face arranged based on time sequence in the video to be detected based on the video detection model, further determines the detection result of the video to be detected based on at least the facial emotion characteristics and the facial action characteristics of the target face, can reflect the appearance of the face due to the fact that the facial emotion characteristics belong to static characteristics in a space domain, can reflect the action of the face, and carries out video detection by combining the static facial emotion characteristics in the space domain and the dynamic facial action characteristics in the time domain, the method can avoid the state of overfitting the depth forgery characteristics of certain specific distribution, thereby improving the detection accuracy and the universality.

In addition, corresponding to the training method of the video detection model shown in fig. 4, the embodiment of the present application further provides a training apparatus of the video detection model. Referring to fig. 6, a schematic structural diagram of a training apparatus 600 for a video detection model according to an embodiment of the present application is provided, where the apparatus 600 includes:

the system comprises a sample acquisition unit 610, a processing unit and a processing unit, wherein the sample acquisition unit 610 is used for acquiring a sample video set and a true-false label corresponding to each sample video in the sample video set, the sample video set comprises a real video and a plurality of forged videos, the plurality of forged videos correspond to a plurality of face forging algorithms one by one, and each forged video is obtained by forging the real video based on the corresponding face forging algorithm;

a second image obtaining unit 620, configured to obtain at least one frame of video image of a sample face in the target sample video and multiple frames of second optical flow images of the sample face based on the time sequence arrangement;

a second spatial domain feature extraction unit 630, configured to perform feature extraction on at least one frame of video image of a sample face in the target sample video through an initial video detection model, so as to obtain facial emotional features of the sample face;

a second time domain feature extraction unit 640, configured to perform feature extraction on multiple frames of second optical flow images of a sample face in the target sample video through the initial video detection model, so as to obtain a facial motion feature of the sample face;

a second detection unit 650, configured to determine a detection result of the target sample video based on at least facial emotion characteristics and facial motion characteristics of a sample face in the target sample video;

the training unit 660 is configured to perform iterative training on the initial video detection model based on the detection result of each sample video in the sample video set and the authenticity label corresponding to each sample video, so as to obtain a video detection model.

Optionally, the second detecting unit is specifically configured to:

determining the pupil size of a sample face in the target sample video based on at least one frame of video image of the sample face in the target sample video;

determining a first detection result of the sample face in the target sample video based on the face emotional characteristics and the pupil size of the sample face in the target sample video;

determining a second detection result of the target sample video based on the facial action features of the sample face in the target sample video and the voice data of the target sample video;

and determining whether the target sample video is a fake video or not based on the first detection result and the second detection result of the target sample video.

Optionally, the training unit is specifically configured to:

repeatedly executing the following processing until the initial video detection model meets a preset training stop condition:

determining the total detection loss of the initial video detection model based on the first detection result and the second detection result of each sample video in the sample video set and the authenticity label of each sample video;

based on the total detection loss, adjusting model parameters of the initial video detection model.

Obviously, the training device for the video detection model provided in the embodiment of the present application may be used as an execution subject of the training method for the video detection model shown in fig. 1, so that the functions of the training method for the video detection model in fig. 1 can be realized. Since the principle is the same, the description will not be repeated here.

The training device for the video detection model provided by the embodiment of the application adopts the real video and the forged video obtained by forging the real video based on a plurality of face forging algorithms as the sample video, and trains the initial video detection model by using the sample video and the corresponding authenticity label thereof, so that the obtained video detection model can learn the characteristics of the plurality of forged videos, the generalization capability of the video detection model can be improved, and the authenticity detection effect of the video detection model on the various videos can be improved; in the specific model training process, extracting the facial emotion characteristics of a sample face from at least one frame of video image of the sample face in a sample video through an initial video detection model, extracting the facial motion characteristics of the sample face from a plurality of frames of optical flow images of the sample face arranged in the sample video based on time sequence through the initial video detection model, detecting the sample video based on the facial emotion characteristics and the facial motion characteristics of the sample face at least, then carrying out iterative training on the initial video detection model based on the detection result of each sample video in a sample video set and the authenticity label corresponding to each sample video to obtain a video detection model, thereby leading the initial detection model to be capable of fully learning the static characteristics of the sample video in a space domain and accurately extracting the facial emotion characteristics reflecting the appearance of the face, the dynamic characteristics of the sample video in the time domain can be fully learned, the face action characteristics reflecting the dynamic actions of the human face can be accurately extracted, the capability of accurately identifying the video by combining the two characteristics is realized, the initial video detection model is prevented from falling into the state of overfitting certain specifically distributed depth fake characteristics, the trained video detection model has high detection accuracy and universality, and the accuracy and the universality of the video detection based on the video detection model are improved.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 7, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 7, but this does not indicate only one bus or one type of bus.

And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the video detection device on the logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:

Or the processor reads a corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the training device of the video detection model on the logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:

The method performed by the video detection apparatus disclosed in the embodiment of fig. 1 of the present application or the training method of the video detection model disclosed in the embodiment of fig. 4 of the present application can be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

The electronic device may further execute the method shown in fig. 1 and implement the functions of the video detection apparatus in the embodiment shown in fig. 1, or the electronic device may further execute the method shown in fig. 4 and implement the functions of the training apparatus of the video detection model in the embodiment shown in fig. 4, which is not described herein again in this embodiment of the present application.

Of course, besides the software implementation, the electronic device of the present application does not exclude other implementations, such as a logic device or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.

Embodiments of the present application also provide a computer-readable storage medium storing one or more programs, where the one or more programs include instructions, which when executed by a portable electronic device including a plurality of application programs, enable the portable electronic device to perform the method of the embodiment shown in fig. 1, and are specifically configured to:

acquiring at least one frame of video image of a target face in a video to be detected and a multi-frame first optical flow image of the target face based on time sequence arrangement;

Alternatively, the instructions, when executed by a portable electronic device comprising a plurality of application programs, can cause the portable electronic device to perform the method of the embodiment shown in fig. 4, and in particular to perform the following operations:

In short, the above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims

1. A video detection method, comprising:

2. The method according to claim 1, wherein the determining the detection result of the video to be detected based on at least the facial emotion feature and the facial action feature of the target human face comprises:

determining the pupil size of the target human face based on the at least one frame of video image;

determining a first detection result of the video to be detected based on the face emotional characteristics and the pupil size of the target face;

determining a second detection result of the video to be detected based on the facial action characteristics of the target face and the voice data of the video to be detected;

and determining the detection result of the video to be detected based on the first detection result and the second detection result.

3. The method according to claim 2, wherein the determining a first detection result of the video to be detected based on the emotional facial features and the pupil size of the target face comprises:

4. The method of claim 2, wherein the determining the pupil size of the target face based on the at least one frame of video image comprises:

5. The method according to claim 2, wherein the determining a second detection result of the video to be detected based on the facial action feature of the target human face and the voice data of the video to be detected comprises:

performing voice recognition on voice data of the video to be detected through a voice recognition network of the video detection model to obtain target face action characteristics corresponding to the voice data;

and determining a second detection result of the video to be detected based on the matching state between the facial action characteristic of the target face and the target facial action characteristic corresponding to the voice data.

6. The method according to claim 1, wherein said performing feature extraction on the at least one frame of video image through a video detection model to obtain facial emotional features of the target human face comprises:

performing feature extraction on the at least one frame of video image through a spatial stream network in the video detection model to obtain facial emotion features of the target face;

performing feature extraction on the multiple frames of first optical flow images through the video detection model to obtain the facial action features of the target human face, including:

and performing feature extraction on the plurality of frames of first optical flow images through a time flow network in the video detection model to obtain the facial action features of the target face.

7. The method according to claim 1, wherein the obtaining at least one video image of a target face in the video to be detected comprises:

dividing the video to be detected into a plurality of video segments;

8. The method according to claim 1, wherein the time-series arrangement based on a plurality of frames of the first optical flow images of the target human face comprises:

dividing the video to be detected into a plurality of video segments;

9. A training method of a video detection model is characterized by comprising the following steps:

10. The method of claim 9, wherein determining the detection result of the target sample video based on at least facial emotion characteristics and facial motion characteristics of the sample face in the target sample video comprises:

11. The method of claim 10, wherein the iteratively training the initial video detection model based on the detection result of each sample video in the sample video set and the authenticity label corresponding to each sample video to obtain a video detection model comprises:

adjusting model parameters of the initial video detection model based on the total detection loss.

12. A video detection apparatus, comprising:

the first image acquisition unit is used for acquiring at least one frame of video image of a target face in a video to be detected and a plurality of frames of first optical flow images of the target face arranged based on time sequence;

the first airspace feature extraction unit is used for extracting the features of the at least one frame of video image through a video detection model to obtain the facial emotion features of the target face;

13. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 11.

14. A computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-11.