CN113449543A

CN113449543A - Video detection method, device, equipment and storage medium

Info

Publication number: CN113449543A
Application number: CN202010213737.7A
Authority: CN
Inventors: 王洋; 刘焱; 郝新; 吴月升; 熊俊峰
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2020-03-24
Filing date: 2020-03-24
Publication date: 2021-09-28
Anticipated expiration: 2040-03-24
Also published as: CN113449543B

Abstract

The embodiment of the disclosure relates to a video detection method, a video detection device, video detection equipment and a storage medium, and relates to the technical field of computer vision. The method includes acquiring a plurality of image frames relating to a face from a video to be detected, and then extracting a plurality of facial feature representations corresponding to the plurality of image frames. The method also includes detecting authenticity of a face in the video based on the plurality of facial feature representations. Embodiments of the present disclosure can quickly, efficiently, and accurately detect false face videos by extracting a plurality of feature representations of a plurality of face image frames in a video and analyzing the degree of association between the plurality of feature representations.

Description

Video detection method, device, equipment and storage medium

Technical Field

Embodiments of the present disclosure relate generally to the field of computers, and more particularly to the field of computer vision technology.

Background

With the popularization of mobile internet, network video becomes more and more popular. The video may be divided into a long video, a short video, a small video, and the like according to the length of time. The long video is a video with a long time length and mainly including a movie program, and is generally shot by professional movie companies. Short video refers to video of shorter duration and richer subject matter, which is not usually captured by professional or semi-professional teams of a movie company. While small video refers to video that is typically between a few seconds to tens of seconds in duration, and is dominated by production by individual users, typically video related to the users' daily lives.

With the continuous development of Artificial Intelligence (AI) technology, AI face changing becomes an emerging application of AI technology. AI face exchange is based on a neural network model, and can replace faces in images or videos with faces of other people. For example, a typical user may replace a celebrity's face in a video with his or her own face for purposes of entertainment, fraud, and the like. Currently, some AI techniques may render images or videos that are face-changed realistic, such that humans and even machines cannot easily discern between true and false.

Disclosure of Invention

According to the exemplary embodiments of the present disclosure, a video detection method, apparatus, device, and medium are provided, which can solve the problem of how to quickly, efficiently, and accurately detect a face-change video.

In a first aspect of the disclosure, a video detection method is provided. The method comprises the following steps: acquiring a plurality of image frames related to a face from a video to be detected; extracting a plurality of facial feature representations corresponding to a plurality of image frames; and detecting authenticity of the face in the video based on the plurality of facial feature representations. Embodiments of the present disclosure can quickly, efficiently, and accurately detect false face videos by extracting a plurality of feature representations of a plurality of face image frames in a video and analyzing the degree of association between the plurality of feature representations.

In a second aspect of the present disclosure, a video detection apparatus is provided. The device includes: an image frame acquisition module configured to acquire a plurality of image frames related to a face from a video to be detected; a feature extraction module configured to extract a plurality of facial feature representations corresponding to a plurality of image frames; and a face detection module configured to detect authenticity of a face in the video based on the plurality of facial feature representations.

In a third aspect of the disclosure, an electronic device is provided that includes one or more processors and storage for storing one or more programs. The one or more programs, when executed by the one or more processors, cause the electronic device to implement methods or processes in accordance with embodiments of the disclosure.

In a fourth aspect of the disclosure, a computer-readable medium is provided, on which a computer program is stored, which when executed by a processor, performs a method or process according to an embodiment of the disclosure.

It should be understood that the statements herein reciting aspects are not intended to identify key or critical features of the embodiments of the disclosure, nor are they intended to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 illustrates an example environment for face change video detection in accordance with an embodiment of the present disclosure;

fig. 2 shows a flow diagram of a video detection method according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a process for determining feature vectors for a face, according to an embodiment of the present disclosure;

FIG. 4 shows a flow diagram of a method for determining the authenticity of a face video in accordance with an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a process for determining a critical threshold for a face-changed video, in accordance with an embodiment of the present disclosure;

FIG. 6 shows a block diagram of a video detection apparatus according to an embodiment of the present disclosure; and

FIG. 7 illustrates a block diagram of an electronic device capable of implementing various embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

In describing embodiments of the present disclosure, the terms "include" and its derivatives should be interpreted as being inclusive, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". Other explicit and implicit definitions are also possible below.

Currently, AI face changing techniques based on neural networks have become popular, which can automatically replace faces in video with a person's face. Fake videos after AI face changes may be abused for malicious purposes, possibly causing serious trust and security issues. For example, AI face exchange techniques, if used improperly, can have serious consequences for infringing portrait rights or reputation rights, walking rumors, triggering public events, and the like. One detection method for the face changing video is to detect a false face changing video generated by the AI by detecting blinking, but the method has the premise that training data of face changing lack images at different blinking stages, and further a face changing algorithm does not learn blinking. However, the blink images are not difficult to acquire, and exist in a large amount in a video, for example, and therefore, the detection effectiveness and the accuracy of the method are not high.

Therefore, the embodiment of the disclosure provides a face-changing video detection method based on face feature representation. The inventor of the present application finds that AI face changing is not perfect due to the limitations of AI face changing technology, and in face changing videos, there are often some situations of failure, for example, face changing failure occurs more easily in the case of non-frontal, exaggerated expression, blur, and the like. Therefore, whether the video is a face-changing video can be judged by different feature vectors of the same face in the video, for example, the video can be regarded as a false face video if the different feature vectors are deviated to a certain degree.

Embodiments of the present disclosure can quickly, efficiently, and accurately detect false face videos by extracting a plurality of feature representations (e.g., multi-dimensional vectors) of a plurality of face image frames in a video and analyzing a degree of association (e.g., a degree of spatial clustering) between the plurality of feature representations. Therefore, the embodiment of the disclosure can quickly and effectively identify the face-changing video by detecting the inconsistency among the faces in the face-changing video, thereby ensuring the safety of video use. Some example embodiments of the present disclosure will be described in detail below with reference to fig. 1-7.

Fig. 1 illustrates an example environment 100 for face change video detection in accordance with an embodiment of the disclosure. As shown in fig. 1, for a video 110 to be detected, a plurality of face image frames are extracted from the video 110. The video 110 may be a real face video (also referred to as a "real face video") captured by an image capture device, or a fake face video (also referred to as a "fake face video") generated by an AI face changing technique. AI Face-swapping techniques refer to intelligent Face swapping implemented through artificial intelligence, including but not limited to, a deep faces (df) model, a Face2Face (F2F) model, a faceswap (fs) model, a neuraltextures (nt) model, and the like, which are generally to replace a Face in a video with another Face and to fuse the changed Face into an existing background. Embodiments of the present disclosure aim to quickly and efficiently determine whether a video 110 is a face-changed video. Generally, the video 110 to be detected is a video including a single face, but of course, the embodiments of the present disclosure are also applicable to videos of multiple faces.

In some embodiments, facial image frame 120 may be extracted from video 110 at a predetermined capture rate or sampling rate. For example, the sampling rate may be 30 frames per second, and 300 facial image frames may be extracted for a 10 second video. In some embodiments, each image frame extracted includes a face 125. Alternatively, if the extracted image frame does not include a face, the image frame may be discarded without performing a subsequent facial image recognition action.

With continued reference to fig. 1, a plurality of facial feature representations, such as a plurality of feature vectors 130, are generated corresponding to the plurality of facial image frames 120. For example, for each facial image frame, a multi-dimensional vector corresponding to a face in the image frame may be extracted, and the multi-dimensional vector may be used to indicate facial features. In some embodiments, the plurality of feature vectors 130 may be obtained by a face recognition system (e.g., Facenet, etc.), each multi-dimensional vector may represent a feature of a face in a corresponding image frame, and in general, the spatial distances of the plurality of vectors for the same face are relatively close.

Next, at block 140, it is determined whether the face in the video 110 is a real face or a false face, i.e., whether the video 110 is a real face video or a false face video, based on the statistical information of the plurality of feature vectors 130. For example, the authenticity of the face 125 in the video 110 may be determined by counting the degree of clustering or closeness of a plurality of feature vectors. By judging the similarity of the face using the degree of aggregation, the accuracy of similarity detection of the face image can be improved.

Fig. 2 shows a flow diagram of a video detection method 200 according to an embodiment of the present disclosure, and the method 200 may be performed on a user equipment side, or may be performed on a server side, or may be performed partly on the user equipment side and partly on the server side, or may also be performed in a distributed device or cloud.

At block 202, a plurality of image frames relating to a face are acquired from a video to be detected. For example, a plurality of facial image frames 120 are extracted from the video 110 of fig. 1, each image frame including a face such as a human face, and if a face is not included in an extracted image frame, the image frame may be discarded. In some embodiments, image frames may be extracted from a video at a predetermined sampling rate. If the video is a small video or a short video, image frame extraction may be performed on the entire video; if the video is a long video, a short segment of video may be first extracted from the long video, and then image frames may be extracted from this short segment of video.

At block 204, a plurality of facial feature representations corresponding to a plurality of image frames are extracted. For a plurality of facial image frames 120, feature vectors 130 of faces therein are determined, respectively. For example, a neural network model may be trained over a large number of facial images, and then the trained neural network model is used to generate facial feature vectors in each image frame. In some embodiments, the feature vector may be obtained by a face recognition system.

At block 206, authenticity of a face in the video is detected based on the plurality of facial feature representations. In some embodiments, the authenticity of the face 125 in the video 110 may be determined by counting the degree of aggregation or closeness of the plurality of feature vectors in space. For example, the aggregation degree of a plurality of points of a plurality of feature vectors may be statistically counted, and if the points corresponding to the plurality of feature vectors are aggregated together in space, a real-face video is illustrated; on the contrary, if the points corresponding to a plurality of feature vectors are relatively scattered in the space, the false face video is indicated.

Therefore, the method 200 of the embodiment of the present disclosure can quickly, effectively and accurately detect a false face video by extracting a plurality of feature representations of a plurality of face image frames in a video and analyzing the degree of association between the plurality of feature representations.

The video to be detected is short in length and typically includes only a single face, i.e., a video of a single person's face, such as a lecture video, a interview video, a movie video, and so forth. In some embodiments, if two or more faces are identified in an image frame, the target face may be selected first, and then the feature vectors associated with the target face may be determined. In this way, for videos with multiple faces, the embodiments of the present disclosure can also be applied, improving the applicability and the application range of the scheme of the present disclosure. Alternatively, it is also possible to separately recognize a plurality of faces in each image frame, and then recognize the most-image face in different image frames, and then perform false face detection.

Fig. 3 shows a schematic diagram of a process 300 for determining a feature vector of a face, according to an embodiment of the disclosure. It should be appreciated that the process 300 may be an example implementation of step 204 described above with reference to fig. 2, and the image frames 311, 312, 313 may be image frames at different times in the plurality of image frames 120, each presenting a dynamic picture of a face, e.g., the head shots in the image frames 312 and 313 begin to tilt as compared to the image frame 311.

As shown in fig. 3, a face region 331 in the image frame 311 is detected at block 321, and then the face region 331 is face-recognized at block 341, generating a corresponding feature vector 351. Similarly, a face region 332 is detected in the image frame 312 at block 322, and then the face region 332 is face-recognized at block 342 to generate a corresponding feature vector 352. The face region 333 in the image frame 313 is detected at block 323, and then the face region 333 is face-recognized at block 343 to generate a corresponding feature vector 353. In general, if the faces in two images are similar, the distance of the corresponding two feature vectors is close, e.g., a euclidean distance can be calculated.

In some embodiments, the face detection action may be performed by a face detection system, such as a Multi-task Convolutional Neural Network (MTCNN). The MTCNN model is a model for realizing face detection and five-point calibration in a network, and mainly realizes a multi-task learning network through CNN model cascade. The whole MTCNN model is divided into three stages, wherein in the first stage, a series of candidate windows are quickly generated through a shallow CNN network; in the second stage, most of non-face candidate windows are filtered through a CNN network with stronger capacity; the third stage finds five markers on the face through a more powerful network. In this way, the face region in the image can be detected quickly and accurately.

In some embodiments, the face recognition action may be performed by a face recognition system, such as a FaceNet model. The FaceNet model may be used for face verification, such as verifying whether different faces are the same person, for face recognition to identify corresponding identities, and for clustering people of similar faces. The FaceNet model uses a method of mapping images to euclidean space through convolutional neural network learning. Spatial distance is directly related to facial image similarity: the different images of the same face have small spatial distance, and the images of different faces have larger distance in space. In this way, the features of the respective face images can be accurately identified and used to identify whether different face images correspond to the same person.

Alternatively, the face detection action and the face recognition action may be implemented by the same face recognition system. For example, a face recognition system may integrate both a face detection algorithm and a face recognition algorithm, thereby enabling direct input of image frames and output of corresponding feature vectors. The

feature vectors

351, 352, 353 may be multidimensional vectors, such as 128-dimensional vectors, 256-dimensional vectors, 512-dimensional vectors, and so on. In this way, the feature vectors of the face in each image frame can be obtained more accurately by means of the existing face recognition model, for example, calling the existing model or an interface of the system, thereby improving the accuracy of face change video detection.

At block 360, false face analysis may be performed on the plurality of

feature vectors

351, 352, 353 of the face at different times in the obtained video. It should be appreciated that although only 3 image frames and feature vectors are shown in fig. 3, false face analysis typically includes more image frames because more image frames are more statistically likely to find inconsistencies of faces in the image frames.

Fig. 4 shows a flowchart of an example method 400 for determining the authenticity of a face video, in accordance with an embodiment of the present disclosure. It should be appreciated that method 400 may be an example implementation of step 206 described above with reference to fig. 2.

At block 402, a plurality of multidimensional vectors are mapped to a plurality of points in facial feature space, e.g., 300 multidimensional vectors generated by the extracted 300 frames of images may be mapped to 300 points in feature space. At block 404, centroids of a plurality of points in the facial feature space are determined, e.g., calculating a centroid of 300 points. At block 406, a plurality of distances between the plurality of points and the centroid is determined, for example, 300 distances of the 300 points from the centroid are calculated. At block 408, a variance of the plurality of distances is determined. For example, a variance of 300 distances is calculated. By mapping a plurality of multi-dimensional vectors to the same feature space, the space aggregation degree among the multi-dimensional vectors can be more conveniently determined, and therefore the false face video can be more accurately judged.

At block 410, it is determined whether the calculated variance is greater than a predetermined threshold, where the threshold is determined based on statistics of a large number of sample face videos. If it is determined at block 410 that the variance is greater than the predetermined threshold, then 300 points are illustrated as being more dispersed in the feature space, and thus the video can be determined to be a false face video at block 412. If it is determined at block 410 that the variance is less than or equal to the predetermined threshold, then 300 points are more concentrated in space, and thus the video can be determined to be a real-face video at block 414. By the method, the real face video and the false face video can be more accurately distinguished by calculating the variance value, and the accuracy of the face changing video detection is further improved.

In some embodiments, if the video is determined to be a fake face video, the user may be further presented with a most fake one of the face image frames, and the user may further approve the detection result by observing the most fake face image frame. For example, a point of the plurality of points furthest from the centroid may be determined, an image frame of the plurality of image frames corresponding to the furthest point may be determined, and the determined image frame may then be presented to the user as a fake face credential. By the method, the user can visually know the basis of the detection result, and the user experience is improved.

Fig. 5 shows a schematic diagram of a process 500 for determining a critical threshold for a face-changed video, according to an embodiment of the present disclosure. As shown in fig. 5, the set of face-changing videos includes a set of real-face videos 511 and a set of false-face videos 512, where the set of real-face videos 511 includes a large number of real-face videos, and the set of false-face videos 512 includes a large number of false-face videos. The videos in the face change video set typically have a certain length, for example, about 10 seconds, or may be longer or shorter videos, which typically contain a face of a person, and the real-face videos and/or fake-face videos may be collected from the public data set, or the real-face videos may be changed into fake-face videos by an AI face change technique.

For each video in the face changing video set, extracting a plurality of image frames, wherein the faces in the image frames correspond to the same person, and respectively mapping the feature vectors of the face image frames to one point of a feature space. Due to the generation quality problem of face change, the similarity of the feature vectors in the false face samples is smaller than that of the true face samples and is dispersed in space. So that real faces are statistically different from false faces. If the feature vectors are deviated to a certain degree, the face-changing video is considered. This criterion is obtained by performing analytical statistics on the video data set.

As shown in fig. 5, statistical information 521 of the real-face videos in the real-face video set 511 is obtained, and statistical information 522 of the false-face videos in the false-face video set 512 is obtained. Then, a critical threshold 530 for distinguishing the real-face video from the false-face video is determined according to the statistical information 521 of the real-face video and the statistical information 522 of the false-face video. For example, a real face variance distribution of distances between points of each facial feature vector in the real face video and the centroid thereof may be calculated based on the real face video set 511, and similarly, a false face variance distribution of distances between points of each facial feature vector in the false face video and the centroid thereof may be calculated based on the false face video set 512. Then, an appropriate critical threshold 530 is determined according to the real face variance distribution and the false face variance distribution, and when the critical threshold 530 is used, whether the video is a false face video can be judged according to the variance of the video. Through the method, the statistical difference between the real face video and the fake face video can be counted according to the real face video set and the fake face video set, and a more accurate critical threshold value is obtained, so that a more accurate judgment standard is obtained through the statistics of the face changing data set, and the accuracy of the face changing video detection is further improved.

Alternatively, other information of the feature vector may be counted, for example, the average distance of the points of the facial feature vector of the same video is counted (the average of the distances of any two points, the distance metric may use the euclidean distance), and if the average distance is too large, the video is considered to be a false face video. In addition, the judgment criterion of the true and false face video can be determined by the covariance from the point of each face feature vector to the centroid and the like.

Fig. 6 shows a block diagram of a video detection apparatus 600 according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus 600 includes an image frame acquisition module 610, a feature extraction module 620, and a face detection module 630. The image frame acquisition module 610 is configured to acquire a plurality of image frames related to a face from a video to be detected. The feature extraction module 620 is configured to extract a plurality of facial feature representations corresponding to a plurality of image frames. The face detection module 630 is configured to detect the authenticity of a face in the video based on the plurality of facial feature representations.

In some embodiments, the face detection module 630 comprises: a face determination module configured to determine the authenticity of a face in the video by counting the degree of aggregation of the plurality of facial feature representations.

In some embodiments, wherein the feature extraction module 620 comprises: a transmitting module configured to transmit a plurality of image frames to a face recognition system; and a receiving module configured to receive the plurality of multi-dimensional vectors as a plurality of facial feature representations from the facial recognition system.

In some embodiments, the face detection module 630 comprises: a mapping module configured to map a plurality of multi-dimensional vectors to a plurality of points in facial feature space; a centroid determination module configured to determine a centroid of a plurality of points in the facial feature space; a distance determination module configured to determine a plurality of distances between the plurality of points and the centroid; a variance determination module configured to determine a variance of the plurality of distances; and an authenticity determination module configured to determine authenticity of a face in the video based on the variance.

In some embodiments, wherein the authenticity determination module comprises: a determination module configured to determine whether the variance is greater than a predetermined threshold; a false face determination module configured to determine that the video is a false face video according to a determination that the variance is greater than a predetermined threshold; and a real face determination module configured to determine that the video is a real face video according to a determination that the variance is less than or equal to a predetermined threshold.

In some embodiments, the apparatus 600 further comprises: an obtaining module configured to obtain a plurality of real face videos and a plurality of false face videos from a face exchange dataset; and a threshold determination module configured to determine a predetermined threshold based on statistical information of the plurality of real face videos and the plurality of false face videos.

In some embodiments, the false face determination module comprises: a farthest point determination module configured to determine a point of the plurality of points that is farthest from the centroid; an image frame determination module configured to determine an image frame corresponding to a farthest point among the plurality of image frames; and a false face presentation module configured to present the determined image frame to a user as a false face credential.

In some embodiments, wherein the feature extraction module 620 comprises: a face selection module configured to select a target face from two or more faces in accordance with a determination that two or more faces are present in a given image frame of the plurality of image frames; and a feature determination module configured to determine a facial feature representation corresponding to the target face.

Therefore, the apparatus 600 of the embodiment of the present disclosure can quickly, effectively and accurately detect a false face video by extracting a plurality of feature representations of a plurality of face image frames in a video and analyzing the degree of association between the plurality of feature representations.

Fig. 7 illustrates a schematic block diagram of an example device 700 that may be used to implement embodiments of the present disclosure. As shown, device 700 includes a Central Processing Unit (CPU)701 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)702 or computer program instructions loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processing unit 701 performs the various methods and processes described above. For example, in some embodiments, the methods and processes may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into RAM 703 and executed by CPU 701, one or more acts or steps of the above described parties may be performed. Alternatively, in other embodiments, CPU 701 may be configured by any other suitable means (e.g., by way of firmware) to perform the various methods of embodiments of the present disclosure.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System On Chip (SOCs), load programmable logic devices (CPLDs), and the like.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

Further, while acts or steps are depicted in a particular order, this should be understood as requiring that such acts or steps be performed in the particular order shown or in sequential order, or that all illustrated acts or steps be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although embodiments of the disclosure have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A video detection method, comprising:

acquiring a plurality of image frames related to a face from a video to be detected;

extracting a plurality of facial feature representations corresponding to the plurality of image frames; and

detecting authenticity of the face in the video based on the plurality of facial feature representations.

2. The method of claim 1, wherein detecting authenticity of the face in the video comprises:

determining authenticity of the face in the video by counting a degree of aggregation of the plurality of facial feature representations.

3. The method of claim 1, wherein extracting a plurality of facial feature representations corresponding to the plurality of image frames comprises:

transmitting the plurality of image frames to a face recognition system; and

receiving a plurality of multi-dimensional vectors as the plurality of facial feature representations from the facial recognition system.

4. The method of claim 3, wherein detecting authenticity of the face in the video comprises:

mapping the plurality of multi-dimensional vectors to a plurality of points in facial feature space;

determining a centroid of the plurality of points in the facial feature space;

determining a plurality of distances between the plurality of points and the centroid;

determining a variance of the plurality of distances; and

determining authenticity of the face in the video based on the variance.

5. The method of claim 4, wherein determining the authenticity of the face in the video comprises:

determining whether the variance is greater than a predetermined threshold;

according to the fact that the variance is larger than the preset threshold value, the video is determined to be a false face video; and

and according to the fact that the variance is smaller than or equal to the preset threshold value, determining that the video is a real-face video.

6. The method of claim 5, further comprising:

obtaining a plurality of real face videos and a plurality of false face videos from the face exchange dataset; and

determining the predetermined threshold based on statistical information of the plurality of real face videos and the plurality of fake face videos.

7. The method of claim 5, wherein determining that the video is a fake-face video comprises:

determining a point of the plurality of points that is farthest from the centroid;

determining an image frame corresponding to the farthest point among a plurality of image frames; and

the determined image frame is presented to the user as a fake face credential.

8. The method of claim 1, wherein extracting a plurality of facial feature representations corresponding to the plurality of image frames comprises:

in accordance with a determination that two or more faces are present in a given image frame of the plurality of image frames, selecting a target face from the two or more faces; and

a facial feature representation corresponding to the target face is determined.

9. A video detection apparatus comprising:

an image frame acquisition module configured to acquire a plurality of image frames related to a face from a video to be detected;

a feature extraction module configured to extract a plurality of facial feature representations corresponding to the plurality of image frames; and

a face detection module configured to detect an authenticity of the face in the video based on the plurality of facial feature representations.

10. The apparatus of claim 9, wherein the face detection module comprises:

a face determination module configured to determine authenticity of the face in the video by counting a degree of aggregation of the plurality of facial feature representations.

11. The apparatus of claim 9, wherein the feature extraction module comprises:

a transmitting module configured to transmit the plurality of image frames to a face recognition system; and

a receiving module configured to receive a plurality of multi-dimensional vectors as the plurality of facial feature representations from the facial recognition system.

12. The apparatus of claim 11, wherein the face detection module comprises:

a mapping module configured to map the plurality of multi-dimensional vectors to a plurality of points in facial feature space;

a centroid determination module configured to determine centroids of the plurality of points in the facial feature space;

a distance determination module configured to determine a plurality of distances between the plurality of points and the centroid;

a variance determination module configured to determine a variance of the plurality of distances; and

a plausibility determination module configured to determine a plausibility of the face in the video based on the variance.

13. The apparatus of claim 12, wherein the authenticity determination module comprises:

a determination module configured to determine whether the variance is greater than a predetermined threshold;

a false face determination module configured to determine that the video is a false face video in accordance with a determination that the variance is greater than the predetermined threshold; and

a real face determination module configured to determine that the video is a real face video in accordance with a determination that the variance is less than or equal to the predetermined threshold.

14. The apparatus of claim 13, further comprising:

an obtaining module configured to obtain a plurality of real face videos and a plurality of false face videos from a face exchange dataset; and

a threshold determination module configured to determine the predetermined threshold based on statistical information of the plurality of real face videos and the plurality of fake face videos.

15. The apparatus of claim 13, wherein the false face determination module comprises:

a farthest point determination module configured to determine a point of the plurality of points that is farthest from the centroid;

an image frame determination module configured to determine an image frame of a plurality of image frames corresponding to the farthest point; and

a false face presentation module configured to present the determined image frame to a user as a false face credential.

16. The apparatus of claim 9, wherein the feature extraction module comprises:

a face selection module configured to select a target face from two or more faces in accordance with a determination that the two or more faces are present in a given image frame of the plurality of image frames; and

a feature determination module configured to determine a facial feature representation corresponding to the target face.

17. An electronic device, the electronic device comprising:

one or more processors; and

storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to implement the method of any of claims 1-8.

18. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.