WO2022110806A1 - 视频检测方法、装置、设备及计算机可读存储介质 - Google Patents

视频检测方法、装置、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2022110806A1
WO2022110806A1 PCT/CN2021/103766 CN2021103766W WO2022110806A1 WO 2022110806 A1 WO2022110806 A1 WO 2022110806A1 CN 2021103766 W CN2021103766 W CN 2021103766W WO 2022110806 A1 WO2022110806 A1 WO 2022110806A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
frame
result
detection
video stream
Prior art date
Application number
PCT/CN2021/103766
Other languages
English (en)
French (fr)
Inventor
方正
石华峰
殷国君
陈思禹
邵婧
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Priority to JP2022531515A priority Critical patent/JP2023507898A/ja
Priority to KR1020227018065A priority patent/KR20220093157A/ko
Publication of WO2022110806A1 publication Critical patent/WO2022110806A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/95Pattern authentication; Markers therefor; Forgery detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Definitions

  • the present disclosure relates to computer vision technology, and in particular, to a video detection method, apparatus, device, and computer-readable storage medium.
  • Embodiments of the present disclosure provide a video detection solution.
  • a video detection method includes: acquiring a plurality of first video frames in a video to be processed, and a first video stream corresponding to the video to be processed; obtaining the single frame detection result of the authenticity detection of the first video frame; obtaining the video stream detection result of the authenticity detection of the first video stream; The video stream detection result of the first video stream is used to determine the authenticity discrimination result of the to-be-processed video.
  • the acquiring a plurality of first video frames in the video to be processed includes: performing frame extraction processing on the video to be processed with a set frame number span to obtain the plurality of first video frames.
  • the obtaining a single-frame detection result of performing authenticity detection on each of the first video frames includes: performing authenticity detection on the first video frame through a first authenticity classification network The detection is performed to obtain a single-frame detection result of the first video frame, wherein the single-frame detection result is used to represent the confidence that the first video frame is forged.
  • the obtaining a video stream detection result of performing authenticity detection on the first video stream includes: using a second authenticity classification network, according to the Video frames and inter-frame relationships, perform authenticity detection on the first video stream, and obtain a video stream detection result of the first video stream, wherein the video stream detection result is used to indicate whether the first video stream is Fake confidence.
  • determining the authenticity of the video to be processed according to the single-frame detection results of the plurality of first video frames and the video stream detection results of the first video stream includes: merging the respective single-frame detection results of the plurality of first video frames to obtain a fusion result; determining the authenticity discrimination result of the video to be processed according to the fusion result and the video stream detection result .
  • the fusion of the respective single-frame detection results of the multiple first video frames to obtain the fusion result includes: detecting the single-frame detection of the multiple first video frames.
  • the results are grouped to obtain a plurality of result groups including one or more single-frame detection results respectively; the average detection results of each of the result groups are obtained; the average detection results of each of the result groups are mapped by the first setting function as The first probability is to obtain a plurality of the first probabilities, wherein the first setting function is a nonlinear mapping function; and the fusion result is obtained according to the average detection result of each of the result groups and the plurality of first probabilities .
  • obtaining a fusion result according to the average detection result of each of the result groups and the multiple first probabilities includes at least one of the following: in response to the multiple first probabilities The ratio of the first upper probabilities greater than the first set threshold is greater than the first set ratio, and the fusion result is obtained according to the average detection result of the result groups corresponding to each of the first upper probabilities; in response to the multiple The ratio of the first lower probability that is smaller than the second set threshold in the first probability is greater than the second set ratio, and the fusion result is obtained according to the average detection results of the result groups corresponding to each of the first lower probabilities; The first set threshold is greater than the second set threshold.
  • the determining the authenticity discrimination result of the video to be processed according to the fusion result and the video stream detection result includes: detecting the fusion result and the video stream The results are weighted and averaged to obtain a weighted average result; the authenticity discrimination result of the video to be processed is determined according to the obtained weighted average result.
  • the first video frame includes a plurality of human faces; and the acquiring a single-frame detection result of performing authenticity detection on each of the first video frames includes: acquiring the first video face detection frames corresponding to multiple faces in the frame; according to the image area corresponding to each of the described face detection frames, determine the single-person detection result of the corresponding face; The detection result is mapped to the second probability, and a plurality of the second probability is obtained, wherein, the second setting function is a nonlinear mapping function; according to the single-person detection result of each of the faces and the plurality of second probability to obtain the single-frame detection result of the first video frame.
  • the single-frame detection result of the first video frame is obtained according to the single-person detection results of each of the faces and the plurality of second probabilities, including at least one of the following: In response to that there is a second probability greater than a third set threshold in the plurality of second probabilities, acquiring the largest single-person detection result in the first video frame as the single-frame detection result of the first video frame; responding When the plurality of second probabilities are all greater than the fourth set threshold, obtain the largest single-person detection result in the first video frame as the single-frame detection result of the first video frame; in response to the plurality of first video frames Both probabilities are less than the fifth set threshold, and the smallest single-person detection result in the first video frame is obtained as the single-frame detection result of the first video frame; wherein the third set threshold is greater than the third set threshold Four set thresholds, the fourth set threshold is greater than the fifth set threshold.
  • the first authenticity classification network includes authenticity classification networks with various structures; the first authenticity classification network is used to perform authenticity detection on the first video frame, and obtain The single-frame detection result of the first video frame includes: performing authenticity detection on the first video frame through the authenticity classification network of various structures to obtain multiple sub-single-frame detection results; The function maps the multiple sub-single frame detection results to third probabilities respectively, to obtain multiple third probabilities, wherein the third setting function is a nonlinear mapping function; the first video frame is determined by at least one of the following The single-frame detection result of The detection result obtains the single-frame detection result of the first video frame; in response to the ratio of the third lower probability that is smaller than the seventh set threshold in the plurality of third probabilities being greater than the fourth set ratio, according to each of the third probabilities The single-frame detection result of the first video frame is obtained from the sub-single-frame detection result corresponding to the three lower probability, wherein the sixth set threshold is greater than the seventh set threshold.
  • the second authenticity classification network includes authenticity classification networks with various structures; the second authenticity classification network is based on video frames included in the first video stream. and the relationship between frames, performing authenticity detection on the first video stream to obtain a video stream detection result of the first video stream, including: through the authenticity classification network of various structures, according to the first video stream The video frames and the relationship between the frames contained in the stream, perform authenticity detection on the first video stream, and obtain multiple sub-video stream detection results; through the fourth setting function, the multiple sub-video stream detection results are respectively mapped to the first video stream.
  • the fourth setting function is a nonlinear mapping function
  • the video stream detection result of the first video stream is determined by at least one of the following: in response to the plurality of Among the fourth probabilities, the ratio of the fourth upper probability greater than the eighth preset threshold is greater than the fifth preset ratio, and the video stream of the first video stream is obtained according to the sub-video stream detection results corresponding to each of the fourth upper probabilities detection result; in response to the ratio of the fourth lower probability that is smaller than the ninth set threshold in the plurality of fourth probabilities being greater than the sixth set ratio, obtain the sub-video stream detection results corresponding to each of the fourth lower probabilities The video stream detection result of the first video stream, wherein the eighth set threshold is greater than the ninth set threshold.
  • the single-frame detection result of the first video frame indicates whether the face image in the first video frame is a face-changing image; the video stream detection result of the first video stream Indicates whether the face image in the first video stream is a face-changing image; the authenticity determination result of the video to be processed indicates whether the to-be-processed video is a face-changing video.
  • a video detection apparatus includes: a first acquisition unit configured to acquire a plurality of first video frames in a video to be processed, and a first video frame corresponding to the video to be processed a video stream; a second acquisition unit for acquiring a single-frame detection result of performing authenticity detection on each of the first video frames; a third acquiring unit for acquiring a video for performing authenticity detection on the first video stream Stream detection result; a determining unit, configured to determine the authenticity discrimination result of the video to be processed according to the respective single frame detection results of the plurality of first video frames and the video stream detection result of the first video stream.
  • an electronic device comprising a memory and a processor, the memory for storing computer instructions executable on the processor, the processor for when executing the computer instructions.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the video detection method described in any embodiment of the present disclosure.
  • a computer program includes computer-readable code, when the computer-readable code is executed in an electronic device, a processor in the electronic device executes the video detection method for implementing any embodiment of the present disclosure.
  • the authenticity detection of multiple first video frames in the video to be processed and the first video stream corresponding to the video to be processed is performed simultaneously, so as to obtain the single-frame detection result of the first video frame and the first video respectively.
  • the video stream detection result of the stream, and the authenticity discrimination result of the to-be-processed video is determined according to the respective single-frame detection results of the multiple first video frames and the video stream detection results of the first video stream, so that the Some forged video frames in the video to be processed are detected to improve the video detection accuracy.
  • FIG. 1 is a flowchart of a video detection method shown in at least one embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a video detection method shown in at least one embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a video detection apparatus shown in at least one embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of an electronic device shown in at least one embodiment of the present disclosure.
  • Embodiments of the present disclosure may be applied to computer systems/servers that are operable with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known computing systems, environments and/or configurations suitable for use with computer systems/servers include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, Microprocessor systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the foregoing, among others.
  • FIG. 1 is a flowchart of a video detection method according to at least one embodiment of the present disclosure. As shown in FIG. 1 , the method includes steps 101 to 104 .
  • step 101 a plurality of first video frames in the video to be processed and a first video stream corresponding to the video to be processed are acquired.
  • the plurality of first video frames may be video frames corresponding to the original video sequence included in the video to be processed, or may be video frames obtained by performing frame extraction processing on the original video sequence .
  • the first video stream corresponding to the video to be processed may be a video stream formed by an original video sequence contained in the video to be processed, or may be a video frame obtained by performing frame extraction processing on the original video sequence.
  • the video stream is, for example, a video stream formed by the plurality of first video frames.
  • step 102 a single-frame detection result of performing authenticity detection on each of the first video frames is acquired.
  • the authenticity detection of the first video frame may be performed by using a first authenticity classification network to obtain a single-frame detection result of the first video frame, wherein the single-frame detection result is used for Indicating the confidence that the first video frame is fake, for example, the single-frame detection result includes a single-frame confidence score.
  • the first authenticity classification network may be a pre-trained authenticity classification network for independent detection of video frames, such as ResNet (Residual Neural Network, residual network), DenseNet (Densely Connected Convolutional Networks, density Connect Convolutional Network), EfficientNet, Xception, SENet (Squeeze-and-Excitation Network, compression and excitation network) and so on.
  • ResNet Residual Neural Network, residual network
  • DenseNet DenseNet (Densely Connected Convolutional Networks, density Connect Convolutional Network)
  • EfficientNet Xception
  • SENet Seeze-and-Excitation Network, compression and excitation network
  • step 103 a video stream detection result of performing authenticity detection on the first video stream is obtained.
  • the second authenticity classification network may be used to perform authenticity detection on the first video stream according to the frame sequence corresponding to the first video stream and the relationship between frames, to obtain the first video stream.
  • the video stream detection result of the video stream wherein the video stream detection result is used to represent the confidence that the first video stream is forged, for example, the video stream detection result includes a video stream confidence score.
  • the second authenticity classification network may be a pre-trained authenticity classification network that detects video streams and considers the relationship between frames, such as C3D (3D ConvNets, 3D convolution) network, SlowFast network, X3D network (Extensible 3D, Extensible 3D) network and so on.
  • C3D 3D ConvNets, 3D convolution
  • SlowFast SlowFast
  • X3D network Extensible 3D, Extensible 3D
  • step 104 according to the respective single frame detection results of the plurality of first video frames and the video stream detection results of the first video stream, determine the authenticity determination result of the video to be processed.
  • the detection result and the video stream detection result of the first video stream, and the authenticity of the video to be processed is determined according to the respective single-frame detection results of the plurality of first video frames and the video stream detection result of the first video stream.
  • the false discrimination result can be detected, so that some forged video frames existing in the video to be processed can be detected, and the video detection accuracy can be improved.
  • frame extraction processing may be performed on the to-be-processed video with a set frame number span to obtain the plurality of first video frames.
  • the set frame number span may be determined according to the frame number of the video to be processed.
  • the set frame number span may be positively correlated with the total number of video frames included in the to-be-processed video, so that The adaptive setting of the set frame number span according to the frame number of the video to be processed is realized, so that a reasonable number of first video frames can be extracted to improve the effect of video detection.
  • the frame extraction process can be performed with a span of 2 as the frame number, that is, 1 frame is extracted every 2 frames.
  • the single-frame detection results of the multiple first video frames may be first fused to obtain a fusion result, and then the video to be processed is determined according to the fusion result and the video stream detection result. Authenticity judgment results.
  • the fusion result reflects the influence of each single-frame detection result, and then the fusion result and the video stream detection result are used to determine the video to be processed. The result of authenticity discrimination can improve the effect of video detection.
  • the fusion result may be obtained by fusing the detection results of the single frames of the first video frames in the following manner.
  • the single-frame detection results of the multiple first video frames are grouped to obtain a plurality of result groups respectively including one or more single-frame detection results; the average detection results of each of the result groups are obtained.
  • the average detection result for each group may include the average confidence score for multiple frames within the group.
  • the plurality of first video frames can be divided into N groups.
  • M and N are positive integers.
  • Those skilled in the art should understand that in the case where the total number of the multiple first video frames is not an integer multiple of M, there may be groups in which the number of the first video frames is not M.
  • every 5 adjacent first video frames may be grouped, so that the plurality of first video frames in the video to be processed are divided into 6 groups.
  • the average detection result of each of the result groups is mapped to a first probability through a first setting function to obtain a plurality of the first probabilities, wherein the first setting function is a nonlinear mapping function.
  • the first setting function may be, for example, a normalized exponential Softmax function, through which the average single-frame confidence score of each group is mapped to the first probability.
  • the single-frame detection result of the first video frame is a logical output value in the (- ⁇ , + ⁇ ) interval.
  • the average detection result of each group is mapped to the first probability in the [0,1] interval by the Softmax function, which can reflect the distribution of the average detection result of each group.
  • a fusion result is obtained according to the average detection result of each of the result groups and the plurality of first probabilities.
  • the fusion result can be obtained by the following method: in response to the ratio of the first upper probabilities greater than the first set threshold in the plurality of first probabilities being greater than the first set ratio, according to each of the first upper probabilities
  • the average detection result of the result group corresponding to the probability obtains the fusion result. That is, when the first upper probability exceeding the first set ratio is greater than the first set threshold, the fusion result is calculated according to the average detection results of the result groups corresponding to the first lower probability, respectively, For example, take the average of these average detection results as the fusion result.
  • the fusion result when the first set threshold is 0.85 and the first set ratio is 0.7, when the ratio of the first upper probability greater than 0.85 exceeds 0.7, the corresponding first upper probability will be The average of the average detection results of the above-mentioned result groups is used as the fusion result.
  • the few lower group detection results may be the result of misjudgment by the neural network.
  • the fusion result may be obtained by the following method: in response to the ratio of the first lower probability that is smaller than the second set threshold in the plurality of first probabilities being greater than the second set ratio, according to each of the first probabilities
  • the fusion result is obtained by averaging the detection results of the result groups corresponding to the lower probability. That is, when the first lower probabilities exceeding the second set ratio are all smaller than the second set threshold, the fusion result is calculated according to the average detection results of the result groups corresponding to these first lower probabilities, respectively, For example, take the average of these average detection results as the fusion result.
  • the first set threshold is greater than the second set threshold.
  • the first set ratio and the second set ratio may be the same or different, which is not limited in this embodiment of the present disclosure.
  • the second set threshold is 0.15 and the second set ratio is 0.7
  • the ratio of the first lower probability less than 0.15 exceeds 0.7
  • the corresponding first lower probability is set to The average of the average detection results of the above-mentioned result groups.
  • the few higher group detection results may be the result of misjudgment by the neural network.
  • the influence of the misjudgment of the neural network on the video detection result can be reduced.
  • the proportion of the first upper probability that is greater than the first set threshold is less than or equal to the first set proportion
  • the proportion of the first lower probability that is less than the second set threshold is less than or equal to the second set proportion
  • the fusion result may be obtained according to the respective single-frame detection results of the plurality of first video frames.
  • the average value of the respective single-frame detection results of the plurality of first video frames may be used as the fusion result.
  • the fusion result is calculated based on the single-frame detection results of each first video frame. The contribution of each first video frame to the final authenticity discrimination result is the same.
  • a weighted average result of the fusion result and the video stream detection result may be obtained through weighted average, and the weighted average result is determined according to the weighted average result. Describe the authenticity discrimination result of the video to be processed.
  • the weighted average result may be compared with a set discrimination threshold, and when the weighted average result is less than the set discrimination threshold, it is determined that the video to be processed is true, that is, it is determined that the to-be-processed video is true.
  • the processed video is not a fake video; when the weighted average value is greater than or equal to the set discrimination threshold, it is determined that the to-be-processed video is a fake video.
  • the multiple human faces may be fused to obtain a single-frame detection result of the corresponding first video frame.
  • the face detection frame can be obtained by performing face detection on the first video frame by using a face detection network, such as RetinaFace; for the video frames after the first video frame on which face detection has been carried out, it can be obtained through the human face detection.
  • Face tracking networks such as the Siamese network, track faces to obtain face detection boxes.
  • a corresponding face detection frame can be generated for each face, the face detection frame has a corresponding frame number, and the face detection frame can be The corresponding face numbers are marked to distinguish multiple faces included in the first video. For example, in the case where the first video frame includes 3 faces, face detection frames with frame numbers A, B, and C are respectively generated, and the face detection frames A, B, and C are marked with faces, respectively Number 1, 2, 3.
  • the face detection frame includes the coordinate information of the four vertices of the face detection frame or the length and height information of the face detection frame.
  • the single-person detection result of the corresponding face is determined.
  • a single-person detection result of the face corresponding to the face detection frame can be obtained.
  • the single-person detection results of faces 1, 2, and 3 can be obtained respectively.
  • an input tensor of [face number, frame number, height, width, channel] can be generated, so that the multiple faces in the to-be-processed video can be
  • the face numbers are concatenated into a video frame set, so that each face in the video to be processed can be detected individually, and the single-person detection result corresponding to each face number can be obtained.
  • the single-person detection results of each of the faces are mapped to second probabilities through a second setting function to obtain a plurality of the second probabilities, wherein the second setting function is a nonlinear mapping function.
  • the single-person detection result of each face can be mapped to the second probability in the [0,1] interval through the Softmax function. Probability to reflect the distribution of single-person detection results of multiple faces contained in the video to be processed.
  • a single-frame detection result of the first video frame is obtained according to the single-person detection results of each of the faces and a plurality of second probabilities.
  • the individual detection of each face in the video to be processed can be realized, and each face can be more accurately evaluated.
  • the influence of the corresponding single-person detection result on the authenticity discrimination result of the video to be processed can improve the accuracy of video detection.
  • the fusion result of multiple faces may be obtained by the following method: in response to the existence of a second probability greater than a third set threshold in the multiple second probabilities, or the multiple first video frames of the first video frame The second probability is greater than the fourth set threshold, and the maximum value among the single-person detection results of the first video frame is acquired as the single-frame detection result of the first video frame.
  • the third set threshold is greater than the fourth set threshold.
  • the third set threshold is 0.9 and the fourth set threshold is 0.6
  • the fourth set threshold is 0.6
  • the maximum value in the single-person confidence score in the first video frame is taken as the single-frame detection result of the frame.
  • the fusion result of multiple faces may be obtained by the following method: in response to the multiple second probabilities being less than the fifth set threshold, obtaining the smallest one among the single-person detection results of the first video frame value as the single frame detection result of the first video frame. That is, when the second probability corresponding to all faces in the first video frame is smaller than the fifth set threshold, it indicates that the confidence of each face detection result in the first video frame is low, then the The smallest single-person detection result in the first video frame is used as the single-frame detection result of the first video frame, so that the entire first video frame has a lower single-frame detection result.
  • the fourth set threshold is greater than the fifth set threshold.
  • the fifth set threshold is 0.4
  • the minimum value of the single-person confidence scores in the first video frame is taken as the frame single-frame detection results.
  • the single-person detection results corresponding to each face are obtained by acquiring the single-person detection results of the multiple faces, and the single-person detection results of the multiple faces are fused to obtain
  • the single-frame detection result of the first video frame makes the authenticity discrimination result of the video take into account the influence of the detection results of different faces, and improves the video detection effect.
  • the first authenticity classification network includes authenticity classification networks with multiple structures, and authenticity detection is performed on the first video frame through the authenticity classification networks with multiple structures to obtain a plurality of subcategories.
  • the single-frame detection result is equivalent to obtaining the sub-single-frame detection results for the authenticity detection of the first video frame using multiple methods, and can be obtained by fusing the multiple sub-single-frame detection results corresponding to the first video frame. The single-frame detection result of the first video frame.
  • the detection results of multiple sub-single frames corresponding to the first video frame may be fused by the following method.
  • the plurality of sub-single frame detection results are respectively mapped to third probabilities through a third setting function to obtain a plurality of third probabilities.
  • each sub-single frame detection result can be mapped to the third probability in the [0,1] interval through the Softmax function, with It reflects the distribution of sub-single frame detection results obtained by multiple authenticity classification methods.
  • a single-frame detection result is obtained according to the multiple sub-single-frame detection results and the multiple third probabilities.
  • the detection in response to the ratio of the third upper probabilities that are greater than the sixth preset threshold in the plurality of third probabilities being greater than the third preset ratio, the detection is performed according to the sub-single frame corresponding to each of the third upper probabilities.
  • a single-frame detection result of the first video frame is obtained. That is, in the case where the third upper probabilities exceeding the third set ratio are all greater than the sixth set threshold, then according to the sub-single frame detection results corresponding to these third upper probabilities respectively, calculate the value of the first video frame.
  • the single-frame detection result for example, the average value of these sub-single-frame detection results is taken as the single-frame detection result.
  • the sixth set threshold is 0.8 and the third set ratio is 0.7
  • the ratio of the third upper probability greater than 0.8 exceeds 0.7
  • the average of the single-frame confidence scores is used as the single-frame detection result.
  • the detection result obtains the single-frame detection result of the first video frame. That is, in the case that the third lower probabilities exceeding the fourth preset ratio are all smaller than the seventh preset threshold, then according to the sub-single frame detection results corresponding to these third lower probabilities respectively, calculate the value of the first video frame. For single frame detection results, for example, the average of these sub-single frame detection results is used as the fusion result. Wherein, the sixth set threshold is greater than the seventh set threshold.
  • the third set ratio and the fourth set ratio may be the same or different, which is not limited in this embodiment of the present disclosure.
  • the seventh set threshold is 0.2 and the fourth set ratio is 0.7
  • the sub-probability corresponding to each of the third lower probability is set to The average of the single-frame confidence scores is used as the single-frame detection result.
  • the detection results of multiple sub-single frames are low, the detection results of a few higher sub-single frames may be the result of misjudgment by the authenticity classification network of the corresponding structure.
  • the influence of the false judgment of the authenticity classification network on the video detection results can be reduced.
  • the second authenticity classification network includes authenticity classification networks with multiple structures, and the authenticity detection is performed on the first video frame stream through the authenticity classification networks with multiple structures to obtain multiple authenticity classification networks.
  • sub-video stream detection results which is equivalent to obtaining sub-video stream detection results that use multiple methods to perform authenticity detection on the first video frame stream. The video stream detection result of the video stream.
  • the detection results of multiple sub-video streams corresponding to the first video stream may be fused by the following method.
  • the plurality of sub-video stream detection results are respectively mapped to fourth probabilities through a fourth setting function to obtain a plurality of fourth probabilities.
  • each sub-video stream detection result can be mapped to the fourth probability in the [0, 1] interval through the Softmax function, with It reflects the distribution of sub-video stream detection results obtained by multiple authenticity classification methods.
  • a video stream detection result of the first video stream is obtained according to the multiple sub-video stream detection results and the fourth probability.
  • the detection is performed according to the sub-video stream corresponding to each of the fourth upper probabilities.
  • the video stream detection result of the first video stream is obtained. That is, in the case that the fourth upper probabilities exceeding the fifth set ratio are all greater than the eighth set threshold, then according to the sub-video stream detection results corresponding to these fourth upper probabilities, the For the video stream detection result, for example, the average value of these sub-video stream detection results is used as the video stream detection result of the first video stream.
  • the eighth set threshold is 0.8 and the fifth set ratio is 0.7
  • the ratio of the fourth upper probability greater than 0.8 exceeds 0.7
  • the average value of the video stream confidence scores is used as the video stream detection result of the first video stream.
  • the sub-video stream corresponding to each of the fourth lower probabilities obtains the video stream detection result of the first video stream. That is, when the fourth lower probabilities exceeding the sixth set ratio are all smaller than the ninth set threshold, the first video stream is calculated according to the sub-video stream detection results corresponding to these fourth lower probabilities respectively. For example, the average value of these sub-video stream detection results is used as the video stream detection result of the first video stream.
  • the eighth set threshold is greater than the ninth set threshold.
  • the fifth set ratio and the sixth set ratio may be the same or different, which is not limited in this embodiment of the present disclosure.
  • the ninth set threshold is 0.2 and the sixth set ratio is 0.7
  • the ratio of the fourth lower probability less than 0.2 exceeds 0.7
  • the average value of the video stream confidence scores is used as the video stream detection result of the first video stream.
  • the detection results of multiple sub-streams are low, the detection results of a few higher sub-streams may be the result of misjudgment by the authenticity classification network of the corresponding structure.
  • the influence of the false judgment of the authenticity classification network on the video detection results can be reduced.
  • each set threshold and each set ratio may be determined according to the accuracy requirements of the video detection result, which are not limited herein.
  • multiple first video frames in the video to be processed may be fused not only for multiple faces, but also for sub-single frame detection results obtained by multiple methods. The results are weighted and averaged to obtain the final single-frame detection result.
  • FIG. 2 shows a schematic diagram of a video detection method according to at least one embodiment of the present disclosure.
  • a plurality of first video frames in the video to be processed and a first video stream formed by the plurality of first video frames are acquired.
  • the first video frame is processed to obtain a single-frame detection result of the first video frame.
  • the authenticity detection is performed on the multiple faces contained in the first video frame, and the single-person detection results corresponding to each face are fused to obtain a face fusion result.
  • Authenticity detection is performed on the first video frame, and the sub-single frame detection results corresponding to various methods are fused to obtain the method fusion result, and the weighted average of the face fusion result and the method fusion result is performed to obtain the result.
  • the fusion results corresponding to the plurality of first video frames are obtained by fusing the respective single-frame detection results of the plurality of first video frames.
  • the first video stream is processed to obtain a video stream detection result of the first video stream.
  • the authenticity detection of the first video stream can be performed by various methods, and the sub-video detection results corresponding to the various methods are fused to obtain the video stream detection result.
  • the authenticity discrimination result of the video to be processed is obtained in combination with multiple fusion methods. For videos that contain both real video frames and fake video frames, and videos that contain both real faces and fake faces, you can perform Effective authenticity detection to obtain video detection results with high accuracy.
  • the authenticity detection performed on the first video frame may be face-swap detection, and the obtained single-frame detection result indicates whether the face image in the first video frame is a face-swap face image test results. For example, the higher the score included in the detection result, the higher the confidence that the face image in the first video frame is a face-changing face image.
  • the authenticity detection performed on the first video stream may also be a face-changing detection, and the obtained video stream detection result is a detection indicating whether the face image in the first video stream is a face-changing face image. result. According to the respective single-frame detection results of the multiple first video frames and the video stream detection results of the first video stream, a determination result of whether the to-be-processed video is a face-changing video can be obtained.
  • FIG. 3 shows a schematic diagram of a video detection apparatus according to an embodiment of the present disclosure.
  • the device includes a first obtaining unit 301 for obtaining a plurality of first video frames in a video to be processed, and a first video stream corresponding to the video to be processed; a second obtaining unit 302, for acquiring a single frame detection result of performing authenticity detection on each of the first video frames; a third acquiring unit 303 for acquiring a video stream detection result for performing authenticity detection on the first video stream; determining unit 304 is used to determine the authenticity discrimination result of the video to be processed according to the respective single frame detection results of the multiple first video frames and the video stream detection results of the first video stream.
  • the first obtaining unit is specifically configured to: perform frame extraction processing on the video to be processed with a set frame number span to obtain the plurality of first video frames, wherein the set frame The number span is positively related to the total number of video frames contained in the video to be processed.
  • the second obtaining unit is specifically configured to: perform authenticity detection on each of the first video frames through a first authenticity classification network, and obtain a single-frame detection result of each of the first video frames, The single-frame detection result is used to represent the confidence that the first video frame is forged.
  • the second obtaining unit is specifically configured to: through the second authenticity classification network, according to the video frames included in each of the first video streams and the relationship between the frames, classify each of the first video streams Authenticity detection is performed to obtain video stream detection results of each of the first video streams, wherein the video stream detection results are used to represent the confidence that the first video stream is forged.
  • the determining unit is specifically configured to: fuse the respective single-frame detection results of the multiple first video frames to obtain a fusion result; determine according to the fusion result and the video stream detection result The authenticity discrimination result of the video to be processed.
  • the determining unit when the determining unit is configured to fuse the respective single-frame detection results of the multiple first video frames to obtain a fusion result, the determining unit is specifically configured to: merging the respective first video frames of the multiple first video frames The single-frame detection results are grouped to obtain multiple result groups including one or more single-frame detection results; the average detection results of each of the result groups are obtained; the average of each of the result groups is calculated by the first setting function The detection result is mapped to a first probability, and a plurality of the first probabilities are obtained, wherein the first setting function is a nonlinear mapping function; according to the average detection result of each of the result groups and the plurality of first probabilities , to get the fusion result.
  • the determining unit when the determining unit is configured to obtain a fusion result according to the average detection result of each of the result groups and the multiple first probabilities, the determining unit is specifically configured to: respond to the multiple first probabilities The ratio of the first upper probabilities greater than the first set threshold is greater than the first set ratio, and the fusion result is obtained according to the average detection result of the result group corresponding to each of the first upper probabilities; and/or, in response to the The ratio of the first lower probability that is smaller than the second preset threshold in the plurality of first probabilities is greater than the second preset ratio, and the fusion result is obtained according to the average detection result of the result group corresponding to each of the first lower probabilities;
  • the first set threshold is greater than the second set threshold.
  • the determining unit when the determining unit is configured to determine the authenticity discrimination result of the video to be processed according to the fusion result and the video stream detection result, the determining unit is specifically configured to: compare the fusion result and the video stream detection result.
  • the video stream detection result is weighted and averaged, and the authenticity judgment result of the video to be processed is determined according to the obtained weighted average result.
  • the first video frame includes multiple faces;
  • the second obtaining unit is specifically configured to: obtain face detection frames corresponding to multiple faces in the first video frame;
  • the image area corresponding to the detection frame is determined, and the single-person detection result of the corresponding face is determined;
  • the single-person detection result of each face is mapped to the second probability through the second setting function, and a plurality of the second probabilities are obtained, wherein all the The second setting function is a nonlinear mapping function; according to the single-person detection results of each of the human faces and the plurality of second probabilities, the single-frame detection results of the first video frame are obtained.
  • the first authenticity classification network includes authenticity classification networks with multiple structures
  • the second obtaining unit is configured to perform authenticity classification on the first video frame through the first authenticity classification network.
  • detection when the single-frame detection result of the first video frame is obtained, it is specifically used for: performing authenticity detection on the first video frame through the authenticity classification network of the various structures, and obtaining multiple sub-single-frame detection results ;
  • the multiple sub-single frame detection results are respectively mapped to the third probability by the third setting function to obtain multiple third probabilities, wherein the third setting function is a nonlinear mapping function;
  • the proportion of the third upper probability greater than the sixth preset threshold is greater than the third preset proportion, and the single frame of the first video frame is obtained according to the sub-single frame detection results corresponding to each of the third upper probabilities.
  • the single-frame detection result of the first video frame is obtained from the sub-single-frame detection result, wherein the sixth set threshold is greater than the seventh set threshold.
  • the second authenticity classification network includes authenticity classification networks with multiple structures
  • the third acquisition unit is configured to pass the second authenticity classification network according to the content of the first video stream.
  • the authenticity detection is performed on the first video stream, and when the video stream detection result of the first video stream is obtained, it is specifically used for: through the authenticity classification network of the various structures, According to the video frames included in the first video stream and the relationship between the frames, the authenticity detection is performed on the first video stream to obtain multiple sub-video stream detection results;
  • the detection results are respectively mapped to fourth probabilities, and a plurality of the fourth probabilities are obtained, wherein the fourth setting function is a nonlinear mapping function;
  • the ratio of the four-up probability is greater than the fifth set ratio, and the video stream detection result of the first video stream is obtained according to the sub-video stream detection results corresponding to each of the fourth probability up-probabilities; and/or, in response to the The ratio of the fourth lower probability that is smaller than the ninth preset threshold value among the
  • the single-frame detection result indicates whether the face image in the first video frame is a face-changing image; the video stream detection result of the first video stream indicates that the Whether the face image of the to-be-processed video is a face-changing image; the authenticity determination result of the to-be-processed video indicates whether the to-be-processed video is a face-changing video.
  • FIG. 4 provides an electronic device according to at least one embodiment of the present disclosure, the device includes a memory and a processor, where the memory is used for storing computer instructions that can be executed on the processor, and the processor is used for executing the computer instructions
  • the video detection method described in any implementation manner of the present disclosure is implemented at the same time.
  • At least one embodiment of the present disclosure further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the video detection method described in any implementation manner of the present disclosure.
  • one or more embodiments of this specification may be provided as a method, system or computer program product. Accordingly, one or more embodiments of this specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present specification may employ a computer program implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein form of the product.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • Embodiments of the subject matter and functional operations described in this specification can be implemented in digital electronic circuitry, in tangible embodiment of computer software or firmware, in computer hardware including the structures disclosed in this specification and their structural equivalents, or in a combination of one or more.
  • Embodiments of the subject matter described in this specification may be implemented as one or more computer programs, ie, one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. multiple modules.
  • the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for interpretation by the data.
  • the processing device executes.
  • the computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of these.
  • the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, eg, an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
  • Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit.
  • the central processing unit will receive instructions and data from read only memory and/or random access memory.
  • the basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic, magneto-optical or optical disks, to receive data therefrom or to It transmits data, or both.
  • the computer does not have to have such a device.
  • the computer may be embedded in another device, such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, global positioning system (GPS) receiver, or a universal serial bus (USB) ) flash drives for portable storage devices, to name a few.
  • PDA personal digital assistant
  • GPS global positioning system
  • USB universal serial bus
  • Computer-readable media suitable for storage of computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (eg, EPROM, EEPROM, and flash memory devices), magnetic disks (eg, internal hard disks or flash memory devices). removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks.
  • semiconductor memory devices eg, EPROM, EEPROM, and flash memory devices
  • magnetic disks eg, internal hard disks or flash memory devices. removable disks
  • magneto-optical disks e.g., CD-ROM and DVD-ROM disks.
  • the processor and memory may be supplemented by or incorporated in special purpose logic circuitry.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

一种视频检测方法、装置、设备及计算机可读存储介质,所述方法包括:获取待处理视频中的多个第一视频帧,以及所述待处理视频所对应的第一视频流(101);获取对各所述第一视频帧进行真伪检测的单帧检测结果(102);获取对所述第一视频流进行真伪检测的视频流检测结果(103);根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果,确定所述待处理视频的真伪判别结果(104)。

Description

视频检测方法、装置、设备及计算机可读存储介质
相关申请的交叉引用
本专利申请要求于2020年11月27日提交的、申请号为202011365074.7、发明名称为“视频检测方法、装置、设备及计算机可读存储介质”的中国专利申请的优先权,该申请的全文以引用的方式并入本文中。
技术领域
本公开涉及计算机视觉技术,尤其涉及一种视频检测方法、装置、设备及计算机可读存储介质。
背景技术
基于生成对抗网络(Generative Adversarial Network,GAN)的博弈优化原理或图形学方法,可以生成逼真度极高的伪造视频,伪造视频的负面应用会带来诸多不良影响。因此,有必要对视频进行真伪检测,以保证视频的真实性。
发明内容
本公开实施例提供了一种视频检测方案。
根据本公开的一方面,提供一种视频检测方法,所述方法包括:获取待处理视频中的多个第一视频帧,以及所述待处理视频所对应的第一视频流;获取对各所述第一视频帧进行真伪检测的单帧检测结果;获取对所述第一视频流进行真伪检测的视频流检测结果;根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果,确定所述待处理视频的真伪判别结果。
结合本公开提供的任一实施方式,所述获取待处理视频中的多个第一视频帧,包括:以设定帧数跨度对所述待处理视频进行抽帧处理,得到所述多个第一视频帧,其中,所述设定帧数跨度与所述待处理视频所包含的视频帧的总帧数呈正相关。
结合本公开提供的任一实施方式,所述获取对各所述第一视频帧进行真伪检测的单帧检测结果,包括:通过第一真伪分类网络对所述第一视频帧进行真伪检测,得到所述第一视频帧的单帧检测结果,其中,所述单帧检测结果用于表征所述第一视频帧是伪造的置信度。
结合本公开提供的任一实施方式,所述获取对所述第一视频流进行真伪检测的视频 流检测结果,包括:通过第二真伪分类网络,根据所述第一视频流所包含的视频帧以及帧间关系,对所述第一视频流进行真伪检测,得到所述第一视频流的视频流检测结果,其中,所述视频流检测结果用于表征所述第一视频流是伪造的置信度。
结合本公开提供的任一实施方式,所述根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果,确定所述待处理视频的真伪判别结果,包括:对所述多个第一视频帧各自的单帧检测结果进行融合,得到融合结果;根据所述融合结果和所述视频流检测结果,确定所述待处理视频的真伪判别结果。
结合本公开提供的任一实施方式,所述对所述多个第一视频帧各自的单帧检测结果进行融合,得到融合结果,包括:对所述多个第一视频帧各自的单帧检测结果进行分组,得到分别包括一个或多个单帧检测结果的多个结果组;获得各所述结果组的平均检测结果;通过第一设定函数将各所述结果组的平均检测结果映射为第一概率,得到多个所述第一概率,其中,所述第一设定函数为非线性映射函数;根据各所述结果组的平均检测结果以及所述多个第一概率,得到融合结果。
结合本公开提供的任一实施方式,所述根据各所述结果组的平均检测结果以及所述多个第一概率,得到融合结果,包括以下中至少一个:响应于所述多个第一概率中大于第一设定阈值的第一上概率的比例大于第一设定比例,根据各所述第一上概率所对应的所述结果组的平均检测结果得到融合结果;响应于所述多个第一概率中小于第二设定阈值的第一下概率的比例大于第二设定比例,根据各所述第一下概率所对应的所述结果组的平均检测结果得到融合结果;其中,所述第一设定阈值大于所述第二设定阈值。
结合本公开提供的任一实施方式,所述根据所述融合结果和所述视频流检测结果,确定所述待处理视频的真伪判别结果,包括:对所述融合结果和所述视频流检测结果进行加权平均,得到加权平均结果;根据所得到的所述加权平均结果确定所述待处理视频的真伪判别结果。
结合本公开提供的任一实施方式,所述第一视频帧包括多个人脸;所述获取对各所述第一视频帧进行真伪检测的单帧检测结果,包括:获取所述第一视频帧中多个人脸对应的人脸检测框;根据各所述人脸检测框对应的图像区域,确定相应人脸的单人检测结果;通过第二设定函数将各个所述人脸的单人检测结果映射为第二概率,得到多个所述第二概率,其中,所述第二设定函数为非线性映射函数;根据各个所述人脸的单人检测结果以及所述多个第二概率,得到所述第一视频帧的单帧检测结果。
结合本公开提供的任一实施方式,所述根据各个所述人脸的单人检测结果以及所述多个第二概率,得到所述第一视频帧的单帧检测结果,包括以下至少一个:响应于所述多个第二概率中存在大于第三设定阈值的第二概率,获取所述第一视频帧中最大的单人检测结果作为所述第一视频帧的单帧检测结果;响应于所述多个第二概率均大于第四设定阈值,获取所述第一视频帧中最大的单人检测结果作为所述第一视频帧的单帧检测结果;响应于所述多个第二概率均小于第五设定阈值,获取所述第一视频帧中最小的单人检测结果作为所述第一视频帧的单帧检测结果;其中,所述第三设定阈值大于所述第四设定阈值,所述第四设定阈值大于所第五设定阈值。
结合本公开提供的任一实施方式,所述第一真伪分类网络包括多种结构的真伪分类网络;所述通过第一真伪分类网络对所述第一视频帧进行真伪检测,得到所述第一视频帧的单帧检测结果,包括:通过所述多种结构的真伪分类网络对所述第一视频帧进行真伪检测,获得多个子单帧检测结果;通过第三设定函数将所述多个子单帧检测结果分别映射为第三概率,得到多个第三概率,其中,所述第三设定函数为非线性映射函数;通过以下至少一个确定所述第一视频帧的单帧检测结果:响应于所述多个第三概率中大于第六设定阈值的第三上概率的比例大于第三设定比例,根据各所述第三上概率所对应的子单帧检测结果得到所述第一视频帧的单帧检测结果;响应于所述多个第三概率中小于第七设定阈值的第三下概率的比例大于第四设定比例,根据各所述第三下概率所对应的子单帧检测结果得到所述第一视频帧的单帧检测结果,其中,所述第六设定阈值大于所述第七设定阈值。
结合本公开提供的任一实施方式,所述第二真伪分类网络包括多种结构的真伪分类网络;所述通过第二真伪分类网络,根据所述第一视频流所包含的视频帧以及帧间关系,对所述第一视频流进行真伪检测,得到所述第一视频流的视频流检测结果,包括:通过所述多种结构的真伪分类网络,根据所述第一视频流所包含的视频帧以及帧间关系,对所述第一视频流进行真伪检测,获得多个子视频流检测结果;通过第四设定函数将所述多个子视频流检测结果分别映射为第四概率,得到多个所述第四概率,其中,所述第四设定函数为非线性映射函数;通过以下至少一个确定所述第一视频流的视频流检测结果:响应于所述多个第四概率中大于第八设定阈值的第四上概率的比例大于第五设定比例,根据各所述第四上概率所对应的子视频流检测结果得到所述第一视频流的视频流检测结果;响应于所述多个第四概率中小于第九设定阈值的第四下概率的比例大于第六设定比例,根据各所述第四下概率所对应的子视频流检测结果得到所述第一视频流的视频流检测结果,其中,所述第八设定阈值大于所述第九设定阈值。
结合本公开提供的任一实施方式,所述第一视频帧的单帧检测结果指示所述第一视频帧中的脸部图像是否为换脸图像;所述第一视频流的视频流检测结果指示所述第一视频流中的脸部图像是否为换脸图像;所述待处理视频的真伪判别结果指示所述待处理视频是否为换脸视频。
根据本公开的一方面,提供一种视频检测装置,所述装置包括:第一获取单元,用于获取待处理视频中的多个第一视频帧,以及所述待处理视频所对应的第一视频流;第二获取单元,用于获取对各所述第一视频帧进行真伪检测的单帧检测结果;第三获取单元,用于获取对所述第一视频流进行真伪检测的视频流检测结果;确定单元,用于根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果,确定所述待处理视频的真伪判别结果。
根据本公开的一方面,提供一种电子设备,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现本公开任一实施方式所述的视频检测方法。
根据本公开的一方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本公开任一实施方式所述的视频检测方法。
一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现本公开任一实施方式所述的视频检测方法。
本公开实施例通过同时对待处理视频中的多个第一视频帧,以及所述待处理视频对应的第一视频流进行真伪检测,分别获得第一视频帧的单帧检测结果以及第一视频流的视频流检测结果,并根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果来确定所述待处理视频的真伪判别结果,从而可以检测出待处理视频中存在的部分伪造视频帧,提高视频检测准确率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本说明书的实施例,并与说明书一起用于解释本说明书的原理。
图1是本公开至少一个实施例示出的一种视频检测方法的流程图;
图2是本公开至少一个实施例示出的一种视频检测方法的示意图;
图3是本公开至少一个实施例示出的一种视频检测装置的示意图;
图4是本公开至少一个实施例示出的电子设备的结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
本公开实施例可以应用于计算机系统/服务器,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与计算机系统/服务器一起使用的众所周知的计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统、大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
图1是本公开至少一个实施例示出的一种视频检测方法的流程图。如图1所示,该方法包括步骤101~步骤104。
在步骤101中,获取待处理视频中的多个第一视频帧,以及所述待处理视频所对应的第一视频流。
在本公开实施例中,所述多个第一视频帧可以是所述待处理视频所包含的原始视频序列所对应的视频帧,也可以是对原始视频序列进行抽帧处理所得到的视频帧。所述待处理视频所对应的第一视频流可以是所述待处理视频所包含的原始视频序列所形成的视频流,也可以是对所述原始视频序列进行抽帧处理得到的视频帧所形成的视频流,例如为所述多个第一视频帧形成的视频流。
在步骤102中,获取对各所述第一视频帧进行真伪检测的单帧检测结果。
在本公开实施例中,可以通过第一真伪分类网络对所述第一视频帧进行真伪检测,得到所述第一视频帧的单帧检测结果,其中,所述单帧检测结果用于表征所述第一视频帧是伪造的置信度,例如所述单帧检测结果包括单帧置信度得分。
在一个示例中,所述第一真伪分类网络可以是预先训练的针对视频帧进行独立检测 的真伪分类网络,例如ResNet(Residual Neural Network,残差网络)、DenseNet(Densely Connected Convolutional Networks,密度连接卷积网络)、EfficientNet、Xception、SENet(Squeeze-and-Excitation Network,压缩和激励网络)等等。
在步骤103中,获取对所述第一视频流进行真伪检测的视频流检测结果。
在本公开实施例中,可以通过第二真伪分类网络,根据所述第一视频流所对应的帧序列以及帧间关系,对所述第一视频流进行真伪检测,得到所述第一视频流的视频流检测结果,其中,所述视频流检测结果用于表征所述第一视频流是伪造的置信度,例如所述视频流检测结果包括视频流置信度得分。
在一个示例中,所述第二真伪分类网络可以是预先训练的针对视频流进行检测并考虑帧间关系的真伪分类网络,例如C3D(3D ConvNets,3D卷积)网络、SlowFast网络、X3D(Extensible 3D,可扩展3D)网络等等。
在步骤104中,根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果,确定所述待处理视频的真伪判别结果。
在本公开实施例中,通过同时对待处理视频中的多个第一视频帧,以及所述待处理视频对应的第一视频流进行真伪检测,分别获得多个第一视频帧各自的单帧检测结果以及第一视频流的视频流检测结果,并根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果来确定所述待处理视频的真伪判别结果,从而可以检测出待处理视频中存在的部分伪造视频帧,提高视频检测准确率。
在一些实施例中,可以以设定帧数跨度对所述待处理视频进行抽帧处理,得到所述多个第一视频帧。其中,所述设定帧数跨度可以根据所述待处理视频的帧数确定,例如,所述设定帧数跨度可以与所述待处理视频所包含的视频帧的总帧数呈正相关,以实现设定帧数跨度根据待处理视频的帧数的自适应设置,从而能够提取合理数量的第一视频帧,以提高视频检测的效果。例如,对于具有160~320帧的10秒长视频,可以以2为帧数跨度进行抽帧处理,即每2帧抽取1帧。
在一些实施例中,可以首先对所述多个第一视频帧各自的单帧检测结果进行融合,得到融合结果,之后根据所述融合结果和所述视频流检测结果,确定所述待处理视频的真伪判别结果。通过对所述多个第一视频帧各自的单帧检测结果进行融合,使得融合结果中体现出各个单帧检测结果的影响,再根据所述融合结果和视频流检测结果来确定待处理视频的真伪判别结果,可以提高视频检测的效果。
在一个示例中,可以通过以下方式对所述多个第一视频帧各自的单帧检测结果进行融合,得到融合结果。
首先,对所述多个第一视频帧各自的单帧检测结果进行分组,得到分别包括一个或多个单帧检测结果的多个结果组;获得各所述结果组的平均检测结果。举例来说,各组的平均检测结果可以包括组内多帧的平均置信度得分。
通过将所述多个第一视频帧中,每M个相邻第一视频帧划分为一组,可以将所述多个第一视频帧分为N组。其中,M、N为正整数。本领域技术人员应当理解,在所述多个第一视频帧的总数非M的整数倍的情况下,可以出现其中的第一视频帧数量不为M的分组。
例如,在第一视频帧的总数为30帧的情况下,可以将每5个相邻第一视频帧进行分组,从而将待处理视频中的多个第一视频帧分为6组。
接下来,通过第一设定函数将各所述结果组的平均检测结果映射为第一概率,得到多个所述第一概率,其中,所述第一设定函数为非线性映射函数。所述第一设定函数例如可以是归一化指数Softmax函数,通过该函数将各组的平均单帧置信度得分映射为第一概率。
在本公开实施例中,在所述第一真伪分类网络为逻辑回归网络的情况下,所述第一视频帧的单帧检测结果为(-∞,+∞)区间的逻辑输出值。通过Softmax函数将各组的平均检测结果映射为[0,1]区间的第一概率,可以体现各组平均检测结果的分布状况。
之后,根据各所述结果组的平均检测结果以及所述多个第一概率,得到融合结果。
通过根据各组平均检测结果的分布状况来对多组的平均检测结果进行融合,可以更加准确地评估各组检测结果对于待处理视频的真伪判别结果的影响,从而能够提高视频检测的准确率。
在一个示例中,可以通过以下方法得到融合结果:响应于所述多个第一概率中大于第一设定阈值的第一上概率的比例大于第一设定比例,根据各所述第一上概率所对应的所述结果组的平均检测结果得到融合结果。也即,在超过第一设定比例的第一上概率均大于第一设定阈值的情况下,则根据这第一下概率分别所对应的所述结果组的平均检测结果来计算融合结果,例如,将这些平均检测结果的平均值作为融合结果。
举例来说,在第一设定阈值为0.85,第一设定比例为0.7的情况下,当大于0.85的第一上概率的比例超过0.7,则将各所述第一上概率所对应的所述结果组平均检测结果 的平均值作为融合结果。
在多数结果组的平均检测结果较高的情况下,少数较低的分组检测结果可能是由于神经网络误判的结果。通过以上融合方式,可以减小神经网络的误判对于视频检测结果的影响。
在另一个示例中,可以通过以下方法得到融合结果:响应于所述多个第一概率中小于第二设定阈值的第一下概率的比例大于第二设定比例,根据各所述第一下概率所对应的所述结果组的平均检测结果得到融合结果。也即,在超过第二设定比例的第一下概率均小于第二设定阈值的情况下,则根据这些第一下概率分别所对应的所述结果组的平均检测结果来计算融合结果,例如,将这些平均检测结果的平均值作为融合结果。其中,所述第一设定阈值大于所述第二设定阈值。
在上述示例中,所述第一设定比例和所述第二设定比例可以相同或者不同,本公开实施例对此不进行限制。
举例来说,在第二设定阈值为0.15,第二设定比例为0.7的情况下,当小于0.15的第一下概率的比例超过0.7,则将各所述第一下概率所对应的所述结果组的平均检测结果的平均值作为融合结果。
在多数结果组的平均检测结果较低的情况下,少数较高的分组检测结果可能是由于神经网络误判的结果。通过以上融合方式,可以减小神经网络的误判对于视频检测结果的影响。
在一个示例中,在大于第一设定阈值的第一上概率的比例小于或等于第一设定比例,且小于第二设定阈值的第一下概率的比例小于或等于第二设定比例的情况下,可以根据所述多个第一视频帧各自的单帧检测结果得到融合结果。例如,可以将所述多个第一视频帧各自的单帧检测结果的平均值作为融合结果。
在多个结果组的平均检测结果没有较为一致的趋势的情况下,也即神经网络对于各组的判断并没有一致的趋势,则通过各个第一视频帧的单帧检测结果计算融合结果,以使得各个第一视频帧对于最终的真伪判别结果的贡献相同。
在一些实施例中,对于所述融合结果和所述视频流检测结果,可以通过加权平均,获得所述融合结果和所述视频流检测结果的加权平均结果,并根据所述加权平均结果确定所述待处理视频的真伪判别结果。
在一个示例中,可以将所述加权平均结果与设定判别阈值进行比较,在所述加权平 均结果小于所述设定判别阈值时,确定所述待处理视频为真,也即确定所述待处理视频不是伪造视频;在所述加权平均值大于或等于所述设定判别阈值时,确定所述待处理视频是伪造视频。
在所述多个第一视频帧中的至少一个第一视频帧上存在多个人脸的情况下,可以针对多个人脸进行融合以得到所对应的第一视频帧的单帧检测结果。
首先,获取所述第一视频帧中多个人脸对应的人脸检测框。其中,所述人脸检测框可以使用人脸检测网络,例如RetinaFace,对所述第一视频帧进行人脸检测获得;对于进行了人脸检测的第一视频帧之后的视频帧,可以通过人脸追踪网络,例如Siamese network(孪生神经网络),对人脸进行追踪来获得人脸检测框。
对于所述第一视频帧中所包含的多个人脸,可以分别针对每个人脸生成相应的人脸检测框,所述人脸检测框具有相应的框编号,并且可以对所述人脸检测框标注对应的人脸编号,以对所述第一视频中所包含的多个人脸进行区分。例如,在所述第一视频帧中包括3个人脸的情况下,分别生成框编号为A、B、C的人脸检测框,并且所述人脸检测框A、B、C分别标注有人脸编号1、2、3。
其中,所述人脸检测框包括人脸检测框的四个顶点的坐标信息或者所述人脸检测框的长度和高度信息。
接下来,根据所述人脸检测框对应的图像区域,确定相应人脸的单人检测结果。
在一个示例中,通过第一真伪分类网络,对所述人脸检测框对应的图像区域进行真伪检测,可以得到所述人脸检测框所对应的人脸的单人检测结果。
例如,通过第一真伪分类网络分别对人脸检测框A、B、C对应的图像区域进行真伪检测,可以分别得到人脸1、2、3的单人检测结果。
针对所述第一视频帧中的多个人脸,可以生成[人脸编号,框编号,高度,宽度,通道]的输入张量,从而可以将所述待处理视频中存在的多个人脸根据人脸编号串联成视频帧集,从而可以对待处理视频中的每个人脸进行单独检测,得到各个人脸编号所对应的单人检测结果。
接着,通过第二设定函数将各个所述人脸的单人检测结果映射为第二概率,得到多个所述第二概率,其中,所述第二设定函数为非线性映射函数。
与上述对于所述第一视频帧所对应的各结果组的平均检测结果映射为第一概率相 似,可以通过Softmax函数将各个人脸的单人检测结果映射为[0,1]区间的第二概率,以体现待处理视频中所包含的多个人脸的单人检测结果的分布状况。
最后,根据各个所述人脸的单人检测结果以及多个第二概率,得到所述第一视频帧的单帧检测结果。
通过根据各个人脸对应的单帧检测结果的分布状况来对多个人脸对应的单人检测结果进行融合,可以实现对待处理视频中每个人脸的单独检测,并可以更加准确地评估各个人脸对应的单人检测结果对于待处理视频的真伪判别结果的影响,从而能够提高视频检测的准确率。
在一个示例中,可以通过以下方法得到多个人脸的融合结果:响应于所述多个第二概率中存在大于第三设定阈值的第二概率,或所述第一视频帧的多个第二概率大于第四设定阈值,获取所述第一视频帧的单人检测结果中的最大值作为所述第一视频帧的单帧检测结果。也即,当所述第一视频帧中存在对应的第二概率大于第三设定阈值的人脸,表明该第一视频帧中存在置信度较高的人脸检测结果,则可以将所述第一视频帧中最大的单人检测结果作为单帧检测结果,从而使整个第一视频帧具有较高的单帧检测结果;当所述多个第二概率均大于第四设定阈值,表明该第一视频帧中的各人脸检测结果的置信度均较高,则同样将所述第一视频帧中最大的单人检测结果作为该第一视频帧的单帧检测结果,从而使整个第一视频帧具有较高的单帧检测结果。其中,所述第三设定阈值大于所述第四设定阈值。
举例来说,在第三设定阈值为0.9的情况下,第四设定阈值为0.6的情况下,当第一视频帧中存在大于0.9的第二概率,或者第一视频帧的所有第二概率均大于0.6,则将该第一视频帧中单人置信度得分中的最大值作为该帧的单帧检测结果。
在另一个示例中,可以通过以下方法得到多个人脸的融合结果:响应于所述多个第二概率均小于第五设定阈值,获取所述第一视频帧的单人检测结果中的最小值作为所述第一视频帧的单帧检测结果。也即,当所述第一视频帧中所有人脸对应的第二概率均小于第五设定阈值,表明该第一视频帧中的各人脸检测结果的置信度均较低,则可以将所述第一视频帧中最小的单人检测结果作为该第一视频帧的单帧检测结果,从而使整个第一视频帧具有较低的单帧检测结果。其中,所述第四设定阈值大于所第五设定阈值。
举例来说,在第五设定阈值为0.4的情况下,当第一视频帧的所有第二概率均小于0.4,则将该第一视频帧中单人置信度得分中的最小值作为该帧的单帧检测结果。
在本公开实施例中,针对待处理视频的第一视频帧中存在多个人脸的情况,通过获取各个人脸对应的单人检测结果,并对多个人脸的单人检测结果进行融合以得到该第一视频帧的单帧检测结果,使得视频的真伪判别结果考虑到了不同人脸的检测结果的影响,提高了视频检测效果。
在一些实施例中,所述第一真伪分类网络包括多种结构的真伪分类网络,通过所述多种结构的真伪分类网络对所述第一视频帧进行真伪检测,获得多个子单帧检测结果,等效于获得了采用多种方法对所述第一视频帧进行真伪检测的子单帧检测结果,通过对第一视频帧对应的多个子单帧检测结果进行融合可以得到该第一视频帧的单帧检测结果。
在一些实施例中,可以通过以下方法对第一视频帧对应的多个子单帧检测结果进行融合。
首先,通过第三设定函数将所述多个子单帧检测结果分别映射为第三概率,得到多个第三概率。
与上述对于所述第一视频帧所对应的各组的平均检测结果映射为第一概率相似,可以通过Softmax函数将各个子单帧检测结果映射为[0,1]区间的第三概率,以体现多种真伪分类方法得到的子单帧检测结果的分布状况。
接下来,根据多个子单帧检测结果以及多个第三概率,得到单帧检测结果。
在一个示例中,响应于所述多个第三概率中大于第六设定阈值的第三上概率的比例大于第三设定比例,根据各所述第三上概率所对应的子单帧检测结果得到所述第一视频帧的单帧检测结果。也即,在超过第三设定比例的第三上概率均大于第六设定阈值的情况下,则根据这些第三上概率分别所对应的子单帧检测结果来计算该第一视频帧的单帧检测结果,例如,将这些子单帧检测结果的平均值作为单帧检测结果。
举例来说,在第六设定阈值为0.8,第三设定比例为0.7的情况下,当大于0.8的第三上概率的比例超过0.7,则将各所述第三上概率所对应的子单帧置信度得分的平均值作为单帧检测结果。
在多个子单帧检测结果较高的情况下,少数较低的子单帧检测结果可能是由于所采用的相应结构的真伪分类网络误判的结果。通过以上融合方式,可以减小真伪分类网络的误判对于视频检测结果的影响。
在另一个示例中,响应于所述多个第三概率中小于第七设定阈值的第三下概率的比 例大于第四设定比例,根据各所述第三下概率所对应的子单帧检测结果得到所述第一视频帧的单帧检测结果。也即,在超过第四设定比例的第三下概率均小于第七设定阈值的情况下,则根据这些第三下概率分别所对应的子单帧检测结果来计算该第一视频帧的单帧检测结果,例如,将这些子单帧检测结果的平均值作为融合结果。其中,所述第六设定阈值大于所述第七设定阈值。
在上述示例中,所述第三设定比例和所述第四设定比例可以相同或者不同,本公开实施例对此不进行限制。
举例来说,在第七设定阈值为0.2,第四设定比例为0.7的情况下,当小于0.2的第三下概率的比例超过0.7,则将各所述第三下概率所对应的子单帧置信度得分的平均值作为单帧检测结果。
在多个子单帧检测结果较低的情况下,少数较高的子单帧检测结果可能是由于所采用的相应结构的真伪分类网络误判的结果。通过以上融合方式,可以减小真伪分类网络的误判对于视频检测结果的影响。
在一些实施例中,所述第二真伪分类网络包括多种结构的真伪分类网络,通过所述多种结构的真伪分类网络对所述第一视频帧流进行真伪检测,获得多个子视频流检测结果,等效于获得了采用多种方法对所述第一视频帧流进行真伪检测的子视频流检测结果,通过对多个子视频流检测结果进行融合可以得到所述第一视频流的视频流检测结果。
在一些实施例中,可以通过以下方法对第一视频流对应的多个子视频流检测结果进行融合。
首先,通过第四设定函数将所述多个子视频流检测结果分别映射为第四概率,得到多个第四概率。
与上述对于所述第一视频帧所对应的各组的平均检测结果映射为第一概率相似,可以通过Softmax函数将各个子视频流检测结果映射为[0,1]区间的第四概率,以体现多种真伪分类方法得到的子视频流检测结果的分布状况。
接下来,根据多个子视频流检测结果以及第四概率,得到所述第一视频流的视频流检测结果。
在一个示例中,响应于所述多个第四概率中大于第八设定阈值的第四上概率的比例大于第五设定比例,根据各所述第四上概率所对应的子视频流检测结果得到所述第一视频流的视频流检测结果。也即,在超过第五设定比例的第四上概率均大于第八设定阈值 的情况下,则根据这些第四上概率分别所对应的子视频流检测结果来计算该第一视频流的视频流检测结果,例如,将这些子视频流检测结果的平均值作为所述第一视频流的视频流检测结果。
举例来说,在第八设定阈值为0.8,第五设定比例为0.7的情况下,当大于0.8的第四上概率的比例超过0.7,则将各所述第四上概率所对应的子视频流置信度得分的平均值作为所述第一视频流的视频流检测结果。
在多个子视频流检测结果较高的情况下,少数较低的子视频流检测结果可能是由于所采用的相应结构的真伪分类网络误判的结果。通过以上融合方式,可以减小真伪分类网络的误判对于视频检测结果的影响。
在另一个示例中,响应于所述多个第四概率中小于第九设定阈值的第四下概率的比例大于第六设定比例,根据各所述第四下概率所对应的子视频流检测结果得到所述第一视频流的视频流检测结果。也即,在超过第六设定比例的第四下概率均小于第九设定阈值的情况下,则根据这些第四下概率分别所对应的子视频流检测结果来计算所述第一视频流的视频流检测结果,例如,将这些子视频流检测结果的平均值作为所述第一视频流的视频流检测结果。其中,所述第八设定阈值大于所述第九设定阈值。
在上述示例中,所述第五设定比例和所述第六设定比例可以相同或者不同,本公开实施例对此不进行限制。
举例来说,在第九设定阈值为0.2,第六设定比例为0.7的情况下,当小于0.2的第四下概率的比例超过0.7,则将各所述第四下概率所对应的子视频流置信度得分的平均值作为所述第一视频流的视频流检测结果。
在多个子视频流检测结果较低的情况下,少数较高的子视频流检测结果可能是由于所采用的相应结构的真伪分类网络误判的结果。通过以上融合方式,可以减小真伪分类网络的误判对于视频检测结果的影响。
在本公开实施例中,各个设定阈值和各个设定比例的具体取值可以根据对视频检测结果的精度要求来确定,在此不进行限定。
在一些实施例中,对于待处理视频中的多个第一视频帧可以既针对多个人脸进行融合,也针对多个方法获得的子单帧检测结果进行融合,通过对两种融合所得到的结果进行加权平均以获得最终的单帧检测结果。
并且对于进行两种融合的先后顺序不进行限制。
图2示出了本公开至少一个实施例示出的一种视频检测方法的示意图。
如图2所示,首先获取待处理视频中的多个第一视频帧,以及由所述多个第一视频帧所形成的第一视频流。
一方面,对于第一视频帧进行处理以得所述第一视频帧的单帧检测结果。其中,首先对第一视频帧中所包含的多个人脸分别进行真伪检测,对于各个人脸所对应的单人检测结果进行融合以得到人脸融合结果,之后,通过多种方法对所述第一视频帧进行真伪检测,对于各种方法对应的子单帧检测结果进行融合以得到方法融合结果,并通过对所述人脸融合结果和所述方法融合结果进行加权平均,以得到所述第一视频帧所对应的单帧检测结果。并通过对多个第一视频帧各自的单帧检测结果进行融合,以得到多个第一视频帧所对应的融合结果。
另一方面,对于第一视频流进行处理以得到所述第一视频流的视频流检测结果。其中,可以通过多种方法对所述第一视频流进行真伪检测,并对于各种方法对应的子视频检测结果进行融合以得到视频流检测结果。
最后,通过对所述多个第一视频帧所对应的融合结果以及所述第一视频流对应的视频流检测结果进行加权平均,并根据加权平均的结果确定所述待处理视频的真伪判别结果。
在本公开实施例中,结合多种融合方式获取待处理视频的真伪判别结果,对于同时存在真实视频帧和伪造视频帧的视频,以及同时存在真实人脸和伪造人脸的视频,可以进行有效的真伪检测,获得准确率较高的视频检测结果。
由于换脸技术在视频中的应用可能导致肖像权、版权等多个层面的问题,因此检测视频是否为换脸视频具有重要意义。
在一些实施例中,对第一视频帧所进行的真伪检测可以是换脸检测,所得到的单帧检测结果为指示所述第一视频帧中的脸部图像是否为换脸脸部图像的检测结果。例如,所述检测结果中所包含的得分越高,即表明所述第一视频帧中的脸部图像为换脸脸部图像的置信度越高。同理,对第一视频流所进行的真伪检测也可以是换脸检测,所得到的视频流检测结果为指示所述第一视频流中的脸部图像是否为换脸脸部图像的检测结果。根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果,可以得到所述待处理视频是否为换脸视频的判别结果。
图3示出本公开实施例示出的一种视频检测装置的示意图。如图3所示,该装置 包括第一获取单元301,用于获取待处理视频中的多个第一视频帧,以及所述待处理视频所对应的第一视频流;第二获取单元302,用于获取对各所述第一视频帧进行真伪检测的单帧检测结果;第三获取单元303,用于获取对所述第一视频流进行真伪检测的视频流检测结果;确定单元304,用于根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果,确定所述待处理视频的真伪判别结果。
在一些实施例中,所述第一获取单元具体用于:以设定帧数跨度对所述待处理视频进行抽帧处理,得到所述多个第一视频帧,其中,所述设定帧数跨度与所述待处理视频所包含的视频帧的总帧数呈正相关。
在一些实施例中,所述第二获取单元具体用于:通过第一真伪分类网络对各所述第一视频帧进行真伪检测,得到各所述第一视频帧的单帧检测结果,其中,所述单帧检测结果用于表征所述第一视频帧是伪造的置信度。
在一些实施例中,所述第二获取单元具体用于:通过第二真伪分类网络,根据各所述第一视频流所包含的视频帧以及帧间关系,对各所述第一视频流进行真伪检测,得到各所述第一视频流的视频流检测结果,其中,所述视频流检测结果用于表征所述第一视频流是伪造的置信度。
在一些实施例中,所述确定单元具体用于:对所述多个第一视频帧各自的单帧检测结果进行融合,得到融合结果;根据所述融合结果和所述视频流检测结果,确定所述待处理视频的真伪判别结果。
在一些实施例中,所述确定单元在用于对所述多个第一视频帧各自的单帧检测结果进行融合,得到融合结果时,具体用于:对所述多个第一视频帧各自的单帧检测结果进行分组,得到分别包括一个或多个单帧检测结果的多个结果组;获得各所述结果组的平均检测结果;通过第一设定函数将各所述结果组的平均检测结果映射为第一概率,得到多个所述第一概率,其中,所述第一设定函数为非线性映射函数;根据各所述结果组的平均检测结果以及所述多个第一概率,得到融合结果。
在一些实施例中,所述确定单元在用于根据各所述结果组的平均检测结果以及所述多个第一概率,得到融合结果时,具体用于:响应于所述多个第一概率中大于第一设定阈值的第一上概率的比例大于第一设定比例,根据各所述第一上概率所对应的所述结果组平均检测结果得到融合结果;和/或,响应于所述多个第一概率中小于第二设定阈值的第一下概率的比例大于第二设定比例,根据各所述第一下概率所对应的所述结果组的 平均检测结果得到融合结果;所述第一设定阈值大于所述第二设定阈值。
在一些实施例中,所述确定单元在用于根据所述融合结果和所述视频流检测结果,确定所述待处理视频的真伪判别结果时,具体用于:对所述融合结果和所述视频流检测结果进行加权平均,根据所得到的加权平均结果确定所述待处理视频的真伪判别结果。
在一些实施例中,所述第一视频帧包括多个人脸;所述第二获取单元具体用于:获取所述第一视频帧中多个人脸对应的人脸检测框;根据所述人脸检测框对应的图像区域,确定相应人脸的单人检测结果;通过第二设定函数将各个人脸的单人检测结果映射为第二概率,得到多个所述第二概率,其中,所述第二设定函数为非线性映射函数;根据各个所述人脸的单人检测结果以及所述多个第二概率,得到所述第一视频帧的单帧检测结果。
在一些实施例中,所述第二获取单元在用于根据各个所述人脸的单人检测结果以及所述多个第二概率,得到所述第一视频帧的单帧检测结果时,具体用于:响应于所述多个第二概率中存在大于第三设定阈值的第二概率,获取所述第一视频帧中最大的单人检测结果作为所述第一视频帧的单帧检测结果;和/或,响应于所述多个第二概率均大于第四设定阈值,获取所述第一视频帧中最大的单人检测结果作为所述第一视频帧的单帧检测结果;和/或,响应于多个第二概率均小于第五设定阈值,获取所述第一视频帧中最小的单人检测结果作为所述第一视频帧的单帧检测结果;其中,所述第三设定阈值大于所述第四设定阈值,所述第四设定阈值大于所第五设定阈值。
在一些实施例中,所述第一真伪分类网络包括多种结构的真伪分类网络,所述第二获取单元在用于通过第一真伪分类网络对所述第一视频帧进行真伪检测,得到所述第一视频帧的单帧检测结果时,具体用于:通过所述多种结构的真伪分类网络对所述第一视频帧进行真伪检测,获得多个子单帧检测结果;通过第三设定函数将所述多个子单帧检测结果分别映射为第三概率,得到多个第三概率,其中,所述第三设定函数为非线性映射函数;响应于所述多个第三概率中大于第六设定阈值的第三上概率的比例大于第三设定比例,根据各所述第三上概率所对应的子单帧检测结果得到所述第一视频帧的单帧检测结果;和/或,响应于所述多个第三概率中小于第七设定阈值的第三下概率的比例大于第四设定比例,根据各所述第三概下率所对应的子单帧检测结果得到所述第一视频帧的单帧检测结果,其中,所述第六设定阈值大于所述第七设定阈值。
在一些实施例中,所述第二真伪分类网络包括多种结构的真伪分类网络,所述第三获取单元在用于通过第二真伪分类网络,根据所述第一视频流所包含的视频帧以及帧 间关系,对所述第一视频流进行真伪检测,得到所述第一视频流的视频流检测结果时,具体用于:通过所述多种结构的真伪分类网络,根据所述第一视频流所包含的视频帧以及帧间关系,对所述第一视频流进行真伪检测,获得多个子视频流检测结果;通过第四设定函数将所述多个子视频流检测结果分别映射为第四概率,得到多个所述第四概率,其中,所述第四设定函数为非线性映射函数;响应于所述多个第四概率大于第八设定阈值的第四上概率的比例大于第五设定比例,根据各所述第四概上率所对应的子视频流检测结果得到所述第一视频流的视频流检测结果;和/或,响应于所述多个第四概率中小于第九设定阈值的第四下概率的比例大于第六设定比例,根据各所述第四下概率所对应的子视频流检测结果得到所述第一视频流的视频流检测结果,其中,所述第八设定阈值大于所述第九设定阈值。
在一些实施例中,所述单帧检测结果指示所述第一视频帧中的脸部图像是否为换脸图像;所述第一视频流的视频流检测结果为指示所述第一视频流中的脸部图像是否为换脸图像;所述待处理视频的真伪判别结果指示所述待处理视频是否为换脸视频。
图4为本公开至少一个实施例提供的电子设备,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现本公开任一实现方式所述的视频检测方法。
本公开至少一个实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本公开任一实现方式所述的视频检测方法。
本领域技术人员应明白,本说明书一个或多个实施例可提供为方法、系统或计算机程序产品。因此,本说明书一个或多个实施例可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本说明书一个或多个实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于数据处理设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的行为或步骤可以按照不同于实施例中的顺序来执 行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本说明书中描述的主题及功能操作的实施例可以在以下中实现:数字电子电路、有形体现的计算机软件或固件、包括本说明书中公开的结构及其结构性等同物的计算机硬件、或者它们中的一个或多个的组合。本说明书中描述的主题的实施例可以实现为一个或多个计算机程序,即编码在有形非暂时性程序载体上以被数据处理装置执行或控制数据处理装置的操作的计算机程序指令中的一个或多个模块。可替代地或附加地,程序指令可以被编码在人工生成的传播信号上,例如机器生成的电、光或电磁信号,该信号被生成以将信息编码并传输到合适的接收机装置以由数据处理装置执行。计算机存储介质可以是机器可读存储设备、机器可读存储基板、随机或串行存取存储器设备、或它们中的一个或多个的组合。
本说明书中描述的处理及逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程计算机执行,以通过根据输入数据进行操作并生成输出来执行相应的功能。所述处理及逻辑流程还可以由专用逻辑电路—例如FPGA(现场可编程门阵列)或ASIC(专用集成电路)来执行,并且装置也可以实现为专用逻辑电路。
适合用于执行计算机程序的计算机包括,例如通用和/或专用微处理器,或任何其他类型的中央处理单元。通常,中央处理单元将从只读存储器和/或随机存取存储器接收指令和数据。计算机的基本组件包括用于实施或执行指令的中央处理单元以及用于存储指令和数据的一个或多个存储器设备。通常,计算机还将包括用于存储数据的一个或多个大容量存储设备,例如磁盘、磁光盘或光盘等,或者计算机将可操作地与此大容量存储设备耦接以从其接收数据或向其传送数据,抑或两种情况兼而有之。然而,计算机不是必须具有这样的设备。此外,计算机可以嵌入在另一设备中,例如移动电话、个人数字助理(PDA)、移动音频或视频播放器、游戏操纵台、全球定位系统(GPS)接收机、或例如通用串行总线(USB)闪存驱动器的便携式存储设备,仅举几例。
适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备,例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部硬盘或可移动盘)、磁光盘以及CD ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。
虽然本说明书包含许多具体实施细节,但是这些不应被解释为限制任何发明的范 围或所要求保护的范围,而是主要用于描述特定发明的具体实施例的特征。本说明书内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面,在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外,虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护,但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除,并且所要求保护的组合可以指向子组合或子组合的变型。
类似地,虽然在附图中以特定顺序描绘了操作,但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行,以实现期望的结果。在某些情况下,多任务和并行处理可能是有利的。此外,上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离,并且应当理解,所描述的程序组件和系统通常可以一起集成在单个软件产品中,或者封装成多个软件产品。
由此,主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下,权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外,附图中描绘的处理并非必需所示的特定顺序或顺次顺序,以实现期望的结果。在某些实现中,多任务和并行处理可能是有利的。
以上所述仅为本说明书一个或多个实施例的较佳实施例而已,并不用以限制本说明书一个或多个实施例,凡在本说明书一个或多个实施例的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书一个或多个实施例保护的范围之内。

Claims (17)

  1. 一种视频检测方法,包括:
    获取待处理视频中的多个第一视频帧,以及所述待处理视频所对应的第一视频流;
    获取对各所述第一视频帧进行真伪检测的单帧检测结果;
    获取对所述第一视频流进行真伪检测的视频流检测结果;
    根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果,确定所述待处理视频的真伪判别结果。
  2. 根据权利要求1所述的方法,其特征在于,所述获取待处理视频中的多个第一视频帧,包括:
    以设定帧数跨度对所述待处理视频进行抽帧处理,得到所述多个第一视频帧,
    其中,所述设定帧数跨度与所述待处理视频所包含的视频帧的总帧数呈正相关。
  3. 根据权利要求1或2所述的方法,其特征在于,所述获取对各所述第一视频帧进行真伪检测的单帧检测结果,包括:
    通过第一真伪分类网络对所述第一视频帧进行真伪检测,得到所述第一视频帧的单帧检测结果,
    其中,所述单帧检测结果用于表征所述第一视频帧是伪造的置信度。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述获取对所述第一视频流进行真伪检测的视频流检测结果,包括:
    通过第二真伪分类网络,根据所述第一视频流所包含的视频帧以及帧间关系,对所述第一视频流进行真伪检测,得到所述第一视频流的视频流检测结果,
    其中,所述视频流检测结果用于表征所述第一视频流是伪造的置信度。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果,确定所述待处理视频的真伪判别结果,包括:
    对所述多个第一视频帧各自的单帧检测结果进行融合,得到融合结果;
    根据所述融合结果和所述视频流检测结果,确定所述待处理视频的真伪判别结果。
  6. 根据权利要求5所述的方法,其特征在于,所述对所述多个第一视频帧各自的单帧检测结果进行融合,得到融合结果,包括:
    对所述多个第一视频帧各自的单帧检测结果进行分组,得到分别包括一个或多个单帧检测结果的多个结果组;
    获得各所述结果组的平均检测结果;
    通过第一设定函数将各所述结果组的平均检测结果映射为第一概率,得到多个所述第一概率,其中,所述第一设定函数为非线性映射函数;
    根据各所述结果组的平均检测结果以及所述多个第一概率,得到融合结果。
  7. 根据权利要求6所述的方法,其特征在于,所述根据各所述结果组的平均检测结果以及所述多个第一概率,得到融合结果,包括以下中至少一个:
    响应于所述多个第一概率中大于第一设定阈值的第一上概率的比例大于第一设定比例,根据各所述第一上概率所对应的所述结果组的平均检测结果得到融合结果;
    响应于所述多个第一概率中小于第二设定阈值的第一下概率的比例大于第二设定比例,根据各所述第一下概率所对应的所述结果组的平均检测结果得到融合结果;
    其中,所述第一设定阈值大于所述第二设定阈值。
  8. 根据权利要求5至7任一项所述的方法,其特征在于,所述根据所述融合结果和所述视频流检测结果,确定所述待处理视频的真伪判别结果,包括:
    对所述融合结果和所述视频流检测结果进行加权平均,得到加权平均结果;
    根据所得到的所述加权平均结果确定所述待处理视频的真伪判别结果。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述第一视频帧包括多个人脸;所述获取对各所述第一视频帧进行真伪检测的单帧检测结果,包括:
    获取所述第一视频帧中多个人脸对应的人脸检测框;
    根据各所述人脸检测框对应的图像区域,确定相应人脸的单人检测结果;
    通过第二设定函数将各个所述人脸的单人检测结果映射为第二概率,得到多个所述第二概率,其中,所述第二设定函数为非线性映射函数;
    根据各个所述人脸的单人检测结果以及所述多个第二概率,得到所述第一视频帧的单帧检测结果。
  10. 根据权利要求9所述的方法,其特征在于,所述根据各个所述人脸的单人检测结果以及所述多个第二概率,得到所述第一视频帧的单帧检测结果,包括以下至少一个:
    响应于所述多个第二概率中存在大于第三设定阈值的第二概率,获取所述第一视频帧中最大的单人检测结果作为所述第一视频帧的单帧检测结果;
    响应于所述多个第二概率均大于第四设定阈值,获取所述第一视频帧中最大的单人检测结果作为所述第一视频帧的单帧检测结果;
    响应于所述多个第二概率均小于第五设定阈值,获取所述第一视频帧中最小的单人检测结果作为所述第一视频帧的单帧检测结果;
    其中,所述第三设定阈值大于所述第四设定阈值,所述第四设定阈值大于所第五设 定阈值。
  11. 根据权利要求3所述的方法,其特征在于,所述第一真伪分类网络包括多种结构的真伪分类网络;所述通过第一真伪分类网络对所述第一视频帧进行真伪检测,得到所述第一视频帧的单帧检测结果,包括:
    通过所述多种结构的真伪分类网络对所述第一视频帧进行真伪检测,获得多个子单帧检测结果;
    通过第三设定函数将所述多个子单帧检测结果分别映射为第三概率,得到多个第三概率,其中,所述第三设定函数为非线性映射函数;
    通过以下至少一个确定所述第一视频帧的单帧检测结果:
    响应于所述多个第三概率中大于第六设定阈值的第三上概率的比例大于第三设定比例,根据各所述第三上概率所对应的子单帧检测结果得到所述第一视频帧的单帧检测结果;
    响应于所述多个第三概率中小于第七设定阈值的第三下概率的比例大于第四设定比例,根据各所述第三下概率所对应的子单帧检测结果得到所述第一视频帧的单帧检测结果,其中,所述第六设定阈值大于所述第七设定阈值。
  12. 根据权利要求4所述的方法,其特征在于,所述第二真伪分类网络包括多种结构的真伪分类网络;所述通过第二真伪分类网络,根据所述第一视频流所包含的视频帧以及帧间关系,对所述第一视频流进行真伪检测,得到所述第一视频流的视频流检测结果,包括:
    通过所述多种结构的真伪分类网络,根据所述第一视频流所包含的视频帧以及帧间关系,对所述第一视频流进行真伪检测,获得多个子视频流检测结果;
    通过第四设定函数将所述多个子视频流检测结果分别映射为第四概率,得到多个所述第四概率,其中,所述第四设定函数为非线性映射函数;
    通过以下至少一个确定所述第一视频流的视频流检测结果:
    响应于所述多个第四概率中大于第八设定阈值的第四上概率的比例大于第五设定比例,根据各所述第四上概率所对应的子视频流检测结果得到所述第一视频流的视频流检测结果;
    响应于所述多个第四概率中小于第九设定阈值的第四下概率的比例大于第六设定比例,根据各所述第四下概率所对应的子视频流检测结果得到所述第一视频流的视频流检测结果,其中,所述第八设定阈值大于所述第九设定阈值。
  13. 根据权利要求1至12任一项所述的方法,其特征在于,所述第一视频帧的单 帧检测结果指示所述第一视频帧中的脸部图像是否为换脸图像;所述第一视频流的视频流检测结果指示所述第一视频流中的脸部图像是否为换脸图像;所述待处理视频的真伪判别结果指示所述待处理视频是否为换脸视频。
  14. 一种视频检测装置,包括:
    第一获取单元,用于获取待处理视频中的多个第一视频帧,以及所述待处理视频所对应的第一视频流;
    第二获取单元,用于获取对各所述第一视频帧进行真伪检测的单帧检测结果;
    第三获取单元,用于获取对所述第一视频流进行真伪检测的视频流检测结果;
    确定单元,用于根据所述多个第一视频帧各自的单帧检测结果和所述第一视频流的视频流检测结果,确定所述待处理视频的真伪判别结果。
  15. 一种电子设备,其特征在于,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现权利要求1至13任一项所述的方法。
  16. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现权利要求1至13任一项所述的方法。
  17. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现权利要求1至13任一项所述的方法。
PCT/CN2021/103766 2020-11-27 2021-06-30 视频检测方法、装置、设备及计算机可读存储介质 WO2022110806A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022531515A JP2023507898A (ja) 2020-11-27 2021-06-30 ビデオ検出方法、装置、機器及びコンピュータ可読記憶媒体
KR1020227018065A KR20220093157A (ko) 2020-11-27 2021-06-30 비디오 검출 방법, 장치, 기기 및 컴퓨터 판독 가능한 저장 매체

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011365074.7A CN112329730B (zh) 2020-11-27 2020-11-27 视频检测方法、装置、设备及计算机可读存储介质
CN202011365074.7 2020-11-27

Publications (1)

Publication Number Publication Date
WO2022110806A1 true WO2022110806A1 (zh) 2022-06-02

Family

ID=74309312

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/103766 WO2022110806A1 (zh) 2020-11-27 2021-06-30 视频检测方法、装置、设备及计算机可读存储介质

Country Status (4)

Country Link
JP (1) JP2023507898A (zh)
KR (1) KR20220093157A (zh)
CN (1) CN112329730B (zh)
WO (1) WO2022110806A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118366198A (zh) * 2024-04-23 2024-07-19 天翼爱音乐文化科技有限公司 一种基于多人场景的跟踪换脸方法、系统、设备及介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329730B (zh) * 2020-11-27 2024-06-11 上海商汤智能科技有限公司 视频检测方法、装置、设备及计算机可读存储介质
CN113792701B (zh) * 2021-09-24 2024-08-13 北京市商汤科技开发有限公司 一种活体检测方法、装置、计算机设备和存储介质
CN114359811A (zh) * 2022-01-11 2022-04-15 北京百度网讯科技有限公司 数据鉴伪方法、装置、电子设备以及存储介质
CN115412726B (zh) * 2022-09-02 2024-03-01 北京瑞莱智慧科技有限公司 视频真伪检测方法、装置及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150208025A1 (en) * 2014-01-21 2015-07-23 Huawei Technologies Co., Ltd. Video Processing Method and Apparatus
CN111444881A (zh) * 2020-04-13 2020-07-24 中国人民解放军国防科技大学 伪造人脸视频检测方法和装置
CN111444873A (zh) * 2020-04-02 2020-07-24 北京迈格威科技有限公司 视频中人物真伪的检测方法、装置、电子设备及存储介质
CN111967427A (zh) * 2020-08-28 2020-11-20 广东工业大学 一种伪造人脸视频鉴别方法、系统和可读存储介质
CN112329730A (zh) * 2020-11-27 2021-02-05 上海商汤智能科技有限公司 视频检测方法、装置、设备及计算机可读存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299650B (zh) * 2018-07-27 2021-09-07 东南大学 基于视频的非线性在线表情预检测方法及装置
US10810725B1 (en) * 2018-12-07 2020-10-20 Facebook, Inc. Automated detection of tampered images
CN110059542A (zh) * 2019-03-04 2019-07-26 平安科技(深圳)有限公司 基于改进的Resnet的人脸活体检测的方法及相关设备
CN113646806A (zh) * 2019-03-22 2021-11-12 日本电气株式会社 图像处理设备、图像处理方法和存储程序的记录介质
CN111783632B (zh) * 2020-06-29 2022-06-10 北京字节跳动网络技术有限公司 针对视频流的人脸检测方法、装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150208025A1 (en) * 2014-01-21 2015-07-23 Huawei Technologies Co., Ltd. Video Processing Method and Apparatus
CN111444873A (zh) * 2020-04-02 2020-07-24 北京迈格威科技有限公司 视频中人物真伪的检测方法、装置、电子设备及存储介质
CN111444881A (zh) * 2020-04-13 2020-07-24 中国人民解放军国防科技大学 伪造人脸视频检测方法和装置
CN111967427A (zh) * 2020-08-28 2020-11-20 广东工业大学 一种伪造人脸视频鉴别方法、系统和可读存储介质
CN112329730A (zh) * 2020-11-27 2021-02-05 上海商汤智能科技有限公司 视频检测方法、装置、设备及计算机可读存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118366198A (zh) * 2024-04-23 2024-07-19 天翼爱音乐文化科技有限公司 一种基于多人场景的跟踪换脸方法、系统、设备及介质

Also Published As

Publication number Publication date
JP2023507898A (ja) 2023-02-28
CN112329730A (zh) 2021-02-05
CN112329730B (zh) 2024-06-11
KR20220093157A (ko) 2022-07-05

Similar Documents

Publication Publication Date Title
WO2022110806A1 (zh) 视频检测方法、装置、设备及计算机可读存储介质
US20230041233A1 (en) Image recognition method and apparatus, computing device, and computer-readable storage medium
CN107871130B (zh) 图像处理
CN106415594B (zh) 用于面部验证的方法和系统
WO2018121157A1 (zh) 一种网络流量异常检测方法及装置
US9485204B2 (en) Reducing photo-tagging spam
CN110853033B (zh) 基于帧间相似度的视频检测方法和装置
WO2022160591A1 (zh) 人群行为检测方法及装置、电子设备、存储介质及计算机程序产品
WO2021130546A1 (en) Target object identification system, method and apparatus, electronic device and storage medium
CN111160555B (zh) 基于神经网络的处理方法、装置及电子设备
Kharrazi et al. Improving steganalysis by fusion techniques: A case study with image steganography
CN112468487B (zh) 实现模型训练的方法、装置、实现节点检测的方法及装置
CN111968625A (zh) 融合文本信息的敏感音频识别模型训练方法及识别方法
TW201944291A (zh) 人臉辨識方法
Niu et al. Boundary-aware RGBD salient object detection with cross-modal feature sampling
US20220398400A1 (en) Methods and apparatuses for determining object classification
US11295457B2 (en) Tracking apparatus and computer readable medium
CN113095257A (zh) 异常行为检测方法、装置、设备及存储介质
US20230283622A1 (en) Anomaly detection method, anomaly detection device, and recording medium
US8737696B2 (en) Human face recognition method and apparatus
WO2023185693A1 (zh) 图像处理方法、相关装置和系统
WO2023019970A1 (zh) 一种攻击检测方法及装置
CN114513473B (zh) 一种流量类别检测方法、装置及设备
WO2023273227A1 (zh) 指甲识别方法、装置、设备及存储介质
CN103426171B (zh) 双目立体视觉系统中对应指尖点的匹配方法、装置

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022531515

Country of ref document: JP

Kind code of ref document: A

Ref document number: 20227018065

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896314

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 18.10.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21896314

Country of ref document: EP

Kind code of ref document: A1