CN114419693A - Method and device for detecting face deep false - Google Patents

Method and device for detecting face deep false Download PDF

Info

Publication number
CN114419693A
CN114419693A CN202111569608.2A CN202111569608A CN114419693A CN 114419693 A CN114419693 A CN 114419693A CN 202111569608 A CN202111569608 A CN 202111569608A CN 114419693 A CN114419693 A CN 114419693A
Authority
CN
China
Prior art keywords
face
feature information
false detection
deep
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111569608.2A
Other languages
Chinese (zh)
Inventor
梁涛
杨青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Du Xiaoman Technology Beijing Co Ltd
Original Assignee
Du Xiaoman Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Du Xiaoman Technology Beijing Co Ltd filed Critical Du Xiaoman Technology Beijing Co Ltd
Priority to CN202111569608.2A priority Critical patent/CN114419693A/en
Publication of CN114419693A publication Critical patent/CN114419693A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The application provides a method and a device for detecting human face deep false, wherein the method comprises the following steps: extracting multiple layers of first characteristic information from an input image; obtaining multilayer second characteristic information by performing characteristic fusion on the multilayer first characteristic information; and performing semantic fusion on each layer of second feature information in the multiple layers of second feature information respectively to obtain multiple layers of third feature information, and inputting each layer of third feature information into a multi-task deep false detection network respectively to obtain a deep false detection result output by the multi-task deep false detection network, wherein the multiple tasks cooperatively executed in the multi-task deep false detection network comprise face classification, face positioning and false face detection. In the scheme of the application, multiple tasks such as face classification, face positioning, fake face detection and the like can be synchronously completed in one stage, the face deep false detection speed can be greatly improved, and the face deep false detection precision can be improved.

Description

Method and device for detecting face deep false
Technical Field
The application relates to the technical field of computers, in particular to a technical scheme for detecting human face deep false in a one-stage multitask cooperation mode.
Background
With the progress of machine learning and computer vision technology, deep forgery (deep) technology has also been rapidly developed. Deep forgery refers to creating or synthesizing visual and audio contents (such as images, audios and videos, texts and the like) based on an intelligent method such as deep learning. The abuse of the deep forged data brings a great deal of potential safety hazard and privacy hazard, so the detection task for the deep forged data is more and more emphasized. Face Forgery (Face Forgery) refers in particular to a Face-oriented tampering technology in depth Forgery, and Face depth Forgery detection, namely, the judgment of whether a Face contained in a given picture is generated by Forgery or not.
In the prior art, a face deep false detection technology generally adopts a two-stage process. In the first stage, the face frame is detected by adopting the existing face detection tool and the face area is cut out; and in the second stage, a face deep false analysis tool is adopted to judge the face obtained in the last step.
Disclosure of Invention
The application aims to provide a technical scheme for detecting the human face deep false.
According to an embodiment of the present application, a method for detecting face deep false is provided, wherein the method includes:
extracting multiple layers of first characteristic information from an input image;
obtaining multilayer second characteristic information by performing characteristic fusion on the multilayer first characteristic information;
and performing semantic fusion on each layer of second feature information in the multiple layers of second feature information respectively to obtain multiple layers of third feature information, and inputting each layer of third feature information into a multi-task deep false detection network respectively to obtain a deep false detection result output by the multi-task deep false detection network, wherein the multiple tasks cooperatively executed in the multi-task deep false detection network comprise face classification, face positioning and false face detection.
According to another embodiment of the present application, there is provided an apparatus for human face deep false detection, wherein the apparatus includes:
a module for extracting a plurality of layers of first feature information from an input image;
a module for obtaining a plurality of layers of second feature information by performing feature fusion on the plurality of layers of first feature information;
and the module is used for performing semantic fusion on each layer of second feature information in the plurality of layers of second feature information respectively to obtain a plurality of layers of third feature information, and inputting each layer of third feature information into the multi-task deep false detection network respectively to obtain a deep false detection result output by the multi-task deep false detection network, wherein the multi-task cooperatively executed in the multi-task deep false detection network comprises face classification, face positioning and false face detection.
There is also provided, in accordance with another embodiment of the present application, a computer apparatus, wherein the computer apparatus includes: a memory for storing one or more programs; one or more processors coupled with the memory, the one or more programs, when executed by the one or more processors, causing the one or more processors to perform operations comprising:
extracting multiple layers of first characteristic information from an input image;
obtaining multilayer second characteristic information by performing characteristic fusion on the multilayer first characteristic information;
and performing semantic fusion on each layer of second feature information in the multiple layers of second feature information respectively to obtain multiple layers of third feature information, and inputting each layer of third feature information into a multi-task deep false detection network respectively to obtain a deep false detection result output by the multi-task deep false detection network, wherein the multiple tasks cooperatively executed in the multi-task deep false detection network comprise face classification, face positioning and false face detection.
According to another embodiment of the present application, there is also provided a computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to:
extracting multiple layers of first characteristic information from an input image;
obtaining multilayer second characteristic information by performing characteristic fusion on the multilayer first characteristic information;
and performing semantic fusion on each layer of second feature information in the multiple layers of second feature information respectively to obtain multiple layers of third feature information, and inputting each layer of third feature information into a multi-task deep false detection network respectively to obtain a deep false detection result output by the multi-task deep false detection network, wherein the multiple tasks cooperatively executed in the multi-task deep false detection network comprise face classification, face positioning and false face detection.
Compared with the prior art, the method has the following advantages: the human face detection method can synchronously complete multiple tasks such as human face classification, human face positioning, fake human face detection and the like in one stage, and can greatly improve the human face deep fake detection speed; the limit of the first-stage face positioning effect on the second-stage deep false detection effect in the two-stage strategy can be eliminated, and the face deep false detection precision is improved; through a one-stage multi-task deep false detection network, the discriminability of the identification characteristics is improved by utilizing a multi-task collaborative learning training mode, and the training data of the deep false detection task can generate the labeling data of tasks such as classification, segmentation, target detection and the like with nearly zero cost, so that richer supervision information is provided for a detection model to extract more robust and discriminability characteristics.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
fig. 1 shows a schematic flow chart of a method for detecting face deep false according to an embodiment of the present application;
FIG. 2 illustrates a design flow framework for face depth false detection according to an example of the present application;
FIG. 3 illustrates a schematic diagram of a semantic fusion module of one example of the present application;
fig. 4 is a schematic structural diagram illustrating an apparatus for detecting a deep false face according to an embodiment of the present application;
FIG. 5 illustrates an exemplary system that can be used to implement the various embodiments described in this application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel, concurrently, or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
The term "device" in this context refers to an intelligent electronic device that can perform predetermined processes such as numerical calculations and/or logic calculations by executing predetermined programs or instructions, and may include a processor and a memory, wherein the predetermined processes are performed by the processor executing program instructions prestored in the memory, or performed by hardware such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), or performed by a combination of the above two.
The technical scheme of the application is mainly realized by computer equipment. Wherein the computer device comprises a network device and a user device. The network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a cloud based computing (CloudComputing) consisting of a large number of computers or network servers, wherein cloud computing is one of distributed computing, a super virtual computer consisting of a collection of loosely coupled computers. The user equipment includes but is not limited to PCs, tablets, smart phones, IPTV, PDAs, wearable devices, and the like. The computer equipment can be independently operated to realize the application, and can also be accessed into a network to realize the application through the interactive operation with other computer equipment in the network. The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless Ad Hoc network (Ad Hoc network), and the like.
It should be noted that the above-mentioned computer devices are only examples, and other computer devices that are currently available or that may come into existence in the future, such as may be applicable to the present application, are also included within the scope of the present application and are incorporated herein by reference.
The methodologies discussed hereinafter, some of which are illustrated by flow diagrams, may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium. The processor(s) may perform the necessary tasks.
Specific structural and functional details disclosed herein are merely representative and are provided for purposes of describing example embodiments of the present application. This application may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element may be termed a second element, and, similarly, a second element may be termed a first element, without departing from the scope of example embodiments. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently, or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
The applicant finds that the existing human face deep false detection process has the following defects: 1. the two-stage operation reasoning speed is low, and the requirement of quick judgment in a deep false detection scene is not met; 2. the result of the second-stage analysis and judgment is influenced by the first-stage face positioning effect, the upper limit of the model is limited, namely, the inaccurate face positioning causes the deviation of the second-stage analysis and judgment; 3. the human face deep false detection data set can generate labeling data of tasks such as classification, segmentation, target detection and the like at almost zero cost, and the two-stage strategy cannot simultaneously utilize the characteristics generated by the three tasks. In view of the above disadvantages, the present application provides a one-stage multi-task collaborative technical scheme for face deep false detection, which can complete face deep false detection and positioning through multi-task collaboration only in one stage.
The embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Fig. 1 shows a flowchart of a method for detecting face deep false detection according to an embodiment of the present application. The method according to the present embodiment includes step S11, step S12, and step S13. In step S11, the computer device extracts a plurality of layers of first feature information from the input image; in step S12, the computer device performs feature fusion on the multiple layers of first feature information to obtain multiple layers of second feature information; in step S13, the computer device performs semantic fusion on each layer of second feature information in the multiple layers of second feature information to obtain multiple layers of third feature information, and inputs each layer of third feature information into the multitask deep false detection network to obtain a deep false detection result output by the multitask deep false detection network, where the multitask cooperatively executed in the multitask deep false detection network includes face classification, face positioning, and false face detection.
In step S11, the computer device extracts a plurality of layers of first feature information from the input image. In some embodiments, the multi-layered first feature information includes, but is not limited to, any shallow feature, such as a color, texture, or the like. In some embodiments, multiple layers of first feature information are extracted from the input image through a feature extraction network (e.g., the base network ResNet50, EfficientNet, Xception, etc.), for example, an RGB image (i.e., an input image) with a size of W × H × 3 is input, and the first feature information of a light layer of color, texture, etc. in the image is extracted through the base network ResNet 50.
In step S12, the computer device obtains multiple layers of second feature information by performing feature fusion on the multiple layers of first feature information. In some embodiments, feature fusion is performed on shallow features of each level extracted from the input image, and multiple layers of second feature information obtained by fusion are obtained. In some embodiments, Feature fusion is performed through a FPN (Feature Pyramid Networks) layer. In some embodiments, the step S12 further includes: performing feature fusion on the multilayer first feature information to obtain fused multilayer second feature information; and sampling the second characteristic information of the highest level in the fused multilayer second characteristic information to obtain the second characteristic information of a higher level. As an example, an RGB image is input, the size of the image is W × H × 3, shallow first feature information such as color and texture in the image is extracted through a base network ResNet50, then feature fusion is performed on the extracted multiple layers of first feature information to obtain fused multiple layers of second feature information, and in order to increase detection of a large target face, further sampling is performed on second feature information of the highest level in the fused multiple layers of second feature information to obtain second feature information of a higher level.
In step S13, the computer device performs semantic fusion on each layer of second feature information in the multiple layers of second feature information to obtain multiple layers of third feature information, and inputs each layer of third feature information into the multitask deep false detection network to obtain a deep false detection result output by the multitask deep false detection network, where the multitask cooperatively executed in the multitask deep false detection network includes face classification, face positioning, and false face detection. Based on the multi-task deep false detection network, a plurality of tasks can be cooperatively executed in one stage, different tasks are sensitive to different characteristics, and the characteristics can be mutually promoted, so that the tasks are mutually promoted, and the speed, the accuracy and the recall rate of the human face deep false detection can be jointly promoted.
In some embodiments, for the multiple layers of second feature information, performing semantic fusion on each layer of second feature information in the feature pyramid through a semantic fusion module to obtain third feature information, and increasing a feature receptive field to enhance the modeling capability of the model on the rigid object. In some embodiments, the channel transformation in the semantic fusion module is performed by using a 3 × 3 deformable convolution with step size of 1 to enhance the modeling capability of the model for non-rigid objects (facial muscle expression movement, etc.).
In some embodiments, the multitask cooperatively executed in the multitask deep false detection network further includes false region positioning, so that 4 tasks of face classification, face positioning, face false detection and false region positioning (segmentation) can be completed in one step. In some embodiments, each task in the multi-task deep pseudo detection network independently adopts a 4-layer full convolution network to obtain the target output.
Fig. 2 shows a design flow framework for face deep false detection according to an example of the present application, where the specific flow is as follows: inputting an RGB image (namely an input image), wherein the size of the image is W multiplied by H multiplied by 3, extracting characteristics C3, C4 and C5 from the image through a basic network ResNet50, and then further fusing the characteristics by adopting an FPN layer to obtain P3, P4 and P5; in order to increase the detection of a large target face, P5 (namely the features of the fused highest level) is further sampled to obtain a higher semantic representation P6, wherein the channel dimensions of P3, P4, P5 and P6 are all 256; then respectively adopting a semantic fusion module based on deformable convolution to strengthen the characteristics of the four characteristic graphs, respectively increasing the characteristic feeling to enhance the modeling capability of the model to the rigid object by fusing multi-scale characteristics, the modeling capability of a deformable convolution enhanced model to a non-rigid object is obtained through a semantic fusion module, four feature maps with different sizes are obtained, then each feature map is accessed into a multi-task detection head (namely a multi-task deep pseudo detection network), to cooperatively complete the execution of 4 tasks of face classification, face forgery detection, face positioning and forgery area positioning, each detection head comprises 4 3 × 3 convolutions with the step length of 1 and unchanged channels and 1 × 1 convolution with the step length of 1, each task independently adopts a 4-layer full convolution network to obtain target output, and the full convolution target output of the face classification task is w '× h' × 1; for the false face detection task, the output of the full convolution target is w 'x h' x 1; for the face positioning task, the output of the full convolution target is w 'x h' x 4; for the fake region location task, the full convolution target output is w '× h' × 1. It should be noted that the number of feature layers, the number of multitask, etc. in the above examples are only examples, and are not limiting for the application, and those skilled in the art can adjust some parameters in the framework based on actual requirements.
Fig. 3 shows a schematic diagram of a semantic fusion module according to an example of the present application, where channel transformations are all completed by using a 3 × 3 deformable convolution with a step size of 1, channel dimensions of each layer of second feature information are all 256, after being input to the semantic fusion module, the channel transformations are transformed from 256 to 128 (i.e., the step size is 1, the input channel 256, the output channel 128, and the 3 × 3 deformable convolution), then the channel transformations are transformed from 128 to 64 (i.e., the step size is 1, the input channel 128, and the output channel 64, and the 3 × 3 deformable convolution), and then the 64 transformations are transformed to 64 (i.e., the step size is 1, the input channel 64, the output channel 64, and the 3 × 3 deformable convolution), and finally outputs of each channel transformation are combined (concat) to obtain an output with a dimension of 256.
In some embodiments, after the training of the model is completed, for the target image to be detected, the target image is input to the trained model, and the trained model can predict and output the deep false detection result corresponding to the target image by performing steps S11-S13.
In some embodiments, steps S11-S13 are operations in a model training process, in which, for each sample image in a training sample set, the model can predict and output a deep false detection result corresponding to the sample image by performing the above steps S11-S13.
In some embodiments, the method further comprises: and according to the deep false detection result, adopting a label distribution strategy to endow a plurality of anchor points or anchor frames in the deep false detection result with labels of positive and negative samples, and calculating a loss function according to the labels, wherein face classification adopts two-classification cross entropy loss, the cross entropy loss of face forgery detection is calculated on the basis that the face is classified into the positive samples, and a face positioning task adopts cross-over-comparison loss. In some embodiments, for a face box (sparse true value box, label) on an input image, many anchor points or anchor boxes (only part of these anchor points or anchor boxes are positive samples, and many others are negative samples) are usually placed in the model design process to cover faces that exist in the picture as much as possible, and the label allocation policy is to determine which anchor points or anchor boxes on the picture are positive samples based on the true value box of the picture label. In some embodiments, in the model training process, the deep false detection result output by the model includes a prediction result of each task, where the prediction result corresponding to the face localization task includes a plurality of anchor points or anchor frames, and the anchor points or anchor frames output by the model are not labeled with truth values, and according to the deep false detection result and the sample label, a label may be assigned to each anchor point or anchor frame output by the model for the face localization task, so as to calculate a loss function and measure a distance between the model prediction result and a true distribution. In some embodiments, the face true and false classification is established on the basis of the face classification, and the forged region location is established on the basis of the face location, so that the label distribution strategy only aims at the face classification and face location tasks, and specifically, the scheme selects positive and negative samples by adopting a dynamic Sample distribution strategy proposed by ATSS (Adaptive Training Sample Selection); ATSS dynamic sample label allocation strategy process: for the feature map corresponding to each layer of second feature information, an anchor point (anchor frame and anchor point are in one-to-one correspondence) with a low Euclidean distance k before the center point of the anchor point and the true value frame is taken as a candidate positive sample set, IoU (Intersection-over-unity ratio) between the candidate anchor frame and the true value frame is calculated, and when the Intersection-over-unity ratio of a certain sample is greater than the threshold value of the candidate anchor point sample set (the mean + variance of the candidate set IoU is dynamic, so the mean + variance is also dynamic), the sample allocation strategy is non-negative, namely positive for the positive sample.
In some embodiments, said calculating a loss function from said tag comprises: firstly, weighting the positive sample according to Euclidean distance from a central point to obtain a weighted label, and then calculating a loss function according to the label corresponding to the input image. In some embodiments, based on a positive sample set (a negative sample set is unchanged) obtained by ATSS allocation, distances from the positive sample set to a center point of a face truth-value box are different, and the closer the prior hypothesis is to the center point, the better the characteristics of the positive sample set, so that the positive sample set can be further weighted according to the euclidean distance from the center point (the closer the distance is, the larger the weight is), and the weighting scheme can improve the model recall rate.
In some embodiments, the multitask cooperatively executed in the multitask deep false detection network includes face classification, face localization and fake face detection, wherein the face classification adopts two-classification cross entropy loss, the cross entropy loss of the face fake detection is calculated on the basis of the face classified as a positive sample (model prediction result, namely loss of the fake detection only calculates the part of the face classified as a positive sample, and the part of the face classified as a negative sample does not consider authenticity), and the face localization task adopts IoU loss. In some embodiments, the multitask packet cooperatively executed in the multitask deep false detection network further includes a fake region location, the fake region location belongs to the fine-grained human face true-false classification problem, and the two-classification cross entropy loss is also adopted (calculation is performed on the basis of human face location, namely the fake region location only calculates the loss in the human face frame through model prediction, and no loss is calculated in other background regions).
In some embodiments, the method further comprises: and weighting the loss function corresponding to each task according to a preset proportion to obtain a multi-task collaborative loss function. In some embodiments, the multi-task deep false detection network cooperatively executes 4 tasks of face classification, face positioning, false face detection and false region positioning, the loss function is divided into 4 parts, the face classification adopts binary cross entropy loss, the face false detection cross entropy loss is calculated on the basis that the face is classified into a positive sample, the face positioning adopts IoU loss, the false region positioning adopts binary cross entropy loss, and finally, the 4 parts of loss are weighted according to a certain proportion and finally used for guiding the learning process of the model. In some embodiments, model parameters are updated and optimized using a gradient descent strategy backpropagation.
Fig. 4 shows a schematic structural diagram of an apparatus for detecting face deep false according to an embodiment of the present application. The apparatus for face deep false detection (hereinafter referred to as "face deep false detection apparatus 1") includes: the system comprises a module (hereinafter, referred to as a "module 11") for extracting multilayer first feature information from an input image, a module (hereinafter, referred to as a "module 12") for obtaining multilayer second feature information by performing feature fusion on the multilayer first feature information, and a module (hereinafter, referred to as a "module 13") for performing semantic fusion on each layer of second feature information in the multilayer second feature information to obtain multilayer third feature information, and inputting each layer of third feature information into a multitask deep false detection network respectively to obtain a deep false detection result output by the multitask deep false detection network, wherein multitask cooperatively executed in the multitask deep false detection network comprises face classification, face positioning and false face detection.
The module 11 extracts a plurality of layers of first feature information from the input image. In some embodiments, the multi-layered first feature information includes, but is not limited to, any shallow feature, such as a color, texture, or the like. In some embodiments, multiple layers of first feature information are extracted from the input image through a feature extraction network (e.g., the base network ResNet50, EfficientNet, Xception, etc.), for example, an RGB image (i.e., an input image) with a size of W × H × 3 is input, and the first feature information of a light layer of color, texture, etc. in the image is extracted through the base network ResNet 50.
The module 12 obtains a plurality of layers of second feature information by performing feature fusion on the plurality of layers of first feature information. In some embodiments, feature fusion is performed on shallow features of each level extracted from the input image, and multiple layers of second feature information obtained by fusion are obtained. In some embodiments, Feature fusion is performed through a FPN (Feature Pyramid Networks) layer. In some embodiments, the module 12 is further configured to: performing feature fusion on the multilayer first feature information to obtain fused multilayer second feature information; and sampling the second characteristic information of the highest level in the fused multilayer second characteristic information to obtain the second characteristic information of a higher level. As an example, an RGB image is input, the size of the image is W × H × 3, shallow first feature information such as color and texture in the image is extracted through a base network ResNet50, then feature fusion is performed on the extracted multiple layers of first feature information to obtain fused multiple layers of second feature information, and in order to increase detection of a large target face, further sampling is performed on second feature information of the highest level in the fused multiple layers of second feature information to obtain second feature information of a higher level.
The module 13 performs semantic fusion on each layer of second feature information in the multiple layers of second feature information to obtain multiple layers of third feature information, and inputs each layer of third feature information into the multi-task deep false detection network to obtain a deep false detection result output by the multi-task deep false detection network, wherein the multiple tasks cooperatively executed in the multi-task deep false detection network include face classification, face positioning, and false face detection. Based on the multi-task deep false detection network, a plurality of tasks can be cooperatively executed in one stage, different tasks are sensitive to different characteristics, and the characteristics can be mutually promoted, so that the tasks are mutually promoted, and the speed, the accuracy and the recall rate of the human face deep false detection can be jointly promoted.
In some embodiments, for the multiple layers of second feature information, performing semantic fusion on each layer of second feature information in the feature pyramid through a semantic fusion module to obtain third feature information, and increasing a feature receptive field to enhance the modeling capability of the model on the rigid object. In some embodiments, the channel transformation in the semantic fusion module is performed by using a 3 × 3 deformable convolution with step size of 1 to enhance the modeling capability of the model for non-rigid objects (facial muscle expression movement, etc.).
In some embodiments, the multitask cooperatively executed in the multitask deep false detection network further includes false region positioning, so that 4 tasks of face classification, face positioning, face false detection and false region positioning (segmentation) can be completed in one step. In some embodiments, each task in the multi-task deep pseudo detection network independently adopts a 4-layer full convolution network to obtain the target output.
Fig. 2 shows a design flow framework for face deep false detection according to an example of the present application, where the specific flow is as follows: inputting an RGB image (namely an input image), wherein the size of the image is W multiplied by H multiplied by 3, extracting characteristics C3, C4 and C5 from the image through a basic network ResNet50, and then further fusing the characteristics by adopting an FPN layer to obtain P3, P4 and P5; in order to increase the detection of a large target face, P5 (namely the features of the fused highest level) is further sampled to obtain a higher semantic representation P6, wherein the channel dimensions of P3, P4, P5 and P6 are all 256; then respectively adopting a semantic fusion module based on deformable convolution to strengthen the characteristics of the four characteristic graphs, respectively increasing the characteristic feeling to enhance the modeling capability of the model to the rigid object by fusing multi-scale characteristics, the modeling capability of a deformable convolution enhanced model to a non-rigid object is obtained through a semantic fusion module, four feature maps with different sizes are obtained, then each feature map is accessed into a multi-task detection head (namely a multi-task deep pseudo detection network), to cooperatively complete the execution of 4 tasks of face classification, face forgery detection, face positioning and forgery area positioning, each detection head comprises 4 3 × 3 convolutions with the step length of 1 and unchanged channels and 1 × 1 convolution with the step length of 1, each task independently adopts a 4-layer full convolution network to obtain target output, and the full convolution target output of the face classification task is w '× h' × 1; for the false face detection task, the output of the full convolution target is w 'x h' x 1; for the face positioning task, the output of the full convolution target is w 'x h' x 4; for the fake region location task, the full convolution target output is w '× h' × 1. It should be noted that the number of feature layers, the number of multitask, etc. in the above examples are only examples, and are not limiting for the application, and those skilled in the art can adjust some parameters in the framework based on actual requirements.
Fig. 3 shows a schematic diagram of a semantic fusion module according to an example of the present application, where channel transformations are all completed by using a 3 × 3 deformable convolution with a step size of 1, channel dimensions of each layer of second feature information are all 256, after being input to the semantic fusion module, the channel transformations are transformed from 256 to 128 (i.e., the step size is 1, the input channel 256, the output channel 128, and the 3 × 3 deformable convolution), then the channel transformations are transformed from 128 to 64 (i.e., the step size is 1, the input channel 128, and the output channel 64, and the 3 × 3 deformable convolution), and then the 64 transformations are transformed to 64 (i.e., the step size is 1, the input channel 64, the output channel 64, and the 3 × 3 deformable convolution), and finally outputs of each channel transformation are combined (concat) to obtain an output with a dimension of 256.
In some embodiments, after model training is completed, for a target image to be detected, the target image is input to a trained model, and the trained model performs an operation through the triggering modules 11 to 13, so that a deep false detection result corresponding to the target image can be output in a predictable manner.
In some embodiments, in the model training process, for each sample image in the training sample set, the model may predict and output a deep false detection result corresponding to the sample image through the operations performed by the triggering modules 11 to 13.
In some embodiments, the face depth false detection apparatus 1 further includes: and the module is used for assigning labels of positive and negative samples to a plurality of anchor points or anchor frames in the deep false detection result by adopting a label distribution strategy according to the deep false detection result, and calculating a loss function according to the labels, wherein the face classification adopts two-classification cross entropy loss, the cross entropy loss of face forgery detection is calculated on the basis of the face classification as the positive samples, and the face positioning task adopts cross-over loss. In some embodiments, for a face box (sparse true value box, label) on an input image, many anchor points or anchor boxes (only part of these anchor points or anchor boxes are positive samples, and many others are negative samples) are usually placed in the model design process to cover faces that exist in the picture as much as possible, and the label allocation policy is to determine which anchor points or anchor boxes on the picture are positive samples based on the true value box of the picture label. In some embodiments, in the model training process, the deep false detection result output by the model includes a prediction result of each task, where the prediction result corresponding to the face localization task includes a plurality of anchor points or anchor frames, and the anchor points or anchor frames output by the model are not labeled with truth values, and according to the deep false detection result and the sample label, a label may be assigned to each anchor point or anchor frame output by the model for the face localization task, so as to calculate a loss function and measure a distance between the model prediction result and a true distribution. In some embodiments, the face true and false classification is established on the basis of the face classification, and the forged region positioning is established on the basis of the face positioning, so that the label distribution strategy only aims at the face classification and face positioning tasks, and specifically, the scheme adopts a dynamic sample distribution strategy provided by ATSS to select positive and negative samples; ATSS dynamic sample label allocation strategy process: for the feature map corresponding to the second feature information of each layer, an anchor point (anchor frame and anchor point are in one-to-one correspondence) with a small euclidean distance k before the center point of the true value frame is taken as a candidate positive sample set, IoU between the candidate anchor frame and the true value frame is calculated, and when the intersection ratio of a certain sample is greater than the threshold value of the candidate anchor point sample set (the mean + variance of the candidate set IoU is dynamic, so the mean + variance is also dynamic), the sample is taken as a positive sample, and the sample allocation strategy is non-negative, i.e. positive.
In some embodiments, said calculating a loss function from said tag comprises: firstly, weighting the positive sample according to Euclidean distance from a central point to obtain a weighted label, and then calculating a loss function according to the label corresponding to the input image. In some embodiments, based on a positive sample set (a negative sample set is unchanged) obtained by ATSS allocation, distances from the positive sample set to a center point of a face truth-value box are different, and the closer the prior hypothesis is to the center point, the better the characteristics of the positive sample set, so that the positive sample set can be further weighted according to the euclidean distance from the center point (the closer the distance is, the larger the weight is), and the weighting scheme can improve the model recall rate.
In some embodiments, the multitask cooperatively executed in the multitask deep false detection network includes face classification, face localization and fake face detection, where the face classification uses two-class cross entropy loss, the cross entropy loss of the face fake detection is calculated on the basis that the face is classified into a positive sample (the model prediction result, i.e., the loss of the fake detection only calculates the part of the face classified into the positive sample, and the part of the face classified into a negative sample does not consider authenticity), and the face localization task uses Intersection-over-Union (IoU) loss. In some embodiments, the multitask packet cooperatively executed in the multitask deep false detection network further includes a fake region location, the fake region location belongs to the fine-grained human face true-false classification problem, and the two-classification cross entropy loss is also adopted (calculation is performed on the basis of human face location, namely the fake region location only calculates the loss in the human face frame through model prediction, and no loss is calculated in other background regions).
In some embodiments, the face depth false detection apparatus 1 further includes: and the module is used for weighting the loss function corresponding to each task according to a preset proportion to obtain a multi-task collaborative loss function. In some embodiments, the multi-task deep false detection network cooperatively executes 4 tasks of face classification, face positioning, false face detection and false region positioning, the loss function is divided into 4 parts, the face classification adopts binary cross entropy loss, the face false detection cross entropy loss is calculated on the basis that the face is classified into a positive sample, the face positioning adopts IoU loss, the false region positioning adopts binary cross entropy loss, and finally, the 4 parts of loss are weighted according to a certain proportion and finally used for guiding the learning process of the model. In some embodiments, model parameters are updated and optimized using a gradient descent strategy backpropagation.
According to the scheme of the application, multiple tasks such as face classification, face positioning, fake face detection and the like can be synchronously completed in one stage, and the face deep false detection speed can be greatly improved; the limit of the first-stage face positioning effect on the second-stage deep false detection effect in the two-stage strategy can be eliminated, and the face deep false detection precision is improved; through a one-stage multi-task deep false detection network, the discriminability of the identification characteristics is improved by utilizing a multi-task collaborative learning training mode, and the training data of the deep false detection task can generate the labeling data of tasks such as classification, segmentation, target detection and the like with nearly zero cost, so that richer supervision information is provided for a detection model to extract more robust and discriminability characteristics.
The present application further provides a computer device, wherein the computer device includes: a memory for storing one or more programs; one or more processors coupled to the memory, the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method for face depth false detection described herein.
The present application also provides a computer-readable storage medium having stored thereon a computer program which can be executed by a processor for the method for human face deep false detection described herein.
The present application further provides a computer program product, which, when executed by a device, causes the device to perform the method for human face deep false detection described herein.
FIG. 5 illustrates an exemplary system that can be used to implement the various embodiments described in this application.
In some embodiments, system 1000 can be implemented as any of the processing devices in the embodiments of the present application. In some embodiments, system 1000 may include one or more computer-readable media (e.g., system memory or NVM/storage 1020) having instructions and one or more processors (e.g., processor(s) 1005) coupled with the one or more computer-readable media and configured to execute the instructions to implement modules to perform the actions described herein.
For one embodiment, system control module 1010 may include any suitable interface controllers to provide any suitable interface to at least one of the processor(s) 1005 and/or to any suitable device or component in communication with system control module 1010.
The system control module 1010 may include a memory controller module 1030 to provide an interface to the system memory 1015. Memory controller module 1030 may be a hardware module, a software module, and/or a firmware module.
System memory 1015 may be used to load and store data and/or instructions, for example, for system 1000. For one embodiment, system memory 1015 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the system memory 1015 may include a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).
For one embodiment, system control module 1010 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 1020 and communication interface(s) 1025.
For example, NVM/storage 1020 may be used to store data and/or instructions. NVM/storage 1020 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more hard disk drive(s) (HDD (s)), one or more Compact Disc (CD) drive(s), and/or one or more Digital Versatile Disc (DVD) drive (s)).
NVM/storage 1020 may include storage resources that are physically part of a device on which system 1000 is installed or may be accessed by the device and not necessarily part of the device. For example, NVM/storage 1020 may be accessed over a network via communication interface(s) 1025.
Communication interface(s) 1025 may provide an interface for system 1000 to communicate over one or more networks and/or with any other suitable device. System 1000 may communicate wirelessly with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols.
For one embodiment, at least one of the processor(s) 1005 may be packaged together with logic for one or more controller(s) of the system control module 1010, e.g., memory controller module 1030. For one embodiment, at least one of the processor(s) 1005 may be packaged together with logic for one or more controller(s) of the system control module 1010 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 1005 may be integrated on the same die with logic for one or more controller(s) of the system control module 1010. For one embodiment, at least one of the processor(s) 1005 may be integrated on the same die with logic of one or more controllers of the system control module 1010 to form a system on a chip (SoC).
In various embodiments, system 1000 may be, but is not limited to being: a server, a workstation, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.). In various embodiments, system 1000 may have more or fewer components and/or different architectures. For example, in some embodiments, system 1000 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (15)

1. A method for face deep false detection, wherein the method comprises:
extracting multiple layers of first characteristic information from an input image;
obtaining multilayer second characteristic information by performing characteristic fusion on the multilayer first characteristic information;
and performing semantic fusion on each layer of second feature information in the multiple layers of second feature information respectively to obtain multiple layers of third feature information, and inputting each layer of third feature information into a multi-task deep false detection network respectively to obtain a deep false detection result output by the multi-task deep false detection network, wherein the multiple tasks cooperatively executed in the multi-task deep false detection network comprise face classification, face positioning and false face detection.
2. The method of claim 1, wherein the multitasking in the multitasking deep false detection network further comprises fake region localization.
3. The method according to claim 1, wherein the obtaining of the multi-layer second feature information by feature fusion of the multi-layer first feature information comprises:
performing feature fusion on the multilayer first feature information to obtain fused multilayer second feature information;
and sampling the second characteristic information of the highest level in the fused multilayer second characteristic information to obtain the second characteristic information of a higher level.
4. The method of any of claims 1 to 3, wherein the method further comprises:
and according to the deep false detection result and the sample labels, adopting a label distribution strategy to endow a plurality of anchor points or anchor frames in the deep false detection result with labels of positive and negative samples, and calculating a loss function according to the labels, wherein the face classification adopts two-classification cross entropy loss, the cross entropy loss of face forgery detection is calculated on the basis of the face classification as the positive samples, and the face positioning task adopts cross-over loss.
5. The method of claim 4, wherein the multitasking cooperatively executed in the multitasking deep false detection network further comprises fake region localization, and a loss function of fake region localization adopts binary cross entropy loss.
6. The method of claim 4 or 5, wherein the method further comprises:
and weighting the loss function corresponding to each task according to a preset proportion to obtain a multi-task collaborative loss function.
7. The method of claim 4, wherein said computing a loss function from said tag comprises:
firstly, weighting the positive sample according to Euclidean distance from a central point to obtain a weighted label, and then calculating a loss function according to the label corresponding to the input image.
8. An apparatus for face deep false detection, wherein the apparatus comprises:
a module for extracting a plurality of layers of first feature information from an input image;
a module for obtaining a plurality of layers of second feature information by performing feature fusion on the plurality of layers of first feature information;
and the module is used for performing semantic fusion on each layer of second feature information in the plurality of layers of second feature information respectively to obtain a plurality of layers of third feature information, and inputting each layer of third feature information into the multi-task deep false detection network respectively to obtain a deep false detection result output by the multi-task deep false detection network, wherein the multi-task cooperatively executed in the multi-task deep false detection network comprises face classification, face positioning and false face detection.
9. The apparatus of claim 8, wherein the multitasking performed in coordination in the multitasking deep false detection network further comprises fake region localization.
10. The apparatus of claim 8, wherein the means for obtaining the multi-layer second feature information by feature fusion of the multi-layer first feature information is configured to:
performing feature fusion on the multilayer first feature information to obtain fused multilayer second feature information;
and sampling the second characteristic information of the highest level in the fused multilayer second characteristic information to obtain the second characteristic information of a higher level.
11. The apparatus of any one of claims 8 to 10, wherein the apparatus further comprises:
and the module is used for assigning labels of positive and negative samples to a plurality of anchor points or anchor frames in the deep false detection result by adopting a label distribution strategy according to the deep false detection result and the sample labels, and calculating a loss function according to the labels, wherein the face classification adopts two-classification cross entropy loss, the cross entropy loss of face forgery detection is calculated on the basis of the face classification as the positive samples, and the face positioning task adopts cross-over loss.
12. The apparatus of claim 11, wherein the multitasking performed in cooperation in the multitasking deep false detection network further comprises fake region localization, and a loss function of fake region localization employs two-class cross entropy loss.
13. The apparatus of claim 11 or 12, wherein the apparatus further comprises:
and the module is used for weighting the loss function corresponding to each task according to a preset proportion to obtain a multi-task collaborative loss function.
14. A computer device, wherein the computer device comprises:
a memory for storing one or more programs;
one or more processors coupled to the memory,
the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method recited by any of claims 1-7.
15. A computer-readable storage medium, on which a computer program is stored, which computer program can be executed by a processor to perform the method according to any one of claims 1 to 7.
CN202111569608.2A 2021-12-21 2021-12-21 Method and device for detecting face deep false Pending CN114419693A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111569608.2A CN114419693A (en) 2021-12-21 2021-12-21 Method and device for detecting face deep false

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111569608.2A CN114419693A (en) 2021-12-21 2021-12-21 Method and device for detecting face deep false

Publications (1)

Publication Number Publication Date
CN114419693A true CN114419693A (en) 2022-04-29

Family

ID=81267637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111569608.2A Pending CN114419693A (en) 2021-12-21 2021-12-21 Method and device for detecting face deep false

Country Status (1)

Country Link
CN (1) CN114419693A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114998337A (en) * 2022-08-03 2022-09-02 联宝(合肥)电子科技有限公司 Scratch detection method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114998337A (en) * 2022-08-03 2022-09-02 联宝(合肥)电子科技有限公司 Scratch detection method, device, equipment and storage medium
CN114998337B (en) * 2022-08-03 2022-11-04 联宝(合肥)电子科技有限公司 Scratch detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US10740640B2 (en) Image processing method and processing device
WO2021203863A1 (en) Artificial intelligence-based object detection method and apparatus, device, and storage medium
CN111563502B (en) Image text recognition method and device, electronic equipment and computer storage medium
Dewi et al. Weight analysis for various prohibitory sign detection and recognition using deep learning
US20180114071A1 (en) Method for analysing media content
CN110782420A (en) Small target feature representation enhancement method based on deep learning
WO2022033095A1 (en) Text region positioning method and apparatus
Wang et al. Tree leaves detection based on deep learning
Sirisha et al. Statistical analysis of design aspects of various YOLO-based deep learning models for object detection
CN111242129A (en) Method and device for end-to-end character detection and identification
CN110263877B (en) Scene character detection method
WO2022161302A1 (en) Action recognition method and apparatus, device, storage medium, and computer program product
WO2023075863A1 (en) Adversarial contrastive learning and active adversarial contrastive mix (adversemix) for semi-supervised semantic segmentation
CN111368634B (en) Human head detection method, system and storage medium based on neural network
CN116235209A (en) Sparse optical flow estimation
CN114419693A (en) Method and device for detecting face deep false
Diwate et al. Optimization in object detection model using YOLO. v3
CN112825116A (en) Method, device, medium and equipment for detecting and tracking face of monitoring video image
Zendehdel et al. Real-time tool detection in smart manufacturing using You-Only-Look-Once (YOLO) v5
CN116824291A (en) Remote sensing image learning method, device and equipment
CN110610184A (en) Method, device and equipment for detecting salient object of image
Ma et al. LAYN: Lightweight Multi-Scale Attention YOLOv8 Network for Small Object Detection
Qin et al. Flower species recognition system combining object detection and attention mechanism
CN116541549B (en) Subgraph segmentation method, subgraph segmentation device, electronic equipment and computer readable storage medium
Dinesh Reddy et al. Deep Neural Transfer Network Technique for Lung Cancer Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination