CN112749686A

CN112749686A - Image detection method, image detection device, computer equipment and storage medium

Info

Publication number: CN112749686A
Application number: CN202110127828.3A
Authority: CN
Inventors: 姚太平; 陈燊; 陈阳; 丁守鸿; 李季檩; 黄飞跃
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2021-05-04
Anticipated expiration: 2041-01-29
Also published as: CN112749686B

Abstract

The application relates to an image detection method, an image detection device, computer equipment and a storage medium, and relates to the technical field of image processing. The method comprises the following steps: performing fusion processing based on an attention mechanism on image features of at least two images through a first detection model to obtain fusion image features of a target image sequence, and processing the fusion image features to obtain a first probability of the target image sequence; processing the image characteristics of the at least two images through a second detection model respectively to obtain respective second probabilities of the at least two images; and acquiring a sequence detection result of the target image sequence based on the first probability and the second probability of each of the at least two images. According to the scheme, in the process of detecting the forged images, time sequence information among the images and artifact traces in the images are considered, so that the accuracy of detecting the forged images is improved.

Description

Image detection method, image detection device, computer equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image detection method and apparatus, a computer device, and a storage medium.

Background

Image processing technologies such as Artificial Intelligence (AI) face changing promote the emerging development of entertainment and culture communication industries, but also bring huge security threats to scenes such as image-based applications (such as face detection).

In the related art, the detection technology of the forged image is to judge specific forged traces of the false content, such as a blinking pattern, biological characteristics, and the like; for example, the human eye region in the video sequence is extracted, and then the neural network is combined to model the human eye sequence so as to distinguish whether the video is a fake human face video.

However, as image processing techniques such as face changing are becoming mature, the generated counterfeit images gradually have a biological pattern consistent with the real images, and thus the detection accuracy of the counterfeit images by the solutions in the related art is low.

Disclosure of Invention

The embodiment of the application provides an image detection method, an image detection device, computer equipment and a storage medium, which can improve the accuracy of detecting a forged image, and the technical scheme is as follows:

in one aspect, an image detection method is provided, and the method includes:

acquiring a target image sequence, wherein the target image sequence comprises at least two images in the same video;

performing fusion processing based on an attention mechanism on the image features of the at least two images through a first detection model to obtain fusion image features of the target image sequence;

processing the fused image feature by the first detection model to obtain a first probability of the target image sequence, the first probability indicating a probability that the target image sequence is a counterfeit image sequence;

respectively processing the image characteristics of the at least two images through a second detection model to obtain respective second probabilities of the at least two images, wherein the second probabilities are used for indicating the probability that the corresponding images are forged images;

acquiring a sequence detection result of the target image sequence based on the first probability and the second probability of each of the at least two images, wherein the sequence detection result is used for indicating whether the target image sequence is a forged image sequence;

wherein the first detection model and the second detection model are obtained by training of an image sequence sample set; the image sequence sample set comprises at least two image sample sequence pairs, each image sample sequence pair comprises an image sequence positive sample and an image sequence negative sample, each image sample sequence has a corresponding sample label, and the sample label is used for indicating whether the corresponding image sample sequence is a forged image sample sequence.

In another aspect, there is provided an image detection apparatus, the apparatus including:

the image sequence acquisition module is used for acquiring a target image sequence, and the target image sequence comprises at least two images in the same video;

the feature fusion module is used for performing fusion processing based on an attention mechanism on the image features of the at least two images through a first detection model to obtain fusion image features of the target image sequence;

a first feature processing module, configured to process the fused image feature through the first detection model to obtain a first probability of the target image sequence, where the first probability is used to indicate a probability that the target image sequence is a forged image sequence;

the second feature processing module is configured to process the image features of the at least two images through a second detection model, so as to obtain respective second probabilities of the at least two images, where the second probabilities are used to indicate probabilities that corresponding images are forged images;

a detection result obtaining module, configured to obtain a sequence detection result of the target image sequence based on the first probability and the second probability of each of the at least two images, where the sequence detection result is used to indicate whether the target image sequence is a forged image sequence;

In a possible implementation manner, the detection result obtaining module is configured to perform weighting processing on the first probability and the second probability of each of the at least two images to obtain the sequence detection result.

In a possible implementation manner, the detection result obtaining module includes:

a probability fusion unit, configured to fuse the second probabilities of the at least two images to obtain a fusion probability;

and a weighting processing unit, configured to perform weighting processing on the first probability and the fusion probability based on the weight of the first probability and the weight of the fusion probability, and obtain the sequence detection result.

In a possible implementation manner, the pair probability fusion unit is configured to,

taking a median value of the second probabilities of the at least two images to obtain the fusion probability;

alternatively, the first and second electrodes may be,

averaging the second probabilities of the at least two images to obtain the fusion probability;

alternatively, the first and second electrodes may be,

and processing the second probability of each of the at least two images through a gating cycle unit GRU to obtain the fusion probability.

In one possible implementation, the weight of the first probability is the same as the weight of the fusion probability.

In one possible implementation, the apparatus further includes:

an image quality acquisition module, configured to, before the weighting processing unit performs weighting processing on the first probability and the fusion probability based on the weight of the first probability and the weight of the fusion probability to obtain the sequence detection result,

acquiring image quality information of the target image sequence;

and acquiring the weight of the first probability and the weight of the fusion probability based on the image quality information.

In one possible implementation, the apparatus further includes:

a probability difference obtaining module, configured to obtain a probability difference between the first probability and the fusion probability before a weighting processing unit performs weighting processing on the first probability and the fusion probability based on a weight of the first probability and a weight of the fusion probability to obtain the sequence detection result;

and the weighting processing unit is configured to, in response to the probability difference being smaller than a difference threshold, execute a weighting process on the first probability and the fusion probability based on the weight of the first probability and the weight of the fusion probability to obtain the sequence detection result.

In one possible implementation, the apparatus further includes:

and the information output module is used for responding to the probability difference value not less than the difference value threshold value and outputting detection failure information.

In a possible implementation manner, the detection result obtaining module is further configured to,

responsive to the probability difference being not less than the difference threshold, obtaining the sequence detection result based on the first probability;

alternatively, the first and second electrodes may be,

and responding to the probability difference value not smaller than the difference value threshold value, and acquiring the sequence detection result based on the fusion probability.

In one possible implementation manner, the feature fusion module is configured to,

performing feature extraction on the at least two images through a first feature extraction network in the first detection model to obtain first image features of the at least two images;

processing first image features of the at least two images through an attention network in the first detection model to obtain respective attention maps of the at least two images, wherein the attention maps are used for indicating weights of the first image features of the corresponding images;

and weighting the first image features of the at least two images through the attention network based on the attention diagrams of the at least two images to obtain the fused image features.

In one possible implementation manner, the common feature fusion module is configured to,

processing the first image characteristics of the at least two images through at least one characteristic processing submodule connected in sequence in the first detection model to obtain respective attention diagrams of the at least two images;

the feature processing submodule comprises a full connection layer, a batch normalization layer and a hyperbolic tangent function which are sequentially connected.

In one possible implementation manner, the second feature processing module is configured to,

performing feature extraction on the at least two images through a second feature extraction network in the second detection model to obtain second image features of the at least two images;

and processing the second image characteristics of the at least two images through a classification network in the second detection model to obtain respective second probabilities of the at least two images.

In another aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory having stored therein at least one computer program, the at least one computer program being loaded and executed by the processor to implement the image detection method described above.

In another aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, the computer program being loaded and executed by a processor to implement the above-mentioned image detection method.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the image detection method provided in the various alternative implementations described above.

The technical scheme provided by the application can comprise the following beneficial effects:

through the two models, the first probability that the target image sequence is a forged image sequence is obtained based on the time sequence information between at least two images in the target image sequence, the second probability that the at least two images are forged images is respectively obtained based on the respective image characteristics of the at least two images, and whether the target image sequence is the forged image sequence is comprehensively determined by combining the two probabilities.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram illustrating a system architecture of an image inspection system according to an exemplary embodiment of the present application;

FIG. 2 illustrates a flow chart of an image detection method provided by an exemplary embodiment of the present application;

FIG. 3 illustrates a block diagram of inspection model training and image inspection provided by an exemplary embodiment of the present application;

FIG. 4 illustrates a flow chart of a detection model training and image detection method provided by an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of an attention network according to the embodiment shown in FIG. 4;

FIG. 6 illustrates an image inspection flow diagram provided by an exemplary embodiment of the present application;

FIG. 7 is a block diagram illustrating an image detection apparatus provided in an exemplary embodiment of the present application;

fig. 8 shows a block diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The embodiment of the application provides a face image detection method, which can realize interactive operation among interactive video viewers and improve the flexibility of interactive control setting. For ease of understanding, several terms referred to in this application are explained below.

1) Artificial intelligence AI

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. The display device comprising the image acquisition component mainly relates to the computer vision technology and the machine learning/depth learning direction.

2) Machine Learning (Machine Learning, ML)

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

3) Attention Mechanism (Attention Mechanism)

Attention mechanism, which is essentially a mechanism that autonomously learns a set of weighting coefficients through a network and emphasizes a user's interest area in a "dynamic weighting" manner while suppressing irrelevant background areas. In the field of computer vision, attention mechanisms can be broadly divided into two broad categories: strong attention and soft attention.

The attention mechanism is usually applied to a Recurrent Neural Network (RNN), and each time a part of pixels of a target image are processed, the RNN with the attention mechanism processes the part of pixels of the target image focused according to a previous state of a current state, rather than the whole pixels of the target image, so that the processing complexity of a task can be reduced.

Fig. 1 is a schematic diagram illustrating a system structure of an image inspection system according to an exemplary embodiment of the present application, where the system includes, as shown in fig. 1: a server 110 and a terminal 120.

The server 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like.

In one possible implementation, multiple servers may be grouped into a blockchain, and the server 110 may be a node on the blockchain.

The terminal 120 is a terminal having an image detection function, for example, the terminal 120 may be an intelligent cash register, an intelligent counter, an intelligent mobile phone, a tablet computer, an electronic book reader, an intelligent glasses, an intelligent watch, an intelligent television, an intelligent vehicle-mounted device, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), a laptop portable computer, a desktop computer, and the like.

Optionally, the system includes one or more servers 110 and a plurality of terminals 120. The number of the servers 110 and the terminals 120 is not limited in the embodiment of the present application.

The terminal and the server are connected through a communication network. Optionally, the communication network is a wired network or a wireless network.

Optionally, the wireless network or wired network described above uses standard communication techniques and/or protocols. The Network is typically the Internet, but may be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wireline or wireless Network, a private Network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above. The application is not limited thereto.

Fig. 2 shows a flowchart of an image detection method provided by an exemplary embodiment of the present application, where the method is executed by a computing device, the computing device may be implemented as a terminal or a server, and the terminal or the server may be the terminal or the server shown in fig. 1, and as shown in fig. 2, the image detection method includes the following steps:

step 210, a target image sequence is obtained, where the target image sequence includes at least two images in the same video.

In this embodiment, the target image sequence may be obtained by extracting, by a computer device, image frames from an input video at a certain sampling rate.

In a possible implementation manner, at least two images in the target image sequence are all region images or partial region images of each image frame extracted from samples in the input video.

When the at least two images in the target image sequence are partial region images of each image frame extracted from samples in the input video, the at least two images in the target image sequence may be images of regions where corresponding target objects are located in each image frame extracted from samples in the input video, for example, the target objects may be target faces or target objects, and the like.

And step 220, performing fusion processing based on an attention mechanism on the image features of the at least two images through the first detection model to obtain fusion image features of the target image sequence.

In this embodiment, for the target image sequence, the computer device may perform attention-based fusion on image features of at least two images in the target image sequence through a trained machine learning model, so as to extract the fused image features having the image features of the at least two images and timing information between the at least two images.

Step 230, processing the fused image feature through the first detection model to obtain a first probability of the target image sequence, where the first probability is used to indicate a probability that the target image sequence is a forged image sequence.

After the first detection model processes the fusion image features, a probability value can be output, wherein the probability value can be the predicted probability that the target image sequence is a forged image sequence, so that whether the image is forged or not is detected by combining time sequence information among images in the target image sequence.

Step 240, processing the image features of the at least two images respectively through a second detection model to obtain respective second probabilities of the at least two images, where the second probabilities are used to indicate the probability that the corresponding images are forged images.

After the second detection model respectively processes the image features of each of the at least two images, a probability value can be output, and the probability value can be the probability that a predicted single image is a forged image, so that whether the image is forged or not can be detected by combining the artifact traces of the images in the target image sequence.

Wherein the first detection model and the second detection model are obtained by training an image sequence sample set; the image sequence sample set comprises at least two image sequence sample pairs, each image sequence sample pair comprises an image sequence positive sample and an image sequence negative sample, each image sequence sample has a corresponding sample label, and the sample label is used for indicating whether the corresponding image sample sequence is a forged image sample sequence.

In this embodiment of the application, the first detection model and the second detection model may be models obtained by training based on the same image sequence sample set, or the first detection model and the second detection model may also be models obtained by training based on different image sequence sample sets.

Step 250, obtaining a sequence detection result of the target image sequence based on the first probability and the second probability of each of the at least two images, where the sequence detection result is used to indicate whether the target image sequence is a forged image sequence.

In this embodiment of the application, after acquiring a first probability obtained based on timing information between at least two images and a second probability obtained based on artifact traces of the at least two images, the computer device may determine a sequence detection result by combining the first probability and the second probability, that is, may predict whether a target image sequence is a forged image sequence by simultaneously combining the timing information between the at least two images and the artifact traces of the at least two images.

In summary, according to the image detection method provided in the embodiment of the present application, through two models, a first probability that a target image sequence is a forged image sequence is obtained based on timing information between at least two images in the target image sequence, a second probability that the at least two images are forged images is obtained based on respective image features of the at least two images, and then the two probabilities are combined to comprehensively determine whether the target image sequence is the forged image sequence, so that timing information between the images and artifact traces in the images can be considered in the process of detecting the forged images, and therefore, the accuracy of detecting the forged images is improved.

In the scheme of the embodiment of the application, the method can be used in any scene needing to detect whether the image is a forged image. For example, the application scenarios of the above scheme include, but are not limited to, the following:

1. financial industry scenarios.

For example, in a scene of online financial institution service transaction, when a user needs to perform a large amount of online transaction, in order to verify the identity of the user, the user needs to acquire and monitor a face image in real time, at this time, if an illegal user falsely identifies a legal user as a face image or a face video after processing, economic loss is caused to the legal user, and at this time, for a face image sequence in a recorded face verification video, through the image detection method provided by the present application, on one hand, after fusion processing based on an attention mechanism is performed on each image in the face image sequence through a video-level model (i.e., the first detection model), the probability that the face in the face image sequence is a forged face is output; on the other hand, each image in the face image sequence is processed through an image-level model (namely, the second detection model), and the probability that the face in each image in the face image sequence is a forged face is output; and then, integrating the probabilities output by the two models, and determining whether the face in the recorded face verification video is a forged face or not, so as to realize face authenticity detection on the face video subjected to verification, and when the detection result indicates that the image sequence in the face video is a non-forged face image sequence, performing identity verification on the user identity in the face video, thereby ensuring the economic safety of a legal user.

2. Information registration scenario.

With the popularization of network applications, information registration is not limited to offline registration, but can also be completed through related application programs, in information registration, authenticity of information registration is often required to be verified, for example, a user who needs to perform information registration uploads a face video containing a specified action to verify authenticity of user registration information, and at this time, authenticity of a face image sequence in the face video uploaded by the user can be verified through the image detection method provided by the application, so that accuracy of face verification is improved.

3. And (5) network friend making scene.

The online friend making has become an important social contact means at present, when online friend making is performed, interest of online friend making is increased by replacing the face image or the face video through the face counterfeiting technology, for example, face changing operation can be realized through the face changing technology, but at the same time, authenticity of online friend making is low, in order to verify authenticity of face information in real-time online videos of friends on the online network in the online network making process, authenticity of a face image sequence in the real-time online videos obtained in the online network making process can be detected through the image detection method provided by the application, and accordingly, interest of online friend making is maintained, and real information of a user is fed back.

4. And (5) identifying scenes of video authenticity.

With the continuous popularization of video shooting application, videos become an important information transmission means in social contact/media gradually, and the authenticity of the videos can be identified, so that a user can be prevented from being misled by fake videos transmitted in the social contact/media. In this scenario, the social/media platform may extract a video image sequence from a video propagated in the social/media platform by using the image detection method provided by the present application, and on one hand, after performing fusion processing based on an attention mechanism on each image in the video image sequence by using a video-level model, output a probability that the video image sequence is a sequence composed of forged video images; on the other hand, each image in the video image sequence is processed through an image-level model, and the probability that each image in the video image sequence is a forged image is output; and then, integrating the probabilities output by the two models, and determining whether the video is a forged video, so as to intercept or remind the forged video.

The scheme shown in each embodiment of the application is beneficial to the evidence verification of the police law and the prevention of the counterfeit evidence of the criminal suspects. On a multimedia platform, the wide spread of face-changing videos enables the public confidence of media to be continuously reduced, and misleading is easily caused to users. Through the scheme shown in each embodiment of the application, the platform can be helped to screen the video, and a remarkable mark is added to the detected fake video, such as 'made by xxx (image processing software name)', so that the credibility of the video content is ensured, and the social credibility is ensured. In general, the scheme shown in the embodiment of the application can be applied to products such as human face verification, judicial verification tools, image video authentication and the like.

The scheme related to the application comprises a detection model training phase and an image detection phase. Fig. 3 illustrates a framework diagram of detection model training and image detection provided in an exemplary embodiment of the present application, and as shown in fig. 3, in a detection model training stage, a model training device 310 obtains a first detection model and a second detection model through preset training samples (including an image sequence sample set and sample labels corresponding to respective image sequence samples in the image sequence sample set). In the image detection stage, the image detection device 320 detects the falsification probability of the input target image sequence based on the first detection model and the second detection model, and determines whether the target image sequence is a falsified image sequence.

The model training device 310 and the image detection device 320 may be computer devices, for example, the computer devices may be stationary computer devices such as a personal computer and a server, or the computer devices may also be mobile computer devices such as a tablet computer and an e-book reader.

Alternatively, the model training device 310 and the image detection device 320 may be the same device, or the detection model training device 310 and the image detection device 320 may be different devices. Also, when the detection model training device 310 and the image detection device 320 are different devices, the detection model training device 310 and the image detection device 320 may be the same type of device, for example, the detection model training device 310 and the image detection device 320 may both be servers; alternatively, the detection model training device 310 and the image detection device 320 may be different types of devices, for example, the image detection device 320 may be a personal computer or a terminal, and the detection model training device 310 may be a server or the like. The embodiment of the present application does not limit the specific types of the detection model training device 310 and the image detection device 320.

Fig. 4 shows a flowchart of a detection model training and image detection method provided by an exemplary embodiment of the present application, where the method is executed by a computer device, and the computer device may be implemented as a terminal or a server, and the terminal or the server may be the terminal or the server shown in fig. 1, and the detection model training and image detection method includes the following steps, as shown in fig. 4:

step 410, obtaining a sample set of an image sequence; the image sequence sample set comprises at least two image sample sequence pairs, each image sample sequence pair comprises an image sequence positive sample and an image sequence negative sample, each image sample sequence has a corresponding sample label, and the sample label is used for indicating whether the corresponding image sample sequence is a forged image sample sequence.

In an exemplary scheme of the embodiment of the application, a face image sequence is taken as an example, a face image sequence sample in a positive sample of the image sequence represents a real face image sequence sample, and a face image sequence sample in a negative sample of the image sequence represents a forged face image sample; and, the image sequence positive samples and the image sequence negative samples are in one-to-one correspondence, that is, the image sequence negative samples corresponding to the image sequence positive samples form an image sequence sample pair. Or, the negative sample of the image sequence is obtained by performing image falsification processing (such as face changing, image retouching, and the like) on the positive sample of the image sequence.

Because the category imbalance phenomenon exists in most forged video data sets, namely the number of forged videos is usually more than that of real videos, when videos corresponding to image sequence samples are obtained, the videos can be obtained in a down-sampling mode, namely one video is sampled from all forged videos corresponding to each real video, and therefore the balance of positive and negative samples in a training sample set is guaranteed. Taking a face image sequence as an example, for each real face video, when sample acquisition is carried out, only one forged face video corresponding to each real face video is obtained to carry out acquisition of a negative face sample image.

Taking a face image sequence as an example, in the process of collecting image sequence samples, N frames can be sampled at equal intervals from face images pre-stored in each video according to the frame sequence of a real face video and a forged face video to form an image sequence positive sample and a corresponding image sequence negative sample in a training sample set.

For example, after obtaining a face video, a computer device (e.g., a model training device) may first sample the video at equal intervals for 50 frames by using OpenCV, then frame a region where the face is located by using a face detection technique, and enlarge the region by 1.2 times with the region as a center, so that the clipping result includes the whole face and a part of a background region around the face. If a plurality of faces exist in the video, the face image sequences corresponding to the detected faces can be respectively stored.

In a possible implementation manner, the image sequence sample label may be represented by 0 and 1, for example, 0 represents that the image sequence sample label is a real face label, and 1 represents that the image sequence sample label is a fake face label.

Step 420, training the first detection model and the second detection model through the image sequence sample set.

The first detection model may include a first feature extraction network, an attention network, and a first classifier. Wherein the first feature extraction network is used for extracting first image features of each image in the input image sequence/image sequence sample; and the attention network is used for fusing the first image features of the images based on an attention mechanism, so that the first detection model outputs a probability indicating that the output image sequence is a fake image sequence through the first classifier according to the fused features.

In a possible implementation, the attention network includes at least one feature processing sub-module connected in sequence, where the at least one feature processing sub-module connected in sequence is configured to output an attention map of each image after processing the first image feature of each image, where the attention map is used to indicate a weight of the first image feature of the corresponding image; the subsequent attention network may perform weighting processing on the first image features of the respective images based on the respective attention diagrams of the respective images to obtain the features obtained by the fusion, thereby implementing fusion of the first image features of the respective images based on the attention mechanism.

For example, please refer to fig. 5, which shows a schematic structural diagram of an attention network according to an embodiment of the present application. As shown in fig. 5, the attention network includes two feature processing sub-modules 51 connected in sequence, and a Softmax function 52 (normalized exponential function), each sub-module 51 includes FC (fully connected layer), BN (batch normalization layer), and Tanh (hyperbolic tangent function), wherein the SumFusion operation 53 represents weighted fusion of the original features and the learned weights.

The second detection model may include a second feature extraction network and a second classifier; the second feature extraction network is used for extracting second image features of each image in the input image sequence/image sequence sample, so that the second detection model outputs second probabilities for respectively indicating each image as a fake image sequence through the second classifier according to the second image features of each image.

The first feature extraction network or the second feature extraction network may use an already trained model in the industry, such as EfficientNet, or may design and modify a network structure according to needs, and retrain the first detection model or the second detection model together with other network parameters after initializing a network weight by using a weight pre-trained on ImageNet.

The first feature extraction network and the second feature extraction network may adopt the same model structure or different model structures.

In an exemplary aspect of the embodiment of the present application, the first classifier and the second classifier may be composed of two continuous fully-connected layers, and a ReLU function (linear rectification function) is included in the classifier to improve the nonlinear fitting capability of the classifier.

In this embodiment of the application, the first detection model is a video-level model, and the second detection model is an image-level model, where the image-level model and the video-level model can be trained separately.

Taking model training for detecting a sequence of forged face images as an example, for an image-level model, the scheme shown in the application can randomly sample B face images from a data set based on a mini-batch method, and after data enhancement operations such as random inversion, fuzzy processing, JPEG compression and the like are performed, the images are subjected to an image-level model to obtain prediction probability, Cross Entropy Loss (Cross Entropy Loss) is calculated by combining with true values of the images, and model parameters are updated by the calculated Cross Entropy Loss.

For the video level model, the present solution can randomly sample B × T pictures (i.e. the above image sequence samples) from the data set based on the mini-batch method, where T is the number of frames of each video sample. After the prediction probability of the picture sequence is obtained through a video-level model, cross entropy loss is calculated by combining the truth value of the picture sequence, and model parameters are updated through the cross entropy loss obtained through calculation.

According to the scheme, optimization algorithms such as Adam can be used for updating network parameters, and the optimization result is iterated for multiple times. Model selection and learning rate attenuation can be performed according to the accuracy of the verification set in the training process, and some other technical means can prevent overfitting of the model. When the verification set and the training set are constructed, the identity of the person can be ensured not to be overlapped.

Step 430, a target image sequence is obtained, where the target image sequence includes at least two images in the same video.

Taking a video detection scene for AI face change as an example, after the first detection model and the second detection model are trained, a computer device (for example, an image detection device) may sample video frames at equal intervals by using OpenCV for an input face change video, where the number of the sampling frames may be appropriately increased according to the speed of a model actual deployment platform to include more various video information, and then cut a face region in the video by using a face detection technology such as RetinaFace, and expand by 1.2 times by using the region as a center to obtain the target image sequence.

And step 440, performing fusion processing based on an attention mechanism on the image features of the at least two images through the first detection model to obtain fusion image features of the target image sequence.

In a possible implementation manner, the computer device may perform feature extraction on the at least two images through a first feature extraction network in the first detection model to obtain first image features of the at least two images; processing the first image features of the at least two images through an attention network in the first detection model to obtain respective attention maps of the at least two images, wherein the attention maps are used for indicating the weight of the first image features of the corresponding images; then, based on the attention diagrams of the at least two images, the first image features of the at least two images are weighted through the attention network, and the fused image features are obtained.

In a possible implementation manner, when the first image features of the at least two images are processed through the attention network in the first detection model to obtain the respective attention diagrams of the at least two images, the computer device may process the first image features of the at least two images through at least one feature processing sub-module sequentially connected in the first detection model to obtain the respective attention diagrams of the at least two images; the feature processing submodule comprises a full connection layer, a batch normalization layer and a hyperbolic tangent function which are connected in sequence.

For example, the computer device performs feature extraction on at least two input images through a first feature extraction network in a first detection model to obtain respective first image features of the at least two images, then inputs the respective first image features of the at least two images into an attention network as shown in fig. 5, and after the respective first image features are sequentially processed by two feature processing sub-modules 51 in the attention network, inputs the processing results into a Softmax function 52 to obtain respective attention diagrams of the at least two images, and then the attention network performs weighted summation on the respective first image features of the at least two images and the respective attention diagrams of the at least two images to obtain the fused image features.

Step 450, processing the fused image feature through the first detection model to obtain a first probability of the target image sequence, where the first probability is used to indicate a probability that the target image sequence is a forged image sequence.

In this embodiment of the application, after the first detection model obtains the above-mentioned feature of the fused image, the feature of the fused image may be input to the first classifier, so as to obtain the first probability output by the first classifier.

Step 460, processing the image features of the at least two images through a second detection model respectively to obtain respective second probabilities of the at least two images, where the second probabilities are used to indicate the probability that the corresponding images are forged images.

In a possible implementation manner, the computer device may perform feature extraction on the at least two images through a second feature extraction network in the second detection model to obtain second image features of the at least two images; and then processing the second image characteristics of the at least two images through a classification network in the second detection model to obtain respective second probabilities of the at least two images.

For example, the computer device performs feature extraction on at least two input images through a second feature extraction network in the second detection model to obtain respective second image features of the at least two images, and then inputs the respective second image features of the at least two images into the second classifier respectively to obtain respective second probabilities of the at least two images, which are output by the second classifier respectively.

That is, in the embodiment of the present application, each of the at least two images corresponds to the respective second probability.

Step 470, obtaining a sequence detection result of the target image sequence based on the first probability and the second probability of each of the at least two images, where the sequence detection result is used to indicate whether the target image sequence is a forged image sequence.

In a possible implementation manner, the computer device may perform weighting processing on the first probability and the second probability of each of the at least two images to obtain the sequence detection result.

The computer device may perform weighting processing on the first probability and the second probabilities of the at least two images to obtain weighted probabilities, and determine a sequence detection result according to the weighted probabilities. For example, the computer device may compare a magnitude relationship between the weighted probability and a probability threshold, and when the weighted probability is greater than the probability threshold, a sequence detection result indicating that the target image sequence is a counterfeit image sequence may be obtained; conversely, when the weighted probability is not greater than the probability threshold, a sequence detection result indicating that the target image sequence is not a forged image sequence can be obtained.

In this embodiment of the application, the computer device may perform weighted summation processing on the first probability obtained by combining the timing information and the second probability obtained by combining the artifact trace to realize fusion of the first probability and the second probability, and further realize prediction of whether the target image sequence is a forged image sequence by simultaneously combining the timing information between the at least two images and the artifact traces of the at least two images.

In a possible implementation manner, when the first probability and the second probability of each of the at least two images are weighted to obtain the sequence detection result, the computer device may fuse the second probabilities of each of the at least two images to obtain a fusion probability; then, based on the weight of the first probability and the weight of the fusion probability, the first probability and the fusion probability are weighted to obtain the sequence detection result.

Since the at least two images have respective second probabilities, and the target image sequence has only one first probability, the at least two images having respective second probabilities may be fused and then weighted with the first probabilities in order to weight the first and second probabilities.

In a possible implementation manner, the second probabilities of the at least two images are fused, and the way of obtaining the fusion probability may be as follows:

1) and taking a median value of the second probabilities of the at least two images to obtain the fusion probability.

In this embodiment of the present application, because there may be a certain difference between at least two images, there may also be a large difference between the second probabilities of at least two images obtained by the first detection model, for example, the difference between the second probability of a small number of images and the second probabilities of other images is large (such second probabilities may be referred to as outliers), and the outliers are usually inaccurate, so as to avoid the influence of the outliers on the fusion result, in this embodiment of the present application, the computer device may extract a key probability value in the second probabilities of at least two images by using a median operation, and use the key probability value as the fusion probability.

2) And averaging the second probabilities of the at least two images to obtain the fusion probability.

In another possible implementation manner, the computer device may also average the second probabilities of the at least two images, and use the average as the fusion probability.

3) And processing the second probability of each of the at least two images by a Gated current Unit (GRU) to obtain the fusion probability.

In another possible implementation manner, the computer device may also input the second probability of each of the at least two images into the GRU to obtain the fusion probability.

In an exemplary aspect of an embodiment of the present application, the computer device may set the first probability and the fusion probability to be the same weight.

In a possible implementation manner, before performing weighting processing on the first probability and the fusion probability based on the weight of the first probability and the weight of the fusion probability to obtain the sequence detection result, the computer device may further obtain image quality information of the target image sequence, and obtain the weight of the first probability and the weight of the fusion probability based on the image quality information.

The image quality information may include, among other things, image sharpness, image resolution, and so on.

In another exemplary aspect of the embodiment of the present application, the computer device may also determine the weight of the first probability and the weight of the fusion probability in combination with the image quality, that is, whether detecting a counterfeit image is prone to an artifact trace in the image or to timing information between the images is determined by the image quality.

For example, the higher the image quality is, the easier the model can accurately distinguish the artifact trace from the image, that is, the higher the accuracy of detecting whether the target image sequence is a forged image sequence through the artifact trace is, in this case, the weight of the fusion probability can be increased, and the weight of the first probability can be decreased.

Correspondingly, the lower the image quality is, the less easily the model is able to accurately distinguish the artifact trace from the image, and the influence of the time sequence information is relatively small, that is, the higher the accuracy of detecting whether the target image sequence is a forged image sequence through the time sequence information in the image is, at this time, the weight of the fusion probability can be reduced, and the weight of the first probability can be increased.

In a possible implementation of the embodiment of the application, the computer device may pre-store a corresponding relationship between various image quality information and the corresponding weight of the first probability and the weight of the second probability, and the computer device may query the corresponding relationship through the image quality information of the target image sequence to obtain the weight of the first probability and the weight of the fusion probability.

In a possible implementation, before performing weighting processing on the first probability and the fusion probability based on the weight of the first probability and the weight of the fusion probability to obtain the sequence detection result, the computer device may further obtain a probability difference between the first probability and the fusion probability; and, in response to the probability difference being smaller than a difference threshold, performing a weighting process on the first probability and the fusion probability based on the weight of the first probability and the weight of the fusion probability to obtain the sequence detection result.

In this embodiment of the application, when the first probability and the fusion probability are relatively close (for example, the difference is less than 0.5), it may be considered that both the models can relatively accurately predict the probability of whether the target image sequence is a forged image sequence, and at this time, the computer device may perform weighting processing on the first probability and the fusion probability so as to determine whether the target image sequence is a forged image sequence by combining the timing information and the artifact trace.

In one possible implementation, detection failure information is output in response to the probability difference being not less than the difference threshold.

In this embodiment of the application, when the first probability and the fusion probability are different greatly (for example, greater than 0.5), it may be considered that one or both of the two models fails to accurately predict the probability of whether the target image sequence is a forged image sequence, and at this time, the computer device may output a prompt indicating that the detection fails, so as to avoid outputting an erroneous result.

In one possible implementation, in response to the probability difference being not less than the difference threshold, obtaining the sequence detection result based on the first probability; or, in response to the probability difference not being less than the difference threshold, obtaining the sequence detection result based on the fusion probability.

In this embodiment of the application, when the first probability and the fusion probability are different greatly, the computer device may also obtain the sequence detection result by using only one of the first probability and the fusion probability.

The embodiments of the present application provide an image detection method based on a multi-modal mixing strategy. At present, along with the development of generating an antagonistic network, face changing technologies are mature day by day, and the face changing videos bring huge potential risks to the society. The generation process of the face changing video is operated based on a single-frame image, so that inherent time sequence information in a video mode is ignored; in addition, some subtle artifacts may also be present in the forged single-frame image modality. Aiming at the phenomenon, the application provides an image detection method based on a multi-mode mixing strategy.

Referring to fig. 6, a block diagram of an image inspection process according to an exemplary embodiment of the present application is shown. As shown in fig. 6, taking AI face-changing video detection as an example, firstly, for an input video 61, the scheme shown in the present application frames a face in each video frame by using a face detector 62 to obtain a face image sequence 63, and then the face image sequence 63 is sent to an image-level model 64 and a video-level model 65, respectively.

The image-level model 64 predicts the face image sequence 63 frame by frame, utilizes the deep neural network to mine the forged traces in the face image, obtains the forged probability (corresponding to the second probability) corresponding to each image in the face image sequence 63 through the classifier, and obtains the fusion probability 66 through median operation processing, thereby avoiding the influence of outliers on the predicted probability.

The video-level model directly takes the whole face image sequence 63 as input, combines a deep neural network and an attention mechanism to model time sequence information in the video, and obtains the forgery probability 67 (corresponding to the first probability) of the face image sequence 63 through a classifier.

Finally, the computer device carries out weighted summation on the prediction probabilities (fusion probability 66 and forgery probability 67) of the two models to obtain a detection result 68 of whether the face image sequence 63 is the AI face-changing video.

In addition, the scheme is not limited to a specific face changing algorithm, and the method can be used as a general forged video detection method, can identify various forged video data such as face changing videos and editing videos, and has certain cross-domain generalization performance. In addition, when the scheme disclosed by the application is applied, user interaction is not needed, the consumed time is short, and the user experience can be improved.

Fig. 7 is a block diagram illustrating a configuration of an image detection apparatus according to an exemplary embodiment. The image detection device can implement all or part of the steps in the method provided by the embodiment shown in fig. 3 or fig. 4. The apparatus may include:

an image sequence obtaining module 701, configured to obtain a target image sequence, where the target image sequence includes at least two images in a same video;

a feature fusion module 702, configured to perform fusion processing based on an attention mechanism on image features of the at least two images through a first detection model to obtain fusion image features of the target image sequence;

a first feature processing module 703, configured to process the fused image feature through the first detection model to obtain a first probability of the target image sequence, where the first probability is used to indicate a probability that the target image sequence is a forged image sequence;

a second feature processing module 704, configured to process, through a second detection model, image features of the at least two images respectively, to obtain respective second probabilities of the at least two images, where the second probabilities are used to indicate probabilities that corresponding images are forged images;

a detection result obtaining module 705, configured to obtain a sequence detection result of the target image sequence based on the first probability and the second probability of each of the at least two images, where the sequence detection result is used to indicate whether the target image sequence is a forged image sequence;

In a possible implementation manner, the detection result obtaining module 705 is configured to perform weighting processing on the first probability and the second probability of each of the at least two images to obtain the sequence detection result.

In a possible implementation manner, the detection result obtaining module 705 includes:

alternatively, the first and second electrodes may be,

In one possible implementation, the apparatus further includes:

acquiring image quality information of the target image sequence;

In one possible implementation, the apparatus further includes:

In a possible implementation manner, the detection result obtaining module 705 is further configured to,

alternatively, the first and second electrodes may be,

In one possible implementation, the feature fusion module 702 is configured to,

In one possible implementation, the common feature fusion module 702 is configured to,

In one possible implementation manner, the second feature processing module 704 is configured to,

Fig. 8 illustrates a block diagram of a computer device 800 according to an exemplary embodiment of the present application. The computer device may be implemented as a terminal or a server in the above-mentioned aspects of the present application. The computer apparatus 800 includes a Central Processing Unit (CPU) 801, a system Memory 804 including a Random Access Memory (RAM) 802 and a Read-Only Memory (ROM) 803, and a system bus 805 connecting the system Memory 804 and the CPU 801. The computer device 800 also includes a mass storage device 806 for storing an operating system 809, application programs 810 and other program modules 811.

The mass storage device 806 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 806 and its associated computer-readable media provide non-volatile storage for the computer device 800. That is, the mass storage device 806 may include a computer-readable medium (not shown) such as a hard disk or Compact Disc-Only Memory (CD-ROM) drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 804 and mass storage device 806 as described above may be collectively referred to as memory.

The computer device 800 may also operate as a remote computer connected to a network via a network, such as the internet, in accordance with various embodiments of the present disclosure. That is, the computer device 800 may be connected to the network 808 through the network interface unit 807 attached to the system bus 805, or may be connected to another type of network or remote computer system (not shown) using the network interface unit 807.

The memory further includes at least one computer program, which is stored in the memory, and the central processing unit 801 executes the at least one computer program to implement all or part of the steps of the methods according to the above embodiments.

In an exemplary embodiment, a computer readable storage medium is also provided for storing at least one computer program, which is loaded and executed by a processor to implement all or part of the steps of the above method. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, which comprises computer instructions, which are stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform all or part of the steps of the method.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An image detection method, characterized in that the method comprises:

2. The method of claim 1, wherein obtaining the sequence detection result for the sequence of target images based on the first probability and the second probability for each of the at least two images comprises:

and weighting the first probability and the second probability of each of the at least two images to obtain the sequence detection result.

3. The method of claim 2, wherein weighting the first probability and the second probability of each of the at least two images to obtain the sequence detection result comprises:

fusing the second probabilities of the at least two images to obtain a fusion probability;

and weighting the first probability and the fusion probability based on the weight of the first probability and the weight of the fusion probability to obtain the sequence detection result.

4. The method according to claim 3, wherein said fusing the second probabilities of the respective at least two images to obtain a fused probability comprises:

alternatively, the first and second electrodes may be,

and processing the second probability of each of the at least two images through a gating circulation unit to obtain the fusion probability.

5. The method of claim 3, wherein the first probability is weighted the same as the fusion probability.

6. The method according to claim 3, wherein before the weighting the first probability and the fusion probability based on the weight of the first probability and the weight of the fusion probability to obtain the sequence detection result, the method further comprises:

acquiring image quality information of the target image sequence;

7. The method according to claim 3, wherein before the weighting the first probability and the fusion probability based on the weight of the first probability and the weight of the fusion probability to obtain the sequence detection result, the method further comprises:

acquiring a probability difference value between the first probability and the fusion probability;

the weighting processing of the first probability and the fusion probability based on the weight of the first probability and the weight of the fusion probability to obtain the sequence detection result includes:

and a step of performing a weighting process on the first probability and the fusion probability based on the weight of the first probability and the weight of the fusion probability in response to the probability difference being smaller than a difference threshold, and obtaining the sequence detection result.

8. The method of claim 7, further comprising:

and outputting detection failure information in response to the probability difference not being smaller than the difference threshold.

9. The method of claim 7, further comprising:

alternatively, the first and second electrodes may be,

10. The method according to any one of claims 1 to 9, wherein the obtaining of the fused image feature of the target image sequence by performing attention-based fusion processing on the image features of the at least two images through the first detection model comprises:

11. The method of claim 10, wherein the processing first image features of the at least two images through an attention network in the first detection model to obtain respective attention maps of the at least two images comprises:

12. The method according to any one of claims 1 to 9, wherein the processing the image features of the at least two images through the second detection model to obtain the second probabilities of the at least two images respectively comprises:

13. An image detection apparatus, characterized in that the apparatus comprises:

14. A computer device comprising a processor and a memory, the memory storing at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the image detection method according to any one of claims 1 to 12.

15. A computer-readable storage medium, in which at least one computer program is stored, which is loaded and executed by a processor to implement the image detection method according to any one of claims 1 to 12.