CN114078274A

CN114078274A - Face image detection method and device, electronic equipment and storage medium

Info

Publication number: CN114078274A
Application number: CN202111272607.1A
Authority: CN
Inventors: 缪长涛; 王强昌; 谭资昌; 郭国栋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-02-22

Abstract

The present disclosure provides a face image detection method, an apparatus, an electronic device and a storage medium, which relate to the technical field of artificial intelligence, and in particular to the technical field of natural language processing, computer vision and deep learning. The specific implementation scheme is as follows: respectively inputting an image vector of a human face image to be detected into a first semantic representation model and a second semantic representation model to obtain a first feature vector and a second feature vector output by two ith-stage networks; performing feature fusion processing according to the first feature vector and the second feature vector, and inputting the fused first feature vector and the fused second feature vector into the (i + 1) th stage network in the corresponding model; and determining the detection result of the face image according to the first detection result and the second detection result output by the two Nth-stage networks. Therefore, the features can be extracted by combining a plurality of semantic representation models, and the extracted features are fused on a plurality of levels and are judged by real counterfeiting, so that the accuracy of face counterfeiting detection is improved.

Description

Face image detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of natural language processing, computer vision, and deep learning technologies, and in particular, to a method and an apparatus for detecting a face image, an electronic device, and a storage medium.

Background

Currently, the face forgery detection technology in the industry refers to a certain method for judging whether a face image is distorted by a deep learning technology (i.e., changing the meaning of original content). In the related art, it is difficult to judge the authenticity or the forgery of a face image tampered by a deep learning technology.

Disclosure of Invention

The disclosure provides a face image detection method, a face image detection device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a face image detection method, including: respectively inputting an image vector of a human face image to be detected into a first semantic representation model and a second semantic representation model to obtain a first feature vector output by an ith-stage network in the first semantic representation model and a second feature vector output by the ith-stage network in the second semantic representation model; the number of the stage networks in the first semantic representation model and the second semantic representation model is N, and i is a positive integer from 1 to N-1; performing feature fusion processing according to the first feature vector and the second feature vector to obtain a fused first feature vector and a fused second feature vector; inputting the fused first feature vector into an i +1 stage network in the first semantic representation model, and inputting the fused second feature vector into an i +1 stage network in the second semantic representation model; and determining the detection result of the face image according to the first detection result of the Nth stage network in the first semantic representation model and the second detection result of the Nth stage network in the second semantic representation model.

According to another aspect of the present disclosure, there is provided a training method of a joint model, including: constructing an initial joint model, wherein the joint model comprises: n stage networks of a first semantic representation model, a head network of a second semantic representation model, N stage networks, N-1 feature fusion networks and a head feature fusion network; the ith feature fusion network is respectively connected with the two ith stage networks and the two (i + 1) th stage networks; the head feature fusion network is respectively connected with the head network and the two first-stage networks; i is a positive integer from 1 to N-1; acquiring training data, wherein the training data comprises a sample face image and a corresponding label, and the label represents that the sample face image is real or forged; and training the combined model by taking the image vector of the sample face image as the input of the combined model and taking the label corresponding to the sample face image as the output of the combined model.

According to another aspect of the present disclosure, there is provided a face image detection apparatus including: the first input module is used for respectively inputting the image vector of the face image to be detected into a first semantic representation model and a second semantic representation model so as to obtain a first feature vector output by an ith-stage network in the first semantic representation model and a second feature vector output by an ith-stage network in the second semantic representation model; the number of the stage networks in the first semantic representation model and the second semantic representation model is N, and i is a positive integer from 1 to N-1; the feature fusion module is used for performing feature fusion processing according to the first feature vector and the second feature vector to obtain a fused first feature vector and a fused second feature vector; the second input module is used for inputting the fused first feature vector into the (i + 1) th stage network in the first semantic representation model and inputting the fused second feature vector into the (i + 1) th stage network in the second semantic representation model; and the first determining module is used for determining the detection result of the face image according to the first detection result of the Nth stage network in the first semantic representation model and the second detection result of the Nth stage network in the second semantic representation model.

According to another aspect of the present disclosure, there is provided a training apparatus of a joint model, including: a construction module configured to construct an initial joined model, wherein the joined model comprises: n stage networks of a first semantic representation model, a head network of a second semantic representation model, N stage networks, N-1 feature fusion networks and a head feature fusion network; the ith feature fusion network is respectively connected with the two ith stage networks and the two (i + 1) th stage networks; the head feature fusion network is respectively connected with the head network and the two first-stage networks; i is a positive integer from 1 to N-1; the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring training data, the training data comprises a sample face image and a corresponding label, and the label represents the truth or the falseness of the sample face image; and the training module is used for training the combined model by taking the image vector of the sample face image as the input of the combined model and taking the label corresponding to the sample face image as the output of the combined model.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method for detecting a face image according to the above aspect of the disclosure; or, on the other hand, the proposed method for training the combined model.

According to still another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the face image detection method set forth in the above aspect of the present disclosure; or, on the other hand, the proposed method for training the combined model.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the face image detection method set forth in the above aspect of the present disclosure; or, on the other hand, the steps of the proposed method of training a joint model.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a joint model;

FIG. 5 is a schematic diagram of a feature fusion network in a federated model;

FIG. 6 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a fifth embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device implementing an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Currently, the face forgery detection technology in the industry refers to a certain method for judging whether a face image is distorted by a deep learning technology (i.e., changing the meaning of original content). In the related technology, a simple neural network is mainly used for true forgery classification, or a deep learning method is used for automatically extracting forgery features to judge whether a face image is true or forged, and the accuracy of face forgery detection is poor.

In order to solve the above problems, the present disclosure provides a face image detection method, an apparatus, an electronic device, and a storage medium.

Fig. 1 is a schematic diagram of a first embodiment of the present disclosure, and it should be noted that the face image detection method according to the embodiment of the present disclosure can be applied to a face image detection apparatus, which can be configured in an electronic device, so that the electronic device can perform a face image detection function.

The electronic device may be any device with computing capability. The device with computing capability may be, for example, a Personal Computer (PC), a mobile terminal, a server, and the like, and the mobile terminal may be, for example, a hardware device with various operating systems, a touch screen, and/or a display screen, such as an in-vehicle device, a mobile phone, a tablet Computer, a Personal digital assistant, and a wearable device.

As shown in fig. 1, the face image detection method may include the following steps:

step 101, respectively inputting an image vector of a human face image to be detected into a first semantic representation model and a second semantic representation model to obtain a first feature vector output by an ith-stage network in the first semantic representation model and a second feature vector output by the ith-stage network in the second semantic representation model; the number of the stage networks in the first semantic representation model and the second semantic representation model is N, and i is a positive integer from 1 to N-1.

In the embodiment of the present disclosure, before step 101, the method may further include a process of determining an image vector of a face image to be detected. The process may specifically comprise the steps of: determining a human face image to be detected; carrying out blocking processing on the face image to obtain a plurality of image blocks; respectively carrying out vector conversion processing on a plurality of image blocks, and determining a vector of each image block; and determining a vector sequence formed by the vectors of the image blocks as an image vector of the human face image.

In the embodiment of the present disclosure, the blocking processing of the face image and the vector conversion processing of the blocks can obtain the vectors of the plurality of image blocks, retain more features in the face image, facilitate subsequent processing, and further improve the accuracy of face forgery detection.

In the embodiment of the present disclosure, the first semantic representation model may be used to extract global features of the face image, and the second semantic representation model may be used to extract local features of the face image. The first semantic representation model can be, for example, a transformer model, and the transformer model has global feature extraction capability and can establish a long-distance relationship among a plurality of image blocks in the face image from a global perspective. The second semantic representation model may be, for example, a CNN model, which is capable of extracting local artifact features in the face image. Therefore, the comprehensiveness of the extracted features is ensured through the extraction of the global features and the local features, and the accuracy of face forgery detection is further improved.

In the embodiment of the present disclosure, the first semantic representation model is a transform model, which may include 4 stage networks, that is, N is 4, each stage network includes 3 transform network layers, and the model includes 12 transform network layers in total.

In the embodiment of the present disclosure, taking the second semantic representation model as a CNN model as an example, the model may include a header network and 4 stage networks, each of which includes 3 CNN network layers.

In the embodiment of the present disclosure, in a case where the first semantic representation model is not provided with a head network, and the second semantic representation model is provided with a head network, the process of the human face image detection apparatus executing step 101 may be, for example, inputting an image vector into the head network of the second semantic representation model to obtain a head feature vector output by the head network; performing feature fusion processing according to the image vector and the head feature vector to obtain a fused image vector and a fused head feature vector; inputting the fused image vector into a first semantic representation model, and inputting the fused head feature vector into a first-stage network of a second semantic representation model to obtain a first feature vector output by an ith-stage network in the first semantic representation model and a second feature vector output by the ith-stage network in the second semantic representation model.

The head feature vector is obtained by processing the second semantic representation model, and local features of the face image can be embodied. The image vector can embody global features as well as local features. And performing feature fusion processing according to the image vector and the head feature vector, so that the fused feature vector can embody global features and local features, and more features in the face image can be utilized, thereby further improving the accuracy of face forgery detection.

And 102, performing feature fusion processing according to the first feature vector and the second feature vector to obtain a fused first feature vector and a fused second feature vector.

In the embodiment of the disclosure, the first feature vector is obtained by processing the first semantic representation model, and can embody the global features of the face image; the second feature vector is obtained by processing the second semantic representation model, and can embody the local features of the face image. And performing feature fusion processing according to the global features and the local features of the face image, so that the fused feature vectors can embody the global features and the local features, and more features in the face image can be utilized, thereby further improving the accuracy of face forgery detection.

Step 103, inputting the fused first feature vector into the i +1 stage network in the first semantic representation model, and inputting the fused second feature vector into the i +1 stage network in the second semantic representation model.

In the practice of the present disclosure, i is a positive integer from 1 to N-1. Wherein, the initial value of i is 1, and each time step 102 and step 103 are executed, 1 is added to i. Taking N as 4 as an example, that is, after the image vector of the face image to be detected is respectively input into the first semantic representation model and the second semantic representation model, first a first feature vector output by the first stage network in the first semantic representation model and a second feature vector output by the first stage network in the second semantic representation model are obtained; performing feature fusion processing to obtain a first feature vector after fusion and a second feature vector after fusion; inputting the fused first feature vector into a second-stage network in a first semantic representation model; and inputting the fused second feature vector into a second stage network in a second semantic representation model.

Then, acquiring a first feature vector output by a second stage network in the first semantic representation model and a second feature vector output by the second stage network in the second semantic representation model; performing feature fusion processing to obtain a first feature vector after fusion and a second feature vector after fusion; inputting the fused first feature vector into a third-stage network in a first semantic representation model; and inputting the fused second feature vector into a third-stage network in a second semantic representation model.

Then, acquiring a first feature vector output by a third stage network in the first semantic representation model and a second feature vector output by a third stage network in the second semantic representation model; performing feature fusion processing to obtain a first feature vector after fusion and a second feature vector after fusion; inputting the fused first feature vector into a fourth-stage network in the first semantic representation model; and inputting the fused second feature vector into a fourth-stage network in a second semantic representation model.

And 104, determining a detection result of the face image according to a first detection result of the Nth-stage network in the first semantic representation model and a second detection result of the Nth-stage network in the second semantic representation model.

In the disclosed embodiment, the first detection result may include a forgery probability and a true probability; the second detection result may include a forgery probability and a true probability. Correspondingly, in order to improve the detection accuracy of the detection result, the face image detection apparatus may execute the process of step 104, for example, to determine the forgery probability of the face image according to the forgery probability in the first detection result and the forgery probability in the second detection result; determining the real probability of the face image according to the real probability in the first detection result and the real probability in the second detection result; and determining the detection result of the face image according to the true probability and the forgery probability of the face image.

The process of determining the forgery probability of the face image according to the forgery probability in the first detection result and the forgery probability in the second detection result may be, for example, summing the forgery probability in the first detection result and the forgery probability in the second detection result to obtain an average value, and taking the average value as the forgery probability of the face image. The process of determining the true probability of the face image according to the true probability in the first detection result and the true probability in the second detection result may be, for example, summing the true probability in the first detection result and the true probability in the second detection result to obtain an average value, and taking the average value as the true probability of the face image.

The process of determining the detection result of the face image according to the true probability and the forgery probability of the face image may be, for example, determining that the face image is true if the true probability is greater than the forgery probability; if the true probability is smaller than the counterfeiting probability, determining the counterfeiting of the face image; and if the true probability is equal to the forgery probability, determining that the face image is true or forged, or detecting the face image again and the like.

The face image detection method of the embodiment of the disclosure obtains a first feature vector output by an ith stage network in a first semantic representation model and a second feature vector output by an ith stage network in a second semantic representation model by respectively inputting an image vector of a face image to be detected into the first semantic representation model and the second semantic representation model; the number of the stage networks in the first semantic representation model and the second semantic representation model is N, and i is a positive integer from 1 to N-1; performing feature fusion processing according to the first feature vector and the second feature vector to obtain a fused first feature vector and a fused second feature vector; inputting the fused first feature vector into an i +1 stage network in a first semantic representation model, and inputting the fused second feature vector into an i +1 stage network in a second semantic representation model; and determining the detection result of the face image according to the first detection result of the Nth stage network in the first semantic representation model and the second detection result of the Nth stage network in the second semantic representation model. Therefore, the extracted features of the semantic representation models can be combined, the extracted features are fused on multiple levels, real counterfeiting judgment is carried out on the basis of the fused features, and accuracy of face counterfeiting detection is improved.

In order to further improve the accuracy of face forgery detection, as shown in fig. 2, fig. 2 is a schematic diagram according to a second embodiment of the disclosure, in the embodiment of the disclosure, a weight adjustment value of two eigenvectors may be determined according to the first eigenvector and the second eigenvector; and determining to obtain two fused feature vectors by combining the weight adjustment value and the weight initial value, and further determining a detection result. The embodiment shown in fig. 2 may include the following steps:

step 201, respectively inputting an image vector of a human face image to be detected into a first semantic representation model and a second semantic representation model to obtain a first feature vector output by an ith-stage network in the first semantic representation model and a second feature vector output by the ith-stage network in the second semantic representation model; the number of the stage networks in the first semantic representation model and the second semantic representation model is N, and i is a positive integer from 1 to N-1.

Step 202, determining a weight adjustment value of the first feature vector and a weight adjustment value of the second feature vector according to the first feature vector and the second feature vector.

In the embodiment of the present disclosure, the process of the face image detection apparatus executing step 202 may be, for example, to perform frequency domain conversion on the first feature vector and the second feature vector respectively to obtain a first frequency domain feature vector and a second frequency domain feature vector; determining a weight adjustment value of a second feature vector according to the first frequency domain feature vector; and determining a weight adjustment value of the first characteristic vector according to the second frequency domain characteristic vector.

And performing frequency domain conversion on the first characteristic vector to obtain a first frequency domain characteristic vector. The first frequency domain feature vector contains a larger amount of information than the first feature vector, especially the amount of information of the spurious feature. Then, the weight adjustment value of the second feature vector is determined based on the first frequency domain feature vector, that is, if the information amount of the forged features in the first frequency domain feature vector is large, the weight of the first feature vector needs to be increased, and the weight adjustment value of the second feature vector can be reduced, so that the weight adjustment value of the second feature vector is determined based on the first frequency domain feature vector, the weight adjustment value of the first feature vector is determined based on the second frequency domain feature vector, the weights of the first feature vector and the second feature vector can be accurately determined, the information amount of the forged features in the first frequency domain feature vector and the second feature vector in the second frequency domain feature vector can be further increased, the information amount of the unforced features in the first frequency domain feature vector and the second feature vector in the second frequency domain feature vector can be accurately determined, and the accuracy of face forgery detection can be further improved.

Step 203, determining the fused first feature vector according to the first feature vector and the corresponding weight initial value and weight adjustment value.

In the embodiment of the present disclosure, the initial weight value is 1, and in an example, the first eigenvector may be multiplied by the weight adjustment value, and the first eigenvector is added to the multiplication result, so as to obtain the fused first eigenvector. In another example, the weight adjustment value may be added to the weight initial value, and the first feature vector is multiplied by the addition result to obtain a fused first feature vector.

And 204, determining the fused second feature vector according to the second feature vector and the corresponding weight initial value and weight adjustment value.

In the embodiment of the present disclosure, the initial weight value is 1, and in an example, the weight adjustment value may be multiplied by the second feature vector, and the second feature vector is added to the multiplication result to obtain the fused second feature vector. In another example, the weight adjustment value may be added to the weight initial value, and the second eigenvector may be multiplied by the addition result to obtain the fused second eigenvector.

Step 205, inputting the fused first feature vector into the i +1 stage network in the first semantic representation model, and inputting the fused second feature vector into the i +1 stage network in the second semantic representation model.

And step 206, determining a detection result of the face image according to a first detection result of the Nth-stage network in the first semantic representation model and a second detection result of the Nth-stage network in the second semantic representation model.

In the embodiment of the present disclosure, the detailed descriptions of step 201, step 205, and step 206 may refer to the detailed descriptions of step 101, step 103, and step 104 in the implementation shown in fig. 1, and are not described in detail here.

The face image detection method of the embodiment of the disclosure obtains a first feature vector output by an ith stage network in a first semantic representation model and a second feature vector output by an ith stage network in a second semantic representation model by respectively inputting an image vector of a face image to be detected into the first semantic representation model and the second semantic representation model; the number of the stage networks in the first semantic representation model and the second semantic representation model is N, and i is a positive integer from 1 to N-1; determining a weight adjustment value of the first feature vector and a weight adjustment value of the second feature vector according to the first feature vector and the second feature vector; determining the fused first feature vector according to the first feature vector and the corresponding weight initial value and weight adjustment value; determining a fused second feature vector according to the second feature vector and the corresponding weight initial value and weight adjustment value; inputting the fused first feature vector into an i +1 stage network in a first semantic representation model, and inputting the fused second feature vector into an i +1 stage network in a second semantic representation model; and determining the detection result of the face image according to the first detection result of the Nth stage network in the first semantic representation model and the second detection result of the Nth stage network in the second semantic representation model. Therefore, the extracted features of the semantic representation models can be combined, the extracted features are subjected to weight adjustment and fusion on multiple levels, real counterfeiting judgment is carried out based on the fused features, and the accuracy of face counterfeiting detection is further improved.

Fig. 3 is a schematic diagram of a third embodiment of the present disclosure, and it should be noted that the training method of the joint model according to the embodiment of the present disclosure is applicable to a training apparatus of the joint model, and the apparatus may be configured in an electronic device, so that the electronic device may perform a training function of the joint model.

As shown in fig. 3, the training method of the joint model may include the following steps:

step 301, constructing an initial combined model, wherein the combined model includes: n stage networks of a first semantic representation model, a head network of a second semantic representation model, N stage networks, N-1 feature fusion networks and a head feature fusion network; the ith feature fusion network is respectively connected with the two ith stage networks and the two (i + 1) th stage networks; the head feature fusion network is respectively connected with the head network and the two first-stage networks; i is a positive integer from 1 to N-1.

In the embodiment of the present disclosure, the first semantic representation model may be used to extract global features of the face image, and the second semantic representation model may be used to extract local features of the face image. The first semantic representation model can be, for example, a transformer model, and the transformer model has global feature extraction capability and can establish a long-distance relationship among a plurality of image blocks in the face image from a global perspective. The second semantic representation model may be, for example, a CNN model, which is capable of extracting local artifact features in the face image.

In the embodiment of the present disclosure, a schematic diagram of a joint model may be as shown in fig. 4, and in fig. 4, the first semantic representation model may be a transform model, and the model may include 4 stage networks, that is, N is 4, each stage network includes 3 transform network layers, and the model includes 12 transform network layers in total. In fig. 4, the second semantic representation model may be a CNN model, which may include a header network and 4 stage networks, each including 3 CNN network layers.

In fig. 4, the blocking and vector conversion represents the blocking processing and vector conversion processing of the face image, and a vector sequence is obtained and used as an image vector of the face image. In fig. 4, the inputs to the head feature fusion network include: the image vector and a head feature vector output by a head network (bottleeck) of the CNN model; the fused image vector output by the head feature fusion network is provided for the first-stage network of the transform model, and the fused head feature vector output by the head feature fusion network is provided for the first-stage network of the CNN model.

In the embodiment of the present disclosure, a schematic diagram of a feature fusion network in a joint model may be as shown in fig. 5. In fig. 5, taking a feature fusion network connecting two first-stage networks and two second-stage networks, respectively, as an example, the inputs of the feature fusion network include: a first feature vector (Xg) output by a first-stage network of a transform model and a second feature vector (Xl) output by the first-stage network of a CNN model, wherein a first FFR module in the feature fusion network is used for carrying out frequency domain conversion on the first feature vector, determining a weight adjustment value of the second feature vector based on a conversion result, and further determining to obtain a fused second feature vector; and the second FFR module in the feature fusion network is used for carrying out frequency domain conversion on the second feature vector, determining a weight adjustment value of the first feature vector based on a conversion result and further determining to obtain the fused first feature vector.

Step 302, training data is obtained, wherein the training data includes a sample face image and a corresponding label, and the label represents that the sample face image is real or forged.

And 303, training the combined model by taking the image vector of the sample face image as the input of the combined model and taking the label corresponding to the sample face image as the output of the combined model.

In the embodiment of the disclosure, specifically, an image vector of a sample face image is input to a first stage network of a first semantic representation model and is input to a head network of a second semantic representation model, a first prediction result output by a last stage network of the first semantic representation model is obtained, and a second prediction result output by a last stage network of the second semantic representation model is obtained; determining a value of a loss function according to the first prediction result, the second prediction result, the label of the sample face image and a preset loss function; and adjusting coefficients in the combined model according to the values of the loss function, so as to realize the training of the combined model.

The training method of the joint model of the embodiment of the disclosure constructs an initial joint model, wherein the joint model comprises: n stage networks of a first semantic representation model, a head network of a second semantic representation model, N stage networks, N-1 feature fusion networks and a head feature fusion network; the ith feature fusion network is respectively connected with the two ith stage networks and the two (i + 1) th stage networks; the head feature fusion network is respectively connected with the head network and the two first-stage networks; i is a positive integer from 1 to N-1; acquiring training data, wherein the training data comprises a sample face image and a corresponding label, and the label represents that the sample face image is real or forged; and training the combined model by taking the image vector of the sample face image as the input of the combined model and taking the label corresponding to the sample face image as the output of the combined model. Therefore, the trained combined model can combine the extracted features of the multiple semantic representation models, the extracted features are fused on multiple levels, real counterfeiting judgment is carried out based on the fused features, and accuracy of face counterfeiting detection is improved.

In order to implement the above embodiments, the present disclosure further provides a face image detection apparatus.

As shown in fig. 6, fig. 6 is a schematic diagram according to a fourth embodiment of the present disclosure. The face image detection apparatus 600 includes: a first input module 610, a feature fusion module 620, a second input module 630, and a first determination module 640.

The first input module 610 is configured to input an image vector of a face image to be detected into a first semantic representation model and a second semantic representation model respectively, so as to obtain a first feature vector output by an i-th stage network in the first semantic representation model and a second feature vector output by the i-th stage network in the second semantic representation model; the number of the stage networks in the first semantic representation model and the second semantic representation model is N, and i is a positive integer from 1 to N-1;

a feature fusion module 620, configured to perform feature fusion processing according to the first feature vector and the second feature vector to obtain a fused first feature vector and a fused second feature vector;

a second input module 630, configured to input the fused first feature vector into an i +1 th stage network in the first semantic representation model, and input the fused second feature vector into an i +1 th stage network in the second semantic representation model;

the first determining module 640 is configured to determine a detection result of the facial image according to a first detection result of the nth stage network in the first semantic representation model and a second detection result of the nth stage network in the second semantic representation model.

As a possible implementation manner of the embodiment of the present disclosure, the apparatus further includes: the device comprises a second determining module, a partitioning module, a vector converting module and a third determining module; the second determining module is used for determining the face image to be detected; the blocking module is used for carrying out blocking processing on the face image to obtain a plurality of image blocks; the vector conversion module is used for respectively carrying out vector conversion processing on the image blocks and determining the vector of each image block; the third determining module is configured to determine a vector sequence formed by the vectors of the plurality of image blocks as an image vector of the face image.

As a possible implementation manner of the embodiment of the present disclosure, the first semantic representation model is used to extract global features of the face image; the second semantic representation model is used for extracting local features of the face image.

As a possible implementation manner of the embodiment of the present disclosure, the first semantic representation model is not provided with a head network, and the second semantic representation model is provided with a head network; the first input module 610 is specifically configured to input the image vector into a head network of the second semantic representation model to obtain a head feature vector output by the head network; performing feature fusion processing according to the image vector and the head feature vector to obtain a fused image vector and a fused head feature vector; and inputting the fused image vector into the first semantic representation model, and inputting the fused head feature vector into the first-stage network of the second semantic representation model to obtain a first feature vector output by the ith-stage network in the first semantic representation model and a second feature vector output by the ith-stage network in the second semantic representation model.

As a possible implementation manner of the embodiment of the present disclosure, the feature fusion module 620 is specifically configured to determine a weight adjustment value of the first feature vector and a weight adjustment value of the second feature vector according to the first feature vector and the second feature vector; determining the fused first feature vector according to the first feature vector and the corresponding weight initial value and weight adjustment value; and determining the fused second feature vector according to the second feature vector and the corresponding weight initial value and weight adjustment value.

As a possible implementation manner of the embodiment of the present disclosure, the feature fusion module 620 is specifically configured to perform frequency domain conversion on the first feature vector and the second feature vector, respectively, to obtain a first frequency domain feature vector and a second frequency domain feature vector; determining a weight adjustment value of the second feature vector according to the first frequency domain feature vector; and determining a weight adjustment value of the first feature vector according to the second frequency domain feature vector.

As a possible implementation manner of the embodiment of the present disclosure, the first determining module 640 is specifically configured to determine a forgery probability of the face image according to a forgery probability in the first detection result and a forgery probability in the second detection result; determining the real probability of the face image according to the real probability in the first detection result and the real probability in the second detection result; and determining the detection result of the face image according to the true probability and the false probability of the face image.

The face image detection device of the embodiment of the disclosure obtains a first feature vector output by an ith stage network in a first semantic representation model and a second feature vector output by an ith stage network in a second semantic representation model by respectively inputting an image vector of a face image to be detected into the first semantic representation model and the second semantic representation model; the number of the stage networks in the first semantic representation model and the second semantic representation model is N, and i is a positive integer from 1 to N-1; performing feature fusion processing according to the first feature vector and the second feature vector to obtain a fused first feature vector and a fused second feature vector; inputting the fused first feature vector into an i +1 stage network in a first semantic representation model, and inputting the fused second feature vector into an i +1 stage network in a second semantic representation model; and determining the detection result of the face image according to the first detection result of the Nth stage network in the first semantic representation model and the second detection result of the Nth stage network in the second semantic representation model. Therefore, the extracted features of the semantic representation models can be combined, the extracted features are fused on multiple levels, real counterfeiting judgment is carried out on the basis of the fused features, and accuracy of face counterfeiting detection is improved.

In order to implement the above embodiments, the present disclosure further provides a training apparatus for a combined model.

As shown in fig. 7, fig. 7 is a schematic diagram according to a fifth embodiment of the present disclosure. The training apparatus 700 of the joint model includes: a build module 710, an acquisition module 720, and a training module 730.

A building module 710, configured to build an initial joint model, where the joint model includes: n stage networks of a first semantic representation model, a head network of a second semantic representation model, N stage networks, N-1 feature fusion networks and a head feature fusion network; the ith feature fusion network is respectively connected with the two ith stage networks and the two (i + 1) th stage networks; the head feature fusion network is respectively connected with the head network and the two first-stage networks; i is a positive integer from 1 to N-1;

an obtaining module 720, configured to obtain training data, where the training data includes a sample face image and a corresponding label, and the label represents that the sample face image is real or counterfeit;

and the training module 730 is configured to train the combined model by using the image vector of the sample face image as the input of the combined model and using the label corresponding to the sample face image as the output of the combined model.

The training device of the joint model of the embodiment of the disclosure constructs an initial joint model, wherein the joint model includes: n stage networks of a first semantic representation model, a head network of a second semantic representation model, N stage networks, N-1 feature fusion networks and a head feature fusion network; the ith feature fusion network is respectively connected with the two ith stage networks and the two (i + 1) th stage networks; the head feature fusion network is respectively connected with the head network and the two first-stage networks; i is a positive integer from 1 to N-1; acquiring training data, wherein the training data comprises a sample face image and a corresponding label, and the label represents that the sample face image is real or forged; and training the combined model by taking the image vector of the sample face image as the input of the combined model and taking the label corresponding to the sample face image as the output of the combined model. Therefore, the trained combined model can combine the extracted features of the multiple semantic representation models, the extracted features are fused on multiple levels, real counterfeiting judgment is carried out based on the fused features, and accuracy of face counterfeiting detection is improved.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all carried out on the premise of obtaining the consent of the user, and all accord with the regulation of related laws and regulations without violating the good custom of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the electronic apparatus 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the face image detection method or the training method of the joint model. For example, in some embodiments, the face image detection method or the training method of the joined model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into the RAM803 and executed by the computing unit 801, one or more steps of the face image detection method described above may be performed; or one or more steps of a training method of the joint model. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the face image detection method or the training method of the joined model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A face image detection method comprises the following steps:

respectively inputting an image vector of a human face image to be detected into a first semantic representation model and a second semantic representation model to obtain a first feature vector output by an ith-stage network in the first semantic representation model and a second feature vector output by the ith-stage network in the second semantic representation model; the number of the stage networks in the first semantic representation model and the second semantic representation model is N, and i is a positive integer from 1 to N-1;

performing feature fusion processing according to the first feature vector and the second feature vector to obtain a fused first feature vector and a fused second feature vector;

inputting the fused first feature vector into an i +1 stage network in the first semantic representation model, and inputting the fused second feature vector into an i +1 stage network in the second semantic representation model;

and determining the detection result of the face image according to the first detection result of the Nth stage network in the first semantic representation model and the second detection result of the Nth stage network in the second semantic representation model.

2. The method according to claim 1, wherein before inputting the image vectors of the face image to be detected into the first and second semantic representation models, respectively, the method further comprises:

determining the face image to be detected;

carrying out blocking processing on the face image to obtain a plurality of image blocks;

respectively carrying out vector conversion processing on the image blocks, and determining a vector of each image block;

and determining a vector sequence formed by the vectors of the image blocks as the image vector of the face image.

3. The method of claim 1, wherein the first semantic representation model is used to extract global features of the face image;

the second semantic representation model is used for extracting local features of the face image.

4. The method of claim 1, wherein the first semantic representation model is not provided with a head network and the second semantic representation model is provided with a head network;

the step of respectively inputting the image vectors of the human face image to be detected into a first semantic representation model and a second semantic representation model comprises the following steps:

inputting the image vector into a head network of the second semantic representation model to obtain a head feature vector output by the head network;

performing feature fusion processing according to the image vector and the head feature vector to obtain a fused image vector and a fused head feature vector;

and inputting the fused image vector into the first semantic representation model, and inputting the fused head feature vector into the first-stage network of the second semantic representation model to obtain a first feature vector output by the ith-stage network in the first semantic representation model and a second feature vector output by the ith-stage network in the second semantic representation model.

5. The method according to claim 1, wherein the performing feature fusion processing according to the first feature vector and the second feature vector to obtain a fused first feature vector and a fused second feature vector comprises:

determining a weight adjustment value of the first feature vector and a weight adjustment value of the second feature vector according to the first feature vector and the second feature vector;

determining the fused first feature vector according to the first feature vector and the corresponding weight initial value and weight adjustment value;

and determining the fused second feature vector according to the second feature vector and the corresponding weight initial value and weight adjustment value.

6. The method of claim 5, wherein the determining a weight adjustment value for the first eigenvector and a weight adjustment value for the second eigenvector from the first eigenvector and the second eigenvector comprises:

respectively carrying out frequency domain conversion on the first feature vector and the second feature vector to obtain a first frequency domain feature vector and a second frequency domain feature vector;

determining a weight adjustment value of the second feature vector according to the first frequency domain feature vector;

and determining a weight adjustment value of the first feature vector according to the second frequency domain feature vector.

7. The method of claim 1, wherein the determining the detection result of the face image according to the first detection result of the nth stage network in the first semantic representation model and the second detection result of the nth stage network in the second semantic representation model comprises:

determining the forgery probability of the face image according to the forgery probability in the first detection result and the forgery probability in the second detection result;

determining the real probability of the face image according to the real probability in the first detection result and the real probability in the second detection result;

and determining the detection result of the face image according to the true probability and the false probability of the face image.

8. A method of training a joined model, comprising:

constructing an initial joint model, wherein the joint model comprises: n stage networks of a first semantic representation model, a head network of a second semantic representation model, N stage networks, N-1 feature fusion networks and a head feature fusion network; the ith feature fusion network is respectively connected with the two ith stage networks and the two (i + 1) th stage networks; the head feature fusion network is respectively connected with the head network and the two first-stage networks; i is a positive integer from 1 to N-1;

acquiring training data, wherein the training data comprises a sample face image and a corresponding label, and the label represents that the sample face image is real or forged;

and training the combined model by taking the image vector of the sample face image as the input of the combined model and taking the label corresponding to the sample face image as the output of the combined model.

9. A face image detection apparatus comprising:

the first input module is used for respectively inputting the image vector of the face image to be detected into a first semantic representation model and a second semantic representation model so as to obtain a first feature vector output by an ith-stage network in the first semantic representation model and a second feature vector output by an ith-stage network in the second semantic representation model; the number of the stage networks in the first semantic representation model and the second semantic representation model is N, and i is a positive integer from 1 to N-1;

the feature fusion module is used for performing feature fusion processing according to the first feature vector and the second feature vector to obtain a fused first feature vector and a fused second feature vector;

the second input module is used for inputting the fused first feature vector into the (i + 1) th stage network in the first semantic representation model and inputting the fused second feature vector into the (i + 1) th stage network in the second semantic representation model;

and the first determining module is used for determining the detection result of the face image according to the first detection result of the Nth stage network in the first semantic representation model and the second detection result of the Nth stage network in the second semantic representation model.

10. The apparatus of claim 9, wherein the apparatus further comprises: the device comprises a second determining module, a partitioning module, a vector converting module and a third determining module;

the second determining module is used for determining the face image to be detected;

the blocking module is used for carrying out blocking processing on the face image to obtain a plurality of image blocks;

the vector conversion module is used for respectively carrying out vector conversion processing on the image blocks and determining the vector of each image block;

the third determining module is configured to determine a vector sequence formed by the vectors of the plurality of image blocks as an image vector of the face image.

11. The apparatus of claim 9, wherein the first semantic representation model is configured to extract global features of the face image;

12. The apparatus of claim 9, wherein the first semantic representation model is not provided with a head network and the second semantic representation model is provided with a head network; the first input module is specifically configured to,

13. The apparatus of claim 9, wherein the feature fusion module is specifically configured to,

14. The apparatus of claim 13, wherein the feature fusion module is specifically configured to,

15. The apparatus of claim 9, wherein the first determining means is specifically configured to,

16. A training apparatus of a joined model, comprising:

a construction module configured to construct an initial joined model, wherein the joined model comprises: n stage networks of a first semantic representation model, a head network of a second semantic representation model, N stage networks, N-1 feature fusion networks and a head feature fusion network; the ith feature fusion network is respectively connected with the two ith stage networks and the two (i + 1) th stage networks; the head feature fusion network is respectively connected with the head network and the two first-stage networks; i is a positive integer from 1 to N-1;

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring training data, the training data comprises a sample face image and a corresponding label, and the label represents the truth or the falseness of the sample face image;

and the training module is used for training the combined model by taking the image vector of the sample face image as the input of the combined model and taking the label corresponding to the sample face image as the output of the combined model.

17. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7; or, performing the method of claim 8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7; or, performing the method of claim 8.

19. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method according to any one of claims 1-7; alternatively, the steps of the method according to claim 8 are implemented.