CN113936343A

CN113936343A - Face image false distinguishing method based on multi-local feature voting

Info

Publication number: CN113936343A
Application number: CN202111557690.7A
Authority: CN
Inventors: 杨理想; 王银瑞; 张侨; 居思刚; 张连海
Original assignee: Nanjing Xingyao Intelligent Technology Co ltd
Current assignee: Nanjing Xingyao Intelligent Technology Co ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-01-14

Abstract

The invention discloses a face image false distinguishing method based on multi-local feature voting, which comprises the steps of preprocessing a face image to obtain the face image and the surrounding frame coordinates of a key area in the face image, and taking the face image and the surrounding frame coordinates as the input of feature extraction, thereby effectively reducing the complexity of calculation in the face recognition process; meanwhile, local features and overall features which are more discriminative are extracted by utilizing a convolutional neural network, the effect of sample enhancement is achieved, and certain robustness is achieved for shielding change in the problem of small samples. In addition, the method adopts the key area to sample the features extracted by the convolutional neural network, eliminates the noise features introduced by the real background in the feature map, effectively improves the recognition performance of the image and simultaneously improves the accuracy of the image recognition.

Description

Face image false distinguishing method based on multi-local feature voting

Technical Field

The invention belongs to the technical field of computer image processing, and particularly relates to a face image false distinguishing method based on multi-local feature voting.

Background

The method is characterized in that a face image synthesized by artificial intelligence is distinguished according to a deep neural network, the image is mainly input into a mainstream convolutional neural network for feature extraction, and then a full-connection network layer is used for carrying out true and false classification on high-dimensional features. Specifically, the first step is to prepare the true and false face data first. After the face image is read in the computer, some preprocessing operations are carried out on the face image, wherein the preprocessing operations comprise random cutting, random horizontal turning, regularization and the like. And then iterating the training data set, reading the face image and the label thereof after data enhancement with fixed size each time, inputting the face image and the label thereof into a mainstream convolutional neural network, such as ResNet50, wherein a full connection layer with the output of ResNet50 being 1000 is replaced by a full connection layer with the output being 2. After the output of ResNet50 is obtained, the loss between the output value and the true value is calculated by using a cross entropy function, the gradient of each parameter in the ResNet50 network is solved, and the network parameters are updated by using a random gradient descent method SGD. And repeating the iteration in a circulating way until the performance of the network model is not increased any more. However, after the neural network extracts features, the above method classifies all parts of the features, but only some of the features are real backgrounds, so this process is equivalent to introducing noise features, and the performance of recognizing face images by using the above method is low.

Disclosure of Invention

In view of this, the invention provides a face image identification method based on multi-local feature voting, which can effectively identify the authenticity of the face image.

The invention provides a face image false distinguishing method based on multi-local feature voting, which comprises the following steps:

step 1, labeling a forged area in a known face image, and analyzing the forged area to determine a key area required by face image identification;

step 2, preprocessing a video to be analyzed to obtain all face images to be analyzed contained in the video to be analyzed; determining surrounding frame coordinates corresponding to the key areas in all the face images to be analyzed according to the key areas determined in the step 1;

step 3, establishing a face image distinguishing model based on multi-local feature voting, wherein the face image distinguishing model comprises the following steps: the device comprises a feature extraction unit, a sampling pooling unit, a classification unit and an evaluation unit;

the feature extraction unit is used for extracting the features of the face image to be analyzed to obtain a feature map; the sampling unit is used for carrying out quantitative sampling on the characteristic diagrams by adopting the bounding box coordinates and obtaining a plurality of local characteristic diagrams corresponding to each key area after pooling; the classification unit is used for performing classification prediction on the local feature maps respectively to obtain a plurality of prediction results; the evaluation unit is used for evaluating the plurality of prediction results and iteratively updating parameters of the face image identification model according to the evaluation results until preset constraint conditions are met;

step 4, preprocessing a sample video in the step 2 to obtain a sample face image and sample surrounding frame coordinates of each key area in the sample face image, taking the sample face image and the sample surrounding frame coordinates as input of a training sample, and taking the truth of the sample face image as a label to construct a training sample data set; adopting the training sample data set to finish the training of a face image identification model;

step 5, in the application, all the face images to be analyzed and the surrounding frame coordinates in the video to be analyzed are obtained in the step 2, the face images to be analyzed and the surrounding frame coordinates are input into the face image discrimination model obtained by training in the step 4, and the output of the face image discrimination model is the prediction result of each local feature map; and then judging the true and false values of the prediction result to obtain a judgment result, voting the judgment result to obtain a judgment result that the number of votes exceeds a threshold value, namely a final false distinguishing result, and outputting the false distinguishing result.

Further, the preprocessing the video to be analyzed in the step 2 further includes an alignment operation of the face image to be analyzed, including the following steps:

step 2.1, performing affine transformation on pixel points of a face region in a face image to be analyzed to complete alignment operation of the face image to be analyzed, and obtaining a face front image to be analyzed;

2.2, acquiring front face bounding box coordinates corresponding to each key area from the front face image of the human face to be analyzed according to the key areas determined in the step 1;

and 2.3, taking the front face image of the face to be analyzed as the face image to be analyzed, taking the coordinates of the front face surrounding frame as the coordinates of the surrounding frame, and outputting the face image to be analyzed and the coordinates of the surrounding frame.

Further, the face detection model may employ a face detector model, a 68-dimensional face key point detection model, or a face feature extraction model.

Further, the key area in step 1 includes: eyebrows, nose, mouth, left cheek and right cheek.

Further, the feature extraction unit is implemented by using a convolutional neural network resnet 18.

Further, the classification unit is composed of a plurality of connection layers with the same number as the key areas.

Further, the evaluation unit performs cross entropy loss calculation on the plurality of prediction results to obtain a loss value, calculates the gradient of the network model through an automatic differentiation mechanism of a neural network framework, and updates and optimizes the parameters of the face image identification model by using a random gradient descent method.

Has the advantages that:

1. the face image and the surrounding frame coordinates of the key area in the face image are obtained by preprocessing the face image and are used as the input of feature extraction, so that the calculation complexity in the face recognition process is effectively reduced; meanwhile, local features and overall features which are more discriminative are extracted by utilizing a convolutional neural network, the effect of sample enhancement is achieved, and certain robustness is achieved for shielding change in the problem of small samples. In addition, the method adopts the key area to sample the features extracted by the convolutional neural network, eliminates the noise features introduced by the real background in the feature map, effectively improves the recognition performance of the image and simultaneously improves the accuracy of the image recognition.

2. The invention aligns the face according to the key points of the face, thereby determining the key area in the face image of the face, and extracts the image characteristics based on the aligned face image and the key area coordinates, thereby effectively improving the accuracy of calculation in the face recognition process.

Drawings

Fig. 1 is a schematic processing flow diagram of a face image false distinguishing method based on multi-local feature voting according to the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

The invention provides a face image false distinguishing method based on multi-local feature voting, which has a specific flow as shown in figure 1 and specifically comprises the following steps:

step 1, labeling a forged region in the face image, and determining a region with the highest forging frequency as a key region required by the face image to identify the fake by analyzing the forged region. The forged region refers to a forged face image, and may also refer to a forged part in the generated face image. Through experimental analysis, the above-mentioned key regions generally include: the invention adopts the data of the key areas as the multi-local characteristics of the face image.

The key area determined in the step is an important basis for identifying the fake face image and is also a decisive factor influencing the fake face image identifying accuracy. Specifically, the bounding box can be used for framing the forged face image or the forged area visible to human eyes in the generated face image, so as to determine the positions of all forged areas, and the coordinates of the bounding box are used as the unique identifier and the mathematical representation mode of the forged area; then, the regions with the highest counterfeiting frequency can be obtained by carrying out unsupervised learning on all the counterfeiting regions, and the regions are named as key regions.

The unsupervised learning process for all the forged regions can be completed by adopting a Kmeans clustering method.

And 2, preprocessing the video to be analyzed, extracting all the face images to be analyzed contained in the video, and determining the surrounding frame coordinates of each key area in all the face images to be analyzed on the basis.

Similarly, the training sample data set is constructed by adopting the preprocessing mode, and the construction process comprises the following steps: preprocessing a sample video to obtain a sample face image, determining bounding box coordinates of key areas in all the sample face images based on the key areas determined in the step 1, taking the sample face image and the bounding box coordinates as input of a training sample, and taking the truth or falseness of the sample face image as a label to construct a training sample data set.

Further, since the position of the face in the face image to be analyzed relative to the picture may be inclined, which may affect the accuracy of image recognition, the operation of face alignment may be added in the preprocessing process to solve the problem, thereby further improving the accuracy of recognition. Specifically, the pretreatment process comprises the following steps:

and 2.1, extracting a face image to be analyzed from the video to be analyzed by adopting the existing face detection model, and performing affine transformation on pixel points of a face region in the face image to be analyzed to complete the alignment operation of the face image to be analyzed so as to obtain a face front image to be analyzed.

For example, existing face detection models include: a face detector model, a 68-dimensional face key point detection model, a face feature extraction model and the like.

And 2.2, acquiring surrounding frame coordinates corresponding to each key area from the face image to be analyzed according to the key areas determined in the step 1, namely acquiring the surrounding frame coordinates of five key areas, namely, an eyebrow, a nose, a mouth, a left cheek and a right cheek in the face image to be analyzed.

And 2.3, outputting the face front image to be analyzed and the surrounding frame coordinates corresponding to the face front image to be analyzed.

Step 3, establishing a face image false distinguishing model based on the multi-local feature voting, wherein the face image false distinguishing model based on the multi-local feature voting comprises the following steps: the device comprises a feature extraction unit, a sampling pooling unit, a classification unit and an evaluation unit.

And the feature extraction unit is used for extracting the features of the face image to be analyzed obtained in the step 2 to obtain a feature map of the face image to be analyzed. In the invention, five series-connected convolution blocks are used as a feature extraction unit to analyze the face image to be analyzedISequentially inputting five convolution blocks, and performing forward iteration to obtain a final convolution characteristic diagramF1, the above process is expressed as the following formula:

，Conv() Indicating the process.

And the sampling pooling unit is used for carrying out quantitative sampling on the feature map output by the feature extraction unit by adopting the bounding box coordinates obtained in the step 2 to obtain a local feature map corresponding to each key area.

The specific process of acquiring the local feature map by the sampling pooling unit comprises the following steps: determining the facial image to be analyzed according to the scale invariance between the facial image to be analyzed and the feature map by adopting a regional feature aggregation mode (such as ROI Align)ICharacteristic diagram ofF1, characteristic regions corresponding to each key region in the image and the face to be analyzedAnd pooling the characteristic graphs into a plurality of local characteristic graphs corresponding to each key area according to the size of the characteristic area.

Further, for the face front image to be analyzed obtained after the face alignment processing, the above process is also adopted to obtain a plurality of local feature maps, that is, a regional feature aggregation mode is adopted to determine the feature regions and feature region coordinate values corresponding to each key region in the feature map of the face front image to be analyzed, and then the feature maps are pooled into a plurality of local feature maps corresponding to each key region according to the size of the feature regions.

And the classification unit is used for performing classification prediction on the local feature map output by the sampling unit. In the invention, a plurality of classifiers corresponding to the key area are adopted to predict each local characteristic graph respectively to obtain a plurality of prediction results. Wherein, the classifier can be realized by adopting a full connection layer.

And the evaluation unit is used for training the face image identification model, namely evaluating the output result of the model in the training stage of the face image identification model based on the multi-local-feature voting, and iteratively updating the parameters of the face image identification model based on the multi-local-feature voting according to the evaluation result until the preset constraint conditions are met, thereby finishing the training of the face image identification model. For example, cross entropy loss calculation is performed on the output result of the classification unit to obtain a loss value, the gradient of the network model is calculated through an automatic differentiation mechanism of a neural network framework, and a random gradient descent method is used for updating and optimizing parameters of the network model.

And 4, training a face image identification model based on multi-local feature voting by adopting a training sample data set generated in the preprocessing process in the step 2.

Specifically, the input of the training sample data set generated in the step 2 is used as the input of the training of the face image identification model, namely, the sample face image is input into a feature extraction unit of the face image identification model and a sampling unit of the face image identification model with surrounding frame coordinates, an evaluation unit obtains a loss value through calculation of the output result of a classification unit, the parameters of the face image identification model are optimized according to the loss value, the training of the model is finished after the loss value meets the constraint condition, and the trained face image identification model is obtained.

Step 5, in the application, preprocessing the video to be analyzed in the step 2 to obtain the face image to be analyzed of the video to be analyzed and the surrounding frame coordinates of the key area in the face image to be analyzed; secondly, inputting the face image to be analyzed and the surrounding frame coordinates into the face image discrimination model based on the multi-local-feature voting trained in the step 4, wherein the output of the face image discrimination model is the prediction result of each local feature image of the face image contained in the video to be analyzed; judging true and false values of all prediction results to obtain judgment results; and finally, voting the judgment result, and obtaining a classification result of which the number of votes exceeds a threshold value, namely a final classification result, wherein the final classification result is a classification result of the face image contained in the video to be analyzed.

For example, five full-connection layers are selected as classifiers FC1, FC2, FC3, FC4 and FC5 of local feature maps corresponding to five key areas; the five local feature maps are respectively input into the five classifiers, the five classifiers are used for respectively carrying out classification prediction on the five local feature maps to obtain five prediction results R1, R2, R3, R4 and R5, and the classification process can be expressed by the following formula

(ii) a Then, using argmax operation to respectively judge the true and false values of the five prediction results to obtain five true and false value judgment results; and finally, voting is carried out on the five true and false value judgment results, and when the true or false result exceeds three votes, the image is judged to be true or false.

Example (b):

in the embodiment, 64 key points are used for describing the face image, and the face image false distinguishing method based on multi-local feature voting provided by the invention is used for completing false distinguishing of the face image, and the specific process is as follows:

step 101, firstly, marking a fake region on the collected fake human face or the generated human face data, and drawing a visual region with a fake defect visible to human eyes by using a surrounding frame to obtain the position of the visual region. Through unsupervised learning of forged regions in all forged face images and the specific use of a Kmeans clustering method, five regions with the highest occurrence frequency of forged defects can be finally obtained as key regions, and five key regions obtained through experiments are respectively an eyebrow, a nose, a mouth, a left cheek and a right cheek.

102, generating training data of a face region required by a face detection model, namely a face image and surrounding frame coordinates of five key regions in the face image, wherein the process of obtaining the face image from a video comprises the following steps: acquiring 64 key points of a face in a video by using a face detection model; and acquiring an area surrounding frame of the face area in the face image according to the 64 key points. Because the face region in the face image may be inclined relative to the face image, the face region needs to be aligned, and the face image with the face being the front face can be obtained by performing affine transformation on the key points corresponding to the face region.

Wherein, the key points of the face area are expressed as

The reference point of the face front face is expressed as

Affine transformation of a matrix of

The following formula is satisfied:

the matrix C is subjected to parameter estimation by using a least square method, for example, each parameter in the matrix C may be calculated by using a built-in function numpy. After the parameters of the matrix C are obtained, carrying out affine transformation on each pixel in the face image to obtain a face front imageI。

Meanwhile, according to 64 key points representing the face image, surrounding frame coordinates of five key areas, namely an eye brow, a nose, a mouth, a left cheek and a right cheek, are obtained, and then an affine transformation matrix C is adopted to transform the surrounding frame coordinates into surrounding frame coordinates B1, B2, B3, B4 and B5 of each key area of the face frontal image.

103, adopting a convolutional neural network resnet18 to correct the face image of the human faceIAnd (5) carrying out feature extraction. Convolutional neural network resnet18 contains five convolutional blocks, a face front imageIThe five convolution blocks are input in sequence, and forward iteration is carried out to obtain a final convolution characteristic diagram F1. The specific calculation operation is

。

And step 104, performing quantitative sampling on the convolution feature map F1 obtained in the step 103 by using the surrounding frame coordinates B1, B2, B3, B4 and B5 of the five local areas generated in the step 102. The specific process is as follows: in order to obtain the characteristics of the area, coordinates of the area where the surrounding frame is located on the characteristic diagram are obtained according to the coordinate relation between the coordinates of the surrounding frame and the original diagram and the scale invariance between the original diagram and the characteristic diagram by using RoiAlign operation; and according to the area of the coordinates, pooling the features into a size of 3x3 to finally obtain five local feature maps P1, P2, P3, P4 and P5.

For example: the feature extraction network used in step 103 is resnet18, and flat _ stride =32 indicates that the picture is reduced to 1/32 of the original image after passing through the network layers, that is, if the size of the original image is 256 × 256, the size of the feature image in the last layer is 8 × 8. Assuming that the size of a local area in the original image is 150 × 150, the size of the local area mapped to the feature map is: 150/32=4.68, i.e. 4.68 × 4.68, in which case floating point numbers are retained. If the pooled convolution kernel is 3, then the region is fixed into a 3 × 3 sized feature map after passing through the pooled layer, so that the 4.68 × 4.68 local region mapped on feature map F1 is divided into 9 equally sized small regions, each of which has a size of 4.68/3=1.56, i.e. 1.56 × 1.56. For each small region of 1.56 x 1.56, four parts are divided equally, each part is taken as the central point position, and the pixel of the central point position is calculated by a bilinear interpolation method, so that the pixel values of four points are obtained. And finally, taking the maximum value of the four pixel values as the pixel value of the small region, and so on, obtaining 9 pixel values from 9 small regions, and forming a feature map with the size of 3x3 by the 9 pixel values. Five local regions were assigned five 3 × 3 signatures, defined as LF1, LF2, LF3, LF4, LF 5.

And 105, constructing five full-connection layers as classifiers FC1, FC2, FC3, FC4 and FC5 of five local regions, and performing classification prediction on the five local features respectively. Obtaining five prediction results R1, R2, R3, R4 and R5 by the specific operation of

。

And 106, performing cross entropy loss calculation on the five prediction results R1, R2, R3, R4 and R5. The loss calculation function employed in this embodiment is:

wherein, in the step (A),

representing a sampleiA 1 represents true and a 0 represents false.

Is a sampleiThe probability of the prediction being true.

In the actual use process, step 101, step 102, step 103 and step 104 are adopted to obtain five prediction results R1, R2, R3, R4 and R5 for a face image to be tested. We make a true-false value determination on these five results using the argmax operation. And voting is carried out on the five true and false judgments, and when the true or false result exceeds three votes, the image is judged to be in a corresponding category.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A face image false distinguishing method based on multi-local feature voting is characterized by comprising the following steps:

2. The method for distinguishing facial images based on multi-local feature voting according to claim 1, wherein the preprocessing the video to be analyzed in step 2 further comprises an alignment operation of the facial images to be analyzed, and the method comprises the following steps:

3. The method of claim 2, wherein the face detection model can be a face detector model, a 68-dimensional face key point detection model or a face feature extraction model.

4. The method for disqualifying facial images based on multi-local feature voting of claim 1, wherein the key areas in step 1 comprise: eyebrows, nose, mouth, left cheek and right cheek.

5. The method for discriminating facial images based on multi-local feature voting according to claim 1, wherein the feature extraction unit is implemented by using a convolutional neural network resnet 18.

6. The method as claimed in claim 1, wherein the classifying unit is composed of a plurality of connection layers with the same number as the key regions.

7. The method as claimed in claim 1, wherein the evaluation unit obtains a loss value by performing cross entropy loss calculation on a plurality of prediction results, calculates the gradient of the network model by an automatic differentiation mechanism of a neural network framework, and performs update optimization on parameters of the facial image false distinguishing model by using a stochastic gradient descent method.