CN114299218A

CN114299218A - System for searching real human face based on hand-drawing sketch

Info

Publication number: CN114299218A
Application number: CN202111514828.5A
Authority: CN
Inventors: 陈玫玫; 王辰萱; 何文俊; 吴思嘉
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-04-08

Abstract

The invention belongs to the technical field of heterogeneous face recognition, and relates to a system for searching a real face based on hand-drawn sketch, which inputs a sketch image into a face reconstruction module; the face detection module selects a face frame and inputs the face frame into the face comparison module; the face comparison module carries out real-time face comparison through steps of face key point detection, feature extraction, face similarity judgment and the like by utilizing a deep neural network. The face reconstruction module adds a fine restoration module, a residual block and an attention mechanism on the basis of using the traditional generation countermeasure network, improves the accuracy of image detail generation, combines spectral normalization to enable deep network training to be controllable, obtains better results in a data set with smaller scale, and reduces the requirement on the scale of a training set while improving the training efficiency.

Description

System for searching real human face based on hand-drawing sketch

Technical Field

The invention belongs to the technical field of heterogeneous face recognition, and relates to a system for searching a real face based on hand-drawing sketch.

Background

Sketch face recognition can be widely applied to various fields. The difference of the image styles causes that the real face cannot be directly identified according to the sketch image, and the style conversion is firstly carried out, and then the real face is generated by the sketch image and then identified. The quality of the sketch image generation directly has a serious influence on the recognition result. Early image style conversion was mainly a texture synthesis problem, i.e. synthesizing textures from an original image while limiting texture synthesis to retain semantic information of the original image. The traditional machine learning algorithm solves the image texture synthesis, which comprises resampling a given source texture to synthesize a new natural texture, controlling texture conversion through edge direction information, and the like. However, in the conventional method, when the difference between the training set and the test set is large, a satisfactory result is difficult to obtain, and the accuracy is poor.

Disclosure of Invention

The invention aims to provide a system for searching a real human face based on hand-drawing sketch, and solves the problem of poor preparation rate.

The present invention is achieved in such a way that,

a system for finding a real face based on a hand-drawn sketch, the system comprising:

the human face reconstruction module is used for generating a human face by the acquired hand drawing through a generator, wherein the generator is based on a U-net network;

the human face detection module is used for collecting the human face generated by the human face reconstruction module in real time, adopting a resnet18 network to zoom the image, judging the probability of the human face existing at each position, and using a non-maximum suppression algorithm to frame the human face;

the face comparison module comprises a face key point detection module, a feature extraction module and a face similarity judgment module, and the face key point detection module, the feature extraction module and the face similarity judgment module are respectively used for face key point detection, feature extraction and face similarity judgment.

Furthermore, the number of the generators is two, the generators all adopt U-net networks, four operations of convolution, pooling and activation are performed during down-sampling, the size of a convolution kernel is 3 x 3, 2 x 2 maximum pooling is selected during pooling, a Leakyrelu is selected as an activation function, weights are shared during down-sampling, four operations of deconvolution, pooling, activation and residual connection are performed during up-sampling, residual connection is splicing feature tensors with the same size, and input from the attention submodule is subjected to a feature space of down-sampling.

Further, the face reconstruction module further comprises a discriminator for judging similarity between the input pair of sketches and faces generated by the corresponding real faces and sketches when the face reconstruction module is trained, and converging the similarity through a loss function, wherein the discriminator adopts a markov discriminator, and the loss function comprises network loss, style loss and feature loss.

Further, the discriminator divides the image blocks input into the discriminator through a VGG network, and the image blocks are subjected to convolution, pooling and activation operations, wherein the size of a convolution kernel is 3 × 3, the pooling is 2 × 2 maximum pooling, an activation function selects tanh, after the image blocks are input through VGG, a multidimensional feature space is obtained, an N × N matrix is obtained through the convolution operation, each piece of data in the matrix represents the probability that one image block is true, and the probability that the whole image block is judged to be true is obtained by averaging the data in the matrix; and (4) substituting the sigmoid function with the discriminator spectrum normalization operation.

Further, the loss function is:

wherein p is_data(x, y) denotes the true sample distribution, p_noise(x) For random noise distribution, N_DAnd N_QThe number of pixels in the original image and the generated image feature space. G (x) represents a generator function, D (x) represents a discriminator function, and the setting parameters are as follows:

λ＝1；μ＝100。

further, the face detection module selects a face frame by using a resnet18 model, the resnet18 model is a convolutional neural network composed of 4 residual blocks, each residual block has 3 × 3 convolutional layers and 1 convolutional layer with 1 × 1 identity mapping, the first 7 × 7 convolutional layer and the last full-connected layer are added, 18 layers are provided, the model optimization adopts an Adam method, the cross entropy loss adopted by the loss function is realized, the input of the trained network resnet18 model is a face with the size of 64 × 64, a rectangular frame with the size of 224 × 224 slides from left to right on a zoomed picture from top to bottom, and the probability of the face existing at each position is calculated; and finally, removing redundant frames and reserving a best frame by using a non-maximum value inhibition method for at least one face frame obtained.

Further, the face key point detection module adopts two cascaded CNNs for key point detection, the face key points are divided into internal key points and contour key points, the internal key points adopt four levels of cascaded networks, and the first level of networks acquire a bounding box of a facial organ; the output of the secondary network is the predicted position of the internal key point; the three-level network carries out positioning from coarse to fine according to different organs; the input of the four-level network is to rotate the output of the three-level network and finally output the position of the internal key point; detecting the outline key points by adopting two levels of cascade networks, wherein a first level network obtains a boundary frame of the outline; predicting the position of a key point of the contour by a secondary network; and superposing the outputs of the two cascaded CNNs to obtain the output, and performing similarity transformation according to the positioned key points to obtain the aligned human face picture.

Further, the feature extraction module finds a projection feature space by using a PCA method, performs projection calculation, and calculates a feature coefficient of the human face: and reading the picture matrix of each face in the database, taking the picture matrix from the first column of the picture matrix, sequentially taking the picture matrix to the last column, and connecting the columns end to form a dimensional column vector. Each human face image is expressed into a column vector, the column vectors are spliced and transposed to form a human face sample matrix, the covariance of the human face sample matrix, the eigenvalue and the mutually orthogonal eigenvector corresponding to the eigenvalue are solved, and the corresponding eigenvector forms an orthogonal matrix, namely a human face projection space, according to the descending order of the eigenvalue.

Further, the module for judging the similarity of the human faces comprises a module for calculating the Euclidean distance between the human faces collected in real time and the characteristic coefficients of the human faces in the database, setting a threshold value and judging whether the two human faces belong to the same person: and performing projection calculation through a K-L conversion mode to obtain a projection characteristic coefficient of a sample in a face projection space in the database, similarly, calculating the projection characteristic coefficient of the face output by the face detection module, respectively calculating Euclidean distances between the projection characteristic coefficient and each face in the sample in the database, and judging the similarity between the face detected in real time and the face in the database according to the calculated Euclidean distances.

Further, in the training stage, the faces in the database are real faces corresponding to the input sketch.

Compared with the prior art, the invention has the beneficial effects that:

the system disclosed by the invention is tested, when the CUFK data set is used, the accuracy of face recognition is 100%, and the similarity between a generated image and a real image is more than 95%.

Drawings

Fig. 1 is a schematic structural diagram of a face reconstruction module;

FIG. 2 is a detection flow chart of the face keypoint detection module;

fig. 3 is a block diagram of the system of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1 in conjunction with fig. 2, a system for finding a real face based on hand-drawn sketch, the system comprising:

The face reconstruction module further comprises a discriminator used for judging similarity between the input sketch and a face generated by the real face in the corresponding real face in the input pair and the sketch when the face reconstruction module is trained, and carrying out convergence through a loss function, wherein the discriminator adopts a Markov discriminator, and the loss function comprises network loss, style loss and characteristic loss. In fig. 2, G1 and G3 are generators, D1 and D2 are discriminators, B1 and B2 are residual blocks, and C is a self-attention module.

Before use, the face reconstruction module is trained, and in the training stage, a CUFS data set is adopted, and the sketch and the face used for training the face reconstruction module are firstly cut to 64 x 64 size, and are subjected to pairing processing. And according to the following steps of 95: the ratio of 5 divides the training set and the test set.

The generator is based on a U-net network, and is combined with a residual error speed and condition self-attention model; the discriminator is based on a Markov discriminator, and is combined with spectrum normalization, and the loss function comprises network loss, style loss and characteristic loss.

Two generators are needed, U-net networks are adopted, the two generators are structurally symmetrical networks, four convolution, pooling and activation operations are performed during down-sampling, the convolution kernel size is 3 x 3, 2 x 2 maximum pooling is selected for pooling, the activation function is selected to be Leakyrelu, and weight is shared during down-sampling, so that the advantages of improving affine invariance and training efficiency are achieved. The operations of four times of deconvolution, pooling, activation and residual connection are performed during upsampling, wherein the residual connection is splicing feature tensors with the same size, and gradient disappearance can be avoided to a certain extent. The self-attention submodule is added in the generator, and the self-attention submodule can well pay attention to the position information of the face features because the face has stronger position features, and the local details are paid attention to when the position information is not paid attention to in the traditional convolutional neural network. The input from the attention module is the feature obtained by down-sampling, the feature is passed through three different 1 × 1 convolution kernels (different numbers of convolution kernels) to obtain three feature subspaces F, G, H, and any one subspace is rotated and multiplied by another to be spliced with the remaining one, for example: and multiplying the F transpose by the G, splicing with the H to obtain a new characteristic subspace, and taking the new characteristic subspace as an up-sampling input to obtain a reconstructed image after the attention position information is obtained. The feature subspace is input to the upsampling, and the outputs of the two generators are overlapped.

The method comprises the steps of using a discriminator in the training process, enabling an image block input into the discriminator to pass through a VGG network by the discriminator, carrying out convolution, pooling and activation operations, wherein the convolution kernel size is 3 x 3, pooling is 2 x 2 maximum pooling, an activation function is tanh, obtaining a multidimensional feature space after input through the VGG, obtaining an N x N matrix through convolution operation, averaging each data in the matrix to obtain the probability that one image is true, obtaining the probability that the whole image is discriminated to be true, and improving the resolution of the generated image by improving the receptive field, wherein the receptive field is set to 8 x 8. The invention cancels the discriminator (the sigmoid function of the last layer and adds the spectrum normalization operation, the spectrum normalization is to prevent the gradient of the discriminator from disappearing and exploding, and the invention can control the slope of any two points of the same channel in the characteristic space in [ -1,1 ].

Convergence using a loss function, including generating a loss function against the network

Function of content loss

And pixel loss function

In which a loss function is generated against the network

Focusing on the distance between the whole generated network and the optimal solution; the pixel loss function pays attention to the distance between the original image and the generated image in detail; the content loss function focuses on the distance between the original image and the generated image in the feature space after passing through the VGG network, including the position information. The three functions are endowed with different weights to obtain a loss function of the face reconstruction module, and a specific formula of the loss function is as follows:

wherein p is_data(x, y) denotes the true sample distribution, p_noise(x) For random noise distribution, N_DAnd N_QThe number of pixels in the original image and the generated image feature space. G (x) represents a generator function, and D (x) represents a discriminator function.

According to experience and test results, the parameters set in the invention are as follows:

λ＝1；μ＝100

the network uses ADAM optimization, and to ensure balance between the generator and the arbiter, the learning rate of the generator is set to 0.0001 and the learning rate of the arbiter is set to 0.0004.

The human face detection module comprises a human face acquisition module and a human face framing module, when the sketch is not performed, a real human face can be acquired through the human face acquisition module, the human face detection module adopts a resnet18 model, the model is a convolutional neural network composed of 4 residual blocks, each residual block is provided with 3 convolutional layers with 3 x 3 convolutional layers and 1 convolutional layer with 1 x 1 identity mapping, and the total number of the convolutional layers is 18 by adding a first 7 x 7 convolutional layer and a last full-connection layer. The Adam algorithm is adopted in model optimization, and cross entropy loss adopted by a loss function is adopted. Because the input of the trained network resnet18 is a 64 × 64 face, the size of the face detected in real time is not fixed, and the picture size needs to be scaled; sliding a rectangular frame with the size of 224 multiplied by 224 from left to right and from top to bottom on the zoomed picture, and calculating the probability of the existence of the face at each position; the number of the face frames obtained finally is possibly more than one, a non-maximum value inhibition algorithm is needed to be used, redundant frames are removed, and the best frame is reserved.

Referring to fig. 2, the face key point detection performs face key point detection on a face generated by sketch. For the internal key points and the external key points, two cascaded CNNs are adopted for key point detection on 68 face key points. The internal 51 key points are detected by adopting a four-level cascade network. Wherein, the primary network is mainly used for obtaining a bounding box of the facial organ; the output of the secondary network is 51 key point prediction positions, and the coarse positioning function is realized, so that the purpose of initializing the tertiary network is realized; the third-level network can perform coarse-to-fine positioning according to different organs; the input of the four-level network is to perform certain rotation on the output of the three-level network, and finally output the positions of 51 key points. And the external 17 key points adopt only two levels of cascade networks for detection. The primary network has the same function as the internal key point detection, and mainly obtains a boundary frame of the contour; the secondary network directly predicts 17 key points without a process from coarse positioning to fine positioning, and the training efficiency is improved on the premise that the network can guarantee the extraction accuracy due to the fact that the area of the contour key points is large. The final 68 key points of the face are superimposed from the outputs of the two cascaded CNNs. And performing similarity transformation according to the positioned 68 key points to obtain an aligned face picture.

The feature extraction module takes the aligned face pictures and the picture matrix of each face in the database from the first column of the image matrix to the last column in sequence, and each column is connected end to form a dimensional column vector. And representing each human face image as a column vector, and splicing and transposing the column vectors to form a human face sample matrix. And solving the covariance of the face sample matrix, the eigenvalue and mutually orthogonal eigenvector corresponding to the eigenvalue, and forming an orthogonal matrix, namely a face projection space respectively, by the corresponding eigenvector according to the descending order of the eigenvalue. And performing projection calculation through a K-L transformation formula to obtain a projection characteristic coefficient of a face projection space. Similarly, the projection characteristic coefficient of the face output by the face detection module is calculated, and the Euclidean distance of the projection characteristic coefficient of each face is calculated. And judging the similarity between the face detected in real time and the face in the database according to the calculated Euclidean distance, wherein the threshold value is set to be 1.20, and when the Euclidean distance is less than 1.20, the face existing in the database is considered to be identified. And visualizing the face comparison result to a micro embedded system.

When the system is tested, the accuracy of face recognition is 100% and the similarity between the generated image and the real image is more than 95%.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A system for finding a real face based on a hand-drawn sketch, the system comprising:

2. The system of claim 1, wherein there are two generators, each using a U-net network, down-sampling is performed by four operations of convolution, pooling, and activation, the convolution kernel size is 3 x 3, pooling is selected by 2 x 2 max pooling, the activation function is selected by Leakyrelu, and down-sampling is performed by sharing weights, up-sampling is performed by four operations of deconvolution, pooling, activation, and residual concatenation, wherein the residual concatenation is a concatenation of feature tensors of the same size, and the input from the attention sub-module is subjected to a down-sampled feature space.

3. The system of claim 1, wherein the face reconstruction module further comprises a discriminator for discriminating similarity between a real face of the input pair of sketches and corresponding real faces and a sketch-generated face and converging the similarity through a loss function, the discriminator being a markov discriminator, the loss function including a network loss, a style loss, and a feature loss, when the face reconstruction module is trained.

4. The system of claim 3, wherein the discriminator blocks the image input to the discriminator through a VGG network, and performs convolution, pooling, and activation operations, wherein the convolution kernel is 3 x 3, pooling is 2 x 2 max pooling, the activation function is selected to be tanh, the input is passed through VGG to obtain a multidimensional feature space, the convolution operation is performed to obtain a matrix of N x N, each datum in the matrix represents a probability that a block of the image is true, and averaging is performed to obtain a probability that the entire image is true; and (4) substituting the sigmoid function with the discriminator spectrum normalization operation.

5. The system of claim 3, wherein the loss function is:

wherein p is_data(x, y) denotes the true sample distribution, p_noise(x) For random noise distribution, N_DAnd N_QThe number of pixels in the original image and the generated image feature space. G (x) represents a generator function,d (x) represents the discriminator function, with the following parameters:

λ＝1；μ＝100。

6. the system of claim 1, wherein the face detection module uses a resnet18 model to frame the face, the resnet18 model is a convolutional neural network consisting of 4 residual blocks, each residual block has 3 × 3 convolutional layers and 1 convolutional layer with 1 × 1 identity mapping, and the first 7 × 7 convolutional layer and the last full-connected layer have 18 layers, the model optimization uses Adam method, the cross entropy loss used by the loss function is calculated by inputting the trained network resnet18 model to a 64 × 64 face, and a rectangular frame with size of 224 × 224 is slid from left to right and from top to bottom on the scaled picture to calculate the probability of the face existing at each position; and finally, removing redundant frames and reserving a best frame by using a non-maximum value inhibition method for at least one face frame obtained.

7. The system of claim 1, wherein the face keypoint detection module uses two cascaded CNNs for keypoint detection, the face keypoints are divided into internal keypoints and contour keypoints, the internal keypoints use four levels of cascaded networks, wherein a first level of the networks obtains bounding boxes of facial organs; the output of the secondary network is the predicted position of the internal key point; the three-level network carries out positioning from coarse to fine according to different organs; the input of the four-level network is to rotate the output of the three-level network and finally output the position of the internal key point; detecting the outline key points by adopting two levels of cascade networks, wherein a first level network obtains a boundary frame of the outline; predicting the position of a key point of the contour by a secondary network; and superposing the outputs of the two cascaded CNNs to obtain the output, and performing similarity transformation according to the positioned key points to obtain the aligned human face picture.

8. The system of claim 1, wherein the feature extraction module uses a PCA method to find a projection feature space, performs projection calculation, and calculates the feature coefficients of the human face: and reading the picture matrix of each face in the database, taking the picture matrix from the first column of the picture matrix, sequentially taking the picture matrix to the last column, and connecting the columns end to form a dimensional column vector. Each human face image is expressed into a column vector, the column vectors are spliced and transposed to form a human face sample matrix, the covariance of the human face sample matrix, the eigenvalue and the mutually orthogonal eigenvector corresponding to the eigenvalue are solved, and the corresponding eigenvector forms an orthogonal matrix, namely a human face projection space, according to the descending order of the eigenvalue.

9. The system of claim 8, wherein the module for determining face similarity comprises calculating euclidean distances between the face collected in real time and feature coefficients of the faces in the database, setting a threshold, and determining whether two faces belong to the same person: and performing projection calculation through a K-L conversion mode to obtain a projection characteristic coefficient of a sample in a face projection space in the database, similarly, calculating the projection characteristic coefficient of the face output by the face detection module, respectively calculating Euclidean distances between the projection characteristic coefficient and each face in the sample in the database, and judging the similarity between the face detected in real time and the face in the database according to the calculated Euclidean distances.

10. The system of claim 9, wherein during the training phase, the faces in the database are real faces corresponding to the input sketch.