CN110189255A

CN110189255A - Method for detecting human face based on hierarchical detection

Info

Publication number: CN110189255A
Application number: CN201910455695.5A
Authority: CN
Inventors: 于力; 刘意文; 邹见效; 杨瞻远; 徐红兵
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2019-08-30
Anticipated expiration: 2039-05-29
Also published as: CN110189255B

Abstract

The invention discloses a kind of method for detecting human face based on hierarchical detection, it is trained respectively to Face datection model with the Super-resolution reconstruction established model based on GAN network first, then facial image to be detected is inputted into Face datection model, the coordinate information and the candidate region that obtain each candidate region of human face target belong to the confidence value of face, tentatively judged according to confidence value, the generator that then human face target to be determined is input in the Super-resolution reconstruction established model based on GAN network is further judged.The present invention uses hierarchical detection, can effectively improve the verification and measurement ratio to low-resolution face image.

Description

Face detection method based on two-stage detection

Technical Field

The invention belongs to the technical field of low-resolution face detection, and particularly relates to a face detection method based on two-stage detection.

Background

The face detection problem initially appears as a sub-problem of the face recognition system, and gradually becomes an independent subject with the further research. The current face detection technology integrates the fields of machine learning, computer vision, mode recognition, artificial intelligence and the like in a crossed manner, becomes the basis of all face image analysis and derivative applications, and has great influence on the response speed and the accurate detection capability of derivative systems. In the process of continuously expanding the application scene of face detection, problems of undersize or too low quality of the input face image and the like caused by various reasons are gradually encountered, and for the face image with low resolution, the accuracy of a face detection system is often greatly reduced. The problem of detection of low quality and small size face images is commonly referred to as low resolution face detection.

The essence of the current face detection algorithm is a binary problem, and the basic flow is that effective features are extracted from a region to be detected, and then whether a face exists is judged by the features, and low-resolution face detection is also researched on the basis. The low resolution face has three characteristics: the method has the advantages that the information quantity is small, the noise is high, and the available tools are few, so that a candidate region cannot extract enough effective features to express the region, and the conventional method cannot extract enough effective features to express a low-resolution face from the aspect of feature expression; the inherent deficiency that appears in deep neural networks that the preceding convolutional layer cannot provide a sufficiently powerful feature map, and the following convolutional layer cannot provide enough features of the low-resolution face region, makes it very difficult to detect the low-resolution face.

In order to solve the problem of low-resolution face detection, a great deal of targeted research is carried out by a plurality of excellent scholars, and comprehensively, the scholars at home and abroad mainly focus on the processing of the problem in three directions, namely, a resolution robust feature expression method for a face region is found, and a new classifier and an image super-resolution method are designed according to the characteristics of a low-resolution face. It should be recognized that, the current research for low-resolution small face detection is still in the development stage, and there are many problems to be solved, on one hand, how to effectively extract the context information of the low-resolution face and integrate the context information into the detection network, and still further exploration is needed to provide better performance for the low-resolution face detector; on the other hand, a complete face detection system is necessarily a full-scale face detection system, which requires that the detection capability of faces of other scales must be considered when processing the problem of low-resolution face detection, and in fact, the fusion problem of multi-scale detection leads to the low-resolution face detection system being low in accuracy or processing speed, which is a great problem to be solved urgently.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a face detection method based on two-stage detection.

In order to achieve the above object, the face detection method based on deep learning of the present invention comprises the following steps:

s1: acquiring a plurality of face image training samples, wherein each training sample comprises a face image and face target information, and training a face detection model by adopting the face image training samples;

s2: acquiring a plurality of super-resolution face image reconstruction training samples, wherein each training sample comprises a low-resolution image containing a face and a corresponding high-resolution image, the super-resolution face image reconstruction training samples are adopted to train a super-resolution reconstruction model based on a GAN network, and the super-resolution reconstruction model based on the GAN network comprises a generator G and a discriminator D;

s3: inputting the face image to be detected into the face detection model to obtain the personCoordinate information of each candidate region of the face target and a confidence value C of the candidate region belonging to the face; presetting confidence threshold T₁And T₂And 0 < T₁＜T₂Less than 1; for each candidate region, if the corresponding confidence value C ≧ T₂Judging that the candidate region has a face target, outputting the face target as a face target region, and if the corresponding confidence value T is detected₁≤C＜T₂If not, judging that the candidate area has no human face target, and not outputting;

s4: and (3) inputting each face target to be determined into a generator G in a super-resolution reconstruction model based on a GAN network to generate a super-resolution reconstruction image R, then inputting the super-resolution reconstruction image R into a discriminator D, judging whether the image R is a qualified super-resolution reconstruction image and whether the image R contains the face target by the discriminator, if the image R is the qualified super-resolution reconstruction image and contains the face target, judging that the face target exists in a corresponding candidate area, outputting the candidate area as a face target area, and otherwise, judging that the face target does not exist.

The invention relates to a face detection method based on two-stage detection, which comprises the steps of firstly respectively training a face detection model and a super-resolution reconstruction model based on a GAN network, then inputting a face image to be detected into the face detection model to obtain coordinate information of each candidate region of a face target and a confidence value that the candidate region belongs to a face, carrying out primary judgment according to the confidence value, and then inputting the face target to be determined into a generator in the super-resolution reconstruction model based on the GAN network for further judgment. The invention adopts two-stage detection, and can effectively improve the detection rate of the low-resolution face image.

Drawings

FIG. 1 is a flow chart of an embodiment of a face detection method based on two-stage detection according to the present invention;

FIG. 2 is a schematic diagram of the structure of an R-FCN network;

FIG. 3 is a flowchart of an improved frame regression algorithm in the present embodiment;

fig. 4 is a block diagram of a generator in the srna network;

FIG. 5 is a block diagram of an arbiter in an SRGAN network;

FIG. 6 is a PR graph of three methods in this experimental verification;

FIG. 7 is an exemplary diagram of the detection result of the SFD face detection method in the present experimental verification;

FIG. 8 is an exemplary diagram of the detection results of the R-FCN face detection method in the experimental verification;

FIG. 9 is an exemplary diagram of the test results of the present invention in this experimental verification;

FIG. 10 is a PR graph showing the face detection of a clear detection sample set by three methods in the present experimental verification;

FIG. 11 is a PR graph of face detection on a general fuzzy detection sample set by three methods in the experimental verification;

fig. 12 is a PR curve diagram of face detection performed on a severely blurred detection sample set by three methods in the experimental verification.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

Fig. 1 is a flowchart of a specific embodiment of a face detection method based on two-stage detection according to the present invention. As shown in fig. 1, the two-stage detection-based face detection method of the present invention specifically includes the following steps:

s101: training a face detection model:

the method comprises the steps of obtaining a plurality of face image training samples, wherein each training sample comprises a face image and face target information, and training a face detection model by adopting the face image training samples.

S102: training a super-resolution reconstruction model:

the method comprises the steps of obtaining a plurality of super-resolution face image reconstruction training samples, wherein each training sample comprises a low-resolution image containing a face and a corresponding high-resolution image, adopting the super-resolution face image reconstruction training samples to train a super-resolution reconstruction model based on a GAN (generic adaptive Network) Network, and the super-resolution reconstruction model based on the GAN Network comprises a generator G and a discriminator D.

S103: adopting a face detection model to carry out preliminary detection:

inputting a face image to be detected into a face detection model to obtain coordinate information of each candidate region of a face target and a confidence value C of the candidate region belonging to the face. Presetting confidence threshold T₁And T₂And 0 < T₁＜T₂Is less than 1. For each candidate region, if the corresponding confidence value C ≧ T₂Judging that the candidate region has a face target, outputting the face target as a face target region, and if the corresponding confidence value T is detected₁≤C＜T₂If not, judging that the candidate area has no human face target, and not outputting.

S104: detecting by adopting a super-resolution reconstruction model:

inputting each face target to be determined into a generator G in a super-resolution reconstruction model based on a GAN network to generate a super-resolution reconstruction image SR, then inputting the super-resolution reconstruction image SR into a discriminator D, judging whether the image SR is a qualified super-resolution reconstruction image and whether the image SR contains the face target by the discriminator, if the image SR is the qualified super-resolution reconstruction image and contains the face target, judging that the face target exists in a corresponding candidate area, outputting the face target as a face target area, and otherwise, judging that the face target does not exist in the corresponding candidate area.

By adopting the face detection method based on the two-stage detection, the super-resolution reconstruction model is adopted as the assistance of the face detection model, and the candidate area with low reliability is further detected, so that the missing detection and the false detection of the face target are avoided, and the detection performance is improved.

As for the face detection model, a specific face detection model can be selected as needed, in this embodiment, an R-FCN network is selected as the face detection model, and low-resolution face detection is improved to improve the detection effect. The R-FCN Network is modified on the basis of a traditional fast R-CNN structure, the core design idea is that on the basis of an RPN (regional provider Network) Network provided in FasterRCNN, position sensitive information is introduced, an ROI layer is moved backwards, a position sensitive characteristic diagram is used for calculating the probability that entities in an image to be detected belong to each category, and the detection rate can be greatly improved while high positioning accuracy is kept. FIG. 2 is a schematic diagram of the structure of an R-FCN network. As shown in FIG. 2, the workflow of the R-FCN can be briefly described as follows:

the images were input into a pre-trained classification network (fig. 2, network before Conv4 using the ResNet-101 network) and their corresponding network parameters were fixed. There are 3 branches on the feature map (feature map) obtained on the last convolutional layer of the pre-trained network:

the 1 st branch is to perform RPN operation on the feature map to obtain corresponding candidate region ROI, and the specific method is as follows: anchor boxes (Anchors) are generated on the feature map according to preset parameters, the anchor boxes being a set of regions having different sizes and aspect ratios across the input image. And then identifying an anchor frame containing the foreground, and converting the anchor frame into a target Bounding Box (Bounding Box) by using a Bounding Box regression algorithm so that the Bounding Box can be more closely fitted with the contained foreground object.

The 2 nd branch is to obtain a position-sensitive score map (position-sensitive score map) with dimension K × K (C +1) on the feature map for classification.

The 3 rd branch is to obtain a position sensitivity score mapping with 4 x K dimensions on the feature for regression;

finally, Position-Sensitive ROI Pooling (Position-Sensitive Rol Pooling, used herein) is performed on the K × K (C +1) -dimensional Position-Sensitive score map and the 4 × K-dimensional Position-Sensitive score map, respectively, to obtain the confidence and Position information of each candidate region, and then the corresponding category is obtained through confidence determination.

In the embodiment, firstly, the generation parameters of the anchor frame are improved. In the conventional R-FCN network, the anchor frame is generated by using three dimensions and three aspect ratios, wherein the three dimensions are {128 × 128,256 × 256,512 × 512} and the three aspect ratios are {1:1,1:2,2:1} by default, so that 9 dimensions can be obtained. When the detection target is a small face, omission of the small face region is likely to occur. Therefore, in the embodiment, the generation scale of the anchor frame is modified into five scales of {16 × 16,32 × 32,128 × 128,256 × 256,512 × 512}, and each scale generates three anchor frames with length-width ratios of {1:1,1:2,2:1} for 15 sizes. Two small scales are added for detecting small faces, and the three scales reserved later are used for extracting face regions with regular sizes.

In terms of a border regression algorithm, the prior art mostly adopts an NMS (Non Maximum Suppression algorithm) algorithm, and a core idea thereof is to find a local Maximum and suppress a Non-Maximum, mainly by using an anchor frame with the highest confidence to calculate a cross-over-unity (IoU, which represents an overlapping rate of a candidate frame and a standard frame) with other anchor frames in an iterative manner, and filter those frames with larger cross-over. However, it has been found that the NMS algorithm has the following problems:

1) the NMS algorithm forcibly sets the confidence of the adjacent candidate boxes with the overlapped part to be 0, namely forcibly deletes the candidate box with the IoU value being larger than the threshold value directly and roughly in operation, if a real target to be detected appears in the overlapped area, the target is detected to fail with a high probability, the missing rate is increased, and the average detection rate is reduced.

2) When the NMS algorithm is used for frame regression, the intersection-to-parallel ratio judgment threshold value N is used_tIt is difficult to determine the optimal value, and setting too large increases the false detection rate, and setting too small increases the false detection rate.

In order to solve the above problem, the frame regression algorithm is improved based on the NMS algorithm in the present embodiment. Fig. 3 is a flowchart of the improved bounding box regression algorithm in this embodiment. As shown in fig. 3, the specific steps of the improved border regression algorithm in this embodiment include:

s301: initializing data:

note that the anchor frame set B ═ B containing the background₁,b₂,…,b_N}，b_nN is 1,2, …, N represents the number of anchor frames containing background, and the confidence of each anchor frame is s_n. Initializing a set of reserved anchor frames

S302: selecting a current optimal anchor frame:

and selecting the anchor frame with the maximum confidence level from the current anchor frame set B, recording the anchor frame as the current optimal anchor frame B ', adding the current optimal anchor frame B ' into the reserved anchor frame set D, and deleting the current optimal anchor frame B ' from the anchor frame set B.

S303: and judging whether the anchor frame set B is empty, if so, finishing the regression of the frame, and otherwise, entering the step S304.

S304: and updating the confidence coefficient:

for each anchor frame B in the current anchor frame set B_nCalculating the intersection ratio iou (b ', b) of the current optimal anchor frame b' and the current optimal anchor frame b_i) Then each anchor frame b is updated using the following formula_nS confidence of_n：

Wherein N is_tIs a preset intersection ratio threshold value.

And then returns to step S302.

As for the super-resolution reconstruction model based on the GAN network, the SRGAN network is employed in the present embodiment. The SRGAN Network is a super-resolution image reconstruction model widely used and having an excellent effect at present, and is trained based on a GAN (generic adaptive Network) Network. The SRGAN network consists of a generator G and a discriminator D. Fig. 4 is a block diagram of a generator in the srna network. Fig. 5 is a block diagram of an arbiter in a srna network. The core of the generator is a number of residual blocks therein, each containing two 3 x 3 convolutional layers followed by a batch normalization layer (BN) and a prellu as activation functions, two 2 x sub-pixel convolutional layers being used to increase the feature size. Arbiter D employs a network structure similar to VGG19, but does not perform maxporoling pooling. And the part D of the discriminator comprises 8 convolutional layers, the number of the features is continuously increased along with the continuous deepening of the network, the feature size is continuously reduced, LeakyReLU is used as an activation function, and finally the probability of the learned real sample is obtained by utilizing two full-connection layers and a final sigmoid activation function.

The existing SRGAN network has the problems that models are difficult to train and the distributions are overlapped, and researches show that the problems are caused by adopting KL divergence and JS divergence as standards for measuring the distance between the real sample distribution and the generated sample distribution in the traditional SRGAN network. Through research in the embodiment, the EM divergence is adopted to solve the above problems. The EM divergence is a symmetric divergence defined as:

let omega be an element of RⁿIs a bounded continuous open set, S is the set of all Radon probability distributions in Ω, if for a certain p ≠ 1, k > 0, the calculation formula for EM divergence is as follows:

wherein, P_rAnd P_gRepresenting two different probability distributions, P_uRepresenting a random probability distribution, inf representing the lowest bound, x representing the obedience P_rThe samples of the distribution are taken as a sample,express compliance P_gThe samples of the distribution are taken as a sample,represents the samples x anda random linear combination of P_uRepresenting a sampleK and p each represent a constant,is the function space of all first-order differentiable functions with tight support property on omega, | | | | | represents to solve the norm.

The advantage of EM divergence is that for two different distributions, even if there is no overlap between them, the distance between the two distributions can still be reflected. This means that meaningful gradients can be provided at any time during training, so that the whole SRGAN network can be stably trained, and the problems of mode collapse and the like caused by gradient disappearance possibly occurring in the original SRGAN network training process can be effectively solved. In the embodiment, an objective function in model training is improved based on EM divergence. Optimizing an objective function based on the maximum and minimum problems of the SRGAN network after EM divergence improvement:

where x denotes true high resolution samples, z denotes low resolution samples input to the generator G, G (z) is the super-resolution reconstructed sample generated in the generator G, P_gRepresenting the probability distribution, P, of super-resolved reconstructed samples_rThe probability distribution of the real high resolution sample is shown, D (x), D (G (z)) respectively show the probability that the discriminator D judges whether the high resolution sample and the super resolution reconstruction sample are the real samples, E [ 2 ]]The mathematical expectation is represented by the mathematical expectation,representing a random linear combination of true high resolution samples x and super resolution reconstruction samples G (z), P_uRepresenting a sampleK and p each represent a constant.

In the training process, the optimization objective function is decomposed into two optimization problems:

1. optimization of the discriminator D:

2. optimization of generator G:

based on the technical derivation, the invention improves the training method of the SRGAN model to obtain a more advantageous SRGAN model, thereby improving the quality of the super-resolution face image reconstruction result. The specific training method comprises the following steps:

firstly, a plurality of high-resolution face images I are obtained^HRObtaining a corresponding low-resolution face image I through down sampling^LREach high resolution face image I^HRAnd a corresponding low resolution face image I^LRAnd forming a training sample, thereby obtaining a training sample set. In this embodiment, downsampling is performed using a gaussian pyramid, the original image is first convolved with a gaussian kernel (5 × 5) as a bottom layer image G0 (layer 0 of the gaussian pyramid), and then downsampled (even rows and columns are removed) to obtain an upper layer image G1, and downsampling is performed iteratively by 4 times.

Then, training the SRGAN network by using the obtained training sample set, wherein the optimization objective function of the generator G in the training process is as follows:

the optimized objective function of the discriminator D is:

wherein x denotes a true high resolution face image, z denotes a low resolution face image input to the generator G, G (z) is a super-resolution reconstructed face image generated in the generator G, P_gRepresenting the probability distribution, P, of a super-resolved reconstructed face image_rThe probability distribution of the real high-resolution face image is shown, D (x), D (G (z)) respectively show the probability that the discriminator D judges whether the high-resolution face image and the super-resolution reconstructed face image are the real face images, E [, ]]The mathematical expectation is represented by the mathematical expectation,representing a random linear combination of the true high resolution face image x and the super-resolution reconstructed face image G (z), P_uRepresenting a sampleK and p each represent a constant.

In the training process of the SRGAN network, firstly, a generator G carries out low-resolution face image I in each training sample X^LRPerforming super-resolution reconstruction, wherein the specific method comprises the following steps: low resolution face image I in training sample X by generator G^LRPerforming up-sampling to obtain a super-resolution reconstructed face image I^SR. Because the embodiment is used for the high-resolution face image I^HR4 times of down sampling is carried out to obtain a low-resolution face image I^SRThus, in generating a super-resolution reconstructed face image I^SRIs also 4.

Then the low-resolution face image I^LRCorresponding high-resolution face image I^HRAnd the super-resolution reconstructed face image I generated by the generator G^SRInputting the input into a discriminator D, and calculating a loss function L of the training sample according to the following formula_SR：

Wherein,the content loss function of the training sample is expressed by the following calculation formula:

wherein,the content loss function based on the mean square error is expressed by the following calculation formula:

wherein W represents a high resolution face image I^HRH represents a high resolution face image I^HRR, represents the down-sampling factor,representing high resolution face images I^HRThe pixel value of the pixel point with the middle coordinate of (x, y),representation of super-resolution reconstructed face image I^SRAnd the pixel value of the pixel point with the middle coordinate of (x, y).

Representing the VGG loss, the calculation formula is as follows:

wherein i represents the maximum pooling layer number in the VGG-19 network in the discriminator D, and j represents the number of the convolution layers between the i-th maximum pooling layer and the i + 1-th maximum pooling layer, in the existing VGG-19 network, the maximum pooling layer number is 5, and the convolution layer number between two adjacent maximum pooling layers is 2 or 4. Phi is a_i,jA feature map W representing the j convolutional layer acquisition after the i-th max pooling layer of the VGG-19 network in the discriminator D_i,jRepresentation of the characteristic diagram phi_i,jWidth of (H)_i,jRepresentation of the characteristic diagram phi_i,jIs high.

Representing the countervailing loss, this portion of the loss function biases the SRGAN network through the "spoof" discriminator to produce an output that is closer to the natural image, as calculated by the following equation:

wherein,indicates that the discriminator D reconstructs the face image (i.e. I) from the super-resolution generated by the generator^SR) Subscript θ as the probability of a true high resolution face image_D、θ_GThe network parameters of the discriminator D and the generator G are respectively represented, W represents the dimension number of the network parameter, W is 1,2, …, and W represents the dimension of the network parameter.

In the invention, the super-resolution reconstruction model needs to detect whether the super-resolution reconstruction image contains the face target, and in order to better meet the requirement, the classification loss L is added when the loss function is calculated_clcThe calculation formula is as follows:

wherein, { y₁,y₂,…,y_v,…,y_VDenotes a high resolution face image I^HRWhether the image is the calibration data of the face or not, V represents a high-resolution face image I^HRThe number of the face areas marked in the middle is in a value range of {0,1 }.

Since the improved optimization objective function in the implementation has no log term, the Adam optimization algorithm can be optimized to realize the objective function optimization of the generator G and the discriminator, thereby improving the training efficiency. As for the generator G, the weight of the generator G is updated in a descending order by using an Adam optimization algorithmw_G：

Wherein,represents a weight w_GDecreasing gradient of z_mRepresentation of super-resolution reconstructed face image I^SRThe value of the mth pixel, M being 1,2, …, M representing the number of pixels, D (G (z)_m) ) the representation discriminator D judges the super-resolution reconstructed face image I^SRThe m-th pixel is a high-resolution face image I^HRProbability of middle pixel, α denotes learning rate, β₁Exponential decay Rate representing first moment estimate, β₂Typical values of the three parameters of the Adam optimization algorithm are α -0.00001, β₁0.9 and β₂＝0.999。

Updating weight w of discriminator D in descending order by using Adam optimization algorithm_D：

Wherein,represents a weight w_DDecreasing gradient, x_mRepresenting high resolution face images I^HRValue of mth pixel, D (x)_m) Representation discriminator D for judging high-resolution face image I^HRThe mth pixel is a high-resolution face image I^HRThe probability of a middle pixel being in the image,to representThe gradient of the fall-off is,μ_m＝m/M，the representation discriminator D judgesFor high resolution face images I^HRProbability of a pixel in (c).

In the present embodiment, it is preferable to alternately update the weight w of the generator G_GWeight w of sum discriminator D_DThat is, the parameters of the generator G are first fixed and the parameters of the discriminator D are updated, and then the parameters of the discriminator D are fixed and the parameters of the generator G are updated, and so on alternately.

In order to better illustrate the technical effect of the invention, the invention is experimentally verified by adopting a group of low-resolution face images. In the experimental verification, the face detection model adopts the R-FCN model which is subjected to anchor frame generation parameter improvement and frame regression algorithm improvement in the embodiment, and the super-resolution reconstruction model based on the GAN network adopts the SRGAN model obtained by the improved training method in the embodiment. When a Face detection model and a super-resolution reconstruction model based on a GAN network are trained, a wire Face training sample set is adopted, 10 images are randomly extracted from 61 classifications, and 610 images are taken as detection images in total. In order to realize the comparison of technical effects, an SFD face detection method and an R-FCN face detection method are selected as comparison methods in the experimental verification.

In order to evaluate the technical effects of the face detection method and the comparison method, a PR curve is selected as an evaluation standard. The PR curve is a curve drawn with Precision (Precision) as the ordinate and Recall (Recall) as the abscissa.

Fig. 6 is a PR curve diagram of three methods in this experimental verification. As shown in fig. 6, in the three face detection methods of the present invention, the PR curve is closer to the upper right corner as a whole, and the value of the mapp (Mean Average Precision, i.e., Average AP value) is 0.947, which is also the best in the three sets of data.

Fig. 7 is an exemplary diagram of a detection result of the SFD face detection method in the experimental verification. Fig. 8 is an exemplary diagram of a detection result of the R-FCN face detection method in the experimental verification of this time. Fig. 9 is an exemplary diagram of the detection result of the present invention in the verification of the experiment. As can be seen from comparing fig. 7 to fig. 9, the present invention detects 14 faces in total, and shows more excellent detection performance than the other two methods, i.e., 11 and 9 faces respectively.

And then carrying out face detection on the image samples under different definitions. The fuzziness (blu) attributes of each Face target are marked in the widget Face training sample set and are divided into three types of clearness, general fuzziness and severe fuzziness, and accordingly a plurality of samples are extracted from image samples with different fuzziness degrees to form a detection sample set. Fig. 10 is a PR curve diagram of face detection performed on a clear detection sample set by three methods in this experimental verification. Fig. 11 is a PR curve diagram of face detection performed on a general blur detection sample set by three methods in the experimental verification. Fig. 12 is a PR curve diagram of face detection performed on a severely blurred detection sample set by three methods in the experimental verification. As shown in fig. 11 to 12, the three methods can well detect the face part when the sample definition is high, and the difference is not very large, and the mAP value is very high; in the test group with the general sample ambiguity, the mAP values of the three algorithms are slightly reduced, but still exceed 97%, which indicates that under the general ambiguity, the three methods have very good detection capability and do not pose too much challenge to the three algorithms. Meanwhile, the invention has some advantages compared with SFD and R-FCN when the face blurring degree is general, but the advantages are not obvious; under the condition that a detected sample is seriously blurred, the difference of the three methods begins to appear, wherein SFD performance is worst, mAP is reduced by about 10 percentage points compared with the condition that the detected sample is clear in blurring degree, the reduction range of the method is minimum, and is reduced by about 5 percentage points, in this case, the mAP value of the method is higher by about 2 percentage points compared with an original R-FCN model, and PR curves can wrap PR curves of the other two comparison methods obviously, so that compared with the other two methods, the method has better stability and higher detection rate under the condition of low resolution.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A face detection method based on two-stage detection is characterized by comprising the following steps:

s3: inputting a face image to be detected into a face detection model to obtain coordinate information of each candidate region of a face target and a confidence value C of the candidate region belonging to a face; presetting confidence threshold T₁And T₂And 0 < T₁＜T₂Less than 1; for each candidate region, if the corresponding confidence value C ≧ T₂Judging that the candidate region has a face target, outputting the face target as a face target region, and if the corresponding confidence value T is detected₁≤C＜T₂If not, judging that the candidate area has no human face target, and not outputting;

s4: inputting each face target to be determined into a generator G in a super-resolution reconstruction model based on a GAN network to generate a super-resolution reconstruction image SR, then inputting the super-resolution reconstruction image SR into a discriminator D, judging whether the image SR is a qualified super-resolution reconstruction image and whether the image SR contains the face target by the discriminator, if the image SR is the qualified super-resolution reconstruction image and contains the face target, judging that the face target exists in a corresponding candidate area, outputting the face target as a face target area, and otherwise, judging that the face target does not exist in the corresponding candidate area.

2. The face detection method of claim 1, wherein the face detection model uses an R-FCN network.

3. The face detection method of claim 2, wherein the generated scale of the anchor frame in the R-FCN network includes five scales {16 x 16,32 x 32,128 x 128,256 x 256,512 x 512}, and three aspect ratios {1:1,1:2,2:1 }.

4. The face detection method of claim 2, wherein the border regression algorithm in the R-FCN network comprises the following specific steps:

1) note that the anchor frame set B ═ B containing the background₁,b₂,…,b_N}，b_nN is 1,2, …, N represents the number of anchor frames containing background, and the confidence of each anchor frame is s_n. Initializing a set of reserved anchor frames

2) Selecting an anchor frame with the maximum confidence level from the current anchor frame set B, recording the anchor frame as a current optimal anchor frame B ', adding the current optimal anchor frame B ' into the reserved anchor frame set D, and deleting the current optimal anchor frame B ' from the anchor frame set B;

3) judging whether the anchor frame set B is empty, if so, finishing frame regression, and otherwise, entering the step 4);

4) for each anchor frame B in the current anchor frame set B_nCalculating the intersection ratio iou (b ', b) of the current optimal anchor frame b' and the current optimal anchor frame b_i) Then each anchor frame b is updated using the following formula_nS confidence of_n：

Wherein N is_tIs a preset cross-over ratio threshold;

and then returns to step 2).

5. The face detection method of claim 1, wherein the GAN network-based super-resolution reconstruction model adopts an SRGAN network.

6. The face detection method of claim 5, wherein the SRGAN network is trained by the following method:

firstly, a plurality of high-resolution face images I are obtained^HRObtaining a corresponding low-resolution face image I through down sampling^LREach high resolution face image I^HRAnd a corresponding low resolution face image I^LRForming a training sample, thereby obtaining a training sample set;

the optimized objective function of the discriminator D is:

wherein x denotes a true high resolution face image, z denotes a low resolution face image input to the generator G, G (z) is a super-resolution reconstructed face image generated in the generator G, P_gRepresenting the probability distribution, P, of a super-resolved reconstructed face image_rThe probability distribution of the real high-resolution face image is shown, D (x), D (G (z)) respectively show the probability that the discriminator D judges whether the high-resolution face image and the super-resolution reconstructed face image are the real face images, E [, ]]The mathematical expectation is represented by the mathematical expectation,representing a random linear combination of the true high resolution face image x and the super resolution reconstructed face image g (z), k and p each representing a constant.

7. The face detection method of claim 6, wherein in the SRGAN network training process, the loss function L of the training sample is calculated according to the following formula_SR：

Wherein,a content loss function representing the training samples,denotes the loss of antagonism, L_clcIndicating a classification loss.

8. The face detection method of claim 6, wherein in the SRGAN network training process, an Adam optimization algorithm is adopted to realize objective function optimization of a generator G and a discriminator D, and the specific method is as follows:

updating the weight w of the generator G in descending order using the Adam optimization algorithm_G：

Wherein,represents a weight w_GDecreasing gradient of z_mRepresentation of super-resolution reconstructed face image I^SRThe value of the mth pixel, M being 1,2, …, M representing the number of pixels, D (G (z)_m) ) the representation discriminator D judges the super-resolution reconstructed face image I^SRThe m-th pixel is a high-resolution face image I^HRProbability of middle pixel, α denotes learning rate, β₁Exponential decay Rate representing first moment estimate, β₂An exponential decay rate representing the second moment estimate;

Wherein,represents a weight w_DDecreasing gradient, x_mRepresenting high resolution face images I^HRM-th pixelValue of (a), D (x)_m) Representation discriminator D for judging high-resolution face image I^HRThe mth pixel is a high-resolution face image I^HRThe probability of a middle pixel being in the image,to representThe gradient of the fall-off is,μ_m＝m/M，the representation discriminator D judgesFor high resolution face images I^HRProbability of a pixel in (c).

9. The super-resolution facial image reconstruction method according to claim 8, wherein the weight w of the generator G is updated alternately during the optimization of the objective function of the generator G and the discriminator D_GWeight w of sum discriminator D_D。