CN110189255B

CN110189255B - Face detection method based on two-stage detection

Info

Publication number: CN110189255B
Application number: CN201910455695.5A
Authority: CN
Inventors: 于力; 刘意文; 邹见效; 杨瞻远; 徐红兵
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2023-01-17
Anticipated expiration: 2039-05-29
Also published as: CN110189255A

Abstract

The invention discloses a face detection method based on two-stage detection, which comprises the steps of respectively training a face detection model and a super-resolution reconstruction model based on a GAN network, then inputting a face image to be detected into the face detection model to obtain coordinate information of each candidate region of a face target and a confidence value of the candidate region belonging to a face, carrying out primary judgment according to the confidence value, and then inputting the face target to be determined into a generator in the super-resolution reconstruction model based on the GAN network for further judgment. The invention adopts two-stage detection, and can effectively improve the detection rate of the low-resolution face image.

Description

Face detection method based on two-stage detection

Technical Field

The invention belongs to the technical field of low-resolution face detection, and particularly relates to a face detection method based on two-stage detection.

Background

The face detection problem is originally presented as a sub-problem of a face recognition system, and gradually becomes an independent subject with the continuous and intensive research. The current face detection technology integrates the fields of machine learning, computer vision, mode recognition, artificial intelligence and the like in a crossed manner, becomes the basis of all face image analysis and derivative applications, and has great influence on the response speed and the accurate detection capability of derivative systems. In the process of continuously expanding the application scene of face detection, problems of undersize or excessively low quality of the input face image and the like caused by various reasons are gradually encountered, and for the face images with low resolution, the accuracy of a face detection system is often greatly reduced. The problem of detection of low quality and small size face images is commonly referred to as low resolution face detection.

The essence of the current face detection algorithm is a binary problem, and the basic flow is that effective features are extracted from a region to be detected, and then whether a face exists is judged by the features, and low-resolution face detection is also researched on the basis. The low resolution face has three characteristics: the method has the advantages that the information quantity is small, the noise is high, and the available tools are few, so that a candidate region cannot extract enough effective features to express the region, and the conventional method cannot extract enough effective features to express a low-resolution face from the aspect of feature expression; the inherent deficiency that appears in deep neural networks that the preceding convolutional layer cannot provide a sufficiently powerful feature map, and the following convolutional layer cannot provide enough features of the low-resolution face region, makes it very difficult to detect the low-resolution face.

In order to solve the problem of low-resolution face detection, a great deal of targeted research is carried out by a plurality of excellent scholars, and comprehensively, the scholars at home and abroad mainly focus on the processing of the problem in three directions, namely, a resolution robust feature expression method for a face region is found, and a new classifier and an image super-resolution method are designed according to the characteristics of a low-resolution face. It should be recognized that, the current research for low-resolution small face detection is still in the development stage, and there are many problems to be solved, on one hand, how to effectively extract the context information of the low-resolution face and integrate the context information into the detection network, and still further exploration is needed to provide better performance for the low-resolution face detector; on the other hand, a complete face detection system is necessarily a full-scale face detection system, which requires that the detection capability of faces of other scales must be considered when processing the low-resolution face detection problem, and in fact, the fusion problem of multi-scale detection leads to the low-resolution face detection system being low in accuracy or processing speed, which is a big problem to be solved urgently.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a face detection method based on two-stage detection.

In order to achieve the purpose, the face detection method based on deep learning comprises the following steps:

s1: acquiring a plurality of face image training samples, wherein each training sample comprises a face image and face target information, and the face image training samples are adopted to train a face detection model;

s2: acquiring a plurality of super-resolution face image reconstruction training samples, wherein each training sample comprises a low-resolution image containing a face and a corresponding high-resolution image, the super-resolution face image reconstruction training samples are adopted to train a super-resolution reconstruction model based on a GAN network, and the super-resolution reconstruction model based on the GAN network comprises a generator G and a discriminator D;

s3: inputting a face image to be detected into a face detection model to obtain coordinate information of each candidate region of a face target and a confidence value C of the candidate region belonging to a face; presetting confidence threshold T ₁ And T ₂ And 0 < T ₁ ＜T ₂ Less than 1; for each candidate region, if the corresponding confidence value C ≧ T ₂ Judging that the candidate region has a face target, outputting the face target as a face target region, and if the corresponding confidence value T is detected ₁ ≤C＜T ₂ If not, judging that the candidate area has no human face target, and not outputting;

s4: and (3) inputting each face target to be determined into a generator G in a super-resolution reconstruction model based on a GAN network to generate a super-resolution reconstruction image R, then inputting the super-resolution reconstruction image R into a discriminator D, judging whether the image R is a qualified super-resolution reconstruction image and whether the image R contains the face target by the discriminator, if the image R is the qualified super-resolution reconstruction image and contains the face target, judging that the face target exists in a corresponding candidate area, outputting the candidate area as a face target area, and otherwise, judging that the face target does not exist.

The invention relates to a face detection method based on two-stage detection, which comprises the steps of firstly respectively training a face detection model and a super-resolution reconstruction model based on a GAN network, then inputting a face image to be detected into the face detection model to obtain coordinate information of each candidate region of a face target and a confidence value that the candidate region belongs to a face, carrying out primary judgment according to the confidence value, and then inputting the face target to be determined into a generator in the super-resolution reconstruction model based on the GAN network for further judgment. The invention adopts two-stage detection, and can effectively improve the detection rate of the low-resolution face image.

Drawings

FIG. 1 is a flow chart of an embodiment of a face detection method based on two-stage detection according to the present invention;

FIG. 2 is a schematic diagram of the structure of an R-FCN network;

FIG. 3 is a flowchart of the improved frame regression algorithm in this embodiment;

fig. 4 is a block diagram of a generator in the srna network;

FIG. 5 is a block diagram of an arbiter in an SRGAN network;

FIG. 6 is a PR graph of three methods in this experimental verification;

FIG. 7 is an exemplary diagram of the detection result of the SFD face detection method in the present experimental verification;

FIG. 8 is an exemplary diagram of the detection results of the R-FCN face detection method in the experimental verification;

FIG. 9 is an exemplary diagram of the test results of the present invention in this experimental verification;

FIG. 10 is a PR graph showing the face detection of a clear detection sample set by three methods in the present experimental verification;

FIG. 11 is a PR graph of face detection on a general fuzzy detection sample set by three methods in the experimental verification;

fig. 12 is a PR curve diagram of face detection performed on a severely blurred detection sample set by three methods in the experimental verification.

Detailed Description

Specific embodiments of the present invention are described below in conjunction with the accompanying drawings so that those skilled in the art can better understand the present invention. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

Fig. 1 is a flow chart of a specific embodiment of the face detection method based on two-stage detection according to the present invention. As shown in fig. 1, the two-stage detection-based face detection method of the present invention specifically includes the following steps:

s101: training a face detection model:

the method comprises the steps of obtaining a plurality of face image training samples, wherein each training sample comprises a face image and face target information, and training a face detection model by adopting the face image training samples.

S102: training a super-resolution reconstruction model:

the method comprises the steps of obtaining a plurality of super-resolution face image reconstruction training samples, wherein each training sample comprises a low-resolution image containing a face and a corresponding high-resolution image, training a super-resolution reconstruction model based on a GAN (generic adaptive Network) Network by adopting the super-resolution face image reconstruction training samples, and generating the super-resolution reconstruction model based on the GAN Network, and the super-resolution reconstruction model comprises a generator G and a discriminator D.

S103: adopting a face detection model to carry out preliminary detection:

and inputting the face image to be detected into a face detection model to obtain the coordinate information of each candidate region of the face target and a confidence value C of the candidate region belonging to the face. Presetting confidence threshold T ₁ And T ₂ And 0 < T ₁ ＜T ₂ Is less than 1. For each candidate region, if the corresponding confidence value C ≧ T ₂ Judging that the candidate region has a face target, outputting the candidate region as a face target region, and if the corresponding confidence value T is detected ₁ ≤C＜T ₂ If not, judging that the candidate area has no human face target, and not outputting.

S104: detecting by adopting a super-resolution reconstruction model:

inputting each face target to be determined into a generator G in a super-resolution reconstruction model based on a GAN network to generate a super-resolution reconstruction image SR, then inputting the super-resolution reconstruction image SR into a discriminator D, judging whether the image SR is a qualified super-resolution reconstruction image and whether the image SR contains the face target by the discriminator, if the image SR is the qualified super-resolution reconstruction image and contains the face target, judging that the face target exists in a corresponding candidate area, outputting the face target as a face target area, and otherwise, judging that the face target does not exist in the corresponding candidate area.

By adopting the face detection method based on the two-stage detection, the super-resolution reconstruction model is adopted as the assistance of the face detection model, and the candidate region with low reliability is further detected, so that the missing detection and the false detection of the face target are avoided, and the detection performance is improved.

As for the face detection model, a specific face detection model can be selected as needed, in this embodiment, an R-FCN network is selected as the face detection model, and low-resolution face detection is improved to improve the detection effect. The R-FCN Network is modified on the basis of a traditional fast R-CNN structure, the core design idea is that on the basis of an RPN (regional provider Network) Network proposed in fast RCNN, position sensitive information is introduced, an ROI layer is moved backwards, a position sensitive characteristic diagram is used for calculating the probability that entities in an image to be detected belong to each category, and the detection rate can be greatly improved while high positioning accuracy is kept. FIG. 2 is a schematic diagram of the structure of an R-FCN network. As shown in FIG. 2, the workflow of the R-FCN can be briefly described as follows:

the image is input into a pre-trained classification network (a network before Conv4 of a ResNet-101 network is used in the figure 2), and corresponding network parameters are fixed. There are 3 branches on the feature map (feature map) obtained on the last convolutional layer of the pre-trained network:

the 1 st branch is to perform RPN operation on the feature map to obtain corresponding candidate region ROI, and the specific method is as follows: anchor boxes (Anchors) are generated on the feature map according to preset parameters, the anchor boxes being a set of regions having different sizes and aspect ratios across the input image. And then identifying an anchor frame containing the foreground, and converting the anchor frame into a target Bounding Box (Bounding Box) by using a Bounding Box regression algorithm so that the Bounding Box can more closely fit the contained foreground object.

The 2 nd branch is to obtain a position-sensitive score map (position-sensitive score map) with dimensions K × K (C + 1) on the feature map for classification.

The 3 rd branch is to obtain a position sensitivity score mapping with 4 x K dimensions on the feature for regression;

finally, position-Sensitive ROI Pooling (Position-Sensitive Rol Pooling, used herein) is performed on the K × K (C + 1) -dimensional Position sensitivity score map and the 4 × K-dimensional Position sensitivity score map, respectively, to obtain confidence and Position information of each candidate region, and then the corresponding category is obtained through confidence determination.

In the embodiment, firstly, the generation parameters of the anchor frame are improved. In the conventional R-FCN network, the anchor frame is generated by using three dimensions and three aspect ratios, wherein the three dimensions are {128 × 128,256 × 256,512 × 512} by default, and the three aspect ratios are {1, 2. When the detection target is a small face, omission of the small face region is likely to occur. Therefore, in the present embodiment, the generated dimension of the anchor frame is modified into five dimensions of {16 × 16,32 × 32,128 × 128,256 × 256,512 × 512}, and each dimension generates three anchor frames of aspect ratios {1, 1. Two small scales are added for detecting small faces, and the three scales reserved later are used for extracting face regions with regular sizes.

For the border regression algorithm, the core idea of the prior art mostly adopts an NMS (Non Maximum Suppression algorithm) algorithm, which is to find a local Maximum and suppress a Non-Maximum value, and mainly calculates a cross-over-unity (IoU) ratio with other anchor boxes by using an anchor box with the highest confidence coefficient in an iterative manner, and filters the boxes with larger cross-over-unity. However, the NMS algorithm has been found to have the following problems:

1) The NMS algorithm forcibly sets the confidence of the adjacent candidate frames with the overlapped part to 0, namely forcibly deletes the candidate frames with the IoU value larger than the threshold value directly and roughly in operation, if a real target to be detected appears in the overlapped area, the target is detected to fail with high probability, the missing rate is increased, and the average detection rate is reduced.

2) When the NMS algorithm is used for frame regression, the intersection-to-parallel ratio judgment threshold value N is used _t It is difficult to determine the optimal value, and setting too large increases the false detection rate, and setting too small increases the false detection rate.

In order to solve the above problem, the frame regression algorithm is improved based on the NMS algorithm in the present embodiment. Fig. 3 is a flowchart of the improved bounding box regression algorithm in this embodiment. As shown in fig. 3, the specific steps of the improved border regression algorithm in this embodiment include:

s301: initializing data:

including a background anchor frame set B = { B = ₁ ,b ₂ ,…,b _N }，b _n N =1,2, \ 8230;, N, N indicates the number of anchor boxes containing the background, noting that the confidence of each anchor box is s _n . Initializing a set of reserved anchor frames

S302: selecting a current optimal anchor frame:

and selecting the anchor frame with the maximum confidence level from the current anchor frame set B, recording the anchor frame as the current optimal anchor frame B ', adding the current optimal anchor frame B ' into the reserved anchor frame set D, and deleting the current optimal anchor frame B ' from the anchor frame set B.

S303: and judging whether the anchor frame set B is empty, if so, finishing the regression of the frame, and otherwise, entering the step S304.

S304: and updating the confidence coefficient:

for each anchor frame B in the current anchor frame set B _n Calculating the intersection ratio iou (b ', b) of the current optimal anchor frame b' and the current optimal anchor frame b _i ) Then each anchor frame b is updated using the following formula _n S confidence of _n ：

Wherein N is _t Is a preset intersection ratio threshold value.

And then returns to step S302.

As for the super-resolution reconstruction model based on the GAN network, the SRGAN network is employed in the present embodiment. The SRGAN Network is a super-resolution image reconstruction model widely used and having an excellent effect at present, and is trained based on a GAN (generic adaptive Network) Network. The SRGAN network consists of a generator G and a discriminator D. Fig. 4 is a block diagram of a generator in the srna network. Fig. 5 is a block diagram of an arbiter in a srna network. The core of the generator is a number of residual blocks therein, each containing two 3 x 3 convolutional layers followed by a batch normalization layer (BN) and a prlu as activation functions, two 2 × sub-pixel convolutional layers being used to increase the feature size. The discriminator D uses a network structure similar to that of VGG19, but does not perform maxporoling pooling. And the part D of the discriminator comprises 8 convolutional layers, the number of the features is continuously increased along with the continuous deepening of the network, the feature size is continuously reduced, leakyReLU is used as an activation function, and finally the probability of the learned real sample is obtained by utilizing two full-connection layers and a final sigmoid activation function.

The existing SRGAN network has the problems that models are difficult to train and the distributions are overlapped, and researches show that the problems are caused by adopting KL divergence and JS divergence as standards for measuring the distance between the real sample distribution and the generated sample distribution in the traditional SRGAN network. Through research in the embodiment, the EM divergence is adopted to solve the above problems. The EM divergence is a symmetric divergence defined as:

let Ω ∈ R ⁿ Is a bounded continuous open set, S is the set of all Radon probability distributions in Ω, if for a certain p ≠ 1, k > 0, the formula for calculating EM divergence is as follows:

wherein, P _r And P _g Representing two different probability distributions, P _u Representing a random probability distribution, inf representing the lowest bound, x representing the obedience P _r The samples of the distribution are taken as a sample,

express compliance P _g The samples of the distribution are taken as a sample,

represents the samples x and

a random linear combination of (2), P _u Representing a sample

K and p each represent a constant,

is the function space of all first-order differentiable functions with tight support property on omega, | | | | | represents to solve the norm.

The advantage of EM divergence is that for two different distributions, even if there is no overlap between them, the distance between the two distributions can still be reflected. This means that meaningful gradients can be provided at any time during training, so that the whole SRGAN network can be stably trained, and the problems of mode collapse and the like caused by gradient disappearance possibly occurring in the original SRGAN network training process can be effectively solved. In the embodiment, an objective function in model training is improved based on EM divergence. Optimizing an objective function based on the maximum and minimum problems of the SRGAN network after EM divergence improvement:

where x represents the true high resolution sampleZ denotes the low resolution samples input to the generator G, G (z) is the super-resolution reconstructed samples generated in the generator G, P _g Representing the probability distribution, P, of super-resolved reconstructed samples _r The probability distribution of the true high resolution sample is shown, D (x) and D (G (z)) respectively show the probability that the discriminator D judges whether the high resolution sample and the super resolution reconstruction sample are the true samples, E [ 2 ]]The mathematical expectation is represented by the mathematical expectation,

representing a random linear combination of true high resolution samples x and super-resolution reconstructed samples G (z), P _u Representing a sample

K and p each represent a constant.

In the training process, the optimization objective function is decomposed into two optimization problems:

1. optimization of the discriminator D:

2. optimization of generator G:

based on the technical derivation, the invention improves the training method of the SRGAN model to obtain a more advantageous SRGAN model, thereby improving the quality of the super-resolution face image reconstruction result. The specific training method comprises the following steps:

firstly, a plurality of high-resolution face images I are obtained ^HR Obtaining a corresponding low-resolution face image I through down sampling ^LR Each high resolution face image I ^HR And a corresponding low resolution face image I ^LR And forming a training sample, thereby obtaining a training sample set. In this embodiment, a gaussian pyramid is used for downsampling, and the original image is first processedAnd (3) performing convolution on the image G0 (the 0 th layer of the Gaussian pyramid) serving as the bottommost layer by using a Gaussian kernel (5 x 5), then performing down-sampling (removing even rows and columns) on the convolved image to obtain an image G1 on the upper layer, and performing iteration to complete 4-time down-sampling.

Then, training the SRGAN network by using the obtained training sample set, wherein the optimization objective function of the generator G in the training process is as follows:

the optimized objective function of the discriminator D is:

where x denotes the true high resolution face image, z denotes the low resolution face image input to the generator G, G (z) is the super-resolution reconstructed face image generated in the generator G, P _g Representing the probability distribution, P, of a super-resolved reconstructed face image _r The probability distribution of the real high-resolution face image is shown, D (x) and D (G (z)) respectively show the probability that the discriminator D judges whether the high-resolution face image and the super-resolution reconstructed face image are the real face images, E [ 2 ]]The mathematical expectation is represented by the mathematical expectation,

representing a random linear combination of a true high resolution face image x and a super-resolution reconstructed face image G (z), P _u Representing a sample

K and p each represent a constant.

In the training process of the SRGAN network, firstly, a generator G carries out low-resolution face image I in each training sample X ^LR Performing super-resolution reconstruction, wherein the specific method comprises the following steps: low resolution face image I in training sample X by generator G ^LR To carry out the upward miningObtaining a super-resolution reconstruction face image I ^SR . Because the embodiment is used for the high-resolution face image I ^HR Carrying out down-sampling by 4 times to obtain a low-resolution face image I ^SR Thus, in generating super-resolution reconstructed face image I ^SR Is also 4.

Then the low-resolution face image I ^LR Corresponding high-resolution face image I ^HR And the super-resolution reconstructed face image I generated by the generator G ^SR Inputting the input into a discriminator D, and calculating a loss function L of the training sample according to the following formula _SR ：

Wherein,

the content loss function of the training sample is expressed by the following calculation formula:

wherein,

the content loss function based on the mean square error is expressed by the following calculation formula:

wherein W represents a high resolution face image I ^HR H represents a high resolution face image I ^HR R, represents the down-sampling factor,

representing high resolution face images I ^HR The pixel value of the pixel point with the middle coordinate of (x, y),

representation of super-resolution reconstructed face image I ^SR The pixel value of the pixel point with the middle coordinate of (x, y).

Representing the VGG loss, the formula is as follows:

wherein i represents the maximum pooling layer number in the VGG-19 network in the discriminator D, and j represents the number of the convolution layers between the i-th maximum pooling layer and the i + 1-th maximum pooling layer, in the existing VGG-19 network, the maximum pooling layer number is 5, and the convolution layer number between two adjacent maximum pooling layers is 2 or 4. Phi is a unit of _i,j A feature map W representing the j convolutional layer acquisition after the ith max pooling layer of the VGG-19 network in the discriminator D _i,j Representation of characteristic diagram phi _i,j Width of (H) _i,j Representation of characteristic diagram phi _i,j Is high.

Representing the countervailing loss, this portion of the loss function biases the SRGAN network through the "spoof" arbiter towards producing an output that is closer to the natural image, which is calculated as follows:

wherein,

indicates that the discriminator D reconstructs the face image (i.e. I) from the super-resolution generated by the generator ^SR ) Subscript θ as the probability of a true high resolution face image _D 、θ _G Respectively represent discriminators D and DThe network parameters of the generator G, W represents the dimension serial number of the network parameters, W =1,2, \ 8230, and W represents the dimension of the network parameters.

In order to better meet the requirement that the super-resolution reconstruction model needs to detect whether the super-resolution reconstruction image contains the human face target or not, the classification loss L is added when the loss function is calculated _clc The calculation formula is as follows:

wherein, { y ₁ ,y ₂ ,…,y _v ,…,y _V Denotes a high resolution face image I ^HR Whether the image is the calibration data of the face or not, V represents a high-resolution face image I ^HR The number of the face areas marked in the middle is in a value range of {0,1}.

Since the improved optimization objective function in this implementation has no log term, adam optimization algorithm can be preferably used to realize the objective function optimization of the generator G and the arbiter, thereby improving the training efficiency. As for the generator G, the weight w of the generator G is updated in a descending order by using an Adam optimization algorithm _G ：

Wherein,

represents the weight w _G Decreasing gradient of z _m Representation of super-resolution reconstructed face image I ^SR The value of the M-th pixel in (M =1,2, \ 8230;, M, M denotes the number of pixels, D (G (z) _m ) ) the representation discriminator D judges the super-resolution reconstructed face image I ^SR The m-th pixel is a high-resolution face image I ^HR Probability of middle pixel, alpha denotes learning rate, beta ₁ Exponential decay Rate, beta, representing an estimate of the first moment ₂ The exponential decay rate of the second moment estimate is expressed. Typical values of three parameters of the Adam optimization algorithm are α =0.00001、β ₁ =0.9 and β ₂ ＝0.999。

Updating weight w of discriminator D in descending order by Adam optimization algorithm _D ：

Wherein,

represents the weight w _D Gradient of descent, x _m Representing high resolution face images I ^HR Value of mth pixel, D (x) _m ) Representation discriminator D for judging high-resolution face image I ^HR The mth pixel is a high-resolution face image I ^HR The probability of a middle pixel being in the image,

to represent

The gradient of the fall-off is,

μ _m ＝m/M，

the representation discriminator D judges

For high resolution face images I ^HR Probability of a pixel in (c).

In the present embodiment, it is preferable to alternately update the weight w of the generator G _G Weight w of sum discriminator D _D That is, the parameters of the generator G are first fixed and the parameters of the discriminator D are updated, and then the parameters of the discriminator D are fixed and the parameters of the generator G are updated, and so on alternately.

In order to better illustrate the technical effect of the invention, the invention is experimentally verified by adopting a group of low-resolution face images. In the experimental verification, the face detection model adopts the R-FCN model which is subjected to anchor frame generation parameter improvement and frame regression algorithm improvement in the embodiment, and the super-resolution reconstruction model based on the GAN network adopts the SRGAN model obtained by the improved training method in the embodiment. When a Face detection model and a super-resolution reconstruction model based on a GAN network are trained, a wire Face training sample set is adopted, 10 images are randomly extracted from 61 classifications, and 610 images are taken as detection images in total. In order to realize the comparison of technical effects, an SFD face detection method and an R-FCN face detection method are selected as comparison methods in the experimental verification.

In order to evaluate the technical effects of the face detection method and the comparison method, a PR curve is selected as an evaluation standard. The PR curve is a curve drawn with Precision (Precision) as the ordinate and Recall (Recall) as the abscissa.

FIG. 6 is a PR graph of three methods in this experimental verification. As shown in fig. 6, in the three face detection methods of the present invention, the whole PR curve is closer to the upper right corner, and the value of the mapp (Mean Average Precision) is 0.947, which is also the best in the three sets of data.

Fig. 7 is an exemplary diagram of a detection result of the SFD face detection method in the experimental verification. Fig. 8 is an exemplary diagram of a detection result of the R-FCN face detection method in the experimental verification of this time. Fig. 9 is a diagram illustrating an example of the detection result of the present invention in the verification of the experiment. As can be seen from comparing fig. 7 to fig. 9, the present invention detects 14 faces in total, and shows more excellent detection performance than the other two methods, i.e., 11 and 9 faces respectively.

And then carrying out face detection on the image samples under different definitions. The fuzziness (blu) attributes of each Face target are marked in the widget Face training sample set and are divided into three types of clearness, general fuzziness and severe fuzziness, and accordingly a plurality of samples are extracted from image samples with different fuzziness degrees to form a detection sample set. Fig. 10 is a PR curve diagram of face detection performed on a clear detection sample set by three methods in this experimental verification. Fig. 11 is a PR curve diagram of face detection performed on a general blur detection sample set by three methods in the experimental verification. Fig. 12 is a PR curve diagram of face detection performed on a severely blurred detection sample set by three methods in the experimental verification. As shown in fig. 11 to 12, when the sample definition is high, the three methods can well detect the face part, and the difference is not very large, and the mAP value is very high; in the test group with the general sample ambiguity, the mAP values of the three algorithms are slightly reduced, but still exceed 97%, which indicates that under the general ambiguity, the three methods have very good detection capability and do not pose too much challenge to the three algorithms. Meanwhile, the invention has some advantages compared with SFD and R-FCN when the face blurring degree is general, but the advantages are not obvious; under the condition that a detected sample is seriously blurred, the difference of the three methods begins to appear, wherein SFD performance is worst, mAP is reduced by about 10 percentage points compared with the condition that the detected sample is clear in blurring degree, the reduction range of the method is minimum, and is reduced by about 5 percentage points, in this case, the mAP value of the method is higher by about 2 percentage points compared with an original R-FCN model, and PR curves can wrap PR curves of the other two comparison methods obviously, so that compared with the other two methods, the method has better stability and higher detection rate under the condition of low resolution.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A face detection method based on two-stage detection is characterized by comprising the following steps:

s1: acquiring a plurality of face image training samples, wherein each training sample comprises a face image and face target information, and training a face detection model by adopting the face image training samples;

s4: inputting each face target to be determined into a generator G in a super-resolution reconstruction model based on a GAN network to generate a super-resolution reconstruction image SR, then inputting the super-resolution reconstruction image SR into a discriminator D, judging whether the super-resolution reconstruction image SR is a qualified super-resolution reconstruction image and whether the super-resolution reconstruction image SR comprises the face target or not by the discriminator, if the image SR is not the qualified super-resolution reconstruction image and comprises the face target, judging that the face target exists in a corresponding candidate area, outputting the candidate area as a face target area, and otherwise judging that the face target does not exist in the candidate area.

2. The face detection method of claim 1, wherein the face detection model uses an R-FCN network.

3. The face detection method according to claim 2, wherein the generated scale of the anchor frame in the R-FCN network includes five scales {16 x 16,32 x 32,128 x 128,256 x 256,512 x 512}, three aspect ratios {1, 1.

4. The face detection method of claim 1, wherein the GAN network-based super-resolution reconstruction model adopts an SRGAN network.

5. The face detection method of claim 4, wherein the SRGAN network is trained by the following method:

firstly, a plurality of high-resolution face images I are obtained ^HR Obtaining a corresponding low-resolution face image I through down-sampling ^LR Each high resolution face image I ^HR And a corresponding low resolution face image I ^LR Forming a training sample, thereby obtaining a training sample set;

the optimized objective function of the discriminator D is:

where x denotes the true high resolution face image, z denotes the low resolution face image input to the generator G, G (z) is the super-resolution reconstructed face image generated in the generator G, P _g Representing the probability distribution, P, of a super-resolved reconstructed face image _r The probability distribution of the real high-resolution face image is shown, D (x) and D (G (z)) respectively show the probability that the discriminator D judges whether the high-resolution face image and the super-resolution reconstructed face image are the real face images, E [ 2 ]]Which represents the mathematical expectation that,

representing true high resolution face image x and hyper-resolutionResolution reconstructs a random linear combination of face images G (z), k and p each representing a constant.

6. The face detection method of claim 5, wherein in the SRGAN network training process, the loss function L of the training sample is calculated according to the following formula _SR ：

Wherein,

a content loss function representing the training samples,

denotes the resistance to loss, L _clc Indicating a classification loss.

7. The face detection method according to claim 5, wherein in the SRGAN network training process, adam optimization algorithm is adopted to realize the objective function optimization of the generator G and the discriminator D, and the specific method is as follows:

updating the weight w of the generator G in descending order using the Adam optimization algorithm _G ：

Wherein,

represents the weight w _G Decreasing gradient of z _m Representation of super-resolution reconstructed face image I ^SR The value of the M-th pixel in (M =1,2, \ 8230;, M, M denotes the number of pixels, D (G (z) _m ) Representation discriminator D judges the super-resolution reconstructed face image I ^SR The m-th pixel is highResolution face image I ^HR Probability of middle pixel, alpha denotes learning rate, beta ₁ Exponential decay rate, beta, representing an estimate of the first moment ₂ An exponential decay rate representing the second moment estimate;

Wherein,

represents a weight w _D Gradient of descent, x _m Representing high resolution face images I ^HR Value of mth pixel, D (x) _m ) Representation discriminator D for discriminating high resolution face image I ^HR The mth pixel is a high-resolution face image I ^HR The probability of a middle pixel being in the image,

to represent

The gradient of the fall-off is,

μ _m ＝m/M，

the representation discriminator D judges

For high resolution face images I ^HR Probability of a middle pixel.

8. The face detection method of claim 7, wherein the generator G and the discriminator D alternately update the weight w of the generator G when optimizing the objective function _G Weight w of sum discriminator D _D 。