CN110263731B

CN110263731B - Single step human face detection system

Info

Publication number: CN110263731B
Application number: CN201910550738.8A
Authority: CN
Inventors: 徐杰; 田野; 罗堡文; 廖静茹
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2021-03-16
Anticipated expiration: 2039-06-24
Also published as: CN110263731A

Abstract

The invention discloses a single-step face detection system. The invention provides a real-time face detection network YOMO formed by depth separable convolution, which comprises a plurality of feature fusion structures in a top-down mode, and each detection module is only responsible for detecting the face detection in a corresponding scale range. The invention adopts a random cutting strategy which is more in line with a multi-scale detection structure, so that each detection module can be trained by a sufficient number of samples. The elliptic regression device provided by the invention can greatly improve the detection recall rate under the ContROC evaluation standard. The detection accuracy of the YOMO model provided by the invention is kept strong competitiveness, the detection rate of pictures with 544 x 544 resolution is 51FPS, and the memory occupation of the model is only 21M.

Description

Single step human face detection system

Technical Field

The invention relates to the technical field of face detection, in particular to a single-step face detection system.

Background

The face detection is a key component of a smart city with human as the center, and relates to technologies such as identity recognition, personalized service, pedestrian detection and tracking, crowd counting and the like. Although widely studied, face detection with unlimited scenes remains an open research problem due to various challenges.

Early face detection focused on manually designing effective features and building efficient classifiers based on the effective features. But usually a sub-optimal detection model is obtained, and the detection accuracy may be greatly reduced with the change of application scenes. In recent years, the deep learning technology is successfully applied to the face detection task, but generating a real-time face detection model with high accuracy and unlimited scene still has great challenge.

The Faster R-CNN uses a region recommendation algorithm to replace a sliding window, and integrates candidate box generation, feature extraction, frame regression and classification into a network, wherein the model is the model with the highest detection rate and accuracy in the R-CNN series models. However, the recommended network generates many face candidate frames and the complicated network structure brings large calculation overhead, so that real-time face detection cannot be achieved.

Another type of face detection method, such as YOLO, converts the detection problem into a regression problem, and therefore does not include a recommendation network, and the face frame is directly regressed in the feature map of the feature extraction network, which has a faster detection rate, but the detection accuracy needs to be improved. In order to improve the detection accuracy, the SSD jointly predicts the category and the position of the frame by using the multi-scale feature maps located at different layers. Multi-layer feature prediction helps detect faces of different scales, but each stage is not trained specifically to handle faces of a particular scale range. That is, during training, all scales of faces are lost in each detection module. In contrast, each detection module of YoMO is trained only by faces within the appropriate scale range.

Aiming at the problem of small-scale face detection of a single-step detection method, an HR trains a plurality of separated single-scale detectors by using an image pyramid, and each detector is responsible for a face with a specific scale. However, in the testing stage, the picture is scaled to multiple scales, and the picture of each scale passes through a very deep network, so that the multi-step single-scale detector is computationally expensive.

Whereas single-step multi-scale detectors, such as S3FD, detect faces using the multi-scale features of deep convolutional networks, only a single pass of pictures to the network is required during the testing phase. However, the same problem as that of the SSD still exists in S3FD, that is, each feature map with different scales is used for prediction separately, and when a small-scale face is predicted by using an underlying network, the detection effect of S3FD on the small-scale face is still not ideal due to the lack of semantic features.

Disclosure of Invention

Aiming at the defects in the prior art, the single-step face detection system provided by the invention solves the problem that the detection effect of the face detection system is not ideal.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a single-step face detection system, comprising a conventional convolution module conv0, a depth separable convolution module conv1, a depth separable convolution module conv2, a depth separable convolution module conv3, a depth separable convolution module conv4, a depth separable convolution module conv5, a depth separable convolution module conv6, a depth separable convolution module conv7, a depth separable convolution module conv8, a depth separable convolution module conv9, a depth separable convolution module conv10, a depth separable convolution module conv11, a depth separable convolution module conv12, a depth separable convolution module conv13, a depth separable convolution module conv14, a deconvolution layer conv15, a depth separable convolution module conv16, a depth separable convolution module conv17, a deconvolution layer conv18, a depth separable convolution module conv19 and a depth separable convolution module conv20, which are connected in sequence from left to right;

the output terminal of the depth separable convolution module conv14 is connected to the detection module det-32, the output terminal of the depth separable convolution module conv17 is connected to the detection module det-16, and the output terminal of the depth separable convolution module conv20 is connected to the detection module det-8;

the output terminal of the depth separable convolution module conv11 is connected to the input terminal of the depth separable convolution module conv16, the output terminal of the depth separable convolution module conv16 is merged with the output terminal characteristics of the deconvolution layer conv15 and connected to the input terminal of the depth separable convolution module conv17, the output terminal of the depth separable convolution module conv5 is connected to the input terminal of the depth separable convolution module conv19, and the output terminal of the depth separable convolution module conv19 is merged with the output terminal characteristics of the deconvolution layer conv18 and connected to the input terminal of the depth separable convolution module conv 20.

Further: the conventional convolution module conv0 includes a 3 × 3 convolution layer, a BatchNorm layer, and a LeakyReLU active layer connected in sequence from top to bottom.

Further:

the input picture of the conventional convolution module conv0 selects a crop box SelectCrop through a semi-soft random cropping algorithm_bboxCutting and training are carried out, and the method specifically comprises the following steps:

s1, generating a plurality of crop boxes with aspect ratios of 1 through a random cropping algorithm_bboxesCutting the original image to obtain a cut image, zooming the cut image according to the size requirement of the input image of the network, zooming effective real frames in the cut frame according to the same proportion, and counting the number of human faces of all scales according to the scale range of the human faces, wherein the statistical formula is as follows:

in the above formula, Num_icThe number of class c face scales of the ith cutting frame is N, the class of the face scales is N, the number is 3, the class is respectively a small-scale face, a medium-scale face and a large-scale face, the number is M, the total number of the cutting frames is 1(·), the identifier is 1(·), the condition is that the real time value is 1, otherwise, the value is 0, and MinScale_cAnd MaxScale_cBoundary minimum and maximum values of class c face scale, bbox_kThe side length of the cutting frame is K, and the total number of the generated cutting frames is K;

s2, sorting the face scale categories in a descending order according to the number of various faces of each cutting box:

S_i1≥S_i2≥…≥S_iN

in the above formula, i is the cutting frame number, S_icOne of N types of face scale categories in the ith cutting frame;

s3, counting the number of various face scales during network actual training, and arranging the face scale categories in an ascending order according to the number:

A₁≤A₂≤…≤A_N

in the above formula, A_cIs one of N types of face scale categories;

s4, sampling the cutting box_bboxesIn M in the personal face dimension category sorting, the search satisfies S_ic＝A_cThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCrop_bbox；

S5, when the cutting box meeting the step S4 is not found, sampling is carried out on the cutting box_bboxesIn M in the personal face dimension category sorting, the search satisfies S_i1＝A₁And S_iN＝A_NThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCrop_bbox；

S6, when the cutting box meeting the step S5 is not found, sampling is carried out on the cutting box_bboxesRandomly selecting a cutting frame as SelectCrop_bbox；

S7, selecting Crop_bboxNumber Num of face scales of each category_scAnd the number of scales of various human faces is updated to the actual training

In (1), namely:

in the above formula, the first and second carbon atoms are,

and s represents the sequence number of the selected cutting box for the number of various human face scales obtained by previous training.

Further: the depth separable convolution module conv1, the depth separable convolution module conv2, the depth separable convolution module conv3, the depth separable convolution module conv4, the depth separable convolution module conv5, the depth separable convolution module conv6, the depth separable convolution module conv7, the depth separable convolution module conv8, the depth separable convolution module conv9, the depth separable convolution module conv10, the depth separable convolution module conv11, the depth separable convolution module conv12, the depth separable convolution module conv13, the depth separable convolution module conv14, the depth separable convolution module conv16, the depth separable convolution module conv17, the depth separable convolution module conv19 and the depth separable convolution module conv20 have the same structure, and each include a 3 × 3 convolution lu, a butcm nortch layer, a leakyrey relu activation layer, a 1 × 1 point convolution layer, a butcm layer and a leakyy activation layer, which are connected in sequence from top to bottom.

Further: the number of output channels of the depth separable convolution module conv14, the depth separable convolution module conv17 and the depth separable convolution module conv20 is 1024.

Further: the detection module det-32 is used for large-scale face detection, the detection module det-16 is used for medium-scale face detection, and the detection module det-8 is used for small-scale face detection.

Further: the detection module det-32, the detection module det-16 and the detection module det-8 comprise a conventional convolution layer and an output layer;

the number of output channels of the conventional convolutional layer is 18;

the calculation formula of the center coordinate and the side length of the output layer prediction frame is as follows:

b_x＝σ(t_x)+C_x,b_y＝σ(t_y)+C_y

in the above formula, (b)_x,b_y) To predict the center coordinates of the box, b_wAnd b_hWidth and height of the prediction box, t_xAnd t_yThe offsets of the abscissa and ordinate of the central point of the prediction frame are (C)_x,C_y) Is the coordinate of the upper left corner of the grid where Anchor is located, sigma (-) is sigmoid function, p_wAnd p_hWidth and height of Anchor, respectively.

Further: the output ends of the detection module det-32, the detection module det-16 and the detection module det-8 are all connected with an ellipse regressor, the ellipse regressor converts the output layer prediction frame into an ellipse real frame, and the calculation formula of the ellipse real frame is as follows:

Y＝XW+ε

in the above formula, Y is the coordinate vector of the ellipse real frame, including the major semi-axis r_aShort semi-axis r_bAngle θ, center point abscissa c_xAnd ordinate c_yX is a coordinate vector of the output layer prediction frame, including the center coordinate b of the prediction frame_x、b_yWidth of prediction frame b_wAnd high b_hW is a regression coefficient matrix, and epsilon is a random error;

the calculation formula of the regression coefficient matrix W is as follows:

in the above formula, J (-) represents the mean square error function, X 'is the normalized coordinate vector of the prediction frame, and Y' is the normalized coordinate vector of the real frame;

in the above formula, U_XAnd σ_XMean and standard deviation, U, of X, respectively, of the prediction frame coordinate vector_YAnd σ_YRespectively, the mean and standard deviation of the real box coordinate vector Y.

The invention has the beneficial effects that:

1. the invention provides a real-time face detection network YOMO formed by depth separable convolution, which comprises a plurality of feature fusion structures in a top-down mode, and each detection module is only responsible for detecting the face detection in a corresponding scale range.

2. The invention adopts a random cutting strategy which is more in line with a multi-scale detection structure, so that each detection module can be trained by a sufficient number of samples.

3. The elliptic regression device provided by the invention can greatly improve the detection recall rate under the ContROC evaluation standard.

4. The detection accuracy of the YOMO model provided by the invention is 51FPS for the pictures with 544 x 544 resolution while keeping strong competitiveness.

Drawings

FIG. 1 is a block diagram of the present invention;

FIG. 2 is a graph of the evaluation results in the FDDB dataset according to the present invention;

FIG. 3 is a diagram of the results of the visualization in the WIDER FACE dataset and the FDDB dataset of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, a single-step face detection system includes a conventional convolution module conv0, a depth separable convolution module conv1, a depth separable convolution module conv2, a depth separable convolution module conv3, a depth separable convolution module conv4, a depth separable convolution module conv5, a depth separable convolution module conv6, a depth separable convolution module conv7, a depth separable convolution module conv8, a depth separable convolution module conv9, a depth separable convolution module conv10, a depth separable convolution module conv11, a depth separable convolution module conv12, a depth separable convolution module conv13, a depth separable convolution module conv14, a deconvolution layer conv15, a depth separable convolution module conv16, a depth separable convolution module conv17, a deconvolution layer conv18, a depth separable convolution module conv 46v 27 and a depth separable convolution module conv20, which are connected in sequence from left to right;

The output feature maps of conv14, conv17, and conv20 have downsampling steps of 32, 16, and 8, respectively, as compared with the original map. The detection module det-32 is used for large-scale face detection, the detection module det-16 is used for medium-scale face detection, the detection module det-8 is used for small-scale face detection, and the face scale range in charge of the detection module is shown in table 1.

TABLE 1 face Scale Range for which the detection Module is responsible

Dimension class	det-8 (Small-scale face)	det-16 (mesoscale human face)	det-32 (Large-scale face)
				Minimum MinScale	10	40	100
Maximum value MaxScale	39	99	350

The invention trains the network using the RMSProp gradient optimization algorithm with the parameter settings of table 2. And 3 detection modules are placed on layers with different step sizes so as to enhance the multi-scale detection capability of the model. In training, the loss function of each detection module is a multitask loss function containing 5 parts. In order to enable each detection module to be only responsible for the face within the corresponding scale range, the anchor with the maximum IoU of the real frame is searched during gradient return, and only the detection branch to which the anchor belongs will generate frame regression loss. To make the training more efficient, each real box will match an anchor that is the highest IoU.

TABLE 2 training file parameter configuration Table

base_lr	step_value	gamma	batch_size	iter_size	type	weight_decay	max_iter
								0.001	40000	0.1	9	3	RMSProp	0.00005	200000

The multitask loss function of the YoMO includes 5 parts, which are respectively a non-target loss, an anchor pre-training loss, a target positioning loss, a target confidence loss and a target category loss, as shown in formula (3).

Where W, H are the width and height of the feature map, respectively, A is the number of anchors, and t is the number of iterations. 1(x) represents a discriminator, and its value is 1 when x is true, and 0 otherwise. Lambda [ alpha ]_noobj，λ_prior，λ_coord，λ_obj，λ_classThe weight values of the sub tasks are respectively a non-target loss weight, an Anchor pre-training loss weight, a coordinate loss weight, a target loss weight and a category loss weight. b^r4 coordinate offset values predicted for the network, and prior^rThe coordinates are 4 coordinates of Anchor, namely the abscissa x, the ordinate y, the frame width w and the frame height h of the center point of the frame. When IoU of the prediction box and all real boxes is less than or equal to the threshold Thresh, the area of the input image corresponding to the prediction box is a non-target, namely a background, and the predicted value of the confidence coefficient is b^o. In order to adapt the network to Anchor as soon as possible, Anchor pre-training is introduced in the early stage of training to lose weight. Defining 1 epoch as the pre-training period in the YoMO model.

The conventional convolution module conv0 includes a 3 × 3 convolution layer, a BatchNorm layer, and a LeakyReLU active layer connected in sequence from top to bottom.

Input pictures of the conventional convolution module conv0 select a crop box SelectCrop by a semi-soft random cropping algorithm_bboxCutting and training are carried out, and the method specifically comprises the following steps:

S_i1≥S_i2≥…≥S_iN

A₁≤A₂≤…≤A_N

in the above formula, A_cIs one of N types of face scale categories;

In (1), namely:

in the above formula, the first and second carbon atoms are,

The depth separable convolution module conv1, the depth separable convolution module conv2, the depth separable convolution module conv3, the depth separable convolution module conv4, the depth separable convolution module conv5, the depth separable convolution module conv6, the depth separable convolution module conv7, the depth separable convolution module conv8, the depth separable convolution module conv9, the depth separable convolution module conv10, the depth separable convolution module conv11, the depth separable convolution module conv12, the depth separable convolution module conv13, the depth separable convolution module conv14, the depth separable convolution module conv16, the depth separable convolution module conv17, the depth separable convolution module conv19 and the depth separable convolution module conv20 have the same structure, and each include a 3 × 3 convolution lu, a butcm nortch layer, a leakyrey relu activation layer, a 1 × 1 point convolution layer, a butcm layer and a leakyy activation layer, which are connected in sequence from top to bottom.

The number of output channels of the depth separable convolution module conv14, the depth separable convolution module conv17 and the depth separable convolution module conv20 is 1024.

The detection module det-32, the detection module det-16 and the detection module det-8 comprise a conventional convolution layer and an output layer;

the calculation formula of the number of output channels of the conventional convolutional layer is as follows:

num_output＝(num_coordinate+num_confidence+num_classes)×num_Anchors

wherein coordinate, confidence, classes and Anchors respectively represent coordinate points, confidence degrees, categories and Anchors of the frame. When the number of anchors is larger, the detection accuracy of the network is better, but the training and testing speed is reduced. Considering that there are 3 detection modules in YoMO to be responsible for faces of 3 scales, num is taken into account for speed and precision_Anchors3. The number of output channels of a conventional convolutional layer in the detection module is therefore all 18.

b_x＝σ(t_x)+C_x,b_y＝σ(t_y)+C_y

The output ends of the detection module det-32, the detection module det-16 and the detection module det-8 are all connected with an ellipse regressor, the ellipse regressor converts the output layer prediction frame into an ellipse real frame, and the calculation formula of the ellipse real frame is as follows:

Y＝XW+ε

the calculation formula of the regression coefficient matrix W is as follows:

When training the elliptic regression, how to match the prediction box with the real box is the key. In practice, for each real frame of each picture of the FDDB, the prediction frame with the highest IoU is matched, and only the real frame and the matched prediction frame are considered in training.

The experimental environment of the invention is based on a 64-bit Ubuntu 14.04LTS system, the operating memory is 16GB, the CPU is 8-core Intelcore i7-7700K, and the single-core frequency is 4.20 GHz. All models are based on Caffe framework and trained in a single GPU, and the model is NVIDIA GeForce GTX 1080 Ti.

The characteristics extraction network of the YoMO model was pre-trained in ImageNet and fine-tuned by 200K iterations. Other parameter settings during training are shown in table 2. The IoU largest anchor with the real box is a positive example, and IoU<An anchor of 0.3 is considered background. Considering the detection rate and the face scale range, each detection module comprises 3 anchors, and the numerical values of the anchors are obtained by clustering in the training set. The weight of each part in the loss function is lambda_noobj＝1，λ_prior＝1，λ_coord＝1，λ_obj＝5，λ_class1. The NMS threshold for each test module was set to 0.7 for training and 0.45 for testing. The training pictures for all models in the present invention are scaled to 544 x 544 resolution.

WIDERFACE is a human face detection reference data set, and pictures are collected on the internet and have complex backgrounds. The data set comprises 32203 pictures, 393703 human faces are marked, and the sizes, postures and occlusions of the human faces have higher change intervals. And classified into 61 event classes, with data being divided into training, validation and test sets by a ratio of 40%, 10%, 50% for each class. All models in the invention are trained in a training set.

The pictures in the FDDB data set are collected in a Faces in the Wild data set, and the FDDB data set comprises 2845 pictures and 5171 human Faces. With certain difficulties including occlusion, difficult pose, low resolution and defocus, as well as black and white and color pictures. Unlike other face detection datasets, the labeled region is an ellipse rather than a rectangle. All models in the present invention were tested in the FDDB dataset.

When tested in the FDDB dataset, all pictures were kept aspect ratio scaled and embedded in a black background at the 544 × 544 scale to ensure that the pictures did not deform. As shown in FIGS. 2(a) and 2(b), the results of YoMO were compared with MTCNN, Scale face, HR-ER, ICC-CNN, FANet models in DiscROC and ContROC, respectively.

In FIG. 2, YOMO-Fit is the detection result of YOMO passing through an elliptic regression device. According to the FDDB evaluation result, when the number of false detections of YoMO-Fit is fixed to 1000 under the DiscROC and ContROC evaluation standards, the recall rates are respectively 97.7 percent and 83.6 percent, which are only lower than FANet. While HR-ER, even using FDDB as the 10-fold cross-validated training data, has the same recall rate in DiscROC as YoMO and a recall rate in ContROC that is 4.9% lower than YoMO-Fit. Notably, the elliptical regressor increased the recall rate of Yomo under DiscROC and ContROC by 0.1% and 8.6%, respectively.

Fig. 3(a), (b) are visualizations of WIDER FACE and the FDDB dataset, respectively, from testing of individual pictures. The rectangular frame in fig. 3(a) is a prediction frame of the YOMO model. The rectangles and ellipses in fig. 3(b) are the prediction box and the real box of the YOMO, respectively.

Claims

1. A single-step face detection system, which is characterized by comprising a conventional convolution module conv0, a depth separable convolution module conv1, a depth separable convolution module conv2, a depth separable convolution module conv3, a depth separable convolution module conv4, a depth separable convolution module conv5, a depth separable convolution module conv6, a depth separable convolution module conv7, a depth separable convolution module conv8, a depth separable convolution module conv9, a depth separable convolution module conv10, a depth separable convolution module conv11, a depth separable convolution module conv12, a depth separable convolution module conv13, a depth separable convolution module conv14, a deconvolution layer conv15, a depth separable convolution module conv16, a depth separable convolution module conv17, a deconvolution layer conv18, a depth separable convolution module conv 46v 27 and a depth separable convolution module conv20, which are connected in sequence from left to right;

an output terminal of the depth separable convolution module conv11 is connected to an input terminal of a depth separable convolution module conv16, an output terminal of the depth separable convolution module conv16 is fused with an output terminal characteristic of a deconvolution layer conv15 and is connected to an input terminal of the depth separable convolution module conv17, an output terminal of the depth separable convolution module conv5 is connected to an input terminal of a depth separable convolution module conv19, and an output terminal of the depth separable convolution module conv19 is fused with an output terminal characteristic of a deconvolution layer conv18 and is connected to an input terminal of a depth separable convolution module conv 20;

S_i1≥S_i2≥…≥S_iN

A₁≤A₂≤…≤A_N

in the above formula, A_cIs one of N types of face scale categories;

In (1), namely:

in the above formula, the first and second carbon atoms are,

s represents the sequence number of the selected cutting box for the number of various human face scales obtained by previous training;

the detection module det-32 is used for large-scale face detection, the detection module det-16 is used for medium-scale face detection, and the detection module det-8 is used for small-scale face detection.

2. The single-step face detection system of claim 1, wherein the conventional convolution module conv0 comprises a 3 x 3 convolution layer, a BatchNorm layer and a LeakyReLU active layer connected in sequence from top to bottom.

3. The single-step face detection system of claim 1, wherein the depth separable convolution module conv1, depth separable convolution module conv2, depth separable convolution module conv3, depth separable convolution module conv4, depth separable convolution module conv5, depth separable convolution module conv6, depth separable convolution module conv7, depth separable convolution module conv8, depth separable convolution module conv9, depth separable convolution module conv10, depth separable convolution module conv11, depth separable convolution module conv12, depth separable convolution module conv13, depth separable convolution module conv14, depth separable convolution module conv16, depth separable convolution module conv17, depth separable convolution module conv19, and depth separable convolution module conv20 have the same structure, and each include a lu 3 × 3 convolution, a BatchNorm layer, a layreakrey layer, and a layerk layer sequentially connected from top to bottom, 1X 1 dot convolutional layer, BatchNorm layer, and LeakyReLU active layer.

4. The single-step face detection system of claim 1, wherein the number of output channels of the depth separable convolution module conv14, the depth separable convolution module conv17 and the depth separable convolution module conv20 are all 1024.

5. The single-step face detection system of claim 1, wherein said detection module det-32, detection module det-16 and detection module det-8 each comprise a conventional convolutional layer and an output layer;

the number of output channels of the conventional convolutional layer is 18;

b_x＝σ(t_x)+C_x，b_y＝σ(t_y)+C_y

in the above formula, (b)_x，b_y) To predict the center coordinates of the box, b_wAnd b_hWidth and height of the prediction box, t_xAnd t_yThe offsets of the abscissa and ordinate of the central point of the prediction frame are (C)_x，C_y) Is the coordinate of the upper left corner of the grid where Anchor is located, sigma (-) is sigmoid function, p_wAnd p_hWidth and height of Anchor, respectively.

6. The single-step face detection system of claim 5, wherein the output ends of the detection module det-32, the detection module det-16 and the detection module det-8 are all connected to an elliptical regression device, the elliptical regression device converts the output layer prediction frame into an elliptical real frame, and the calculation formula of the elliptical real frame is as follows:

Y＝XW+ε

the calculation formula of the regression coefficient matrix W is as follows: