CN110263731B - Single step human face detection system - Google Patents

Single step human face detection system Download PDF

Info

Publication number
CN110263731B
CN110263731B CN201910550738.8A CN201910550738A CN110263731B CN 110263731 B CN110263731 B CN 110263731B CN 201910550738 A CN201910550738 A CN 201910550738A CN 110263731 B CN110263731 B CN 110263731B
Authority
CN
China
Prior art keywords
convolution module
depth separable
separable convolution
module
face
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910550738.8A
Other languages
Chinese (zh)
Other versions
CN110263731A (en
Inventor
徐杰
田野
罗堡文
廖静茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910550738.8A priority Critical patent/CN110263731B/en
Publication of CN110263731A publication Critical patent/CN110263731A/en
Application granted granted Critical
Publication of CN110263731B publication Critical patent/CN110263731B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Abstract

The invention discloses a single-step face detection system. The invention provides a real-time face detection network YOMO formed by depth separable convolution, which comprises a plurality of feature fusion structures in a top-down mode, and each detection module is only responsible for detecting the face detection in a corresponding scale range. The invention adopts a random cutting strategy which is more in line with a multi-scale detection structure, so that each detection module can be trained by a sufficient number of samples. The elliptic regression device provided by the invention can greatly improve the detection recall rate under the ContROC evaluation standard. The detection accuracy of the YOMO model provided by the invention is kept strong competitiveness, the detection rate of pictures with 544 x 544 resolution is 51FPS, and the memory occupation of the model is only 21M.

Description

Single step human face detection system
Technical Field
The invention relates to the technical field of face detection, in particular to a single-step face detection system.
Background
The face detection is a key component of a smart city with human as the center, and relates to technologies such as identity recognition, personalized service, pedestrian detection and tracking, crowd counting and the like. Although widely studied, face detection with unlimited scenes remains an open research problem due to various challenges.
Early face detection focused on manually designing effective features and building efficient classifiers based on the effective features. But usually a sub-optimal detection model is obtained, and the detection accuracy may be greatly reduced with the change of application scenes. In recent years, the deep learning technology is successfully applied to the face detection task, but generating a real-time face detection model with high accuracy and unlimited scene still has great challenge.
The Faster R-CNN uses a region recommendation algorithm to replace a sliding window, and integrates candidate box generation, feature extraction, frame regression and classification into a network, wherein the model is the model with the highest detection rate and accuracy in the R-CNN series models. However, the recommended network generates many face candidate frames and the complicated network structure brings large calculation overhead, so that real-time face detection cannot be achieved.
Another type of face detection method, such as YOLO, converts the detection problem into a regression problem, and therefore does not include a recommendation network, and the face frame is directly regressed in the feature map of the feature extraction network, which has a faster detection rate, but the detection accuracy needs to be improved. In order to improve the detection accuracy, the SSD jointly predicts the category and the position of the frame by using the multi-scale feature maps located at different layers. Multi-layer feature prediction helps detect faces of different scales, but each stage is not trained specifically to handle faces of a particular scale range. That is, during training, all scales of faces are lost in each detection module. In contrast, each detection module of YoMO is trained only by faces within the appropriate scale range.
Aiming at the problem of small-scale face detection of a single-step detection method, an HR trains a plurality of separated single-scale detectors by using an image pyramid, and each detector is responsible for a face with a specific scale. However, in the testing stage, the picture is scaled to multiple scales, and the picture of each scale passes through a very deep network, so that the multi-step single-scale detector is computationally expensive.
Whereas single-step multi-scale detectors, such as S3FD, detect faces using the multi-scale features of deep convolutional networks, only a single pass of pictures to the network is required during the testing phase. However, the same problem as that of the SSD still exists in S3FD, that is, each feature map with different scales is used for prediction separately, and when a small-scale face is predicted by using an underlying network, the detection effect of S3FD on the small-scale face is still not ideal due to the lack of semantic features.
Disclosure of Invention
Aiming at the defects in the prior art, the single-step face detection system provided by the invention solves the problem that the detection effect of the face detection system is not ideal.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a single-step face detection system, comprising a conventional convolution module conv0, a depth separable convolution module conv1, a depth separable convolution module conv2, a depth separable convolution module conv3, a depth separable convolution module conv4, a depth separable convolution module conv5, a depth separable convolution module conv6, a depth separable convolution module conv7, a depth separable convolution module conv8, a depth separable convolution module conv9, a depth separable convolution module conv10, a depth separable convolution module conv11, a depth separable convolution module conv12, a depth separable convolution module conv13, a depth separable convolution module conv14, a deconvolution layer conv15, a depth separable convolution module conv16, a depth separable convolution module conv17, a deconvolution layer conv18, a depth separable convolution module conv19 and a depth separable convolution module conv20, which are connected in sequence from left to right;
the output terminal of the depth separable convolution module conv14 is connected to the detection module det-32, the output terminal of the depth separable convolution module conv17 is connected to the detection module det-16, and the output terminal of the depth separable convolution module conv20 is connected to the detection module det-8;
the output terminal of the depth separable convolution module conv11 is connected to the input terminal of the depth separable convolution module conv16, the output terminal of the depth separable convolution module conv16 is merged with the output terminal characteristics of the deconvolution layer conv15 and connected to the input terminal of the depth separable convolution module conv17, the output terminal of the depth separable convolution module conv5 is connected to the input terminal of the depth separable convolution module conv19, and the output terminal of the depth separable convolution module conv19 is merged with the output terminal characteristics of the deconvolution layer conv18 and connected to the input terminal of the depth separable convolution module conv 20.
Further: the conventional convolution module conv0 includes a 3 × 3 convolution layer, a BatchNorm layer, and a LeakyReLU active layer connected in sequence from top to bottom.
Further:
the input picture of the conventional convolution module conv0 selects a crop box SelectCrop through a semi-soft random cropping algorithmbboxCutting and training are carried out, and the method specifically comprises the following steps:
s1, generating a plurality of crop boxes with aspect ratios of 1 through a random cropping algorithmbboxesCutting the original image to obtain a cut image, zooming the cut image according to the size requirement of the input image of the network, zooming effective real frames in the cut frame according to the same proportion, and counting the number of human faces of all scales according to the scale range of the human faces, wherein the statistical formula is as follows:
Figure BDA0002105364110000031
in the above formula, NumicThe number of class c face scales of the ith cutting frame is N, the class of the face scales is N, the number is 3, the class is respectively a small-scale face, a medium-scale face and a large-scale face, the number is M, the total number of the cutting frames is 1(·), the identifier is 1(·), the condition is that the real time value is 1, otherwise, the value is 0, and MinScalecAnd MaxScalecBoundary minimum and maximum values of class c face scale, bboxkThe side length of the cutting frame is K, and the total number of the generated cutting frames is K;
s2, sorting the face scale categories in a descending order according to the number of various faces of each cutting box:
Si1≥Si2≥…≥SiN
in the above formula, i is the cutting frame number, SicOne of N types of face scale categories in the ith cutting frame;
s3, counting the number of various face scales during network actual training, and arranging the face scale categories in an ascending order according to the number:
A1≤A2≤…≤AN
in the above formula, AcIs one of N types of face scale categories;
s4, sampling the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Sic=AcThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox
S5, when the cutting box meeting the step S4 is not found, sampling is carried out on the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Si1=A1And SiN=ANThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox
S6, when the cutting box meeting the step S5 is not found, sampling is carried out on the cutting boxbboxesRandomly selecting a cutting frame as SelectCropbbox
S7, selecting CropbboxNumber Num of face scales of each categoryscAnd the number of scales of various human faces is updated to the actual training
Figure BDA0002105364110000041
In (1), namely:
Figure BDA0002105364110000042
in the above formula, the first and second carbon atoms are,
Figure BDA0002105364110000043
and s represents the sequence number of the selected cutting box for the number of various human face scales obtained by previous training.
Further: the depth separable convolution module conv1, the depth separable convolution module conv2, the depth separable convolution module conv3, the depth separable convolution module conv4, the depth separable convolution module conv5, the depth separable convolution module conv6, the depth separable convolution module conv7, the depth separable convolution module conv8, the depth separable convolution module conv9, the depth separable convolution module conv10, the depth separable convolution module conv11, the depth separable convolution module conv12, the depth separable convolution module conv13, the depth separable convolution module conv14, the depth separable convolution module conv16, the depth separable convolution module conv17, the depth separable convolution module conv19 and the depth separable convolution module conv20 have the same structure, and each include a 3 × 3 convolution lu, a butcm nortch layer, a leakyrey relu activation layer, a 1 × 1 point convolution layer, a butcm layer and a leakyy activation layer, which are connected in sequence from top to bottom.
Further: the number of output channels of the depth separable convolution module conv14, the depth separable convolution module conv17 and the depth separable convolution module conv20 is 1024.
Further: the detection module det-32 is used for large-scale face detection, the detection module det-16 is used for medium-scale face detection, and the detection module det-8 is used for small-scale face detection.
Further: the detection module det-32, the detection module det-16 and the detection module det-8 comprise a conventional convolution layer and an output layer;
the number of output channels of the conventional convolutional layer is 18;
the calculation formula of the center coordinate and the side length of the output layer prediction frame is as follows:
bx=σ(tx)+Cx,by=σ(ty)+Cy
Figure BDA0002105364110000051
in the above formula, (b)x,by) To predict the center coordinates of the box, bwAnd bhWidth and height of the prediction box, txAnd tyThe offsets of the abscissa and ordinate of the central point of the prediction frame are (C)x,Cy) Is the coordinate of the upper left corner of the grid where Anchor is located, sigma (-) is sigmoid function, pwAnd phWidth and height of Anchor, respectively.
Further: the output ends of the detection module det-32, the detection module det-16 and the detection module det-8 are all connected with an ellipse regressor, the ellipse regressor converts the output layer prediction frame into an ellipse real frame, and the calculation formula of the ellipse real frame is as follows:
Y=XW+ε
in the above formula, Y is the coordinate vector of the ellipse real frame, including the major semi-axis raShort semi-axis rbAngle θ, center point abscissa cxAnd ordinate cyX is a coordinate vector of the output layer prediction frame, including the center coordinate b of the prediction framex、byWidth of prediction frame bwAnd high bhW is a regression coefficient matrix, and epsilon is a random error;
the calculation formula of the regression coefficient matrix W is as follows:
Figure BDA0002105364110000061
in the above formula, J (-) represents the mean square error function, X 'is the normalized coordinate vector of the prediction frame, and Y' is the normalized coordinate vector of the real frame;
Figure BDA0002105364110000062
Figure BDA0002105364110000063
in the above formula, UXAnd σXMean and standard deviation, U, of X, respectively, of the prediction frame coordinate vectorYAnd σYRespectively, the mean and standard deviation of the real box coordinate vector Y.
The invention has the beneficial effects that:
1. the invention provides a real-time face detection network YOMO formed by depth separable convolution, which comprises a plurality of feature fusion structures in a top-down mode, and each detection module is only responsible for detecting the face detection in a corresponding scale range.
2. The invention adopts a random cutting strategy which is more in line with a multi-scale detection structure, so that each detection module can be trained by a sufficient number of samples.
3. The elliptic regression device provided by the invention can greatly improve the detection recall rate under the ContROC evaluation standard.
4. The detection accuracy of the YOMO model provided by the invention is 51FPS for the pictures with 544 x 544 resolution while keeping strong competitiveness.
Drawings
FIG. 1 is a block diagram of the present invention;
FIG. 2 is a graph of the evaluation results in the FDDB dataset according to the present invention;
FIG. 3 is a diagram of the results of the visualization in the WIDER FACE dataset and the FDDB dataset of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
As shown in fig. 1, a single-step face detection system includes a conventional convolution module conv0, a depth separable convolution module conv1, a depth separable convolution module conv2, a depth separable convolution module conv3, a depth separable convolution module conv4, a depth separable convolution module conv5, a depth separable convolution module conv6, a depth separable convolution module conv7, a depth separable convolution module conv8, a depth separable convolution module conv9, a depth separable convolution module conv10, a depth separable convolution module conv11, a depth separable convolution module conv12, a depth separable convolution module conv13, a depth separable convolution module conv14, a deconvolution layer conv15, a depth separable convolution module conv16, a depth separable convolution module conv17, a deconvolution layer conv18, a depth separable convolution module conv 46v 27 and a depth separable convolution module conv20, which are connected in sequence from left to right;
the output terminal of the depth separable convolution module conv14 is connected to the detection module det-32, the output terminal of the depth separable convolution module conv17 is connected to the detection module det-16, and the output terminal of the depth separable convolution module conv20 is connected to the detection module det-8;
the output terminal of the depth separable convolution module conv11 is connected to the input terminal of the depth separable convolution module conv16, the output terminal of the depth separable convolution module conv16 is merged with the output terminal characteristics of the deconvolution layer conv15 and connected to the input terminal of the depth separable convolution module conv17, the output terminal of the depth separable convolution module conv5 is connected to the input terminal of the depth separable convolution module conv19, and the output terminal of the depth separable convolution module conv19 is merged with the output terminal characteristics of the deconvolution layer conv18 and connected to the input terminal of the depth separable convolution module conv 20.
The output feature maps of conv14, conv17, and conv20 have downsampling steps of 32, 16, and 8, respectively, as compared with the original map. The detection module det-32 is used for large-scale face detection, the detection module det-16 is used for medium-scale face detection, the detection module det-8 is used for small-scale face detection, and the face scale range in charge of the detection module is shown in table 1.
TABLE 1 face Scale Range for which the detection Module is responsible
Dimension class det-8 (Small-scale face) det-16 (mesoscale human face) det-32 (Large-scale face)
Minimum MinScale 10 40 100
Maximum value MaxScale 39 99 350
The invention trains the network using the RMSProp gradient optimization algorithm with the parameter settings of table 2. And 3 detection modules are placed on layers with different step sizes so as to enhance the multi-scale detection capability of the model. In training, the loss function of each detection module is a multitask loss function containing 5 parts. In order to enable each detection module to be only responsible for the face within the corresponding scale range, the anchor with the maximum IoU of the real frame is searched during gradient return, and only the detection branch to which the anchor belongs will generate frame regression loss. To make the training more efficient, each real box will match an anchor that is the highest IoU.
TABLE 2 training file parameter configuration Table
base_lr step_value gamma batch_size iter_size type weight_decay max_iter
0.001 40000 0.1 9 3 RMSProp 0.00005 200000
The multitask loss function of the YoMO includes 5 parts, which are respectively a non-target loss, an anchor pre-training loss, a target positioning loss, a target confidence loss and a target category loss, as shown in formula (3).
Figure BDA0002105364110000081
Figure BDA0002105364110000091
Where W, H are the width and height of the feature map, respectively, A is the number of anchors, and t is the number of iterations. 1(x) represents a discriminator, and its value is 1 when x is true, and 0 otherwise. Lambda [ alpha ]noobj,λprior,λcoord,λobj,λclassThe weight values of the sub tasks are respectively a non-target loss weight, an Anchor pre-training loss weight, a coordinate loss weight, a target loss weight and a category loss weight. br4 coordinate offset values predicted for the network, and priorrThe coordinates are 4 coordinates of Anchor, namely the abscissa x, the ordinate y, the frame width w and the frame height h of the center point of the frame. When IoU of the prediction box and all real boxes is less than or equal to the threshold Thresh, the area of the input image corresponding to the prediction box is a non-target, namely a background, and the predicted value of the confidence coefficient is bo. In order to adapt the network to Anchor as soon as possible, Anchor pre-training is introduced in the early stage of training to lose weight. Defining 1 epoch as the pre-training period in the YoMO model.
The conventional convolution module conv0 includes a 3 × 3 convolution layer, a BatchNorm layer, and a LeakyReLU active layer connected in sequence from top to bottom.
Input pictures of the conventional convolution module conv0 select a crop box SelectCrop by a semi-soft random cropping algorithmbboxCutting and training are carried out, and the method specifically comprises the following steps:
s1, generating a plurality of crop boxes with aspect ratios of 1 through a random cropping algorithmbboxesCutting the original image to obtain a cut image, zooming the cut image according to the size requirement of the input image of the network, zooming effective real frames in the cut frame according to the same proportion, and counting the number of human faces of all scales according to the scale range of the human faces, wherein the statistical formula is as follows:
Figure BDA0002105364110000092
in the above formula, NumicThe number of class c face scales of the ith cutting frame is N, the class of the face scales is N, the number is 3, the class is respectively a small-scale face, a medium-scale face and a large-scale face, the number is M, the total number of the cutting frames is 1(·), the identifier is 1(·), the condition is that the real time value is 1, otherwise, the value is 0, and MinScalecAnd MaxScalecBoundary minimum and maximum values of class c face scale, bboxkThe side length of the cutting frame is K, and the total number of the generated cutting frames is K;
s2, sorting the face scale categories in a descending order according to the number of various faces of each cutting box:
Si1≥Si2≥…≥SiN
in the above formula, i is the cutting frame number, SicOne of N types of face scale categories in the ith cutting frame;
s3, counting the number of various face scales during network actual training, and arranging the face scale categories in an ascending order according to the number:
A1≤A2≤…≤AN
in the above formula, AcIs one of N types of face scale categories;
s4, sampling the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Sic=AcThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox
S5, when the cutting box meeting the step S4 is not found, sampling is carried out on the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Si1=A1And SiN=ANThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox
S6, when the cutting box meeting the step S5 is not found, sampling is carried out on the cutting boxbboxesRandomly selecting a cutting frame as SelectCropbbox
S7, selecting CropbboxNumber Num of face scales of each categoryscAnd the number of scales of various human faces is updated to the actual training
Figure BDA0002105364110000101
In (1), namely:
Figure BDA0002105364110000102
in the above formula, the first and second carbon atoms are,
Figure BDA0002105364110000103
and s represents the sequence number of the selected cutting box for the number of various human face scales obtained by previous training.
The depth separable convolution module conv1, the depth separable convolution module conv2, the depth separable convolution module conv3, the depth separable convolution module conv4, the depth separable convolution module conv5, the depth separable convolution module conv6, the depth separable convolution module conv7, the depth separable convolution module conv8, the depth separable convolution module conv9, the depth separable convolution module conv10, the depth separable convolution module conv11, the depth separable convolution module conv12, the depth separable convolution module conv13, the depth separable convolution module conv14, the depth separable convolution module conv16, the depth separable convolution module conv17, the depth separable convolution module conv19 and the depth separable convolution module conv20 have the same structure, and each include a 3 × 3 convolution lu, a butcm nortch layer, a leakyrey relu activation layer, a 1 × 1 point convolution layer, a butcm layer and a leakyy activation layer, which are connected in sequence from top to bottom.
The number of output channels of the depth separable convolution module conv14, the depth separable convolution module conv17 and the depth separable convolution module conv20 is 1024.
The detection module det-32, the detection module det-16 and the detection module det-8 comprise a conventional convolution layer and an output layer;
the calculation formula of the number of output channels of the conventional convolutional layer is as follows:
numoutput=(numcoordinate+numconfidence+numclasses)×numAnchors
wherein coordinate, confidence, classes and Anchors respectively represent coordinate points, confidence degrees, categories and Anchors of the frame. When the number of anchors is larger, the detection accuracy of the network is better, but the training and testing speed is reduced. Considering that there are 3 detection modules in YoMO to be responsible for faces of 3 scales, num is taken into account for speed and precisionAnchors3. The number of output channels of a conventional convolutional layer in the detection module is therefore all 18.
The calculation formula of the center coordinate and the side length of the output layer prediction frame is as follows:
bx=σ(tx)+Cx,by=σ(ty)+Cy
Figure BDA0002105364110000121
in the above formula, (b)x,by) To predict the center coordinates of the box, bwAnd bhWidth and height of the prediction box, txAnd tyThe offsets of the abscissa and ordinate of the central point of the prediction frame are (C)x,Cy) Is the coordinate of the upper left corner of the grid where Anchor is located, sigma (-) is sigmoid function, pwAnd phWidth and height of Anchor, respectively.
The output ends of the detection module det-32, the detection module det-16 and the detection module det-8 are all connected with an ellipse regressor, the ellipse regressor converts the output layer prediction frame into an ellipse real frame, and the calculation formula of the ellipse real frame is as follows:
Y=XW+ε
in the above formula, Y is the coordinate vector of the ellipse real frame, including the major semi-axis raShort semi-axis rbAngle θ, center point abscissa cxAnd ordinate cyX is a coordinate vector of the output layer prediction frame, including the center coordinate b of the prediction framex、byWidth of prediction frame bwAnd high bhW is a regression coefficient matrix, and epsilon is a random error;
the calculation formula of the regression coefficient matrix W is as follows:
Figure BDA0002105364110000122
in the above formula, J (-) represents the mean square error function, X 'is the normalized coordinate vector of the prediction frame, and Y' is the normalized coordinate vector of the real frame;
Figure BDA0002105364110000123
Figure BDA0002105364110000124
in the above formula, UXAnd σXMean and standard deviation, U, of X, respectively, of the prediction frame coordinate vectorYAnd σYRespectively, the mean and standard deviation of the real box coordinate vector Y.
When training the elliptic regression, how to match the prediction box with the real box is the key. In practice, for each real frame of each picture of the FDDB, the prediction frame with the highest IoU is matched, and only the real frame and the matched prediction frame are considered in training.
The experimental environment of the invention is based on a 64-bit Ubuntu 14.04LTS system, the operating memory is 16GB, the CPU is 8-core Intelcore i7-7700K, and the single-core frequency is 4.20 GHz. All models are based on Caffe framework and trained in a single GPU, and the model is NVIDIA GeForce GTX 1080 Ti.
The characteristics extraction network of the YoMO model was pre-trained in ImageNet and fine-tuned by 200K iterations. Other parameter settings during training are shown in table 2. The IoU largest anchor with the real box is a positive example, and IoU<An anchor of 0.3 is considered background. Considering the detection rate and the face scale range, each detection module comprises 3 anchors, and the numerical values of the anchors are obtained by clustering in the training set. The weight of each part in the loss function is lambdanoobj=1,λprior=1,λcoord=1,λobj=5,λclass1. The NMS threshold for each test module was set to 0.7 for training and 0.45 for testing. The training pictures for all models in the present invention are scaled to 544 x 544 resolution.
WIDERFACE is a human face detection reference data set, and pictures are collected on the internet and have complex backgrounds. The data set comprises 32203 pictures, 393703 human faces are marked, and the sizes, postures and occlusions of the human faces have higher change intervals. And classified into 61 event classes, with data being divided into training, validation and test sets by a ratio of 40%, 10%, 50% for each class. All models in the invention are trained in a training set.
The pictures in the FDDB data set are collected in a Faces in the Wild data set, and the FDDB data set comprises 2845 pictures and 5171 human Faces. With certain difficulties including occlusion, difficult pose, low resolution and defocus, as well as black and white and color pictures. Unlike other face detection datasets, the labeled region is an ellipse rather than a rectangle. All models in the present invention were tested in the FDDB dataset.
When tested in the FDDB dataset, all pictures were kept aspect ratio scaled and embedded in a black background at the 544 × 544 scale to ensure that the pictures did not deform. As shown in FIGS. 2(a) and 2(b), the results of YoMO were compared with MTCNN, Scale face, HR-ER, ICC-CNN, FANet models in DiscROC and ContROC, respectively.
In FIG. 2, YOMO-Fit is the detection result of YOMO passing through an elliptic regression device. According to the FDDB evaluation result, when the number of false detections of YoMO-Fit is fixed to 1000 under the DiscROC and ContROC evaluation standards, the recall rates are respectively 97.7 percent and 83.6 percent, which are only lower than FANet. While HR-ER, even using FDDB as the 10-fold cross-validated training data, has the same recall rate in DiscROC as YoMO and a recall rate in ContROC that is 4.9% lower than YoMO-Fit. Notably, the elliptical regressor increased the recall rate of Yomo under DiscROC and ContROC by 0.1% and 8.6%, respectively.
Fig. 3(a), (b) are visualizations of WIDER FACE and the FDDB dataset, respectively, from testing of individual pictures. The rectangular frame in fig. 3(a) is a prediction frame of the YOMO model. The rectangles and ellipses in fig. 3(b) are the prediction box and the real box of the YOMO, respectively.

Claims (6)

1. A single-step face detection system, which is characterized by comprising a conventional convolution module conv0, a depth separable convolution module conv1, a depth separable convolution module conv2, a depth separable convolution module conv3, a depth separable convolution module conv4, a depth separable convolution module conv5, a depth separable convolution module conv6, a depth separable convolution module conv7, a depth separable convolution module conv8, a depth separable convolution module conv9, a depth separable convolution module conv10, a depth separable convolution module conv11, a depth separable convolution module conv12, a depth separable convolution module conv13, a depth separable convolution module conv14, a deconvolution layer conv15, a depth separable convolution module conv16, a depth separable convolution module conv17, a deconvolution layer conv18, a depth separable convolution module conv 46v 27 and a depth separable convolution module conv20, which are connected in sequence from left to right;
the output terminal of the depth separable convolution module conv14 is connected to the detection module det-32, the output terminal of the depth separable convolution module conv17 is connected to the detection module det-16, and the output terminal of the depth separable convolution module conv20 is connected to the detection module det-8;
an output terminal of the depth separable convolution module conv11 is connected to an input terminal of a depth separable convolution module conv16, an output terminal of the depth separable convolution module conv16 is fused with an output terminal characteristic of a deconvolution layer conv15 and is connected to an input terminal of the depth separable convolution module conv17, an output terminal of the depth separable convolution module conv5 is connected to an input terminal of a depth separable convolution module conv19, and an output terminal of the depth separable convolution module conv19 is fused with an output terminal characteristic of a deconvolution layer conv18 and is connected to an input terminal of a depth separable convolution module conv 20;
the input picture of the conventional convolution module conv0 selects a crop box SelectCrop through a semi-soft random cropping algorithmbboxCutting and training are carried out, and the method specifically comprises the following steps:
s1, generating a plurality of crop boxes with aspect ratios of 1 through a random cropping algorithmbboxesCutting the original image to obtain a cut image, zooming the cut image according to the size requirement of the input image of the network, zooming effective real frames in the cut frame according to the same proportion, and counting the number of human faces of all scales according to the scale range of the human faces, wherein the statistical formula is as follows:
Figure FDA0002881786780000021
in the above formula, NumicThe number of class c face scales of the ith cutting frame is N, the class of the face scales is N, the number is 3, the class is respectively a small-scale face, a medium-scale face and a large-scale face, the number is M, the total number of the cutting frames is 1(·), the identifier is 1(·), the condition is that the real time value is 1, otherwise, the value is 0, and MinScalecAnd MaxScalecBoundary minimum and maximum values of class c face scale, bboxkThe side length of the cutting frame is K, and the total number of the generated cutting frames is K;
s2, sorting the face scale categories in a descending order according to the number of various faces of each cutting box:
Si1≥Si2≥…≥SiN
in the above formula, i is the cutting frame number, SicOne of N types of face scale categories in the ith cutting frame;
s3, counting the number of various face scales during network actual training, and arranging the face scale categories in an ascending order according to the number:
A1≤A2≤…≤AN
in the above formula, AcIs one of N types of face scale categories;
s4, sampling the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Sic=AcThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox
S5, when the cutting box meeting the step S4 is not found, sampling is carried out on the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Si1=A1And SiN=ANThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox
S6, when the cutting box meeting the step S5 is not found, sampling is carried out on the cutting boxbboxesRandomly selecting a cutting frame as SelectCropbbox
S7, selecting CropbboxNumber Num of face scales of each categoryscAnd the number of scales of various human faces is updated to the actual training
Figure FDA0002881786780000022
In (1), namely:
Figure FDA0002881786780000031
in the above formula, the first and second carbon atoms are,
Figure FDA0002881786780000032
s represents the sequence number of the selected cutting box for the number of various human face scales obtained by previous training;
the detection module det-32 is used for large-scale face detection, the detection module det-16 is used for medium-scale face detection, and the detection module det-8 is used for small-scale face detection.
2. The single-step face detection system of claim 1, wherein the conventional convolution module conv0 comprises a 3 x 3 convolution layer, a BatchNorm layer and a LeakyReLU active layer connected in sequence from top to bottom.
3. The single-step face detection system of claim 1, wherein the depth separable convolution module conv1, depth separable convolution module conv2, depth separable convolution module conv3, depth separable convolution module conv4, depth separable convolution module conv5, depth separable convolution module conv6, depth separable convolution module conv7, depth separable convolution module conv8, depth separable convolution module conv9, depth separable convolution module conv10, depth separable convolution module conv11, depth separable convolution module conv12, depth separable convolution module conv13, depth separable convolution module conv14, depth separable convolution module conv16, depth separable convolution module conv17, depth separable convolution module conv19, and depth separable convolution module conv20 have the same structure, and each include a lu 3 × 3 convolution, a BatchNorm layer, a layreakrey layer, and a layerk layer sequentially connected from top to bottom, 1X 1 dot convolutional layer, BatchNorm layer, and LeakyReLU active layer.
4. The single-step face detection system of claim 1, wherein the number of output channels of the depth separable convolution module conv14, the depth separable convolution module conv17 and the depth separable convolution module conv20 are all 1024.
5. The single-step face detection system of claim 1, wherein said detection module det-32, detection module det-16 and detection module det-8 each comprise a conventional convolutional layer and an output layer;
the number of output channels of the conventional convolutional layer is 18;
the calculation formula of the center coordinate and the side length of the output layer prediction frame is as follows:
bx=σ(tx)+Cx,by=σ(ty)+Cy
Figure FDA0002881786780000044
in the above formula, (b)x,by) To predict the center coordinates of the box, bwAnd bhWidth and height of the prediction box, txAnd tyThe offsets of the abscissa and ordinate of the central point of the prediction frame are (C)x,Cy) Is the coordinate of the upper left corner of the grid where Anchor is located, sigma (-) is sigmoid function, pwAnd phWidth and height of Anchor, respectively.
6. The single-step face detection system of claim 5, wherein the output ends of the detection module det-32, the detection module det-16 and the detection module det-8 are all connected to an elliptical regression device, the elliptical regression device converts the output layer prediction frame into an elliptical real frame, and the calculation formula of the elliptical real frame is as follows:
Y=XW+ε
in the above formula, Y is the coordinate vector of the ellipse real frame, including the major semi-axis raShort semi-axis rbAngle θ, center point abscissa cxAnd ordinate cyX is a coordinate vector of the output layer prediction frame, including the center coordinate b of the prediction framex、byWidth of prediction frame bwAnd high bhW is a regression coefficient matrix, and epsilon is a random error;
the calculation formula of the regression coefficient matrix W is as follows:
Figure FDA0002881786780000041
in the above formula, J (-) represents the mean square error function, X 'is the normalized coordinate vector of the prediction frame, and Y' is the normalized coordinate vector of the real frame;
Figure FDA0002881786780000042
Figure FDA0002881786780000043
in the above formula, UXAnd σXMean and standard deviation, U, of X, respectively, of the prediction frame coordinate vectorYAnd σYRespectively, the mean and standard deviation of the real box coordinate vector Y.
CN201910550738.8A 2019-06-24 2019-06-24 Single step human face detection system Active CN110263731B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910550738.8A CN110263731B (en) 2019-06-24 2019-06-24 Single step human face detection system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910550738.8A CN110263731B (en) 2019-06-24 2019-06-24 Single step human face detection system

Publications (2)

Publication Number Publication Date
CN110263731A CN110263731A (en) 2019-09-20
CN110263731B true CN110263731B (en) 2021-03-16

Family

ID=67920979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910550738.8A Active CN110263731B (en) 2019-06-24 2019-06-24 Single step human face detection system

Country Status (1)

Country Link
CN (1) CN110263731B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807385B (en) * 2019-10-24 2024-01-12 腾讯科技(深圳)有限公司 Target detection method, target detection device, electronic equipment and storage medium
CN111401292B (en) * 2020-03-25 2023-05-26 成都东方天呈智能科技有限公司 Face recognition network construction method integrating infrared image training
CN111489332B (en) * 2020-03-31 2023-03-17 成都数之联科技股份有限公司 Multi-scale IOF random cutting data enhancement method for target detection
CN112699826A (en) * 2021-01-05 2021-04-23 风变科技(深圳)有限公司 Face detection method and device, computer equipment and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866833A (en) * 2015-05-29 2015-08-26 中国科学院上海高等研究院 Video stream face detection method and apparatus thereof
US9392257B2 (en) * 2011-11-28 2016-07-12 Sony Corporation Image processing device and method, recording medium, and program
CN106709568A (en) * 2016-12-16 2017-05-24 北京工业大学 RGB-D image object detection and semantic segmentation method based on deep convolution network
CN108564030A (en) * 2018-04-12 2018-09-21 广州飒特红外股份有限公司 Classifier training method and apparatus towards vehicle-mounted thermal imaging pedestrian detection
CN108647649A (en) * 2018-05-14 2018-10-12 中国科学技术大学 The detection method of abnormal behaviour in a kind of video
WO2018213841A1 (en) * 2017-05-19 2018-11-22 Google Llc Multi-task multi-modal machine learning model
CN109272487A (en) * 2018-08-16 2019-01-25 北京此时此地信息科技有限公司 The quantity statistics method of crowd in a kind of public domain based on video
CN109284670A (en) * 2018-08-01 2019-01-29 清华大学 A kind of pedestrian detection method and device based on multiple dimensioned attention mechanism
CN109598290A (en) * 2018-11-22 2019-04-09 上海交通大学 A kind of image small target detecting method combined based on hierarchical detection
WO2019079895A1 (en) * 2017-10-24 2019-05-02 Modiface Inc. System and method for image processing using deep neural networks
CN109815886A (en) * 2019-01-21 2019-05-28 南京邮电大学 A kind of pedestrian and vehicle checking method and system based on improvement YOLOv3
CN109919097A (en) * 2019-03-08 2019-06-21 中国科学院自动化研究所 Face and key point combined detection system, method based on multi-task learning
CN109919308A (en) * 2017-12-13 2019-06-21 腾讯科技(深圳)有限公司 A kind of neural network model dispositions method, prediction technique and relevant device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599797B (en) * 2016-11-24 2019-06-07 北京航空航天大学 A kind of infrared face recognition method based on local parallel neural network
CN108182397B (en) * 2017-12-26 2021-04-20 王华锋 Multi-pose multi-scale human face verification method
CN108664893B (en) * 2018-04-03 2022-04-29 福建海景科技开发有限公司 Face detection method and storage medium
CN109101899B (en) * 2018-07-23 2020-11-24 苏州飞搜科技有限公司 Face detection method and system based on convolutional neural network
CN109753927A (en) * 2019-01-02 2019-05-14 腾讯科技(深圳)有限公司 A kind of method for detecting human face and device
CN109711384A (en) * 2019-01-09 2019-05-03 江苏星云网格信息技术有限公司 A kind of face identification method based on depth convolutional neural networks

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9392257B2 (en) * 2011-11-28 2016-07-12 Sony Corporation Image processing device and method, recording medium, and program
CN104866833A (en) * 2015-05-29 2015-08-26 中国科学院上海高等研究院 Video stream face detection method and apparatus thereof
CN106709568A (en) * 2016-12-16 2017-05-24 北京工业大学 RGB-D image object detection and semantic segmentation method based on deep convolution network
WO2018213841A1 (en) * 2017-05-19 2018-11-22 Google Llc Multi-task multi-modal machine learning model
WO2019079895A1 (en) * 2017-10-24 2019-05-02 Modiface Inc. System and method for image processing using deep neural networks
CN109919308A (en) * 2017-12-13 2019-06-21 腾讯科技(深圳)有限公司 A kind of neural network model dispositions method, prediction technique and relevant device
CN108564030A (en) * 2018-04-12 2018-09-21 广州飒特红外股份有限公司 Classifier training method and apparatus towards vehicle-mounted thermal imaging pedestrian detection
CN108647649A (en) * 2018-05-14 2018-10-12 中国科学技术大学 The detection method of abnormal behaviour in a kind of video
CN109284670A (en) * 2018-08-01 2019-01-29 清华大学 A kind of pedestrian detection method and device based on multiple dimensioned attention mechanism
CN109272487A (en) * 2018-08-16 2019-01-25 北京此时此地信息科技有限公司 The quantity statistics method of crowd in a kind of public domain based on video
CN109598290A (en) * 2018-11-22 2019-04-09 上海交通大学 A kind of image small target detecting method combined based on hierarchical detection
CN109815886A (en) * 2019-01-21 2019-05-28 南京邮电大学 A kind of pedestrian and vehicle checking method and system based on improvement YOLOv3
CN109919097A (en) * 2019-03-08 2019-06-21 中国科学院自动化研究所 Face and key point combined detection system, method based on multi-task learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Learning Transferable Architectures for Scalable Image Recognition;Barret Zoph et.al.;《2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition》;20181231;全文 *
基于Adaboost算法的人脸检测研究及实现;林鹏;《中国优秀硕士学位论文全文数据库信息科技辑》;20070815(第2期);全文 *

Also Published As

Publication number Publication date
CN110263731A (en) 2019-09-20

Similar Documents

Publication Publication Date Title
CN110263731B (en) Single step human face detection system
CN106874894B (en) Human body target detection method based on regional full convolution neural network
CN108764085B (en) Crowd counting method based on generation of confrontation network
CN105095856B (en) Face identification method is blocked based on mask
CN107633226B (en) Human body motion tracking feature processing method
CN111079739B (en) Multi-scale attention feature detection method
CN108960404B (en) Image-based crowd counting method and device
CN106778687A (en) Method for viewing points detecting based on local evaluation and global optimization
CN110543906B (en) Automatic skin recognition method based on Mask R-CNN model
CN112949572A (en) Slim-YOLOv 3-based mask wearing condition detection method
CN109858547A (en) A kind of object detection method and device based on BSSD
CN104732248B (en) Human body target detection method based on Omega shape facilities
Li et al. A complex junction recognition method based on GoogLeNet model
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
Lu et al. An improved target detection method based on multiscale features fusion
Zhong et al. Improved localization accuracy by locnet for faster r-cnn based text detection
CN113808166B (en) Single-target tracking method based on clustering difference and depth twin convolutional neural network
CN111553337A (en) Hyperspectral multi-target detection method based on improved anchor frame
CN111444816A (en) Multi-scale dense pedestrian detection method based on fast RCNN
CN109284752A (en) A kind of rapid detection method of vehicle
CN111339950B (en) Remote sensing image target detection method
CN109657577B (en) Animal detection method based on entropy and motion offset
CN116092179A (en) Improved Yolox fall detection system
Zhu et al. Real-time traffic sign detection based on YOLOv2
CN112347967B (en) Pedestrian detection method fusing motion information in complex scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant