CN110263731B - Single step human face detection system - Google Patents
Single step human face detection system Download PDFInfo
- Publication number
- CN110263731B CN110263731B CN201910550738.8A CN201910550738A CN110263731B CN 110263731 B CN110263731 B CN 110263731B CN 201910550738 A CN201910550738 A CN 201910550738A CN 110263731 B CN110263731 B CN 110263731B
- Authority
- CN
- China
- Prior art keywords
- convolution module
- depth separable
- separable convolution
- module
- face
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
Abstract
The invention discloses a single-step face detection system. The invention provides a real-time face detection network YOMO formed by depth separable convolution, which comprises a plurality of feature fusion structures in a top-down mode, and each detection module is only responsible for detecting the face detection in a corresponding scale range. The invention adopts a random cutting strategy which is more in line with a multi-scale detection structure, so that each detection module can be trained by a sufficient number of samples. The elliptic regression device provided by the invention can greatly improve the detection recall rate under the ContROC evaluation standard. The detection accuracy of the YOMO model provided by the invention is kept strong competitiveness, the detection rate of pictures with 544 x 544 resolution is 51FPS, and the memory occupation of the model is only 21M.
Description
Technical Field
The invention relates to the technical field of face detection, in particular to a single-step face detection system.
Background
The face detection is a key component of a smart city with human as the center, and relates to technologies such as identity recognition, personalized service, pedestrian detection and tracking, crowd counting and the like. Although widely studied, face detection with unlimited scenes remains an open research problem due to various challenges.
Early face detection focused on manually designing effective features and building efficient classifiers based on the effective features. But usually a sub-optimal detection model is obtained, and the detection accuracy may be greatly reduced with the change of application scenes. In recent years, the deep learning technology is successfully applied to the face detection task, but generating a real-time face detection model with high accuracy and unlimited scene still has great challenge.
The Faster R-CNN uses a region recommendation algorithm to replace a sliding window, and integrates candidate box generation, feature extraction, frame regression and classification into a network, wherein the model is the model with the highest detection rate and accuracy in the R-CNN series models. However, the recommended network generates many face candidate frames and the complicated network structure brings large calculation overhead, so that real-time face detection cannot be achieved.
Another type of face detection method, such as YOLO, converts the detection problem into a regression problem, and therefore does not include a recommendation network, and the face frame is directly regressed in the feature map of the feature extraction network, which has a faster detection rate, but the detection accuracy needs to be improved. In order to improve the detection accuracy, the SSD jointly predicts the category and the position of the frame by using the multi-scale feature maps located at different layers. Multi-layer feature prediction helps detect faces of different scales, but each stage is not trained specifically to handle faces of a particular scale range. That is, during training, all scales of faces are lost in each detection module. In contrast, each detection module of YoMO is trained only by faces within the appropriate scale range.
Aiming at the problem of small-scale face detection of a single-step detection method, an HR trains a plurality of separated single-scale detectors by using an image pyramid, and each detector is responsible for a face with a specific scale. However, in the testing stage, the picture is scaled to multiple scales, and the picture of each scale passes through a very deep network, so that the multi-step single-scale detector is computationally expensive.
Whereas single-step multi-scale detectors, such as S3FD, detect faces using the multi-scale features of deep convolutional networks, only a single pass of pictures to the network is required during the testing phase. However, the same problem as that of the SSD still exists in S3FD, that is, each feature map with different scales is used for prediction separately, and when a small-scale face is predicted by using an underlying network, the detection effect of S3FD on the small-scale face is still not ideal due to the lack of semantic features.
Disclosure of Invention
Aiming at the defects in the prior art, the single-step face detection system provided by the invention solves the problem that the detection effect of the face detection system is not ideal.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a single-step face detection system, comprising a conventional convolution module conv0, a depth separable convolution module conv1, a depth separable convolution module conv2, a depth separable convolution module conv3, a depth separable convolution module conv4, a depth separable convolution module conv5, a depth separable convolution module conv6, a depth separable convolution module conv7, a depth separable convolution module conv8, a depth separable convolution module conv9, a depth separable convolution module conv10, a depth separable convolution module conv11, a depth separable convolution module conv12, a depth separable convolution module conv13, a depth separable convolution module conv14, a deconvolution layer conv15, a depth separable convolution module conv16, a depth separable convolution module conv17, a deconvolution layer conv18, a depth separable convolution module conv19 and a depth separable convolution module conv20, which are connected in sequence from left to right;
the output terminal of the depth separable convolution module conv14 is connected to the detection module det-32, the output terminal of the depth separable convolution module conv17 is connected to the detection module det-16, and the output terminal of the depth separable convolution module conv20 is connected to the detection module det-8;
the output terminal of the depth separable convolution module conv11 is connected to the input terminal of the depth separable convolution module conv16, the output terminal of the depth separable convolution module conv16 is merged with the output terminal characteristics of the deconvolution layer conv15 and connected to the input terminal of the depth separable convolution module conv17, the output terminal of the depth separable convolution module conv5 is connected to the input terminal of the depth separable convolution module conv19, and the output terminal of the depth separable convolution module conv19 is merged with the output terminal characteristics of the deconvolution layer conv18 and connected to the input terminal of the depth separable convolution module conv 20.
Further: the conventional convolution module conv0 includes a 3 × 3 convolution layer, a BatchNorm layer, and a LeakyReLU active layer connected in sequence from top to bottom.
Further:
the input picture of the conventional convolution module conv0 selects a crop box SelectCrop through a semi-soft random cropping algorithmbboxCutting and training are carried out, and the method specifically comprises the following steps:
s1, generating a plurality of crop boxes with aspect ratios of 1 through a random cropping algorithmbboxesCutting the original image to obtain a cut image, zooming the cut image according to the size requirement of the input image of the network, zooming effective real frames in the cut frame according to the same proportion, and counting the number of human faces of all scales according to the scale range of the human faces, wherein the statistical formula is as follows:
in the above formula, NumicThe number of class c face scales of the ith cutting frame is N, the class of the face scales is N, the number is 3, the class is respectively a small-scale face, a medium-scale face and a large-scale face, the number is M, the total number of the cutting frames is 1(·), the identifier is 1(·), the condition is that the real time value is 1, otherwise, the value is 0, and MinScalecAnd MaxScalecBoundary minimum and maximum values of class c face scale, bboxkThe side length of the cutting frame is K, and the total number of the generated cutting frames is K;
s2, sorting the face scale categories in a descending order according to the number of various faces of each cutting box:
Si1≥Si2≥…≥SiN
in the above formula, i is the cutting frame number, SicOne of N types of face scale categories in the ith cutting frame;
s3, counting the number of various face scales during network actual training, and arranging the face scale categories in an ascending order according to the number:
A1≤A2≤…≤AN
in the above formula, AcIs one of N types of face scale categories;
s4, sampling the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Sic=AcThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox;
S5, when the cutting box meeting the step S4 is not found, sampling is carried out on the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Si1=A1And SiN=ANThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox;
S6, when the cutting box meeting the step S5 is not found, sampling is carried out on the cutting boxbboxesRandomly selecting a cutting frame as SelectCropbbox;
S7, selecting CropbboxNumber Num of face scales of each categoryscAnd the number of scales of various human faces is updated to the actual trainingIn (1), namely:
in the above formula, the first and second carbon atoms are,and s represents the sequence number of the selected cutting box for the number of various human face scales obtained by previous training.
Further: the depth separable convolution module conv1, the depth separable convolution module conv2, the depth separable convolution module conv3, the depth separable convolution module conv4, the depth separable convolution module conv5, the depth separable convolution module conv6, the depth separable convolution module conv7, the depth separable convolution module conv8, the depth separable convolution module conv9, the depth separable convolution module conv10, the depth separable convolution module conv11, the depth separable convolution module conv12, the depth separable convolution module conv13, the depth separable convolution module conv14, the depth separable convolution module conv16, the depth separable convolution module conv17, the depth separable convolution module conv19 and the depth separable convolution module conv20 have the same structure, and each include a 3 × 3 convolution lu, a butcm nortch layer, a leakyrey relu activation layer, a 1 × 1 point convolution layer, a butcm layer and a leakyy activation layer, which are connected in sequence from top to bottom.
Further: the number of output channels of the depth separable convolution module conv14, the depth separable convolution module conv17 and the depth separable convolution module conv20 is 1024.
Further: the detection module det-32 is used for large-scale face detection, the detection module det-16 is used for medium-scale face detection, and the detection module det-8 is used for small-scale face detection.
Further: the detection module det-32, the detection module det-16 and the detection module det-8 comprise a conventional convolution layer and an output layer;
the number of output channels of the conventional convolutional layer is 18;
the calculation formula of the center coordinate and the side length of the output layer prediction frame is as follows:
bx=σ(tx)+Cx,by=σ(ty)+Cy
in the above formula, (b)x,by) To predict the center coordinates of the box, bwAnd bhWidth and height of the prediction box, txAnd tyThe offsets of the abscissa and ordinate of the central point of the prediction frame are (C)x,Cy) Is the coordinate of the upper left corner of the grid where Anchor is located, sigma (-) is sigmoid function, pwAnd phWidth and height of Anchor, respectively.
Further: the output ends of the detection module det-32, the detection module det-16 and the detection module det-8 are all connected with an ellipse regressor, the ellipse regressor converts the output layer prediction frame into an ellipse real frame, and the calculation formula of the ellipse real frame is as follows:
Y=XW+ε
in the above formula, Y is the coordinate vector of the ellipse real frame, including the major semi-axis raShort semi-axis rbAngle θ, center point abscissa cxAnd ordinate cyX is a coordinate vector of the output layer prediction frame, including the center coordinate b of the prediction framex、byWidth of prediction frame bwAnd high bhW is a regression coefficient matrix, and epsilon is a random error;
the calculation formula of the regression coefficient matrix W is as follows:
in the above formula, J (-) represents the mean square error function, X 'is the normalized coordinate vector of the prediction frame, and Y' is the normalized coordinate vector of the real frame;
in the above formula, UXAnd σXMean and standard deviation, U, of X, respectively, of the prediction frame coordinate vectorYAnd σYRespectively, the mean and standard deviation of the real box coordinate vector Y.
The invention has the beneficial effects that:
1. the invention provides a real-time face detection network YOMO formed by depth separable convolution, which comprises a plurality of feature fusion structures in a top-down mode, and each detection module is only responsible for detecting the face detection in a corresponding scale range.
2. The invention adopts a random cutting strategy which is more in line with a multi-scale detection structure, so that each detection module can be trained by a sufficient number of samples.
3. The elliptic regression device provided by the invention can greatly improve the detection recall rate under the ContROC evaluation standard.
4. The detection accuracy of the YOMO model provided by the invention is 51FPS for the pictures with 544 x 544 resolution while keeping strong competitiveness.
Drawings
FIG. 1 is a block diagram of the present invention;
FIG. 2 is a graph of the evaluation results in the FDDB dataset according to the present invention;
FIG. 3 is a diagram of the results of the visualization in the WIDER FACE dataset and the FDDB dataset of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
As shown in fig. 1, a single-step face detection system includes a conventional convolution module conv0, a depth separable convolution module conv1, a depth separable convolution module conv2, a depth separable convolution module conv3, a depth separable convolution module conv4, a depth separable convolution module conv5, a depth separable convolution module conv6, a depth separable convolution module conv7, a depth separable convolution module conv8, a depth separable convolution module conv9, a depth separable convolution module conv10, a depth separable convolution module conv11, a depth separable convolution module conv12, a depth separable convolution module conv13, a depth separable convolution module conv14, a deconvolution layer conv15, a depth separable convolution module conv16, a depth separable convolution module conv17, a deconvolution layer conv18, a depth separable convolution module conv 46v 27 and a depth separable convolution module conv20, which are connected in sequence from left to right;
the output terminal of the depth separable convolution module conv14 is connected to the detection module det-32, the output terminal of the depth separable convolution module conv17 is connected to the detection module det-16, and the output terminal of the depth separable convolution module conv20 is connected to the detection module det-8;
the output terminal of the depth separable convolution module conv11 is connected to the input terminal of the depth separable convolution module conv16, the output terminal of the depth separable convolution module conv16 is merged with the output terminal characteristics of the deconvolution layer conv15 and connected to the input terminal of the depth separable convolution module conv17, the output terminal of the depth separable convolution module conv5 is connected to the input terminal of the depth separable convolution module conv19, and the output terminal of the depth separable convolution module conv19 is merged with the output terminal characteristics of the deconvolution layer conv18 and connected to the input terminal of the depth separable convolution module conv 20.
The output feature maps of conv14, conv17, and conv20 have downsampling steps of 32, 16, and 8, respectively, as compared with the original map. The detection module det-32 is used for large-scale face detection, the detection module det-16 is used for medium-scale face detection, the detection module det-8 is used for small-scale face detection, and the face scale range in charge of the detection module is shown in table 1.
TABLE 1 face Scale Range for which the detection Module is responsible
Dimension class | det-8 (Small-scale face) | det-16 (mesoscale human face) | det-32 (Large-scale face) |
|
10 | 40 | 100 |
Maximum value MaxScale | 39 | 99 | 350 |
The invention trains the network using the RMSProp gradient optimization algorithm with the parameter settings of table 2. And 3 detection modules are placed on layers with different step sizes so as to enhance the multi-scale detection capability of the model. In training, the loss function of each detection module is a multitask loss function containing 5 parts. In order to enable each detection module to be only responsible for the face within the corresponding scale range, the anchor with the maximum IoU of the real frame is searched during gradient return, and only the detection branch to which the anchor belongs will generate frame regression loss. To make the training more efficient, each real box will match an anchor that is the highest IoU.
TABLE 2 training file parameter configuration Table
base_lr | step_value | gamma | batch_size | iter_size | type | weight_decay | max_iter |
0.001 | 40000 | 0.1 | 9 | 3 | RMSProp | 0.00005 | 200000 |
The multitask loss function of the YoMO includes 5 parts, which are respectively a non-target loss, an anchor pre-training loss, a target positioning loss, a target confidence loss and a target category loss, as shown in formula (3).
Where W, H are the width and height of the feature map, respectively, A is the number of anchors, and t is the number of iterations. 1(x) represents a discriminator, and its value is 1 when x is true, and 0 otherwise. Lambda [ alpha ]noobj,λprior,λcoord,λobj,λclassThe weight values of the sub tasks are respectively a non-target loss weight, an Anchor pre-training loss weight, a coordinate loss weight, a target loss weight and a category loss weight. br4 coordinate offset values predicted for the network, and priorrThe coordinates are 4 coordinates of Anchor, namely the abscissa x, the ordinate y, the frame width w and the frame height h of the center point of the frame. When IoU of the prediction box and all real boxes is less than or equal to the threshold Thresh, the area of the input image corresponding to the prediction box is a non-target, namely a background, and the predicted value of the confidence coefficient is bo. In order to adapt the network to Anchor as soon as possible, Anchor pre-training is introduced in the early stage of training to lose weight. Defining 1 epoch as the pre-training period in the YoMO model.
The conventional convolution module conv0 includes a 3 × 3 convolution layer, a BatchNorm layer, and a LeakyReLU active layer connected in sequence from top to bottom.
Input pictures of the conventional convolution module conv0 select a crop box SelectCrop by a semi-soft random cropping algorithmbboxCutting and training are carried out, and the method specifically comprises the following steps:
s1, generating a plurality of crop boxes with aspect ratios of 1 through a random cropping algorithmbboxesCutting the original image to obtain a cut image, zooming the cut image according to the size requirement of the input image of the network, zooming effective real frames in the cut frame according to the same proportion, and counting the number of human faces of all scales according to the scale range of the human faces, wherein the statistical formula is as follows:
in the above formula, NumicThe number of class c face scales of the ith cutting frame is N, the class of the face scales is N, the number is 3, the class is respectively a small-scale face, a medium-scale face and a large-scale face, the number is M, the total number of the cutting frames is 1(·), the identifier is 1(·), the condition is that the real time value is 1, otherwise, the value is 0, and MinScalecAnd MaxScalecBoundary minimum and maximum values of class c face scale, bboxkThe side length of the cutting frame is K, and the total number of the generated cutting frames is K;
s2, sorting the face scale categories in a descending order according to the number of various faces of each cutting box:
Si1≥Si2≥…≥SiN
in the above formula, i is the cutting frame number, SicOne of N types of face scale categories in the ith cutting frame;
s3, counting the number of various face scales during network actual training, and arranging the face scale categories in an ascending order according to the number:
A1≤A2≤…≤AN
in the above formula, AcIs one of N types of face scale categories;
s4, sampling the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Sic=AcThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox;
S5, when the cutting box meeting the step S4 is not found, sampling is carried out on the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Si1=A1And SiN=ANThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox;
S6, when the cutting box meeting the step S5 is not found, sampling is carried out on the cutting boxbboxesRandomly selecting a cutting frame as SelectCropbbox;
S7, selecting CropbboxNumber Num of face scales of each categoryscAnd the number of scales of various human faces is updated to the actual trainingIn (1), namely:
in the above formula, the first and second carbon atoms are,and s represents the sequence number of the selected cutting box for the number of various human face scales obtained by previous training.
The depth separable convolution module conv1, the depth separable convolution module conv2, the depth separable convolution module conv3, the depth separable convolution module conv4, the depth separable convolution module conv5, the depth separable convolution module conv6, the depth separable convolution module conv7, the depth separable convolution module conv8, the depth separable convolution module conv9, the depth separable convolution module conv10, the depth separable convolution module conv11, the depth separable convolution module conv12, the depth separable convolution module conv13, the depth separable convolution module conv14, the depth separable convolution module conv16, the depth separable convolution module conv17, the depth separable convolution module conv19 and the depth separable convolution module conv20 have the same structure, and each include a 3 × 3 convolution lu, a butcm nortch layer, a leakyrey relu activation layer, a 1 × 1 point convolution layer, a butcm layer and a leakyy activation layer, which are connected in sequence from top to bottom.
The number of output channels of the depth separable convolution module conv14, the depth separable convolution module conv17 and the depth separable convolution module conv20 is 1024.
The detection module det-32, the detection module det-16 and the detection module det-8 comprise a conventional convolution layer and an output layer;
the calculation formula of the number of output channels of the conventional convolutional layer is as follows:
numoutput=(numcoordinate+numconfidence+numclasses)×numAnchors
wherein coordinate, confidence, classes and Anchors respectively represent coordinate points, confidence degrees, categories and Anchors of the frame. When the number of anchors is larger, the detection accuracy of the network is better, but the training and testing speed is reduced. Considering that there are 3 detection modules in YoMO to be responsible for faces of 3 scales, num is taken into account for speed and precisionAnchors3. The number of output channels of a conventional convolutional layer in the detection module is therefore all 18.
The calculation formula of the center coordinate and the side length of the output layer prediction frame is as follows:
bx=σ(tx)+Cx,by=σ(ty)+Cy
in the above formula, (b)x,by) To predict the center coordinates of the box, bwAnd bhWidth and height of the prediction box, txAnd tyThe offsets of the abscissa and ordinate of the central point of the prediction frame are (C)x,Cy) Is the coordinate of the upper left corner of the grid where Anchor is located, sigma (-) is sigmoid function, pwAnd phWidth and height of Anchor, respectively.
The output ends of the detection module det-32, the detection module det-16 and the detection module det-8 are all connected with an ellipse regressor, the ellipse regressor converts the output layer prediction frame into an ellipse real frame, and the calculation formula of the ellipse real frame is as follows:
Y=XW+ε
in the above formula, Y is the coordinate vector of the ellipse real frame, including the major semi-axis raShort semi-axis rbAngle θ, center point abscissa cxAnd ordinate cyX is a coordinate vector of the output layer prediction frame, including the center coordinate b of the prediction framex、byWidth of prediction frame bwAnd high bhW is a regression coefficient matrix, and epsilon is a random error;
the calculation formula of the regression coefficient matrix W is as follows:
in the above formula, J (-) represents the mean square error function, X 'is the normalized coordinate vector of the prediction frame, and Y' is the normalized coordinate vector of the real frame;
in the above formula, UXAnd σXMean and standard deviation, U, of X, respectively, of the prediction frame coordinate vectorYAnd σYRespectively, the mean and standard deviation of the real box coordinate vector Y.
When training the elliptic regression, how to match the prediction box with the real box is the key. In practice, for each real frame of each picture of the FDDB, the prediction frame with the highest IoU is matched, and only the real frame and the matched prediction frame are considered in training.
The experimental environment of the invention is based on a 64-bit Ubuntu 14.04LTS system, the operating memory is 16GB, the CPU is 8-core Intelcore i7-7700K, and the single-core frequency is 4.20 GHz. All models are based on Caffe framework and trained in a single GPU, and the model is NVIDIA GeForce GTX 1080 Ti.
The characteristics extraction network of the YoMO model was pre-trained in ImageNet and fine-tuned by 200K iterations. Other parameter settings during training are shown in table 2. The IoU largest anchor with the real box is a positive example, and IoU<An anchor of 0.3 is considered background. Considering the detection rate and the face scale range, each detection module comprises 3 anchors, and the numerical values of the anchors are obtained by clustering in the training set. The weight of each part in the loss function is lambdanoobj=1,λprior=1,λcoord=1,λobj=5,λclass1. The NMS threshold for each test module was set to 0.7 for training and 0.45 for testing. The training pictures for all models in the present invention are scaled to 544 x 544 resolution.
WIDERFACE is a human face detection reference data set, and pictures are collected on the internet and have complex backgrounds. The data set comprises 32203 pictures, 393703 human faces are marked, and the sizes, postures and occlusions of the human faces have higher change intervals. And classified into 61 event classes, with data being divided into training, validation and test sets by a ratio of 40%, 10%, 50% for each class. All models in the invention are trained in a training set.
The pictures in the FDDB data set are collected in a Faces in the Wild data set, and the FDDB data set comprises 2845 pictures and 5171 human Faces. With certain difficulties including occlusion, difficult pose, low resolution and defocus, as well as black and white and color pictures. Unlike other face detection datasets, the labeled region is an ellipse rather than a rectangle. All models in the present invention were tested in the FDDB dataset.
When tested in the FDDB dataset, all pictures were kept aspect ratio scaled and embedded in a black background at the 544 × 544 scale to ensure that the pictures did not deform. As shown in FIGS. 2(a) and 2(b), the results of YoMO were compared with MTCNN, Scale face, HR-ER, ICC-CNN, FANet models in DiscROC and ContROC, respectively.
In FIG. 2, YOMO-Fit is the detection result of YOMO passing through an elliptic regression device. According to the FDDB evaluation result, when the number of false detections of YoMO-Fit is fixed to 1000 under the DiscROC and ContROC evaluation standards, the recall rates are respectively 97.7 percent and 83.6 percent, which are only lower than FANet. While HR-ER, even using FDDB as the 10-fold cross-validated training data, has the same recall rate in DiscROC as YoMO and a recall rate in ContROC that is 4.9% lower than YoMO-Fit. Notably, the elliptical regressor increased the recall rate of Yomo under DiscROC and ContROC by 0.1% and 8.6%, respectively.
Fig. 3(a), (b) are visualizations of WIDER FACE and the FDDB dataset, respectively, from testing of individual pictures. The rectangular frame in fig. 3(a) is a prediction frame of the YOMO model. The rectangles and ellipses in fig. 3(b) are the prediction box and the real box of the YOMO, respectively.
Claims (6)
1. A single-step face detection system, which is characterized by comprising a conventional convolution module conv0, a depth separable convolution module conv1, a depth separable convolution module conv2, a depth separable convolution module conv3, a depth separable convolution module conv4, a depth separable convolution module conv5, a depth separable convolution module conv6, a depth separable convolution module conv7, a depth separable convolution module conv8, a depth separable convolution module conv9, a depth separable convolution module conv10, a depth separable convolution module conv11, a depth separable convolution module conv12, a depth separable convolution module conv13, a depth separable convolution module conv14, a deconvolution layer conv15, a depth separable convolution module conv16, a depth separable convolution module conv17, a deconvolution layer conv18, a depth separable convolution module conv 46v 27 and a depth separable convolution module conv20, which are connected in sequence from left to right;
the output terminal of the depth separable convolution module conv14 is connected to the detection module det-32, the output terminal of the depth separable convolution module conv17 is connected to the detection module det-16, and the output terminal of the depth separable convolution module conv20 is connected to the detection module det-8;
an output terminal of the depth separable convolution module conv11 is connected to an input terminal of a depth separable convolution module conv16, an output terminal of the depth separable convolution module conv16 is fused with an output terminal characteristic of a deconvolution layer conv15 and is connected to an input terminal of the depth separable convolution module conv17, an output terminal of the depth separable convolution module conv5 is connected to an input terminal of a depth separable convolution module conv19, and an output terminal of the depth separable convolution module conv19 is fused with an output terminal characteristic of a deconvolution layer conv18 and is connected to an input terminal of a depth separable convolution module conv 20;
the input picture of the conventional convolution module conv0 selects a crop box SelectCrop through a semi-soft random cropping algorithmbboxCutting and training are carried out, and the method specifically comprises the following steps:
s1, generating a plurality of crop boxes with aspect ratios of 1 through a random cropping algorithmbboxesCutting the original image to obtain a cut image, zooming the cut image according to the size requirement of the input image of the network, zooming effective real frames in the cut frame according to the same proportion, and counting the number of human faces of all scales according to the scale range of the human faces, wherein the statistical formula is as follows:
in the above formula, NumicThe number of class c face scales of the ith cutting frame is N, the class of the face scales is N, the number is 3, the class is respectively a small-scale face, a medium-scale face and a large-scale face, the number is M, the total number of the cutting frames is 1(·), the identifier is 1(·), the condition is that the real time value is 1, otherwise, the value is 0, and MinScalecAnd MaxScalecBoundary minimum and maximum values of class c face scale, bboxkThe side length of the cutting frame is K, and the total number of the generated cutting frames is K;
s2, sorting the face scale categories in a descending order according to the number of various faces of each cutting box:
Si1≥Si2≥…≥SiN
in the above formula, i is the cutting frame number, SicOne of N types of face scale categories in the ith cutting frame;
s3, counting the number of various face scales during network actual training, and arranging the face scale categories in an ascending order according to the number:
A1≤A2≤…≤AN
in the above formula, AcIs one of N types of face scale categories;
s4, sampling the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Sic=AcThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox;
S5, when the cutting box meeting the step S4 is not found, sampling is carried out on the cutting boxbboxesIn M in the personal face dimension category sorting, the search satisfies Si1=A1And SiN=ANThe cutting frame of (1) randomly selecting a cutting frame meeting the condition as the SelectCropbbox;
S6, when the cutting box meeting the step S5 is not found, sampling is carried out on the cutting boxbboxesRandomly selecting a cutting frame as SelectCropbbox;
S7, selecting CropbboxNumber Num of face scales of each categoryscAnd the number of scales of various human faces is updated to the actual trainingIn (1), namely:
in the above formula, the first and second carbon atoms are,s represents the sequence number of the selected cutting box for the number of various human face scales obtained by previous training;
the detection module det-32 is used for large-scale face detection, the detection module det-16 is used for medium-scale face detection, and the detection module det-8 is used for small-scale face detection.
2. The single-step face detection system of claim 1, wherein the conventional convolution module conv0 comprises a 3 x 3 convolution layer, a BatchNorm layer and a LeakyReLU active layer connected in sequence from top to bottom.
3. The single-step face detection system of claim 1, wherein the depth separable convolution module conv1, depth separable convolution module conv2, depth separable convolution module conv3, depth separable convolution module conv4, depth separable convolution module conv5, depth separable convolution module conv6, depth separable convolution module conv7, depth separable convolution module conv8, depth separable convolution module conv9, depth separable convolution module conv10, depth separable convolution module conv11, depth separable convolution module conv12, depth separable convolution module conv13, depth separable convolution module conv14, depth separable convolution module conv16, depth separable convolution module conv17, depth separable convolution module conv19, and depth separable convolution module conv20 have the same structure, and each include a lu 3 × 3 convolution, a BatchNorm layer, a layreakrey layer, and a layerk layer sequentially connected from top to bottom, 1X 1 dot convolutional layer, BatchNorm layer, and LeakyReLU active layer.
4. The single-step face detection system of claim 1, wherein the number of output channels of the depth separable convolution module conv14, the depth separable convolution module conv17 and the depth separable convolution module conv20 are all 1024.
5. The single-step face detection system of claim 1, wherein said detection module det-32, detection module det-16 and detection module det-8 each comprise a conventional convolutional layer and an output layer;
the number of output channels of the conventional convolutional layer is 18;
the calculation formula of the center coordinate and the side length of the output layer prediction frame is as follows:
bx=σ(tx)+Cx,by=σ(ty)+Cy
in the above formula, (b)x,by) To predict the center coordinates of the box, bwAnd bhWidth and height of the prediction box, txAnd tyThe offsets of the abscissa and ordinate of the central point of the prediction frame are (C)x,Cy) Is the coordinate of the upper left corner of the grid where Anchor is located, sigma (-) is sigmoid function, pwAnd phWidth and height of Anchor, respectively.
6. The single-step face detection system of claim 5, wherein the output ends of the detection module det-32, the detection module det-16 and the detection module det-8 are all connected to an elliptical regression device, the elliptical regression device converts the output layer prediction frame into an elliptical real frame, and the calculation formula of the elliptical real frame is as follows:
Y=XW+ε
in the above formula, Y is the coordinate vector of the ellipse real frame, including the major semi-axis raShort semi-axis rbAngle θ, center point abscissa cxAnd ordinate cyX is a coordinate vector of the output layer prediction frame, including the center coordinate b of the prediction framex、byWidth of prediction frame bwAnd high bhW is a regression coefficient matrix, and epsilon is a random error;
the calculation formula of the regression coefficient matrix W is as follows:
in the above formula, J (-) represents the mean square error function, X 'is the normalized coordinate vector of the prediction frame, and Y' is the normalized coordinate vector of the real frame;
in the above formula, UXAnd σXMean and standard deviation, U, of X, respectively, of the prediction frame coordinate vectorYAnd σYRespectively, the mean and standard deviation of the real box coordinate vector Y.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910550738.8A CN110263731B (en) | 2019-06-24 | 2019-06-24 | Single step human face detection system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910550738.8A CN110263731B (en) | 2019-06-24 | 2019-06-24 | Single step human face detection system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110263731A CN110263731A (en) | 2019-09-20 |
CN110263731B true CN110263731B (en) | 2021-03-16 |
Family
ID=67920979
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910550738.8A Active CN110263731B (en) | 2019-06-24 | 2019-06-24 | Single step human face detection system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110263731B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110807385B (en) * | 2019-10-24 | 2024-01-12 | 腾讯科技(深圳)有限公司 | Target detection method, target detection device, electronic equipment and storage medium |
CN111401292B (en) * | 2020-03-25 | 2023-05-26 | 成都东方天呈智能科技有限公司 | Face recognition network construction method integrating infrared image training |
CN111489332B (en) * | 2020-03-31 | 2023-03-17 | 成都数之联科技股份有限公司 | Multi-scale IOF random cutting data enhancement method for target detection |
CN112699826A (en) * | 2021-01-05 | 2021-04-23 | 风变科技(深圳)有限公司 | Face detection method and device, computer equipment and storage medium |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866833A (en) * | 2015-05-29 | 2015-08-26 | 中国科学院上海高等研究院 | Video stream face detection method and apparatus thereof |
US9392257B2 (en) * | 2011-11-28 | 2016-07-12 | Sony Corporation | Image processing device and method, recording medium, and program |
CN106709568A (en) * | 2016-12-16 | 2017-05-24 | 北京工业大学 | RGB-D image object detection and semantic segmentation method based on deep convolution network |
CN108564030A (en) * | 2018-04-12 | 2018-09-21 | 广州飒特红外股份有限公司 | Classifier training method and apparatus towards vehicle-mounted thermal imaging pedestrian detection |
CN108647649A (en) * | 2018-05-14 | 2018-10-12 | 中国科学技术大学 | The detection method of abnormal behaviour in a kind of video |
WO2018213841A1 (en) * | 2017-05-19 | 2018-11-22 | Google Llc | Multi-task multi-modal machine learning model |
CN109272487A (en) * | 2018-08-16 | 2019-01-25 | 北京此时此地信息科技有限公司 | The quantity statistics method of crowd in a kind of public domain based on video |
CN109284670A (en) * | 2018-08-01 | 2019-01-29 | 清华大学 | A kind of pedestrian detection method and device based on multiple dimensioned attention mechanism |
CN109598290A (en) * | 2018-11-22 | 2019-04-09 | 上海交通大学 | A kind of image small target detecting method combined based on hierarchical detection |
WO2019079895A1 (en) * | 2017-10-24 | 2019-05-02 | Modiface Inc. | System and method for image processing using deep neural networks |
CN109815886A (en) * | 2019-01-21 | 2019-05-28 | 南京邮电大学 | A kind of pedestrian and vehicle checking method and system based on improvement YOLOv3 |
CN109919097A (en) * | 2019-03-08 | 2019-06-21 | 中国科学院自动化研究所 | Face and key point combined detection system, method based on multi-task learning |
CN109919308A (en) * | 2017-12-13 | 2019-06-21 | 腾讯科技(深圳)有限公司 | A kind of neural network model dispositions method, prediction technique and relevant device |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599797B (en) * | 2016-11-24 | 2019-06-07 | 北京航空航天大学 | A kind of infrared face recognition method based on local parallel neural network |
CN108182397B (en) * | 2017-12-26 | 2021-04-20 | 王华锋 | Multi-pose multi-scale human face verification method |
CN108664893B (en) * | 2018-04-03 | 2022-04-29 | 福建海景科技开发有限公司 | Face detection method and storage medium |
CN109101899B (en) * | 2018-07-23 | 2020-11-24 | 苏州飞搜科技有限公司 | Face detection method and system based on convolutional neural network |
CN109753927A (en) * | 2019-01-02 | 2019-05-14 | 腾讯科技(深圳)有限公司 | A kind of method for detecting human face and device |
CN109711384A (en) * | 2019-01-09 | 2019-05-03 | 江苏星云网格信息技术有限公司 | A kind of face identification method based on depth convolutional neural networks |
-
2019
- 2019-06-24 CN CN201910550738.8A patent/CN110263731B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9392257B2 (en) * | 2011-11-28 | 2016-07-12 | Sony Corporation | Image processing device and method, recording medium, and program |
CN104866833A (en) * | 2015-05-29 | 2015-08-26 | 中国科学院上海高等研究院 | Video stream face detection method and apparatus thereof |
CN106709568A (en) * | 2016-12-16 | 2017-05-24 | 北京工业大学 | RGB-D image object detection and semantic segmentation method based on deep convolution network |
WO2018213841A1 (en) * | 2017-05-19 | 2018-11-22 | Google Llc | Multi-task multi-modal machine learning model |
WO2019079895A1 (en) * | 2017-10-24 | 2019-05-02 | Modiface Inc. | System and method for image processing using deep neural networks |
CN109919308A (en) * | 2017-12-13 | 2019-06-21 | 腾讯科技(深圳)有限公司 | A kind of neural network model dispositions method, prediction technique and relevant device |
CN108564030A (en) * | 2018-04-12 | 2018-09-21 | 广州飒特红外股份有限公司 | Classifier training method and apparatus towards vehicle-mounted thermal imaging pedestrian detection |
CN108647649A (en) * | 2018-05-14 | 2018-10-12 | 中国科学技术大学 | The detection method of abnormal behaviour in a kind of video |
CN109284670A (en) * | 2018-08-01 | 2019-01-29 | 清华大学 | A kind of pedestrian detection method and device based on multiple dimensioned attention mechanism |
CN109272487A (en) * | 2018-08-16 | 2019-01-25 | 北京此时此地信息科技有限公司 | The quantity statistics method of crowd in a kind of public domain based on video |
CN109598290A (en) * | 2018-11-22 | 2019-04-09 | 上海交通大学 | A kind of image small target detecting method combined based on hierarchical detection |
CN109815886A (en) * | 2019-01-21 | 2019-05-28 | 南京邮电大学 | A kind of pedestrian and vehicle checking method and system based on improvement YOLOv3 |
CN109919097A (en) * | 2019-03-08 | 2019-06-21 | 中国科学院自动化研究所 | Face and key point combined detection system, method based on multi-task learning |
Non-Patent Citations (2)
Title |
---|
Learning Transferable Architectures for Scalable Image Recognition;Barret Zoph et.al.;《2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition》;20181231;全文 * |
基于Adaboost算法的人脸检测研究及实现;林鹏;《中国优秀硕士学位论文全文数据库信息科技辑》;20070815(第2期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110263731A (en) | 2019-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110263731B (en) | Single step human face detection system | |
CN106874894B (en) | Human body target detection method based on regional full convolution neural network | |
CN108764085B (en) | Crowd counting method based on generation of confrontation network | |
CN105095856B (en) | Face identification method is blocked based on mask | |
CN107633226B (en) | Human body motion tracking feature processing method | |
CN111079739B (en) | Multi-scale attention feature detection method | |
CN108960404B (en) | Image-based crowd counting method and device | |
CN106778687A (en) | Method for viewing points detecting based on local evaluation and global optimization | |
CN110543906B (en) | Automatic skin recognition method based on Mask R-CNN model | |
CN112949572A (en) | Slim-YOLOv 3-based mask wearing condition detection method | |
CN109858547A (en) | A kind of object detection method and device based on BSSD | |
CN104732248B (en) | Human body target detection method based on Omega shape facilities | |
Li et al. | A complex junction recognition method based on GoogLeNet model | |
CN110751027B (en) | Pedestrian re-identification method based on deep multi-instance learning | |
Lu et al. | An improved target detection method based on multiscale features fusion | |
Zhong et al. | Improved localization accuracy by locnet for faster r-cnn based text detection | |
CN113808166B (en) | Single-target tracking method based on clustering difference and depth twin convolutional neural network | |
CN111553337A (en) | Hyperspectral multi-target detection method based on improved anchor frame | |
CN111444816A (en) | Multi-scale dense pedestrian detection method based on fast RCNN | |
CN109284752A (en) | A kind of rapid detection method of vehicle | |
CN111339950B (en) | Remote sensing image target detection method | |
CN109657577B (en) | Animal detection method based on entropy and motion offset | |
CN116092179A (en) | Improved Yolox fall detection system | |
Zhu et al. | Real-time traffic sign detection based on YOLOv2 | |
CN112347967B (en) | Pedestrian detection method fusing motion information in complex scene |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |