CN110263731A

CN110263731A - A kind of single step face detection system

Info

Publication number: CN110263731A
Application number: CN201910550738.8A
Authority: CN
Inventors: 徐杰; 田野; 罗堡文; 廖静茹
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2019-09-20
Anticipated expiration: 2039-06-24
Also published as: CN110263731B

Abstract

The invention discloses a kind of single step face detection systems.The present invention proposes that separating the real-time face that convolution is constituted by depth detects network YOMO, the Fusion Features structure containing multiple forms from top to bottom, and each detection module is only responsible for detecting the Face datection in corresponding range scale.The present invention enables the sample training that each detection module is more sufficient by quantity using the random cropping strategy of multiple scale detecting structure is more met.Oval recurrence device proposed by the present invention, can improve the detection recall rate under ContROC evaluation criteria by a relatively large margin.The detection accuracy of YOMO model proposed by the present invention, while keeping stronger competitiveness, the detection rates to the picture of 544 × 544 resolution ratio are 51FPS, and the EMS memory occupation of model only has 21M.

Description

A kind of single step face detection system

Technical field

The present invention relates to human face detection tech fields, and in particular to a kind of single step face detection system.

Background technique

Face datection is the key components of smart city focusing on people, is related to identification, personalized clothes The technologies such as business, pedestrian detection tracking, crowd's counting.Although having obtained extensive research, since there are various challenges, scene is unrestricted Face datection be still one and open study a question.

The Face datection of early stage is primarily upon and manually designs effective feature, and establishes efficient classifier with this.But The detection model of suboptimum is generally yielded, and with the variation of application scenarios, detecting accuracy be might have by a relatively large margin It reduces.In recent years, the Successful utilization that depth learning technology is attracted people's attention in Face datection task, but generate an application Unrestricted in scene, the real-time face detection model with higher accuracy still has biggish challenge.

Faster R-CNN using area proposed algorithm substitutes sliding window, and by candidate frame generation, feature extraction, frame It returns and classification is all integrated into a network, be detection rates and the highest model of accuracy in R-CNN series model.But due to Recommendation network generates more face candidate frame, and biggish computing cost brought by complicated network structure, can not do It is detected to real-time face.

Another kind of method for detecting human face, such as YOLO, the problem of will test is converted into regression problem, therefore does not include and recommend net Network returns face frame directly in the characteristic pattern of feature extraction network, has faster detection rates, but detection accuracy has Wait improve.For improve detect accuracy, SSD utilize positioned at different layers Analysis On Multi-scale Features figure, the classification of associated prediction frame and Position.Multilayer feature prediction helps to detect the face of different scale, but each stage therein without specialized training, with Handle the face of particular dimensions range.That is, the face of all scales can produce in each detection module in training Raw loss.In contrast, each detection module of YOMO is only trained by the face in suitable range scale.

For the small scale Face datection problem of single step detection method, HR utilizes image pyramid, the multiple separation of training Single scale detector, each detector are responsible for the face of particular dimensions.But in test phase, picture need to be zoomed to multiple rulers The picture of degree, each scale will pass through very deep network, and the expense of this multistep single scale detector computationally is very high.

And single step multiple scale detecting device, such as S3FD, face is detected using the Analysis On Multi-scale Features of depth convolutional network, is being tested Stage only needs single to transmit picture to network.But there are still the problems same as SSD by S3FD, i.e., by the spy of each different scale Sign figure is individually used for predicting, when predicting small scale face using bottom-layer network, due to lacking semantic feature, causes S3FD to small ruler The detection effect for spending face is still undesirable.

Summary of the invention

For above-mentioned deficiency in the prior art, a kind of single step face detection system provided by the invention solves face inspection The undesirable problem of examining system detection effect.

In order to achieve the above object of the invention, the technical solution adopted by the present invention are as follows: a kind of single step face detection system, including Sequentially connected conventional convolution module conv0, depth separate convolution module conv1 from left to right, depth separates convolution mould Block conv2, depth separate convolution module conv3, depth separates convolution module conv4, depth separates convolution module Conv5, depth separate convolution module conv6, depth separates convolution module conv7, depth separates convolution module Conv8, depth separate convolution module conv9, depth separates convolution module conv10, depth separates convolution module Conv11, depth separate convolution module conv12, depth separates convolution module conv13, depth separates convolution module Conv14, warp lamination conv15, depth separate convolution module conv16, depth separates convolution module conv17, warp Lamination conv18, depth separate convolution module conv19 and depth separates convolution module conv20；

The output end that the depth separates convolution module conv14 is connect with detection module det-32, and the depth can divide Output end from convolution module conv17 is connect with detection module det-16, and the depth separates the defeated of convolution module conv20 Outlet is connect with detection module det-8；

The depth separates the input of the output end and the separable convolution module conv16 of depth of convolution module conv11 End connection, the depth separate convolution module conv16 output end and warp lamination conv15 output end Fusion Features simultaneously The input terminal that depth separates convolution module conv17 is connected, the depth separates the output end and depth of convolution module conv5 The input terminal connection of separable convolution module conv19 is spent, the depth separates the output end and warp of convolution module conv19 The output end Fusion Features of lamination conv18 simultaneously connect the input terminal that depth separates convolution module conv20.

Further: the conventional convolution module conv0 include from top to bottom sequentially connected 3 × 3 convolutional layer, BatchNorm layers and LeakyReLU active coating.

Further:

The input picture of the conventional convolution module conv0 selects crop box by the random clipping algorithm of medium-soft SelectCrop_bboxIt is cut and is trained, specific steps are as follows:

S1, the crop box Sampled that several length-width ratios are 1 is generated by random clipping algorithm_bboxes, after original image is cut Obtain cut picture, according to the input figure size of network require scaling cut picture, and by equal proportion scaling crop box in have The true frame of effect, the quantity of each scale face, statistical formula are counted according to face range scale are as follows:

In above formula, Num_icFor the number of the c class face scale of i-th of crop box, N is the type of face scale, N=3, Respectively small scale face, mesoscale face and large scale face, M are the sum of crop box, and 1 () was identifier, and condition is True duration is 1, is otherwise 0, MinScale_cAnd MaxScale_cRespectively the boundary minimum value of c class face scale and boundary be most Big value, bbox_kFor the side length of crop box, K is the total quantity of the crop box generated；

S2, face scale classification descending is arranged according to all kinds of face quantity of each crop box are as follows:

S_i1≥S_i2≥…≥S_iN

In above formula, i is crop box serial number, S_icFor one kind in i-th of crop box in N class face scale classification；

The quantity of all kinds of face scales when S3, statistics network hands-on, and according to it by face scale classification ascending order Arrangement are as follows:

A₁≤A₂≤…≤A_N

In above formula, A_cFor one kind in N class face scale classification；

S4, in crop box Sampled_bboxesIn M face scale classification sequence in, searching meet S_ic=A_cCutting Frame, random selection one meet the crop box of condition as SelectCrop_bbox；

S5, when the crop box for meeting step S4 is not found, in crop box Sampled_bboxesIn M face scale In classification sequence, searching meets S_i1=A₁And S_iN=A_NCrop box, the crop box conduct that random selection one meets condition SelectCrop_bbox；

S6, when the crop box for meeting step S5 is not found, in crop box Sampled_bboxesOne sanction of middle random selection Frame is cut as SelectCrop_bbox；

S7, by SelectCrop_bboxIn face scale of all categories quantity Num_sc, update to people all kinds of when hands-on The quantity of face scaleIn, it may be assumed that

In above formula,For the quantity for all kinds of face scales that preceding primary training obtains, selected by s expression The crop box serial number selected.

Further: the depth separates convolution module conv1, depth separates convolution module conv2, depth can divide Convolution module conv4 is separated from convolution module conv3, depth, depth separates the separable volume of convolution module conv5, depth Volume module conv6, depth separate convolution module conv7, depth separates convolution module conv8, depth separates convolution mould Block conv9, depth separate convolution module conv10, depth separates convolution module conv11, depth separates convolution module Conv12, depth separate convolution module conv13, depth separates convolution module conv14, depth separates convolution module Conv16, depth separate convolution module conv17, depth separates convolution module conv19 and depth separates convolution module The structure of conv20 is identical, including sequentially connected 3 × 3 convolutional layer from top to bottom, BatchNorm layers, LeakyReLU activation Layer, 1 × 1 convolutional layer, BatchNorm layers and LeakyReLU active coating.

Further: the depth separates convolution module conv14, depth separates convolution module conv17 and depth The output channel number of separable convolution module conv20 is 1024.

Further: the detection module det-32 is used for large scale Face datection, and the detection module det-16 is used for Mesoscale Face datection, the detection module det-8 are used for small scale Face datection.

Further: the detection module det-32, detection module det-16 and detection module det-8 include regular volume Lamination and output layer；

The output channel quantity of the regular volume lamination is 18；

The centre coordinate of the output layer prediction block and the calculation formula of side length are as follows:

b_x=σ (t_x)+C_x,b_y=σ (t_y)+C_y

In above formula, (b_x,b_y) be prediction block centre coordinate, b_wAnd b_hThe respectively width and height of prediction block, t_xAnd t_yRespectively For the offset of prediction block central point abscissa and ordinate, (C_x,C_y) top left co-ordinate of grid, σ () where Anchor For sigmoid function, p_wAnd p_hThe respectively width of Anchor and height.

Further: the output end of the detection module det-32, detection module det-16 and detection module det-8 connect Oval recurrence device is connect, output layer prediction block is converted oval true frame, the meter of the oval really frame by the oval recurrence device Calculate formula are as follows:

Y=XW+ ε

In above formula, Y is the coordinate vector of oval true frame, including major semiaxis r_a, semi-minor axis r_b, angle, θ, the horizontal seat of central point Mark c_xWith ordinate c_y, X is the coordinate vector of output layer prediction block, the centre coordinate b including prediction block_x、b_y, prediction block wide b_w With high b_h, W is regression coefficient matrix, and ε is random error；

Wherein, the calculation formula of regression coefficient matrix W are as follows:

In above formula, J () indicates that mean square error function, X ' are the normalized coordinates vector of prediction block, and Y ' is true frame Normalized coordinates vector；

In above formula, U_XAnd σ_XThe respectively mean value and standard deviation of the X of prediction block coordinate vector, U_YAnd σ_YRespectively true frame The mean value and standard deviation of coordinate vector Y.

The invention has the benefit that

1. the present invention proposes that separating the real-time face that convolution constitutes by depth detects network YOMO, containing it is multiple from upper and The Fusion Features structure of lower form, each detection module are only responsible for detecting the Face datection in corresponding range scale.

2. the present invention enables each detection module to be counted using the random cropping strategy for more meeting multiple scale detecting structure Measure more sufficient sample training.

3. oval recurrence device proposed by the present invention, can improve the detection recall rate under ContROC evaluation criteria by a relatively large margin.

4. the detection accuracy of YOMO model proposed by the present invention, while keeping stronger competitiveness, to 544 × 544 The detection rates of the picture of resolution ratio are 51FPS.

Detailed description of the invention

Fig. 1 is structure of the invention figure；

Fig. 2 is assessment result of the present invention in FDDB data set；

Fig. 3 is visualization result figure of the present invention in WIDER FACE data set and FDDB data set.

Specific embodiment

A specific embodiment of the invention is described below, in order to facilitate understanding by those skilled in the art this hair It is bright, it should be apparent that the present invention is not limited to the ranges of specific embodiment, for those skilled in the art, As long as various change is in the spirit and scope of the present invention that the attached claims limit and determine, these variations are aobvious and easy See, all are using the innovation and creation of present inventive concept in the column of protection.

As shown in Figure 1, a kind of single step face detection system, including sequentially connected conventional convolution module from left to right Conv0, depth separate convolution module conv1, depth separates convolution module conv2, depth separates convolution module Conv3, depth separate convolution module conv4, depth separates convolution module conv5, depth separates convolution module Conv6, depth separate convolution module conv7, depth separates convolution module conv8, depth separates convolution module Conv9, depth separate convolution module conv10, depth separates convolution module conv11, depth separates convolution module Conv12, depth separate convolution module conv13, depth separates convolution module conv14, warp lamination conv15, depth Separable convolution module conv16, depth separate convolution module conv17, warp lamination conv18, depth and separate convolution mould Block conv19 and depth separate convolution module conv20；

The output characteristic pattern of conv14, conv17, conv20 compare original image, and down-sampling step-length is respectively 32,16,8.Institute Detection module det-32 is stated for large scale Face datection, the detection module det-16 is used for mesoscale Face datection, described Detection module det-8 is used for small scale Face datection, and the face range scale that detection module is responsible for is as shown in table 1.

The face range scale that 1 detection module of table is responsible for

Scale classification	Det-8 (small scale face)	Det-16 (mesoscale face)	Det-32 (large scale face)
				Minimum M inScale	10	40	100
Maximum value MaxScale	39	99	350

The present invention is set as the RMSProp gradient optimal method training network of table 2 using parameter.Place 3 detection modules On the layer of different step-lengths, to enhance the multiple scale detecting ability of model.In training, the loss function of each detection module is Multitask loss function comprising 5 parts.To make each detection module only be responsible for the face in corresponding range scale, returned in gradient When biography, detection branches belonging to the maximum anchor of IoU of search and true frame, the only anchor will generate frame and return damage It loses.To keep training more effective, each true frame will match one and the highest anchor of its IoU.

Table 2 trains file parameters allocation list

base_lr	step_value	gamma	batch_size	iter_size	type	weight_decay	max_iter
								0.001	40000	0.1	9	3	RMSProp	0.00005	200000

The multitask loss function of YOMO includes 5 parts, respectively non-targeted loss, the loss of anchor pre-training, mesh Target positioning loss, the confidence level loss of target, the classification loss of target, as shown in formula (3).

Wherein W, H are respectively the width and height of characteristic pattern, and A is the quantity of Anchor, and t is the number of iterations.1 (x) indicates to differentiate Symbol, when x is true, value 1, otherwise its value is 0.λ_noobj, λ_prior, λ_coord, λ_obj, λ_classFor the weighted value of each point of task, It is non-target loss weight, Anchor pre-training loss weight, coordinate loss weight, target loss weight, classification loss respectively Weight.b^rFor 4 coordinate shift values of neural network forecast, and prior^rIt is that frame central point is horizontal respectively for 4 coordinates of Anchor Coordinate x, ordinate y, border width w, bezel height h.When the IoU of prediction block and all true frames is both less than or equal to threshold value When Thresh, then the region of input figure corresponding to the prediction block is non-targeted, i.e. background, and the predicted value of confidence level is b^o。 In order to make network adapt to Anchor as soon as possible, Anchor pre-training loss weight is introduced early period in training.1 is defined in YOMO model A epoch is training early period.

The conventional convolution module conv0 include from top to bottom sequentially connected 3 × 3 convolutional layer, BatchNorm layers and LeakyReLU active coating.

The input picture of conventional convolution module conv0 selects crop box by the random clipping algorithm of medium-soft SelectCrop_bboxIt is cut and is trained, specific steps are as follows:

S_i1≥S_i2≥…≥S_iN

A₁≤A₂≤…≤A_N

In above formula, A_cFor one kind in N class face scale classification；

The depth separates convolution module conv1, depth separates convolution module conv2, depth separates convolution mould Block conv3, depth separate convolution module conv4, depth separates convolution module conv5, depth separates convolution module Conv6, depth separate convolution module conv7, depth separates convolution module conv8, depth separates convolution module Conv9, depth separate convolution module conv10, depth separates convolution module conv11, depth separates convolution module Conv12, depth separate convolution module conv13, depth separates convolution module conv14, depth separates convolution module Conv16, depth separate convolution module conv17, depth separates convolution module conv19 and depth separates convolution module The structure of conv20 is identical, including sequentially connected 3 × 3 convolutional layer from top to bottom, BatchNorm layers, LeakyReLU activation Layer, 1 × 1 convolutional layer, BatchNorm layers and LeakyReLU active coating.

The depth separates convolution module conv14, depth separates convolution module conv17 and depth separates convolution The output channel number of module conv20 is 1024.

The detection module det-32, detection module det-16 and detection module det-8 include regular volume lamination and defeated Layer out；

The calculation formula of the output channel quantity of the regular volume lamination are as follows:

num_output=(num_coordinate+num_confidence+num_classes)×num_Anchors

Wherein coordinate, confidence, classes, Anchors respectively indicate frame coordinate points, confidence level, class Other and anchor.When Anchor number is more, the detection accuracy of network is preferable, but trained and test speed will reduce.Consider There are 3 detection modules to be responsible for the face of 3 kinds of scales into YOMO, in order to balance speed and precision, num_Anchors=3.Therefore it examines The output channel number for the regular volume lamination surveyed in module is all 18.

b_x=σ (t_x)+C_x,b_y=σ (t_y)+C_y

The output end of the detection module det-32, detection module det-16 and detection module det-8 are all connected with oval return Return device, output layer prediction block is converted oval true frame, the calculation formula of the oval true frame by the oval recurrence device are as follows:

Y=XW+ ε

When training ellipse returns device, how to match prediction block and true frame is crucial.In practical operation, to every of FDDB The true frame of each of picture matches the highest prediction block of IoU therewith, only considers true frame and matched prediction block when training.

Experimental situation of the present invention is based on 64 Ubuntu 14.04LTS systems, and running memory 16GB, CPU are 8 cores IntelCore i7-7700K, monokaryon frequency are 4.20GHz.All models are based on Caffe frame, training, type in individual GPU Number be NVIDIA GeForce GTX 1080Ti.

The feature extraction network pre-training of YOMO model is in ImageNet, and the fine tuning Jing Guo 200K iteration.When training Other parameter settings it is as shown in table 2.The maximum anchor of IoU with true frame is positive example, and the anchor of IoU < 0.3 is recognized To be background.In view of detection rates and face range scale, each detection module includes 3 anchor, and numerical value is in training Cluster is concentrated to obtain.Each section weight is respectively λ in loss function_noobj=1, λ_prior=1, λ_coord=1, λ_obj=5, λ_class= 1.The NMS threshold value of each detection module is set as 0.7 when training, and while testing is 0.45.The training picture of all models in the present invention It is scaled to 544 × 544 resolution ratio.

WIDERFACE is Face datection benchmark dataset, and picture is collected in internet, and background is more complex.Data set has altogether Comprising 32203 pictures, it is labelled with totally 393703 width face, the size of face, has blocked higher constant interval at posture. And 61 event classes are ranged, proportionally 40%, 10%, 50% training set, verifying collection and survey are splitted data into every class Examination collection.All models in the present invention are obtained in training concentration training.

Picture in FDDB data set is collected in Faces in the Wild data set, altogether includes 2845 pictures, 5171 Width face.It with certain difficulty, including blocks, difficult posture, low resolution and out of focus, further includes black and white and color image. Different from other face detection data collection, tab area is oval and non-rectangle.All models are in FDDB data set in the present invention Middle test.

In FDDB data integrated test, all pictures keep length-width ratio scaling, and are embedded in the black of 544 × 544 scales In background, to guarantee that picture will not deformation occurs.As shown in Fig. 2 (a) and 2 (b), by YOMO and MTCNN, ScaleFace, HR, HR-ER, ICC-CNN, FANet model, the result in DiscROC and ContROC compare respectively.

YOMO-Fit is testing result of the YOMO after oval recurrence device in Fig. 2.By FDDB assessment result it is found that YOMO-Fit is under DiscROC and ContROC evaluation criteria, and when erroneous detection number is fixed as 1000, recall rate is respectively 97.7% and 83.6%, it is only below FANet.And even if HR-ER uses FDDB as the training data of 10-fold cross validation, Recall rate in DiscROC is identical as YOMO, the recall rate ratio YOMO-Fit low 4.9% in ContROC.It is noticeable It is that ellipse returns device and makes recall rate of the YOMO at DiscROC and ContROC that 0.1% and 8.6% be respectively increased.

Fig. 3 (a), (b) are the visualization result that individual pictures of WIDER FACE and FDDB data set are tested respectively. Rectangular shaped rim is the prediction block of YOMO model in Fig. 3 (a).In Fig. 3 (b) rectangle and it is oval be respectively the prediction block of YOMO and true Frame.

Claims

1. a kind of single step face detection system, which is characterized in that including sequentially connected conventional convolution module from left to right Conv0, depth separate convolution module conv1, depth separates convolution module conv2, depth separates convolution module Conv3, depth separate convolution module conv4, depth separates convolution module conv5, depth separates convolution module Conv6, depth separate convolution module conv7, depth separates convolution module conv8, depth separates convolution module Conv9, depth separate convolution module conv10, depth separates convolution module conv11, depth separates convolution module Conv12, depth separate convolution module conv13, depth separates convolution module conv14, warp lamination conv15, depth Separable convolution module conv16, depth separate convolution module conv17, warp lamination conv18, depth and separate convolution mould Block conv19 and depth separate convolution module conv20；

The output end that the depth separates convolution module conv14 is connect with detection module det-32, the separable volume of the depth The output end of volume module conv17 is connect with detection module det-16, and the depth separates the output end of convolution module conv20 It is connect with detection module det-8；

The input terminal that the depth separates the output end of convolution module conv11 and depth separates convolution module conv16 connects It connects, the depth separates output end and the output end Fusion Features of warp lamination conv15 of convolution module conv16 and connect Depth separates the input terminal of convolution module conv17, and the depth separates the output end of convolution module conv5 and depth can The input terminal connection of convolution module conv19 is separated, the depth separates the output end and warp lamination of convolution module conv19 The output end Fusion Features of conv18 simultaneously connect the input terminal that depth separates convolution module conv20.

2. single step face detection system according to claim 1, which is characterized in that the conventional convolution module conv0 packet Include sequentially connected 3 × 3 convolutional layer, BatchNorm layers and LeakyReLU active coating from top to bottom.

3. single step face detection system according to claim 1, which is characterized in that the conventional convolution module conv0's It inputs picture and crop box SelectCrop is selected by the random clipping algorithm of medium-soft_bboxIt is cut and is trained, it is specific to walk Suddenly are as follows:

S1, the crop box Sampled that several length-width ratios are 1 is generated by random clipping algorithm_bboxes, obtained after original image is cut Picture is cut, requires scaling to cut picture according to the input figure size of network, and by effective in equal proportion scaling crop box True frame counts the quantity of each scale face, statistical formula according to face range scale are as follows:

In above formula, Num_icFor the number of the c class face scale of i-th of crop box, N is the type of face scale, N=3, difference For small scale face, mesoscale face and large scale face, M is the sum of crop box, and 1 () was identifier, when condition is true Value is 1, is otherwise 0, MinScale_cAnd MaxScale_cThe respectively boundary minimum value and boundary maximum value of c class face scale, bbox_kFor the side length of crop box, K is the total quantity of the crop box generated；

S_i1≥S_i2≥…≥S_iN

The quantity of all kinds of face scales when S3, statistics network hands-on, and arranged face scale classification ascending order according to it Are as follows:

A₁≤A₂≤…≤A_N

In above formula, A_cFor one kind in N class face scale classification；

S4, in crop box Sampled_bboxesIn M face scale classification sequence in, searching meet S_ic=A_cCrop box, with Machine selects the crop box for meeting condition as SelectCrop_bbox；

S5, when the crop box for meeting step S4 is not found, in crop box Sampled_bboxesIn M face scale classification In sequence, searching meets S_i1=A₁And S_iN=A_NCrop box, the crop box conduct that random selection one meets condition SelectCrop_bbox；

S6, when the crop box for meeting step S5 is not found, in crop box Sampled_bboxesOne crop box of middle random selection As SelectCrop_bbox；

S7, by SelectCrop_bboxIn face scale of all categories quantity Num_sc, update to face rulers all kinds of when hands-on The quantity of degreeIn, it may be assumed that

In above formula,For the quantity for all kinds of face scales that preceding primary training obtains, s indicates selected Crop box serial number.

4. single step face detection system according to claim 1, which is characterized in that the depth separates convolution module Conv1, depth separate convolution module conv2, depth separates convolution module conv3, depth separates convolution module Conv4, depth separate convolution module conv5, depth separates convolution module conv6, depth separates convolution module Conv7, depth separate convolution module conv8, depth separates convolution module conv9, depth separates convolution module Conv10, depth separate convolution module conv11, depth separates convolution module conv12, depth separates convolution module Conv13, depth separate convolution module conv14, depth separates convolution module conv16, depth separates convolution module Conv17, depth separate convolution module conv19 and depth separate convolution module conv20 structure it is identical, include from Sequentially connected 3 × 3 convolutional layer, BatchNorm layers, LeakyReLU active coating, 1 × 1 convolutional layer, BatchNorm under Layer and LeakyReLU active coating.

5. single step face detection system according to claim 1, which is characterized in that the depth separates convolution module Conv14, depth separate convolution module conv17 and the output channel number of the separable convolution module conv20 of depth is 1024。

6. single step face detection system according to claim 1, which is characterized in that the detection module det-32 is for big Scale Face datection, the detection module det-16 are used for mesoscale Face datection, and the detection module det-8 is used for small scale Face datection.

7. single step face detection system according to claim 1, which is characterized in that the detection module det-32, detection Module det-16 and detection module det-8 includes regular volume lamination and output layer；

The output channel quantity of the regular volume lamination is 18；

b_x=σ (t_x)+C_x,b_y=σ (t_y)+C_y

In above formula, (b_x,b_y) be prediction block centre coordinate, b_wAnd b_hThe respectively width and height of prediction block, t_xAnd t_yIt is respectively pre- Survey the offset of frame central point abscissa and ordinate, (C_x,C_y) be grid where Anchor top left co-ordinate, σ () is Sigmoid function, p_wAnd p_hThe respectively width of Anchor and height.

8. single step face detection system according to claim 7, which is characterized in that the detection module det-32, detection The output end of module det-16 and detection module det-8 are all connected with oval recurrence device, and the oval device that returns predicts output layer Frame is converted into oval true frame, the calculation formula of the oval true frame are as follows:

Y=XW+ ε

In above formula, Y is the coordinate vector of oval true frame, including major semiaxis r_a, semi-minor axis r_b, angle, θ, central point abscissa c_x With ordinate c_y, X is the coordinate vector of output layer prediction block, the centre coordinate b including prediction block_x、b_y, prediction block wide b_wWith High b_h, W is regression coefficient matrix, and ε is random error；

In above formula, J () indicates that mean square error function, X ' are the normalized coordinates vector of prediction block, and Y ' is the standard of true frame Change coordinate vector；

In above formula, U_XAnd σ_XThe respectively mean value and standard deviation of the X of prediction block coordinate vector, U_YAnd σ_YRespectively true frame coordinate to Measure the mean value and standard deviation of Y.