CN112949499A

CN112949499A - Improved MTCNN face detection method based on ShuffleNet

Info

Publication number: CN112949499A
Application number: CN202110242262.9A
Authority: CN
Inventors: 徐成; 秦振; 刘宏哲; 徐冰心; 潘卫国; 代松银
Original assignee: Beijing Union University
Current assignee: Beijing Union University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-06-11

Abstract

The invention discloses an MTCNN face detection method based on ShuffleNet improvement, which comprises the following steps: firstly, the image is transformed in different scales to construct an image pyramid so as to adapt to the detection of the human faces in different sizes. Generating a face region Bounding Boxes by an original picture in a first stage through P-Net; the second stage R-Net takes the original picture and the Bounding Boxes generated by the first stage P-Net as input to generate corrected more accurate Bounding Boxes; and in the third stage, taking the original picture and Bounding Boxes output by the R-Net as input of the O-Net to generate final face areas Bounding Boxes. And the model is improved by adopting the channel shuffling idea and the point-by-point hierarchical convolution technology in the Shufflenet. The model is based on MTCNN, and a channel shuffling thought is adopted to improve the model during convolution operation, so that a network can quickly and accurately detect the human face.

Description

Improved MTCNN face detection method based on ShuffleNet

Technical Field

The invention relates to the field of deep learning target detection, in particular to an MTCNN face detection method improved based on ShuffleNet.

Background

With the rapid increase of the quantity of motor vehicles, great convenience is brought to the life and the travel of people, but road traffic accidents caused by the great convenience cause great loss to the lives and properties and national economy of people in various countries every year, and fatigue driving is one of the important causes and main causes of traffic accidents. If the face fatigue can be efficiently recognized, the fatigue driving phenomenon of the driver can be effectively prevented and reminded by detecting the real-time facial expression state of the driver, so that the possibility of traffic accidents is reduced, and the system has potential economic value and wide application prospect.

The existing driver fatigue detection method has a series of problems: the detection method based on the physiological parameters requires a driver to wear invasive experimental equipment, which not only affects comfort, but also causes interference to the driver in an actual driving state. The driver fatigue detection method based on the driver operation behavior is greatly influenced by individual difference factors such as driving habits, driving proficiency and the like, and has the problems of poor robustness, low detection precision and the like. The fatigue detection method based on the vehicle operation parameters has requirements on the driving environment, and the detection robustness on the unstructured road is poor. The fatigue detection method based on the facial behaviors has the advantages of non-invasiveness, low cost, good instantaneity and the like, but is greatly influenced by the driving environment and individual difference.

In recent years, deep learning technology is continuously developed and makes a great breakthrough, and target features are automatically extracted through a convolutional neural network. The strong feature extraction capability of the convolutional neural network is benefited, the detection accuracy of the face detection algorithm is greatly improved, the robustness is stronger, and the face detection algorithm can adapt to more complex recognition scenes.

The AlexNet proposal in 2012 pulled the development of deep learning, and the VGGNet proposal in 2014 made the implementation of deep neural network possible, but the gradient vanishing problem occurred while the network deepened. ResNet in 2015 solves the problems by a residual connection method, reduces model convergence time, and makes the network deeper and difficult to have gradient disappearance.

The Multi-task Cascaded Convolutional neural network (MTCNN) is a Cascaded structure model based on a Coarse-to-fine (Coarse-to-fine) idea and simultaneously realizing face detection and face key point detection, and is a detector widely applied in the field of face detection at present. The internal relevance of the face detection and the face key point detection is utilized to improve the detection performance of the face detection and the face key point detection, the face detection and the face key point detection are few detectors which can fall on the ground on traditional hardware, and the face detection task has higher detection precision. Because only 5 calibrated face key points are output by the MTCNN, and more key points are needed for accurately positioning face parts (such as eyes, mouths and the like) and calculating fatigue characteristics of the face parts in driving fatigue detection, the invention only utilizes the face detection function of the MTCNN. The entire cascade structure includes three CNN models: P-Net (Proposal network), R-Net (RefinementNetwork) and O-Net (output network). Wherein, P-Net is a full Convolutional neural Network (FCN) [5] for rapidly generating a series of face candidate windows; R-Net is used for filtering most of non-face candidate windows generated by P-Net and further correcting the coordinate positions of Bounding boxes (Bounding boxes) of the candidate windows which are possibly faces; O-Net and R-Net have similar functions, except that O-Net has more feature inputs and a complex network structure, has better performance, and generates a final face window and the positions of face key points.

The MTCNN model has the following bottlenecks in detection speed: the larger the resolution of the input image of the 1 st stage P-Net is, the more time is consumed; the more faces in the image the longer the time spent in the 2 nd and 3 rd stages O-Net and R-Net.

Aiming at the problems, the hybrid channel of ShuffleNet is added to the MTCNN model to be convolved with the point-by-point group, so that the face detection precision is ensured, and the detection speed is increased.

Disclosure of Invention

In order to solve the above problems, an embodiment of the present invention provides an MTCNN face detection method improved based on shuffle net, aiming to improve the face detection accuracy and speed, and including the following formation steps:

firstly, transforming images in different scales to construct an image pyramid;

inputting all pictures of the image pyramid into P-Net, performing convolution for three times, pooling for one time, performing channel shuffling for two times, and outputting a large number of bounding box coordinates;

thirdly, cutting out a picture from the original picture according to the Boundingbox coordinate, wherein resize is 24 x 24;

step four, inputting the picture with the size of 24 × 24 into R-Net, performing three times of conventional convolution, performing pooling twice, performing channel shuffling twice, performing point-by-point group convolution once, and outputting corrected more accurate Bounding box coordinates;

fifthly, cutting out a picture from the original picture according to the Boundingbox coordinate, wherein resize is 48 x 48;

step six, inputting the picture with the size of 24 × 24 into O-Net, performing four times of conventional convolution, three times of pooling and three times of channel shuffling, performing one time of point-by-point group convolution, and outputting an accurate Bounding box coordinate;

preferably, when convolution operation is performed in the second step, the fourth step and the sixth step, a channel shuffling thought improved model is adopted, and the characteristic channels are evenly distributed in different groups, so that the characteristics of each group can obtain the information of other groups when convolution operation is performed, and the relevance among characteristic graphs of different channels is enhanced. The strategy not only reduces the calculation amount as much as possible, but also can ensure the detection precision of the model. (ii) a

Preferably, the point-by-point group convolution is performed in the fourth step and the sixth step. The point-by-point Group Convolution is a combined application of Group Convolution (Group Convolution) and point-by-point Convolution (point Convolution), the Group Convolution has the effect of reducing the parameter number, and the Group Convolution can be regarded as Structured Sparse (Structured Sparse), and the formula of the point-by-point Convolution is as follows:

where k is all children supported by the coreIterating over the field, p_iIs the coordinate of the ith point, |. represents the number of points in the subdomain, ω_kIs the kernel weight of the kth subfield, l-1 and l are the ordinal numbers of the input and output layers.

Compared with the prior art, the embodiment of the invention is improved on the basis of the MTCNN network, and the hybrid channel added with ShuffleNet is convolved with the point-by-point group, thereby improving the detection speed while ensuring the face detection precision.

Drawings

FIG. 1 is a flow chart of reasoning of the forming steps of an improved MTCNN face detection method based on ShuffleNet of the present invention;

FIG. 2 is a schematic diagram of an improved P-Net method for MTCNN face detection based on ShuffleNet;

FIG. 3 is a schematic diagram of an improved R-Net method for MTCNN face detection based on ShuffleNet improvement according to the present invention;

FIG. 4 is a schematic diagram of an improved O-Net method for MTCNN face detection based on ShuffleNet;

FIG. 5 is a schematic diagram of a channel shuffling technique adopted in an improved MTCNN face detection method based on ShuffleNet;

FIG. 6 is a schematic diagram of a packet convolution adopted by an improved MTCNN face detection method based on ShuffleNet according to the present invention;

FIG. 7 is a schematic diagram of a point-by-point convolution adopted by the improved MTCNN face detection method based on ShuffleNet of the present invention;

FIG. 8 is a diagram of the detection effect of the improved MTCNN face detection method based on ShuffleNet of the present invention.

Detailed Description

The model scheme in the embodiment of the present invention will be fully described in the following with reference to the accompanying drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not a whole embodiment. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides an improved MTCNN face detection method based on ShuffleNet, and the example of the present invention includes the following steps:

firstly, transforming images in different scales to construct an image pyramid; firstly, continuously carrying out Resize on the picture to obtain a picture pyramid. And (3) resize is carried out on the picture according to a resize _ factor (for example, 0.70, the size is determined according to the face size distribution of the data set, and basically, the size is determined to be more appropriate between 0.70 and 0.80, the set size is larger, the inference time is easy to prolong, and small and medium-sized faces are easy to miss) until the size is equal to 12 x 12 required by the P-net. Thus, you get the original image, the original _ size _ factor, the original image, the original _ size _ factor 2, and the original image, the original _ size _ factor. Note that these images are all to be entered into Pnet one by one to get candidates.

Step two, the picture pyramid inputs Pnet, as shown in fig. 2, to obtain a large number of candidates (candidate). And inputting all pictures into the P-net according to the picture pyramid obtained in the step, and obtaining an output map with the shape of (m, n, 16). According to the classification score, screening a majority of candidates, calibrating the bbox according to the obtained 4 offsets to obtain coordinates of the upper left part and the lower right part of the bbox (correcting the embedded pits according to the offsets to describe candidates in the training stage), and then performing non-maximum suppression (NMS) on the candidates according to the IOU value to screen a majority of candidates. In detail, the tensor of (num _ left,4), that is, the absolute coordinates of upper left and lower right of num _ left bbox, is obtained according to the classification score from large to small. Each time iou is found with the bbox coordinate and the remaining coordinates of the maximum score value in the queue, the box where iou is greater than 0.6 (the threshold is set in advance) is drained and this maximum score value is moved to the final result. Repeating the operation will dry out many bboxs with a large amount of overlap, and finally obtain (num _ left _ after _ nms,16) candidates, wherein a channel shuffling step is added after convolution, and the specific principle is as shown in fig. 5;

and step four, according to the coordinates output by the P-net, cutting out a picture from the original image (the cut picture has details which are squares requiring the maximum side length of the bbox to be cut, which is to ensure that no deformation is generated and more details around the face frame are reserved when resize is generated), wherein the resize is 24 x 24, inputting the resize into the R-net, and performing fine adjustment as shown in the figure 3. The R-net will still output two classes of one-hot2 outputs, bbox coordinate offset 4 outputs, landmark10 outputs, and after the offset adjustment of bbox in the screenshot is performed (the simple point is to perform up-down and left-right adjustment of the x, y coordinates at the upper left and lower right) according to the two-class score, repeat the P-net again to said IOU NMS to dry most candidates. Finally, the P-net outputs num _ left _ after _ Rnet,16, and a point-by-point layer convolution step is added to accelerate the detection speed, wherein the specific principle is shown in FIG. 6 and FIG. 7;

fifthly, cutting out an image from the original image according to the coordinates of the bbox and inputting the image into O-net, and also performing square cutting according to the maximum side length to avoid deformation and keep more details;

and step six, inputting the pictures subjected to the drying of a plurality of candidates by the R-net into the O-net, and outputting accurate bbox coordinates and landmark coordinates as shown in figure 4. The process of P-net can be repeated in general, but with the difference that at this time we also output the coordinates of landmark in addition to the coordinates of bbox. (therefore, there is also the output of the landmark coordinate, and it is mainly hoped that the landmark coordinate can be combined to make the bbox more accurate, in other words, the P-net and R-net in the inference stage can completely output no landmark, O-net output).

Through tests, compared with the traditional MTCNN network, the embodiment of the invention has a good detection effect on WIDERFACE data sets, the average precision reaches 90.3%, the average speed reaches 232FPS, and compared with 25FPS of MTCNN, the requirement of real-time detection can be met.

In summary, in the embodiment of the present invention, based on the MTCNN model, the hybrid channel added with ShuffleNet is convolved with the point-by-point group, so that the face detection accuracy is ensured, and the detection speed is increased.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An MTCNN face detection method based on ShuffleNet improvement is characterized in that: the method for improving the model by using the channel shuffling method in the ShuffleNet comprises the following forming steps:

firstly, transforming images in different scales to construct an image pyramid;

thirdly, cutting out a picture from the original picture according to the Bounding box coordinate, wherein resize is 24 x 24;

fifthly, cutting out a picture from the original picture according to the Bounding box coordinate, wherein resize is 48 x 48;

and step six, inputting the picture with the size of 24 × 24 into O-Net, performing four times of conventional convolution, three times of pooling, three times of channel shuffling, performing one time of point-by-point group convolution, and outputting an accurate Bounding box coordinate.

2. The ShuffleNet-based modified MTCNN face detection method as claimed in claim 1, wherein an image pyramid is used in the first step to solve the multi-scale problem, i.e. the original image is scaled multiple times according to a certain factor to obtain a multi-scale image.

3. The ShuffleNet-based modified MTCNN face detection method as claimed in claim 1, wherein in step two, the P-NET is a full convolution network, and the convolution, pooling and nonlinear activation are all operations that accept arbitrary scale matrix.

4. The ShuffleNet-based improved MTCNN face detection method as claimed in claim 1, wherein a channel shuffling thought improvement model is adopted during convolution operation in the second, fourth and sixth steps, the feature channels are evenly distributed in different groups, and the features of each group can obtain information of other groups during convolution operation, so as to enhance the correlation between feature maps of different channels.

5. The ShuffleNet-based modified MTCNN face detection method as claimed in claim 1, wherein in the fourth and sixth steps, a point-by-point group convolution is performed; the point-by-point group convolution is a combined application of group convolution and point-by-point convolution, the group convolution has the effect of reducing the parameter quantity, and the group convolution is regarded as structured sparseness.

6. The ShuffleNet-based improved MTCNN face detection method as claimed in claim 1, wherein the improved MTCNN still utilizes three tasks to train the model, respectively: the method comprises the following steps of face classification, face bounding box regression and face key point regression, and the main task is the face bounding box regression.

7. A ShuffleNet-based improved MTCNN face detection method as claimed in claim 1, wherein for each candidate window, its offset from the nearest artificially labeled bounding box is predicted; the learning objective is formulated as a regression problem, using euclidean distances for each sample to calculate the loss:

is the value of the output of the network regression,

and the position coordinates which are actually marked for the human face comprise the coordinates, the height and the width of the upper left corner of the human face frame in the original image.

8. The ShuffleNet-based modified MTCNN face detection method as claimed in claim 1, wherein the multi-source training is directly implemented using a sample type indicator, and the whole learning objective function is formulated;

wherein N is the number of samples in the whole training data set; alpha is alpha_jRepresents the importance degree of the learning task, and alpha is used for training the networks P-Net and R-Net_jIs set as (alpha)_det＝1,α_box＝0.5,α_landmark0.5) and when training O-Net, α is used to make the keypoint location more accurate_jIs set as (alpha)_det＝1,α_box＝0.5,α_landmark＝1)；

Is a sample type indicator; and the random gradient descent SGD is used in the network training process to optimize the CNN network parameters of each stage.