Background
With the rise of deep learning, intelligent analysis technologies related to human faces become the key point and the focus of research in the field of artificial intelligence, new algorithms continuously refresh the scores of tasks related to human faces, the current face recognition technology exceeds the highest level of human beings, and meanwhile, the industrial application related to human faces is the most extensive. For example, applications related to face detection include intelligent security, urban brain, safe driving, and Chinese skynet systems; the related applications of face recognition include face payment, intelligent access control, face attendance, face verification of various intelligent terminal devices and the like, and the face related technology is closely related to the safety of various systems. Meanwhile, the technology related to the human face is also continuously applied to various aspects of life, such as missing children searching, intelligent education and the like. Further, with the improvement of the computing capability of a computer and the application of a 5G network, the cost of data storage and the delay of data transmission are lower and lower, and the application related to the human face is deployed on more and more intelligent terminals, so that the intelligent society is really realized and the human is benefited. The face detection is that the intelligent terminal judges whether a face exists on an input image and finds out the position of the face. The precondition of the face detection technology is that the face can be accurately detected without being influenced by the background of the face image. Therefore, human face detection is widely concerned by researchers as a basic and core technology of human face related tasks.
The SSD algorithm-based face detection model can quickly and accurately identify faces in images of natural scenes, and has high detection speed, but the SSD face detection algorithm still has large promotion space for recall rate of small face detection in natural or unnatural scenes, so that a new network MDSSD model needs to be constructed, namely a Mix resolution Single Shot Multi Box Detector is used for face detection, the MDSSD algorithm improves various defects of the SSD algorithm in the aspect of face detection, including a model structure, a detection characteristic diagram, parameter configuration, a loss function and the like, and configures the model by a machine learning method to reduce human experience intervention, thereby greatly improving the detection effect of the model.
Disclosure of Invention
This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.
The present invention has been made in view of the above-mentioned conventional problems.
Therefore, the invention provides an MDSSD face detection method based on an improved loss function, which can solve the problem of low recall rate of small face detection in natural or unnatural scenes.
In order to solve the technical problems, the invention provides the following technical scheme: the MDSSD network detects a face region by using a priori frame mechanism, and classifies and regresses candidate regions; carrying out clustering analysis on the group Truth frame according to k-means, and searching the optimal prior frame number, size and proportion; and the MDSSD network replaces the Focal loss function with the cross entropy loss function in the classification network, and detects and classifies the human face and the background of the prior frame after the clustering analysis.
As a preferred scheme of the MDSSD face detection method based on the improved loss function, the present invention further includes: the MDSSD network comprises the steps of filling 0 in a deep feature map or a deep fusion layer, combining 3-by-3 convolution to perform deconvolution operation on the filled feature map, and doubling the resolution of the feature map under the condition of ensuring that the receptive field range is unchanged; ensuring that the deconvolution operation output dimension is matched with the shallow fusion feature map dimension by using the number of convolution kernels with the same dimension as the shallow feature map channel dimension; during MDSSD feature fusion, only adding operation is carried out on corresponding positions of the shallow feature map and the deconvolution feature map so as to enhance effective context information; the MDSSD carries out nonlinear mapping by adding an activation layer to a fusion layer, and the fusion layer after activation is used as a final detection characteristic graph.
As a preferred scheme of the MDSSD face detection method based on the improved loss function, the present invention further includes: the MDSSD network takes SSD as a basic network model; eliminating dropouts of Block6 and Block7 in the SSD network; adding a multilayer fused Mixed layer3 and single layer fused Mixed layer4 and Mixed layer 7; the MDSSD network model also adds an L2Normalization layer to reduce the difference from the detection layer.
As a preferred scheme of the MDSSD face detection method based on the improved loss function, the present invention further includes: effective receptive fields need to be considered by the prior frame mechanism, and the effective receptive fields comprise that layers in the convolutional neural network are locally connected, so that neurons cannot sense all information of an original image; if the receptive field is larger, the more global information is acquired, namely the more global and high-level semantic features contained in the feature map are abundant; if the neuron receptive field is smaller, the lower the features contained in the feature map, the more local and texture the contained information is.
As a preferred scheme of the MDSSD face detection method based on the improved loss function, the present invention further includes: the prior frame needs to be matched with the group Truth frame to divide positive and negative samples; if the difference between the size and the proportion of the prior frame and the real Ground Truth frame is larger, the error of calculating the intersection ratio is larger; if the size and the proportion of the prior frame are smaller than the difference between the real group Truth frame, the error of calculating the intersection ratio is smaller.
As a preferred scheme of the MDSSD face detection method based on the improved loss function, the present invention further includes: performing the cluster analysis using the custom IOU distance as a metric distance, including,
dIOU(box,centroid)=1-IOU(box,centroid)
the clustering loss is the IOU distance between the group Truth frame and the cluster center, and if the IOU distance is smaller, the IOU value is larger; defining cluster number k and initializing cluster center at random (W)i,Hi) I ∈ {1,2, …, k }, where Wi,HiRespectively representing the length and width of the cluster center; placing the cluster center and the center of the group Truth frame at a coordinate origin and calculating the IOU distance between each group Truth frame and the cluster; distributing the group Truth frame as a cluster with the minimum IOU distance, and recalculating the cluster center after all the group Truth frames are distributed; continuously updating until the cluster center is not changed any more, and taking the median of the cluster center as the final priorityThe size and proportion of the check frame.
As a preferred scheme of the MDSSD face detection method based on the improved loss function, the present invention further includes: and determining the optimal cluster number by using an elbow strategy, wherein when k is 17, the loss function slowly descends and tends to be stable, and the optimal cluster number is determined to be 17 by comprehensively considering the settings of all detection layers.
As a preferred scheme of the MDSSD face detection method based on the improved loss function, the present invention further includes: the loss function may include a function of the loss,
wherein x is a sample label, y' is a model output value, alpha is a sample balance factor, and gamma is a sample weight adjustment factor.
As a preferred scheme of the MDSSD face detection method based on the improved loss function, the present invention further includes: also included is that when x is 1, i.e. the input is a positive sample, the larger the predictor, the easier the sample is to classify, and the smaller the sample weight.
As a preferred scheme of the MDSSD face detection method based on the improved loss function, the present invention further includes: it is also included that the sample balance factor α can adjust the specific gravity of the positive and negative samples in the loss function, and typically set α -0.25 and γ -2 during the model training process.
The invention has the beneficial effects that: the invention aims at improving SSD network structure, loss function, model presetting and the like based on the defects of an SSD algorithm in face detection, such as unbalanced samples, low classification confidence coefficient, low recall rate of small face detection and the like, and provides an MDSSD algorithm which redesigns the network structure and a detection module, advances detection layers, and performs cluster analysis on a group Truth frame marked with a face to find the optimal prior frame number and proportion of each detection layer; meanwhile, the MDSSD model is trained and is tested and analyzed, and experimental results show that the MDSSD algorithm has higher recall rate on small faces and fuzzy faces compared with the SSD, and the MDSSD algorithm still keeps higher detection speed.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.
Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Example 1
Referring to fig. 1 to 4, a first embodiment of the present invention provides an MDSSD face detection method based on an improved loss function, including:
s1: the MDSSD network detects the face region and classifies and regression the candidate regions using a priori box mechanism. It should be noted that the MDSSD network includes:
performing 0 filling on the deep characteristic map or the deep fusion layer, performing deconvolution operation on the filled characteristic map by combining 3-by-3 convolution, and doubling the resolution of the characteristic map under the condition of ensuring that the receptive field range is unchanged;
the number of convolution kernels with the same dimensionality as the shallow feature map channel is used for ensuring that the output dimensionality of the deconvolution operation is matched with the dimensionality of the shallow fusion feature map;
during MDSSD feature fusion, only the corresponding positions of the shallow feature map and the deconvolution feature map are subjected to addition operation to enhance effective context information;
and the MDSSD carries out nonlinear mapping by adding an activation layer to the fusion layer, and takes the activated fusion layer as a final detection characteristic map.
Further, the method also comprises the following steps:
the MDSSD network takes SSD as a basic network model;
eliminating dropouts of Block6 and Block7 in the SSD network;
adding a multilayer fused Mixed layer3 and single layer fused Mixed layer4 and Mixed layer 7;
the MDSSD network model also adds an L2Normalization layer to reduce the difference from the detection layer.
S2: and carrying out clustering analysis on the group Truth frame according to k-means, and searching the optimal prior frame number, size and proportion. It should be noted that in this step, the prior frame mechanism needs to consider the effective reception field, which includes:
the layers in the convolutional neural network are locally connected, so that neurons cannot sense all information of an original image;
if the receptive field is larger, the more global information is acquired, namely the more global and high-level semantic features contained in the feature map are abundant;
if the neuron receptive field is smaller, the lower the feature contained in the feature map is, the more local and texture information is contained;
the prior frame needs to be matched with a group Truth frame to divide positive and negative samples;
if the difference between the size and the proportion of the prior frame and the real Ground Truth frame is larger, the error of calculating the intersection ratio is larger;
if the difference between the size and the proportion of the prior frame and the real Ground Truth frame is smaller, the error of calculating the intersection ratio is smaller.
Specifically, the clustering analysis is performed by using the user-defined IOU distance as the measurement distance, and comprises the following steps:
dIOU(box,centroid)=1-IOU(box,centroid)
the clustering loss is the IOU distance between the group Truth frame and the cluster center, and if the IOU distance is smaller, the IOU value is larger;
defining cluster number k and initializing cluster center at random (W)i,Hi) I ∈ {1,2, …, k }, where Wi,HiRespectively representing the length and width of the cluster center;
placing the cluster center and the center of the group Truth frame at the origin of coordinates and calculating the IOU distance between each group Truth frame and the cluster;
distributing the group Truth frames into clusters with the minimum IOU distance, and recalculating cluster centers after all the group Truth frames are distributed;
and continuously updating until the cluster center is not changed, and taking the median of the cluster center as the final prior frame size and proportion.
S3: and the MDSSD network replaces the cross entropy loss function in the classification network with the Focal loss, and detects and classifies the face and the background of the prior frame after the cluster analysis. It should be further noted that the loss function includes:
wherein x is a sample label, y' is a model output value, alpha is a sample balance factor, and gamma is a sample weight adjustment factor;
determining the optimal cluster number by using an elbow strategy, when k is 17, the loss function slowly descends and tends to be stable, and the optimal cluster number is 17 by comprehensively considering the settings of all detection layers;
when x is 1, namely the input is a positive sample, the larger the predicted value is, the easier the sample is classified, and the smaller the sample weight is;
the sample balance factor α can adjust the specific gravity of the positive and negative samples in the loss function, and α -0.25 and γ -2 are typically set during model training.
Referring to fig. 2, the inverse process of convolution operation is called transposed convolution or deconvolution, which is a special upsampling convolution operation with learnable parameters, and the transposed convolution is the inverse process of convolution, i.e. the forward propagation and backward propagation processes of the two operations are reciprocal; however, the inverse process only means that the transposed convolution can only restore the size of the input feature map but cannot restore the feature value of the original feature map, so the maximum use of the transposed convolution is upsampling; the convolution operation with the step length larger than 1 is equidistant downsampling, so that the size of the output feature graph is smaller than that of the input feature graph, and the transposed convolution uses the convolution with the step length smaller than 1 to perform upsampling, so that the size of the feature graph is increased; the traditional method for realizing up-sampling is to apply interpolation or manual creation rules, and the transposed convolution is to make the network learn proper transformation from data without human intervention; in the transposed convolution implementation process, firstly, s-1 0 s are inserted into a feature unit of an input feature graph to serve as a new input feature graph, then convolution operation is carried out on the feature graph after interpolation, when the size of the input feature graph is i multiplied by i, the transposed convolution step size is s, the size of a convolution kernel is k multiplied by k, and the filling size is p, the size of a transposed convolution output feature graph is s (i-1) + k-2p, and the low-resolution feature graph is subjected to up-sampling through the transposed convolution in an MDSSD algorithm, so that fusion of shallow texture features and high-level semantic features is achieved.
Referring to fig. 3, the SSD network sets a plurality of detection feature maps, and detects faces of different sizes from different feature maps, the shallow feature map is suitable for small face detection because of its smaller receptive field, and the resolution of the feature map is continuously reduced as the receptive field is continuously enlarged with the increase of the number of network convolution layers, so the deep feature map is more suitable for large face detection; in the deep convolutional neural network, a shallow feature map contains abundant semantic features and is limited by the resolution and information of a small face, so that the detection of the small face is a challenging task; the shallow feature map in the deep convolutional neural network has high resolution and contains more low-level textural features, but the feature extraction is not rich, so that the shallow feature map contains fewer semantic features and more noise, while the deep feature map is subjected to a plurality of convolutional operations, so that the extracted semantic information is rich, but the perception capability of the deep feature map on some low-level features such as textures and the like is poor, so that the MDSSD algorithm improves the detection of a small face by introducing context information, namely the performance of face detection is improved by fusing a plurality of feature layers; the detection capability of the model can be obviously improved by introducing context information layer by layer, the semantic information of the shallow feature map is enriched, but a large amount of noise can be introduced by introducing excessive context information, so that the detection of the low-resolution small face is influenced, therefore, the embodiment designs two feature fusion strategies of multilayer fusion and single-layer fusion according to the human face detection task, and only adds a feature fusion module to the shallow detection layer for detecting the small face, the multilayer fusion strategy is used for the feature map of the lower layer, namely, the deconvolution layer of the deep fusion module is fused with the feature map, and only single-layer fusion is carried out on the high-layer feature map, namely, the feature layer is only fused with the deconvolution layer of the next module.
Referring to fig. 4, the MDSSD network uses SSD as a base network, still uses VGG16 as a backbone network and maintains the original number of convolution cores and model structure, but the MDSSD network removes the drop layers of Block6 and Block7 in the SSD network, and at the same time, the network adds two feature fusion modules for face detection, the MDSSD network adds a multi-layer fusion module Mixed _ layer3 and two single-layer fusion modules Mixed _ layer4 and Mixed _ layer7, respectively, since the Conv3_3 layer of VGG16 is located at a shallower layer and the face resolution detected by the layer is lower, if the Conv4_4 layer is fused alone, the useful semantic features cannot be effectively fused, the MDssd network fuses Conv3_3 with the fusion module Mixed _ layer4, wherein the Conv _ layer4 is a convolution layer of Conv 85 4_3 and Block7, thereby realizing the fusion of Conv _ layer 3648 with the fusion module Mixed _ layer7 for face detection, since the Mixed _ layer3 and Mixed _ layer4 have larger data scale due to the earlier position, the MDSSD model adds an L2Normalization layer after the detection module to reduce the difference with the later detection layer and increase the difference between the layer data.
Preferably, in this embodiment, it should be noted that, a candidate region nomination stage is cancelled in a one-stage face detection algorithm similar to the SSD, which greatly improves the face detection speed, but the one-stage face detection algorithm also causes a relatively serious sample imbalance problem; in a one-stage face detection algorithm, an input face image may generate thousands of preselected frames, but only a few of the preselected frames are candidate frames containing real faces, so that a large number of negative samples, namely background areas, exist in a training sample, and the negative samples play a main role in loss reduction in the training process, so that the updating direction of gradient is dominant, and a model cannot well classify the faces and the backgrounds; the MDSSD algorithm uses Focal loss to replace a cross entropy loss function in a classification network, and the Focal loss solves the problems of difficult sample learning and positive and negative sample imbalance in the model training process by adding two balance factors.
Example 2
Referring to fig. 5 and 6, a second embodiment of the present invention, which is different from the first embodiment, provides a verification method of an MDSSD face detection method based on an improved loss function, including:
in this embodiment, it is found by clustering labeled images that there are 4 different proportions of {0.55,0.65,0.75,1} in all cluster centers, when a 300 × 300 face image is input, because the face pose and model data enhancement will cause different scales of face corresponding to different scales of group Truth boxes, the small scale face proportion is close to {0.65,0.75,1}, and the large scale face proportion is close to {0.55,0.65,1 }.
The number of each detection layer is determined by calculating the scale of the group Truth in the center of each cluster, and 17 prior frames are distributed to 7 different detection layers according to the receptive field size of each detection layer, wherein the specific detection layers are set as shown in the following table:
table 1: MDSSD detects the layer parameter configuration table.
Preferably, in order to better verify and explain the technical effects adopted in the method of the present invention, the present embodiment selects to perform a comparison test with the conventional natural SSD algorithm and the method of the present invention, and compares the test results with a scientific demonstration means to verify the actual effects of the method of the present invention.
In order to verify that the method has higher recall rate and better detection effect compared with the traditional method, the traditional natural SSD algorithm and the method of the invention are adopted to respectively carry out random measurement comparison on the small face in a certain unnatural scene.
And (3) testing environment: (1) DELL Tower Server, Windows10 operating system, NVIDA GTX1080Ti GUP and Intercore i7-8700@3.20 GHz;
(2) memory 32G and video memory 8G;
(3) both the SSD model and the MDSSD model were implemented using python3.6 based on the tensrflow1.14 framework.
Table 2: and (5) a parameter setting table.
Parameter(s)
|
SSD networks
|
MDSSD network
|
Backbone network initialization method
|
VGG16
|
SSD
|
Batch size (batch size)
|
32
|
32
|
Optimization method
|
Adam
|
Adam
|
Adam_bate1
|
0.9
|
0.9
|
Adam_bate2
|
0.999
|
0.999
|
Learning rate
|
0.001
|
0.001
|
Learning rate decay rate
|
0.90
|
0.90
|
Number of iterations
|
50000
|
50000 |
Table 3: the recall ratio of the two methods is compared with a result table.
Referring to tables 2 and 3, it can be seen visually that the recall rate is compared between the conventional method and the method of the present invention under the same parameter setting condition, the recall rate is gradually reduced with the increase of the number of sample training iterations in the conventional method, while the method of the present invention is kept in a stable state and is always higher than the recall rate of the conventional method, and based on this, the true technical effect of the method of the present invention is verified.
It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.