CN109886159B

CN109886159B - Face detection method under non-limited condition

Info

Publication number: CN109886159B
Application number: CN201910091271.5A
Authority: CN
Inventors: 王慧燕
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2021-03-26
Anticipated expiration: 2039-01-30
Also published as: CN109886159A

Abstract

The invention provides a face detection method under non-limited conditions, which comprises the following steps: s1) image preprocessing; s2) designing a face detection network based on deep convolution; s3) face detection network forward propagation; s4) adopting a non-maximum value suppression algorithm; s5) to obtain the final detection result. The invention has the advantages that: the method has the advantages of being wide in application range, achieving the effect and the speed of reaching the state-of-art level, being beneficial to improving the accuracy of the coordinates of the pedestrians, reducing the probability of false detection, relieving the problem of gradient dispersion of the deep network and accelerating the network convergence process.

Description

Face detection method under non-limited condition

Technical Field

The invention relates to the technical field of computer image processing, in particular to a human face detection method under an undefined condition based on a deep convolutional neural network and combined with a multi-scale feature pyramid.

Background

The human face detection technology is used as the basis of various visual tasks and occupies a very important position in the fields of image processing and pattern recognition. In recent years, with the rapid development of artificial intelligence based on a neural network, face detection technology is increasingly applied to various visual tasks, for example, the preconditions of tasks such as testimony comparison, conference check-in, face gate, face recognition and the like are high-precision and high-accuracy face detection methods.

Early face detection techniques relied on manually constructed features, performed in conjunction with traditional machine learning. For example, a famous Haar feature and a face detection algorithm using an AdaBoost algorithm scan an image by using a sliding window, extract the Haar feature of a target in the sliding window, and classify the target by using the AdaBoost algorithm. The algorithm not only consumes a lot of time, but also has unsatisfactory effect.

With the continuous progress of artificial intelligence technology, target detection methods based on neural networks are diversified, and the most representative methods are MTCNN, YOLO, SSD, and fast RCNN. The MTCNN adopts a small neural network to classify and match with a sliding window for rapid detection, so that good detection effect and speed can be achieved, but the generalization performance is poor, and retraining is needed for a specific scene; the YOLO and the SSD adopt a deep convolutional network, and the detection process is realized by classifying and performing deviation regression on anchor points of each feature map at one time, so that the method is high in speed, good in generalization performance and lost in precision; the performance of the fast RCNN algorithm of the two phases is the best compared with other algorithms, but the computation amount is extremely increased due to the intervention of a full connection layer, and the fast RCNN algorithm is difficult to be applied to an industrial application scene requiring real-time performance.

Disclosure of Invention

The invention aims to provide a face detection method under the non-limited condition, which is beneficial to improving the accuracy of pedestrian coordinates, reducing the probability of false detection, relieving the problem of gradient dispersion of a deep network and accelerating the network convergence process.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a face detection method under non-limiting conditions comprises the following steps:

s1) image preprocessing

For public data sets collected from the network, horizontally turning all pictures for data enhancement;

for 1080p video data collected by self, extracting video frames which can be used for face detection, carrying out manual or machine face labeling on each frame, carrying out scale reduction on each face appearing in the video frames, and cutting an image containing the face on the reduced image to be used as a final training image;

s2) designing face detection network based on deep convolution

Extracting features by taking VGG16 as a convolution layer, deleting a final full-connection layer, and adding an acceptance structure and an expansion convolution structure; the whole network comprises 6 blocks, feature pyramids are used for feature fusion of 3 blocks with relatively large feature maps at the lower layer, a convolution layer of 3 x 3 is added behind each fused convolution module, and then classification loss functions and regression loss functions are added to the 3 outputs at the lower layer and the last layer of the other 3 convolution modules without feature fusion;

subtracting ImageNet mean value (104,117,123) from the manually labeled multi-scale image sample, normalizing to [0,1], and inputting the image sample into a network through a network data layer for training; uniformly distributing initialization weights for the classification loss functions of all layers; optimizing the network weight by adopting a random gradient descent algorithm, and training a network model until convergence;

s3) face detection network forward propagation

Inputting the training image obtained in the step S1) into a trained network model for forward propagation operation, and calculating a classification result and a regression result output by each layer of feature pyramid, wherein the classification result represents the probability that the target of the anchor point is a pedestrian or a background, and the regression result represents the deviation of a prediction result relative to the anchor point;

screening out anchor points with the probability that the anchor point target is a pedestrian exceeding a threshold value, removing corresponding predicted deviation of the anchor points, and correcting anchor point coordinates by using the predicted deviation to obtain a preliminary prediction result P1;

s4) applying a non-maximum suppression algorithm

Sequencing the preliminarily obtained detection results P1 according to probability, and filtering out the detection results with local non-maximum probability to obtain a prediction result P2;

s5) obtaining the final detection result

The prediction result is the coordinates (x 1, y 1) of the upper left corner of the target frame and the width and height (x 2, y 2) of the target frame, and the prediction result exceeding the image range is filtered;

if x2 is greater than the image width, x2 is modified to the image width; if y2 is greater than the image height, y2 is modified to the image height; if x1 and y1 are less than 0, correcting x1 and y1 to be 0;

the final detection result P3 was obtained.

Further, in step S1), the scaling down of each face appearing in the video frame is performed to reduce the original image size to [. about.0.9,. about.0.8,. about.0.7 ], and the image including the face is cut out at 700. about.700 pixels from the reduced image, and then the image with 640. about.640 pixels is cut out at 700. about.700 pixels from the image as the final training image.

Compared with the prior art, the invention has the following advantages:

the invention relates to a face detection method under non-limited conditions, which takes VGG16 as a convolution layer to extract features, uses a feature pyramid structure to design a face detection network, and combines a classification loss function and a regression loss function of each layer to improve the face detection effect, in particular the small target face detection effect. The invention can be applied to various types of video monitoring and detecting systems, particularly to the monitoring of a face bayonet camera, has wide application range, can achieve the effect and speed of state-of-art level, is beneficial to improving the coordinate accuracy of pedestrians, reducing the occurrence probability of false detection, relieving the gradient dispersion problem of a deep network and accelerating the network convergence process.

Drawings

Fig. 1 is a schematic flow chart of a face detection method under an unlimited condition according to the present invention.

Fig. 2 is a schematic diagram of a face detection network structure of a face detection method under an unlimited condition according to the present invention.

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

According to the face detection method under the non-limited condition, the face detection from end to end can be realized through a network model obtained after training; when the video frame enters the network, outputting the probability of the detection result and the position information of the target; through non-maximum suppression and probability screening, specific coordinates of the human face can be obtained.

s1) image preprocessing

For public data sets collected from the network, horizontal turning is adopted for data enhancement;

for 1080p video data collected by self, extracting video frames which can be used for face detection, carrying out face manual or machine labeling on each frame, carrying out scale reduction on each face appearing in the video frames, respectively reducing the size of the original image to [. about.0.9,. about.0.8,. about.0.7 ], cutting out 700. about.700 pixels of the image containing the face from the reduced image, and then cutting out 640. about.640 pixels of the image on the 700. about.700 pixels of the image to be used as a final training image.

S2) designing face detection network based on deep convolution

Extracting features by taking VGG16 as a convolution layer, deleting a final full-connection layer, and adding an acceptance structure and an expansion convolution structure; the whole network comprises 6 blocks, the blocks are obtained by cutting the whole network into 6 parts, the feature maps of each part are different in size, feature pyramids are used for feature fusion of 3 blocks in a lower layer, the feature pyramids refer to features of multiple scales, a convolution layer of 3 x 3 is added behind each fusion convolution module, the convolution layer does not change the size of a feature graph and can prevent aliasing effect after fusion of two different layer features, and then classification loss functions and regression loss functions are added to 3 outputs in the lower layer and the last layer of outputs in other 3 convolution modules which are not subjected to feature fusion;

subtracting ImageNet mean value (104,117,123) from the manually labeled multi-scale image sample, normalizing to [0,1], inputting the image sample into a network through a network data layer for training, wherein the training data comprises images and coordinate information of all human faces in the images; configuring weights for the classification loss functions of all layers in a uniform distribution mode; and optimizing the network weight by adopting a random gradient descent algorithm, and training the network model until convergence.

S3) face detection network forward propagation

Inputting the training image obtained in the step S1) into a trained network model for forward propagation operation, calculating a classification result output by each layer of feature pyramid by using a Softmax function, calculating a regression result output by each layer of feature pyramid by using a logistic regression function, wherein the classification result represents the probability that the target of the anchor point is a pedestrian or a background, and the regression result represents the deviation of a prediction result relative to the anchor point; the regression deviation offset = { dx, dy, dw, dh }, where dx, dy, dw, dh denotes the abscissa offset amount, ordinate offset amount, width offset multiple, and height offset multiple of the prediction result for the currently set anchor box. Anchor points are represented as anchor = { x, y, w, h }, where (x, y) represents the center coordinate of the currently used anchor box, and w and h represent width and height;

screening out anchor points with the probability of pedestrians exceeding a threshold value, removing the corresponding predicted deviation of the anchor points, and correcting the coordinates of the anchor points by using the predicted deviation to obtain a preliminary prediction result P1= { x + dx-w × dw/2, y + dy-h × dh/2, x + dx + w × dw/2, y + dy + h × dh/2}, wherein the prediction result is the coordinates of the upper left corner and the coordinates of the lower right corner.

S4) applying a non-maximum suppression algorithm

And sequencing the preliminarily obtained detection results P1 according to the probability, performing non-maximum suppression processing, traversing each candidate frame, removing the candidate frames which are less in probability and have the intersection ratio exceeding 0.35, and filtering the detection results of local non-maximum probability to obtain a prediction result P2.

S5) obtaining the final detection result

The prediction result P2 is the coordinates (x 1, y 1) of the upper left corner of the target frame and the width and height (x 2, y 2) of the target frame, and the prediction result beyond the image range is filtered;

the final detection result P3 was obtained.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and improvements can be made without departing from the spirit of the present invention, and these modifications and improvements should also be considered as within the scope of the present invention.

Claims

1. A face detection method under non-limiting conditions is characterized by comprising the following steps:

s1) image preprocessing

s2) designing face detection network based on deep convolution

s3) face detection network forward propagation

Inputting the training image obtained in the step S1) into a trained network model for forward propagation operation, and calculating a classification result and a regression result output by each layer of feature pyramid, wherein the classification result represents the probability that the target of an anchor point is a pedestrian or a background, and the regression result represents the deviation of a prediction result relative to the anchor point;

s4) applying a non-maximum suppression algorithm

s5) obtaining the final detection result

the final detection result P3 was obtained.

2. The method according to claim 1, wherein the face detection method comprises: in step S1), the scaling down of each face appearing in the video frame is performed to reduce the original image size to [. about.0.9,. about.0.8,. about.0.7 ], and the image including the face is cut out at 700 x 700 pixels from the reduced image, and then the image with 640 x 640 pixels is cut out at 700 x 700 pixels as the final training image.