CN108875624B

CN108875624B - Face detection method based on multi-scale cascade dense connection neural network

Info

Publication number: CN108875624B
Application number: CN201810605067.6A
Authority: CN
Inventors: 秦华标; 黄波
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2022-03-25
Anticipated expiration: 2038-06-13
Also published as: CN108875624A

Abstract

The invention discloses a face detection method based on a multi-scale cascade dense connection neural network, belongs to the field of image processing and computer vision, and is suitable for intelligent systems such as face recognition, facial expression recognition, driver fatigue detection and the like. The invention comprises a construction method of a regional nomination network and a construction method of a multi-level dense connection convolution network model, which specifically comprises the following steps: collecting face pictures marked with face rectangular frame (bounding box) information to form a training data set conforming to input conditions of each sub-network; constructing a cascade dense connection neural network with strong generalization capability; respectively training each sub-network by utilizing a training data set, and obtaining an integral network model; and finally, detecting the multi-pose human face in the picture by using the integral network model. According to the invention, a dense connection mode is introduced into the network, so that the network can fully extract the face characteristic information, and the accuracy of face detection under multiple postures is improved.

Description

Face detection method based on multi-scale cascade dense connection neural network

Technical Field

The invention belongs to the field of image processing and computer vision, and particularly relates to a face detection method based on a multi-scale cascade dense connection neural network.

Background

The human face image contains rich information, and the research and analysis of the human face image are important directions and research hotspots in the field of computer vision. For example, in various artificial intelligence applications such as face recognition, crowd monitoring, photography, man-machine interaction, fatigue driving and the like, face detection is the key first step in the technologies, and only when a face is detected, the later analysis and research can be valuable.

In recent decades, a large number of scholars have intensively studied multi-pose face detection algorithms, and generally, the multi-pose face detection algorithms are mainly classified into the following two categories: a traditional machine learning based method and a deep learning based method.

The traditional machine learning algorithm generally obtains a classifier through a large amount of sample training to judge whether the human face or the non-human face exists. In the testing phase, the most common approach is to use a sliding window algorithm. First, the input image is scaled to various sizes, creating an image pyramid. Then, for each position of each layer of image in the pyramid, a picture of a fixed size, called a window, is taken. Next, features are extracted in this window. And finally, judging whether the window is a human face or not by using a trained classifier. Generally, the number of windows to be classified by a face detection algorithm is very large, and for a picture with a resolution of 640 × 480, there are about hundreds of thousands of windows, and how to accurately process the windows in a short time is a problem that needs to be considered by each face detection algorithm. In addition, in the feature extraction process, the traditional machine learning algorithm extracts manual features, such as: haar (Haar) feature, Local Binary Pattern (LBP) feature, Histogram of gradients (HOG) feature. Because the prior knowledge of designers is added into the manual features, the accuracy rate of the manual features is higher only for faces under certain specific backgrounds, and the manual features are difficult to apply to complex conditions such as multi-pose faces in three-dimensional space. The method based on deep learning has a dominant position in the field of current face detection, and the main neural network architectures include a Convolutional Neural Network (CNN), a Deep Belief Network (DBN), and an Auto-encoder (Auto-encoder), wherein the convolutional neural network is used most successfully in face detection. For example, based on cascaded convolutional neural networks (Cascade CNNs) and multitask convolutional neural networks (MTCNN), the networks automatically extract stable human face features by adopting convolutional layers, and the detection effect is greatly improved compared with the traditional machine learning algorithm. However, the current deep learning-based face detection model is often driven by data, a training data set is fitted by using a network, the generalization performance is weak, and the face under multiple postures is difficult to detect under the condition that no multi-posture training data set participates in training.

Therefore, a multi-pose face detection algorithm with higher generalization performance needs to be provided, and the face detection rate can be improved under the condition that no multi-pose face data set participates in training.

Disclosure of Invention

The invention aims to solve the problem that face detection is easily influenced by posture change, and provides a face detection method based on a multi-scale cascade dense connection neural network. The invention designs a cascade dense connection network with stronger feature extraction capability and generalization capability, trains the network model by utilizing the collected and processed training data set, and finally detects the human face by utilizing the trained model, thereby realizing the algorithm which can achieve good effect on the human face under multiple postures.

The invention is realized by at least one of the following technical solutions.

A face detection method based on a multi-scale cascade dense connection neural network comprises a construction method of a regional nomination network and a construction method of a multi-level dense connection convolution network model:

the construction method of the regional nomination network comprises the following steps: performing score prediction and frame prediction possibly including a face region on a plurality of convolution layers of the regional nomination network; then eliminating the area blocks with the scores smaller than a set threshold value, and carrying out non-maximum value inhibition on the remaining area blocks to obtain the final area possibly containing the face; finally, the face area obtained by prediction is sent into a second-level dense connection convolution network;

the construction method of the multilevel dense connection convolution network model comprises the following steps: continuously extracting more abstract features of the human face by using the convolutional layers, and simultaneously connecting the features extracted by the lower convolutional layers with the features extracted by the higher convolutional layers; then, accessing a global average pooling layer in the last layer of convolution layer, and performing fine classification and frame regression on the face region predicted by the previous layer; and finally, the remaining face regions are sent to a third-level dense connection convolution network for more fine classification and frame regression, so that the final face regions are obtained through prediction.

Further, different convolution layers of the area nomination network are utilized to extract more candidate areas containing the human faces with high quality (namely, the human face area accounts for as much as possible in the candidate areas), and missing detection caused by too few extracted candidate areas is prevented; respectively connecting a classification layer and a regression layer to the last two convolution layers of the regional nomination network to predict the face region score and carry out frame regression; finally the elimination score is lower than the threshold value T₁The remaining candidate frames are subjected to non-maximum suppression to obtain a final prediction result; t is₁The value range is 0-1.

Furthermore, a global average pooling layer is introduced to replace a traditional full-connection layer to classify and regress the human face; the global average pooling layer is accessed after the last convolution layer of each level of dense connection network, the overall average value of each feature image output by the previous layer of convolution network is calculated, the local information of the human face is fully learned, and overfitting caused by introducing space structure information is avoided; and finally, after the average pooling layer, a multi-classification (softmax) layer is accessed to classify and regress the face region predicted by the previous stage.

Furthermore, the face features are extracted by constructing a cascaded convolution dense connection network and are subjected to fine classification and regression, a plurality of dense connection blocks can be arranged in each level of dense connection network, each dense connection block is composed of a plurality of convolution layers, and the convolution layers of the same dense connection block can generate feature maps with the same size; in the same dense connecting block, the input of each convolution layer is formed by connecting the characteristic graphs generated by all the convolution layers in the front; two adjacent dense connecting blocks are connected by a transition layer; the second-level network and the third-level network respectively consist of dense connection convolution networks comprising two dense connection blocks and three dense connection blocks, and the face area of the first-level prediction is eliminated step by step and refined in position; the transition layer includes a convolutional layer and a pooling layer.

Further implemented, the face detection method based on the multi-scale cascade dense connection neural network comprises the following steps: (1) collecting the face picture marked with the face rectangular frame information to form an initial training data set D₁By using D₁Generating a sub-training data set D that conforms to a first-level network input format₂(ii) a (2) Designing a region nomination network capable of extracting more high-quality candidate regions and utilizing a sub-training data set D₂Training the sub-network model and then collecting the initial training data set D₁Sending the sub-network model for detection, and generating training data D of the next stage from the detection result₃(ii) a (3) Designing a cascade (two-stage) dense connection network with stronger feature extraction capability and generalization capability, and connecting D₃Sending the network into the first stage of dense connection network to train and generate sub-network model, and then sending D₁Sending the data into a network consisting of a regional nomination network and a first-level dense connection network for detection, and generating a training data set D of a next-level dense connection network according to the detection result₄Reuse of D₄Training a second-stage cascaded dense connection network; (4) and detecting the multi-pose human face in the picture to be tested by utilizing the network model obtained by training.

Further, the step (1) includes: a face data set D₁Sub-training data set D preprocessed to conform to first-level network input format in cascade network₂The resolution is 12 × 12. The sub-training data set contains three types of training pictures: face images, partial face images, non-face images. The label information of the three types of pictures is made as follows: the face image is labeled as 1, the partial face image is labeled as-1, and the non-face image is labeled as 0. The face and part of the face image are also marked with face rectangular frame information, and the face rectangular frame information of the non-face image is marked with-1.

Further, the step (2) includes: and extracting more high-quality candidate regions containing human faces by using different convolution layers of the regional nomination network, and preventing missing detection caused by too few extracted candidate regions. The method comprises the steps that a classification layer and a regression layer are respectively connected to the last two convolution layers of a regional nomination network, and face region score prediction and frame regression are conducted; finally the elimination score is lower than the threshold value T₁(T₁The value range is 0-1, the candidate frame of 0.9) is taken, and the non-maximum value of the rest candidate frames is restrained, so that the final prediction result is obtained. Then using the preprocessed data set D₂Training the area nomination network, and after the training is finished, D₁Inputting the network for detection, and combining the rectangular frame of the face with D₁Calculating an Intersection Over Unit (IOU) and an IOU (input over Unit) of the real face rectangular frame information of the corresponding picture>0.85 as a face sample, 0.55<IOU<0.7 labeled partial face samples, IOU<0.35 marking as non-human face sample, generating training of next level networkExercise data set D₃，D₃Has an image resolution of 24 × 24.

Further, the step (3) includes: constructing a cascaded convolution dense connection network to extract human face features and perform fine classification and regression, wherein each level of dense connection network can be provided with a plurality of dense connection blocks, each dense connection block is composed of a plurality of convolution layers, and the convolution layers of the same dense connection block can generate feature maps with the same size; in the same dense connecting block, the input of each convolution layer is formed by connecting the characteristic graphs generated by all the convolution layers in the front; two adjacent dense connecting blocks are connected by a transition layer (a convolution layer and a pooling layer); and accessing the last convolution layer of the dense connection network to the global average pooling layer, and calculating the overall average value of each feature graph output by the previous convolution network, wherein the number of the feature graphs is consistent with the number of the classified categories. By using D₃Training the dense connection network of the first stage, and after the training is completed, D₁Sending the data into a network consisting of a regional nomination network and a first-level dense connection network for cascade detection, and then generating a sub-training data set D₄，D₄The image resolution of (3) is 48 × 48, and the generation method is the same as in step (2). Finally, by D₄A second level dense connectivity network is trained.

Further, the step (4) includes: and cascading the regional nomination network and the two-stage dense connection network to form a three-stage cascaded network whole. Then, pyramid scale conversion is carried out on a new picture, the conversion ratio is 0.709, the converted picture is input into a first-stage regional nomination network model, a large number of face classification scores and face rectangular frame regression vectors are generated, and the elimination score is lower than a threshold value T₁(T₁The value range is 0-1, the method takes the face rectangular frame of 0.9), and carries out non-maximum value inhibition on the remaining face rectangular frame so as to obtain a final prediction result; the predicted results are then input into a second level network model, again with a culling score below a threshold T₂(T₂The value range is 0-1, the method takes a face rectangular frame of 0.7), and then screening is carried out by utilizing a non-maximum suppression algorithmOverlapping the larger face rectangular frame; finally, inputting the prediction result into a third-level network model, outputting the score of the face classification and the face rectangular frame information, and eliminating the score lower than a threshold value T₃(T₃The value range is 0-1, the method takes 0.8) face rectangular frames, and then the non-maximum suppression algorithm is used for screening the face rectangular frames with larger overlap to obtain a final prediction result.

Compared with the prior art, the invention has the following advantages and effects: the invention can effectively prevent the missing detection of the human face by making the regional nomination network predict more candidate regions; meanwhile, a dense connection network with stronger feature extraction capability is introduced, and a global average pooling layer is used for replacing a full connection layer, so that the generalization capability of the network can be further improved. Therefore, the model of the invention has better effect under multiple postures.

Drawings

Fig. 1a and 1b are flow charts of a training phase and a testing phase, respectively.

Fig. 2a, 2b, and 2c are network configuration diagrams of three sub-networks, respectively.

Detailed Description

The following further describes embodiments of the present invention with reference to the drawings, but the practice of the present invention is not limited thereto. It is noted that the following processes, if not described in particular detail, are all realizable by one skilled in the art with reference to the prior art.

In the embodiment, the multi-pose face detection algorithm based on the multi-scale cascade dense connection neural network can overcome the influence caused by multi-pose to a certain extent.

In this embodiment, in the training phase, as shown in fig. 1a, the specific embodiment is as follows.

Step 1: firstly, a training subset D conforming to a first-level network input format is made₂The resolution size is 12 × 12. The existing face data set D₁Randomly intercepting three types of sub-image blocks: face image, partial face image, non-face image. The label information is produced as follows: the face image is labeled as 1, the partial face image is labeled as-1, and the non-face image is labeled as 0.The face and part of the face image are also marked with face rectangular frame information, and the rectangular frame information of the non-face image is marked with-1. Then a sub-training data set D of 12 x 12 size is set₂Inputting the parameters into a first-level network (a regional nomination network), updating the parameters of the network by adopting a random gradient descent method, performing 22 total iterations (all training data sets after traversing are called as one iteration), setting the initial learning rate to be 0.01, when the training is completed to the 6 th round, setting the learning rate to be 0.001, and when the training reaches the 16 th round, setting the learning rate to be 0.0001 until the training is completed.

The objective function of the area nomination network is as follows:

wherein N represents the number of training samples, j is 1 represents a classification task, j is 2 represents a boundary box regression task, i represents the ith sample, and alpha_jThe weights representing the different tasks are represented by,

represents a sample x_iIs indicative of the type of the light source,

the loss function representing different tasks is (2) the loss of the classification task and (3) the loss of the bounding box regression task.

Wherein

Represents a sample x_iThe value of (a) is 0 or 1, 0 represents a non-face, 1 represents a face, p represents a face_iRepresenting network sample x_iThe probability of a face is determined.

I in formula (3) represents the ith sample,

the bounding box position increment for each candidate window representing the network prediction,

the real bounding box position increment is represented by a four-dimensional real number vector.

Step 2: generating training data D of a second-level network by using the obtained regional nomination network model₃And training the second level network. Firstly, D is₁Sending the data to a regional nomination network for detection to obtain the prediction of the face score and the prediction of a face rectangular frame; culling score below threshold T₁(T₁The value range is 0-1, the method takes 0.9) face rectangular frame, carries out non-maximum value inhibition on the rest face rectangular frame to obtain the final prediction result, and carries out non-maximum value inhibition on the face rectangular frame and D in the prediction result₁Calculating IOU (input output Unit) by using real face rectangular frame information of corresponding picture>0.85 as a face sample, 0.55<IOU<0.7 labeled partial face samples, IOU<0.35 marking as a non-human face sample, and generating a training data set D of a next-level network₃，D₃Has an image resolution of 24 × 24. Training data set D to be generated₃And sending to a third-level network for training, wherein 18 training rounds are performed totally, the initial learning rate is set to be 0.01, when the training round is reached to the 6 th round, the learning rate is set to be 0.001, and when the training round is reached to the 12 th round, the learning rate is set to be 0.0001 until the training is finished. The same loss function is used as in the first stage network.

And step 3: generating a training data set D of a third-level network by using the model trained in the first two levels₄And finishing the training of the third-level network. Will D₁Sending the data into a network consisting of a regional nomination network and a first-level dense connection network for cascade detection, and generating a sub-training data set D by adopting the same method as the step 2₄，D₄Has an image resolution of 48 × 48. Finally, by D₄Training the second-level dense connection network for 18 training rounds, setting the initial learning rate to be 0.01, setting the learning rate to be 0.001 when the training round is 6, and setting the learning rate to be 0.0001 when the training round is 12 until the training is finished. The same loss function is used as in the first stage network.

In this embodiment, in the testing stage, as shown in fig. 1b, a new picture is subjected to pyramid scale transformation with a transformation ratio of 0.709, the transformed picture is input into the first-stage regional nomination network model to generate a large number of face classification scores and face rectangular frame regression vectors, and the culling score is lower than the threshold T₁(T₁The value range is 0-1, the method takes the face rectangular frame of 0.9), and carries out non-maximum value inhibition on the remaining face rectangular frame so as to obtain a final prediction result; the predicted results are then input into a second level network model, again with a culling score below a threshold T₂(T₂The value range is 0-1, the method takes 0.7) of the face rectangular frame, and then the non-maximum value suppression algorithm is utilized to screen the face rectangular frame with larger overlap; finally, inputting the prediction result into a third-level network model, outputting face classification scores and face rectangular frame information, and eliminating scores lower than a threshold value T₃(T₃The value range is 0-1, the method takes 0.8) face rectangular frames, and then the non-maximum suppression algorithm is used for screening the face rectangular frames with larger overlap to obtain a final prediction result.

Claims

1. A face detection method based on a multi-scale cascade dense connection neural network is characterized by comprising a construction method of a regional nomination network and a construction method of a multi-level dense connection convolution network model:

the construction method of the multilevel dense connection convolution network model comprises the following steps: continuously extracting more abstract features of the human face by using the convolutional layers, and simultaneously connecting the features extracted by the lower convolutional layers with the features extracted by the higher convolutional layers; then, accessing a global average pooling layer in the last layer of convolution layer, and performing fine classification and frame regression on the face region predicted by the previous layer; finally, the remaining face regions are sent to a third-level dense connection convolution network for more fine classification and frame regression, so that the final face regions are obtained through prediction;

the method specifically comprises the following steps:

step (1), collecting the face picture marked with face rectangular frame information to form an initial training data set D₁By using D₁Generating a sub-training data set D that conforms to a first-level network input format₂；

Step (2), designing a region nomination network model capable of extracting more high-quality candidate regions, and utilizing a sub-training data set D₂Training the region nomination network model, and then collecting an initial training data set D₁Sending the data into the region nomination network model for detection, and generating training data D of the next level according to the detection result₃(ii) a The method specifically comprises the following steps: extracting more high-quality candidate regions containing human faces by using different convolution layers of the region nomination network, and preventing missing detection caused by too few extracted candidate regions; respectively connecting a classification layer and a regression layer to the last two convolution layers of the regional nomination network to predict the face region score and carry out frame regression; finally the elimination score is lower than the threshold value T₁The remaining candidate frames are subjected to non-maximum suppression to obtain a final prediction result; then using the preprocessed data set D₂Training the area nomination network, and after the training is finished, D₁Inputting the area nomination network for detection, and combining the face rectangle frame and D in the detection result₁Calculating the cross-over ratio and the cross-over ratio according to the real face rectangular frame information of the corresponding picture>0.85 as a face sample, 0.55<Cross ratio of<0.7, labeled as partial face samples, cross-over ratio<0.35 marking as a non-human face sample, generating a next level networkTraining data set D of₃，D₃The image resolution of (a) is 24 × 24;

step (3), designing a cascading dense connection network with stronger feature extraction capability and generalization capability, and connecting D₃Sending the network into the first stage of dense connection network to train and generate sub-network model, and then sending D₁Sending the data into a network consisting of a regional nomination network and a first-level dense connection network for detection, and generating a training data set D of a next-level dense connection network according to the detection result₄Reuse of D₄Training a second-stage cascaded dense connection network;

and (4) detecting the multi-pose human face in the picture to be tested by using the network model obtained by training.

2. The face detection method based on the multi-scale cascade dense connection neural network as claimed in claim 1, characterized in that more high-quality candidate regions containing faces are extracted by using different convolution layers of the regional nomination network, thereby preventing missing detection caused by too few extracted candidate regions; respectively connecting a classification layer and a regression layer to the last two convolution layers of the regional nomination network to predict the face region score and carry out frame regression; finally the elimination score is lower than the threshold value T₁The remaining candidate frames are subjected to non-maximum suppression to obtain a final prediction result; t is₁The value range is 0-1.

3. The face detection method based on the multi-scale cascaded densely-connected neural network as claimed in claim 1, characterized in that a global average pooling layer is introduced to replace a traditional full-connected layer for face classification and regression; the global average pooling layer is accessed after the last convolution layer of each level of dense connection network, the overall average value of each feature image output by the previous layer of convolution network is calculated, the local information of the human face is fully learned, and overfitting caused by introducing space structure information is avoided; and finally, after the average pooling layer, a softmax layer is accessed to classify and regress the face region predicted by the previous stage.

4. The face detection method based on the multi-scale cascaded dense connection neural network as claimed in claim 3, characterized in that a cascaded convolutional dense connection network is constructed to extract face features and perform fine classification and regression, there can be a plurality of dense connection blocks in each level of dense connection network, each dense connection block is composed of a plurality of convolutional layers, convolutional layers of the same dense connection block must be able to generate feature maps of the same size; in the same dense connecting block, the input of each convolution layer is formed by connecting the characteristic graphs generated by all the convolution layers in the front; two adjacent dense connecting blocks are connected by a transition layer; the second-level network and the third-level network respectively consist of dense connection convolution networks comprising two dense connection blocks and three dense connection blocks, and the face area of the first-level prediction is eliminated step by step and refined in position; the transition layer includes a convolutional layer and a pooling layer.

5. The face detection method based on the multi-scale cascaded dense-connected neural network as claimed in claim 1, wherein the step (1) specifically comprises: a face data set D₁Sub-training data set D preprocessed to conform to first-level network input format in cascade network₂The resolution is 12 × 12; the sub-training data set D₂There are three types of training pictures: face images, partial face images, non-face images; label information corresponding to the three types of pictures is produced as follows: the face image is marked as 1, part of the face image is marked as-1, and the non-face image is marked as 0; the face and part of the face image are also marked with face rectangular frame information, and the face rectangular frame information of the non-face image is marked with-1.

6. The face detection method based on the multi-scale cascaded densely-connected neural network as claimed in claim 1, wherein the step (3) specifically comprises: constructing a cascade convolution dense connection network to extract the human face features and carrying out fine classification and regression, wherein each level of dense connection network is provided with a plurality of dense connection blocks, and each dense connection block is constructed by a plurality of convolution layersTherefore, the convolutional layers of the same dense connection block must be capable of generating feature maps of the same size; in the same dense connecting block, the input of each convolution layer is formed by connecting the characteristic graphs generated by all the convolution layers in the front; two adjacent dense connecting blocks are connected by a transition layer; accessing the last convolution layer of the dense connection network to a global average pooling layer, and calculating an overall average value of each feature graph fe feature graph output by the previous convolution network, wherein the number of the feature graphs is consistent with the number of classified categories; by using D₃Training the dense connection network of the first stage, and after the training is completed, D₁Sending the data into a network consisting of a regional nomination network and a first-level dense connection network for cascade detection, and then generating a sub-training data set D₄，D₄The resolution of the image of (3) is 48 × 48, and the generation method is the same as that of the step (2); finally, by D₄A second level dense connectivity network is trained.

7. The face detection method based on the multi-scale cascaded densely-connected neural network as claimed in claim 1, wherein the step (4) specifically comprises: cascading the regional nomination network and the two-stage dense connection network to form a three-stage cascaded network whole; then, pyramid scale conversion is carried out on a new picture, the conversion ratio is 0.709, the converted picture is input into a first-stage regional nomination network model, a large number of face classification scores and face rectangular frame regression vectors are generated, and the elimination score is lower than a threshold value T₁The remaining face rectangular frames are subjected to non-maximum value suppression to obtain a final prediction result; the predicted results are then input into a second level network model, again with a culling score below a threshold T₂The human face rectangular frame is screened by using a non-maximum suppression algorithm; finally, inputting the prediction result into a third-level network model, outputting the score of the face rectangular frame and the face rectangular frame information, and eliminating the score lower than a threshold value T₃The face rectangular frame is screened by using a non-maximum suppression algorithm to obtain a final prediction result T₁、T₂、T₃The value of (1) is 0-1.