CN108229442B

CN108229442B - Method for rapidly and stably detecting human face in image sequence based on MS-KCF

Info

Publication number: CN108229442B
Application number: CN201810124952.2A
Authority: CN
Inventors: 李小霞; 李旻择; 叶远征
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2018-02-07
Filing date: 2018-02-07
Publication date: 2022-03-11
Anticipated expiration: 2038-02-07
Also published as: CN108229442A

Abstract

The invention provides a method for rapidly and stably detecting a human face in an image sequence based on MS-KCF. Aiming at the problem of face Detection with large angle change and serious shielding in an image sequence, the invention provides a novel automatic Detection-Tracking-Detection (DTD) mode which integrates a rapid and accurate target Detection model MobileNet-SSD (MS) and a rapid Tracking model Kernel Correlation Filters (KCF), namely an MS-KCF face Detection model. The method comprises the following steps: step 1, building an MS detection network; step 2, detecting the target by using the MS network; step 3, updating the tracking model to predict the position of the next frame of human face target; step 4, after tracking the number of frames, updating the MS detection network, and re-detecting and positioning the face target; and 5, comparing and analyzing the experimental results. Experiments show that the MS-KCF model not only ensures the stability of face detection with large angle change and serious shielding in an image sequence, but also greatly improves the detection speed.

Description

Method for rapidly and stably detecting human face in image sequence based on MS-KCF

Technical Field

The invention belongs to the technical field of target detection of machine vision, and particularly relates to a method for rapidly and stably detecting a human face in an image sequence based on MS-KCF.

Background

With the continuous development of computer technology, the performance of computers is continuously improved, and the face detection technology has made a great breakthrough as an important branch in the field of computer vision, and nowadays, face detection has been widely applied to access control systems, intelligent monitoring, intelligent cameras and the like. The face detection is also a challenging technology, and has become a problem to be solved urgently in application for detecting how to stably detect a face with large angle change and serious shielding in an image sequence in real time. At present, the traditional method using shallow features has not met the requirements, so a deep Convolutional Neural Network (CNN) is the focus and hot spot of the research of the detection technology today.

The traditional face detection methods are numerous, but all have the following characteristics: firstly, characteristics need to be manually selected, the process is complex, and the quality of a target detection effect depends on the prior knowledge of researchers; secondly, the target is detected in a mode that a window area traverses the image, a plurality of redundant windows exist in the detection process, the time complexity is high, and the human face detection effect on the human face with large angle change and serious shielding in the image sequence is poor.

In recent years, CNN has made a great breakthrough in the field of target detection, and is now the most advanced target detection method. The marked breakthrough of CNN in target detection is that Ross Girshick et al proposed an R-CNN (Region-based CNN) network in 2014, and the mean Average detection Precision (mAP) in VOC is twice that of the HOG-based DPM (Deformable Parts model) target detection algorithm proposed by Felzenszwalb et al. Since the advent of R-CNN, CNN-based target detection has dominated the performance of VOC datasets, largely divided into two main categories: (1) target detection based on a candidate area is characterized by having high detection precision, but the speed cannot meet the application real-time requirement, wherein the target detection represents R-FCN in 2016, Faster R-CNN in 2017, Mask R-CNN in 2017 and the like; (2) target detection based on regression is characterized by high speed but poor detection accuracy, and represents that there are 2015 years of yolo (young Only Look once) and 2016 years of ssd (single Shot multi box detector), etc. Jonathan Huang et al elaborated in 2016 a method of compromise between detection accuracy and speed of meta-structures (SSD, Faster R-CNN and R-FCN). In addition, some cascaded face detection methods also have good effects, for example, a Joint Cascade method proposed by Chen et al in 2014 utilizes face detection and mark point detection of a face to carry out cascading, and has a high detection effect in a traditional face detection method; the MTCNN proposed by Zhang et al in 2016 uses three convolution network cascades, and a coarse-to-fine algorithm structure enables multitask face detection to have high recall rate, but three different data sets are needed for network training, which is complicated; the Faceness network proposed by Yang et al in 2016 judges whether a detected target is a human face or not by using five characteristics, namely, a nose, a mouth, eyes, hair and a beard, has high detection precision, but does not meet the real-time criterion. Deep learning is developing towards embedded devices such as mobile phones, and the number of parameters of a basic network is highly limited in order to meet real-time requirements, so that MobileNet is proposed by Andrew g. Howard et al in 2017, and a large amount of parameters are replaced by small amount of classification accuracy to be reduced. The number of parameters of MoblieNet is 1/33 of VGG16, and the accuracy of the ImageNet-1000 classification is 70.6%, which is only 0.9% less than that of VGG 16. In summary, it is still a difficult point to consider both speed and accuracy in the field of target detection.

Disclosure of Invention

In practical engineering application, most of the human faces in an image sequence are detected, and a system is required to stably detect the human faces with large angle change and serious shielding in real time. Therefore, the fast MobileNet basic network is improved and combined with the fast face detection network SSD model to form an MS (MobileNet-SSD) detection network, which can well give consideration to detection speed and precision, MS network parameters are adjusted to meet the face detection task of two classifications (face target and background), and then a Kernel Correlation Filtering (KCF) algorithm is used for stably tracking the detected face to form a detection-tracking-detection (DTD) cyclic update mode, namely an MS-KCF face detection model. The model not only solves the problems of face detection stability of large angle change and serious shielding, but also can greatly improve the detection speed of a face target in an image sequence.

The technical act of the present invention is as follows: a method for rapidly and stably detecting human faces in an image sequence based on MS-KCF mainly comprises the following steps:

step 1, building an MS (Mobile network-SSD) detection network;

step 2, reading an image sequence, and detecting the image by using an MS detection network;

step 3, updating a tracking model, transmitting the coordinate information of the detected face target to a KCF tracker, taking the KCF tracker as a basic sample frame of the tracker, and carrying out sample sampling and training near the sample frame to predict the position of the face target of the next frame;

step 4, in order to prevent the loss of the human face target during tracking, after tracking for a plurality of frames, updating the MS detection network, and re-detecting and positioning the human face target;

and 5, comparing and analyzing the experimental result with the current advanced face detection method.

Drawings

FIG. 1 is a general flow chart of the system of the present invention

FIG. 2 is a diagram of a MS network architecture in accordance with the present invention

FIG. 3 is a diagram of the improved MobileNet convolution structure of the present invention

FIG. 4 is a pyramid of the convolution characteristics of the MS network of the present invention

FIG. 5 is a graph showing the test results of the MS-KCF model of the present invention

FIG. 6 is a ROC curve comparison graph of Girl image sequences of the present invention

FIG. 7 is a ROC curve comparison of the faceOcc1 image sequence of the present invention.

Detailed Description

The fast face stability detection method in the image sequence based on the MS-KCF of the invention will be further described in detail with reference to the examples and the accompanying drawings.

As shown in FIG. 1, the system of the present invention comprises an image sequence acquisition module, an MS detection network module, a KCF tracking module, and a model update module. Therefore, a new automatic Detection-Tracking-Detection (DTD) cyclic updating mode, namely an MS-KCF face Detection model, is formed in the whole network.

Step 1, an MS (Mobile network-SSD) detection network is built. As shown in fig. 2, the MS detection network structure includes four parts: the first part is an input layer and is used for inputting pictures; the second part is an improved MobileNet convolution network used for extracting the characteristics of an input picture; the third part is an SSD meta structure and is used for classification regression and bounding box regression; the fourth part is an output layer used for outputting the detection result. As shown in table 1, for the overall architecture of the MS detection network, Conv _ BN _ ReLU6 represents the standard convolutional layer, Conv1_ Dw _ Pw represents the depth separable convolutional layer, and 'v' represents the feature map of the convolutional layer output, which will be used in both classification regression and bounding box regression. Since the human face target is small, the feature map output by the Conv7_ Dw _ Pw in the shallow layer is taken.

TABLE 1 MS Overall architecture

The MS detection network comprises two parts of a modified MobileNet convolution network and an SSD meta-structure.

(1) The improved MobileNet convolutional network extracts features. As shown in fig. 3, for a modified MobileNet convolution structure: conv _ Dw _ Pw is depth Separable Convolutions (Depthwise Separable Convolutions), Dw is the deep layer convolution (Depthwise Layers) of 3x3, Pw is the point convolution layer (Pointwise Layers) of 1x1, and each convolution operation is followed by a Batch Normalization (BN) algorithm and an activation function ReLU 6. The invention changes the ReLU of the activating function in the MoblieNet network into the ReLU6, and the convergence speed of the training is accelerated by matching with the BN algorithm for automatically adjusting the data distribution. Equation (1) is the ReLU6 activation function:

(1)

whereinxIs the input to the activation function and,yis the output.

The improved MobileNet convolution structure is skillfully designed.

First, the depth separable convolution structure greatly reduces the amount of computation and speeds up the convergence rate during training for the following reasons:

when performing the calculation of the standard convolution, assume that the size of the input image is

，MRepresenting an input imageThe number of the channels of (a) is,Nrepresenting the number of channels of the convolution output, the standard convolution kernel size being

. If the calculation cost is expressed by the parameter number, the calculation cost of the method is as follows:

。

however, for the depth separable convolution formula in MobileNet, the required size of the depth convolution kernel at the first half Dw stage is the same as the above-mentioned input and output

In the latter half Pw stage, the required size of the point convolution kernel is

The computation cost of the depth separable convolution at this point is:

at the cost of standard convolution calculations

And (4) doubling.

Second, during convolutional neural network training, the distribution of data changes due to each layer of convolution. If the data is distributed at the edge of the activation function, the gradient will disappear, so that the parameters are not updated any more. The BN algorithm adjusts the distribution of data (similar to standard normal distribution) by setting two learnable parameters, and avoids the gradient disappearance phenomenon and complex parameter (learning rate, Dropout proportion and the like) setting in the training process.

(2) And (5) SSD meta structure regression. The SSD network is a regression model, classification regression and boundary box regression are carried out by utilizing the characteristics output by different convolution layers, the contradiction between translation invariance and translation variability is well relieved, and the detection precision and speed are well compromised,namely, the detection speed is improved, and meanwhile, the detection precision is high. For an input of size 300 × 300, the voc2007 dataset was tested in a Titan X GPU hardware environment, with a detection speed of 59fps and an Average detection Precision (mep) of 74.3%. SSD is an end-to-end training model, the total loss function of which is used in trainingLIncluding confidence loss for classification regression and location loss for bounding box regression, defined as:

(2)

in the formula (2)xA feature representing an input;crepresenting a classification confidence;lrepresenting predicted offsets, including translation offsets for center point coordinates and scaling offsets for bounding box width height;ga calibration frame for the actual position of the target;

confidence loss for classification regression;

position loss of bounding box regression;

is a parameter for balancing the two losses;Pindicates the default number of boxes matched whenPAt 0, the total loss will be set to 0.

And 2, reading the image sequence, and detecting the image by using an MS detection network. As shown in fig. 4, the MS network convolution feature pyramid is obtained, in order to meet the translational variability required by the detection task, the present invention obtains two layers of feature maps in the improved MobileNet and four layers of feature maps in the additional standard convolution layer to form the feature map pyramid, performs convolution using different convolution kernels of 3 × 3, and performs classification regression and bounding box regression using the result after convolution as the final feature. The invention takes a picture with the size of 300 × 300 as input, and the default frame number of each feature unit in the six-layer convolution feature map pyramid is respectively 4, 6 and 6. And the convolution kernel parameters of 3x3 size and step size 1 used for different layers and different tasks are all different.

And 3, updating the tracking model, transmitting the coordinate information of the detected face target to a KCF tracker, taking the KCF tracker as a basic sample frame of the tracker, and carrying out sample sampling and training near the sample frame to predict the position of the face target of the next frame.

The problems of large angle change, serious shielding and the like of a face moving in an image sequence can cause the phenomenon of missing detection in the face detection process. The KCF is a rapid target tracking algorithm, so model updating is carried out in the face detection process: and (3) starting a KCF algorithm for continuous and stable tracking when the face is detected by the MS detection network, and updating the target position by using the face detection model again after tracking 10 frames to avoid tracking loss. Thus, the KCF algorithm functions to:

(1) the robustness of face detection in an image sequence to changes such as postures and angles is enhanced;

(2) the method plays a role in connection and acceleration in the DTD model, and greatly improves the detection speed of the whole system.

Is provided with

In order to be an input, the user can select,

for label, then the training sample set is

The number of samples isRThe purpose of regression is to find a mapping relationshipfSo that

A linear regression function of

Wherein

Representing the weight coefficients. Equation (3) is the error function used by the algorithm:

(3)

wherein the coefficients

The method is used for controlling the structural complexity of the system so as to ensure the generalization capability of the classifier. Solving the formula (3) by a least square method to obtain an optimal weight coefficientw：

(4)

In the formula (4)

。TThe transpose is represented by,Ithe unit matrix is represented by a matrix of units,Xeach row in the list represents a feature vector. Equation (5) is a complex field form of equation (4):

(5)

wherein

To representXThe complex conjugate transpose matrix. At this point, the solutionwHas a computation time complexity of

。

In the KCF algorithm, the training sample and the test sample are both composed of basic samples

Produced byThe cyclic matrix is formed by:

(6)

in the formula (6)

Can be obtained by the discrete Fourier matrix in the formula (7)FObtaining:

(7)

(8)

(9)

(10)

in the formula (8), the reaction mixture is,

is a basic sample

In the form of a discrete fourier transform of (a),

to representFThe complex conjugate transpose matrix. In the formula (9), the reaction mixture is,

is composed of

Is a device of Hermite "diag"is a matrix diagonalization operation. Formula (10) is a modification of formula (9), wherein "

"is an element-by-element multiplication operation. After the discrete Fourier transform is simultaneously performed on two sides of the formula (5), according to the formulas (8-10), the obtained result is as follows:

(11)

in the formula (11), the reaction mixture is,

is composed ofYFor discrete Fourier transform of

Then Fourier inversion is carried out to obtainw. In this equation (11)wThe computational time complexity of the solution isO(n)The time complexity of the discrete Fourier transform isO(nlogn)Before, compared withwComplexity of the calculation time of

The time complexity of the whole system is greatly reduced.

The KCF algorithm aims to reduce the computational time complexity of regression through a circulant matrix of the fourier space, thereby achieving a large amount of speed improvement.

And 4, in order to prevent the loss of the human face target during Tracking, after Tracking for 10 frames, updating the MS Detection model, and detecting and positioning the human face target again, so that the whole network forms a new automatic Detection-Tracking-Detection (DTD) mode, namely an MS-KCF human face Detection model, and the whole Detection process takes speed and precision into account.

The tests were evaluated on a GTX1080 GPU, with the input pictures all scaled to a size of 300 x 300.

Table 2 shows the average detection rate and average speed comparison for different methods in a standard FDDB static face detection dataset. Table 2 shows that the method has better detection rate, and the detection speed of the MS detection network is 2.8 times faster than MTCNN and 9.3 times faster than Faceness. Therefore, the method has higher detection rate and fast detection speed on the static face detection database.

TABLE 2 average detection Rate and average speed in FDDB dataset with different methods

Fig. 5(a) and 5(b) are the results of testing two image sequences (Girl and FaceOcc 1) in the VOT2016 dynamic face tracking data set, respectively. Girl is a sequence of images in which the angle of the face changes greatly, and FaceOcc1 is a sequence of images in which the occlusion is large. The first two rows of each image sequence in fig. 5(a) and 5(b) are detection results of the MS model, and the last two rows are detection results of the MS-KCF model. Obviously, the MS-KCF model has better detection performance for the face with larger angle change and more serious shielding in the image sequence.

Fig. 6 and 7 are ROC curve comparisons of two image sequences Girl and faceoc 1 in the VOT2016 dataset, respectively. As can be seen from fig. 6 and 7, for the face detection task in the image sequence, the detection performance of the MS-KCF method having the model update function is superior to that of the MS method having only the detection function.

Table 3 mean velocity in VOT2016 dataset versus different methods

Table 3 is a comparison of the average velocities of the different methods in the VOT2016 dataset. As can be seen from Table 3, the MS-KCF method with the model update function is fast, and the detection speed is 2.3 times faster than the MS method with the detection function only, 6.4 times faster than MTCNN, and 21.4 times faster than Faceness.

Claims

1. A method for rapidly and stably detecting human faces in an image sequence based on MS-KCF comprises the following five steps:

step 1, an MS (Mobile network-SSD) detection network is built, and the MS detection network structure comprises four parts: the first part is an input layer and is used for inputting pictures; the second part is an improved MobileNet convolution network used for extracting the characteristics of an input picture; the third part is an SSD meta structure and is used for classification regression and bounding box regression; the fourth part is an output layer and is used for outputting a detection result; improved MobileNet convolution structure: conv _ Dw _ Pw is depth Separable Convolutions (Depthwise Separable Convolutions), Dw is the deep layer convolution layer (Depthwise Layers) of 3x3, Pw is the point convolution layer (Pointwise Layers) of 1x1, and each convolution operation is followed by a Batch Normalization (BN) algorithm and an activation function ReLU 6;

step 2, reading an image sequence, and detecting the image by using an MS network;

step 3, updating a tracking model, transmitting the coordinate information of the detected face target to a Kernel Correlation Filter (KCF) tracker, taking the KCF tracker as a basic sample frame of the tracker, and carrying out sample sampling and training near the sample frame to predict the position of the face target of the next frame;

2. The method as claimed in claim 1, wherein the MS detection network in step 1 replaces the reference network VGG in the original SSD model with the improved fast and accurate MobileNet network, the Pw structure in the original MobileNet changes the distribution of the Dw structure output data, so that the detection accuracy is reduced, the full connection layer of the original MobileNet is omitted, 8 standard convolutional layers are additionally added to expand the perception field of the feature map, adjust the data distribution and enhance the translation invariance required by the classification task, in order to prevent the gradient from disappearing, a Batch Normalization (BN) layer is added after each convolutional layer, and the activation function is changed from ReLU to ReLU 6.

3. The method for rapidly and stably detecting the human face in the image sequence based on the MS-KCF as claimed in claim 1, wherein the MS detection network proposed in the step 2 respectively obtains two layers of feature maps in an improved MobileNet and four layers of feature maps in an additional standard convolution layer to form a feature map pyramid in order to meet the translation variability required by a detection task, convolution is performed by using different convolution kernels of 3x3, and the result after convolution is used as a final feature to perform classification regression and bounding box regression.

4. The method for rapidly and stably detecting the human face in the image sequence based on the MS-KCF as claimed in claim 1, wherein the model updating in the step 3 and the step 4 is performed twice, so as to realize the accurate Detection and positioning of the human face target, thereby forming a new automatic Detection-Tracking-Detection (DTD) cyclic updating mode, namely an MS-KCF human face Detection model, in the whole network, so that the whole Detection process not only ensures the stability of the human face Detection with large angle change and serious shielding in the image sequence, but also greatly improves the Detection speed.