CN111898410A

CN111898410A - Face detection method based on context reasoning under unconstrained scene

Info

Publication number: CN111898410A
Application number: CN202010531633.0A
Authority: CN
Inventors: 徐琴珍; 杨哲; 邵文韬; 刘茵茵; 侯坤林; 朱颖; 杨绿溪
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2020-11-06

Abstract

The invention provides a face detection scheme based on context inference in an unconstrained scene, belonging to the field of multimedia signal processing. The invention performs data augmentation on a training set, takes VGGNet-16 as a basic feature extraction network, fuses different layer features in a weighted manner through a low-level feature pyramid network, adopts a context auxiliary prediction module to expand a sub-network in a prediction link so as to deepen and widen a network model, introduces a data enhancement mode of self-adaptive anchor sampling and a multi-scale training method, and enhances the adaptability of the model to the scale. The method can extract the description information with the most expressive power, can well compensate the face features which are not fully extracted, can optimize the utilization rate of the face features, is suitable for unconstrained scenes with high detection difficulty, and can particularly realize accurate detection on tiny, fuzzy and shielded faces.

Description

Face detection method based on context reasoning under unconstrained scene

Technical Field

The invention belongs to the technical field of image processing, and relates to a face detection method based on context reasoning in an unconstrained scene.

Background

The popularization of intelligent terminal equipment profoundly influences the thinking way of human beings, and has a brand-new definition on the social essence of the intelligent terminal equipment. Human face detection is the most suitable application for daily life in the field of computer vision, and relieves human beings from heavy visual processing work, and specific information in images and videos is analyzed and summarized by a machine, so that the development of the modern society is deeply influenced. On the smart phone, 3D face recognition unlocking is respectively realized on an IOS platform and an android platform by iPhone X and Mate20pro, so that privacy is better protected; in security monitoring, lawbreakers can be tracked and captured by a face recognition technology, so that security maintenance is enhanced; in the aspect of property safety, the paying treasures firstly put out face-brushing payment and credit loan for identity authentication, so that the safety is also ensured while the efficiency is improved.

The early mainstream face detection method is mostly based on a manually designed template matching technology, has a good detection effect on a face with a clear front face without shielding, is easy to implement and hardly influenced by illumination and image quality, but cannot make a completely effective face template to adapt to changes of posture, scale and the like due to high plasticity of the face, so that the precision is limited. The conventional face detection method, which determines whether a face exists in an image by only mechanically comparing the autocorrelation between the manual features and the target face, is not suitable for an unconstrained scene.

With the rapid development of deep learning, the face detection method based on the convolutional neural network gradually replaces the traditional face detection method with strong characterization learning and nonlinear modeling capability, the detection performance is remarkably improved, and the accuracy rate of a clear face without shielding can almost reach one hundred percent. However, an unconstrained face in a natural scene is easily interfered by external environmental factors such as shielding, illumination, expressions, postures and the like, so that facial features are insufficiently extracted and utilized; in addition, the low-resolution human face with a smaller size is a bottleneck, and the small human face is densely sampled by using the small-size anchor point, so that excessive background negative samples are easily generated, and the false detection rate is increased. The accuracy of the existing face detection method under the unconstrained scene is still insufficient, and a satisfactory effect cannot be obtained.

Disclosure of Invention

In order to solve the above problems, the present invention provides a face detection method based on context-based reasoning in an unconstrained scene, which focuses on the following two aspects of improvement and optimization: on one hand, facial features, especially description information with more expressive power, are fully extracted, features of different levels are fused in a weighted mode through a low-level feature pyramid network, a sub-network is expanded in a prediction link by adopting a context auxiliary prediction module, and the deeper and wider network model can well make up the facial features which are not fully extracted; on the other hand, a data enhancement mode and a multi-scale method of self-adaptive anchor point sampling are introduced, so that the adaptability of the model to the scale is enhanced, and the utilization rate of the facial features is improved.

In order to achieve the purpose, the invention provides the following technical scheme:

the face detection method based on context inference under the unconstrained scene comprises the following steps:

step 1, carrying out data augmentation on WIDER FACE (the current most authoritative face detection reference) training set;

step 2, based on the augmented picture in the step 1, taking VGGNet-16 (a classical deep convolutional neural network) as a basic feature extraction network, fusing features of different layers in a weighted mode through a low-level feature pyramid network, and adopting a context auxiliary prediction module to expand a sub-network in a prediction link so as to deepen and widen a network model;

and 3, after the training parameters are initialized, guiding the autonomous learning process of the model by using a multi-scale training method, storing the model after loss convergence, and detecting.

Further, the step 1 specifically includes the following sub-steps:

step 1.1: the method comprises the following steps of horizontally turning and randomly cutting pictures in WIDER FACE training sets, and performing primary preprocessing, wherein the specific operations are as follows: firstly, an input image is expanded to 4 times of the original size, then, each picture is subjected to mirror image horizontal turning, and finally, the size of a 640 multiplied by 640 area is randomly cut out, namely, the following formula is applied for processing:

x_preprocess＝Crop(Flip(Extend(x_input)))

in the formula, x_inputRepresenting an input training set picture, extending operation is to expand the picture by adopting a mean filling mode, Flip operation represents to randomly horizontally turn over, Crop operation is random, and x is_preprocessThen representing the corresponding preliminary preprocessing results, and the sizes of the preliminary preprocessing results are unified to be 640 multiplied by 640;

step 1.2: simulating the interference under the unconstrained scene by adopting a color dithering and noise disturbance mode, and carrying out the preliminary preprocessing result x obtained in the step 1.1 again_preprocessEnhancing in different degrees to obtain an extended picture x after comprehensive treatment_processAs shown in the following formula:

in the formula, Color operation indicates a Color dithering method, and Noise (gaussian) and Noise (Salt & pepper) operations indicate a picture plus gaussian Noise and a Salt and pepper Noise, respectively.

Step 1.3: the face in a certain image is reshaped by adopting a self-adaptive anchor point sampling method, so that a larger face with higher probability is introduced, and the method specifically comprises the following operations: selecting a size s in a certain image_faceThe anchor point scale s on the ith layer feature map (i is 0,1, …,5) is preset_iAs shown in the following formula:

s_i＝2⁴⁺ⁱ

the feature map of the ith layer is compared with the face size s_faceThe index of the nearest anchor point is denoted as:

wherein the content of the first and second substances,

the anchor point scale of the ith layer feature map is obtained;

then, the set { max (0, i) }_anchor-1),1,…,min(5,i_anchor+1) } sequentially selecting index i_resultFinally, the original drawing s_faceIs adjusted to s_result：

Thereby obtaining the scaling ratio s of the whole size of the image^*：

s^*＝s_result/s_face

The original sample picture is scaled according to s, and then a 640 multiplied by 640 area containing the selected face is cut out randomly, namely the training sample picture after the adaptive anchor point sampling.

Further, the step 2 specifically includes the following sub-steps:

step 2.1: performing basic feature extraction on the augmented input picture through VGGNet-16, wherein conv3_3, conv4_3, conv5_3, conv _ fc7, conv6_2 and conv7_2 are respectively selected for final prediction, and the feature map sizes are respectively 160 × 160, 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5;

step 2.2: through the weighted fusion of the low-level detail features and the high-level semantic features through the low-level feature pyramid network, description information with more expressive power can be extracted, and the shallow feature graphs and the deep feature graphs used for prediction in the step 2.1 are recorded as phi respectively_i、φ_i+1Where H represents a 2-fold upsampling operation applied to the higher-level feature map and θ represents a relevant parameter of the upsampling operation, the new feature map generated after weighted fusion can be represented as follows:

φ′_i＝α*φ_i+β*H(φ_i；θ)

in the formula, alpha and beta are hyper-parameters for balancing the two, and a new characteristic diagram obtained at the left side of the equation continues to enter a low-level characteristic pyramid network with a characteristic diagram at a lower level in a recursion manner until the lowest level is reached;

step 2.3: and (4) sending the weighted and fused feature graph obtained in the step into a context auxiliary prediction module, and selecting a splicing mode for each sub-network to fuse to realize channel parallel connection so as to deepen and widen the network model.

Further, the step 3 specifically includes the following sub-steps:

step 3.1: initializing the training parameters;

step 3.2: by applying a multi-scale training method, three scales are divided in the training process and respectively correspond to images with different resolutions, and the region of interest under each resolution has an appointed range: if the size of the true value box is in the range, the true value box is marked as correct, otherwise, the true value box is marked as error; when generating an anchor point and distributing a label for the anchor point, firstly detecting whether the proportion of the anchor point to the overlapping part of a true value frame marked as an error exceeds a certain proportion, if so, the anchor point is regarded as an error anchor point, otherwise, the anchor point is a correct anchor point; the anchor points which are judged to be wrong are invalidated in the training process and cannot be added into the back propagation process to influence the parameters;

step 3.3: learning and supervising position regression and category scoring respectively by adopting smooth L1 loss and softmax loss, and stopping training, storing a model and detecting when the loss sum of the smooth L1 loss and the softmax loss is not increased and is stabilized in a smaller value range; otherwise, the procedure returns to step 3.1.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention makes up the neglect of more expressive information in the prior art, performs weighted fusion on low-level detail information and high-level semantic information through a low-level feature pyramid network, deepens and widens a network model by a context auxiliary prediction module in a sub-network expansion mode, and remedies the facial features which are not fully extracted.

2. The invention further strengthens the sensitivity and the adaptability of the model to the scale, introduces a data enhancement mode of self-adaptive anchor point sampling and a multi-scale training method, improves the utilization rate of facial features and obtains good gain.

3. The invention can keep higher detection accuracy rate, stronger interference resistance and extremely high plasticity and comprehensiveness when facing to the human face with the attributes of inconsistent scales, fuzziness, strong illumination, different postures, facial shielding, makeup and the like in an unconstrained scene.

Drawings

FIG. 1 is a flow chart of the face detection method based on context inference according to the present invention.

FIG. 2 is a network model diagram of the face detection method based on context inference.

Fig. 3 is a schematic diagram of a human face image processing enhancement mode.

FIG. 4 is a comparison graph of training sample data distribution before and after adaptive anchor point sampling.

FIG. 5 is a schematic diagram of a low-level feature pyramid network structure.

FIG. 6 is a low-level feature pyramid network fusion feature visualization.

FIG. 7 is a block diagram of a context assisted prediction module.

Fig. 8 is a diagram illustrating the detection effect of the trained model on WIDER FACE face samples in the test set.

FIG. 9 shows the detection accuracy of the trained model on the Easy, Medium, Hard validation set of WIDER FACE.

Fig. 10 is a diagram illustrating the effect of detecting an unconstrained face by using a trained model.

The original pictures of the photos in the drawings are color pictures, and are modified into a gray form according to the requirements of patent filing.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.

WIDER FACE (the most authoritative face detection reference) data set is taken as an example, and the specific implementation of the face detection method based on context inference in the unconstrained scene of the invention is further described in detail with reference to the accompanying drawings, and the flow of the method is shown in fig. 1, and comprises the following steps:

step 1: the data augmentation of WIDER FACE training set mainly includes the following three aspects:

x_preprocess＝Crop(Flip(Extend(x_input)))

in the formula, x_inputRepresenting an input training set picture, extending operation is to expand the picture by adopting a mean filling mode, Flip operation represents to randomly horizontally turn over, Crop operation is random, and x is_preprocessIt represents the corresponding preliminary preprocessing result, and its size is unified to 640 × 640. An example of data enhancement operations is shown in fig. 3, where the first line is an original input image of an arbitrary size, the second line is a 4-fold reduction of the size of the corresponding image to the original size, and the third and fourth lines are the preliminary pre-processing enhancement results of the image of the turned and cropped partial sample.

Step 1.2: and simulating the interference in an unconstrained scene by adopting a color dithering and noise disturbance mode. These two data enhancement modes are briefly described below:

color dithering: considering different illumination intensity, background atmosphere, shooting conditions and the like, the saturation, brightness, contrast and sharpness of the input image are respectively adjusted according to random factors generated randomly.

Noise disturbance: the method mainly relates to the addition of Gaussian white noise and salt and pepper noise, wherein the Gaussian noise refers to that the noise amplitude obeys Gaussian distribution, namely the number of noise points with certain intensity is the largest, and the number of noise points which are farther away from the intensity is smaller, so that the noise is additive noise; the salt and pepper noise is an impulse noise, and the alternating black and white bright and dark point noise can be generated on an original image by randomly changing the values of some pixel points, so that the salt and pepper noise is vivid, is just like spreading salt and pepper on the image, and is a logic noise.

To sum up, the preliminary pre-processing result x obtained in step 1.1 is again subjected to_preprocessEnhancing in different degrees to obtain an extended picture x after comprehensive treatment_processAs shown in the following formula:

in the formula, Color operation indicates a Color dithering method, and Noise (gaussian) and Noise (Salt & pepper) operations indicate a picture plus gaussian Noise and a Salt and pepper Noise, respectively. An example of the data enhancement operation is shown in fig. 3, in which the fifth line is a color dithering enhancement mode for the picture cropped from the fourth line, and the sixth and seventh lines are modes for adding gaussian noise and salt and pepper noise of different degrees respectively to the picture cropped from the fourth line, so as to enhance the detection stability of the model for any environmental external cause.

s_i＝2⁴⁺ⁱ

wherein the content of the first and second substances,

anchor for ith layer feature mapPoint scale;

Thereby obtaining the scaling ratio s of the whole size of the image^*：

s^*＝s_result/s_face

The original sample picture is scaled according to s, and then a 640 multiplied by 640 area containing the selected face is cut out randomly, namely the training sample picture after the adaptive anchor point sampling. Fig. 4 shows the influence of adaptive anchor sampling on WIDERFACE training data distribution, which is compared according to the pose attribute, the occlusion attribute, the blur attribute, and the illumination attribute, where the dotted line in the figure represents the distribution of the original samples of each attribute, and the solid line represents the distribution of the corresponding samples of the attributes after adaptive anchor sampling.

Step 2: based on the augmented picture in the step 1, VGGNet-16 is used as a basic feature extraction network, features of different layers are fused in a weighted mode through a low-level feature pyramid network, a context auxiliary prediction module is adopted in a prediction link to expand a sub-network, and then a network model is deepened and widened, and the method mainly comprises the following steps:

step 2.1: and performing basic feature extraction on the augmented input picture through VGGNet-16, wherein conv3_3, conv4_3, conv5_3, conv _ fc7, conv6_2 and conv7_2 are respectively selected to be used for final prediction, and the feature map sizes are respectively 160 × 160, 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5.

Step 2.2: through the weighted fusion of the low-level detail features and the high-level semantic features through the low-level feature pyramid network, description information with more expressive power can be extracted, and the shallow feature graphs and the deep feature graphs used for prediction in the step 2.1 are recorded as phi respectively_i、φ_i+1H denotes action at a higher levelAnd 2 times of upsampling operation on the feature map, and theta represents a relevant parameter of the upsampling operation, the new feature map generated after weighted fusion can be represented as follows:

φ′_i＝α*φ_i+β*H(φ_i；θ)

in the formula, alpha and beta are hyper-parameters for balancing the two, wherein alpha and beta are respectively assigned as 4 and 1, and mainly considering that the feature diagram with strong capability of distinguishing the middle and small-scale human faces can play a larger role and is beneficial to weakening negative interference caused by poor feature diagrams, and a new feature diagram obtained at the left side of the equation and the feature diagram at the lower layer can continue to recursively enter the low-level feature pyramid network until the lowest layer. Fig. 5 is a schematic diagram illustrating the structure of a low-level feature pyramid network, taking conv5_3 and conv _ fc7 of the VGGNet-16 network as an example, wherein the feature map sizes of conv5_3 and conv _ fc7 are 40 × 40 and 20 × 20, respectively. Fig. 6 is a visualization of features after fusion of a low-level feature pyramid network, and extracted high-level features are quite abstract and are disadvantageous to small, fuzzy and partially shielded human faces, and after the features of various high-level bottom layers are fused, detailed information of the human faces can be made up to a greater extent.

Step 2.3: and (3) sending the weighted and fused feature graph obtained in the step into a context auxiliary prediction module, on one hand, expanding the receptive field to enable the prediction module to be deeper for auxiliary classification, on the other hand, adding a residual-free submodule to enable the model to be wider for auxiliary positioning, and selecting a splicing mode (splice) for each sub-network to fuse and realize channel parallel connection, thereby deepening and widening the network model. Fig. 7 is a structural diagram of a context-assisted prediction module, which not only retains rich context information, but also makes up for the deficiency of the characterization capability of a low-level feature map to some extent, although the low-level feature is helpful for detecting a small and medium-sized face.

And step 3: after the training parameters are initialized, a multi-scale training method is applied to guide the autonomous learning process of the model, and the model is stored and detected after loss convergence, and the method mainly comprises the following steps:

step 3.1: the training parameters are initialized, and the specific settings are shown in table 1 below.

TABLE 1 training parameter settings

Wherein, the optimizer selects a random gradient descent (SGD) method with a momentum value of 0.9; meanwhile, to prevent overfitting, the weight attenuation value is set to 10^-5. It should be noted that, in consideration of the continuous depth of the network learning process, the following settings are set for the learning rate: as the number of iterations increases, when the number of iterations is in the set stepping list {30000,40000,50000}, the learning rate decreases to 0.1, which can prevent the unexpected situation that the network parameter is close to the global optimal solution, and the optimal value is missed due to the excessive learning rate.

Step 3.2: by applying a multi-scale training method, three scales are divided in the training process and respectively correspond to images with different resolutions, and the region of interest under each resolution has an appointed range: if the size of the true value box is in the range, the true value box is marked as correct, otherwise, the true value box is marked as error; when generating an anchor point and distributing a label for the anchor point, firstly detecting whether the proportion of the anchor point to an overlapped part of a true value frame marked as an error exceeds 30%, if so, the anchor point is regarded as an error anchor point, otherwise, the anchor point is a correct anchor point; the anchor points determined to be wrong are invalidated during training and are not added into the back propagation process to influence the parameters. The specific settings are shown in table 2 below.

TABLE 2 Multi-Scale training method parameter settings

Resolution ratio	0～16	16～128	>128
				Scale transformation	*2.0	*1.0	*0.5
Region of interest	16～32	32～64	64～256

Step 3.3: training by adopting smooth L1 loss guide position regression, wherein the expression is as follows:

in the formula, y⁽ⁱ⁾A tag that represents the true location of the object,

and representing coordinate label information predicted by the model, wherein omega represents a region set with a priori frame as a positive sample.

And (3) adopting softmax loss to guide the training of class scoring, wherein the expression is as follows:

in the formula, x_kIndicating the actual class label, z_mDenotes the input of the softmax layer, f (z)_m) Representing the predicted output of the softmax layer, T is the number of classes on the training data set.

The loss sum L of both can be expressed as:

L＝L_loc+L_conf

in summary, the overall network structure of the face detection method based on context inference is shown in fig. 2.

Step 3.4: when the progressive loss no longer rises and settles in a smaller range (e.g., (0, 1)), the training may be stopped, otherwise, the process returns to step 3.1.

Step 3.5: stopping training, saving the model and detecting. The trained model is used for detecting partial human face samples related to attributes of inconsistent scales, fuzziness, strong and weak illumination, different postures, facial occlusion and makeup in the WIDER FACE test set, and the human face is marked by a rectangular frame, so that higher detection accuracy can be still maintained in the high-difficulty unconstrained scenes as shown in FIG. 8. The model size of the invention is 91M, the accuracy on Easy, Medium and Hard verification sets of the published WIDER FACE respectively reaches 93.8%, 92.5% and 86.7%, and good gain is obtained within the same-level model size range as shown in FIG. 9. The method has wide application scenes, is suitable for face detection tasks in various unconstrained scenes, has extremely high comprehensiveness and generalization, and still has higher accuracy when the method is used for detecting the arbitrarily captured unconstrained faces as shown in figure 10. The invention can detect 51 pictures per second on a GPU (graphic processing unit) platform, and can detect 39 pictures per second under the condition of only using a CPU (central processing unit), thereby meeting the real-time requirement in a human face detection task.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. The face detection method based on context inference under the unconstrained scene is characterized by comprising the following steps:

step 1, carrying out data augmentation on WIDERFACE training sets;

step 2, based on the augmented picture in the step 1, taking VGGNet-16 as a basic feature extraction network, fusing features of different layers in a weighted manner through a low-level feature pyramid network, and adopting a context auxiliary prediction module to expand a sub-network in a prediction link so as to deepen and widen a network model;

2. The face detection method based on context inference under the unconstrained scene according to claim 1, wherein the step 1 specifically comprises the following sub-steps:

step 1.1: the method comprises the following steps of horizontally turning and randomly cutting pictures in WIDERFACE training sets, and performing primary preprocessing, wherein the specific operations are as follows: firstly, an input image is expanded to 4 times of the original size, then, each picture is subjected to mirror image horizontal turning, and finally, the size of a 640 multiplied by 640 area is randomly cut out, namely, the following formula is applied for processing:

x_preprocess＝Crop(Flip(Extend(x_input)))

step 1.2: simulating the interference under the unconstrained scene by adopting a color dithering and noise disturbance mode, and carrying out the preliminary preprocessing result x obtained in the step 1.1 again_preprocessThe enhancement is carried out to a different extent,further obtaining the extended picture x after comprehensive treatment_processAs shown in the following formula:

in the formula, Color operation represents a Color dithering mode, and Noise (Gaussian) operation and Noise (Salt & pepper) operation represent pictures with gaussian Noise and Salt and pepper Noise respectively;

step 1.3: the face in a certain image is reshaped by adopting a self-adaptive anchor point sampling method, so that a larger face with higher probability is introduced, and the method specifically comprises the following operations: selecting a size s in a certain image_faceThe anchor point scale s on the ith layer feature map is preset_iAs shown in the following formula:

s_i＝2⁴⁺ⁱ

wherein i is 0,1, …, 5;

wherein the content of the first and second substances,

the anchor point scale of the ith layer feature map is obtained;

Thereby obtaining the scaling ratio s of the whole size of the image^*：

s^*＝s_result/s_face

3. The face detection method based on context inference under an unconstrained scene according to claim 1, wherein the step 2 specifically comprises the following sub-steps:

step 2.2: the low-level feature pyramid network is used for weighting and fusing the low-level detail features and the high-level semantic features to extract description information with more expressive power, and the shallow feature graph and the deep feature graph used for prediction in the step 2.1 are recorded as phi_i、φ_i+1Where H represents a 2-fold upsampling operation applied to the higher-level feature map and θ represents a relevant parameter of the upsampling operation, the new feature map generated after weighted fusion can be represented as follows:

φ_i'＝α*φ_i+β*H(φ_i；θ)

4. The face detection method based on context inference under an unconstrained scene according to claim 1, wherein the step 3 specifically comprises the following sub-steps:

step 3.1: initializing the training parameters;