CN111898410A - Face detection method based on context reasoning under unconstrained scene - Google Patents

Face detection method based on context reasoning under unconstrained scene Download PDF

Info

Publication number
CN111898410A
CN111898410A CN202010531633.0A CN202010531633A CN111898410A CN 111898410 A CN111898410 A CN 111898410A CN 202010531633 A CN202010531633 A CN 202010531633A CN 111898410 A CN111898410 A CN 111898410A
Authority
CN
China
Prior art keywords
face
anchor point
training
network
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010531633.0A
Other languages
Chinese (zh)
Inventor
徐琴珍
杨哲
邵文韬
刘茵茵
侯坤林
朱颖
杨绿溪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010531633.0A priority Critical patent/CN111898410A/en
Publication of CN111898410A publication Critical patent/CN111898410A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a face detection scheme based on context inference in an unconstrained scene, belonging to the field of multimedia signal processing. The invention performs data augmentation on a training set, takes VGGNet-16 as a basic feature extraction network, fuses different layer features in a weighted manner through a low-level feature pyramid network, adopts a context auxiliary prediction module to expand a sub-network in a prediction link so as to deepen and widen a network model, introduces a data enhancement mode of self-adaptive anchor sampling and a multi-scale training method, and enhances the adaptability of the model to the scale. The method can extract the description information with the most expressive power, can well compensate the face features which are not fully extracted, can optimize the utilization rate of the face features, is suitable for unconstrained scenes with high detection difficulty, and can particularly realize accurate detection on tiny, fuzzy and shielded faces.

Description

Face detection method based on context reasoning under unconstrained scene
Technical Field
The invention belongs to the technical field of image processing, and relates to a face detection method based on context reasoning in an unconstrained scene.
Background
The popularization of intelligent terminal equipment profoundly influences the thinking way of human beings, and has a brand-new definition on the social essence of the intelligent terminal equipment. Human face detection is the most suitable application for daily life in the field of computer vision, and relieves human beings from heavy visual processing work, and specific information in images and videos is analyzed and summarized by a machine, so that the development of the modern society is deeply influenced. On the smart phone, 3D face recognition unlocking is respectively realized on an IOS platform and an android platform by iPhone X and Mate20pro, so that privacy is better protected; in security monitoring, lawbreakers can be tracked and captured by a face recognition technology, so that security maintenance is enhanced; in the aspect of property safety, the paying treasures firstly put out face-brushing payment and credit loan for identity authentication, so that the safety is also ensured while the efficiency is improved.
The early mainstream face detection method is mostly based on a manually designed template matching technology, has a good detection effect on a face with a clear front face without shielding, is easy to implement and hardly influenced by illumination and image quality, but cannot make a completely effective face template to adapt to changes of posture, scale and the like due to high plasticity of the face, so that the precision is limited. The conventional face detection method, which determines whether a face exists in an image by only mechanically comparing the autocorrelation between the manual features and the target face, is not suitable for an unconstrained scene.
With the rapid development of deep learning, the face detection method based on the convolutional neural network gradually replaces the traditional face detection method with strong characterization learning and nonlinear modeling capability, the detection performance is remarkably improved, and the accuracy rate of a clear face without shielding can almost reach one hundred percent. However, an unconstrained face in a natural scene is easily interfered by external environmental factors such as shielding, illumination, expressions, postures and the like, so that facial features are insufficiently extracted and utilized; in addition, the low-resolution human face with a smaller size is a bottleneck, and the small human face is densely sampled by using the small-size anchor point, so that excessive background negative samples are easily generated, and the false detection rate is increased. The accuracy of the existing face detection method under the unconstrained scene is still insufficient, and a satisfactory effect cannot be obtained.
Disclosure of Invention
In order to solve the above problems, the present invention provides a face detection method based on context-based reasoning in an unconstrained scene, which focuses on the following two aspects of improvement and optimization: on one hand, facial features, especially description information with more expressive power, are fully extracted, features of different levels are fused in a weighted mode through a low-level feature pyramid network, a sub-network is expanded in a prediction link by adopting a context auxiliary prediction module, and the deeper and wider network model can well make up the facial features which are not fully extracted; on the other hand, a data enhancement mode and a multi-scale method of self-adaptive anchor point sampling are introduced, so that the adaptability of the model to the scale is enhanced, and the utilization rate of the facial features is improved.
In order to achieve the purpose, the invention provides the following technical scheme:
the face detection method based on context inference under the unconstrained scene comprises the following steps:
step 1, carrying out data augmentation on WIDER FACE (the current most authoritative face detection reference) training set;
step 2, based on the augmented picture in the step 1, taking VGGNet-16 (a classical deep convolutional neural network) as a basic feature extraction network, fusing features of different layers in a weighted mode through a low-level feature pyramid network, and adopting a context auxiliary prediction module to expand a sub-network in a prediction link so as to deepen and widen a network model;
and 3, after the training parameters are initialized, guiding the autonomous learning process of the model by using a multi-scale training method, storing the model after loss convergence, and detecting.
Further, the step 1 specifically includes the following sub-steps:
step 1.1: the method comprises the following steps of horizontally turning and randomly cutting pictures in WIDER FACE training sets, and performing primary preprocessing, wherein the specific operations are as follows: firstly, an input image is expanded to 4 times of the original size, then, each picture is subjected to mirror image horizontal turning, and finally, the size of a 640 multiplied by 640 area is randomly cut out, namely, the following formula is applied for processing:
xpreprocess=Crop(Flip(Extend(xinput)))
in the formula, xinputRepresenting an input training set picture, extending operation is to expand the picture by adopting a mean filling mode, Flip operation represents to randomly horizontally turn over, Crop operation is random, and x ispreprocessThen representing the corresponding preliminary preprocessing results, and the sizes of the preliminary preprocessing results are unified to be 640 multiplied by 640;
step 1.2: simulating the interference under the unconstrained scene by adopting a color dithering and noise disturbance mode, and carrying out the preliminary preprocessing result x obtained in the step 1.1 againpreprocessEnhancing in different degrees to obtain an extended picture x after comprehensive treatmentprocessAs shown in the following formula:
Figure BDA0002535649950000021
in the formula, Color operation indicates a Color dithering method, and Noise (gaussian) and Noise (Salt & pepper) operations indicate a picture plus gaussian Noise and a Salt and pepper Noise, respectively.
Step 1.3: the face in a certain image is reshaped by adopting a self-adaptive anchor point sampling method, so that a larger face with higher probability is introduced, and the method specifically comprises the following operations: selecting a size s in a certain imagefaceThe anchor point scale s on the ith layer feature map (i is 0,1, …,5) is presetiAs shown in the following formula:
si=24+i
the feature map of the ith layer is compared with the face size sfaceThe index of the nearest anchor point is denoted as:
Figure BDA0002535649950000031
wherein the content of the first and second substances,
Figure BDA0002535649950000032
the anchor point scale of the ith layer feature map is obtained;
then, the set { max (0, i) }anchor-1),1,…,min(5,ianchor+1) } sequentially selecting index iresultFinally, the original drawing sfaceIs adjusted to sresult
Figure BDA0002535649950000033
Thereby obtaining the scaling ratio s of the whole size of the image*
s*=sresult/sface
The original sample picture is scaled according to s, and then a 640 multiplied by 640 area containing the selected face is cut out randomly, namely the training sample picture after the adaptive anchor point sampling.
Further, the step 2 specifically includes the following sub-steps:
step 2.1: performing basic feature extraction on the augmented input picture through VGGNet-16, wherein conv3_3, conv4_3, conv5_3, conv _ fc7, conv6_2 and conv7_2 are respectively selected for final prediction, and the feature map sizes are respectively 160 × 160, 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5;
step 2.2: through the weighted fusion of the low-level detail features and the high-level semantic features through the low-level feature pyramid network, description information with more expressive power can be extracted, and the shallow feature graphs and the deep feature graphs used for prediction in the step 2.1 are recorded as phi respectivelyi、φi+1Where H represents a 2-fold upsampling operation applied to the higher-level feature map and θ represents a relevant parameter of the upsampling operation, the new feature map generated after weighted fusion can be represented as follows:
φ′i=α*φi+β*H(φi;θ)
in the formula, alpha and beta are hyper-parameters for balancing the two, and a new characteristic diagram obtained at the left side of the equation continues to enter a low-level characteristic pyramid network with a characteristic diagram at a lower level in a recursion manner until the lowest level is reached;
step 2.3: and (4) sending the weighted and fused feature graph obtained in the step into a context auxiliary prediction module, and selecting a splicing mode for each sub-network to fuse to realize channel parallel connection so as to deepen and widen the network model.
Further, the step 3 specifically includes the following sub-steps:
step 3.1: initializing the training parameters;
step 3.2: by applying a multi-scale training method, three scales are divided in the training process and respectively correspond to images with different resolutions, and the region of interest under each resolution has an appointed range: if the size of the true value box is in the range, the true value box is marked as correct, otherwise, the true value box is marked as error; when generating an anchor point and distributing a label for the anchor point, firstly detecting whether the proportion of the anchor point to the overlapping part of a true value frame marked as an error exceeds a certain proportion, if so, the anchor point is regarded as an error anchor point, otherwise, the anchor point is a correct anchor point; the anchor points which are judged to be wrong are invalidated in the training process and cannot be added into the back propagation process to influence the parameters;
step 3.3: learning and supervising position regression and category scoring respectively by adopting smooth L1 loss and softmax loss, and stopping training, storing a model and detecting when the loss sum of the smooth L1 loss and the softmax loss is not increased and is stabilized in a smaller value range; otherwise, the procedure returns to step 3.1.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention makes up the neglect of more expressive information in the prior art, performs weighted fusion on low-level detail information and high-level semantic information through a low-level feature pyramid network, deepens and widens a network model by a context auxiliary prediction module in a sub-network expansion mode, and remedies the facial features which are not fully extracted.
2. The invention further strengthens the sensitivity and the adaptability of the model to the scale, introduces a data enhancement mode of self-adaptive anchor point sampling and a multi-scale training method, improves the utilization rate of facial features and obtains good gain.
3. The invention can keep higher detection accuracy rate, stronger interference resistance and extremely high plasticity and comprehensiveness when facing to the human face with the attributes of inconsistent scales, fuzziness, strong illumination, different postures, facial shielding, makeup and the like in an unconstrained scene.
Drawings
FIG. 1 is a flow chart of the face detection method based on context inference according to the present invention.
FIG. 2 is a network model diagram of the face detection method based on context inference.
Fig. 3 is a schematic diagram of a human face image processing enhancement mode.
FIG. 4 is a comparison graph of training sample data distribution before and after adaptive anchor point sampling.
FIG. 5 is a schematic diagram of a low-level feature pyramid network structure.
FIG. 6 is a low-level feature pyramid network fusion feature visualization.
FIG. 7 is a block diagram of a context assisted prediction module.
Fig. 8 is a diagram illustrating the detection effect of the trained model on WIDER FACE face samples in the test set.
FIG. 9 shows the detection accuracy of the trained model on the Easy, Medium, Hard validation set of WIDER FACE.
Fig. 10 is a diagram illustrating the effect of detecting an unconstrained face by using a trained model.
The original pictures of the photos in the drawings are color pictures, and are modified into a gray form according to the requirements of patent filing.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.
WIDER FACE (the most authoritative face detection reference) data set is taken as an example, and the specific implementation of the face detection method based on context inference in the unconstrained scene of the invention is further described in detail with reference to the accompanying drawings, and the flow of the method is shown in fig. 1, and comprises the following steps:
step 1: the data augmentation of WIDER FACE training set mainly includes the following three aspects:
step 1.1: the method comprises the following steps of horizontally turning and randomly cutting pictures in WIDER FACE training sets, and performing primary preprocessing, wherein the specific operations are as follows: firstly, an input image is expanded to 4 times of the original size, then, each picture is subjected to mirror image horizontal turning, and finally, the size of a 640 multiplied by 640 area is randomly cut out, namely, the following formula is applied for processing:
xpreprocess=Crop(Flip(Extend(xinput)))
in the formula, xinputRepresenting an input training set picture, extending operation is to expand the picture by adopting a mean filling mode, Flip operation represents to randomly horizontally turn over, Crop operation is random, and x ispreprocessIt represents the corresponding preliminary preprocessing result, and its size is unified to 640 × 640. An example of data enhancement operations is shown in fig. 3, where the first line is an original input image of an arbitrary size, the second line is a 4-fold reduction of the size of the corresponding image to the original size, and the third and fourth lines are the preliminary pre-processing enhancement results of the image of the turned and cropped partial sample.
Step 1.2: and simulating the interference in an unconstrained scene by adopting a color dithering and noise disturbance mode. These two data enhancement modes are briefly described below:
color dithering: considering different illumination intensity, background atmosphere, shooting conditions and the like, the saturation, brightness, contrast and sharpness of the input image are respectively adjusted according to random factors generated randomly.
Noise disturbance: the method mainly relates to the addition of Gaussian white noise and salt and pepper noise, wherein the Gaussian noise refers to that the noise amplitude obeys Gaussian distribution, namely the number of noise points with certain intensity is the largest, and the number of noise points which are farther away from the intensity is smaller, so that the noise is additive noise; the salt and pepper noise is an impulse noise, and the alternating black and white bright and dark point noise can be generated on an original image by randomly changing the values of some pixel points, so that the salt and pepper noise is vivid, is just like spreading salt and pepper on the image, and is a logic noise.
To sum up, the preliminary pre-processing result x obtained in step 1.1 is again subjected topreprocessEnhancing in different degrees to obtain an extended picture x after comprehensive treatmentprocessAs shown in the following formula:
Figure BDA0002535649950000051
in the formula, Color operation indicates a Color dithering method, and Noise (gaussian) and Noise (Salt & pepper) operations indicate a picture plus gaussian Noise and a Salt and pepper Noise, respectively. An example of the data enhancement operation is shown in fig. 3, in which the fifth line is a color dithering enhancement mode for the picture cropped from the fourth line, and the sixth and seventh lines are modes for adding gaussian noise and salt and pepper noise of different degrees respectively to the picture cropped from the fourth line, so as to enhance the detection stability of the model for any environmental external cause.
Step 1.3: the face in a certain image is reshaped by adopting a self-adaptive anchor point sampling method, so that a larger face with higher probability is introduced, and the method specifically comprises the following operations: selecting a size s in a certain imagefaceThe anchor point scale s on the ith layer feature map (i is 0,1, …,5) is presetiAs shown in the following formula:
si=24+i
the feature map of the ith layer is compared with the face size sfaceThe index of the nearest anchor point is denoted as:
Figure BDA0002535649950000061
wherein the content of the first and second substances,
Figure BDA0002535649950000062
anchor for ith layer feature mapPoint scale;
then, the set { max (0, i) }anchor-1),1,…,min(5,ianchor+1) } sequentially selecting index iresultFinally, the original drawing sfaceIs adjusted to sresult
Figure BDA0002535649950000063
Thereby obtaining the scaling ratio s of the whole size of the image*
s*=sresult/sface
The original sample picture is scaled according to s, and then a 640 multiplied by 640 area containing the selected face is cut out randomly, namely the training sample picture after the adaptive anchor point sampling. Fig. 4 shows the influence of adaptive anchor sampling on WIDERFACE training data distribution, which is compared according to the pose attribute, the occlusion attribute, the blur attribute, and the illumination attribute, where the dotted line in the figure represents the distribution of the original samples of each attribute, and the solid line represents the distribution of the corresponding samples of the attributes after adaptive anchor sampling.
Step 2: based on the augmented picture in the step 1, VGGNet-16 is used as a basic feature extraction network, features of different layers are fused in a weighted mode through a low-level feature pyramid network, a context auxiliary prediction module is adopted in a prediction link to expand a sub-network, and then a network model is deepened and widened, and the method mainly comprises the following steps:
step 2.1: and performing basic feature extraction on the augmented input picture through VGGNet-16, wherein conv3_3, conv4_3, conv5_3, conv _ fc7, conv6_2 and conv7_2 are respectively selected to be used for final prediction, and the feature map sizes are respectively 160 × 160, 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5.
Step 2.2: through the weighted fusion of the low-level detail features and the high-level semantic features through the low-level feature pyramid network, description information with more expressive power can be extracted, and the shallow feature graphs and the deep feature graphs used for prediction in the step 2.1 are recorded as phi respectivelyi、φi+1H denotes action at a higher levelAnd 2 times of upsampling operation on the feature map, and theta represents a relevant parameter of the upsampling operation, the new feature map generated after weighted fusion can be represented as follows:
φ′i=α*φi+β*H(φi;θ)
in the formula, alpha and beta are hyper-parameters for balancing the two, wherein alpha and beta are respectively assigned as 4 and 1, and mainly considering that the feature diagram with strong capability of distinguishing the middle and small-scale human faces can play a larger role and is beneficial to weakening negative interference caused by poor feature diagrams, and a new feature diagram obtained at the left side of the equation and the feature diagram at the lower layer can continue to recursively enter the low-level feature pyramid network until the lowest layer. Fig. 5 is a schematic diagram illustrating the structure of a low-level feature pyramid network, taking conv5_3 and conv _ fc7 of the VGGNet-16 network as an example, wherein the feature map sizes of conv5_3 and conv _ fc7 are 40 × 40 and 20 × 20, respectively. Fig. 6 is a visualization of features after fusion of a low-level feature pyramid network, and extracted high-level features are quite abstract and are disadvantageous to small, fuzzy and partially shielded human faces, and after the features of various high-level bottom layers are fused, detailed information of the human faces can be made up to a greater extent.
Step 2.3: and (3) sending the weighted and fused feature graph obtained in the step into a context auxiliary prediction module, on one hand, expanding the receptive field to enable the prediction module to be deeper for auxiliary classification, on the other hand, adding a residual-free submodule to enable the model to be wider for auxiliary positioning, and selecting a splicing mode (splice) for each sub-network to fuse and realize channel parallel connection, thereby deepening and widening the network model. Fig. 7 is a structural diagram of a context-assisted prediction module, which not only retains rich context information, but also makes up for the deficiency of the characterization capability of a low-level feature map to some extent, although the low-level feature is helpful for detecting a small and medium-sized face.
And step 3: after the training parameters are initialized, a multi-scale training method is applied to guide the autonomous learning process of the model, and the model is stored and detected after loss convergence, and the method mainly comprises the following steps:
step 3.1: the training parameters are initialized, and the specific settings are shown in table 1 below.
TABLE 1 training parameter settings
Figure BDA0002535649950000071
Wherein, the optimizer selects a random gradient descent (SGD) method with a momentum value of 0.9; meanwhile, to prevent overfitting, the weight attenuation value is set to 10-5. It should be noted that, in consideration of the continuous depth of the network learning process, the following settings are set for the learning rate: as the number of iterations increases, when the number of iterations is in the set stepping list {30000,40000,50000}, the learning rate decreases to 0.1, which can prevent the unexpected situation that the network parameter is close to the global optimal solution, and the optimal value is missed due to the excessive learning rate.
Step 3.2: by applying a multi-scale training method, three scales are divided in the training process and respectively correspond to images with different resolutions, and the region of interest under each resolution has an appointed range: if the size of the true value box is in the range, the true value box is marked as correct, otherwise, the true value box is marked as error; when generating an anchor point and distributing a label for the anchor point, firstly detecting whether the proportion of the anchor point to an overlapped part of a true value frame marked as an error exceeds 30%, if so, the anchor point is regarded as an error anchor point, otherwise, the anchor point is a correct anchor point; the anchor points determined to be wrong are invalidated during training and are not added into the back propagation process to influence the parameters. The specific settings are shown in table 2 below.
TABLE 2 Multi-Scale training method parameter settings
Resolution ratio 0~16 16~128 >128
Scale transformation *2.0 *1.0 *0.5
Region of interest 16~32 32~64 64~256
Step 3.3: training by adopting smooth L1 loss guide position regression, wherein the expression is as follows:
Figure BDA0002535649950000081
Figure BDA0002535649950000082
in the formula, y(i)A tag that represents the true location of the object,
Figure BDA0002535649950000083
and representing coordinate label information predicted by the model, wherein omega represents a region set with a priori frame as a positive sample.
And (3) adopting softmax loss to guide the training of class scoring, wherein the expression is as follows:
Figure BDA0002535649950000084
Figure BDA0002535649950000085
in the formula, xkIndicating the actual class label, zmDenotes the input of the softmax layer, f (z)m) Representing the predicted output of the softmax layer, T is the number of classes on the training data set.
The loss sum L of both can be expressed as:
L=Lloc+Lconf
in summary, the overall network structure of the face detection method based on context inference is shown in fig. 2.
Step 3.4: when the progressive loss no longer rises and settles in a smaller range (e.g., (0, 1)), the training may be stopped, otherwise, the process returns to step 3.1.
Step 3.5: stopping training, saving the model and detecting. The trained model is used for detecting partial human face samples related to attributes of inconsistent scales, fuzziness, strong and weak illumination, different postures, facial occlusion and makeup in the WIDER FACE test set, and the human face is marked by a rectangular frame, so that higher detection accuracy can be still maintained in the high-difficulty unconstrained scenes as shown in FIG. 8. The model size of the invention is 91M, the accuracy on Easy, Medium and Hard verification sets of the published WIDER FACE respectively reaches 93.8%, 92.5% and 86.7%, and good gain is obtained within the same-level model size range as shown in FIG. 9. The method has wide application scenes, is suitable for face detection tasks in various unconstrained scenes, has extremely high comprehensiveness and generalization, and still has higher accuracy when the method is used for detecting the arbitrarily captured unconstrained faces as shown in figure 10. The invention can detect 51 pictures per second on a GPU (graphic processing unit) platform, and can detect 39 pictures per second under the condition of only using a CPU (central processing unit), thereby meeting the real-time requirement in a human face detection task.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims (4)

1. The face detection method based on context inference under the unconstrained scene is characterized by comprising the following steps:
step 1, carrying out data augmentation on WIDERFACE training sets;
step 2, based on the augmented picture in the step 1, taking VGGNet-16 as a basic feature extraction network, fusing features of different layers in a weighted manner through a low-level feature pyramid network, and adopting a context auxiliary prediction module to expand a sub-network in a prediction link so as to deepen and widen a network model;
and 3, after the training parameters are initialized, guiding the autonomous learning process of the model by using a multi-scale training method, storing the model after loss convergence, and detecting.
2. The face detection method based on context inference under the unconstrained scene according to claim 1, wherein the step 1 specifically comprises the following sub-steps:
step 1.1: the method comprises the following steps of horizontally turning and randomly cutting pictures in WIDERFACE training sets, and performing primary preprocessing, wherein the specific operations are as follows: firstly, an input image is expanded to 4 times of the original size, then, each picture is subjected to mirror image horizontal turning, and finally, the size of a 640 multiplied by 640 area is randomly cut out, namely, the following formula is applied for processing:
xpreprocess=Crop(Flip(Extend(xinput)))
in the formula, xinputRepresenting an input training set picture, extending operation is to expand the picture by adopting a mean filling mode, Flip operation represents to randomly horizontally turn over, Crop operation is random, and x ispreprocessThen representing the corresponding preliminary preprocessing results, and the sizes of the preliminary preprocessing results are unified to be 640 multiplied by 640;
step 1.2: simulating the interference under the unconstrained scene by adopting a color dithering and noise disturbance mode, and carrying out the preliminary preprocessing result x obtained in the step 1.1 againpreprocessThe enhancement is carried out to a different extent,further obtaining the extended picture x after comprehensive treatmentprocessAs shown in the following formula:
Figure FDA0002535649940000011
in the formula, Color operation represents a Color dithering mode, and Noise (Gaussian) operation and Noise (Salt & pepper) operation represent pictures with gaussian Noise and Salt and pepper Noise respectively;
step 1.3: the face in a certain image is reshaped by adopting a self-adaptive anchor point sampling method, so that a larger face with higher probability is introduced, and the method specifically comprises the following operations: selecting a size s in a certain imagefaceThe anchor point scale s on the ith layer feature map is presetiAs shown in the following formula:
si=24+i
wherein i is 0,1, …, 5;
the feature map of the ith layer is compared with the face size sfaceThe index of the nearest anchor point is denoted as:
Figure FDA0002535649940000021
wherein the content of the first and second substances,
Figure FDA0002535649940000022
the anchor point scale of the ith layer feature map is obtained;
then, the set { max (0, i) }anchor-1),1,…,min(5,ianchor+1) } sequentially selecting index iresultFinally, the original drawing sfaceIs adjusted to sresult
Figure FDA0002535649940000023
Thereby obtaining the scaling ratio s of the whole size of the image*
s*=sresult/sface
The original sample picture is scaled according to s, and then a 640 multiplied by 640 area containing the selected face is cut out randomly, namely the training sample picture after the adaptive anchor point sampling.
3. The face detection method based on context inference under an unconstrained scene according to claim 1, wherein the step 2 specifically comprises the following sub-steps:
step 2.1: performing basic feature extraction on the augmented input picture through VGGNet-16, wherein conv3_3, conv4_3, conv5_3, conv _ fc7, conv6_2 and conv7_2 are respectively selected for final prediction, and the feature map sizes are respectively 160 × 160, 80 × 80, 40 × 40, 20 × 20, 10 × 10 and 5 × 5;
step 2.2: the low-level feature pyramid network is used for weighting and fusing the low-level detail features and the high-level semantic features to extract description information with more expressive power, and the shallow feature graph and the deep feature graph used for prediction in the step 2.1 are recorded as phii、φi+1Where H represents a 2-fold upsampling operation applied to the higher-level feature map and θ represents a relevant parameter of the upsampling operation, the new feature map generated after weighted fusion can be represented as follows:
φi'=α*φi+β*H(φi;θ)
in the formula, alpha and beta are hyper-parameters for balancing the two, and a new characteristic diagram obtained at the left side of the equation continues to enter a low-level characteristic pyramid network with a characteristic diagram at a lower level in a recursion manner until the lowest level is reached;
step 2.3: and (4) sending the weighted and fused feature graph obtained in the step into a context auxiliary prediction module, and selecting a splicing mode for each sub-network to fuse to realize channel parallel connection so as to deepen and widen the network model.
4. The face detection method based on context inference under an unconstrained scene according to claim 1, wherein the step 3 specifically comprises the following sub-steps:
step 3.1: initializing the training parameters;
step 3.2: by applying a multi-scale training method, three scales are divided in the training process and respectively correspond to images with different resolutions, and the region of interest under each resolution has an appointed range: if the size of the true value box is in the range, the true value box is marked as correct, otherwise, the true value box is marked as error; when generating an anchor point and distributing a label for the anchor point, firstly detecting whether the proportion of the anchor point to the overlapping part of a true value frame marked as an error exceeds a certain proportion, if so, the anchor point is regarded as an error anchor point, otherwise, the anchor point is a correct anchor point; the anchor points which are judged to be wrong are invalidated in the training process and cannot be added into the back propagation process to influence the parameters;
step 3.3: learning and supervising position regression and category scoring respectively by adopting smooth L1 loss and softmax loss, and stopping training, storing a model and detecting when the loss sum of the smooth L1 loss and the softmax loss is not increased and is stabilized in a smaller value range; otherwise, the procedure returns to step 3.1.
CN202010531633.0A 2020-06-11 2020-06-11 Face detection method based on context reasoning under unconstrained scene Pending CN111898410A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010531633.0A CN111898410A (en) 2020-06-11 2020-06-11 Face detection method based on context reasoning under unconstrained scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010531633.0A CN111898410A (en) 2020-06-11 2020-06-11 Face detection method based on context reasoning under unconstrained scene

Publications (1)

Publication Number Publication Date
CN111898410A true CN111898410A (en) 2020-11-06

Family

ID=73207405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010531633.0A Pending CN111898410A (en) 2020-06-11 2020-06-11 Face detection method based on context reasoning under unconstrained scene

Country Status (1)

Country Link
CN (1) CN111898410A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633065A (en) * 2020-11-19 2021-04-09 特斯联科技集团有限公司 Face detection method, system, storage medium and terminal based on data enhancement
CN113221907A (en) * 2021-06-01 2021-08-06 平安科技(深圳)有限公司 Vehicle part segmentation method, device, equipment and storage medium
CN113673616A (en) * 2021-08-26 2021-11-19 南通大学 Attention and context coupled lightweight small target detection method
CN113837058A (en) * 2021-09-17 2021-12-24 南通大学 Lightweight rainwater grate detection method coupled with context aggregation network

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886159A (en) * 2019-01-30 2019-06-14 浙江工商大学 It is a kind of it is non-limiting under the conditions of method for detecting human face

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886159A (en) * 2019-01-30 2019-06-14 浙江工商大学 It is a kind of it is non-limiting under the conditions of method for detecting human face

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633065A (en) * 2020-11-19 2021-04-09 特斯联科技集团有限公司 Face detection method, system, storage medium and terminal based on data enhancement
CN113221907A (en) * 2021-06-01 2021-08-06 平安科技(深圳)有限公司 Vehicle part segmentation method, device, equipment and storage medium
CN113221907B (en) * 2021-06-01 2024-05-31 平安科技(深圳)有限公司 Vehicle part segmentation method, device, equipment and storage medium
CN113673616A (en) * 2021-08-26 2021-11-19 南通大学 Attention and context coupled lightweight small target detection method
CN113673616B (en) * 2021-08-26 2023-09-29 南通大学 Light-weight small target detection method coupling attention and context
CN113837058A (en) * 2021-09-17 2021-12-24 南通大学 Lightweight rainwater grate detection method coupled with context aggregation network
CN113837058B (en) * 2021-09-17 2022-09-30 南通大学 Lightweight rainwater grate detection method coupled with context aggregation network

Similar Documents

Publication Publication Date Title
CN111898410A (en) Face detection method based on context reasoning under unconstrained scene
US11830230B2 (en) Living body detection method based on facial recognition, and electronic device and storage medium
CN104751142B (en) A kind of natural scene Method for text detection based on stroke feature
CN104050471B (en) Natural scene character detection method and system
EP1693782B1 (en) Method for facial features detection
US11790499B2 (en) Certificate image extraction method and terminal device
CN112614136A (en) Infrared small target real-time instance segmentation method and device
CN111951154B (en) Picture generation method and device containing background and medium
CN111553230A (en) Feature enhancement based progressive cascade face detection method under unconstrained scene
CN111553227A (en) Lightweight face detection method based on task guidance
CN112818952A (en) Coal rock boundary recognition method and device and electronic equipment
Yu et al. Pedestrian detection based on improved Faster RCNN algorithm
Cai et al. Vehicle Detection Based on Deep Dual‐Vehicle Deformable Part Models
Xu et al. License plate recognition system based on deep learning
CN111179289B (en) Image segmentation method suitable for webpage length graph and width graph
CN116310358B (en) Method, storage medium and equipment for detecting bolt loss of railway wagon
CN117392375A (en) Target detection algorithm for tiny objects
CN117078603A (en) Semiconductor laser chip damage detection method and system based on improved YOLO model
CN111898454A (en) Weight binarization neural network and transfer learning human eye state detection method and device
CN111553217A (en) Driver call monitoring method and system
CN114549809A (en) Gesture recognition method and related equipment
CN114038030A (en) Image tampering identification method, device and computer storage medium
Sami et al. Improved semantic inpainting architecture augmented with a facial landmark detector.
Tang et al. Component recognition method based on deep learning and machine vision
JP2003173436A (en) Image processing device, image processing method, and recording medium recording program readably by computer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination