CN112132839B - Multi-scale rapid face segmentation method based on deep convolution cascade network - Google Patents

Multi-scale rapid face segmentation method based on deep convolution cascade network Download PDF

Info

Publication number
CN112132839B
CN112132839B CN202010878450.6A CN202010878450A CN112132839B CN 112132839 B CN112132839 B CN 112132839B CN 202010878450 A CN202010878450 A CN 202010878450A CN 112132839 B CN112132839 B CN 112132839B
Authority
CN
China
Prior art keywords
face
confidence
neural network
network model
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010878450.6A
Other languages
Chinese (zh)
Other versions
CN112132839A (en
Inventor
徐联伯
彭珂凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Eagle Zhida Technology Co ltd
Original Assignee
Hangzhou Eagle Zhida Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Eagle Zhida Technology Co ltd filed Critical Hangzhou Eagle Zhida Technology Co ltd
Priority to CN202010878450.6A priority Critical patent/CN112132839B/en
Publication of CN112132839A publication Critical patent/CN112132839A/en
Application granted granted Critical
Publication of CN112132839B publication Critical patent/CN112132839B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/136Segmentation; Edge detection involving thresholding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a face segmentation technology and discloses a multi-scale rapid face segmentation method based on a deep convolution cascade network; firstly, making an image pyramid for an input image to be segmented, inputting the image pyramid into a first convolutional neural network model one by one, calculating local peak points of each heat map, and removing repeated face frames of each scale; inputting the face frames into a second convolutional neural network model one by one, filtering the face frames with confidence coefficient lower than a face confidence coefficient threshold value, and further optimizing the positions of the passed face frames; and inputting the optimized face frames into a third convolutional neural network model one by one, filtering the face frames, and adjusting the positions of the face frames to realize the segmentation of the whole face region, the eyes, the nose, the mouth and other local regions. The face segmentation technology designed by the invention can rapidly carry out face segmentation in multiple scales, has high segmentation precision, can reduce the overall network calculation cost, is suitable for an embedded platform with limited calculation resources, and simultaneously carries out specific character recognition through the key parts of the face.

Description

Multi-scale rapid face segmentation method based on deep convolution cascade network
Technical Field
The invention relates to a face segmentation technology, in particular to a multi-scale rapid face segmentation method based on a deep convolution cascade network.
Background
The face detection segmentation technology is crucial in a face recognition system, and before effective face recognition, the influence of illumination, gestures, expressions, image quality and the like needs to be overcome, the face existing in a target scene is accurately and effectively detected, and the positions of key parts such as eyes, nose and mouth of a face local area need to be accurately positioned. The accurate positioning of the key positions of eyes, nose, mouth and the like can improve the face correction effect, thereby improving the face recognition precision. In the prior art, multi-scale face segmentation cannot be performed, the segmentation speed is low, the network calculation cost is high, and an embedded platform with limited resources is not applicable.
For example, patent titles: an infrared face segmentation method using annular shortest path, application number: cn201610090345.X, filing date: 2016-02-18, a face segmentation method, a device and equipment, wherein the method comprises the following steps: acquiring a face image comprising a face region to be segmented; extracting key point information of a face area to be segmented; determining semantic prior layer information corresponding to the face region to be segmented according to the key point information and the corresponding relation between each segmented object and the key point information; the semantic prior layer information represents a limiting segmentation area corresponding to each segmentation object in the face area to be segmented; and inputting the semantic prior layer information and the face image into a pre-trained network model to obtain a segmentation result corresponding to each segmentation object in the face region to be segmented.
The face segmentation method, the face segmentation device and the face segmentation equipment provided in the prior art can improve the segmentation accuracy. But the face segmentation is carried out by utilizing the infrared face segmentation method with the annular shortest path, the multi-scale face segmentation cannot be carried out, the segmentation speed is low, the network calculation cost is high, and the embedded platform with limited resources is not applicable.
Disclosure of Invention
Aiming at the problems that in the prior art, multi-scale face segmentation cannot be performed, the segmentation speed is low, the network calculation cost is high, an embedded platform with limited resources is not applicable, and the like, the invention provides a multi-scale rapid face segmentation method based on a deep convolution cascade network.
In order to solve the technical problems, the invention is solved by the following technical scheme:
a multi-scale rapid face segmentation method based on a deep convolution cascade network comprises the following steps:
s1: generating a face candidate frame; making an image pyramid on an input image to be segmented, and inputting the image pyramid into a first convolutional neural network model one by one, wherein the first convolutional neural network model predicts face candidate frames with different scales at different depth layers, so as to realize multi-scale face candidate frame prediction;
S2: classifying and regressing face frames; inputting the face frames predicted in the step S1 into a second convolutional neural network model, wherein the second convolutional neural network model predicts confidence degrees and frame regression values of the face frames one by one;
when the confidence coefficient of the face frame is smaller than the confidence coefficient threshold value of the face frame, the face frame is filtered by the second convolutional neural network model; otherwise, the position of the face frame is optimized; and outputting the adjusted face frame;
S3: inputting the face frame adjusted in the step S2 into a third convolutional neural network model, and filtering the face frame by the third convolutional neural network model according to the step S2;
When the confidence coefficient of the face frame is larger than or equal to a threshold value of the confidence coefficient of the face frame, the face frame position is adjusted by the third convolutional neural network model, and the face whole area and the local area of the adjusted face frame are segmented;
S4: and outputting the result, namely outputting the result after the segmentation in the step S3.
Preferably, the first convolutional neural network model in the S1 is a lightweight characteristic pyramid network model, the network depth is 6 layers, and the maximum number of channels of the network is 32; the feature pyramid network model receives an input image, and obtains heat maps with different confidence coefficients at three different depth layers, wherein the heat maps are used for representing the distribution situation of the human face; and calculating local peak points of the confidence heat map according to different confidence heat maps, and removing repeated face frames of each scale according to the local peak points of the confidence heat map.
And simultaneously predicting the face candidate frames with different scales at different depths to realize multi-scale face frame prediction.
Preferably, the calculation of the local peak point of the confidence heat map comprises the following steps:
step 1, carrying out maximum pooling operation on a confidence coefficient heat map according to a formula 1 to obtain a confidence coefficient heat map maximum feature map;
In the formula 1 of the present invention, And/>Representing the kth feature map of the current layer and the previous layer respectively, down (-) is a downsampling function,/>Weighting coefficient representing kth feature map of current layer,/>Representing the bias of the kth feature map of the current pooling layer. In confidence heat map maximum feature map calculation,/>And/>Respectively representing a confidence heat map maximum feature map and a confidence heat map;
Step 2, comparing the value of the corresponding position of the confidence coefficient heat map of the same position with the value of the corresponding position of the maximum feature map of the confidence coefficient heat map of the step 1, and assigning 1 to the position when the values of the corresponding positions are the same; when the corresponding position values are different, the position is assigned to 0; obtaining a local peak point position characteristic map of the confidence heat map according to the formula 2;
In the formula 2, A ij represents the ith row and the jth column pixel values of the confidence heat map, B ij represents the ith row and the jth column pixel values of the maximum feature map of the confidence heat map, C ij represents the ith row and the jth column pixel values of the feature map of the local peak point position of the confidence heat map, i represents the value range of [0, M-1], j represents the value range of [0, N-1], M represents the high of the confidence heat map, and N represents the wide of the confidence heat map;
step 3, multiplying the value of the corresponding position of the confidence coefficient heat map at the same position by the value of the corresponding position of the local peak point position feature map of the confidence coefficient heat map in the step 2, and generating a local peak point feature map of the confidence coefficient heat map in the formula 3;
d ij=Aij×Cij formula 3
In the formula 3, A ij represents the pixel value of the ith row and the jth column of the confidence heat map, C ij represents the pixel value of the ith row and the jth column of the local peak point position feature map of the confidence heat map, D ij represents the pixel value of the ith row and the jth column of the local peak point feature map of the confidence heat map, I represents the value range of [0, M-1], J represents the value range of [0, N-1], M represents the high of the confidence heat map, and N represents the wide of the confidence heat map.
By calculating the local peak point of each heat map, the repeated face frames of each scale are removed rapidly, the time consumption of NMS (non-maximum suppression) operation when face candidate frames are combined under different subsequent scales is reduced effectively, and the network calculation cost is reduced.
Preferably, for each input face, a segmentation feature map with a dimension km 2, that is, K face segmentation branch task output feature maps with a resolution of mxm, K represents the number of classes, and a sigmoid function is used for each pixel, where a calculation formula of the sigmoid function is as follows:
In the formula 4, i represents a pixel value in the output feature map, and S i represents a probability value that the pixel value is a face or a face region;
For the segmentation task loss function, a sigmoid cross entropy loss function is used, and the calculation formula is as follows:
In formula 5, y i mask represents a label with a true pixel value in the feature map, the value of y i mask∈{0,1},Si is output by the convolutional neural network, and represents a probability value that the pixel value is a face or a face region.
Preferably, the second convolutional neural network model in the S2 is a face classification and regression network structure, the network depth is 5 layers, and the second convolutional neural network model is a downsampled convolutional neural network model; and S3, a third convolutional neural network model is a face segmentation network model, and the convolutional neural network model is obtained through downsampling and then upsampling.
Preferably, the segmentation in S3 is performed by pixel-level classification, and in the output feature map, the segmentation is performed by performing pixel-level classification on whether each position is a whole region of a face or a partial region such as an eye, nose, or mouth.
Preferably, the first convolutional neural network model and the second convolutional neural network model simultaneously realize two tasks of face classification and face frame regression, the formula 6 is a calculation mode of a multi-task loss function,
L loss=Lcls+Lbox equation 6
In formula 6, L LOSS represents a total LOSS value of two tasks of face classification and face frame regression, L CLS represents a LOSS value of a face classification task, and L BOX represents a LOSS value of a face frame regression task.
For a third convolutional neural network model, the model simultaneously realizes three tasks of face classification, face frame regression and face segmentation, and a multi-task loss function is defined as follows:
l loss=Lcls+Lbox+Lmask equation 7
In formula 7, L LOSS represents a face classification, face frame regression and face segmentation total LOSS value, L CLS represents a face classification task LOSS value, L BOX represents a face frame regression task LOSS value, and L MASK represents a face segmentation task LOSS value.
Preferably, the method further comprises the step of designing a convolutional neural network of 32 x 32 for extracting the local area features for the segmented image key parts. The segmented image is subjected to local face feature extraction, so that fine-granularity comparison and identification of specific characters are facilitated.
Preferably, the local region features are extracted as a class from the same local region of each person, and the classification probability distribution is calculated by using a softmax cross entropy loss function through the softmax function, and the softmax function is calculated by the formula 8:
In formula 8, X i represents the projection size of a face local area on the ith person, P i represents the probability that the face local area is the ith person, and k represents the category number of classification tasks;
softmax cross entropy loss function, calculate equation 9:
In formula 9, y i cls represents the label of the real sample, and the value y i cls∈{0,1},pi is output by the neural network and represents the probability that one sample is the ith person.
The invention has the remarkable technical effects due to the adoption of the technical scheme: and (3) making an image pyramid for the input images to be segmented, and inputting the image pyramids into a first convolutional neural network model one by one, wherein the first convolutional neural network model obtains confidence heat maps at three different depth layers, and the confidence heat maps are used for representing the face distribution situation. And respectively calculating local peak points of each confidence heat map according to the confidence heat maps, rapidly removing repeated face frames of each scale through the local peak points of each confidence heat map, and reducing time consumption for performing non-maximum suppression operation when combining face candidate frames of different scales, thereby effectively reducing network calculation cost.
And inputting the face frames generated by the first convolutional neural network model into a second convolutional neural network model one by one, predicting the confidence coefficient and the frame regression value of the face frames one by one, filtering the face frames with the confidence coefficient smaller than the confidence coefficient threshold value of the face, and further optimizing the face frame positions larger than or equal to the confidence coefficient threshold value.
And inputting the face frames subjected to the adjustment of the second convolutional neural network model into a third convolutional neural network model one by one, and when the confidence coefficient of the face frames is greater than or equal to a confidence coefficient threshold value of the face frames, adjusting the positions of the face frames by the third convolutional neural network model, and dividing the whole area and the local area of the face of the adjusted face frames.
The face segmentation technology designed by the invention can rapidly carry out face segmentation in multiple scales, has high segmentation precision, can reduce the overall network calculation cost, and is suitable for an embedded platform with limited calculation resources.
By utilizing the local region segmentation in the invention, the key region of the face can be extracted, and the corresponding local characteristics are extracted by utilizing the corresponding convolutional neural network, so that the specific characters are subjected to fine-granularity comparison and identification.
Drawings
FIG. 1 is a schematic diagram of the composition of the present invention.
Fig. 2 is a diagram of a first convolutional neural network model of the present invention.
Fig. 3 is a diagram of a second convolutional neural network model of the present invention.
Fig. 4 is a diagram of a third convolutional neural network model of the present invention.
Fig. 5 is a flowchart of the heat map local peak point calculation of the present invention.
FIG. 6 is a confidence heat map of the present invention.
FIG. 7 is a diagram of an automated training and testing system according to example 5 of the present invention.
Fig. 8 is a structural diagram of embodiment 6 of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Example 1
A multi-scale rapid face segmentation method based on a deep convolution cascade network comprises the following steps:
s1: generating a face candidate frame; making an image pyramid on an input image to be segmented, and inputting the image pyramid into a first convolutional neural network model one by one, wherein the first convolutional neural network model predicts face candidate frames with different scales at different depth layers, so as to realize multi-scale face candidate frame prediction;
Confidence heat maps are obtained at three different depth layers and are used for representing the distribution situation of the human face. Then, calculating local peak points of each confidence coefficient heat map respectively, wherein the local peak points of the confidence coefficient are the positions with the maximum confidence coefficient values of the local areas of the confidence coefficient heat map, and rapidly removing the face frames with repeated scales through the local peak points of the confidence coefficient; the time consumption for NMS operation when face candidate frames under different subsequent scales are combined is effectively reduced, and the NMS operation steps are as follows:
firstly, sorting all the obtained face frames in descending order according to the confidence value, and selecting the face frame with the largest confidence value;
step two, traversing all the other face frames in sequence, and deleting the face frame if the overlapping area of the face frame and the face frame with the maximum confidence is larger than a certain threshold value; the threshold value is set in a range of 0.5 to 0.9 according to the actual application scene;
and thirdly, continuously selecting the face frame with the maximum confidence value from the unprocessed face frame, and repeating the second step.
The calculated amount of the non-maximum value suppression operation is exponentially increased, and when the number of face frames is large, the calculated amount is much smaller than that of the faces.
S2: classifying and regressing face frames; inputting the face frames predicted in the step S1 into a second convolutional neural network model, wherein the second convolutional neural network model predicts confidence degrees and frame regression values of the face frames one by one;
When the confidence coefficient of the face frame is smaller than the confidence coefficient threshold value of the face frame, the face frame is filtered by the second convolutional neural network model; otherwise, the position of the face frame is optimized; and outputting the adjusted face frame; the confidence threshold of the face frame is set to be 0.7;
S3: inputting the face frame adjusted in the step S2 into a third convolutional neural network model, and filtering the face frame by the third convolutional neural network model according to the step S2;
When the confidence coefficient of the face frame is larger than or equal to a threshold value of the confidence coefficient of the face frame, the face frame position is adjusted by the third convolutional neural network model, and the face whole area and the local area of the adjusted face frame are segmented; the confidence threshold of the face frame is set to be 0.7;
S4: and outputting the result, namely outputting the result after the segmentation in the step S3.
The first convolutional neural network model in the S1 is a lightweight characteristic pyramid network model, the network depth is 6 layers, and the maximum number of channels of the network is 32; the feature pyramid network model receives an input image, and obtains heat maps with different confidence coefficients at three different depth layers, wherein the heat maps are used for representing the distribution situation of the human face; and calculating local peak points of the confidence heat map according to different confidence heat maps, and removing repeated face frames of each scale according to the local peak points of the confidence heat map. And simultaneously predicting the face candidate frames with different scales at different depths to realize multi-scale face frame prediction.
The calculation of the local peak point of the confidence heat map comprises the following steps:
Carrying out maximum pooling operation on the confidence coefficient heat map according to the formula 1 to obtain a confidence coefficient heat map maximum value characteristic map;
In the formula 1 of the present invention, And/>Representing the kth feature map of the current layer and the previous layer respectively, down (-) is a downsampling function,/>Weighting coefficient representing kth feature map of current layer,/>Representing the bias of the kth feature map of the current pooling layer. In confidence heat map maximum feature map calculation,/>And/>Respectively representing a confidence heat map maximum feature map and a confidence heat map;
the confidence heat map only represents the face distribution situation, so k is 1, and the maximum pooling layer step length is set to be 1. Through the calculation of the maximum pooling layer, the resolution of the confidence coefficient heat map is the same as that of the maximum feature map of the confidence coefficient heat map, and k is 1;
Step 2, comparing the value of the position corresponding to the confidence coefficient heat map of the same position with the value of the position corresponding to the maximum feature map of the confidence coefficient heat map of the step 1, and assigning 1 to the position when the values of the corresponding positions are the same; when the values of the corresponding positions are different, the positions are assigned to 0; obtaining a local peak point position characteristic map of the confidence heat map;
In the formula 2, A ij represents the ith row and the jth column of pixel values of the confidence heat map, B ij represents the ith row and the jth column of pixel values of the maximum feature map of the confidence heat map, cij represents the ith row and the jth column of pixel values of the local peak point position feature map of the confidence heat map, i represents the value range of [0, M-1], J represents the value range of [0, N-1], M represents the high of the confidence heat map, and N represents the wide of the confidence heat map; the resolution of the confidence heat map maximum feature map and the resolution of the confidence heat map local peak point position feature map are the same;
step 3, multiplying the value of the corresponding position of the confidence coefficient heat map at the same position by the value of the corresponding position of the local peak point position feature map of the confidence coefficient heat map in the step 2, and generating a local peak point feature map of the confidence coefficient heat map in the formula 3;
d ij=Aij×Cij formula 3
In the formula 3, A ij represents the pixel value of the ith row and the jth column of the confidence heat map, C ij represents the pixel value of the ith row and the jth column of the local peak point position feature map of the confidence heat map, D ij represents the pixel value of the ith row and the jth column of the local peak point feature map of the confidence heat map, I represents the value range of [0, M-1], J represents the value range of [0, N-1], M represents the high of the confidence heat map, and N represents the wide of the confidence heat map.
Assuming that the confidence heat map is a matrix A MN, the confidence heat map maximum feature map is a matrix B MN, the confidence heat map local peak point position feature map is a matrix C MN, and the confidence heat map local peak point position feature map is D MN;
When a [i][j] is equal to B [i][j], then C [i][j] is 1, otherwise C [i][j] is 0;
D[i][j]=A[i][j]*C[i][j]
Where i= (0 … M-1), j= (0 … N-1).
The confidence heat map, the confidence heat map maximum feature map, the confidence heat map local peak point feature map and the confidence heat map local peak point feature map are the same in resolution.
By calculating the local peak point of each heat map, the repeated face frames of each scale can be removed rapidly, the time consumption of NMS (non-maximum suppression) operation when face candidate frames are combined under different subsequent scales is reduced effectively, and the network calculation cost is reduced.
For each input face output dimension km 2 of the segmentation feature map, namely K face segmentation branch task output feature maps with resolution of MxM, K represents the category number, and a sigmoid function is used for each pixel, wherein the calculation formula of the sigmoid function is as follows:
Wherein i represents a pixel value in the output feature map, and Si represents a probability value that the pixel value is a face or a face region;
For the segmentation task loss function, a sigmoid cross entropy loss function is used, and the calculation formula is as follows:
wherein y i mask represents a label with a true pixel value in the feature map, the value y i mask∈{0,1},Si is output by a convolutional neural network, and represents a probability value that the pixel value is a face or a face region.
The second convolutional neural network model in the S2 is a face classification and regression network structure, the network depth is 5 layers, and the convolutional neural network model is a downsampled convolutional neural network model. And S3, a third convolutional neural network model is a face segmentation network model, and the convolutional neural network model is obtained through downsampling and then upsampling.
The segmentation in S3 is realized through pixel level classification, and in the output feature map, the pixel level classification is carried out on whether each position is a whole region of a human face, a partial region such as eyes, nose and mouth and the like, so that the segmentation is realized.
The system comprises a first convolutional neural network model and a second convolutional neural network model, wherein the two models simultaneously realize two tasks of face classification and face frame regression, and a multi-task loss function is defined as follows:
l loss=Lcls+Lbox equation 6
In formula 6, L LOSS represents a total LOSS value of two tasks of face classification and face frame regression, L CLS represents a LOSS value of a face classification task, and L BOX represents a LOSS value of a face frame regression task.
For a third convolutional neural network model, the model simultaneously realizes three tasks of face classification, face frame regression and face segmentation, and a multi-task loss function is defined as follows:
l loss=Lcls+Lbox+Lmask equation 7
In formula 7, L LOSS represents a face classification, face frame regression and face segmentation total LOSS value, L CLS represents a face classification task LOSS value, L BOX represents a face frame regression task LOSS value, and L MASK represents a face segmentation task LOSS value.
According to fig. 6, the region in black in the middle of white represents the region with high confidence of the face, namely the face distribution region.
Maximum pooling is a prior art technique that takes the maximum value in the local accepted domain.
Example 2
Based on example 1, according to the first convolutional neural network structure shown in fig. 2, the full convolutional network size is 24×24 resolution, conv represents the convolutional layer, pool represents the pooling layer, conv/dw in combination with conv represents the depth separable convolutional layer.
① Firstly, an input image passes through a convolution layer conv1 and a maximum pooling layer pool1 to obtain a first layer output characteristic diagram;
② Obtaining a second layer output characteristic diagram through depth separable convolution layers conv2-1/dw and conv 2;
③ Obtaining a third layer output characteristic diagram through depth separable convolution layers conv3-1/dw and conv 3;
④ The third layer of output feature map has three output layers, the first one is through convolutions layer conv4-1 to obtain the feature map which represents the confidence of human face classification, namely confidence heat map, the second one is through convolutions layer conv4-2 to obtain the feature map which represents the regression position of human face frame, the third one is through depth separable convolutions layers conv5-1/dw and conv5 to obtain the fourth layer of output feature map;
⑤ The fourth layer output characteristic diagram is similar to the third layer output characteristic diagram, and also has three output layers, wherein the first layer is subjected to a convolution layer conv6-1 to obtain a characteristic diagram representing the confidence of face classification, the second layer is subjected to a convolution layer conv6-2 to obtain a characteristic diagram representing the regression position of a face frame, and the third layer is subjected to a depth separable convolution layer conv7-1/dw and conv7 to obtain a fifth layer output characteristic diagram;
⑥ And outputting a feature map at a fifth layer, obtaining a feature map representing the confidence of face classification through a convolution layer conv8-1, and obtaining a feature map representing the regression position of the face frame through a convolution layer conv 8-2.
Example 3
Based on the embodiment 1 and the embodiment 2, according to the structure of the second convolutional neural network shown in fig. 3, a series of convolution and pooling operations are performed to obtain a face confidence coefficient and a face frame regression value, face candidate frames lower than a face confidence coefficient threshold are filtered, the face frames passing through the face confidence coefficient threshold are further regressed, the positions of the face frames are more accurate, and the face frames after the position regression are used as the input of a third convolutional neural network model. The output position of the face frame is determined by a frame regression value, and the calculation formula is as follows:
TX1=(GX1-PX1)/(PX2-PX1)
TY1=(GY1-PY1)/(PY2-PY1)
TX2=(GX2-PX2)/(PX2-PX1)
TY2=(GY2-PY2)/(PY2-PY1)
In the formula, T X1、TY1、TX2、TY2 is a frame regression value, G X1、GY1、GX2、GY2 is a real face frame coordinate, and P X1、PY1、PX2、PY2 is a predicted face frame coordinate.
And adjusting the position of the face frame, measuring the training model through the test set, and determining the optimal value of the face frame through the test accuracy and recall index, thereby determining the best training model. Because the second convolutional neural network model is more complex than the first convolutional neural network model and has stronger learning ability, the adjusted face frame position is closer to the real face.
Example 4
Based on the embodiment 1, the embodiment 2 and the embodiment 3, according to fig. 4, the third convolutional neural network structure is a lightweight convolutional neural network. The images are input into a face segmentation network structure one by one, face confidence and face frame regression values are obtained in the middle layer through a series of operations such as convolution, pooling and deconvolution, face frames with the face confidence lower than a face confidence threshold value are continuously filtered, and meanwhile fine regression is carried out on the face frames, so that the face frame positions are more accurate, the segmentation of the face images is achieved in the last layer, the face segmentation CNN network can carry out face overall region segmentation, and meanwhile partial face region segmentation such as eyes, nose and mouth can be achieved.
Example 5
On the basis of the embodiment 1, according to fig. 7, an automatic training and testing system for face segmentation is formed, wherein the automatic training and testing system for face segmentation comprises a training set, a testing set, face data, a model training module, a model screening module, a face segmentation CNN network and a segmentation output module. Based on the face segmentation network structure designed in embodiment 1, training a model by using a training set, testing the trained model by using a testing set, removing an unqualified model, reserving the qualified model, inputting collected and tidied face data into the face segmentation network model to obtain a segmentation result, and adding the training set after labeling the face data with poor segmentation result to perform iterative training.
The model training module receives the training set, performs end-to-end training on the training set to form a training model, and stores the training model;
The model training module is connected with the model screening module and transmits the training model to the model screening module; the model screening module receives the test set and tests the training model according to the test set; when the test index of the training model is within the test set threshold, reserving the training model, otherwise deleting the training model;
The model screening module is connected with the face segmentation CNN network and transmits the reserved training model to the face segmentation CNN network;
The face segmentation CNN network automatically collects and sorts face data, and inputs the face data into a reserved face segmentation model to perform face segmentation operation; the CNN network for face segmentation not only realizes the whole face region segmentation, but also realizes the partial face region segmentation of eyes, nose, mouth and the like; and (3) marking the face data with poor segmentation results, adding a training set, and performing iterative training.
The training set comprises widerface data sets and a face training set which is automatically collected and marked; the test set comprises fddb test sets, and human face test indexes are collected and marked by oneself; the test indexes comprise recall rate, false detection rate and segmentation accuracy. The thresholds of the face recall rate, the false detection rate and the segmentation precision are set according to specific application scenes.
Example 6
Based on the above embodiment, this embodiment is shown in fig. 8, and further includes a convolutional neural network of 32×32 for extracting local region features for the segmented image key parts.
Obtaining a local region picture according to the final face local region segmentation result, and inputting the local region picture into a corresponding convolutional neural network to obtain local features; such as: inputting mouth pictures of two faces to obtain corresponding characteristics of F1 and F2, wherein the characteristic dimension is M, and in order to improve the recognition effect, F1 and F2 are normalized to G1 and G2 respectively, wherein the number of elements of G1 and G2 is M; equation 10 is for calculating the similarity of G1 and G2:
SIM = Σg K*G2K equation 10
G1 K is an element of the feature G1, G2 K is an element of the feature G2, and whether the feature is the same person is judged according to the size of the similarity SIM, so that a more accurate recognition effect is achieved for a specific face and similar faces.

Claims (8)

1. A multi-scale rapid face segmentation method based on a deep convolution cascade network is characterized by comprising the following steps:
s1: generating a face candidate frame; making an image pyramid on an input image to be segmented, and inputting the image pyramid into a first convolutional neural network model one by one, wherein the first convolutional neural network model predicts face candidate frames with different scales at different depth layers, so as to realize multi-scale face candidate frame prediction;
S2: classifying and regressing face frames; inputting the face frames predicted in the step S1 into a second convolutional neural network model, wherein the second convolutional neural network model predicts confidence degrees and frame regression values of the face frames one by one;
when the confidence coefficient of the face frame is smaller than the confidence coefficient threshold value of the face frame, the face frame is filtered by the second convolutional neural network model; otherwise, the position of the face frame is optimized; and outputting the adjusted face frame;
S3: inputting the face frame adjusted in the step S2 into a third convolutional neural network model, and filtering the face frame by the third convolutional neural network model according to the step S2;
When the confidence coefficient of the face frame is larger than or equal to a threshold value of the confidence coefficient of the face frame, the face frame position is adjusted by the third convolutional neural network model, and the face whole area and the local area of the adjusted face frame are segmented;
s4: outputting a result, namely outputting the result after segmentation in the step S3; the first convolutional neural network model in the S1 is a lightweight characteristic pyramid network model, the network depth is 6 layers, and the maximum number of channels of the network is 32;
The feature pyramid network model receives an input image, and obtains heat maps with different confidence coefficients at three different depth layers, wherein the heat maps are used for representing the distribution situation of the human face; and calculating local peak points of the confidence heat maps according to the different confidence heat maps, and removing repeated face frames of each scale according to the local peak points of the confidence heat maps.
2. The multi-scale rapid face segmentation method based on the deep convolution cascade network according to claim 1, wherein the calculation of the local peak point of the confidence heat map comprises the following steps:
step 1, carrying out maximum pooling operation on a confidence heat map to obtain a confidence heat map maximum feature map;
In the formula 1 of the present invention, And/>Representing the kth feature map of the current layer and the previous layer respectively, down (-) is a downsampling function,/>Weighting coefficient representing kth feature map of current layer,/>Bias representing kth feature map of current pooling layer, in confidence heat map maximum feature map calculation,/>And/>Respectively representing a confidence heat map maximum feature map and a confidence heat map;
Step 2, comparing the value of the position corresponding to the confidence coefficient heat map of the same position with the value of the position corresponding to the maximum feature map of the confidence coefficient heat map of the step 1, and assigning 1 to the position when the values of the corresponding positions are the same; when the values of the corresponding positions are different, the positions are assigned to 0; obtaining a local peak point position characteristic map of the confidence heat map;
In the formula 2, A ij represents the ith row and the jth column pixel values of the confidence heat map, B ij represents the ith row and the jth column pixel values of the maximum feature map of the confidence heat map, C ij represents the ith row and the jth column pixel values of the feature map of the local peak point position of the confidence heat map, i represents the value range of [0, M-1], j represents the value range of [0, N-1], M represents the high of the confidence heat map, and N represents the wide of the confidence heat map;
step 3, multiplying the value of the corresponding position of the confidence coefficient heat map at the same position by the value of the corresponding position of the local peak point position feature map of the confidence coefficient heat map in the step 2, and generating a local peak point feature map of the confidence coefficient heat map in the formula 3;
d ij=Aij×Cij formula 3
In the formula 3, A ij represents the ith row and jth column pixel value of the confidence heat map, C ij represents the ith row and jth column pixel value of the local peak point position feature map of the confidence heat map, D ij represents the ith row and jth column pixel value of the local peak point feature map of the confidence heat map, the value range of i is [0, M-1], the value range of j is [0, N-1], M represents the high of the confidence heat map, and N represents the wide of the confidence heat map.
3. The multi-scale rapid face segmentation method based on the deep convolution cascade network according to claim 1, wherein for each input face, a segmentation feature map with a dimension of km 2 is output, namely, K face segmentation branch task output feature maps with a resolution of m×m, K represents the number of categories, and for each pixel, a sigmoid function is used, wherein the calculation formula of the sigmoid function is as follows:
Wherein i represents a pixel value in the output feature map, and S i represents a probability value that the pixel value is a face or a face region;
For the segmentation task loss function, a sigmoid cross entropy loss function is used, and the calculation formula is as follows:
In formula 5, y i mask represents a label with a true pixel value in the feature map, the value of y i mask∈{0,1},Si is output by the convolutional neural network, and represents a probability value that the pixel value is a face or a face region.
4. The multi-scale rapid face segmentation method based on the deep convolution cascade network according to claim 1, wherein the second convolution neural network model in the S2 is a face classification and regression network structure, the network depth is 5 layers, and the second convolution neural network model is a downsampled convolution neural network model; and S3, a third convolutional neural network model is a face segmentation network model, and the convolutional neural network model is obtained through downsampling and then upsampling.
5. The multi-scale rapid face segmentation method based on the deep convolution cascade network according to claim 1, wherein segmentation in the step S3 is achieved through pixel level classification, and in the output feature map, pixel level classification is conducted on whether each position is a face whole region, a local region such as an eye, nose and mouth, and the like, so that segmentation is achieved.
6. The multi-scale rapid face segmentation method based on the deep convolutional cascade network as claimed in claim 1, wherein the first convolutional neural network model and the second convolutional neural network model simultaneously realize two tasks of face classification and face frame regression, and the multi-task loss function is defined as:
l loss=Lcls+Lbox equation 6
In the formula 6, L LOSS represents a total LOSS value of two tasks of face classification and face frame regression, L CLS represents a LOSS value of a face classification task, and L BOX represents a LOSS value of a face frame regression task;
For a third convolutional neural network model, the model simultaneously realizes three tasks of face classification, face frame regression and face segmentation, and a multi-task loss function is defined as follows:
l loss=Lcls+Lbox+Lmask equation 7
In formula 7, L LOSS represents a face classification, face frame regression and face segmentation total LOSS value, L CLS represents a face classification task LOSS value, L BOX represents a face frame regression task LOSS value, and L MASK represents a face segmentation task LOSS value.
7. The multi-scale rapid face segmentation method based on the deep convolution cascade network according to claim 1, further comprising the step of designing a 32 x 32 convolution neural network for local region feature extraction for the segmented image key parts.
8. The multi-scale rapid face segmentation method based on the deep convolutional cascade network of claim 7, wherein,
The local region features are extracted as the same local region of each person is a category, a softmax cross entropy loss function is used, a classification probability distribution is calculated through the softmax function, and a formula 8 is calculated for the softmax function:
In formula 8, X i represents the projection size of a face local area on the ith person, P i represents the probability that the face local area is the ith person, and k represents the category number of classification tasks;
softmax cross entropy loss function, calculate equation 9:
In formula 9, y i cls represents the label of the real sample, and the value y i cls∈{0,1},pi is output by the neural network and represents the probability that one sample is the ith person.
CN202010878450.6A 2020-08-27 2020-08-27 Multi-scale rapid face segmentation method based on deep convolution cascade network Active CN112132839B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010878450.6A CN112132839B (en) 2020-08-27 2020-08-27 Multi-scale rapid face segmentation method based on deep convolution cascade network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010878450.6A CN112132839B (en) 2020-08-27 2020-08-27 Multi-scale rapid face segmentation method based on deep convolution cascade network

Publications (2)

Publication Number Publication Date
CN112132839A CN112132839A (en) 2020-12-25
CN112132839B true CN112132839B (en) 2024-04-30

Family

ID=73847509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010878450.6A Active CN112132839B (en) 2020-08-27 2020-08-27 Multi-scale rapid face segmentation method based on deep convolution cascade network

Country Status (1)

Country Link
CN (1) CN112132839B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364846B (en) * 2021-01-12 2021-04-30 深圳市一心视觉科技有限公司 Face living body identification method and device, terminal equipment and storage medium
CN112967204A (en) * 2021-03-23 2021-06-15 新疆爱华盈通信息技术有限公司 Noise reduction processing method and system for thermal imaging and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563350A (en) * 2017-09-21 2018-01-09 深圳市唯特视科技有限公司 A kind of method for detecting human face for suggesting network based on yardstick
CN109376571A (en) * 2018-08-03 2019-02-22 西安电子科技大学 Estimation method of human posture based on deformation convolution
CN109766887A (en) * 2019-01-16 2019-05-17 中国科学院光电技术研究所 A kind of multi-target detection method based on cascade hourglass neural network
CN110136745A (en) * 2019-05-08 2019-08-16 西北工业大学 A kind of vehicle whistle recognition methods based on convolutional neural networks
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method
CN111401257A (en) * 2020-03-17 2020-07-10 天津理工大学 Non-constraint condition face recognition method based on cosine loss

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563350A (en) * 2017-09-21 2018-01-09 深圳市唯特视科技有限公司 A kind of method for detecting human face for suggesting network based on yardstick
CN109376571A (en) * 2018-08-03 2019-02-22 西安电子科技大学 Estimation method of human posture based on deformation convolution
CN109766887A (en) * 2019-01-16 2019-05-17 中国科学院光电技术研究所 A kind of multi-target detection method based on cascade hourglass neural network
CN110136745A (en) * 2019-05-08 2019-08-16 西北工业大学 A kind of vehicle whistle recognition methods based on convolutional neural networks
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method
CN111401257A (en) * 2020-03-17 2020-07-10 天津理工大学 Non-constraint condition face recognition method based on cosine loss

Also Published As

Publication number Publication date
CN112132839A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
CN109949317B (en) Semi-supervised image example segmentation method based on gradual confrontation learning
CN109359559B (en) Pedestrian re-identification method based on dynamic shielding sample
CN110929577A (en) Improved target identification method based on YOLOv3 lightweight framework
CN109614921B (en) Cell segmentation method based on semi-supervised learning of confrontation generation network
CN109684922B (en) Multi-model finished dish identification method based on convolutional neural network
CN111340123A (en) Image score label prediction method based on deep convolutional neural network
CN111160249A (en) Multi-class target detection method of optical remote sensing image based on cross-scale feature fusion
CN108039044B (en) Vehicle intelligent queuing system and method based on multi-scale convolutional neural network
CN112232371B (en) American license plate recognition method based on YOLOv3 and text recognition
CN113011357A (en) Depth fake face video positioning method based on space-time fusion
CN112132839B (en) Multi-scale rapid face segmentation method based on deep convolution cascade network
CN111145145B (en) Image surface defect detection method based on MobileNet
CN110969171A (en) Image classification model, method and application based on improved convolutional neural network
CN114187311A (en) Image semantic segmentation method, device, equipment and storage medium
CN113435407B (en) Small target identification method and device for power transmission system
CN116342894B (en) GIS infrared feature recognition system and method based on improved YOLOv5
CN111540203B (en) Method for adjusting green light passing time based on fast-RCNN
CN111626357B (en) Image identification method based on neural network model
CN114492634B (en) Fine granularity equipment picture classification and identification method and system
CN114155474A (en) Damage identification technology based on video semantic segmentation algorithm
CN108154199B (en) High-precision rapid single-class target detection method based on deep learning
CN114170422A (en) Coal mine underground image semantic segmentation method
CN115830514B (en) Whole river reach surface flow velocity calculation method and system suitable for curved river channel
CN110738113B (en) Object detection method based on adjacent scale feature filtering and transferring
CN114119382A (en) Image raindrop removing method based on attention generation countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant