CN112132839B

CN112132839B - Multi-scale rapid face segmentation method based on deep convolution cascade network

Info

Publication number: CN112132839B
Application number: CN202010878450.6A
Authority: CN
Inventors: 徐联伯; 彭珂凡
Original assignee: Hangzhou Eagle Zhida Technology Co ltd
Current assignee: Hangzhou Eagle Zhida Technology Co ltd
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2024-04-30
Anticipated expiration: 2040-08-27
Also published as: CN112132839A

Abstract

The invention relates to a face segmentation technology and discloses a multi-scale rapid face segmentation method based on a deep convolution cascade network; firstly, making an image pyramid for an input image to be segmented, inputting the image pyramid into a first convolutional neural network model one by one, calculating local peak points of each heat map, and removing repeated face frames of each scale; inputting the face frames into a second convolutional neural network model one by one, filtering the face frames with confidence coefficient lower than a face confidence coefficient threshold value, and further optimizing the positions of the passed face frames; and inputting the optimized face frames into a third convolutional neural network model one by one, filtering the face frames, and adjusting the positions of the face frames to realize the segmentation of the whole face region, the eyes, the nose, the mouth and other local regions. The face segmentation technology designed by the invention can rapidly carry out face segmentation in multiple scales, has high segmentation precision, can reduce the overall network calculation cost, is suitable for an embedded platform with limited calculation resources, and simultaneously carries out specific character recognition through the key parts of the face.

Description

Multi-scale rapid face segmentation method based on deep convolution cascade network

Technical Field

The invention relates to a face segmentation technology, in particular to a multi-scale rapid face segmentation method based on a deep convolution cascade network.

Background

The face detection segmentation technology is crucial in a face recognition system, and before effective face recognition, the influence of illumination, gestures, expressions, image quality and the like needs to be overcome, the face existing in a target scene is accurately and effectively detected, and the positions of key parts such as eyes, nose and mouth of a face local area need to be accurately positioned. The accurate positioning of the key positions of eyes, nose, mouth and the like can improve the face correction effect, thereby improving the face recognition precision. In the prior art, multi-scale face segmentation cannot be performed, the segmentation speed is low, the network calculation cost is high, and an embedded platform with limited resources is not applicable.

For example, patent titles: an infrared face segmentation method using annular shortest path, application number: cn201610090345.X, filing date: 2016-02-18, a face segmentation method, a device and equipment, wherein the method comprises the following steps: acquiring a face image comprising a face region to be segmented; extracting key point information of a face area to be segmented; determining semantic prior layer information corresponding to the face region to be segmented according to the key point information and the corresponding relation between each segmented object and the key point information; the semantic prior layer information represents a limiting segmentation area corresponding to each segmentation object in the face area to be segmented; and inputting the semantic prior layer information and the face image into a pre-trained network model to obtain a segmentation result corresponding to each segmentation object in the face region to be segmented.

The face segmentation method, the face segmentation device and the face segmentation equipment provided in the prior art can improve the segmentation accuracy. But the face segmentation is carried out by utilizing the infrared face segmentation method with the annular shortest path, the multi-scale face segmentation cannot be carried out, the segmentation speed is low, the network calculation cost is high, and the embedded platform with limited resources is not applicable.

Disclosure of Invention

Aiming at the problems that in the prior art, multi-scale face segmentation cannot be performed, the segmentation speed is low, the network calculation cost is high, an embedded platform with limited resources is not applicable, and the like, the invention provides a multi-scale rapid face segmentation method based on a deep convolution cascade network.

In order to solve the technical problems, the invention is solved by the following technical scheme:

a multi-scale rapid face segmentation method based on a deep convolution cascade network comprises the following steps:

s1: generating a face candidate frame; making an image pyramid on an input image to be segmented, and inputting the image pyramid into a first convolutional neural network model one by one, wherein the first convolutional neural network model predicts face candidate frames with different scales at different depth layers, so as to realize multi-scale face candidate frame prediction;

S2: classifying and regressing face frames; inputting the face frames predicted in the step S1 into a second convolutional neural network model, wherein the second convolutional neural network model predicts confidence degrees and frame regression values of the face frames one by one;

when the confidence coefficient of the face frame is smaller than the confidence coefficient threshold value of the face frame, the face frame is filtered by the second convolutional neural network model; otherwise, the position of the face frame is optimized; and outputting the adjusted face frame;

S3: inputting the face frame adjusted in the step S2 into a third convolutional neural network model, and filtering the face frame by the third convolutional neural network model according to the step S2;

When the confidence coefficient of the face frame is larger than or equal to a threshold value of the confidence coefficient of the face frame, the face frame position is adjusted by the third convolutional neural network model, and the face whole area and the local area of the adjusted face frame are segmented;

S4: and outputting the result, namely outputting the result after the segmentation in the step S3.

Preferably, the first convolutional neural network model in the S1 is a lightweight characteristic pyramid network model, the network depth is 6 layers, and the maximum number of channels of the network is 32; the feature pyramid network model receives an input image, and obtains heat maps with different confidence coefficients at three different depth layers, wherein the heat maps are used for representing the distribution situation of the human face; and calculating local peak points of the confidence heat map according to different confidence heat maps, and removing repeated face frames of each scale according to the local peak points of the confidence heat map.

And simultaneously predicting the face candidate frames with different scales at different depths to realize multi-scale face frame prediction.

Preferably, the calculation of the local peak point of the confidence heat map comprises the following steps:

step 1, carrying out maximum pooling operation on a confidence coefficient heat map according to a formula 1 to obtain a confidence coefficient heat map maximum feature map;

In the formula 1 of the present invention, And/>Representing the kth feature map of the current layer and the previous layer respectively, down (-) is a downsampling function,/>Weighting coefficient representing kth feature map of current layer,/>Representing the bias of the kth feature map of the current pooling layer. In confidence heat map maximum feature map calculation,/>And/>Respectively representing a confidence heat map maximum feature map and a confidence heat map;

Step 2, comparing the value of the corresponding position of the confidence coefficient heat map of the same position with the value of the corresponding position of the maximum feature map of the confidence coefficient heat map of the step 1, and assigning 1 to the position when the values of the corresponding positions are the same; when the corresponding position values are different, the position is assigned to 0; obtaining a local peak point position characteristic map of the confidence heat map according to the formula 2;

In the formula 2, A _ij represents the ith row and the jth column pixel values of the confidence heat map, B _ij represents the ith row and the jth column pixel values of the maximum feature map of the confidence heat map, C _ij represents the ith row and the jth column pixel values of the feature map of the local peak point position of the confidence heat map, i represents the value range of [0, M-1], j represents the value range of [0, N-1], M represents the high of the confidence heat map, and N represents the wide of the confidence heat map;

step 3, multiplying the value of the corresponding position of the confidence coefficient heat map at the same position by the value of the corresponding position of the local peak point position feature map of the confidence coefficient heat map in the step 2, and generating a local peak point feature map of the confidence coefficient heat map in the formula 3;

d _ij＝A_ij×C_ij formula 3

In the formula 3, A _ij represents the pixel value of the ith row and the jth column of the confidence heat map, C _ij represents the pixel value of the ith row and the jth column of the local peak point position feature map of the confidence heat map, D _ij represents the pixel value of the ith row and the jth column of the local peak point feature map of the confidence heat map, I represents the value range of [0, M-1], J represents the value range of [0, N-1], M represents the high of the confidence heat map, and N represents the wide of the confidence heat map.

By calculating the local peak point of each heat map, the repeated face frames of each scale are removed rapidly, the time consumption of NMS (non-maximum suppression) operation when face candidate frames are combined under different subsequent scales is reduced effectively, and the network calculation cost is reduced.

Preferably, for each input face, a segmentation feature map with a dimension km ², that is, K face segmentation branch task output feature maps with a resolution of mxm, K represents the number of classes, and a sigmoid function is used for each pixel, where a calculation formula of the sigmoid function is as follows:

In the formula 4, i represents a pixel value in the output feature map, and S _i represents a probability value that the pixel value is a face or a face region;

For the segmentation task loss function, a sigmoid cross entropy loss function is used, and the calculation formula is as follows:

In formula 5, y _i ^mask represents a label with a true pixel value in the feature map, the value of y _i ^mask∈{0,1},S_i is output by the convolutional neural network, and represents a probability value that the pixel value is a face or a face region.

Preferably, the second convolutional neural network model in the S2 is a face classification and regression network structure, the network depth is 5 layers, and the second convolutional neural network model is a downsampled convolutional neural network model; and S3, a third convolutional neural network model is a face segmentation network model, and the convolutional neural network model is obtained through downsampling and then upsampling.

Preferably, the segmentation in S3 is performed by pixel-level classification, and in the output feature map, the segmentation is performed by performing pixel-level classification on whether each position is a whole region of a face or a partial region such as an eye, nose, or mouth.

Preferably, the first convolutional neural network model and the second convolutional neural network model simultaneously realize two tasks of face classification and face frame regression, the formula 6 is a calculation mode of a multi-task loss function,

L _loss＝L_cls+L_box equation 6

In formula 6, L _LOSS represents a total LOSS value of two tasks of face classification and face frame regression, L _CLS represents a LOSS value of a face classification task, and L _BOX represents a LOSS value of a face frame regression task.

For a third convolutional neural network model, the model simultaneously realizes three tasks of face classification, face frame regression and face segmentation, and a multi-task loss function is defined as follows:

l _loss＝L_cls+L_box+L_mask equation 7

In formula 7, L _LOSS represents a face classification, face frame regression and face segmentation total LOSS value, L _CLS represents a face classification task LOSS value, L _BOX represents a face frame regression task LOSS value, and L _MASK represents a face segmentation task LOSS value.

Preferably, the method further comprises the step of designing a convolutional neural network of 32 x 32 for extracting the local area features for the segmented image key parts. The segmented image is subjected to local face feature extraction, so that fine-granularity comparison and identification of specific characters are facilitated.

Preferably, the local region features are extracted as a class from the same local region of each person, and the classification probability distribution is calculated by using a softmax cross entropy loss function through the softmax function, and the softmax function is calculated by the formula 8:

In formula 8, X _i represents the projection size of a face local area on the ith person, P _i represents the probability that the face local area is the ith person, and k represents the category number of classification tasks;

softmax cross entropy loss function, calculate equation 9:

In formula 9, y _i ^cls represents the label of the real sample, and the value y _i ^cls∈{0,1},p_i is output by the neural network and represents the probability that one sample is the ith person.

The invention has the remarkable technical effects due to the adoption of the technical scheme: and (3) making an image pyramid for the input images to be segmented, and inputting the image pyramids into a first convolutional neural network model one by one, wherein the first convolutional neural network model obtains confidence heat maps at three different depth layers, and the confidence heat maps are used for representing the face distribution situation. And respectively calculating local peak points of each confidence heat map according to the confidence heat maps, rapidly removing repeated face frames of each scale through the local peak points of each confidence heat map, and reducing time consumption for performing non-maximum suppression operation when combining face candidate frames of different scales, thereby effectively reducing network calculation cost.

And inputting the face frames generated by the first convolutional neural network model into a second convolutional neural network model one by one, predicting the confidence coefficient and the frame regression value of the face frames one by one, filtering the face frames with the confidence coefficient smaller than the confidence coefficient threshold value of the face, and further optimizing the face frame positions larger than or equal to the confidence coefficient threshold value.

And inputting the face frames subjected to the adjustment of the second convolutional neural network model into a third convolutional neural network model one by one, and when the confidence coefficient of the face frames is greater than or equal to a confidence coefficient threshold value of the face frames, adjusting the positions of the face frames by the third convolutional neural network model, and dividing the whole area and the local area of the face of the adjusted face frames.

The face segmentation technology designed by the invention can rapidly carry out face segmentation in multiple scales, has high segmentation precision, can reduce the overall network calculation cost, and is suitable for an embedded platform with limited calculation resources.

By utilizing the local region segmentation in the invention, the key region of the face can be extracted, and the corresponding local characteristics are extracted by utilizing the corresponding convolutional neural network, so that the specific characters are subjected to fine-granularity comparison and identification.

Drawings

FIG. 1 is a schematic diagram of the composition of the present invention.

Fig. 2 is a diagram of a first convolutional neural network model of the present invention.

Fig. 3 is a diagram of a second convolutional neural network model of the present invention.

Fig. 4 is a diagram of a third convolutional neural network model of the present invention.

Fig. 5 is a flowchart of the heat map local peak point calculation of the present invention.

FIG. 6 is a confidence heat map of the present invention.

FIG. 7 is a diagram of an automated training and testing system according to example 5 of the present invention.

Fig. 8 is a structural diagram of embodiment 6 of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Example 1

Confidence heat maps are obtained at three different depth layers and are used for representing the distribution situation of the human face. Then, calculating local peak points of each confidence coefficient heat map respectively, wherein the local peak points of the confidence coefficient are the positions with the maximum confidence coefficient values of the local areas of the confidence coefficient heat map, and rapidly removing the face frames with repeated scales through the local peak points of the confidence coefficient; the time consumption for NMS operation when face candidate frames under different subsequent scales are combined is effectively reduced, and the NMS operation steps are as follows:

firstly, sorting all the obtained face frames in descending order according to the confidence value, and selecting the face frame with the largest confidence value;

step two, traversing all the other face frames in sequence, and deleting the face frame if the overlapping area of the face frame and the face frame with the maximum confidence is larger than a certain threshold value; the threshold value is set in a range of 0.5 to 0.9 according to the actual application scene;

and thirdly, continuously selecting the face frame with the maximum confidence value from the unprocessed face frame, and repeating the second step.

The calculated amount of the non-maximum value suppression operation is exponentially increased, and when the number of face frames is large, the calculated amount is much smaller than that of the faces.

When the confidence coefficient of the face frame is smaller than the confidence coefficient threshold value of the face frame, the face frame is filtered by the second convolutional neural network model; otherwise, the position of the face frame is optimized; and outputting the adjusted face frame; the confidence threshold of the face frame is set to be 0.7;

When the confidence coefficient of the face frame is larger than or equal to a threshold value of the confidence coefficient of the face frame, the face frame position is adjusted by the third convolutional neural network model, and the face whole area and the local area of the adjusted face frame are segmented; the confidence threshold of the face frame is set to be 0.7;

The first convolutional neural network model in the S1 is a lightweight characteristic pyramid network model, the network depth is 6 layers, and the maximum number of channels of the network is 32; the feature pyramid network model receives an input image, and obtains heat maps with different confidence coefficients at three different depth layers, wherein the heat maps are used for representing the distribution situation of the human face; and calculating local peak points of the confidence heat map according to different confidence heat maps, and removing repeated face frames of each scale according to the local peak points of the confidence heat map. And simultaneously predicting the face candidate frames with different scales at different depths to realize multi-scale face frame prediction.

The calculation of the local peak point of the confidence heat map comprises the following steps:

Carrying out maximum pooling operation on the confidence coefficient heat map according to the formula 1 to obtain a confidence coefficient heat map maximum value characteristic map;

the confidence heat map only represents the face distribution situation, so k is 1, and the maximum pooling layer step length is set to be 1. Through the calculation of the maximum pooling layer, the resolution of the confidence coefficient heat map is the same as that of the maximum feature map of the confidence coefficient heat map, and k is 1;

Step 2, comparing the value of the position corresponding to the confidence coefficient heat map of the same position with the value of the position corresponding to the maximum feature map of the confidence coefficient heat map of the step 1, and assigning 1 to the position when the values of the corresponding positions are the same; when the values of the corresponding positions are different, the positions are assigned to 0; obtaining a local peak point position characteristic map of the confidence heat map;

In the formula 2, A _ij represents the ith row and the jth column of pixel values of the confidence heat map, B _ij represents the ith row and the jth column of pixel values of the maximum feature map of the confidence heat map, cij represents the ith row and the jth column of pixel values of the local peak point position feature map of the confidence heat map, i represents the value range of [0, M-1], J represents the value range of [0, N-1], M represents the high of the confidence heat map, and N represents the wide of the confidence heat map; the resolution of the confidence heat map maximum feature map and the resolution of the confidence heat map local peak point position feature map are the same;

d _ij＝A_ij×C_ij formula 3

Assuming that the confidence heat map is a matrix A _MN, the confidence heat map maximum feature map is a matrix B _MN, the confidence heat map local peak point position feature map is a matrix C _MN, and the confidence heat map local peak point position feature map is D _MN;

When a _[i][j] is equal to B _[i][j], then C _[i][j] is 1, otherwise C _[i][j] is 0;

D_[i][j]＝A_[i][j]*C_[i][j]；

Where i= (0 … M-1), j= (0 … N-1).

The confidence heat map, the confidence heat map maximum feature map, the confidence heat map local peak point feature map and the confidence heat map local peak point feature map are the same in resolution.

By calculating the local peak point of each heat map, the repeated face frames of each scale can be removed rapidly, the time consumption of NMS (non-maximum suppression) operation when face candidate frames are combined under different subsequent scales is reduced effectively, and the network calculation cost is reduced.

For each input face output dimension km ² of the segmentation feature map, namely K face segmentation branch task output feature maps with resolution of MxM, K represents the category number, and a sigmoid function is used for each pixel, wherein the calculation formula of the sigmoid function is as follows:

Wherein i represents a pixel value in the output feature map, and Si represents a probability value that the pixel value is a face or a face region;

wherein y _i ^mask represents a label with a true pixel value in the feature map, the value y _i ^mask∈{0,1},S_i is output by a convolutional neural network, and represents a probability value that the pixel value is a face or a face region.

The second convolutional neural network model in the S2 is a face classification and regression network structure, the network depth is 5 layers, and the convolutional neural network model is a downsampled convolutional neural network model. And S3, a third convolutional neural network model is a face segmentation network model, and the convolutional neural network model is obtained through downsampling and then upsampling.

The segmentation in S3 is realized through pixel level classification, and in the output feature map, the pixel level classification is carried out on whether each position is a whole region of a human face, a partial region such as eyes, nose and mouth and the like, so that the segmentation is realized.

The system comprises a first convolutional neural network model and a second convolutional neural network model, wherein the two models simultaneously realize two tasks of face classification and face frame regression, and a multi-task loss function is defined as follows:

l _loss＝L_cls+L_box equation 6

l _loss＝L_cls+L_box+L_mask equation 7

According to fig. 6, the region in black in the middle of white represents the region with high confidence of the face, namely the face distribution region.

Maximum pooling is a prior art technique that takes the maximum value in the local accepted domain.

Example 2

Based on example 1, according to the first convolutional neural network structure shown in fig. 2, the full convolutional network size is 24×24 resolution, conv represents the convolutional layer, pool represents the pooling layer, conv/dw in combination with conv represents the depth separable convolutional layer.

① Firstly, an input image passes through a convolution layer conv1 and a maximum pooling layer pool1 to obtain a first layer output characteristic diagram;

② Obtaining a second layer output characteristic diagram through depth separable convolution layers conv2-1/dw and conv 2;

③ Obtaining a third layer output characteristic diagram through depth separable convolution layers conv3-1/dw and conv 3;

④ The third layer of output feature map has three output layers, the first one is through convolutions layer conv4-1 to obtain the feature map which represents the confidence of human face classification, namely confidence heat map, the second one is through convolutions layer conv4-2 to obtain the feature map which represents the regression position of human face frame, the third one is through depth separable convolutions layers conv5-1/dw and conv5 to obtain the fourth layer of output feature map;

⑤ The fourth layer output characteristic diagram is similar to the third layer output characteristic diagram, and also has three output layers, wherein the first layer is subjected to a convolution layer conv6-1 to obtain a characteristic diagram representing the confidence of face classification, the second layer is subjected to a convolution layer conv6-2 to obtain a characteristic diagram representing the regression position of a face frame, and the third layer is subjected to a depth separable convolution layer conv7-1/dw and conv7 to obtain a fifth layer output characteristic diagram;

⑥ And outputting a feature map at a fifth layer, obtaining a feature map representing the confidence of face classification through a convolution layer conv8-1, and obtaining a feature map representing the regression position of the face frame through a convolution layer conv 8-2.

Example 3

Based on the embodiment 1 and the embodiment 2, according to the structure of the second convolutional neural network shown in fig. 3, a series of convolution and pooling operations are performed to obtain a face confidence coefficient and a face frame regression value, face candidate frames lower than a face confidence coefficient threshold are filtered, the face frames passing through the face confidence coefficient threshold are further regressed, the positions of the face frames are more accurate, and the face frames after the position regression are used as the input of a third convolutional neural network model. The output position of the face frame is determined by a frame regression value, and the calculation formula is as follows:

T_X1＝(G_X1-P_X1)/(P_X2-P_X1)

T_Y1＝(G_Y1-P_Y1)/(P_Y2-P_Y1)

T_X2＝(G_X2-P_X2)/(P_X2-P_X1)

T_Y2＝(G_Y2-P_Y2)/(P_Y2-P_Y1)

In the formula, T _X1、T_Y1、T_X2、T_Y2 is a frame regression value, G _X1、G_Y1、G_X2、G_Y2 is a real face frame coordinate, and P _X1、P_Y1、P_X2、P_Y2 is a predicted face frame coordinate.

And adjusting the position of the face frame, measuring the training model through the test set, and determining the optimal value of the face frame through the test accuracy and recall index, thereby determining the best training model. Because the second convolutional neural network model is more complex than the first convolutional neural network model and has stronger learning ability, the adjusted face frame position is closer to the real face.

Example 4

Based on the embodiment 1, the embodiment 2 and the embodiment 3, according to fig. 4, the third convolutional neural network structure is a lightweight convolutional neural network. The images are input into a face segmentation network structure one by one, face confidence and face frame regression values are obtained in the middle layer through a series of operations such as convolution, pooling and deconvolution, face frames with the face confidence lower than a face confidence threshold value are continuously filtered, and meanwhile fine regression is carried out on the face frames, so that the face frame positions are more accurate, the segmentation of the face images is achieved in the last layer, the face segmentation CNN network can carry out face overall region segmentation, and meanwhile partial face region segmentation such as eyes, nose and mouth can be achieved.

Example 5

On the basis of the embodiment 1, according to fig. 7, an automatic training and testing system for face segmentation is formed, wherein the automatic training and testing system for face segmentation comprises a training set, a testing set, face data, a model training module, a model screening module, a face segmentation CNN network and a segmentation output module. Based on the face segmentation network structure designed in embodiment 1, training a model by using a training set, testing the trained model by using a testing set, removing an unqualified model, reserving the qualified model, inputting collected and tidied face data into the face segmentation network model to obtain a segmentation result, and adding the training set after labeling the face data with poor segmentation result to perform iterative training.

The model training module receives the training set, performs end-to-end training on the training set to form a training model, and stores the training model;

The model training module is connected with the model screening module and transmits the training model to the model screening module; the model screening module receives the test set and tests the training model according to the test set; when the test index of the training model is within the test set threshold, reserving the training model, otherwise deleting the training model;

The model screening module is connected with the face segmentation CNN network and transmits the reserved training model to the face segmentation CNN network;

The face segmentation CNN network automatically collects and sorts face data, and inputs the face data into a reserved face segmentation model to perform face segmentation operation; the CNN network for face segmentation not only realizes the whole face region segmentation, but also realizes the partial face region segmentation of eyes, nose, mouth and the like; and (3) marking the face data with poor segmentation results, adding a training set, and performing iterative training.

The training set comprises widerface data sets and a face training set which is automatically collected and marked; the test set comprises fddb test sets, and human face test indexes are collected and marked by oneself; the test indexes comprise recall rate, false detection rate and segmentation accuracy. The thresholds of the face recall rate, the false detection rate and the segmentation precision are set according to specific application scenes.

Example 6

Based on the above embodiment, this embodiment is shown in fig. 8, and further includes a convolutional neural network of 32×32 for extracting local region features for the segmented image key parts.

Obtaining a local region picture according to the final face local region segmentation result, and inputting the local region picture into a corresponding convolutional neural network to obtain local features; such as: inputting mouth pictures of two faces to obtain corresponding characteristics of F1 and F2, wherein the characteristic dimension is M, and in order to improve the recognition effect, F1 and F2 are normalized to G1 and G2 respectively, wherein the number of elements of G1 and G2 is M; equation 10 is for calculating the similarity of G1 and G2:

SIM = Σg _K*G2_K equation 10

G1 _K is an element of the feature G1, G2 _K is an element of the feature G2, and whether the feature is the same person is judged according to the size of the similarity SIM, so that a more accurate recognition effect is achieved for a specific face and similar faces.

Claims

1. A multi-scale rapid face segmentation method based on a deep convolution cascade network is characterized by comprising the following steps:

s4: outputting a result, namely outputting the result after segmentation in the step S3; the first convolutional neural network model in the S1 is a lightweight characteristic pyramid network model, the network depth is 6 layers, and the maximum number of channels of the network is 32;

The feature pyramid network model receives an input image, and obtains heat maps with different confidence coefficients at three different depth layers, wherein the heat maps are used for representing the distribution situation of the human face; and calculating local peak points of the confidence heat maps according to the different confidence heat maps, and removing repeated face frames of each scale according to the local peak points of the confidence heat maps.

2. The multi-scale rapid face segmentation method based on the deep convolution cascade network according to claim 1, wherein the calculation of the local peak point of the confidence heat map comprises the following steps:

step 1, carrying out maximum pooling operation on a confidence heat map to obtain a confidence heat map maximum feature map;

In the formula 1 of the present invention, And/>Representing the kth feature map of the current layer and the previous layer respectively, down (-) is a downsampling function,/>Weighting coefficient representing kth feature map of current layer,/>Bias representing kth feature map of current pooling layer, in confidence heat map maximum feature map calculation,/>And/>Respectively representing a confidence heat map maximum feature map and a confidence heat map;

d _ij＝A_ij×C_ij formula 3

In the formula 3, A _ij represents the ith row and jth column pixel value of the confidence heat map, C _ij represents the ith row and jth column pixel value of the local peak point position feature map of the confidence heat map, D _ij represents the ith row and jth column pixel value of the local peak point feature map of the confidence heat map, the value range of i is [0, M-1], the value range of j is [0, N-1], M represents the high of the confidence heat map, and N represents the wide of the confidence heat map.

3. The multi-scale rapid face segmentation method based on the deep convolution cascade network according to claim 1, wherein for each input face, a segmentation feature map with a dimension of km ² is output, namely, K face segmentation branch task output feature maps with a resolution of m×m, K represents the number of categories, and for each pixel, a sigmoid function is used, wherein the calculation formula of the sigmoid function is as follows:

Wherein i represents a pixel value in the output feature map, and S _i represents a probability value that the pixel value is a face or a face region;

4. The multi-scale rapid face segmentation method based on the deep convolution cascade network according to claim 1, wherein the second convolution neural network model in the S2 is a face classification and regression network structure, the network depth is 5 layers, and the second convolution neural network model is a downsampled convolution neural network model; and S3, a third convolutional neural network model is a face segmentation network model, and the convolutional neural network model is obtained through downsampling and then upsampling.

5. The multi-scale rapid face segmentation method based on the deep convolution cascade network according to claim 1, wherein segmentation in the step S3 is achieved through pixel level classification, and in the output feature map, pixel level classification is conducted on whether each position is a face whole region, a local region such as an eye, nose and mouth, and the like, so that segmentation is achieved.

6. The multi-scale rapid face segmentation method based on the deep convolutional cascade network as claimed in claim 1, wherein the first convolutional neural network model and the second convolutional neural network model simultaneously realize two tasks of face classification and face frame regression, and the multi-task loss function is defined as:

l _loss＝L_cls+L_box equation 6

In the formula 6, L _LOSS represents a total LOSS value of two tasks of face classification and face frame regression, L _CLS represents a LOSS value of a face classification task, and L _BOX represents a LOSS value of a face frame regression task;

l _loss＝L_cls+L_box+L_mask equation 7

7. The multi-scale rapid face segmentation method based on the deep convolution cascade network according to claim 1, further comprising the step of designing a 32 x 32 convolution neural network for local region feature extraction for the segmented image key parts.

8. The multi-scale rapid face segmentation method based on the deep convolutional cascade network of claim 7, wherein,

The local region features are extracted as the same local region of each person is a category, a softmax cross entropy loss function is used, a classification probability distribution is calculated through the softmax function, and a formula 8 is calculated for the softmax function:

softmax cross entropy loss function, calculate equation 9: