Single-stage face segmentation method
Technical Field
The invention relates to the technical field of image processing, in particular to a single-stage face segmentation method.
Background
The face segmentation technology is mainly used for positioning and segmenting face parts such as eyes, a nose, lips and the like, and is an important component of the face recognition technology. The general face segmentation method adopts a Two-Stage (Two Stage) mode, i.e. a technical scheme of firstly detecting and then segmenting, for example, adopting Mask-RCNN to segment the face. Firstly, detecting the boundary frame of each part in the face image, then intercepting the image in the boundary frame, and segmenting the mask of the face part. The human face segmentation mode divides a task into two subtasks, namely detection and segmentation, each subtask is independently completed, and the detection effect of each part of the human face directly influences the subsequent segmentation precision. Meanwhile, the mode of firstly detecting and then dividing weakens the global context information of all parts of the human face, and cannot achieve good adaptive effect on human face shielding and complex human face postures.
The invention discloses a method for single-Stage Face Segmentation, which can complete the positioning and Segmentation of each part of a Face only by a One Stage network. The method only needs the example mask label during training, does not need the bounding box information of the example, can learn from end to end, simplifies the training of the model and achieves better segmentation effect. The method comprises the steps that a segmentation network is divided into two branches, wherein one branch is used for predicting whether a human face part exists at a certain position in an image, and if the human face part exists at the certain position in the image, the position and the category are given and are called as classification branches; the other branch is used for generating a mask of the human face part and is called a mask branch.
The relationship of space consistency exists among all parts of the human face, for example, two eyes are respectively positioned at the upper left and the upper right of a nose, lips are positioned below the nose, the relationship cannot change along with the posture and the position of the human face, and a certain distance exists among the central points of all the parts. Therefore, the segmentation method disclosed by the invention is very suitable for face segmentation, and the face segmentation effect which is simpler and faster than that of the traditional two-stage segmentation method and has higher segmentation precision can be achieved by combining the two branch networks.
Disclosure of Invention
Based on the background, the invention provides a single-stage face segmentation method to improve the speed and accuracy of face segmentation.
In order to achieve the above object, the method for single-stage face segmentation provided by the present invention specifically comprises the following implementation steps:
step 1, firstly, uniformly gridding an input face image by using a Grid module, providing gridding class labels for classification branches, and providing reference masks for mask branches;
specifically, the Grid module uniformly grids the input face image by uniformly dividing the input face image into S rows and S columns, that is, S2The grid subgraph takes the left eye, the right eye, the nose, the lips and other parts and backgrounds as reference segmentation marks, 5 types of class marks are used in total, the grid mark where the center point of each part of the face is located is the class of the part, the rest grids are all marked as the background class, finally, a SxS matrix is generated and used as a reference label of a classification branch, correspondingly, each grid corresponds to a mask image, and S is included2A trellis, so the mask branch will output S2Each channel corresponds to a grid, and each mask image only corresponds to one object in one category.
Furthermore, the classification branch and the mask branch are used as two different tasks for completing face segmentation together, the two branches use different Loss functions in training, and the classification branch uses Cross Engine Loss, which is recorded as LCEAs shown in formula I, the mask branch adopts Dice Loss, which is denoted as LDiceAs shown in formula two:
in the formula I, x [ j ] and x [ class ] are both the output of a prediction layer, and x [ class ] is the value of the real class;
in the formula two, px,yTo predict the pixel value at location (x, y) in the mask, qx,yIs the pixel value at location (x, y) in the real mask;
during the training process, the overall loss function is as shown in formula three:
L=Lc+λLDiceformula three
In the third formula, λ is a loss function coefficient, and is used for balancing the loss function weights of the two branches.
Step 2, extracting basic features of the gridded face image by using an FCN module;
specifically, the FCN module is composed of 1 convolutional layer of 3x3 and 4 Block layers, where a Block layer includes 4 convolutional layers of 3x3, 1 convolutional layer of 1x1 and 2 summation layers, an input image first passes through the convolutional layer of Stride 2, then sequentially passes through 4 Block layers, each Block layer performs feature dimension reduction once, and finally, the output of the last three Block layers is taken as the three-way input of the next JFU module
Step 3, fusing the three-level characteristics output by the FCN module by using an JFU module to obtain richer characteristic information;
specifically, the JFU module is composed of a convolutional layer, an upsampling layer, a cascade layer and an expansion convolutional layer, three inputs respectively correspond to outputs of three-level Block layers of the FCN module, the three inputs have different channel numbers and feature plane sizes, feature planes with the same channel number are obtained after passing through the convolutional layers respectively, feature planes with the same size are obtained after passing through the upsampling layer respectively, the three feature planes are cascaded and then pass through convolutional layers with different expansion coefficients, the expansion coefficients are 1, 2, 4 and 8 respectively, and finally the four feature planes are cascaded to serve as inputs of a next module.
Step 4, processing the output of the JFU module by using a Category module to realize the function of classifying branches, predicting whether each grid image has a face part, and if the face part exists, giving a corresponding grid position, Category and confidence coefficient to provide reference information for mask branches;
specifically, the Category module is composed of N convolutional layers, 1 prediction layer and 1 Softmax layer, wherein the prediction layer is composed of 1 convolutional layer with the number of output channels being C, the process of the Category module is that an input feature plane passes through the N convolutional layers, then passes through the 1 prediction layer and then passes through the Softmax layer, the network output size of the classification branch is sxsxsxxc, namely the feature plane size is SxS, the number of channels is C, the Category of C is predicted at each grid position, the Category with the highest confidence coefficient is selected, and the grid prediction result of SxS is formed.
And step 5, processing the output of the JFU module by using a Mask module, acquiring a segmentation Mask of a face part, and predicting a Mask image corresponding to each grid.
Specifically, the Mask module mainly comprises M convolutional layers, 1 prediction layer and 1 upsampling layer, wherein the prediction layer comprises an output channel with the number of S2The Mask module is used for realizing the function of Mask branch, predicting the Mask image corresponding to each grid, and the output size of the Mask branch is HxWxS2I.e. the mask has a feature plane size of HxW and the number of channels is S2In the division prediction, firstly, the position (i, j) and the class C of the grids are positioned in the prediction result of the classification branch, and the channel position (i.e., (i × S + j) of the corresponding mask image in the mask branch is found, wherein the mask image is the mask image of the face part class C.
According to the technical scheme provided by the invention, in the design of the whole network, the limitation of a two-stage segmentation algorithm is avoided, namely the two-stage segmentation algorithm is detected and then segmented, so that the complexity of model training is reduced, more global context information is fused, and the accuracy of model prediction is improved. The consistency of each part of the human face exists in space, the relative positions are fixed, and the central points of each part have certain distance. According to the characteristics of all parts of the human face, the human face segmentation network is decomposed into two branches, namely a classification branch and a mask branch.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention;
FIG. 2 is a block diagram of the FCN module of the present invention;
FIG. 3 is a Block diagram of the Block module of the present invention;
FIG. 4 is a block diagram of the JFU module of the present invention;
FIG. 5 is a block diagram of a Category module of the present invention;
fig. 6 is a structural diagram of a Mask module of the present invention.
Detailed Description
In order to make the purpose and technical solution of the present invention clearer, the following will clearly and completely describe the technical solution of the present invention with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the technical solutions of the present invention, and other embodiments obtained by those skilled in the art without inventive work should fall within the scope of the present invention.
Fig. 1 is a flowchart illustrating an implementation of a method for single-level face segmentation according to an exemplary embodiment, and as shown in fig. 1, the method for single-level face segmentation according to the embodiment of the present invention specifically includes the following implementation steps:
step 1, firstly, uniformly gridding an input face image by using a Grid module, providing gridding class labels for classification branches, and providing reference masks for mask branches;
as shown in fig. 1, the Grid module uniformly grids the input face image by uniformly dividing the input face image into S rows and S columns, i.e. S2The grid subgraph takes the left eye, the right eye, the nose, the lips and other parts and backgrounds as reference segmentation marks, 5 types of class marks are used in total, the grid mark where the center point of each part of the face is located is the class of the part, the rest grids are all marked as the background class, finally, a SxS matrix is generated and used as a reference label of a classification branch, correspondingly, each grid corresponds to a mask image, and S is included2A trellis, so the mask branch will output S2Each channel corresponds to a grid, and each mask image only corresponds to one object in one category.
The classification branch and the mask branch are used as two different tasks and used for jointly completing face segmentation, the two branches use different Loss functions in training, and the classification branch uses Cross Engine Loss and is marked as LCEAs shown in formula I, the mask branch adopts Dice Loss, which is denoted as LDiceAs shown in formula two:
in the formula I, x [ j ] and x [ class ] are both the output of a prediction layer, and x [ class ] is the value of the real class;
in the formula two, px,yTo predict the pixel value at location (x, y) in the mask, qx,yIs the pixel value at location (x, y) in the real mask;
during the training process, the overall loss function is as shown in formula three:
L=Lc+λLDiceformula three
In the third formula, λ is a loss function coefficient, and is used for balancing the loss function weights of the two branches.
Step 2, extracting basic features of the gridded face image by using an FCN module;
fig. 2 is a structural diagram of an FCN module, fig. 3 is a structural diagram of a Block module, the FCN module is composed of 1 convolution layer of 3 × 3 and 4 Block layers, and the Block layers include 4 convolution layers of 3 × 3, 1 convolution layer of 1 × 1 and 2 summation layers, an input image passes through the convolution layer with Stride equal to 2 first, then passes through 4 stages of Block layers in sequence, each Block layer performs a feature dimension reduction, and finally the output of the last three stages of Block layers is taken as the three-way input of the next JFU module.
Step 3, fusing the three-level characteristics output by the FCN module by using an JFU module to obtain richer characteristic information;
fig. 4 is a structural diagram of an JFU module, where the JFU module is composed of a convolutional layer, an upsampling layer, a cascade layer, and an expansion convolutional layer, where three inputs respectively correspond to outputs of three Block layers of an FCN module, the three inputs have different channel numbers and feature plane sizes, and after passing through the convolutional layers, feature planes with the same channel number are obtained, and then passing through the upsampling layer, feature planes with the same size are obtained, the three feature planes are cascaded, and then pass through convolutional layers with different expansion coefficients, where the expansion coefficients are 1, 2, 4, and 8, and finally the four feature planes are cascaded to serve as inputs of a next module.
Step 4, processing the output of the JFU module by using a Category module to realize the function of classifying branches, predicting whether each grid image has a face part, and if the face part exists, giving a corresponding grid position, Category and confidence coefficient to provide reference information for mask branches;
fig. 5 is a structural diagram of a Category module, where the Category module is composed of N convolutional layers, 1 prediction layer, and 1 Softmax layer, where N is set to be N ═ 4 in this embodiment, where a prediction layer is composed of 1 convolutional layer with the number of output channels being C, a flow of the Category module is that an input feature plane passes through the N convolutional layers, then passes through the 1 prediction layer, and then passes through the Softmax layer, a network output size of a classification branch is sxsxsxsxxc, that is, the feature plane size is SxS, the number of channels is C, each grid position is subjected to prediction of a C Category, and a Category with the highest confidence is selected, so as to form a grid prediction result of SxS.
Step 5, processing the output of the JFU module by using a Mask module, acquiring a segmentation Mask of a face part, and predicting a Mask image corresponding to each grid;
fig. 6 is a structural diagram of a Mask module, where the Mask module mainly includes M convolutional layers, 1 prediction layer and 1 upsampling layer, M is set to be M ═ 4 in this embodiment, and the prediction layer includes one output channel with S number2The Mask module is used for realizing the function of Mask branch, predicting the Mask image corresponding to each grid, and the output size of the Mask branch is HxWxS2I.e. the mask has a feature plane size of HxW and the number of channels is S2In the division prediction, firstly, the position (i, j) and the class C of the grid are positioned in the prediction result of the classification branch, and the channel position (i.e., (i × S + j) of the corresponding mask image in the mask branch is found, wherein the mask image is the face partMask map for class C.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. It will be understood that the disclosure is not limited to the embodiments described and disclosed above, but is intended to cover all modifications and changes that may be made without departing from the scope of the disclosure, as defined in the appended claims.