CN111339874A

CN111339874A - Single-stage face segmentation method

Info

Publication number: CN111339874A
Application number: CN202010100612.3A
Authority: CN
Inventors: 余孟春; 谢清禄; 王显飞
Original assignee: Guangzhou Melux Information Technology Co ltd
Current assignee: Guangzhou Melux Information Technology Co ltd
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2020-06-26

Abstract

The invention relates to the technical field of image processing, in particular to a single-stage face segmentation method. A method for single-stage face segmentation comprises the steps of firstly, uniformly gridding an input face image by using a Grid module, providing gridding class labels for classification branches, and providing reference masks for mask branches; extracting basic features of the gridded face image by using an FCN module; fusing the three-level characteristics output by the FCN module by using an JFU module to obtain richer characteristic information; processing the output of the JFU module by using a Category module to realize the function of classifying branches, predicting whether each grid image has a face part, and if the face part exists, giving a corresponding grid position, Category and confidence coefficient to provide reference information for mask branches; and processing the output of the JFU module by using a Mask module, acquiring a segmentation Mask of a human face part, and predicting a Mask image corresponding to each grid.

Description

Single-stage face segmentation method

Technical Field

The invention relates to the technical field of image processing, in particular to a single-stage face segmentation method.

Background

The face segmentation technology is mainly used for positioning and segmenting face parts such as eyes, a nose, lips and the like, and is an important component of the face recognition technology. The general face segmentation method adopts a Two-Stage (Two Stage) mode, i.e. a technical scheme of firstly detecting and then segmenting, for example, adopting Mask-RCNN to segment the face. Firstly, detecting the boundary frame of each part in the face image, then intercepting the image in the boundary frame, and segmenting the mask of the face part. The human face segmentation mode divides a task into two subtasks, namely detection and segmentation, each subtask is independently completed, and the detection effect of each part of the human face directly influences the subsequent segmentation precision. Meanwhile, the mode of firstly detecting and then dividing weakens the global context information of all parts of the human face, and cannot achieve good adaptive effect on human face shielding and complex human face postures.

The invention discloses a method for single-Stage Face Segmentation, which can complete the positioning and Segmentation of each part of a Face only by a One Stage network. The method only needs the example mask label during training, does not need the bounding box information of the example, can learn from end to end, simplifies the training of the model and achieves better segmentation effect. The method comprises the steps that a segmentation network is divided into two branches, wherein one branch is used for predicting whether a human face part exists at a certain position in an image, and if the human face part exists at the certain position in the image, the position and the category are given and are called as classification branches; the other branch is used for generating a mask of the human face part and is called a mask branch.

The relationship of space consistency exists among all parts of the human face, for example, two eyes are respectively positioned at the upper left and the upper right of a nose, lips are positioned below the nose, the relationship cannot change along with the posture and the position of the human face, and a certain distance exists among the central points of all the parts. Therefore, the segmentation method disclosed by the invention is very suitable for face segmentation, and the face segmentation effect which is simpler and faster than that of the traditional two-stage segmentation method and has higher segmentation precision can be achieved by combining the two branch networks.

Disclosure of Invention

Based on the background, the invention provides a single-stage face segmentation method to improve the speed and accuracy of face segmentation.

In order to achieve the above object, the method for single-stage face segmentation provided by the present invention specifically comprises the following implementation steps:

step 1, firstly, uniformly gridding an input face image by using a Grid module, providing gridding class labels for classification branches, and providing reference masks for mask branches;

specifically, the Grid module uniformly grids the input face image by uniformly dividing the input face image into S rows and S columns, that is, S²The grid subgraph takes the left eye, the right eye, the nose, the lips and other parts and backgrounds as reference segmentation marks, 5 types of class marks are used in total, the grid mark where the center point of each part of the face is located is the class of the part, the rest grids are all marked as the background class, finally, a SxS matrix is generated and used as a reference label of a classification branch, correspondingly, each grid corresponds to a mask image, and S is included²A trellis, so the mask branch will output S²Each channel corresponds to a grid, and each mask image only corresponds to one object in one category.

Furthermore, the classification branch and the mask branch are used as two different tasks for completing face segmentation together, the two branches use different Loss functions in training, and the classification branch uses Cross Engine Loss, which is recorded as L_CEAs shown in formula I, the mask branch adopts Dice Loss, which is denoted as L_DiceAs shown in formula two:

in the formula I, x [ j ] and x [ class ] are both the output of a prediction layer, and x [ class ] is the value of the real class;

in the formula two, p_x，yTo predict the pixel value at location (x, y) in the mask, q_x，yIs the pixel value at location (x, y) in the real mask;

during the training process, the overall loss function is as shown in formula three:

L＝L_c+λL_Diceformula three

In the third formula, λ is a loss function coefficient, and is used for balancing the loss function weights of the two branches.

Step 2, extracting basic features of the gridded face image by using an FCN module;

specifically, the FCN module is composed of 1 convolutional layer of 3x3 and 4 Block layers, where a Block layer includes 4 convolutional layers of 3x3, 1 convolutional layer of 1x1 and 2 summation layers, an input image first passes through the convolutional layer of Stride 2, then sequentially passes through 4 Block layers, each Block layer performs feature dimension reduction once, and finally, the output of the last three Block layers is taken as the three-way input of the next JFU module

Step 3, fusing the three-level characteristics output by the FCN module by using an JFU module to obtain richer characteristic information;

specifically, the JFU module is composed of a convolutional layer, an upsampling layer, a cascade layer and an expansion convolutional layer, three inputs respectively correspond to outputs of three-level Block layers of the FCN module, the three inputs have different channel numbers and feature plane sizes, feature planes with the same channel number are obtained after passing through the convolutional layers respectively, feature planes with the same size are obtained after passing through the upsampling layer respectively, the three feature planes are cascaded and then pass through convolutional layers with different expansion coefficients, the expansion coefficients are 1, 2, 4 and 8 respectively, and finally the four feature planes are cascaded to serve as inputs of a next module.

Step 4, processing the output of the JFU module by using a Category module to realize the function of classifying branches, predicting whether each grid image has a face part, and if the face part exists, giving a corresponding grid position, Category and confidence coefficient to provide reference information for mask branches;

specifically, the Category module is composed of N convolutional layers, 1 prediction layer and 1 Softmax layer, wherein the prediction layer is composed of 1 convolutional layer with the number of output channels being C, the process of the Category module is that an input feature plane passes through the N convolutional layers, then passes through the 1 prediction layer and then passes through the Softmax layer, the network output size of the classification branch is sxsxsxxc, namely the feature plane size is SxS, the number of channels is C, the Category of C is predicted at each grid position, the Category with the highest confidence coefficient is selected, and the grid prediction result of SxS is formed.

And step 5, processing the output of the JFU module by using a Mask module, acquiring a segmentation Mask of a face part, and predicting a Mask image corresponding to each grid.

Specifically, the Mask module mainly comprises M convolutional layers, 1 prediction layer and 1 upsampling layer, wherein the prediction layer comprises an output channel with the number of S²The Mask module is used for realizing the function of Mask branch, predicting the Mask image corresponding to each grid, and the output size of the Mask branch is HxWxS²I.e. the mask has a feature plane size of HxW and the number of channels is S²In the division prediction, firstly, the position (i, j) and the class C of the grids are positioned in the prediction result of the classification branch, and the channel position (i.e., (i × S + j) of the corresponding mask image in the mask branch is found, wherein the mask image is the mask image of the face part class C.

According to the technical scheme provided by the invention, in the design of the whole network, the limitation of a two-stage segmentation algorithm is avoided, namely the two-stage segmentation algorithm is detected and then segmented, so that the complexity of model training is reduced, more global context information is fused, and the accuracy of model prediction is improved. The consistency of each part of the human face exists in space, the relative positions are fixed, and the central points of each part have certain distance. According to the characteristics of all parts of the human face, the human face segmentation network is decomposed into two branches, namely a classification branch and a mask branch.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a block diagram of the FCN module of the present invention;

FIG. 3 is a Block diagram of the Block module of the present invention;

FIG. 4 is a block diagram of the JFU module of the present invention;

FIG. 5 is a block diagram of a Category module of the present invention;

fig. 6 is a structural diagram of a Mask module of the present invention.

Detailed Description

In order to make the purpose and technical solution of the present invention clearer, the following will clearly and completely describe the technical solution of the present invention with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the technical solutions of the present invention, and other embodiments obtained by those skilled in the art without inventive work should fall within the scope of the present invention.

Fig. 1 is a flowchart illustrating an implementation of a method for single-level face segmentation according to an exemplary embodiment, and as shown in fig. 1, the method for single-level face segmentation according to the embodiment of the present invention specifically includes the following implementation steps:

as shown in fig. 1, the Grid module uniformly grids the input face image by uniformly dividing the input face image into S rows and S columns, i.e. S²The grid subgraph takes the left eye, the right eye, the nose, the lips and other parts and backgrounds as reference segmentation marks, 5 types of class marks are used in total, the grid mark where the center point of each part of the face is located is the class of the part, the rest grids are all marked as the background class, finally, a SxS matrix is generated and used as a reference label of a classification branch, correspondingly, each grid corresponds to a mask image, and S is included²A trellis, so the mask branch will output S²Each channel corresponds to a grid, and each mask image only corresponds to one object in one category.

The classification branch and the mask branch are used as two different tasks and used for jointly completing face segmentation, the two branches use different Loss functions in training, and the classification branch uses Cross Engine Loss and is marked as L_CEAs shown in formula I, the mask branch adopts Dice Loss, which is denoted as L_DiceAs shown in formula two:

L＝L_c+λL_Diceformula three

fig. 2 is a structural diagram of an FCN module, fig. 3 is a structural diagram of a Block module, the FCN module is composed of 1 convolution layer of 3 × 3 and 4 Block layers, and the Block layers include 4 convolution layers of 3 × 3, 1 convolution layer of 1 × 1 and 2 summation layers, an input image passes through the convolution layer with Stride equal to 2 first, then passes through 4 stages of Block layers in sequence, each Block layer performs a feature dimension reduction, and finally the output of the last three stages of Block layers is taken as the three-way input of the next JFU module.

fig. 4 is a structural diagram of an JFU module, where the JFU module is composed of a convolutional layer, an upsampling layer, a cascade layer, and an expansion convolutional layer, where three inputs respectively correspond to outputs of three Block layers of an FCN module, the three inputs have different channel numbers and feature plane sizes, and after passing through the convolutional layers, feature planes with the same channel number are obtained, and then passing through the upsampling layer, feature planes with the same size are obtained, the three feature planes are cascaded, and then pass through convolutional layers with different expansion coefficients, where the expansion coefficients are 1, 2, 4, and 8, and finally the four feature planes are cascaded to serve as inputs of a next module.

fig. 5 is a structural diagram of a Category module, where the Category module is composed of N convolutional layers, 1 prediction layer, and 1 Softmax layer, where N is set to be N ═ 4 in this embodiment, where a prediction layer is composed of 1 convolutional layer with the number of output channels being C, a flow of the Category module is that an input feature plane passes through the N convolutional layers, then passes through the 1 prediction layer, and then passes through the Softmax layer, a network output size of a classification branch is sxsxsxsxxc, that is, the feature plane size is SxS, the number of channels is C, each grid position is subjected to prediction of a C Category, and a Category with the highest confidence is selected, so as to form a grid prediction result of SxS.

Step 5, processing the output of the JFU module by using a Mask module, acquiring a segmentation Mask of a face part, and predicting a Mask image corresponding to each grid;

fig. 6 is a structural diagram of a Mask module, where the Mask module mainly includes M convolutional layers, 1 prediction layer and 1 upsampling layer, M is set to be M ═ 4 in this embodiment, and the prediction layer includes one output channel with S number²The Mask module is used for realizing the function of Mask branch, predicting the Mask image corresponding to each grid, and the output size of the Mask branch is HxWxS²I.e. the mask has a feature plane size of HxW and the number of channels is S²In the division prediction, firstly, the position (i, j) and the class C of the grid are positioned in the prediction result of the classification branch, and the channel position (i.e., (i × S + j) of the corresponding mask image in the mask branch is found, wherein the mask image is the face partMask map for class C.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. It will be understood that the disclosure is not limited to the embodiments described and disclosed above, but is intended to cover all modifications and changes that may be made without departing from the scope of the disclosure, as defined in the appended claims.

Claims

1. A method for single-stage face segmentation is characterized by comprising the following implementation steps:

2. The method of claim 1, wherein the Grid module of step 1 uniformly grids the input face image by uniformly dividing the input face image into S rows and S columns, i.e. S²The sub-graph of each grid takes the left eye, the right eye, the nose, the lips and other parts and backgrounds as reference segmentation marks, 5 types of class marks are provided, the grid mark where the center point of each part of the face is positioned is the class of the part, the rest grids are all marked as background classes, and finally an SxS matrix is generated and used as the reference of classification branchesTest labels, each grid corresponding to a mask image, with S²A trellis, so the mask branch will output S²Each channel corresponds to a grid, and each mask image only corresponds to one object in one category.

3. The method of claim 1, wherein the classification branch and the mask branch in step 1 are used as two different tasks to jointly perform face segmentation, the two branches use different Loss functions in training, and the classification branch uses Cross entry Loss, denoted as L_CEAs shown in formula I, the mask branch adopts Dice Loss, which is denoted as L_DiceAs shown in formula two:

L＝L_c+λL_Diceformula three

4. The method of claim 1, wherein the FCN Block of step 2 is composed of 1 convolutional layer of 3x3 and 4 Block layers, and the Block layers include 4 convolutional layers of 3x3, 1 convolutional layer of 1x1 and 2 summation layers, and the input image passes through the convolutional layer of Stride 2 first and then passes through the 4 Block layers in sequence, each Block layer performs a feature dimension reduction, and finally the output of the last three Block layers is taken as the three inputs of the next JFU Block.

5. The method of claim 1, wherein the JFU module in step 3 is composed of a convolutional layer, an upsampling layer, a cascade layer, and an expansion convolutional layer, three inputs respectively correspond to outputs of three Block layers of the FCN module, the three inputs have different channel numbers and feature plane sizes, feature planes with the same channel number are obtained after passing through the convolutional layer, feature planes with the same size are obtained after passing through the upsampling layer, the three feature planes are cascaded, then four convolutional layers with different expansion coefficients are passed through, the expansion coefficients are 1, 2, 4, and 8, and finally the four feature planes are cascaded to serve as inputs of a next module.

6. The method of claim 1, wherein the Category module in step 4 is composed of N convolutional layers, 1 prediction layer and 1 Softmax layer, wherein the prediction layer is composed of 1 convolutional layer with C output channels, the process of the Category module is that an input feature plane passes through the N convolutional layers, then passes through the 1 prediction layer and then passes through the Softmax layer, the network output size of a classification branch is sxsxsxsxsxxc, that is, the feature plane size is SxS, the channel number is C, each grid position is subjected to C Category prediction, and the Category with the highest confidence is selected to form the grid prediction result of SxS.

7. The method of claim 1, wherein the Mask module in step 5 mainly comprises M convolutional layers, 1 prediction layer and 1 upsampling layer, wherein the prediction layer comprises an output channel with the number S²The Mask module is used for realizing the function of Mask branch, predicting the Mask image corresponding to each grid, and the output size of the Mask branch is HxWxS²I.e. the size of the feature plane of the mask is HxW, generalThe channel number is S²In the division prediction, firstly, the position (i, j) and the class C of the grids are positioned in the prediction result of the classification branch, and the channel position (i.e., (i × S + j) of the corresponding mask image in the mask branch is found, wherein the mask image is the mask image of the face part class C.