CN111899169B

CN111899169B - Method for segmenting network of face image based on semantic segmentation

Info

Publication number: CN111899169B
Application number: CN202010628571.5A
Authority: CN
Inventors: 杨海东; 李泽辉; 陈俊杰; 黄坤山
Original assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Current assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2024-01-26
Anticipated expiration: 2040-07-02
Also published as: CN111899169A

Abstract

The invention discloses a method for segmenting a face image based on semantic segmentation, which comprises the following steps: obtaining an image dataset; constructing a segmented deep convolution network structure; training data by using a network structure to obtain a training model; verifying and adjusting parameters by using a verification set, and selecting an optimal model; and testing the selected optimal model by using a test set. The invention adopts a lightweight model, adopts the combination of a space channel and a context information channel, gradually increases the high resolution to the low resolution subnetwork on the original space network structure to form more stages, and connects the multi-resolution subnetworks in parallel to obtain the information interaction module of the invention. And then carrying out multi-scale fusion for a plurality of times, so that each high-resolution to low-resolution representation repeatedly receives information from other parallel representations, thereby obtaining rich high-resolution representations. Since parallel connections are employed, a high resolution representation can be maintained, and thus the predictions are more spatially accurate.

Description

Method for segmenting network of face image based on semantic segmentation

Technical Field

The invention relates to the technical field of image processing, in particular to a method for segmenting a network of a face image based on semantic segmentation.

Background

Due to the development and application of convolutional neural networks (CNN for short), many tasks in the field of computer vision have been greatly developed, wherein image segmentation is a task in computer vision, and the purpose of the task is to divide and label images according to areas where different targets exist. Further, the semantic segmentation is to label the image at the pixel level, and label each pixel of the image with its corresponding category, and because each pixel is considered, the semantic segmentation is a density type prediction. The concept of semantic segmentation has various methods, such as patch classification, full convolution method, encoder-decoder architecture, etc., and the encoder-decoder architecture is popular at present, and the deep convolution network is designed by adopting the architecture.

However, in designing a semantic segmentation model, because people too pursue the accuracy of the model, a complicated trunk is introduced, which brings about heavy calculation burden and memory occupation. And due to the complexity of the backbone network, the model is difficult to deploy in practical applications. Therefore, solving this problem is an important task in the current semantic segmentation field, which is to balance the relationship between the efficiency and speed of the segmentation network, and to provide a simpler solution for multi-tasking segmentation.

At present, various beautifying and makeup software exists on the market. If a certain part of the human face is to be processed, each part of the human face must be divided, and then the face is beautified and made up for different parts. The invention processes the segmentation task of the face image, mainly aims at segmenting the face part and the hair, and reserves and properly denoises the edge between the face and the hair, so that the processed image is more natural and softer.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method for segmenting a network of a face image based on semantic segmentation.

The aim of the invention is achieved by the following technical scheme:

a method for segmenting a network of face images based on semantic segmentation mainly comprises the following specific steps:

step S1: corresponding image data sets for face segmentation are obtained after a series of operations.

Further, the step S1 further includes the following steps:

and S11, adopting a face recognition data set Labeled Faces in the wild (LFW) to divide the training set, the verification set and the test set in proportion.

Further, the series of operations in step S1 includes the operations of averaging, defogging, and cutting.

Step S2: and constructing a segmented deep convolution network structure.

Further, the step S2 further includes the following steps:

step S21: the deep convolutional network adopts an encoder-decoder architecture as a network structure, which includes an encoder module and a decoder module.

Step S22: the encoder module comprises three parts, namely a rapid downsampling module, an information interaction module and an expansion group.

Further, the step S22 further includes the following steps:

step S221: the rapid downsampling module consists of three convolution layers, and the adopted convolution layers have two forms, one is a standard convolution layer, and the other is a convolution layer with separable depth; the convolution layer with the separable depth can effectively reduce the parameter quantity of the model, thereby reducing the calculation burden; each convolution layer is followed by a BN layer and a RELU activation function is used.

Step S222: the structure of the information interaction module adopts a reverse residual bottleneck block (Inverted bottleneck residual block) of the MobileNet V2, and attractive output is obtained through information interaction and combination of feature maps with different dimensions.

Step S223: the expansion group carries out expansion convolution on the space of the module subjected to feature fusion through the information interaction module, and the receptive field of a convolution kernel can be increased through the expansion convolution, so that more layers of context information are captured.

Step S23: the decoder module mainly comprises a bilinear upsampling layer and a convolution layer, wherein the convolution layer is followed by a softmax layer to classify pixel levels.

Step S24: the output of the decoder module is post-processed to preserve face and hair edge detail and reduce noise by employing a guide filter.

Step S3: training data by using the network structure in the step S2 to obtain a corresponding training model;

step S4: verifying and adjusting parameters by using a verification set, and selecting an optimal model;

step S5: and testing the selected optimal model by using a test set, and evaluating the performance of the model.

The working process and principle of the invention are as follows: according to the method for segmenting the network of the face image based on semantic segmentation, a lightweight model is adopted, a spatial channel and a context information channel are combined, a high-resolution sub-network to a low-resolution sub-network is gradually increased on an original spatial network structure to form more stages, and the multi-resolution sub-networks are connected in parallel to obtain the information interaction module. And then carrying out multi-scale fusion for a plurality of times, so that each high-resolution to low-resolution representation repeatedly receives information from other parallel representations, thereby obtaining rich high-resolution representations. Since parallel connections are employed, a high resolution representation can be maintained, and thus the predictions are more spatially accurate.

Compared with the prior art, the invention has the following advantages:

(1) The method for segmenting the network based on the semantic segmentation of the face image provided by the invention can not reduce the performance of the network while improving the speed, and the efficiency is obviously improved compared with the prior art.

(2) The method for segmenting the facial image based on semantic segmentation provided by the invention can obtain the hair area of the facial part by utilizing the picture processed by the network, and then can carry out corresponding dyeing operation and the like according to the requirements of users, and is simple, convenient and quick to operate.

Drawings

Fig. 1 is a flowchart of an image segmentation method provided by the present invention.

Fig. 2 is a schematic diagram of the structure of the whole face image segmentation network according to the present invention.

Fig. 3 is a schematic diagram of an inverted residual block of a MobileNet V2 network structure according to the present invention.

Fig. 4 is a schematic diagram of a network structure provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described below with reference to the accompanying drawings and examples.

Example 1:

as shown in fig. 1 to 4, the present embodiment discloses a method for segmenting a network of face images based on semantic segmentation, which mainly includes the following specific steps:

Further, the step S1 further includes the following steps:

Step S2: and constructing a segmented deep convolution network structure.

Further, the step S2 further includes the following steps:

Further, the step S22 further includes the following steps:

Example 2:

referring to fig. 1 to 4, the present embodiment discloses a method for segmenting a network of face images based on semantic segmentation, comprising the following steps:

step S1, obtaining a corresponding image data set for face segmentation.

Further, the step S1 includes:

step S11 training LFW using the well known face recognition dataset Labeled Faces in the wild (LFW) is a popular face recognition dataset on the network, which contains more than 13000 face pictures, we train using its extended version Part Labels, the Labels of which will be labeled with the super-pixel segmentation algorithm contained in themselves, so the resulting dataset is already labeled. The dataset then divides the training set, validation set and test set by numbers of 1500, 500, 1000, respectively.

And then a series of operations including the operations of averaging, defogging and cutting. The resulting picture is an RGB input diagram of 224 x 224 awaiting training of the model.

And S2, constructing a structure of the segmented deep convolution network.

Further, the step S2 includes:

in step S21, the network structure adopted by the present invention is an encoder-decoder architecture that is popular in the field of semantic segmentation, so the network structure can be divided into two parts.

Step S22, for the encoder part, this part is the main body of this network. The system comprises three small parts, namely a rapid downsampling module, an information interaction module and an expansion group. The main reference network architecture comprises a Fast-SCNN downsampling learning part, a Inverted bottleneck residual block module of MobileNet V2 and a FFM (Feature Fusion Module) Attention mechanism of lightweight BiSeNet which is popular in the semantic segmentation field at present, and a spatial channel and context channel parallel mode is adopted to fuse a high-resolution characteristic diagram of the spatial channel which retains semantic information and a low-resolution characteristic diagram of the context channel which is obtained by rapidly downsampling to increase the receptive field, so that the performance of the network can be well improved.

In step S221, for each small portion, the fast downsampling module is composed of three convolution layers, and the adopted convolution layers have two forms, one is a standard convolution layer, and the other is a convolution layer with separable depth. The convolution layer with separable depth can effectively reduce the parameter quantity of the model, thereby reducing the calculation burden. Wherein the three convolution layers all employ a convolution kernel of (3*3) step size of 2, each followed by a BN layer and using a RELU activation function. 224×224×3 of the pictures pass through the first convolutional layer conv2D (3, 3), stride=2, and then obtain a feature map of 112×112×32, then input to the second convolutional layer Dwconv2D (3, 3), stride=2, and then obtain a feature map of 56×56×64, then input to the third convolutional layer Dwconv2D (3, 3), stride=2, and then obtain a feature map of 28×28×64.

In step S222, the structure of the information interaction module refers to the inverted residual bottleneck block (Inverted bottleneck residual block) of the MobileNet V2, and the structure of fig. 3 is adopted for design. The more beautiful output is obtained through the information interaction and combination of the feature graphs with different dimensions. Furthermore, because the upsampling layer is a bilinear interpolation method, no learning parameters are required, as this can greatly reduce the huge computation amount caused by transposed convolution. The feature map of 28×28×64 obtained after step S221 is downsampled by three different multiples, and the convolutional layers are conv2D (3, 3), and the downsampled layers after the step S221 are the convolutional kernels. The upsampling layer is selected from a bilinear upsampling module. The step size is selected according to the scale to be reduced of the resolution, 1,2 and 4 are selected respectively, and three feature maps (1) (2) (3) with different resolution sizes are obtained. And then carrying out downsampling convolution on the graph (1) according to step sizes of different scales to obtain three characteristic graphs (4) (5) (6) with different resolutions. Graph (2) is up-sampled twice and high resolution graph (1) is fused to graph (4), and graph (3) is up-sampled four times and fused to graph (4). Then the up-sampling of figure (3) is doubled and fused into figure (5). The graph (3) is fused directly with the graph (6) by a convolution block. And adding the graph (5) and the graph (6), and adding the result and the graph (4), wherein the output obtained after a feature fusion module is a feature graph of 28 x 64.

In step S223, the expansion group performs spatial expansion convolution on the module after feature fusion by the information interaction module, and the receptive field of the convolution kernel can be increased by the expansion convolution, so as to capture more levels of context information. The feature map obtained in step S222 is subjected to expansion convolution with expansion coefficients of 2,4 and 8, respectively, the receptive fields of the feature map are increased, three feature maps with different receptive fields are obtained, and then the feature maps are added, and the size of the obtained feature map is 28×28×32.

In step S23, for the decoder module, pixel-level classification is performed by a bilinear upsampling layer and a convolution layer followed by a softmax layer. The feature map obtained in step 223 is subjected to a bilinear upsampling module to obtain a feature map size of 224×224×32, and then classified by conv2D convolutional layers plus a softmax layer to obtain an output feature map with a size of 224×224×3.

Step S24, post-processing is performed on the output image. Post-processing mechanisms are generally capable of improving image edge detail and texture fidelity while maintaining a high degree of consistency with global information. The decoder output is post-processed to preserve face and hair edge detail and reduce noise by employing a guide filter. The guide filter is effective in suppressing distortion and smoothing the contour of the edge, creating an edge that appears comfortable to the person.

And S3, training data by using the network of the S2 to obtain a corresponding training model.

And S4, verifying by using a verification set, adjusting parameters, and selecting an optimal model.

And S5, testing the selected model by using a test set, and evaluating the performance of the model. In the test stage, we first use the model with high recall of MTCNN to extract the ROI curve of the face, and then because the redundant environmental information has some promotion effect on the segmentation background, the ROI area is enlarged by 0.8 times in both horizontal and vertical directions. We used four indices of the fully convolutional neural network (mIoU, fwIoU, pixelAcc, mPixelAcc) to evaluate the performance of the model. Comparing the experimental result of the model with some models of SOTA such as VGG and U-Net, the network structure of the invention can be balanced well in speed and performance, and has good effect. Although accuracy is not as good as that of some networks of SOTA in some aspects, the network has great advantages in terms of speed, and is a lightweight network architecture which has both speed and performance.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A method for segmenting a network of face images based on semantic segmentation, comprising the steps of:

step S1: obtaining a corresponding image data set for face segmentation through a series of operations;

step S2: constructing a segmented deep convolution network structure;

step S5: testing the selected optimal model by using a test set, and evaluating the performance of the model;

the step S2 further includes the steps of:

step S21: the deep convolution network adopts an encoder-decoder architecture as a network structure, and the network structure comprises an encoder module and a decoder module;

step S22: the encoder module comprises three parts, namely a rapid downsampling module, an information interaction module and an expansion group;

step S23: the decoder module mainly comprises a bilinear upsampling layer and a convolution layer, wherein the convolution layer is connected with a softmax layer to classify pixel levels;

step S24: post-processing the output of the decoder module to preserve details of the face and hair edges and reduce noise by employing a guide filter;

the step S22 further includes the steps of:

step S221: the rapid downsampling module consists of three convolution layers, and the adopted convolution layers have two forms, one is a standard convolution layer, and the other is a convolution layer with separable depth; the convolution layer with the separable depth can effectively reduce the parameter quantity of the model, thereby reducing the calculation burden; each convolution layer is followed by a BN layer and a RELU activation function is used;

step S222: the structure of the information interaction module adopts a reverse residual bottleneck block (Inverted bottleneck residual block) of the MobileNet V2, and attractive output is obtained through information interaction and combination of feature graphs with different dimensions;

2. The method of segmenting a network of face images based on semantic segmentation according to claim 1, wherein said step S1 further comprises the steps of:

step S11: the training set, validation set and test set are each scaled using face recognition dataset Labeled Faces in the wild (LFW).

3. The method of claim 1, wherein the series of operations in step S1 includes a averaging, defogging, and cropping operation.