CN111899169A

CN111899169A - Network segmentation method of face image based on semantic segmentation

Info

Publication number: CN111899169A
Application number: CN202010628571.5A
Authority: CN
Inventors: 杨海东; 李泽辉; 陈俊杰; 黄坤山
Original assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Current assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2020-11-06
Anticipated expiration: 2040-07-02
Also published as: CN111899169B

Abstract

The invention discloses a network segmentation method of a face image based on semantic segmentation, which comprises the following steps: obtaining an image dataset; constructing a segmented deep convolutional network structure; using a network structure to train data to obtain a training model; carrying out verification and parameter adjustment by using the verification set, and selecting an optimal model; and testing the selected optimal model by using a test set. The invention adopts a lightweight model, combines a space channel and a context information channel, gradually increases subnets from high resolution to low resolution on the basis of the original space network structure to form more stages, and connects the subnets with multiple resolutions in parallel to obtain the information interaction module. And then, multi-scale fusion is carried out for multiple times, so that each high-resolution to low-resolution representation repeatedly receives information from other parallel representations, and abundant high-resolution representations are obtained. Because parallel connections are used, a high resolution representation can be maintained, and thus the prediction is spatially more accurate.

Description

Network segmentation method of face image based on semantic segmentation

Technical Field

The invention relates to the technical field of image processing, in particular to a network segmentation method of a face image based on semantic segmentation.

Background

Due to the development and application of convolutional neural networks (CNN for short), many tasks in the field of computer vision have been greatly developed, wherein image segmentation is a task in computer vision, and the purpose of the task is to divide and label images according to areas where different targets exist. Furthermore, semantic segmentation is to label the image at a pixel level, and label each pixel of the image with its corresponding category, because each pixel is considered, semantic segmentation is a kind of density-type prediction. There are many ideas of semantic segmentation, such as patch classification, full convolution method, encoder-decoder architecture, etc., and the currently popular is the encoder-decoder architecture, and this document also adopts this architecture to design a deep convolutional network.

However, in designing a semantic segmentation model, people excessively pursue the accuracy of the model, and a complicated backbone is introduced, which brings heavy computational burden and memory occupation. And due to the complexity of the backbone network, the model is difficult to deploy in practical application. Therefore, solving this problem is an important task in the current field of semantic segmentation, which is originally aimed at balancing the relationship between efficiency and speed of the segmentation network, and providing a simpler solution for the segmentation of multitasks.

At present, various beautifying and makeup software exist in the market. If a certain part of the face is to be processed, each part of the face must be segmented, and then different parts are beautified and made up. The invention processes the segmentation task of the face image, mainly aims at segmenting the face part from the hair, reserves the edge between the face part and the hair and properly denoises, so that the processed image is more natural and softer.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method for segmenting a network of a face image based on semantic segmentation.

The purpose of the invention is realized by the following technical scheme:

a method for segmenting a network based on a face image segmented by semantics mainly comprises the following specific steps:

step S1: after a series of operations, a corresponding image data set for face segmentation is obtained.

Further, the step S1 further includes the following steps:

and step S11, adopting a face recognition data set laboratory Faces in the world (LFW), and dividing the training set, the verification set and the test set according to the proportion respectively.

Further, the series of operations in step S1 includes averaging, defogging, and clipping operations.

Step S2: and constructing a segmented deep convolutional network structure.

Further, the step S2 further includes the following steps:

step S21: the deep convolutional network adopts an encoder-decoder architecture as a network structure, and the network structure comprises an encoder module and a decoder module.

Step S22: the encoder module comprises three parts, namely a fast down-sampling module, an information interaction module and an expansion group.

Further, the step S22 further includes the following steps:

step S221: the fast down-sampling module consists of three convolution layers, and the adopted convolution layers have two forms, one is a standard convolution layer, and the other is a depth-separable convolution layer; the convolution layer with separable depth can effectively reduce the parameter quantity of the model, thereby reducing the calculation burden; after each convolutional layer there is a BN layer and a RELU activation function is used.

Step S222: the structure of the information interaction module adopts an Inverted residual bottleneck block (Inverted bottle residual block) of MobileNet V2, and beautiful output is obtained through information interaction and combination of feature graphs with different dimensions.

Step S223: the expansion group is used for performing spatial expansion convolution on the module subjected to feature fusion through the information interaction module, and the expansion convolution can increase the receptive field of a convolution kernel and capture more levels of context information.

Step S23: the decoder module mainly comprises a bilinear up-sampling layer and a convolution layer, wherein the convolution layer is followed by a softmax layer to classify the pixel level.

Step S24: post-processing the output of the decoder module, preserving details of face and hair edges by employing a guided filter, and reducing noise.

Step S3: using the network structure of the step S2 to train data to obtain a corresponding training model;

step S4: carrying out verification and parameter adjustment by using the verification set, and selecting an optimal model;

step S5: and testing the selected optimal model by using a test set to evaluate the performance of the model.

The working process and principle of the invention are as follows: the invention provides a method for segmenting a network based on a semantic segmentation face image, which adopts a lightweight model, adopts a combination of a space channel and a context information channel, gradually increases subnets from high resolution to low resolution on the basis of an original space network structure to form more stages, and connects the subnets with multiple resolutions in parallel to obtain the information interaction module. And then, multi-scale fusion is carried out for multiple times, so that each high-resolution to low-resolution representation repeatedly receives information from other parallel representations, and abundant high-resolution representations are obtained. Because parallel connections are used, a high resolution representation can be maintained, and thus the prediction is spatially more accurate.

Compared with the prior art, the invention also has the following advantages:

(1) the method for segmenting the network based on the semantic segmentation human face image improves the speed, does not reduce the performance of the network, and obviously improves the efficiency compared with the prior art.

(2) The method for segmenting the network based on the semantic segmentation human face image can obtain the hair area of the human face part by utilizing the image processed by the network, and then can perform corresponding dyeing operation and the like according to the requirements of a user, and is simple, convenient and quick to operate.

Drawings

FIG. 1 is a flow chart of an image segmentation method provided by the present invention.

Fig. 2 is a schematic structural diagram of the whole face image segmentation network provided by the present invention.

Fig. 3 is a schematic structural diagram of an inversion residual block of the MobileNet V2 network structure provided by the present invention.

Fig. 4 is a schematic diagram of a network structure provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described below with reference to the accompanying drawings and examples.

Example 1:

as shown in fig. 1 to 4, the present embodiment discloses a method for segmenting a network based on a face image of semantic segmentation, which mainly includes the following specific steps:

Further, the step S1 further includes the following steps:

Step S2: and constructing a segmented deep convolutional network structure.

Further, the step S2 further includes the following steps:

Further, the step S22 further includes the following steps:

Example 2:

referring to fig. 1 to 4, the present embodiment discloses a method for segmenting a network based on a semantic segmentation face image, which includes the following steps:

in step S1, a corresponding image dataset for face segmentation is obtained.

Further, the step S1 includes:

step S11, adopting a known face recognition data set Labeled Faces in the world (LFW) for training LFW is a face recognition data set popular on the network, which contains more than 13000 face pictures, and using an extended version Part Labels of the LFW for training, wherein Labels of the data set are Labeled by a super-pixel segmentation algorithm contained in the data set, so that the obtained data set is Labeled. The data set then divides the training set, validation set, and test set by a number of 1500, 500, 1000, respectively.

And then a series of operations including averaging, defogging and cutting are carried out. The resulting picture is a 224 x 224 RGB input picture, awaiting training of the model.

Step S2, the structure of the segmented deep convolutional network is constructed.

Further, the step S2 includes:

in step S21, the network structure adopted by the invention is an encoder-decoder architecture which is popular in the current semantic segmentation field, so that the network structure can be roughly divided into two parts.

For the encoder portion, this portion is the subject of this network, step S22. The system comprises three small parts, namely a fast down-sampling module, an information interaction module and an expansion group. The network architecture mainly referred to comprises a down-sampling learning part of Fast-SCNN, an invoked bottleckresidual block module of MobileNet V2, and an FFM (feature Fusion module) attribute mechanism of lightweight BiSeNet which is popular in the semantic segmentation field at present, wherein a spatial channel and a context channel are adopted in a parallel mode, a high-resolution feature diagram of the spatial channel, which retains semantic information, and a low-resolution feature diagram of the context channel, which is obtained by rapidly down-sampling to increase a receptive field, are fused together, and the performance of the network can be well increased.

In step S221, for each fraction, the fast down-sampling module is composed of three convolutional layers, and the convolutional layers used have two forms, one is a standard convolutional layer, and the other is a depth-separable convolutional layer. The depth separable convolutional layer can effectively reduce the parameter quantity of the model, thereby reducing the calculation load. The convolution kernels used by all three convolution layers are (3 x 3), the step size is 2, each convolution layer is followed by a BN layer and a RELU activation function is used. The 224 × 3 pictures pass through the first convolution layer conv2D (3,3), after stride 2, a 112 × 32 signature is obtained, and then input to the second convolution layer Dwconv2D (3,3), after stride 2, a 56 × 64 signature is obtained, and then input to the third convolution layer Dwconv2D (3,3), after stride 2, a 28 × 64 signature is obtained.

In step S222, the structure of the information interaction module refers to an Inverted residual bottleneck block (Inverted bottle residual block) of MobileNet V2, and the structure of fig. 3 is adopted for design. Through the information interaction and combination of the feature maps with different dimensions, more beautiful output is obtained. Furthermore, because the employed up-sampling layer is a bilinear interpolation method, no learning parameter is needed, because the huge calculation amount brought by the transposition convolution can be greatly reduced. The 28 × 64 feature map obtained after step S221 is sampled by three different multiples, and the convolution layers are selected as conv2D (3,3), and the subsequent downsampling layers are selected as the convolution kernel in the same way. The up-sampling layer selects a bilinear up-sampling module. The step size is selected according to the scale of the resolution to be reduced, 1,2 and 4 are respectively selected, and three feature maps (1) (2) (3) with different resolution sizes are obtained. And then, carrying out downsampling convolution on the graph (1) according to step sizes of different scales to obtain three feature graphs (4), (5) and (6) with different resolutions. Fig. 2 is fused to fig. 4 after up-sampling twice and high resolution fig. 1, and fig. 4 is fused to fig. 3 after up-sampling four times. The upsampling of FIG. 3 is then doubled and merged into FIG. 5. Fig. 3 is directly merged with fig. 6 by the volume block. And adding the graph (5) and the graph (6), adding the result to the graph (4), and obtaining the feature graph with the output of 28 by 64 after passing through a feature fusion module.

In step S223, the expansion group performs spatial expansion convolution on the module subjected to feature fusion by the information interaction module, and through the expansion convolution, the receptive field of the convolution kernel can be increased, and more levels of context information can be captured. The feature maps obtained after step S222 are subjected to dilation convolution with dilation coefficients of 2,4, and 8, respectively, so that the receptive fields of the feature maps are increased, feature maps of three different receptive fields are obtained, and then the feature maps are added, so that the feature map size is 28 × 32.

In step S23, for the decoder module, pixel-level classification is performed by bilinear upsampling layer and convolutional layer followed by softmax layer. The size of the feature map obtained by the bilinear upsampling module in the feature map obtained in the step 223 is 224 × 32, and then the output feature map is obtained by adding a softmax layer to the convolution layer of conv2D, and the size is 224 × 3.

In step S24, post-processing is performed on the output image. Post-processing mechanisms can generally improve image edge detail and texture fidelity while maintaining a high degree of consistency with global information. Post-processing the output of the decoder, preserving details of face and hair edges by employing a guided filter, and reducing noise. The guiding filter can effectively suppress distortion and can smooth the contour of the edge to generate an edge which is comfortable for people to look at.

And step S3, using the network of S2 to train data to obtain a corresponding training model.

And step S4, carrying out verification by using the verification set, adjusting parameters and selecting an optimal model.

And step S5, testing the selected model by using the test set to evaluate the performance of the model. In the testing phase, we first used the MTCNN model with high recall to extract the ROI curve of the face, and then enlarged the ROI region by 0.8 times in both horizontal and vertical directions due to some contribution of the excess environmental information to segmentation of the background. We used four indices of the full convolution neural network (mIoU, fwIoU, pixelAcc, mPixelAcc) to evaluate the performance of the model. The experimental result of the model is compared with certain models of SOTA, such as VGG and U-Net, and the network structure obtained by balancing can well balance speed and performance and has good effect. Although the accuracy is not as good as that of some networks of SOTA in some aspects, the network has a great advantage in speed, and is a lightweight network architecture which has both speed and performance.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A method for segmenting a network based on a human face image segmented by semantics is characterized by comprising the following steps:

step S1: obtaining a corresponding image data set for face segmentation after a series of operations;

step S2: constructing a segmented deep convolutional network structure;

2. The method for segmenting network based on semantic segmented face image according to claim 1, wherein the step S1 further comprises the following steps:

3. The method for segmenting network based on semantic segmented face image according to claim 1, wherein the series of operations in the step S1 includes averaging, defogging and clipping operations.

4. The method for segmenting network based on semantic segmented face image according to claim 1, wherein the step S2 further comprises the following steps:

step S21: the deep convolutional network adopts an encoder-decoder architecture as a network structure, and the network structure comprises an encoder module and a decoder module;

step S22: the encoder module comprises three parts, namely a rapid downsampling module, an information interaction module and an expansion group;

step S23: the decoder module mainly comprises a bilinear up-sampling layer and a convolution layer, wherein the convolution layer is followed by a softmax layer to classify the pixel level;

5. The method for segmenting network based on semantic segmented face image according to claim 4, wherein the step S22 further comprises the following steps:

step S221: the fast down-sampling module consists of three convolution layers, and the adopted convolution layers have two forms, one is a standard convolution layer, and the other is a depth-separable convolution layer; the convolution layer with separable depth can effectively reduce the parameter quantity of the model, thereby reducing the calculation burden; after each convolution layer, a BN layer and a RELU activation function are used;

step S222: the structure of the information interaction module adopts an inverted residual bottleneck block (Invertedbottleneck residual block) of MobileNet V2, and attractive output is obtained through information interaction and combination of feature graphs with different dimensions;