CN115457263A

CN115457263A - Lightweight portrait segmentation method based on deep learning

Info

Publication number: CN115457263A
Application number: CN202210998603.XA
Authority: CN
Inventors: 林镇源; 张文辉; 谢胜勇; 蒋小莲; 许轩瑞; 刘晨旭
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2022-12-09

Abstract

The invention discloses a lightweight portrait segmentation method based on deep learning, which is used for segmenting a portrait by utilizing a lightweight portrait segmentation model of MobileNet V3 based on a SegFormer and a fused coordinate attention mechanism. The light portrait segmentation model based on deep learning replaces a SegFormer encoder with a MobileNet V3 network, improves the MobileNet V3 network, adds a multilayer CA attention mechanism, improves the accuracy of portrait segmentation, replaces SEModul in the original network with CA, accelerates the fitting speed during training, reduces the size of the model, inhibits redundant features, improves the precision of the model, and finally outputs an accurate semantic segmentation image. Compared with other lightweight networks, the SegFormer-MobileNet V3 network has a more accurate segmentation effect and faster segmentation efficiency. The invention can be applied to the aspect of portrait image processing such as portrait matting, online meeting, background replacement and the like.

Description

Lightweight portrait segmentation method based on deep learning

Technical Field

The invention relates to the technical field of image segmentation, in particular to a light portrait segmentation method based on deep learning.

Background

Image segmentation is an important issue in the field of computer vision, and has attracted considerable attention in many fields, such as medicine, agriculture, and military affairs. Image segmentation is to divide an image into a plurality of sub-image blocks according to the intrinsic rules of the image, such as the gray value of pixels, so that each block has specific characteristics and is obviously compared with other blocks, thereby achieving the purpose of image segmentation. Human image segmentation is an important research content of image segmentation. The human image segmentation is widely applied in the fields of automatic driving, pedestrian detection, intelligent search and rescue, medical images and the like, and plays an extremely important role. The traditional portrait segmentation method mainly utilizes the low-level semantic information of the image of the visual layer, such as the shape, color and the like of the image to segment the image, and a threshold-based segmentation method and an edge-based segmentation method are typical traditional segmentation methods. The threshold-based segmentation method is to classify pixel points according to the gray threshold obtained by calculation and the size relationship between the gray value of the image pixel and the threshold. If the difference between the portrait pixel to be separated and the background pixel is small, the edge boundary information is lost. The edge-based segmentation method utilizes different edge detection operators to carry out edge detection, but noise in an image has a large influence on the detection operators, so that the method is only suitable for low-noise images. Because the traditional method has the problems of segmentation noise, rough edge and the like, and is difficult to be competent for the task of fine segmentation, researchers provide a portrait semantic segmentation technology based on deep learning.

Disclosure of Invention

The invention aims to solve the problem that the existing portrait segmentation method is low in precision, and provides a lightweight portrait segmentation method based on deep learning.

In order to solve the problems, the invention is realized by the following technical scheme:

a lightweight portrait segmentation method based on deep learning comprises the following steps:

step 1, constructing a lightweight portrait segmentation model based on deep learning;

the lightweight portrait segmentation model based on deep learning is composed of an input layer, a convolution coordinate attention layer, 3 block units, 4 multilayer sensing layers, a convolution normalization and activation layer, a convolution layer and an output layer; inputting the input layer as the input of a lightweight portrait segmentation model based on deep learning; the output of the input layer is connected with the input of the convolution coordinate attention layer; one output of the convolution coordinate attention layer is connected with the input of the first block unit, and the other output of the convolution coordinate attention layer is connected with the input of the fourth multilayer sensing layer; one output of the first block unit is connected with the input of the second block unit, and the other output of the first block unit is connected with the input of the third multilayer sensing layer; one output of the second block unit is connected with the input of the third block unit, and the other output of the second block unit is connected with the input of the second multilayer sensing layer; the output of the third block unit is connected with the input of the first multilayer sensing layer; the outputs of the 4 multilayer sensing layers are simultaneously connected with the inputs of the convolution normalization and activation layers, and the outputs of the convolution normalization and activation layers are connected with the inputs of the convolution layers; the output of the convolution layer is connected with the input of the output layer; the output of the output layer is used as the output of a lightweight portrait segmentation model based on deep learning;

step 2, training the lightweight portrait segmentation model based on deep learning constructed in the step 1 by using the segmented sample image set to obtain a trained lightweight portrait segmentation model based on deep learning;

and 3, sending the image to be segmented into the trained lightweight portrait segmentation model based on deep learning obtained in the step 2, and outputting the segmented picture by the trained lightweight portrait segmentation model based on deep learning.

In the scheme, the convolution coordinate attention layer consists of a convolution layer and a coordinate attention layer; the input of the convolutional layer is used as the input of the convolutional coordinate attention layer, the output of the convolutional layer is connected with the input of the coordinate attention layer, and the output of the coordinate attention layer is used as the output of the convolutional coordinate attention layer.

In the scheme, the first block unit consists of 9 convolution normalization layers, 3 fusion layers and 1 coordinate attention layer; the input of a first convolution normalization layer is used as the input of a first block unit, the output of the first convolution normalization layer is connected with the input of a second convolution normalization layer, the output of the second convolution normalization layer is connected with the input of a third convolution normalization layer, the output of the third convolution normalization layer is connected with one input of a first fusion layer, the other input of the first fusion layer is connected with the input of the first convolution normalization layer, and the output of the first fusion layer is connected with the input of a fourth convolution normalization layer; the output of the fourth convolution normalization layer is connected with the input of the fifth convolution normalization layer, the output of the fifth convolution normalization layer is connected with the input of the sixth convolution normalization layer, the output of the sixth convolution normalization layer is connected with one input of the second fusion layer, the other input of the second fusion layer is connected with the input of the fourth convolution normalization layer, and the output of the second fusion layer is connected with the input of the seventh convolution normalization layer; and the output of the seventh convolution normalization layer is connected with the input of the eighth convolution normalization layer, one output of the eighth convolution normalization layer is connected with one input of the ninth convolution normalization layer, the other output of the eighth convolution normalization layer is connected with the input of the coordinate attention layer, the output of the coordinate attention layer is connected with the other input of the ninth convolution normalization layer, the output of the ninth convolution normalization layer is connected with one input of the third fusion layer, the other input of the third fusion layer is connected with the input of the seventh convolution normalization layer, and the output of the third fusion layer is used as the output of the first block unit.

In the scheme, the second block unit consists of 12 convolution normalization layers, 4 fusion layers and 1 coordinate attention layer; the input of a first convolution normalization layer is used as the input of a second block unit, the output of the first convolution normalization layer is connected with the input of a second convolution normalization layer, the output of the second convolution normalization layer is connected with the input of a third convolution normalization layer, the output of the third convolution normalization layer is connected with one input of a first fusion layer, the other input of the first fusion layer is connected with the input of the first convolution normalization layer, and the output of the first fusion layer is connected with the input of a fourth convolution normalization layer; the output of the fourth convolution normalization layer is connected with the input of the fifth convolution normalization layer, the output of the fifth convolution normalization layer is connected with the input of the sixth convolution normalization layer, the output of the sixth convolution normalization layer is connected with one input of the second fusion layer, the other input of the second fusion layer is connected with the input of the fourth convolution normalization layer, and the output of the second fusion layer is connected with the input of the seventh convolution normalization layer; the output of the seventh convolution normalization layer is connected with the input of the eighth convolution normalization layer, the output of the eighth convolution normalization layer is connected with the input of the ninth convolution normalization layer, the output of the ninth convolution normalization layer is connected with one input of the third fusion layer, the other input of the third fusion layer is connected with the input of the seventh convolution normalization layer, and the output of the third fusion layer is connected with the input of the tenth convolution normalization layer; the output of the tenth convolution normalization layer is connected with the input of the eleventh convolution normalization layer, one output of the eleventh convolution normalization layer is connected with one input of the twelfth convolution normalization layer, the other output of the eleventh convolution normalization layer is connected with the input of the coordinate attention layer, the output of the coordinate attention layer is connected with the other input of the twelfth convolution normalization layer, the output of the twelfth convolution normalization layer is connected with one input of the fourth fusion layer, the other input of the fourth fusion layer is connected with the input of the tenth convolution normalization layer, and the output of the fourth fusion layer is used as the output of the second block unit.

In the scheme, the third block unit consists of 9 convolution normalization layers, 3 fusion layers and 1 coordinate attention layer; the input of a third convolution normalization layer is used as the input of a third block unit, the output of the first convolution normalization layer is connected with the input of the second convolution normalization layer, the output of the second convolution normalization layer is connected with the input of the third convolution normalization layer, the output of the third convolution normalization layer is connected with one input of a first fusion layer, the other input of the first fusion layer is connected with the input of the first convolution normalization layer, and the output of the first fusion layer is connected with the input of a fourth convolution normalization layer; the output of the fourth convolution normalization layer is connected with the input of the fifth convolution normalization layer, the output of the fifth convolution normalization layer is connected with the input of the sixth convolution normalization layer, the output of the sixth convolution normalization layer is connected with one input of the second fusion layer, the other input of the second fusion layer is connected with the input of the fourth convolution normalization layer, and the output of the second fusion layer is connected with the input of the seventh convolution normalization layer; the output of the seventh convolution normalization layer is connected with the input of the eighth convolution normalization layer, one output of the eighth convolution normalization layer is connected with one input of the ninth convolution normalization layer, the other output of the eighth convolution normalization layer is connected with the input of the coordinate attention layer, the output of the coordinate attention layer is connected with the other input of the ninth convolution normalization layer, the output of the ninth convolution normalization layer is connected with one input of the third fusion layer, the other input of the third fusion layer is connected with the input of the seventh convolution normalization layer, and the output of the third fusion layer is used as the output of the third block unit.

In the scheme, the convolution normalization layer consists of a convolution layer and a BN normalization layer; the input of the convolution layer is used as the input of the convolution normalization layer, the output of the convolution layer is connected with the input of the BN normalization layer, and the output of the BN normalization layer is used as the output of the convolution normalization layer.

In the scheme, each multilayer sensing layer consists of a linear layer and an LN normalization layer; the input of the linear layer is used as the input of the multilayer sensing layer, the output of the linear layer is connected with the input of the LN normalization layer, and the output of the LN normalization layer is used as the output of the multilayer sensing layer.

In the scheme, the convolution normalization and activation layer consists of a convolution layer, a BN normalization layer and a ReLu activation layer; the input of the convolution layer is used as the input of the convolution normalization and activation layer, the output of the convolution layer is connected with the input of the BN normalization layer, the output of the BN normalization layer is connected with the input of the ReLu activation layer, and the output of the ReLu activation layer is used as the output of the convolution normalization and activation layer.

Compared with the prior art, the invention provides a MobileNetV3 lightweight portrait segmentation algorithm based on a SegFormer and a fusion coordinate attention mechanism so as to realize the segmentation of the portrait. A SegFormer encoder is replaced by a MobileNet V3 network; meanwhile, the MobileNet V3 network is improved, a multilayer Coding Attention (CA) Attention mechanism is added, the accuracy of portrait segmentation is improved, SEModul in the original network is replaced by CA, the fitting speed during training is increased, the size of the model is reduced, redundant features are inhibited, the precision of the model is improved, and finally an accurate semantic segmentation image is output. Compared with other lightweight networks, the SegFormer-MobileNet V3 network has a more accurate segmentation effect and faster segmentation efficiency. The invention can be applied to the aspect of portrait image processing such as portrait matting, online meeting, background replacement and the like.

Drawings

FIG. 1 is a structural diagram of a SegFormer.

FIG. 2 is a MoblieNet V3 block structure diagram.

FIG. 3 is a diagram of the Coordinate Attention structure.

FIG. 4 is a structural diagram of a lightweight portrait segmentation model SegFormer-MobileNet V3 based on deep learning.

Fig. 5 is a first block unit configuration diagram.

Fig. 6 is a second block unit configuration diagram.

Fig. 7 is a diagram showing the structure of a third block unit.

Fig. 8 shows a first set of effect maps (Matting) of the segmentation mask for different algorithms.

Fig. 9 is a second set of effect maps (EG 1800) for different algorithm split masks.

FIG. 10 is a third set of effect graphs (P3M-10K) for different algorithm segmentation masks.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to specific examples.

A lightweight portrait segmentation method based on deep learning specifically comprises the following steps:

step 1, constructing a lightweight portrait segmentation model based on deep learning.

The SegFormer network consists of an Encoder (Encoder) and a Decoder (Decoder), as shown in fig. 1. The transform module within the encoder MiT uses an Overlap Patch Embedding (OPE) structure to extract features and downsample the input image. The obtained features are input into an effective multi-head Attention Self-Attention (EMSA) layer and a mixed Feed Forward (MixFFN) layer. The standard convolutional layer will be used for overlay patch embedding calculations. After the two-dimensional features are flattened into one-dimensional features in space, the features are input into an EMSA layer for self-attention calculation and feature enhancement. In order to replace the position coding information in the common Transformer, a 3 × 3 two-dimensional convolution layer is added between two linear transformation layers of the common feedforward layer to fuse the position information in space. The Transformer Block uses multiple superimposed EMSA and mixed FFN to deepen the network to extract rich details and semantic features. Self-attention was calculated in the EMSA of each scale. Compared with other conventional networks based on convolutional neural networks, although the SegFormer network can integrate information on all scales and enable a self-attention mechanism on each scale to be purer by calculating self-attention, the conventional SegFormer network has the problems of overlarge model and insufficient segmentation precision.

The MobileNet series is widely applied to downstream tasks of image processing due to fast detection and high accuracy, and is already representative of a lightweight network. Mobielet v3 was published in 2019, and this version of mobielet v3 combines the deep separable convolution of mobielet v1, the imported responses and Linear bottleeck, SE modules of mobielet v2, using NAS (neural structure search) to search for the configuration and parameters of the network. The main innovation of MoblieNet V3 is to introduce an SE module on the block of MoblieNet V2, wherein the SE module is a lightweight channel attention module. After depthwise the number of channels is reduced by a factor of 4 through pooling layers, then the first fc layer, and after a second fc layer the number of channels is transformed back (expanded by a factor of 4) and then multiplied by depthwise as shown in fig. 2.

The Coordinate Attention (CA) is a new Attention mechanism module designed for lightweight networks proposed by Qibin Hou et al of the national university of singapore, as shown in fig. 3, which encodes the channel relationship and long-range dependence by accurate position information, and is divided into two steps: coordinate information embedding (coordinate information embedding) and coordinate attention generation (coordinate attention generation) in order to enable the attention module to capture remote spatial interactions with accurate location information, qibin Hou et al decomposed global pooling according to the following formula, which translates into a pair of one-dimensional feature encoding operations:

given an input X, each channel is first encoded along a horizontal and vertical coordinate, respectively, using a posing kernel of size (H, 1) or (1, W). Thus, the output of the c-th channel with height h can be expressed as:

similarly, the output of the c-th channel of width w can be written as

The 2 transformations respectively aggregate features along two spatial directions to obtain a pair of direction perception feature maps. The 2 conversions also allow the attention module to capture long-term dependencies along one spatial direction and store precise position information along the other spatial direction, which helps the network to more accurately position the subject in the image, and the CA module has little computational overhead on the model and does not increase the final size of the model through experimental verification.

Based on the analysis, the whole network model architecture of the lightweight human image segmentation model SegFormer-MobileNet 3 based on deep learning provided by the invention is shown in figure 4 and comprises an input layer, a convolution coordinate attention layer, 3 block units, 4 multilayer sensing layers, a convolution normalization and activation layer, a convolution layer and an output layer; the input of the input layer is used as the input of a lightweight portrait segmentation model based on deep learning; the output of the input layer is connected with the input of the convolution coordinate attention layer; one output of the convolution coordinate attention layer is connected with the input of the first block unit, and the other output of the convolution coordinate attention layer is connected with the input of the fourth multilayer sensing layer; one output of the first block unit is connected with the input of the second block unit, and the other output of the first block unit is connected with the input of the third multilayer sensing layer; one output of the second block unit is connected with the input of the third block unit, and the other output of the second block unit is connected with the input of the second multilayer sensing layer; the output of the third block unit is connected with the input of the first multilayer sensing layer; the outputs of the 4 multilayer sensing layers are simultaneously connected with the inputs of the convolution normalization and activation layers, and the outputs of the convolution normalization and activation layers are connected with the inputs of the convolution layers; the output of the convolution layer is connected with the input of the output layer; and outputting the output of the output layer as the output of the lightweight portrait segmentation model based on deep learning.

The convolution coordinate attention layer consists of a convolution layer and a coordinate attention layer; the input of the convolutional layer is used as the input of the convolutional coordinate attention layer, the output of the convolutional layer is connected with the input of the coordinate attention layer, and the output of the coordinate attention layer is used as the output of the convolutional coordinate attention layer.

The first block unit consists of 9 convolution normalization layers, 3 fusion layers and 1 coordinate attention layer; as shown in fig. 5. The input of the first convolution normalization layer is used as the input of the first block unit, the output of the first convolution normalization layer is connected with the input of the second convolution normalization layer, the output of the second convolution normalization layer is connected with the input of the third convolution normalization layer, the output of the third convolution normalization layer is connected with one input of the first fusion layer, the other input of the first fusion layer is connected with the input of the first convolution normalization layer, and the output of the first fusion layer is connected with the input of the fourth convolution normalization layer. The output of the fourth convolution normalization layer is connected with the input of the fifth convolution normalization layer, the output of the fifth convolution normalization layer is connected with the input of the sixth convolution normalization layer, the output of the sixth convolution normalization layer is connected with one input of the second fusion layer, the other input of the second fusion layer is connected with the input of the fourth convolution normalization layer, and the output of the second fusion layer is connected with the input of the seventh convolution normalization layer. And the output of the seventh convolution normalization layer is connected with the input of the eighth convolution normalization layer, one output of the eighth convolution normalization layer is connected with one input of the ninth convolution normalization layer, the other output of the eighth convolution normalization layer is connected with the input of the coordinate attention layer, the output of the coordinate attention layer is connected with the other input of the ninth convolution normalization layer, the output of the ninth convolution normalization layer is connected with one input of the third fusion layer, the other input of the third fusion layer is connected with the input of the seventh convolution normalization layer, and the output of the third fusion layer is used as the output of the first block unit.

The second block unit consists of 12 convolution normalization layers, 4 fusion layers and 1 coordinate attention layer; as shown in fig. 6. The input of the first convolution normalization layer is used as the input of the second block unit, the output of the first convolution normalization layer is connected with the input of the second convolution normalization layer, the output of the second convolution normalization layer is connected with the input of the third convolution normalization layer, the output of the third convolution normalization layer is connected with one input of the first fusion layer, the other input of the first fusion layer is connected with the input of the first convolution normalization layer, and the output of the first fusion layer is connected with the input of the fourth convolution normalization layer. The output of the fourth convolution normalization layer is connected with the input of the fifth convolution normalization layer, the output of the fifth convolution normalization layer is connected with the input of the sixth convolution normalization layer, the output of the sixth convolution normalization layer is connected with one input of the second fusion layer, the other input of the second fusion layer is connected with the input of the fourth convolution normalization layer, and the output of the second fusion layer is connected with the input of the seventh convolution normalization layer. The output of the seventh convolution normalization layer is connected to the input of the eighth convolution normalization layer, the output of the eighth convolution normalization layer is connected to the input of the ninth convolution normalization layer, the output of the ninth convolution normalization layer is connected to one input of the third fusion layer, the other input of the third fusion layer is connected to the input of the seventh convolution normalization layer, and the output of the third fusion layer is connected to the input of the tenth convolution normalization layer. And the output of the tenth convolution normalization layer is connected with the input of the eleventh convolution normalization layer, one output of the eleventh convolution normalization layer is connected with one input of the twelfth convolution normalization layer, the other output of the eleventh convolution normalization layer is connected with the input of the coordinate attention layer, the output of the coordinate attention layer is connected with the other input of the twelfth convolution normalization layer, the output of the twelfth convolution normalization layer is connected with one input of the fourth fusion layer, the other input of the fourth fusion layer is connected with the input of the tenth convolution normalization layer, and the output of the fourth fusion layer is used as the output of the second block unit.

The third block unit consists of 9 convolution normalization layers, 3 fusion layers and 1 coordinate attention layer; as shown in fig. 7. The input of the third convolution normalization layer is used as the input of the third block unit, the output of the first convolution normalization layer is connected with the input of the second convolution normalization layer, the output of the second convolution normalization layer is connected with the input of the third convolution normalization layer, the output of the third convolution normalization layer is connected with one input of the first fusion layer, the other input of the first fusion layer is connected with the input of the first convolution normalization layer, and the output of the first fusion layer is connected with the input of the fourth convolution normalization layer. The output of the fourth convolution normalization layer is connected with the input of the fifth convolution normalization layer, the output of the fifth convolution normalization layer is connected with the input of the sixth convolution normalization layer, the output of the sixth convolution normalization layer is connected with one input of the second fusion layer, the other input of the second fusion layer is connected with the input of the fourth convolution normalization layer, and the output of the second fusion layer is connected with the input of the seventh convolution normalization layer. The output of the seventh convolution normalization layer is connected with the input of the eighth convolution normalization layer, one output of the eighth convolution normalization layer is connected with one input of the ninth convolution normalization layer, the other output of the eighth convolution normalization layer is connected with the input of the coordinate attention layer, the output of the coordinate attention layer is connected with the other input of the ninth convolution normalization layer, the output of the ninth convolution normalization layer is connected with one input of the third fusion layer, the other input of the third fusion layer is connected with the input of the seventh convolution normalization layer, and the output of the third fusion layer is used as the output of the third block unit.

The convolution normalization layer consists of a convolution layer and a BN normalization layer; the input of the convolution layer is used as the input of the convolution normalization layer, the output of the convolution layer is connected with the input of the BN normalization layer, and the output of the BN normalization layer is used as the output of the convolution normalization layer.

Each multilayer sensing layer consists of a linear layer and an LN normalization layer; the input of the linear layer is used as the input of the multilayer sensing layer, the output of the linear layer is connected with the input of the LN normalization layer, and the output of the LN normalization layer is used as the output of the multilayer sensing layer.

The convolution normalization and activation layer consists of a convolution layer, a BN normalization layer and a ReLu activation layer; the input of the convolution layer is used as the input of the convolution normalization and activation layer, the output of the convolution layer is connected with the input of the BN normalization layer, the output of the BN normalization layer is connected with the input of the ReLu activation layer, and the output of the ReLu activation layer is used as the output of the convolution normalization and activation layer.

The lightweight portrait segmentation model SegFormer-MobileNet V3 overall architecture based on deep learning is divided into two parts: encoder and Decode. The Encoder pre-processes the image using convCAblock, and then transmits the Feature map (Feature map) to the fourth layer of MobileNetV3 and Feature Pyramid (Feature Pyramid). MobileNetV3 is divided into three blocks, each of which is composed of several residual units, the outputs of which become the inputs of the feature pyramid and the next block. In the characteristic pyramid, an improved MLP layer is used as a basic module of a network, after the network of each pyramid layer is output, the pyramids of different layers are upsampled to be 1/4 times of the size of an original image by adopting linear interpolation through a bilinear interpolation method, the outputs are mixed together to pass through a layer of ConvBNReLu, then the feature graph with the same size as the input image is output as a prediction result through a layer of conv1 × 1 through the bilinear interpolation method.

After an image is input, convCAblock passes through a layer of conv3 multiplied by 3, so that the model captures global feature information at the beginning and can determine main feature information which needs to be focused by the model. In order to enable a module to have a stronger capability of capturing global features, compared with the previous MobileNetV3, a channel Attention (SEModule) in the module is replaced by a Coordinate Attention mechanism (Coordinate Attention), the channel Attention mechanism constitutes the most basic MobileNetV3 unit, and finally experiments verify that the size of the model can be further compressed, so that the final lightweight of the model is ensured. In order to better perform feature fusion, each MLP layer of the feature pyramid adopts Linear to perform feature extraction, a LayerNorm layer (formula is shown in 4) is adopted to normalize the output of a feature map, and the LayerNorm layer smoothes the size relationship among different samples and reserves the size relationship among different features. The method is used for reducing (Internal vary Shift) ICS phenomenon, a GELU function (formula shown in 5) is adopted for activation, a random regularization thought is added into the activation of the Gaussian error linear unit activation function, the defect of non-linearity of the ReLU6 is made up, and the problem of gradient disappearance can be effectively avoided.

The SegFormer-MobileNet V3 adopts a cross entropy loss function (cross entropy loss) commonly used in the deep learning image semantic segmentation field as a training loss function, the whole cross entropy loss function curve is monotonous, and the larger the loss is, the larger the gradient is, the faster optimization during reverse propagation is facilitated, so that a better model evaluation result can be ensured. For the binary problem, the corresponding Loss formula is as follows:

L＝-(ylogP+(1-y)log(1-P)) (6)

where y is the label and P is the probability of correct prediction.

And 2, training the lightweight portrait segmentation model based on deep learning constructed in the step 1 by using the segmented sample image set to obtain the trained lightweight portrait segmentation model based on deep learning.

The performance of the invention is illustrated below by means of specific examples:

the experimental hardware platform is 1 Tesla V100 GPU with 32GB video memory to a strong Intel processor with 24 cores. The software environments are ubuntu16.04, python3.7.4, paddlepaddle2.3.1. The optimal hyper-parameters of the algorithm of the invention are shown in table 1. In table 1, the variable Epoch represents the number of training iterations; batchsize represents the number of pictures input to the model per round of training; the optimizer represents the network parameters used to adjust the parameters that affect the model training; lr represents a learning rate; factor represents the rate of decay of the learning rate when the loss function no longer falls; the Embed _ dim is the number of output channels and influences the size of the channels of the intermediate layer.

TABLE 1 optimal hyper-parameters of the algorithm of the present invention

The training data set is collected in a mode of crawling network portrait bust and manually marking mask labels, and 18509 collected portraits are collected. In order to meet the requirement of MixVisionTransformer in SegFormer on data volume, open portrait data sets of Matting, supervisely, P3M-10K, EG1800 and the like are collected, finally a portrait training set reaches 88229 images, a verification set consists of 5811 Matting images, 425 EG1800 images and 1000P 3M-10K, and 7236 images are counted. And (4) training and verifying the pictures according to the sizes of [512, 512] by adopting a reshape function because the collected pictures are different in size.

And the training set is subjected to data enhancement by adopting random horizontal inversion, random image distortion, random Gaussian blur and other modes. The diversity of the training set can be increased, so that the input image is more complex, the model can be ensured to adapt to a more complex environment, and the robustness of the model is improved.

In order to correctly evaluate the method, the measurement indexes commonly used in the image semantic segmentation field are adopted: the average cross-over ratio (MioU), the accuracy (Acc), the model consistency (Kappa) and the image segmentation coefficient (Dice) were used as evaluation indexes.

Cross-over ratio (MioU): on the aspect of semantic segmentation, one set is a real value, the other set is a predicted value, the IoU of each class is solved, and then the average is taken, namely:

where k is the number of categories, i represents the true value, j represents the predicted value, pij represents that i will predict j, and pji represents that j will predict i.

Accuracy (Acc): when the classification is carried out, the calculation formula is as follows:

wherein, TP, TN, FP, FN belong to the concept of confusion matrix, and the meanings thereof are shown in Table 2.

TABLE 2 confusion matrix

Model identity (Kappa): for the classification problem, consistency is whether the model prediction result is consistent with the actual classification result.

The calculation formula of the Kappa coefficient based on the confusion matrix is as follows:

image segmentation coefficient (Dice): the value range of the index applied to the pixel level is [0,1], and the closer the Dice is to 1, the better the model effect is.

Wherein X is a prediction pixel set, and Y is a real value pixel set.

To test the effectiveness of the method proposed herein, the method herein was validated on three test sets with FCN-HRNet-W18 and SegFormer using different encoders, the results are shown in tables 3-5.

(1) Matting dataset

TABLE 3 verification of results on Matting data sets by different methods

As can be seen from Table 3, on 5811 Matting test sets, the method not only maintains smaller model sizes, the model sizes are respectively 11.1M and 1.4M smaller than SegFormer-MiT-B0 and SegFormer-MobileNet v3, but also on 5811 Matting data sets, the four evaluation indexes of MIoU, acc, kappa and Dice are superior to those of the comparative model, and exceed 1.30% of MIOU index and 0.66% of Acc of the SegFormer-MiT-B0 model. Comparing SegFormer-MiT-B0 with SegFormer-MiT-B0-OurMLP, it is found that our MLP does not reduce the running speed of the model. Comparing the SegFormer-MobileNetv3 with the model, the model not only reduces the size of the model, but also improves the maximum FPS of the model, and meanwhile, experiments prove that the model can keep a stable frame rate of more than 30FPS on 5811 pictures.

As can be seen from the input image in fig. 8, the portrait has high color similarity with the background, so that all algorithms have a large misjudgment phenomenon, which is mainly indicated by that the clothes at the lower left corner of the portrait have misjudgment with the background, so that the gap between the background and the portrait is also completely divided, and the hair at the left side of the portrait is partially missing. The main reason is that the portrait mask provided by Matting is not very accurately labeled, but rather is coarser, and the performance of fig. 8 (d) is inferior to that of fig. 8 (c) in the performance of the Matting dataset, but our model is superior to that of fig. 8 (e) and other models.

(2) EG1800 dataset

TABLE 4 verification of results on EG1800 datasets by different methods

As can be seen from Table 4, our model exceeded SegFormer-MiT-B0 model by 1.83% MIOU and 0.90% Acc over 425 EG1800 test sets. Since the EG1800 data set itself is not large in data size, relatively little feature information can be obtained during model training. This is also highlighting the advantages of our model, which is lightweight and does not require a particularly large training set to perform well. Larger models cost more memory overhead in both training and migration. Meanwhile, the SegFormer-MiT-B0-OursMLP and the SegFormer-MiT-B0 are found, and the SegFormer-MiT-B0-OursMLP has better expression effect when the data volume is insufficient. Meanwhile, the model is compared with SegFormer-MobileNet 3, the modified MobileNet 3 has a higher MIOU value in a small data volume, which exceeds the MIOU value of an unmodified model by 0.85%, and a faster model operation frame rate is obtained.

From FIGS. 9 (c) and 9 (d), we found that the effect of SegFormer-MiT-B0-OursMLP is better, and again, it is proved that our improved MLP is more suitable for the case of small data size. Comparing fig. 9 (b) with other models, fig. 9 (b) employs high-resolution HRNet, but the phenomenon of erroneous judgment still occurs when the portrait is separated from the background intersecting chair. And the misjudgment of the model is smaller, and the misjudgment area is smaller.

Claims

1. A lightweight portrait segmentation method based on deep learning is characterized by comprising the following steps:

2. The lightweight human image segmentation method based on deep learning of claim 1, wherein the convolution coordinate attention layer is composed of a convolution layer and a coordinate attention layer; the input of the convolutional layer is used as the input of the convolutional coordinate attention layer, the output of the convolutional layer is connected with the input of the coordinate attention layer, and the output of the coordinate attention layer is used as the output of the convolutional coordinate attention layer.

3. The method for lightweight human image segmentation based on deep learning of claim 1, wherein the first block unit is composed of 9 convolution normalization layers, 3 fusion layers and 1 coordinate attention layer;

the input of a first convolution normalization layer is used as the input of a first block unit, the output of the first convolution normalization layer is connected with the input of a second convolution normalization layer, the output of the second convolution normalization layer is connected with the input of a third convolution normalization layer, the output of the third convolution normalization layer is connected with one input of a first fusion layer, the other input of the first fusion layer is connected with the input of the first convolution normalization layer, and the output of the first fusion layer is connected with the input of a fourth convolution normalization layer;

the output of the fourth convolution normalization layer is connected with the input of the fifth convolution normalization layer, the output of the fifth convolution normalization layer is connected with the input of the sixth convolution normalization layer, the output of the sixth convolution normalization layer is connected with one input of the second fusion layer, the other input of the second fusion layer is connected with the input of the fourth convolution normalization layer, and the output of the second fusion layer is connected with the input of the seventh convolution normalization layer;

and the output of the seventh convolution normalization layer is connected with the input of the eighth convolution normalization layer, one output of the eighth convolution normalization layer is connected with one input of the ninth convolution normalization layer, the other output of the eighth convolution normalization layer is connected with the input of the coordinate attention layer, the output of the coordinate attention layer is connected with the other input of the ninth convolution normalization layer, the output of the ninth convolution normalization layer is connected with one input of the third fusion layer, the other input of the third fusion layer is connected with the input of the seventh convolution normalization layer, and the output of the third fusion layer is used as the output of the first block unit.

4. The method for segmenting the lightweight human image based on the deep learning of claim 1, wherein the second block unit is composed of 12 convolution normalization layers, 4 fusion layers and 1 coordinate attention layer;

the input of a first convolution normalization layer is used as the input of a second block unit, the output of the first convolution normalization layer is connected with the input of a second convolution normalization layer, the output of the second convolution normalization layer is connected with the input of a third convolution normalization layer, the output of the third convolution normalization layer is connected with one input of a first fusion layer, the other input of the first fusion layer is connected with the input of the first convolution normalization layer, and the output of the first fusion layer is connected with the input of a fourth convolution normalization layer;

the output of the seventh convolution normalization layer is connected with the input of the eighth convolution normalization layer, the output of the eighth convolution normalization layer is connected with the input of the ninth convolution normalization layer, the output of the ninth convolution normalization layer is connected with one input of the third fusion layer, the other input of the third fusion layer is connected with the input of the seventh convolution normalization layer, and the output of the third fusion layer is connected with the input of the tenth convolution normalization layer;

and the output of the tenth convolution normalization layer is connected with the input of the eleventh convolution normalization layer, one output of the eleventh convolution normalization layer is connected with one input of the twelfth convolution normalization layer, the other output of the eleventh convolution normalization layer is connected with the input of the coordinate attention layer, the output of the coordinate attention layer is connected with the other input of the twelfth convolution normalization layer, the output of the twelfth convolution normalization layer is connected with one input of the fourth fusion layer, the other input of the fourth fusion layer is connected with the input of the tenth convolution normalization layer, and the output of the fourth fusion layer is used as the output of the second block unit.

5. The method for lightweight human image segmentation based on deep learning of claim 1, wherein the third block unit is composed of 9 convolution normalization layers, 3 fusion layers and 1 coordinate attention layer;

the input of a third convolution normalization layer is used as the input of a third block unit, the output of the first convolution normalization layer is connected with the input of the second convolution normalization layer, the output of the second convolution normalization layer is connected with the input of the third convolution normalization layer, the output of the third convolution normalization layer is connected with one input of a first fusion layer, the other input of the first fusion layer is connected with the input of the first convolution normalization layer, and the output of the first fusion layer is connected with the input of a fourth convolution normalization layer;

the output of the seventh convolution normalization layer is connected with the input of the eighth convolution normalization layer, one output of the eighth convolution normalization layer is connected with one input of the ninth convolution normalization layer, the other output of the eighth convolution normalization layer is connected with the input of the coordinate attention layer, the output of the coordinate attention layer is connected with the other input of the ninth convolution normalization layer, the output of the ninth convolution normalization layer is connected with one input of the third fusion layer, the other input of the third fusion layer is connected with the input of the seventh convolution normalization layer, and the output of the third fusion layer is used as the output of the third block unit.

6. The method for weight-saving human image segmentation based on deep learning according to any one of claims 3 to 5, wherein the convolution normalization layer is composed of a convolution layer and a BN normalization layer; the input of the convolution layer is used as the input of the convolution normalization layer, the output of the convolution layer is connected with the input of the BN normalization layer, and the output of the BN normalization layer is used as the output of the convolution normalization layer.

7. The method for weight-saving portrait segmentation based on deep learning as claimed in claim 1, wherein each multi-layer perception layer is composed of a linear layer and an LN normalization layer; the input of the linear layer is used as the input of the multilayer sensing layer, the output of the linear layer is connected with the input of the LN normalization layer, and the output of the LN normalization layer is used as the output of the multilayer sensing layer.

8. The method for lightweight human image segmentation based on deep learning of claim 1, wherein the convolution normalization and activation layer is composed of a convolution layer, a BN normalization layer and a ReLu activation layer; the input of the convolution layer is used as the input of the convolution normalization and activation layer, the output of the convolution layer is connected with the input of the BN normalization layer, the output of the BN normalization layer is connected with the input of the ReLu activation layer, and the output of the ReLu activation layer is used as the output of the convolution normalization and activation layer.