CN112906706A

CN112906706A - Improved image semantic segmentation method based on coder-decoder

Info

Publication number: CN112906706A
Application number: CN202110344753.4A
Authority: CN
Inventors: 张红英; 李鑫
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-06-04

Abstract

The invention provides an improved image semantic segmentation method based on a coder and a decoder. Firstly, a cavity space pyramid pooling module in an improved encoder extracts image multi-scale features, and then a decoder performs feature map cross-layer fusion on extracted high-level and low-level semantic information; then, a visual activation function is used for improving the spatial context modeling capability of the codec network; and finally, introducing an optimization branch, generating an offset map containing different offset information of each pixel in each direction by using a boundary branch and a direction branch, and refining the coarse prediction result map generated by the decoder through coordinate mapping to generate a final semantic segmentation result map. The invention extracts and fuses the multi-scale features of the image by utilizing the coder and the decoder, refines the class boundary by using the offset map information, has excellent semantic segmentation performance and has wide applicability.

Description

Improved image semantic segmentation method based on coder-decoder

Technical Field

The invention relates to an image processing technology, in particular to an image semantic segmentation method for extracting and fusing multi-scale features and class boundary optimization by adopting a codec structure.

Background

The image semantic segmentation is used as a foundation technology different from target detection and image classification in a computer vision task, and a predefined label representing the semantic category of each pixel in an image is allocated to achieve a pixel-level classification task. Specifically, image semantic segmentation means that what a target object in an image is and where the target object is distinguished from the pixel level, that is, the target in the image is detected, then the outline between each individual and a scene is drawn, and finally the individual and the scene are classified and a color is given to the object belonging to the same class for representation. In recent years, with the development of deep learning technology in computer vision, image semantic segmentation is widely applied to aspects such as automatic driving and intelligent medical treatment. The inherent invariance of the deep neural network can learn the characteristics of dense abstraction, and the system performance is much better than that of a system designed according to the traditional sample characteristics.

The codec network aims to perform cross-layer fusion on feature map information of different scales acquired by a coder through a decoder, so that high-level semantic information and low-level spatial information are effectively fused, but a single codec structure easily causes the problems of loss of small-scale targets and boundary blurring in image segmentation, so that the acquisition of rich spatial context feature information and the optimization research of segmentation boundaries become the key points of image semantic segmentation.

Disclosure of Invention

The invention aims to solve the problem of image semantic segmentation, each pixel of an image is accurately classified through a deep learning network, and image semantic information can be segmented through the method.

In order to achieve the above object, the present invention provides an image semantic segmentation method using codec to extract and fuse multi-scale features and class boundary optimization, wherein the method mainly comprises five parts, the first part is to preprocess a data set; the second part is to perform feature extraction and cross-layer fusion on the input image; the third part is to carry out semantic rough segmentation on the image; the fourth part is the boundary optimization of the rough prediction graph; and the fifth part is network training and testing, and a final segmentation result graph is predicted.

The first part comprises two steps:

step 1, downloading a semantic segmentation public data set, and selecting images with complex scenes, various details and complete categories as training samples;

step 2, randomly zooming the training images in the range of [0.5, 2], then training images by random cutting, enhancing the randomness of training samples, preventing the problem of over-training fitting, and forming a final training set;

the second part comprises two steps:

and 3, inputting the training samples in the step 2 into a codec network for multi-scale feature extraction and cross-layer fusion to obtain a fused feature map. The specific implementation is as follows:

(1) the encoder network is used for feature extraction and multi-scale feature fusion and comprises a downsampling operation and an improved Spatial Pyramid Pooling (ASPP) module. And taking a residual error network as a backbone network, performing 1/4 down-sampling on the input samples to generate a low-level spatial feature map, transmitting the low-level spatial feature map into a decoder for standby, and taking a feature map with the size of 1/16 generated by continuous down-sampling as the input of an improved ASPP module to acquire high-level semantic information. Improved ASPP module in encoder

Convolutional layer, four

The expansion convolution layer (the expansion rates are respectively 4, 8, 12 and 24) and the average pooling layer, multi-scale feature extraction is carried out on the input feature map, nonlinear activation is carried out by using a FRELU activation function, and finally, Concatenate fusion is carried out;

(2) the decoder network is used for carrying out cross-layer fusion on different level features in the encoder;

step 4, inputting the training samples in the step 2 into a boundary optimization network, and extracting a high-resolution feature map as the input of boundary branches and direction branches through a parallel network HRNet;

the third part comprises two steps:

step 5, adjusting the channel number of the standby characteristic diagram of the decoder in the step 3, and performing concatemate fusion on the standby characteristic diagram and the improved ASPP module output characteristic diagram after deconvolution up-sampling operation;

step 6, mapping the feature map subjected to cross-layer fusion in the step 5 to an RGB space through convolution, and recovering to the resolution of the input image through deconvolution operation;

the fourth section comprises two steps:

and 7, taking the feature graph extracted in the step 4 as the input of the boundary branch and the direction branch, generating an offset graph with different offset information in each direction, and optimizing a rough result. The specific implementation is as follows:

(1) with boundary branches supervised by a binary cross-entropy function

Convolution, BN normalization and ReLU activation function sum

The linear classifier formed by convolution is formed, boundary division is carried out through a preset threshold value, all offsets are rescaled by artificial scaling factors, and false pixel prediction is reduced;

(2) the directional branches being supervised by a standard class cross entropy function

Convolution, BN normalization and ReLU activation function sum

A linear classifier formed by convolution is formed, and a real scene graph is divided by discrete partitions;

(3) masking the (2) output discrete directional diagram and the (1) output boundary diagram to generate an offset diagram with different offset information in each direction;

(4) mapping the output offset map space in the step (3) to the output rough segmentation map in the step 6 for boundary optimization;

the fifth part comprises two steps:

step 8, debugging the network structure hyper-parameters from step 3 to step 7, setting network model parameters, wherein the initial learning rate is set to 0.01, 1/10 initial learning rate is used in the backbone network, a poly learning rate adjustment strategy is used, Epochs is set to 80, Bach size is set to 8, and a final training model is obtained;

and 9, inputting the test set in the step 1 into the training model in the step 8, and segmenting image semantics.

The invention provides a codec image semantic segmentation method integrating multi-scale features and boundary optimization. Firstly, an ASPP module in an improved encoder extracts multi-scale features of an image and fuses the multi-scale features, then channel number adjustment is carried out on feature maps of different levels in a decoder, cross-layer fusion is carried out on high-level semantic information and low-level spatial information of which the channel number is adjusted, a rough semantic prediction result map is generated by mapping to an RGB space through convolution, and the efficiency of capturing spatial context is improved in a video activation function FRELU in the codec; and finally, performing boundary optimization on the rough prediction result by using a pixel offset map generated by a target boundary map and a discrete directional diagram mask in the optimization branch to generate a fine semantic segmentation result map. The invention utilizes the improved codec to fuse the multi-scale features of the image and optimizes the class boundary, thereby realizing excellent semantic segmentation performance, high accuracy and good robustness.

Drawings

FIG. 1 is a network overall framework diagram of the present invention;

FIG. 2 is a block diagram of an improved spatial pyramid pooling of the present invention;

FIG. 3 is an optimization schematic of the present invention;

FIG. 4 is an original input image;

FIG. 5 is a semantic segmentation image of FIG. 4 predicted using the present invention.

Detailed Description

For better understanding of the present invention, the codec image semantic segmentation method with multi-scale feature and boundary optimization according to the present invention is described in more detail below with reference to specific embodiments. In the following description, detailed descriptions of the current prior art, which will be omitted herein, may obscure the subject matter of the present invention.

Step 1, downloading a semantic segmentation public data set, and selecting a training sample with complex scene, various details and complete categories;

step 2, randomly zooming the training image in the range of [0.5, 2], then training the image by random cutting, enhancing the randomness of the training sample, preventing the problem of over-training fitting, and forming a final training set sample 101;

fig. 1 is a network model diagram of a codec image semantic segmentation method based on multi-scale feature and boundary optimization, which is performed according to the following steps in the present embodiment:

(1) the encoder network is used for feature extraction and multi-scale feature fusion and consists of a downsampling operation and an improved ASPP module. And taking a residual error network as a backbone network, performing 1/4 down-sampling on the input samples to generate a low-level spatial feature map 102, transmitting the low-level spatial feature map into a decoder for standby, and taking a feature map 103 with the size of 1/16 generated by continuous down-sampling as the input of an improved ASPP module to acquire high-level semantic information. The improved ASPP module in the encoder is shown in fig. 2, and performs multi-scale feature extraction on an input feature map 201, which is obtained by

Convolutional layer, four

The expansion convolution layer (the expansion rates are respectively 4, 8, 12 and 24) and the average pooling layer form 202, the FRELU activation function is used for nonlinear activation, and finally the Concatenate fusion 203 is carried out;

the third part comprises two steps:

step 5, adjusting 104 the channel number of the standby characteristic diagram of the decoder in the step 3, and performing concatemate fusion 106 on the standby characteristic diagram of the decoder and an ASPP module output characteristic diagram 105 after deconvolution up-sampling operation;

step 6, mapping the feature map subjected to cross-layer fusion in the step 5 to an RGB space through convolution, and recovering to the resolution 107 of the input image through deconvolution operation;

the fourth section comprises two steps:

step 7, as shown in fig. 3, the optimization branch takes the feature map 301 extracted in step 4 as input of the boundary branch 302 and the direction branch 303, generates an offset map 305 having different offset information in each direction, and optimizes a rough result. The specific implementation is as follows:

(1) with boundary branches supervised by a binary cross-entropy function

Convolution, BN normalization and ReLU activation function sum

A linear classifier formed by convolution is formed 302, a preset threshold N =5 is used for dividing the boundary, and an artificial scaling factor a =2 is set for rescaling all offsets, so that the prediction of false pixels is reduced;

Convolution, BN normalization and ReLU activation function sum

A linear classifier component 303 formed by convolution, which divides the real scene graph by using discrete partition m = 8;

(3) masking (2) the output discrete pattern and (1) the output boundary map 304 to generate an offset map 305 having different offset amount information in each direction;

(4) mapping (3) the output offset map space to the output rough segmentation map 107 in the step 6 for boundary optimization 306;

wherein the content of the first and second substances,

is to refine the post-labelIn the figure, the figure shows that,

representing the position of the boundary pixel i,

an offset vector representing the generated intra pixels;

representing the location of the identified internal pixel;

the fifth part comprises two steps:

step 9, inputting the test image into the pre-trained model, and predicting the semantic segmentation image 108 shown in fig. 5.

The invention provides a method for segmenting image semantics of a coder/decoder, which integrates multi-scale features and boundary optimization according to the structural characteristics of the coder/decoder and an image semantics segmenting method based on deep learning, and the method is characterized in that the cross-layer integration characteristics of the coder/decoder are more fully utilized, a space pyramid pooling module of a coder is improved to obtain the multi-scale image features, space insensitive information is activated through a visual activation function, high-level semantic information and low-level spatial information are integrated through the decoder, and then the resolution of a predicted image is restored through deconvolution; and finally, performing boundary pixel optimization on the generated rough prediction image by using the optimization branch to generate a final semantic segmentation prediction result image. The method has the advantages of simple algorithm, strong operability and wide applicability.

While the invention has been described with respect to the illustrative embodiments thereof, it is to be understood that the invention is not limited thereto but is intended to cover various changes and modifications which are obvious to those skilled in the art, and which are intended to be included within the spirit and scope of the invention as defined and defined in the appended claims.

Claims

1. An improved image semantic segmentation method based on a coder-decoder is characterized in that a coder-decoder structure is adopted to extract and fuse multi-scale features and optimize class boundaries, and the method comprises five parts, namely data set preprocessing, feature extraction and cross-layer fusion of input images, semantic rough segmentation, boundary optimization, network training and testing;

the first part comprises two steps:

the second part comprises two steps:

(1) the encoder network is used for feature extraction and multi-scale feature fusion and consists of a downsampling operation and an improved ASPP module; using a residual error network as a backbone network, performing 1/4 down-sampling on an input sample to generate a low-level spatial feature map, transmitting the low-level spatial feature map into a decoder for standby, and using a feature map with the size of 1/16 generated by continuous down-sampling as the input of an improved ASPP module to acquire high-level semantic information; improved ASPP module in encoder

Convolutional layer, four

Expansion convolutional layer (expansion ratio of 4, 8, 12, 24, respectively) and globalThe method comprises the steps of average pooling layer composition, multi-scale feature extraction is conducted on an input feature graph, nonlinear activation is conducted through an FRELU activation function, and finally Concatenate fusion is conducted;

the third part comprises two steps:

step 5, adjusting the channel number of the standby characteristic diagram of the decoder in the step 3, and performing concatemate fusion on the characteristic diagram output by the improved ASPP module subjected to deconvolution up-sampling operation;

step 6, mapping the feature map subjected to cross-layer fusion in the step 5 to an RGB space through convolution, and recovering the feature map into the resolution of the input image through deconvolution operation;

the fourth section comprises two steps:

step 7, taking the characteristic graph extracted in the step 4 as the input of a boundary branch and a direction branch, generating an offset graph with different offset information in each direction, and optimizing a rough result; the specific implementation is as follows:

(1) with boundary branches supervised by a binary cross-entropy function

Convolution, BN normalization and ReLU activation function sum

Convolution, BN normalization and ReLU excitationThe sum of the living functions

the fifth part comprises two steps:

2. The improved image semantic segmentation method based on codec as claimed in claim 1, characterized by using the ASPP block improved in step 3 (1), wherein the expansion ratio of the ASPP block is set to 4, 8, 12, 24; in the step 3, a visual activation function FRELU is used for acquiring space insensitive information; and 3, the codec structure is used for multi-scale extraction and cross-layer fusion of image features.

3. The improved codec-based image semantic segmentation method according to claim 1, wherein, by using step 7 (1), an input image boundary map is extracted, wherein a threshold is set to 5 and an artificial scaling factor is 2; extracting a discrete directional diagram by using the step 7 (2), wherein the discrete partition is set to be 8; in step 7 (3), the offset map is used for image segmentation optimization.

4. The method of claim 1, wherein in step 9, Epochs is set to 80, the backbone network learning rate is 1/10 of the initial learning rate, a poly learning rate adjustment strategy is used, and Bach size is set to 8.