CN113033570B

CN113033570B - Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion

Info

Publication number: CN113033570B
Application number: CN202110344461.0A
Authority: CN
Inventors: 高世伟; 张长柱; 张皓; 王祝萍; 黄超
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2022-11-11
Anticipated expiration: 2041-03-29
Also published as: CN113033570A

Abstract

The invention relates to an image semantic segmentation method for improving void convolution and multilevel characteristic information fusion, which comprises the following steps of: extracting image features in a deep convolution neural network by using an improved hole convolution method; the extracted deep characteristic images and the shallow characteristic images are cascaded and fused to make up for the loss of spatial information; learning boundary information of the characteristic image subjected to multistage processing through boundary thinning, fusing and restoring to the resolution of the original image, and generating a prediction segmentation graph; and (4) training the network by using a cross entropy loss function, and evaluating the model performance by using mIoU. The invention improves the utilization method of the prior cavity convolution and designs the deformable space pyramid structure, thereby improving the image characteristic extraction effect of the model. Meanwhile, a multi-level characteristic information fusion structure is designed for image resolution recovery, local information and global information contained in different levels are fully utilized, boundary refinement is introduced, and the accuracy of image semantic segmentation is effectively improved.

Description

Image semantic segmentation method for improving fusion of void volume and multilevel characteristic information

Technical Field

The invention relates to the field of computer vision and pattern recognition intelligent systems, in particular to an image semantic segmentation method for improving void convolution and multilevel characteristic information fusion.

Background

Automated scene understanding is an important goal in the field of modern computer vision. Image semantic segmentation is a basic scene understanding task of computer vision, involving taking raw data (e.g., a flat image) as input and converting it into a mask with highlighted regions of interest to divide them into multiple regions with different semantic information. In recent years, due to the excellent performance of the deep convolutional neural network in the semantic segmentation task, compared with the traditional methods such as GrabCT, N-Cut and the like, the segmentation quality is remarkably improved. Good segmentation algorithms are crucial for many practical applications, such as autopilot, medical image processing, computational photography, image search engines, augmented reality. These applications all require very accurate pixel prediction.

However, in the current semantic segmentation method based on the deep convolutional neural network, problems such as reduced image resolution, loss of global context information and the like are caused by multiple times of pooling and downsampling, and a higher prediction classification accuracy cannot be obtained on a segmentation result.

Disclosure of Invention

The invention aims to provide an image semantic segmentation method for improving the fusion of void volume and multilevel feature information, which can effectively improve the information utilization rate and effectiveness of feature extraction, enrich shallow semantic information, learn the global context information of an image and improve the accuracy of semantic segmentation on a two-dimensional image.

The structure based on the improved hole convolution method and the multi-level characteristic information fusion can ensure that the calculated amount of the system is not obviously improved while the image segmentation effect is improved. Compared with the simple method of stacking convolution networks, the method has the advantages that a more appropriate structure and method is designed for image feature extraction and spatial information compensation, the loss of feature information in the downsampling process is reduced, the pixel prediction accuracy is effectively improved, and the image semantic segmentation effect is enhanced.

In order to achieve the purpose, the invention adopts the technical scheme that:

an image semantic segmentation method for improving void convolution and multilevel characteristic information fusion comprises the following steps:

s1: extracting image features in a deep convolution neural network by using an improved hole convolution method;

s2: the extracted deep characteristic images and the shallow characteristic images are cascaded and fused to make up for the loss of spatial information; (ii) a

S3: learning boundary information of the characteristic image after multi-stage processing through boundary thinning, fusing and restoring to the resolution of the original image to generate a prediction segmentation graph;

s4: and training the network by using a cross entropy loss function, and evaluating the performance of the model by using the mIoU.

The specific implementation method of the S1 comprises the following steps:

s1.1: the method comprises the steps that ResNet-101 is used as a basic network, an improved step connection cavity convolution module is connected behind a third sampling module, the module comprises three cavity convolution layers, the cavity rate of the convolution layers is changed according to the resolution of an input image, step connection is established among different convolution layers in the forward direction, the receptive field is further expanded under the condition that the image is not reduced continuously, and information loss is reduced;

s1.2: the image which is subjected to the step-by-step connection cavity convolution module is input into the improved deformable space pyramid pooling module, the advantages of the deformable convolution, such as adaptive receptive field to target scale change and flexible convergence information, are combined with the advantage that multi-scale cavity convolution standard sampling can effectively classify any region of the image, and the capability of the model for learning target deformation is improved at the cost of low model complexity;

s1.3: and different levels of feature information contained in feature images with different resolution ratios at different stages of the down-sampling process are reserved.

The specific implementation method of the S2 comprises the following steps:

s2.1: performing 1 × 1 convolution on the feature layer processed by the step connection hole convolution module, combining the feature layer with the feature image extracted from the deepest layer to make up the semantic information of the shallow feature image, and performing 1 × 1 convolution on the output feature image to serve as the layer for output;

s2.2: combining the feature map output in S2.1 with the feature image output by the previous module to make up the semantic information of the shallow feature image, and performing 1 × 1 convolution on the output feature map to serve as the layer output;

s2.3: performing bilinear interpolation double upsampling on the feature image output in S2.2, combining the feature image with the feature image output by a previous module, making up semantic information of the shallow feature image, and performing 1 × 1 convolution on the output feature image to output the output feature image as the layer;

s2.4: and (4) performing bilinear interpolation double upsampling on the feature image output in the S2.3, combining the feature image with the feature image output by a previous module, making up semantic information of the shallow feature image, and performing 1 × 1 convolution on the output feature image to obtain the output feature image as the layer.

The specific implementation method of the S3 comprises the following steps:

s3.1: four times up-sampling the deepest output characteristic image by a bilinear interpolation method;

s3.2: the output characteristic image of the layer in the S2.1 is up-sampled four times by a bilinear interpolation method;

s3.3: performing four-time upsampling on the layer of output characteristic image in the S2.2 by a bilinear interpolation method;

s3.4: the output characteristic image of the layer in S2.3 is up-sampled twice by a bilinear interpolation method;

s3.5: and refining the boundary of the output characteristic images in S2.4, S3.1, S3.2, S3.3 and S3.4 by a BR module, fusing, performing two-step processing of 3 x 3 convolution and four-time up-sampling by a bilinear interpolation method, and recovering to the original resolution of the images to obtain the final prediction segmentation image.

The specific implementation method of the S4 comprises the following steps:

s4.1: calculating the cross entropy loss of the segmentation prediction graph and a standard segmentation graph in the data set, updating parameter weight in the model by using a back propagation algorithm, and training the data set to obtain a final semantic segmentation model;

s4.2: and (5) predicting the performance by using the mIoU index test model by using the test set in the data set.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages and effects:

the method fully considers the benefits and the disadvantages of the hole convolution on semantic segmentation, improves the utilization method of the existing hole convolution, designs the deformable space pyramid structure and improves the image feature extraction effect of the model. Meanwhile, compared with a common up-sampling method, a multi-level characteristic information fusion structure is designed for recovering the image resolution, local information and global information contained in different levels are fully utilized, boundary refinement is introduced, and the accuracy of image semantic segmentation is effectively improved.

Drawings

FIG. 1 is a flow chart of the overall semantic segmentation method proposed by the present invention;

FIG. 2 is a network model diagram of the overall semantic segmentation algorithm proposed by the present invention;

FIG. 3 is a block diagram of a jump connection hole convolution module in a network architecture according to the present invention;

FIG. 4 is a graph of the visualization effect of the algorithm of the present invention on the Cityscapes data set.

Detailed Description

The invention is further described below with reference to the accompanying drawings and embodiments:

an image semantic segmentation method for improving void convolution and multilevel characteristic information fusion comprises the following steps: as shown in fig. 1, comprises the following steps:

s1: image features are extracted in a deep convolutional neural network using a modified hole convolution method, as shown in fig. 2 within the dashed box "S1":

s1.1: firstly, a ResNet-101-based network is used as a base network, and an improved jump connection hole Convolution module is accessed after a third sampling module, wherein "Conv" represents "Convolution" and represents a Convolution layer. Fig. 3 shows a specific structure of the module, which includes three consecutive void convolution layers, the void rate (rate) of the convolution layer in the module is changed according to the resolution of the input image, the void rates of the three layers of void convolution in fig. 3 are sequentially 2, 4 and 8, step-by-step connections are established between different convolution layers in the forward direction, the receptive field is further enlarged without continuously reducing the image, and the information loss is reduced;

s1.2: the image which is subjected to the step-by-step connection cavity convolution module is input into an improved deformable space pyramid pooling module which consists of three layers of cavity convolution, one layer of deformable convolution and a maximum pooling layer in the image 2, the advantage that the deformable convolution enables the receptive field to adapt to the scale change of the target and flexibly converges information is combined with the advantage that multi-scale cavity convolution standard sampling can effectively classify any region of the image, and the capability of the model for learning the deformation of the target is improved at the cost of lower model complexity;

s1.3: and different levels of feature information contained in the feature images with different resolutions at different stages of the down-sampling process are reserved.

S2: the extracted deep characteristic image and the shallow characteristic image are cascaded and fused to make up for the loss of spatial information, as shown in a dashed line frame of 'S2' in FIG. 2;

s2.1: as shown in fig. 2, the feature layer processed by the step-wise connected hole convolution module is subjected to 1 × 1 convolution, and is combined with the feature image extracted from the deepest layer of the network model, wherein "C" in the figure represents "corresponding", which means the fusion of feature images of different levels, and is used for compensating the semantic information of the shallow feature image, and the output feature image is output as the layer after being subjected to 1 × 1 convolution;

s2.3: combining the feature image output in S2.2 with the feature image output by the previous module through bilinear interpolation double upsampling (namely 'upsample by 2'), making up semantic information of the shallow feature image, and performing 1 × 1 convolution on the output feature image to serve as the layer to be output;

s2.4: and (4) performing bilinear interpolation double upsampling on the feature image output in the S2.3, combining the feature image with the feature image output by a previous module, making up the semantic information of the shallow feature image, and performing 1 × 1 convolution on the output feature image to obtain the output feature image.

S3: learning boundary information of the multi-stage processed characteristic image through boundary thinning, fusing and recovering to the resolution of an original image to generate a prediction segmentation map, wherein the prediction segmentation map is shown in a dotted line frame of 'S3' in figure 2;

s3.5: and refining the Boundary of the output characteristic images in S2.4, S3.1, S3.2, S3.3 and S3.4 by a BR (Boundary refining) module, fusing, performing four-time upsampling by a 3 x 3 convolution and bilinear interpolation method, and recovering to the original resolution of the image to obtain a final prediction segmentation map.

S4: and (4) training the network by using a cross entropy loss function, and evaluating the model performance by using mIoU.

S4.1: calculating the cross entropy loss of the segmentation prediction graph and the standard segmentation graph in the data set, updating the parameter weight in the model by using a back propagation algorithm, and training by a training set in the data set to obtain a final semantic segmentation model;

s4.2: and (5) predicting the performance by using a test set in the data set with accuracy and an mIoU test model.

The following experiments were conducted in accordance with the method of the present invention to illustrate the predicted effects of the present invention.

And (3) testing environment: ubuntu16.04 system; NVIDIA GTX 1080Ti GPU; python3.5; tensorFlow framework.

Testing a data set: the selected dataset is an image dataset PASCAL VOC 2012 for image segmentation in computer vision tasks, relating to four categories: vehicle, family, animal, human, and further subdivided into 20 sub-categories (plus a background). The data set contained 1464 training images, 1449 validation images, and 1456 test images.

And (3) testing indexes: the invention uses mIoU as a performance evaluation index. mlou refers to the ratio of the intersection of the predicted and actual regions to the union of the predicted and actual regions. The result comparison is carried out on the index data calculated by different algorithms in the prior art, and the better result obtained in the field of image semantic segmentation is proved.

The test results were as follows:

table 1 shows the performance comparison of the method under different cavity convolution cavity rates of the deformable space pyramid pooling module design, and the proper parameter setting can improve the network performance through comparison

Table 2. The performance comparison of the invention under the addition of the multi-level feature information fusion and boundary refinement modules can prove the effectiveness of the network design

TABLE 3 comparison of Performance of the present invention with other algorithms under the PASCAL VOC 2012 data set

As can be seen from the comparison data, the mIoU of the invention is obviously improved compared with the existing algorithm.

It is emphasized that the examples described herein are illustrative and are intended to enable one of ordinary skill in the art to understand the disclosure for practice of the invention, including but not limited to the examples described in the detailed description. All equivalent changes or modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. An image semantic segmentation method for improving the fusion of hollow convolution and multilevel characteristic information is characterized in that: the method comprises the following steps:

s1: extracting image features in a deep convolutional neural network by using an improved hole convolution method;

s2: the extracted deep characteristic images and the shallow characteristic images are cascaded and fused to make up for the loss of spatial information;

s4: training a network by using a cross entropy loss function, and evaluating the performance of the model by using mIoU;

the specific implementation method of the S1 comprises the following steps:

s1.1, taking ResNet-101 as a basic network, accessing an improved step-connection hole convolution module after a third sampling module, wherein the module comprises three continuous hole convolution layers, changing the hole rate of the convolution layers according to the resolution of an input image, establishing step-connection among different convolution layers in a forward direction, further expanding the receptive field under the condition of not continuously reducing the image and reducing the information loss;

s1.2, inputting the image subjected to the leap connection cavity convolution module into an improved deformable space pyramid pooling module, combining the advantages of adaptive receptive field target scale change and flexible convergence information of the deformable convolution and the advantage of effectively classifying any region of the image of multi-scale cavity convolution standard sampling by utilizing the deformable convolution, and improving the capability of learning the target deformation of the model at the cost of low model complexity;

s1.3, different levels of feature information contained in feature images with different resolution ratios at different stages in the downsampling process are reserved.

2. The method for improving semantic segmentation of the image with fusion of the hole convolution and the multi-level feature information as claimed in claim 1, wherein: the specific implementation method of the S2 comprises the following steps:

s2.1, performing 1 × 1 convolution on the feature layer processed by the leap connection hole convolution module, combining the feature layer with the feature image extracted from the deepest layer, making up semantic information of the shallow feature image, and performing 1 × 1 convolution on the output feature image to serve as the layer to be output;

s2.2, combining the feature map output in the S2.1 with the feature image output by the previous module, making up the semantic information of the shallow feature image, and performing 1 × 1 convolution on the output feature map to serve as the layer output;

s2.3, performing bilinear interpolation double upsampling on the feature image output in the S2.2, combining the feature image with the feature image output by a previous module, making up semantic information of the shallow feature image, and performing 1 × 1 convolution on the output feature image to obtain the output feature image;

and S2.4, performing bilinear interpolation double upsampling on the feature image output in the S2.3, combining the feature image with the feature image output by the previous module, making up the semantic information of the shallow feature image, and performing 1 × 1 convolution on the output feature image to obtain the output feature image.

3. The method for improving semantic segmentation of the image with fusion of the hole convolution and the multi-level feature information as claimed in claim 1, wherein: in order to fuse feature images with different levels and different spatial information and semantic information at the same resolution, it is necessary to uniformly process an output feature image of multiple levels to a quarter size of an original resolution of the image through a bilinear interpolation method, and the specific implementation method of S3 includes the following steps:

s3.1, four times up-sampling the deepest output feature image by a bilinear interpolation method;

s3.2, the output characteristic image of the layer in the S2.1 is subjected to four-time up-sampling through a bilinear interpolation method;

s3.3, the output characteristic image of the layer in the S2.2 is subjected to four-time up-sampling through a bilinear interpolation method;

s3.4, performing up-sampling on the layer of output characteristic image in the S2.3 by twice through a bilinear interpolation method;

and S3.5, refining the boundary of the output characteristic images in S2.4, S3.1, S3.2, S3.3 and S3.4 by a BR module, fusing, performing up-sampling two steps of processing by 3 x 3 convolution and a bilinear interpolation method, and recovering to the original resolution of the images to obtain the final prediction segmentation image.

4. The method for improving semantic segmentation of the image with fusion of the hole convolution and the multi-level feature information as claimed in claim 1, wherein: the specific implementation method of the S4 comprises the following steps:

s4.1, calculating the cross entropy loss of the segmentation prediction graph and the standard segmentation graph in the data set, updating the parameter weight in the model by using a back propagation algorithm, and training by a training set in the data set to obtain a final semantic segmentation model;

and S4.2, predicting the performance by using the mIoU index test model by using the data set.