CN113902925A

CN113902925A - Semantic segmentation method and system based on deep convolutional neural network

Info

Publication number: CN113902925A
Application number: CN202111245617.6A
Authority: CN
Inventors: 汪春梅; 李康; 袁非牛
Original assignee: Shanghai Normal University
Current assignee: Shanghai Normal University; University of Shanghai for Science and Technology
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-01-07

Abstract

The invention relates to a semantic segmentation method and a semantic segmentation system based on a deep convolutional neural network, wherein the method comprises model training and model application, and the model training comprises the following steps: acquiring a semantic segmentation image data set, and preprocessing images in the semantic segmentation data set; building a semantic segmentation network model, taking a modified ResNet50 backbone network as an encoder, and taking a multi-scale mixed pool structure and a feature attention fusion module as a decoder; and performing model training and network parameter setting of the semantic segmentation network model by using the preprocessed semantic segmentation image data set to obtain the trained semantic segmentation network model. Compared with the prior art, the method can effectively realize the connection between the encoding and the decoding, fully extract the related information between different stages and in the same stage, effectively fuse the characteristics of the low layer and the high layer, obtain the remote dependence and rich context information, and efficiently and accurately segment the related images.

Description

Semantic segmentation method and system based on deep convolutional neural network

Technical Field

The invention relates to the technical field of image processing and neural networks, in particular to a semantic segmentation method and a semantic segmentation system based on a deep convolutional neural network.

Background

Semantic segmentation is an important research field of computer vision and is one of key technologies for realizing scene understanding. The method aims to allocate labels of semantic categories to all pixels in an image and divide and analyze a scene image into different areas related to the semantic categories. It is widely used for automatic driving, medical image analysis and target detection. Semantic segmentation is a challenging task because it requires combining dense pixel-level precision and multi-scale contextual reasoning, and furthermore considers the huge differences in content, shape and scale within the same object, as well as the high similarity between different classes of objects, and also requires attention to objects that easily confuse the boundary region.

In recent years, most advanced semantic segmentation methods often widely use a deep Convolutional Neural Network (CNN), which shows impressive capability in solving various complex challenges, and can realize end-to-end full image segmentation with precision superior to any traditional method. Dense semantic representations are extracted from input images, pixel-level labels are predicted, and through proper training, the deep CNN can acquire rich scene information by utilizing multiple convolution operations, nonlinear pooling and activation functions. However, due to the local nature of CNNs, convolution local features typically have a limited acceptance domain, although some methods use a large acceptance domain, and the extracted features mainly describe the core region of the object, largely ignoring the context around its boundary. Furthermore, objects of different classes may have similar local features, e.g. tables and chairs may share similar local textures. Therefore, the detail information brings great difficulty to deep network learning, and further results in poor image segmentation effect.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a semantic segmentation method and a semantic segmentation system based on a deep convolutional neural network.

The purpose of the invention can be realized by the following technical scheme:

a semantic segmentation method based on a deep convolutional neural network comprises model training and model application, wherein the model training comprises the following steps:

acquiring a semantic segmentation image data set, and preprocessing images in the semantic segmentation data set; building a semantic segmentation network model, taking a modified ResNet50 backbone network as an encoder, and taking a multi-scale mixed pool structure and a feature attention fusion module as a decoder, wherein the encoder performs feature extraction on an input image to obtain a feature map, and the decoder obtains a segmentation result map based on the feature map; performing model training and network parameter setting of the semantic segmentation network model by using the preprocessed semantic segmentation image data set to obtain a trained semantic segmentation network model;

the model application specifically comprises: and performing semantic segmentation on the image by using the trained semantic segmentation network model to obtain a segmentation result graph.

Further, the constructed semantic segmentation network model takes a ResNet50 backbone network of a variant as an encoder, the ResNet50 backbone network comprises 4 layers, input images are sent to a first layer, the output of each layer is sent to the next layer, the ResNet50 backbone network of the variant is a ResNet50 backbone network which comprises a channel attention module, a third layer and a fourth layer are provided with a hole convolution, the outputs of the second layer and the third layer of the backbone network are sent to the channel attention module, the output of the channel attention module and the output of the third layer of the backbone network are combined in an Element-wise Sum mode and are sent to the fourth layer of the backbone network together, and the output of the fourth layer of the backbone network is a feature map extracted by the encoder.

Further, the built semantic segmentation network model adopts a multi-scale mixing pool structure and a feature attention fusion module as a decoder, the input of the multi-scale mixing pool structure is the output of the fourth layer of the backbone network, and the output of the multi-scale mixing pool structure and the output of the third layer of the backbone network are sent to the feature attention fusion module together; the multi-scale mixed pool structure comprises a conventional pooling module and an unconventional pooling module, and has certain improvement on the partition of objects with irregular shapes aiming at the characteristics of different objects respectively.

Further, the conventional pooling module comprises m k × k adaptive average pooling blocks, where k is greater than 0 and may be 1, 2, …, 6, and the like, and a feature map output by the fourth layer of the backbone network is sent to the m adaptive average pooling blocks after dimensionality reduction; the unconventional pooling module comprises an sx 1 pooling block and a 1 xs pooling block, a feature map output by the fourth layer of the backbone network is subjected to dimensionality reduction and then is sent into the sx 1 pooling block and the 1 xs pooling block, and the output of the unconventional pooling module is restored to the size of the sx s through matrix multiplication and convolution; and finally, the feature graph output by the conventional pooling module and the feature graph output by the unconventional pooling module are subjected to upsampling and then restored to the size of the input feature graph, and the feature graphs are combined and subjected to dimensionality reduction to obtain the output of a multi-scale mixed pool structure, namely the feature graph after mixed pooling.

Further, the feature attention fusion module is specifically described as follows: the output of the multi-scale mixing pool structure is a feature map X₁The output of the third layer of the backbone network is a feature map X₂Feature map X₂Generating attention coefficients used for adjusting the feature map X through global average pooling, convolution, batch normalization and Sigmoid activation functions in a feature attention fusion module₁Each channel of (a). The feature attention fusion module compresses and excites the feature graph output by the third layer of the backbone network, and the generated attention coefficient is used for weighting the feature graph output by the multi-scale mixed pool structure, so that the segmentation accuracy of the semantic segmentation network model is further enhanced.

Furthermore, the constructed semantic segmentation network model also comprises a context embedding block, a global average pooling layer and a classification module, wherein a feature map output by an encoder is sent to the context embedding block and then sequentially passes through the global average pooling layer and the classification module, the classification module comprises a full connection layer and a Sigmoid function, and in a network training stage, the full connection layer is used for learning global information of corresponding features to emphasize beneficial features and inhibit useless features; and the output of the classification module and the output of the characteristic attention fusion module are subjected to element multiplication so as to adjust relevant parameters of the semantic segmentation network model, optimize network performance, refine the segmentation result of the image and obtain a segmentation result graph.

Furthermore, two types of auxiliary losses are introduced, the output of the Classification module and the Classification layer (Classification label) perform Classification assistance, the Segmentation result of the image is further refined, the output of the semantic Segmentation network model, namely the Segmentation result image, and the Segmentation layer (semantic label) perform Segmentation assistance, and the training parameters of the semantic Segmentation network model are adjusted.

Further, the preprocessing the image in the semantic segmentation data set specifically includes: and carrying out cutting, turning, translation and scaling operations on the image and the mask thereof in the semantic segmentation data set, and expanding the semantic segmentation data set.

Further, the model training and the network parameter setting of the semantic segmentation network model by using the preprocessed semantic segmentation image data set specifically comprise:

taking the images in the preprocessed semantic segmentation image data set as input images, unifying the sizes of the input images, wherein the weight initialization mode in the semantic segmentation network model is Kaiming, the semantic segmentation network model is trained by using a random gradient descent algorithm with momentum, the iteration frequency is 30000 times, the weight attenuation is 1e-5, the momentum is 0.9, the batch processing number is 4, the initial learning rate is 0.001, and the learning rate iteration strategy is Poly.

A semantic segmentation system based on a deep convolutional neural network, comprising:

the image acquisition module is used for acquiring a scene image and preprocessing the scene image;

the semantic segmentation module is used for performing semantic segmentation on the scene image by using the trained semantic segmentation network model and outputting a segmentation result graph;

the training process of the semantic segmentation network model specifically comprises the following steps:

acquiring a semantic segmentation image data set, and preprocessing images in the semantic segmentation data set; building a semantic segmentation network model, taking a modified ResNet50 backbone network as an encoder, and taking a multi-scale mixed pool structure and a feature attention fusion module as a decoder, wherein the encoder performs feature extraction on an input image to obtain a feature map, and the decoder obtains a segmentation result map based on the feature map; and performing model training and network parameter setting of the semantic segmentation network model by using the preprocessed semantic segmentation image data set to obtain the trained semantic segmentation network model.

Further, the constructed semantic segmentation network model takes a ResNet50 backbone network of a variant as an encoder, the ResNet50 backbone network of the variant is a ResNet50 backbone network which comprises a channel attention module, a third layer and a fourth layer are provided with hole convolution, outputs of the second layer and the third layer of the backbone network are sent to the channel attention module, the output of the channel attention module and the output of the third layer of the backbone network are combined in an Element-wise Sum mode and are sent to the fourth layer of the backbone network together, and the output of the fourth layer of the backbone network is a feature map extracted by the encoder.

Compared with the prior art, the invention improves the coding and decoding structure, designs some special modules to fully extract the relevant information between the features and in the features, and the modules can jointly extract rich multi-scale information and global information of different images, thereby realizing the dense classification of each pixel. Wherein, the two input channel attention modules can capture the internal correlation of the coding stage; the multi-scale mixing pool structure can effectively capture multi-scale features of different images; a feature attention fusion module that can capture external dependencies between encoding and decoding stages; and the classification module can further optimize the network performance and refine the segmentation result of the image. Therefore, the invention can efficiently and accurately carry out semantic segmentation on the image.

Drawings

FIG. 1 is a schematic diagram of a semantic segmentation network model;

FIG. 2 is a schematic diagram of a multi-scale mixing tank structure;

FIG. 3 is a flow chart of training of a semantic segmentation network model;

FIG. 4 is a graph of segmentation results obtained using a semantic segmentation network model;

FIG. 5 is a graph of segmentation results obtained using a semantic segmentation network model;

FIG. 6 is a graph of segmentation results obtained using a semantic segmentation network model;

FIG. 7 is a graph of segmentation results obtained using a semantic segmentation network model;

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

Example 1:

a semantic segmentation method based on a deep convolutional neural network comprises model training and model application, wherein the model training is as shown in figure 3, the embodiment adopts Python language as a basis, an open-source Pythrch is used as a neural network framework to complete the construction of a semantic segmentation network model, the semantic segmentation network model is trained through a semantic segmentation image data set, and the optimal model parameters are found out, and the method comprises the following steps:

acquiring a semantic segmentation image data set, and preprocessing images in the semantic segmentation data set; building a semantic segmentation network model, as shown in fig. 1, taking a ResNet50 backbone network of a variant as an encoder, and taking a multi-scale mixed pool structure and a feature attention fusion module as a decoder, wherein the encoder performs feature extraction on an input image to obtain a feature map, and the decoder obtains a segmentation result map based on the feature map; performing model training and network parameter setting of the semantic segmentation network model by using the preprocessed semantic segmentation image data set to obtain a trained semantic segmentation network model;

The preprocessing of the image in the semantic segmentation data set specifically comprises the following steps: the images and the masks in the semantic segmentation data set are cut, turned, translated and scaled, for example, length and width scaling is performed according to a preset scaling ratio, so that the semantic segmentation data set is expanded, and the generalization capability of the semantic segmentation network model can be improved.

The constructed semantic segmentation network model is shown in FIG. 1:

1) the method comprises the steps of taking a ResNet50 backbone network of a variant as an encoder, wherein the ResNet50 backbone network comprises 4 layers, input images are sent to a first layer, the output of each layer is sent to the next layer, the ResNet50 backbone network of the variant is a ResNet50 backbone network which comprises a channel attention module, a third layer and a fourth layer are provided with hole convolutions, the outputs of the second layer and the third layer of the backbone network are sent to the channel attention module, the output of the channel attention module and the output of the third layer of the backbone network are combined in an Element-wise Sum mode and are sent to the fourth layer of the backbone network together, and the output of the fourth layer of the backbone network is a feature map extracted by the encoder.

2) A multi-scale mixing pool structure and a feature attention fusion module are used as a decoder, the input of the multi-scale mixing pool structure is the output of the fourth layer of the backbone network, and the output of the multi-scale mixing pool structure and the output of the third layer of the backbone network are sent to the feature attention fusion module; the multi-scale mixed pool structure comprises a conventional pooling module and an unconventional pooling module, and has certain improvement on the partition of objects with irregular shapes aiming at the characteristics of different objects respectively.

Multi-scale mixing pool structure as shown in fig. 2, the feature map output by the encoder is fed into the multi-scale mixing pool structure. The conventional pooling module comprises m k multiplied by k self-adaptive average pooling blocks, wherein k is greater than 0 and can be 1, 2, …, 6 and the like, and a feature map output by the fourth layer of the backbone network is sent into the m self-adaptive average pooling blocks after dimension reduction; the unconventional pooling module comprises an sx 1 pooling block and a 1 xs pooling block, a feature map output by the fourth layer of the backbone network is subjected to dimensionality reduction and then is sent into the sx 1 pooling block and the 1 xs pooling block, and the output of the unconventional pooling module is restored to the size of the sx s through matrix multiplication and convolution; and finally, the feature graph output by the conventional pooling module and the feature graph output by the unconventional pooling module are subjected to upsampling and then restored to the size of the input feature graph, namely the size of the feature graph is the same as that of the feature graph output by the encoder, merging and dimensionality reduction are carried out to obtain the output of a multi-scale mixed pool structure, namely the feature graph after mixing pooling, and the feature graph after mixing pooling is sent to the feature attention fusion module.

The feature attention fusion module is described in detail as follows: the output of the multi-scale mixing pool structure is a feature map X₁The output of the third layer of the backbone network is a feature map X₂Feature map X₂Generating an attention coefficient through a global average pooling function, a convolution function, a batch normalization function and a Sigmoid activation function in a characteristic attention fusion module, wherein the attention coefficient is used for adjusting the characteristic diagram X₁Each channel of (a). The feature attention fusion module compresses and excites the feature graph output by the third layer of the backbone network, and the generated attention coefficient is used for weighting the feature graph output by the multi-scale mixed pool structure, so that the segmentation accuracy of the semantic segmentation network model is further enhanced.

3) The constructed semantic segmentation network model also comprises a context embedding block, a global average pooling layer and a classification module, wherein a feature map output by an encoder is sent into the context embedding block and then sequentially passes through the global average pooling layer and the classification module, the classification module comprises a full connection layer and a Sigmoid function, and in a network training stage, the full connection layer is used for learning global information of corresponding features so as to emphasize beneficial features and inhibit useless features; the output of the classification module and the output of the characteristic attention fusion module are combined by Element-wise product, namely Element multiplication is carried out, so that relevant parameters of a semantic segmentation network model are adjusted, network performance is optimized, the segmentation result of the image is refined, a segmentation result image is obtained, and the size of the segmentation result image is restored to the size of the input image.

Two types of auxiliary losses are introduced, wherein the Classification layer is a Classification label, the Segmentation layer is a semantic label, the output of the Classification module and the Classification layer perform Classification assistance to further refine the Segmentation result of the image, the output of the semantic Segmentation network model, namely the Segmentation result image, and the Segmentation assistance and the Segmentation layer perform Segmentation assistance to adjust the training parameters of the semantic Segmentation network model.

The preprocessing of the image in the semantic segmentation data set specifically comprises: and carrying out cutting, turning, translation and scaling operations on the image and the mask thereof in the semantic segmentation data set, and expanding the semantic segmentation data set.

The model training and network parameter setting of the semantic segmentation network model by using the preprocessed semantic segmentation image data set specifically comprise:

acquiring a semantic segmentation image data set, and preprocessing images in the semantic segmentation data set; building a semantic segmentation network model, taking a modified ResNet50 backbone network as an encoder, taking a multi-scale mixed pool structure and a feature attention fusion module as a decoder, carrying out feature extraction on an input image by the encoder to obtain a feature map, and obtaining a segmentation result map by the decoder based on the feature map; and performing model training and network parameter setting of the semantic segmentation network model by using the preprocessed semantic segmentation image data set to obtain the trained semantic segmentation network model.

Taking real-time semantic segmentation of a video image as an example, a camera function is called in an OpenCV (open content computer vision library) library, a real-time picture is read, then the read video is processed frame by frame, the image format is converted, the preprocessed video image is input into a trained semantic segmentation network model frame by frame, and a corresponding segmentation mask is output in real time. Fig. 4, 5, 6 and 7 show the segmentation result of different street view images.

The invention provides a segmentation method such as ResNet50 backbone network and mixed pooling based on variants, in the training process, a preprocessed semantic segmentation image is sent to a deeper encoder to strive for network learning to obtain more image characteristic information, in addition, a channel attention module is added, migration learning is adopted to initialize a network architecture, learning of the image characteristic information is accelerated, and then model convergence is accelerated.

In the decoding stage, a Multi-scale Mixed pool structure is designed, and comprises a conventional Pooling module and an unconventional Pooling module, and the Mixed pool modules (MMP) with different scales are connected in parallel to realize the efficient aggregation of the Multi-scale feature maps. And a feature attention fusion module is also built behind the multi-scale mixing pool structure, the feature graph output by the third layer of the backbone network in the encoding stage is compressed and excited, and an attention coefficient is generated and used for weighting the feature graph output by the multi-scale mixing pool structure in the decoding stage, so that the network segmentation accuracy is further enhanced.

In addition, a context embedding block, a global average pooling layer and a classification module which are parallel to the decoder are established, Element-wise product operation is carried out on the output of the classification module and the output of the characteristic attention fusion module, namely Element multiplication is carried out, a segmentation result graph is obtained, and the segmentation effect is enhanced.

On the whole, although objects of different classes may have similar local features and related detail information increases the difficulty of segmentation, the method can effectively realize the association between encoding and decoding, fully extract related information between different stages and in the same stage, effectively fuse low-level and high-level features, obtain remote dependence and rich context information, and can efficiently and accurately segment related images. The invention can efficiently and accurately realize the semantic segmentation task of the related images.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A semantic segmentation method based on a deep convolutional neural network is characterized by comprising model training and model application, wherein the model training comprises the following steps:

2. The semantic segmentation method based on the deep convolutional neural network of claim 1, wherein a constructed semantic segmentation network model takes a ResNet50 backbone network of a variant as an encoder, the ResNet50 backbone network of the variant is a ResNet50 backbone network which comprises a channel attention module and is provided with a hole convolution on a third layer and a fourth layer, outputs of the second layer and the third layer of the backbone network are sent to the channel attention module, outputs of the channel attention module and outputs of the third layer of the backbone network are combined in an Element-wise Sum mode and are sent to the fourth layer of the backbone network together, and the output of the fourth layer of the backbone network is a feature map extracted by the encoder.

3. The deep convolutional neural network-based semantic segmentation method according to claim 2, wherein a built semantic segmentation network model adopts a multi-scale mixed pool structure and a feature attention fusion module as a decoder, the multi-scale mixed pool structure comprises a conventional pooling module and an unconventional pooling module, the input of the multi-scale mixed pool structure is the output of the fourth layer of the backbone network, and the output of the multi-scale mixed pool structure and the output of the third layer of the backbone network are sent to the feature attention fusion module together.

4. The deep convolutional neural network-based semantic segmentation method as claimed in claim 3, wherein the conventional pooling module comprises m k × k adaptive average pooling, k >0, and the irregular pooling module adopts s × 1 and 1 × s pooling modes.

5. The semantic segmentation method based on the deep convolutional neural network as claimed in claim 3, wherein the feature attention fusion module is specifically described as follows: the output of the multi-scale mixing pool structure is a feature map X₁The output of the third layer of the backbone network is a feature map X₂Feature map X₂Generating attention coefficients used for adjusting the feature map X through global average pooling, convolution, batch normalization and Sigmoid activation functions in a feature attention fusion module₁Each channel of (a).

6. The semantic segmentation method based on the deep convolutional neural network as claimed in claim 1, wherein the constructed semantic segmentation network model further comprises a context embedding block, a global average pooling layer and a classification module, a feature map output by an encoder is sent into the context embedding block and then sequentially passes through the global average pooling layer and the classification module, wherein the classification module comprises a full connection layer and a Sigmoid function; and carrying out element multiplication on the output of the classification module and the output of the characteristic attention fusion module to obtain a segmentation result graph.

7. The semantic segmentation method based on the deep convolutional neural network as claimed in claim 1, wherein the preprocessing of the image in the semantic segmentation data set specifically comprises: and carrying out cutting, turning, translation and scaling operations on the image and the mask thereof in the semantic segmentation data set, and expanding the semantic segmentation data set.

8. The semantic segmentation method based on the deep convolutional neural network as claimed in claim 1, wherein the model training and network parameter setting of the semantic segmentation network model using the preprocessed semantic segmentation image dataset specifically comprise:

9. A semantic segmentation system based on a deep convolutional neural network, which is based on the semantic segmentation method based on the deep convolutional neural network as claimed in any one of claims 1 to 8, and comprises:

10. The semantic segmentation system based on the deep convolutional neural network of claim 9, wherein the constructed semantic segmentation network model uses a ResNet50 backbone network of a variant as an encoder, the ResNet50 backbone network of the variant is a ResNet50 backbone network which comprises a channel attention module and has a hole convolution in a third layer and a fourth layer, outputs of the second layer and the third layer of the backbone network are sent to the channel attention module, outputs of the channel attention module and outputs of the third layer of the backbone network are combined in an Element-wise Sum manner and sent to the fourth layer of the backbone network, and the output of the fourth layer of the backbone network is a feature map extracted by the encoder.