CN112241959A

CN112241959A - Attention mechanism generation semantic segmentation method based on superpixels

Info

Publication number: CN112241959A
Application number: CN202011011881.9A
Authority: CN
Inventors: 李亮; 李亚军; 王凯; 彭俊杰
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2021-01-19

Abstract

The invention relates to a deep learning technology and semantic segmentation, and provides a semantic segmentation generation method with low operation cost. The technical scheme adopted by the invention is that a semantic segmentation method is generated based on an attention mechanism of the superpixel, and the original calculation similarity between each pixel and all other pixels is converted into the calculation similarity between each pixel and all other superpixels; and fusing two coding results through space attention coding and channel attention mechanism coding to finally generate semantic segmentation. The method is mainly applied to semantic segmentation occasions.

Description

Attention mechanism generation semantic segmentation method based on superpixels

Technical Field

The invention relates to a deep learning technology, in particular to a superpixel and attention mechanism in deep learning, and semantic segmentation is completed by combining the characteristics of the superpixel and the attention mechanism.

Background

Semantic segmentation is a fundamental task in computer vision and its purpose is to classify pixels in an image, assigning a class label to each pixel in the image. Computer vision has become more and more interesting in recent years for the problem of image segmentation. More and more application scenarios require accurate and efficient segmentation techniques such as autopilot, virtual reality and intelligent robotics.

The earliest successful deep learning technique applying semantic segmentation was full convolutional neural network. The method utilizes a convolutional neural network as a basic framework to extract a feature module, and utilizes a classification network model (such as a VGG-16 network) to convert the feature module into a full-convolution model: the full link layer is converted into a full convolution layer to generate dense pixel-level features, and then the high-level semantic features and the low-level semantic features are combined to generate pixel-level labels. This work is seen as a landmark improvement that illustrates how CNN (convolutional neural network) trains end-to-end on the problem of semantic segmentation. In the subsequent work, people adopt a hole convolution method and a multi-scale method to obtain context semantic information, so that the semantic segmentation accuracy is greatly improved.

In the paper Non-local Neural Networks (Non-local Neural Networks) of nakemin in 2017, a method using a self-attention mechanism is mentioned to obtain global context information. The self-attention mechanism is that the similarity between each pixel vector and other pixels is calculated, so that the context information on the global scope is introduced into the local position. The method greatly improves the precision of semantic segmentation. Meanwhile, a new problem is introduced, semantic information of each position and all other positions is calculated to generate an attention map, the calculation amount of the network is greatly increased, and therefore the problem becomes a problem to be solved by the invention.

Disclosure of Invention

In order to overcome the defects of the prior art and solve the problem of overlarge network calculation amount, the invention aims to provide a semantic generation segmentation method with low operation cost. Therefore, the technical scheme adopted by the invention is that a semantic segmentation method is generated based on the attention mechanism of the superpixel, and the similarity calculated between each original pixel and all other pixels is converted into the similarity calculated between each pixel and all other superpixels; and fusing two coding results through space attention coding and channel attention mechanism coding to finally generate semantic segmentation.

The method comprises the following specific steps:

step 1, extracting characteristics: extracting features by adopting a residual error network ResNet-101; the network has a total 101-layer network structure, wherein a convolution or pooling structure with the step length of 2 is adopted in layers 1, 2 and 7, so that the size of the finally obtained feature map is 1/8 times of that of the original image;

step 2, embedding the super pixels: generating super pixels by using a simple linear clustering algorithm slic method, embedding a super pixel layer into a residual error network ResNet network structure, pooling a characteristic graph through the super pixel layer to obtain super pixel characteristics, and embedding the pooled super pixel characteristics into an attention mechanism network;

step 3, attention mechanism: the attention mechanism is divided into a space attention mechanism and a channel attention mechanism; the spatial attention mechanism acquires global context information by calculating the similarity between each pixel vector and the feature vectors of all other positions; the channel attention mechanism acquires semantic information among channels by calculating the similarity among the channels; then, fusing the result of the space attention mechanism and the result of the channel attention mechanism to finally obtain a result of semantic segmentation;

in step 2, embedding superpixels: generating super pixels by using a simple linear clustering algorithm slic, embedding a super pixel layer into a ResNet network structure, pooling a characteristic diagram through the super pixel layer, and then obtaining a characteristic vector corresponding to each super pixel

The feature vector is an average pooling performed by the region corresponding to the superpixel:

wherein

Representing the kth feature vector, S, in the ith super-pixel region_iIndicating the number of pixels in the ith super-pixel region; this pooling operation is called superpixel pooling;

in the step 3:

the spatial attention mechanism is as follows:

firstly, a characteristic diagram A e R acquired through a ResNet-101 network^C×H×WThen, a is input into three 1 × 1 convolutional layers to obtain three new feature maps B, C, D. Wherein { B, C, D }. belongs to R^C×H×W(ii) a Then convert them to R^C×NWherein N ═ hxw; inputting B and D into the superpixel pooling layer to obtain v and theta respectively, wherein { v, theta } is belonged to R^K×CK represents the number of superpixels on each map; then, a space attention moment matrix S epsilon R is calculated by applying a normalized softmax layer^N×K：

Wherein v_iRepresents the feature obtained by pooling the ith super pixel in equation 1, and C_jRepresenting the jth pixel in the feature map;

S_ijrepresenting the similarity between the jth pixel and the ith super pixel; an attention diagram S is obtained through the formula 2, and the size of S is R^N×KWherein N represents the number of pixels in the feature map, and K represents the number of superpixels in the feature map, so S represents the similarity of each pixel and the pooled feature of each superpixel; meanwhile, the feature vectors are weighted according to the calculated similarity and added to the corresponding pixel positions, so that each pixel vector can obtain semantic information of all spaces through weighting:

where α is a weight parameter initialized to 0 and used for learning; the finally obtained output E gathers global semantic information;

the channel attention mechanism is as follows:

firstly, an initial feature map A epsilon R is output^C×H×WAs the input characteristic of the channel attention module, obtaining v e R through the superpixel pooling layer^C×KThen a channel attention mechanism X ∈ R is calculated by performing a matrix multiplication^C×C：

Wherein x_ijRepresenting the similarity between the ith channel and the jth channel in the ν; in addition, the calculated similarity between channels is weighted to the corresponding channel and then accumulated to each local channel, so that each channel can obtain the information of other channels, and the final output D e R^C×H×W：

And (3) fusing the space attention feature map obtained by the formula (3) and the channel attention feature map obtained by the formula (5) to finally obtain a semantic segmentation map.

The invention has the characteristics and beneficial effects that:

the invention provides an attention mechanism network based on superpixel pooling to generate semantic segmentation, and the network provided by the invention can reduce the calculated amount of the semantic segmentation and improve the speed of the semantic segmentation;

description of the drawings:

FIG. 1 is a schematic diagram of a network architecture according to the present invention.

FIG. 2 attention drawing of the present invention

In the figure, (a) a spatial attention mechanism; (b) the channel attention mechanism.

FIG. 3 is a schematic structural view of the present invention

(a) Original drawing

(b) Semantically segmented result graph

(c) Original drawing

(d) Semantically segmented result graph

Detailed Description

In order to solve the problem of excessive network calculation amount, the invention provides an attention mechanism module based on superpixel pooling. Superpixels refer to regions of pixels that are spatially adjacent and have similar color, texture, and brightness. Since the pixels within each superpixel region are similar, we use a superpixel pooling layer to average the features in the superpixel region. And pooling the number of features in each super pixel from n to 1, so that when the attention mechanism is calculated, the similarity between each pixel and all other pixels is converted into the similarity between each pixel and all other super pixels. The method can greatly reduce the complexity of the network and does not influence the accuracy of the network.

The invention provides a deep embedded network for end-to-end semantic segmentation. The contribution of the invention lies in: firstly, an attention mechanism architecture based on superpixels is provided, and an end-to-end training model is generated; then the invention is divided into two steps in the attention mechanism: 1. obtaining semantic information in each super pixel by adopting a super pixel pooling method; 2. and calculating the similarity between each pixel and the super pixel so as to achieve the aim of acquiring global information. The method comprises the following steps:

step 1:

the basic network architecture of the present invention adopts a structure of ResNet-101. An identity connection structure is adopted in the residual error network to relieve the problem of gradient disappearance in the deep neural network. ResNet-101 is a model adopting a 101-layer residual error network;

resnet-101 network architecture:

1.7×7conv,64channels,stride 2

2.3×3conv,max pool,stride 2

3.

4.

5.

6.

7.average pool,stride 2

a total of 101 layers of network structure, in which a convolution or pooling structure with step size 2 is used in layers 1, 2, and 7, the size of the resulting feature map is 1/8 times that of the original map.

Step 2:

embedding the super-pixels: the method uses a slic (simple linear clustering algorithm) method to generate the super pixels, then embeds a super pixel layer into a ResNet network structure, performs pooling on a feature map through the super pixel layer, and then obtains a feature vector corresponding to each super pixel

wherein

Representing the kth feature vector, S, in the ith super-pixel region_iIndicating the number of pixels in the ith super-pixel region; this pooling operation is referred to as superpixel pooling.

And step 3:

calculating a self-attention mechanism: in the conventional method, global context information is obtained by calculating the similarity between each pixel vector and the pixel vectors of all other positions. However, although the context information obtained by this method can obtain accurate results, it takes a lot of computation time and consumes much GPU memory. Our proposed superpixel-based computation method improves computational efficiency without sacrificing final accuracy.

The spatial attention mechanism is as follows:

the spatial attention mechanism encodes global context semantic information to the pixels of each location, which enhances the representation capability of the semantic information. Firstly, a characteristic diagram A e R acquired through a ResNet-101 network^C×H×WThen, a is input into three 1 × 1 convolutional layers to obtain three new feature maps B, C, D. Wherein { B, C, D }. belongs to R^C×H×W. Then convert them to R^C×NWherein N ═ hxw. We input B and D to the superpixel pooling layer to get v and θ, respectively, where { v, θ }. epsilon.R^K×CAnd K represents the number of superpixels on each map. Then we compute a spatial attention moment map S e R by applying a softmax (normalized) layer^N×K：

S_ijrepresenting the similarity between the jth pixel and the ith super pixel. Therefore, the effect of obtaining the global context information can be achieved by calculating each pixel vector and all the super pixel vectors on the feature map, and meanwhile, the time complexity of calculation can be greatly reduced.

By the formula 2, we finally obtain an attention diagram S, wherein S is R^N×KWhere N represents the number of pixels in the feature map and K represents the number of superpixels in the feature map, so S represents the pooled feature of each pixel and each superpixelThe similarity of (2); meanwhile, the feature vectors are weighted according to the calculated similarity and added to the corresponding pixel positions, so that each pixel vector can obtain semantic information of all spaces through weighting:

where α is a weight parameter initialized to 0 and used for learning; finally, the output E obtained by us gathers global semantic information.

The channel attention mechanism is as follows:

in semantic segmentation, channels can be regarded as responses to each class feature, and by exploring interdependencies among the channels, the characteristics of the interdependencies can be highlighted to improve the expression of semantic features. Therefore, a channel attention mechanism is constructed to acquire semantic information between channels.

First, we output an initial feature map A ∈ R^C×H×WAs the input characteristic of the channel attention module, obtaining v e R through the superpixel pooling layer^C×KThen a channel attention mechanism X ∈ R is calculated by performing a matrix multiplication^C ^×C：

Wherein x_ijRepresenting the similarity between the ith channel and the jth channel in the ν; in addition, the calculated similarity between channels is weighted to the corresponding channel and then accumulated in each local channel, so that each channel can obtain the information of other channels. The final output D ∈ R^C×H×W：

And fusing the space attention feature map obtained by the formula 3 and the channel attention feature map obtained by the formula 5 to finally obtain a semantic segmentation map.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A super-pixel-based attention mechanism generation semantic segmentation method is characterized in that similarity is calculated between each pixel and all other super-pixels; and fusing two coding results through space attention coding and channel attention mechanism coding to finally generate semantic segmentation.

2. The method for superpixel-based attention mechanism-generated semantic segmentation as claimed in claim 1, comprising the steps of:

step 3, attention mechanism: the attention mechanism is divided into a space attention mechanism and a channel attention mechanism; the spatial attention mechanism acquires global context information by calculating the similarity between each pixel vector and the feature vectors of all other positions; the channel attention mechanism acquires semantic information among channels by calculating the similarity among the channels; and then fusing the result of the spatial attention mechanism and the result of the channel attention mechanism to finally obtain a result of semantic segmentation.

3. The method of superpixel-based attention mechanism-generated semantic segmentation as claimed in claim 1, wherein in step 2, the embedding superpixels: generating super pixels by using a simple linear clustering algorithm slic, embedding a super pixel layer into a ResNet network structure, pooling a characteristic diagram through the super pixel layer, and then obtaining a characteristic vector corresponding to each super pixel

wherein

in the step 3:

the spatial attention mechanism is as follows:

Wherein v_iIndicating that the ith super pixel passes throughFormula 1 obtained after pooling, and C_jRepresenting the jth pixel in the feature map;

the channel attention mechanism is as follows: