CN112508960A

CN112508960A - Low-precision image semantic segmentation method based on improved attention mechanism

Info

Publication number: CN112508960A
Application number: CN202011521916.3A
Authority: CN
Inventors: 陈纯玉; 吴忻生; 陈安; 王博
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2021-03-16

Abstract

The invention discloses a low-precision image semantic segmentation method based on an improved attention mechanism, which comprises the following steps of: s1, collecting image composition data sets under different scenes, and dividing the data sets into a training set, a verification set and a test set; s2, performing feature extraction on the preprocessed training set images by using an improved MobileNet v2 network, and performing up-sampling or down-sampling on the resolution of feature images of different layers; s3, aggregating the multi-scale information of the feature graph after up-sampling or down-sampling in S2 by using a GASPP structure with a global attention feature module; s4, fusing the low-level detail features extracted by the MobileNet v2 main network with the multi-scale features obtained by the polymerization in the step S3, and fusing the obtained fusion features; and S5, decoding the feature map through bilinear interpolation upsampling to obtain a final segmentation image.

Description

Low-precision image semantic segmentation method based on improved attention mechanism

Technical Field

The invention belongs to the field of deep learning and computer vision, and particularly relates to a low-precision image semantic segmentation method based on an improved attention mechanism.

Background

Since the 21 st century, how to realize intelligent driving becomes an increasingly topic of people. In the common scene facing intelligent vehicles, the semantic segmentation technology is a key technology for identifying different objects such as obstacles, driving areas, traffic lights and the like in urban roads. Semantic segmentation is a classification at the pixel level, and pixels belonging to the same class are classified into one class, so that the semantic segmentation is to understand an image from the pixel level.

Before the deep learning method using the convolutional neural network becomes mainstream, semantic segmentation methods such as textonfiest and a random forest classifier are used in many cases. The methods are simple in design and easy to implement, but the feature extraction link is mainly realized manually, and the classification effect is poor.

The deep learning method has great success in semantic segmentation, and the deep learning method can be generalized into several ideas for solving the problem of semantic segmentation.

In 2014, a Full Convolution Network (FCN) is generated, the FCN replaces a network full connection layer with convolution, so that input of any image size becomes possible, firstly, an RGB image is input into a convolution neural network, a series of feature maps are obtained through multiple convolution and pooling processes, then, an inverse convolution layer is utilized to perform up-sampling on the feature map obtained by the last convolution layer, the feature map after up-sampling is the same as the original image in size, therefore, the spatial position information of each pixel value on the feature map in the original image is reserved while prediction is performed on each pixel value on the feature map, finally, pixel-by-pixel classification is performed on the up-sampled feature map, and softmax classification loss is calculated pixel-by-pixel.

The encoder-decoder is an FCN-based fabric. encoder gradually reduces spatial dimensions due to posing, while decoder gradually restores spatial dimensions and detailed information. There is also a shortcut connection (i.e., a connection across layers) from encoder to decoder in general.

A scaled/atomic architecture, which replaces posing, on the one hand it preserves spatial resolution and on the other hand it integrates context information well because it enlarges the field of view.

There is also a method of post-processing the segmentation results, namely Conditional Random Fields (CRFs), to improve the segmentation. The DeepLab series articles basically adopt the post-processing method, and can better improve the segmentation result.

The existing networks such as U-Net networks, VGG networks and the like have the problems of insufficient real-time performance and the like, and the lightweight networks such as MoblieNet series and the like have the problems of insufficient accuracy and the like. How to improve the accuracy while ensuring the real-time performance of image segmentation is an important problem to be solved by the method.

Disclosure of Invention

The invention aims to provide a low-precision image semantic segmentation method based on an improved attention mechanism, which can improve the image segmentation accuracy in a low-precision network.

The object of the invention is achieved by at least one of the following solutions.

A low-precision image semantic segmentation method based on an improved attention mechanism comprises the following steps:

s1, collecting and preprocessing images in different scenes, labeling the images to form a data set, and dividing the data set into a training set, a verification set and a test set;

s2, performing feature extraction on the preprocessed training set images by using an improved MobileNet v2 network, and performing up-sampling or down-sampling on the resolution of feature images of different layers;

s3, aggregating the multi-scale information of the feature graph after up-sampling or down-sampling in the step S2 by using a GASPP structure network with a global attention feature module;

s4, fusing the low-level detail features extracted by the MobileNet v2 network and the multi-scale features obtained by aggregation in the step S3, and fusing the obtained fused features through a decoder module (SAM) with a selective attention mechanism;

and S5, decoding the feature map through bilinear interpolation upsampling to obtain a final segmentation image.

Preferably, the modified MobileNet v2 network described in step S2 is a MobileNet v2 network with the last three layers deleted.

Preferably, the GASPP structural network with global attention feature module in step S3 includes a hole space convolution pooling pyramid (ASPP) module with hole convolution based on the deep lab v3+, the ASPP module is a global average pooling operation adopted;

each branch of the GASPP-structured network contains 256 channels and a global attention mechanism module (GAM) is introduced, 3 convolution modules of 3 × 3 are added after each branch of the hole convolution, and the original 1 × 1 convolution is retained.

Preferably, the improved MobileNet v2 network only retains one two-dimensional convolution layer and seven linear bottleneck layers, the GAM takes the last layer of feature map in the MobileNet v2 backbone network as input, and expands the size of the feature map into cxhw, where the parameters C, W, H respectively represent the number of channels, the width, and the height of the feature map, and extracts global attention masks, namely a channel number mask and size masks cxhw and HW × C, by converting mapping, and extracts the correlation between features as a normalization function spearsemax input by a dot product between the two global attention masks, where the normalization function is shown in formula (1):

sparsemax_i(z)＝max(0,z_i-τ(z)) (1)

wherein the attention feature map vector is z ═ z₁,z₂,…,z_k]，z_kAn attention feature vector representing the kth channel, ordering vector values from small to large, with a threshold τ (z) of:

wherein the content of the first and second substances,

where k denotes the total number of channels, j denotes the current channel index, z_(j)And z_(k)And respectively represents the attention feature map vectors of the j-th and k-th channels, and f (z) represents the maximum value of the attention feature map vectors.

Preferably, the GASPP is calculated as follows:

Z＝GAM(X)⊙P_3,6(P₃(X))⊙P_3,12(P₅(X))⊙P_3,18(P₇(X))⊙P₁(X) (1)

wherein Z represents the output of GASPP, GAM (X) represents global attention manipulation, P_k(X) represents convolution operation with convolution kernel size k × k, which represents merging by channel, and after all feature maps are concatenated, the concatenated feature maps are passed through a 1 × 1 convolution to reduce the number of channels.

Preferably, in step S4, the fusion of the low-level features and the multi-scale features is performed using a decoder module SAM, which includes a squeeze and fire network (SENet), and the SAM performs an up-sampling operation after the selective attention calculation, and the output size is restored to the input state, and a pixel profile is obtained based thereon.

Preferably, the selective attention module in the decoder module with selective attention mechanism SAM is divided into two different branches, wherein one branch is from the multi-scale aggregation high-level feature information of the GASPP structure network with global attention feature module; the other branch is from the detail feature of the MobileNet v2 network, using a 1 × 1 convolutional layer to reduce the number of channels.

Preferably, in step S4, the decoder-fused feature map is decoded by upsampling by bilinear interpolation, which is linear calculation based on the value of a known point.

Preferably, the linearity calculation is as follows:

wherein the intermediate point A and the point B are respectively R₁And R₂The values are respectively:

wherein, the coordinate points of the four corners are Q respectively₁₁＝(x₁,y₁)，Q₁₂＝(x₁,y₂)，Q₂₁＝(x₂,y₁)，Q₂₂＝(x₂,y₂) Are known points, the P (x, y) points are evaluated, x represents the x-axis coordinate and y represents the y-axis coordinate.

Preferably, the preprocessing process mainly comprises flipping, rotating, scaling and cropping.

Compared with the prior art, the invention has the following beneficial effects:

(1) aiming at the problem of insufficient semantic segmentation accuracy of a low-precision network, the method designs an ASPP structure GASPP with global attention information and a decoder module SAM, and effectively improves the algorithm precision.

(2) The method can effectively segment roads under various scenes and inhibit noise, consumes less time and has high accuracy for semantic segmentation of the lane pictures, has better adaptability in the environments of fuzzy lane lines, rainy days, heavy fog, large area rate and the like, and has practical significance in traffic application scenes.

Drawings

FIG. 1 is a schematic structural diagram of a low-precision image semantic segmentation method based on an improved attention mechanism according to this embodiment;

fig. 2 is a diagram of a GASPP network model structure according to the embodiment;

fig. 3 is a diagram illustrating the structure of the GAM module according to the present embodiment;

FIG. 4 is a flow chart of a decoder module with selective attention according to the present embodiment;

fig. 5 is a schematic overall flow chart of the low-precision image semantic segmentation method based on the improved attention mechanism according to the embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a low-precision image semantic segmentation method based on an improved attention mechanism, and a neural network model structure chart is shown in fig. 1 and mainly comprises a backbone network, a GASPP structure network with a global attention feature module and a decoder module with a selective attention mechanism.

As shown in fig. 5, the low-precision image semantic segmentation method based on the improved attention mechanism of the embodiment includes the following steps:

step 1, collecting images of lanes under different scenes, labeling vehicles, roads, obstacles and the like of the images respectively to form a data set, and processing the data set according to the following steps of 8: 1: the proportion of 1 is divided into a training set, a verification set and a test set, wherein the training set is used for training the deep convolutional network, the verification set is used for selecting an optimal training model, and the test set is used for testing the performance of the design model at the later stage.

The preprocessing process mainly comprises turning, rotating, scaling, clipping and the like, the operations improve the accuracy of the model, enhance the stability of the model, prevent the model from being over-fitted, and enhance the fault tolerance of the data set through the controlled scale transformation specification.

Step 2, using an improved MobileNet v2 network to perform feature extraction on the picture preprocessed by the training set, and performing up-sampling or down-sampling on the resolution of feature maps of different layers;

the improved MobileNet v2 network is obtained by improvement and simplification based on MobileNet v2, the original MobileNet v2 network structure comprises a two-dimensional convolution layer, seven linear bottleneck layers, a 1 × 1 two-dimensional convolution layer, a 7 × 7 average pooling layer, a 1 × 1 two-dimensional convolution layer and three deleted layers, the obtained network model parameters are shown in table 2, only one two-dimensional convolution layer and seven linear bottleneck layers are included, the network calculation amount is greatly reduced, and the image segmentation and feature extraction speed is higher.

TABLE 2 network model parameters

The GASPP structure network with global attention feature module comprises an existing ASPP (cavity space convolution pooling pyramid) module with cavity convolution based on the deep lab v3+, an ASPP module with cavity convolution based on the deep lab v3+ is a global average pooling operation, each branch of the GASPP adopted by the invention contains 256 channels, GAM (global attention mechanism module) is introduced, 3 convolution modules of 3 × 3 are added after each branch of the cavity convolution, and the original 1 × 1 convolution is reserved, and the structure is shown in fig. 2.

The calculation formula of GASPP is as follows:

Z＝GAM(X)⊙P_3,6(P₃(X))⊙P_3,12(P₅(X))⊙P_3,18(P₇(X))⊙P₁(X) (1)

wherein Z represents the output of GASPP, GAM (X) represents global attention manipulation, P_k(X) represents convolution operation, the size of the convolution kernel is at k × k, which indicates a merge by channel. After all feature maps are concatenated, the concatenated feature maps are passed through a 1 × 1 convolution to reduce the number of channels to 128.

As shown in fig. 3, the GAM takes the last layer of feature map in the improved MobileNet v2 network as input, expands the size of the feature map into cxhw, where parameters C, W, H respectively represent the number of channels, width, and height of the feature map, extracts global attention masks, namely cxhw and HW × C, by converting mapping, extracts the correlation between features as a normalization function Sparsemax input by a dot product between the two global attention masks, where Sparsemax is shown as formula (2):

sparsemax_i(z)＝max(0,z_i-τ(z)) (2)

wherein the attention feature map vector is z ═ z₁,z₂,…,z_k]，z_kAn attention feature vector representing the k-th channel, the vector being ordered from small to large, the threshold being τ (z) and being of the order

Wherein the content of the first and second substances,

The feature graph generated by two branches (a backbone network and a GASPP) has different levels of information, the backbone network provides rich high-level semantic information, the GASPP mainly provides enough high-level semantic information, a coder module SAM is used for fusing low-level features and multi-scale features, the SAM is improved by SENET, the structure is shown in FIG. 4, a selective attention module in the SAM can be divided into two different branches, and one branch is from the multi-scale aggregation high-level feature information of the GASPP module; the other branch is from the detail feature of the main network, and a 1 × 1 convolutional layer is used to reduce the number of channels. Merging the fused features according to channels, connecting the merged features by using a full-play average pooling layer, performing expansion operation by using a full-connection layer and a ReLU layer, performing feature recalibration by using the full-connection layer and a Sigmoid layer, performing up-sampling operation after selective attention calculation of SAM (sample access memory), recovering the output size to the input state, and obtaining a pixel distribution map according to the output size.

And (4) decoding the selected characteristic graph through a bilinear interpolation upsampling formula (4) to obtain a final segmentation image.

Wherein the intermediate points A and B are R respectively₁And R₂Respectively as follows:

And 3, setting and modifying parameters of the low-precision semantic segmentation network model based on the improved attention mechanism, wherein the GPU used in the method is GTX2080Ti, and the input size of the picture is set to be 512 and 1024 in consideration of the resolution problem of the picture.

In order to enlarge the data set, when training the model, firstly, the RGB channels of the input original picture are normalized by the mean and variance, and the enhancement means such as random scaling and random horizontal inversion in the range of 0.5 to 2.0 are adopted in the training process. During testing, operations such as random horizontal turning, random cutting and the like are not performed on the test image, and the image is sent into the network model after the average value is subtracted.

The network adopts the existing Poly learning rate strategy, the learning rate strategy does not fix the step length parameter, the learning rate is reduced under the reference of the initial learning rate according to the attenuation factor in each iteration, and the calculation formula is as follows:

wherein epoch represents the current iteration cycle in the training process, max _ epoch represents the maximum iteration cycle number, and the initial learning rate lr_baseIs set to 0.01, the exponent coefficient power is set to 0.9,

the embodiment adopts an ASPP structure GASPP with global attention information and a decoder module SAM, thereby effectively improving the algorithm precision. The method can effectively segment roads under various scenes and inhibit noise, consumes less time and has high accuracy for semantic segmentation of road pictures, has better adaptability in the environments of fuzzy roads, rainy days, heavy fog, large area rate and the like, and has practical significance in traffic application scenes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A low-precision image semantic segmentation method based on an improved attention mechanism is characterized by comprising the following steps:

2. The method for semantic segmentation of low-precision images based on the attention-improving mechanism of claim 1, wherein the improved MobileNet v2 network in step S2 is a MobileNet v2 network with the last three layers deleted.

3. The improved attention mechanism-based low-precision image semantic segmentation method according to claim 2, wherein the GASPP structure network with global attention feature module in step S3 includes an empty space convolution pooling pyramid (ASPP) module with empty convolution based on deep lab v3+, the ASPP module is an adopted global average pooling operation;

4. The method as claimed in claim 3, wherein the improved MobileNet v2 network only retains one two-dimensional convolutional layer and seven linear bottleneck layers, the GAM takes the last layer of feature map in the MobileNet v2 backbone network as input, and expands the size of the feature map into cxhw, wherein the parameters C, W, H respectively represent the number of channels, the width and the height of the feature map, and the global attention masks are extracted by transformation mapping, and are respectively the number of channels mask and the size masks cxhw and HW × C, and the correlation between features is extracted as the input of a normalization function spearsemax by the dot product between the two global attention masks, and the normalization function is as shown in formula (1):

sparsemax_i(z)＝max(0,z_i-τ(z)) (1)

wherein the content of the first and second substances,

5. The method for semantically segmenting the low-precision image based on the attention-improving mechanism in accordance with claim 4, wherein the GASPP has the following formula:

Z＝GAM(X)⊙P_3,6(P₃(X))⊙P_3,12(P₅(X))⊙P_3,18(P₇(X))⊙P₁(X) (1)

6. The method for semantic segmentation of low-precision images based on attention-improving mechanism as claimed in claim 5, wherein the step S4 is implemented by fusing low-level features and multi-scale features using a decoder module SAM, wherein the SAM comprises a squeeze and excite network (SENet), the SAM performs an up-sampling operation after the selective attention calculation is completed, the output size is restored to the input state, and a pixel distribution map is obtained according to the SAM.

7. The method according to claim 6, wherein the selective attention module in the decoder module with selective attention mechanism SAM is divided into two different branches, one branch being from the multi-scale aggregation high-level feature information of the GASPP structure network with global attention feature module; the other branch is from the detail feature of the MobileNet v2 network, using a 1 × 1 convolutional layer to reduce the number of channels.

8. The method for semantic segmentation of low-precision images based on the attention-improving mechanism as claimed in claim 7, wherein the decoder-fused feature map is decoded by bilinear interpolation upsampling in step S4, and the bilinear interpolation is performed by linear calculation according to the numerical value of the known point.

9. The method for semantically segmenting the low-precision image based on the improved attention mechanism as claimed in claim 8, wherein the linear calculation is as follows:

10. The method of claim 9, wherein the preprocessing process mainly comprises flipping, rotating, scaling, and cropping.