CN113642390B

CN113642390B - Street view image semantic segmentation method based on local attention network

Info

Publication number: CN113642390B
Application number: CN202110763344.8A
Authority: CN
Inventors: 赵明华; 郅宇星; 王睿; 胡静; 都双丽; 石程; 李鹏
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2024-02-13
Anticipated expiration: 2041-07-06
Also published as: CN113642390A

Abstract

The invention discloses a street view image semantic segmentation method based on a local attention network, which comprises the following specific implementation steps: step 1, firstly, randomly selecting partial image data from a public data set Cityscapes, and dividing the selected partial image data into a training set, a verification set and a test set; step 2, constructing a MobileNet V2 network model by using an inverted residual error module and hole convolution; step 3, designing a local attention module and a residual block, and constructing a coding network; step 4, constructing a decoding network, gradually recovering the image resolution, and finally outputting the semantic segmentation result; and 5, training the model by using the training set and the verification set, and verifying the segmentation effect of the model on the test set. The method solves the problem that the ubiquitous local information in the prior art cannot be completely reserved in the feature extraction process, so that the inconsistent segmentation result in the category is solved.

Description

Street view image semantic segmentation method based on local attention network

Technical Field

The invention belongs to the field of digital image processing methods, and particularly relates to a street view image semantic segmentation method based on a local attention network.

Background

Vision is an important way for humans to recognize and accept external information, and humans tend to acquire information directly from images, as opposed to linguistic text descriptions. However, computers are still a challenging task in that they need to perform scene understanding tasks like the human eye, such as accurately classifying image pixels. The objective of the semantic segmentation task is to correctly classify each pixel in an image using a computer, which is a pixel-by-pixel classification task. Scene understanding is a task similar to human perception and understanding of the environment by using a computer, wherein semantic segmentation is a key and fundamental technology as a necessary path for realizing scene understanding.

In urban road-oriented scenes, semantic segmentation is a key technology for understanding different kinds of objects such as vehicles, sidewalks, roads and signal lamps in urban street scenes, but street scenes usually have complicated and unstructured characteristics, such as illumination, seasonal weather changes, too small target dimensions, objects are blocked, and the like, so that diversified targets are usually present in roads, and the visual understanding and semantic segmentation task in street-oriented scenes becomes a very complex and serious challenge.

Disclosure of Invention

The invention aims to provide a street view image semantic segmentation method based on a local attention network, which solves the problem that ubiquitous local information in the prior art cannot be completely reserved in the feature extraction process, so that inconsistent segmentation results in categories are solved.

The invention adopts the technical scheme that the street view image semantic segmentation method based on the local attention network comprises the following specific implementation steps:

step 1, firstly randomly selecting partial image data from a public data set Cityscapes, dividing the selected partial image data into a training set, a verification set and a test set, and finally carrying out data enhancement and preprocessing operations on all image data of the training set, the verification set and the test set respectively;

step 2, firstly constructing an inverted residual error module by using a depth separable convolution and a residual error structure, and then constructing a MobileNet V2 network model by using the inverted residual error module and a cavity convolution; inputting the image data of the training set into a MobileNetV2 network model to extract image features and outputting a low-level feature map F _low And high-level feature map F _high For F _high Four cavity convolutions with different expansion rates and one-time global average pooling are used to obtain five feature graphs;

step 3, designing a local attention module and a residual block and constructing a coding network to extract image features and restore local context information of input image data at the same time because the local context information is likely to be lost in the process of gradually extracting the features;

step 4, constructing a decoding network: output of coding network and low-level characteristic diagram F _low Sequentially performing upsampling, splicing and upsampling operations, gradually recovering the image resolution, and finally outputting the semantic segmentation result;

and 5, training the model by using the training set and the verification set, and verifying the segmentation effect of the model on the test set.

The present invention is also characterized in that,

the specific process of the step 1 is as follows:

step 1.1, randomly selecting partial image data from a public data set Cityscapes, and dividing the selected partial image data into a training set, a verification set and a test set according to the proportion of 6:3:1;

step 1.2, for all image data of the training set, enhancing the data by using a random overturn, random clipping and random Gaussian blur method, and finally normalizing the image data of the training set;

step 1.3, for the verification set and the test set, firstly scaling the image size to 513×1026 pixels by using bilinear interpolation method for all image data; then cutting into 513×513 images; and finally, carrying out normalization operation on all image data of the verification set and the test set.

The specific process of the step 2 is as follows:

step 2.1, constructing an inverted residual module using a depth separable convolution and residual network structure: first convolution up-dimension with convolution kernel size 1 x 1, then convolution with depth separable convolution with convolution kernel size 3 x 3, in order to reduce the computational effort, finally convolution down-dimension with convolution kernel size 1 x 1, and use of two ReLu ₆ Activating a function;

in step 2.1, reLu ₆ Activation function ReLu ₆ (x) Is defined as shown in formula (1):

ReLu ₆ (x)＝min{max(0,x),6} (1)

wherein x represents input data, and max () and min () are two functions of returning the maximum value and the minimum value in the input data;

step 2.2, constructing a MobileNet V2 network model by using 3 convolution layers, 7 inversion residual error modules and 1 average pooling layer, and extracting image features layer by adopting a mode of hole convolution by using cross pixels; all convolution operations used by the MobileNetV2 network model are hole convolutions with a dilation rate d=1, the fourth layer output is the low layer feature map F _low The output of the overall network model is a high-level feature map F _high ；

In step 2.2, the calculation mode of the equivalent convolution kernel size k' of the hole convolution is defined as shown in formula (2):

k'＝k+(k-1)×(d-1) (2)

where k is the convolution kernel size, d is the expansion rate of the hole convolution, and the i+1st layer of hole convolution is receptive field RF _i+1 The definition of the calculation mode is shown in a formula (3):

RF _i+1 ＝RF _i +(k'-1)×S _i (3)

where i denotes the index of the network layer number and RF denotes the feel of the i-th layerWild, S _i Representing the product of all previous layer steps, S _i The calculation mode of (2) is shown as the formula (4):

wherein, the Stride _i Representing the step size of the i-th layer;

step 2.3, for F _high First, four feature maps are obtained by using the hole convolution of four expansion rates d=0, 1,2, 3: f (F) ₁ ,F ₂ ,F ₃ ,F ₄ Then, a global average pooling is used to obtain a feature map F _p ，F _p The calculation method of the output size N of the (B) is as shown in the formula (5):

wherein W is the input size, F is the kernel size, S is the step size, and P is the fill size.

The specific process of the step 3 is as follows:

step 3.1, constructing a local attention module: first, the input feature map f _a And feature map f _b After splicing, f is obtained through batch normalization and convolution operation of 1 multiplied by 1 _b 'A'; then f is carried out _b ' global pooling, relu activation function, 1×1 convolution, sigmoid activation function are sequentially performed to obtain f _b ”，f _b "AND f _a Multiplying to obtain f _a 'A'; finally f _a ' and f _b The 'addition' is the output of the local attention module;

in step 3.1, the Relu activation function ReLu (x) and the Sigmoid activation function are defined as formula (6) and formula (7):

wherein x represents an input value;

the convolution operation gives a probability value for each class to each pixel, and finally the probability F for each class is added by all feature maps, as in equation (8):

where D represents a characteristic diagram of the network output, w represents a convolution operation, D represents a set of all pixel positions, K e {0,1,.. _k A value representing the kth channel;

introducing a weight parameter alpha = Sigmoid (d; w) to correct the highest probability of the prediction, new predicted valueAs shown in formula (9):

step 3.2, constructing a residual block: firstly, the input characteristic diagram is subjected to 1X 1 convolution to obtain a characteristic diagram f _c1 The method comprises the steps of carrying out a first treatment on the surface of the Then the characteristic diagram f is obtained by 3X 3 convolution, relu activation function, batch normalization and 3X 3 convolution in sequence _c2 The method comprises the steps of carrying out a first treatment on the surface of the Finally f _c1 And f _c2 Adding, and obtaining the output of a residual block by using a Relu activation function;

step 3.3, constructing a coding network: first four feature maps F are generated as follows ₄ ',F ₃ ',F ₂ ',F ₁ '：

1) Feature map F ₄ The' generation mode is as follows: outputting the characteristic diagram F output in the step 2.3 _p And feature map F ₄ Input to local attention module, output feature map F _{4_1} Input to residual block to obtain feature map F _{4_2} Input F _{4_2} Hole convolution to expansion rate d=1 to obtain a feature map F ₄ '；

2) Special purposeSign F ₃ The' generation mode is as follows: input of a feature map F _{4_2} And feature map F ₃ Obtaining a feature map F to a local attention module _{3_1} Input of a feature map F _{3_1} Obtaining a feature map F from the residual block _{3_2} Input F _{3_2} Hole convolution to expansion rate d=1 to obtain a feature map F ₃ '；

3) Feature map F ₂ The' generation mode is as follows: input of a feature map F _{3_2} And feature map F ₂ Obtaining a feature map F to a local attention module _{2_1} Input of a feature map F _{2_1} Obtaining a feature map F from the residual block _{2_2} Input F _{2_2} Hole convolution to expansion rate d=1 to obtain a feature map F ₂ '；

4) Feature map F ₁ The' generation mode is as follows: input of a feature map F _{2_2} And feature map F ₁ Obtaining a feature map F to a local attention module _{1_1} Input of a feature map F _{1_1} Obtaining a feature map F from the residual block _{1_2} Input F _{1_2} Hole convolution to expansion rate d=1 to obtain a feature map F ₁ '；

Then splice four feature maps F ₁ ',F ₂ ',F ₃ ',F ₄ ' finally, the splicing result is subjected to 1X 1 convolution operation once to obtain the output F of the coding network _encoder 。

The specific process of the step 4 is as follows: first, a low-level feature map F _low Performing a 1×1 convolution operation to obtain a feature map F _low ' output feature map F of coding network _encoder Upsampling using bilinear sampling method to obtain F _encoder 'A'; then F is carried out _low ' and F _encoder ' splicing, and performing 3×3 convolution operation; finally obtaining a segmentation result through bilinear upsampling by 4 times.

The specific process of the step 5 is as follows:

step 5.1, training a model by using image data of a training set, and evaluating the segmentation effect of the model by using a verification set in the training process, wherein the verification set does not participate in the training process; training the model uses a cross entropy Loss function Loss _ce The initial learning rate is set to 0.007 and a polynomial decay strategy is adopted;

in step 5.1, the cross entropy Loss function Loss _ce Is defined as formula (11):

wherein T is a real tag value, and the total number of samples is N, p _i,t Representing the probability that the ith sample is predicted to be the t tag value, y _i,t Representing that the i-th sample is the true probability value of the t-th label, where i e {0,1,., 1000}, t e {0,1,., 19};

and 5.2, using the average blending ratio and the accuracy in the semantic segmentation method as evaluation indexes to evaluate a model, inputting the image data in the test set into the model one by one, wherein the output of the model is the semantic segmentation result of each image, and simultaneously outputting the time used for segmenting each image.

The beneficial effects of the invention are as follows:

(1) The method of the invention is based on the segmentation structure of the encoder-decoder, the encoder network extracts the characteristics layer by layer, and the decoder gradually restores the resolution of the image through up-sampling, thereby achieving the purpose of classifying each pixel in the image.

(2) During convolution operation, each feature map is identified as the same kernel, and the method of the invention assigns different weights to each feature map through the local attention network module, and for the feature map divided with gain, the assigned weights are larger, and for the redundant feature map, the weights are smaller. Therefore, the method can obviously improve the discrimination capability of the network model to each category, reduce the segmentation inconsistency in the category and improve the visual smoothing effect of semantic segmentation.

Drawings

FIG. 1 is a flow chart of the street view image semantic segmentation method based on the local attention network of the present invention;

FIG. 2 is a schematic diagram of a local attention module architecture used in the street view image semantic segmentation method based on the local attention network of the present invention;

FIG. 3 is a schematic diagram of a residual block structure used in the local attention network based street view image semantic segmentation method of the present invention;

FIG. 4 is a diagram showing a comparison of a first original image, a real label and a segmentation result randomly obtained in a test set in an embodiment of the present invention;

FIG. 5 is a diagram showing a comparison of a second original image, a real label and a segmentation result randomly obtained in a test set in an embodiment of the present invention;

fig. 6 is a comparison chart of a third original image, a real label and a segmentation result obtained randomly in a test set in the embodiment of the present invention.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention discloses a street view image semantic segmentation method based on a local attention network, which comprises the following specific implementation steps:

the specific process of the step 1 is as follows:

step 1.1, randomly selecting 1000, 500 and 166 images from a public data set Cityscapes respectively as image data of a training set, a testing set and a verification set;

Step 2, firstly constructing an inverted residual error module by using a depth separable convolution and residual error structure, and then using the inverted residual error moduleAnd constructing a MobileNet V2 network model by hole convolution, wherein the detailed structure of the MobileNet V2 network model is shown in table 1. Inputting the image data of the training set into a MobileNetV2 network model to extract image features and outputting a low-level feature map F _low And high-level feature map F _high For F _high Four cavity convolutions with different expansion rates and one-time global average pooling are used to obtain five feature graphs;

TABLE 1 detailed structure of MobileNet V2 network model

The specific process of the step 2 is as follows:

ReLu ₆ (x)＝min{max(0,x),6} (1)

step 2.2, a MobileNetV2 network model is constructed using 3 convolutional layers, 7 inversion residual modules, and 1 average pooling layer, and the specific model structure is shown in table 1. In order to increase the receptive field of convolution without losing information, adopting cavity convolution to extract image features layer by using a cross-pixel mode; all convolution operations used by the MobileNetV2 network model are hole convolutions with a dilation rate d=1, the fourth layer output is the low layer feature map F _low The output of the overall network model is a high-level feature map F _high ；

k'＝k+(k-1)×(d-1) (2)

RF _i+1 ＝RF _i +(k'-1)×S _i (3)

wherein i represents the index of the network layer number, RF represents the receptive field of the ith layer, S _i Representing the product of all previous layer steps, S _i The calculation mode of (2) is shown as the formula (4):

wherein, the Stride _i Representing the step size of the i-th layer;

Step 3, designing a local attention module and a residual block as shown in fig. 2 and 3, and constructing a coding network for extracting image features and recovering local context information of input image data, because local context information is likely to be lost in the process of gradually extracting features;

the specific process of the step 3 is as follows:

step 3.1, constructing a local attention module (Local Attention Block, LAB): first, the input feature map f _a And feature map f _b After splicing, batch normalization and 1×1 convolution operations are performedObtaining f _b 'A'; then f is carried out _b ' global pooling, relu activation function, 1×1 convolution, sigmoid activation function are sequentially performed to obtain f _b ”，f _b "AND f _a Multiplying to obtain f _a 'A'; finally f _a ' and f _b The 'addition' is the output of the local attention module; a block diagram of the local attention module is shown in fig. 2. The purpose is to assign different weights to each channel, the convolution operation assigns probability values for each class to each pixel, and sets the highest probability of weight parameter optimization.

wherein x represents an input value;

step 3.2, constructing a residual block: firstly, the input characteristic diagram is subjected to 1X 1 convolution to obtain a characteristic diagram f _c1 The method comprises the steps of carrying out a first treatment on the surface of the Then the characteristic diagram f is obtained by 3X 3 convolution, relu activation function, batch normalization and 3X 3 convolution in sequence _c2 The method comprises the steps of carrying out a first treatment on the surface of the Finally f _c1 And f _c2 Adding, and obtaining the output of a residual block by using a Relu activation function; the block diagram of the residual block is shown in fig. 3, and the purpose of the block is to combine the information of all channels to achieve the purpose of refining the feature map.

2) Feature map F ₃ The' generation mode is as follows: input of a feature map F _{4_2} And feature map F ₃ Obtaining a feature map F to a local attention module _{3_1} Input of a feature map F _{3_1} Obtaining a feature map F from the residual block _{3_2} Input F _{3_2} Hole convolution to expansion rate d=1 to obtain a feature map F ₃ '；

And 5, training the model by using a training set and a verification set based on the semantic segmentation network model structure based on the local attention network constructed in the steps 2-4 as shown in figure 1, and verifying the segmentation effect of the model on the test set, wherein the performance effect of the model on the training set and the verification set is shown in table 2.

The specific process of the step 5 is as follows:

step 5.1, a semantic segmentation model structure based on the local attention network is shown in fig. 1. Training a model by using image data of a training set, and evaluating the segmentation effect of the model by using a verification set in the training process, wherein the verification set does not participate in the training process; training the model uses a cross entropy Loss function Loss _ce The initial learning rate is set to 0.007 and a polynomial decay strategy is adopted;

and 5.2, using an average blending ratio (MIOU) and an accuracy (Acc) in a semantic segmentation method as evaluation indexes to evaluate a model, inputting image data in a test set into the model one by one, wherein the output of the model is a semantic segmentation result of each image, and simultaneously outputting the time used for segmenting each image, the performance effect of the model in a training set and a verification set is shown in a table 2, the model has good performance in the whole, the average blending ratio of all categories reaches 0.613, the accuracy reaches 0.942, and the segmentation result can be obtained only within 0.5 seconds for an image with a resolution of 512 multiplied by 1024.

Table 2 model performance effects on training set and validation set

As shown in fig. 4 to 6, the original pictures are three images randomly acquired in the test set, the three original pictures are processed by using a semantic segmentation model based on a local attention network, the second column is a real label corresponding to the original pictures, and the third column is a semantic segmentation result obtained by using the model processing on the three original pictures.

Comparing the real label with the segmentation results can be seen: the model has accurate segmentation result and good visual effect, and does not have the situation of large-area classification errors, particularly the situation that the segmentation of large-area categories (roads, buildings, vehicles and the like) is more accurate, particularly the situation that the edges of the categories are not saw-toothed and information in the categories is not lost is shown, the small targets can be segmented into approximate outlines, and the specific details of the small targets need further subdivision.

Claims

1. A street view image semantic segmentation method based on a local attention network is characterized by comprising the following specific implementation steps:

the specific process of the step 2 is as follows:

step 2.1, constructing an inverted residual module using a depth separable convolution and residual network structure:

first convolution up-dimension with convolution kernel size 1 x 1, then convolution with depth separable convolution with convolution kernel size 3 x 3, in order to reduce the computational effort, finally convolution down-dimension with convolution kernel size 1 x 1, and use of two ReLu ₆ Activating a function;

ReLu ₆ (x)＝min{max(0,x),6}(1)

step 2.2, constructing a MobileNet V2 network model by using 3 convolution layers, 7 inversion residual error modules and 1 average pooling layer, and extracting image features layer by adopting a mode of hole convolution by using cross pixels;

all convolution operations used by the MobileNetV2 network model are hole convolutions with a dilation rate d=1, the fourth layer output is the low layer feature map F _low The output of the overall network model is a high-level feature map F _high ；

k'＝k+(k-1)×(d-1) (2)

RF _i+1 ＝RF _i +(k'-1)×S _i (3)

wherein, the Stride _i Representing the step size of the i-th layer;

wherein W is the input size, F is the kernel size, S is the step size, and P is the filling size;

step 3, designing a local attention module and a residual block, and constructing a coding network;

the specific process of the step 3 is as follows:

step 3.1, constructing a local attention module: first, the input feature map f _a And feature map f _b After splicing, f is obtained through batch normalization and convolution operation of 1 multiplied by 1 _b 'A'; then f is carried out _b ' Global pooling, relu activation function, 1×1 convolution in order,Sigmoid activates a function to obtain f _b ”，f _b "AND f _a Multiplying to obtain f _a 'A'; finally f _a ' and f _b The 'addition' is the output of the local attention module;

wherein x represents an input value;

step 3.2, constructing a residual block: firstly, the input characteristic diagram is subjected to 1X 1 convolution to obtain a characteristic diagram f _c1 The method comprises the steps of carrying out a first treatment on the surface of the Then sequentially pass through 3X 3 convolution and Relu activation function, batch normalization, and 3×3 convolution to obtain feature map f _c2 The method comprises the steps of carrying out a first treatment on the surface of the Finally f _c1 And f _c2 Adding, and obtaining the output of a residual block by using a Relu activation function;

Then splice four feature maps F ₁ ',F ₂ ',F ₃ ',F ₄ ' finally, the splicing result is subjected to 1X 1 convolution operation once to obtain the output F of the coding network _encoder ；

Step 4, constructing a decoding network: will encode the networkOutput and low-level feature map F _low Sequentially performing upsampling, splicing and upsampling operations, gradually recovering the image resolution, and finally outputting the semantic segmentation result;

2. The street view image semantic segmentation method based on the local attention network according to claim 1, wherein the specific process of the step 1 is as follows:

3. The street view image semantic segmentation method based on the local attention network according to claim 1, wherein the specific process of the step 4 is as follows: first, a low-level feature map F _low Performing a 1×1 convolution operation to obtain a feature map F _low ' output feature map F of coding network _encoder Upsampling using bilinear sampling method to obtain F _encoder 'A'; then F is carried out _low ' and F _encoder ' splicing, and performing 3×3 convolution operation; finally obtaining a segmentation result through bilinear upsampling by 4 times.

4. The street view image semantic segmentation method based on the local attention network according to claim 1, wherein the specific process of the step 5 is as follows: