CN113642390B - Street view image semantic segmentation method based on local attention network - Google Patents
Street view image semantic segmentation method based on local attention network Download PDFInfo
- Publication number
- CN113642390B CN113642390B CN202110763344.8A CN202110763344A CN113642390B CN 113642390 B CN113642390 B CN 113642390B CN 202110763344 A CN202110763344 A CN 202110763344A CN 113642390 B CN113642390 B CN 113642390B
- Authority
- CN
- China
- Prior art keywords
- feature map
- convolution
- input
- network
- image data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 230000011218 segmentation Effects 0.000 title claims abstract description 54
- 238000012549 training Methods 0.000 claims abstract description 45
- 238000012795 verification Methods 0.000 claims abstract description 28
- 238000012360 testing method Methods 0.000 claims abstract description 27
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims abstract description 24
- 230000000694 effects Effects 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 claims description 33
- 238000010586 diagram Methods 0.000 claims description 22
- 230000004913 activation Effects 0.000 claims description 20
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 12
- 238000010606 normalization Methods 0.000 claims description 9
- 238000002156 mixing Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 230000003213 activating effect Effects 0.000 claims description 3
- 230000010339 dilation Effects 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 abstract description 2
- 230000000007 visual effect Effects 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000001932 seasonal effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a street view image semantic segmentation method based on a local attention network, which comprises the following specific implementation steps: step 1, firstly, randomly selecting partial image data from a public data set Cityscapes, and dividing the selected partial image data into a training set, a verification set and a test set; step 2, constructing a MobileNet V2 network model by using an inverted residual error module and hole convolution; step 3, designing a local attention module and a residual block, and constructing a coding network; step 4, constructing a decoding network, gradually recovering the image resolution, and finally outputting the semantic segmentation result; and 5, training the model by using the training set and the verification set, and verifying the segmentation effect of the model on the test set. The method solves the problem that the ubiquitous local information in the prior art cannot be completely reserved in the feature extraction process, so that the inconsistent segmentation result in the category is solved.
Description
Technical Field
The invention belongs to the field of digital image processing methods, and particularly relates to a street view image semantic segmentation method based on a local attention network.
Background
Vision is an important way for humans to recognize and accept external information, and humans tend to acquire information directly from images, as opposed to linguistic text descriptions. However, computers are still a challenging task in that they need to perform scene understanding tasks like the human eye, such as accurately classifying image pixels. The objective of the semantic segmentation task is to correctly classify each pixel in an image using a computer, which is a pixel-by-pixel classification task. Scene understanding is a task similar to human perception and understanding of the environment by using a computer, wherein semantic segmentation is a key and fundamental technology as a necessary path for realizing scene understanding.
In urban road-oriented scenes, semantic segmentation is a key technology for understanding different kinds of objects such as vehicles, sidewalks, roads and signal lamps in urban street scenes, but street scenes usually have complicated and unstructured characteristics, such as illumination, seasonal weather changes, too small target dimensions, objects are blocked, and the like, so that diversified targets are usually present in roads, and the visual understanding and semantic segmentation task in street-oriented scenes becomes a very complex and serious challenge.
Disclosure of Invention
The invention aims to provide a street view image semantic segmentation method based on a local attention network, which solves the problem that ubiquitous local information in the prior art cannot be completely reserved in the feature extraction process, so that inconsistent segmentation results in categories are solved.
The invention adopts the technical scheme that the street view image semantic segmentation method based on the local attention network comprises the following specific implementation steps:
step 1, firstly randomly selecting partial image data from a public data set Cityscapes, dividing the selected partial image data into a training set, a verification set and a test set, and finally carrying out data enhancement and preprocessing operations on all image data of the training set, the verification set and the test set respectively;
step 2, firstly constructing an inverted residual error module by using a depth separable convolution and a residual error structure, and then constructing a MobileNet V2 network model by using the inverted residual error module and a cavity convolution; inputting the image data of the training set into a MobileNetV2 network model to extract image features and outputting a low-level feature map F low And high-level feature map F high For F high Four cavity convolutions with different expansion rates and one-time global average pooling are used to obtain five feature graphs;
step 3, designing a local attention module and a residual block and constructing a coding network to extract image features and restore local context information of input image data at the same time because the local context information is likely to be lost in the process of gradually extracting the features;
step 4, constructing a decoding network: output of coding network and low-level characteristic diagram F low Sequentially performing upsampling, splicing and upsampling operations, gradually recovering the image resolution, and finally outputting the semantic segmentation result;
and 5, training the model by using the training set and the verification set, and verifying the segmentation effect of the model on the test set.
The present invention is also characterized in that,
the specific process of the step 1 is as follows:
step 1.1, randomly selecting partial image data from a public data set Cityscapes, and dividing the selected partial image data into a training set, a verification set and a test set according to the proportion of 6:3:1;
step 1.2, for all image data of the training set, enhancing the data by using a random overturn, random clipping and random Gaussian blur method, and finally normalizing the image data of the training set;
step 1.3, for the verification set and the test set, firstly scaling the image size to 513×1026 pixels by using bilinear interpolation method for all image data; then cutting into 513×513 images; and finally, carrying out normalization operation on all image data of the verification set and the test set.
The specific process of the step 2 is as follows:
step 2.1, constructing an inverted residual module using a depth separable convolution and residual network structure: first convolution up-dimension with convolution kernel size 1 x 1, then convolution with depth separable convolution with convolution kernel size 3 x 3, in order to reduce the computational effort, finally convolution down-dimension with convolution kernel size 1 x 1, and use of two ReLu 6 Activating a function;
in step 2.1, reLu 6 Activation function ReLu 6 (x) Is defined as shown in formula (1):
ReLu 6 (x)=min{max(0,x),6} (1)
wherein x represents input data, and max () and min () are two functions of returning the maximum value and the minimum value in the input data;
step 2.2, constructing a MobileNet V2 network model by using 3 convolution layers, 7 inversion residual error modules and 1 average pooling layer, and extracting image features layer by adopting a mode of hole convolution by using cross pixels; all convolution operations used by the MobileNetV2 network model are hole convolutions with a dilation rate d=1, the fourth layer output is the low layer feature map F low The output of the overall network model is a high-level feature map F high ;
In step 2.2, the calculation mode of the equivalent convolution kernel size k' of the hole convolution is defined as shown in formula (2):
k'=k+(k-1)×(d-1) (2)
where k is the convolution kernel size, d is the expansion rate of the hole convolution, and the i+1st layer of hole convolution is receptive field RF i+1 The definition of the calculation mode is shown in a formula (3):
RF i+1 =RF i +(k'-1)×S i (3)
where i denotes the index of the network layer number and RF denotes the feel of the i-th layerWild, S i Representing the product of all previous layer steps, S i The calculation mode of (2) is shown as the formula (4):
wherein, the Stride i Representing the step size of the i-th layer;
step 2.3, for F high First, four feature maps are obtained by using the hole convolution of four expansion rates d=0, 1,2, 3: f (F) 1 ,F 2 ,F 3 ,F 4 Then, a global average pooling is used to obtain a feature map F p ,F p The calculation method of the output size N of the (B) is as shown in the formula (5):
wherein W is the input size, F is the kernel size, S is the step size, and P is the fill size.
The specific process of the step 3 is as follows:
step 3.1, constructing a local attention module: first, the input feature map f a And feature map f b After splicing, f is obtained through batch normalization and convolution operation of 1 multiplied by 1 b 'A'; then f is carried out b ' global pooling, relu activation function, 1×1 convolution, sigmoid activation function are sequentially performed to obtain f b ”,f b "AND f a Multiplying to obtain f a 'A'; finally f a ' and f b The 'addition' is the output of the local attention module;
in step 3.1, the Relu activation function ReLu (x) and the Sigmoid activation function are defined as formula (6) and formula (7):
wherein x represents an input value;
the convolution operation gives a probability value for each class to each pixel, and finally the probability F for each class is added by all feature maps, as in equation (8):
where D represents a characteristic diagram of the network output, w represents a convolution operation, D represents a set of all pixel positions, K e {0,1,.. k A value representing the kth channel;
introducing a weight parameter alpha = Sigmoid (d; w) to correct the highest probability of the prediction, new predicted valueAs shown in formula (9):
step 3.2, constructing a residual block: firstly, the input characteristic diagram is subjected to 1X 1 convolution to obtain a characteristic diagram f c1 The method comprises the steps of carrying out a first treatment on the surface of the Then the characteristic diagram f is obtained by 3X 3 convolution, relu activation function, batch normalization and 3X 3 convolution in sequence c2 The method comprises the steps of carrying out a first treatment on the surface of the Finally f c1 And f c2 Adding, and obtaining the output of a residual block by using a Relu activation function;
step 3.3, constructing a coding network: first four feature maps F are generated as follows 4 ',F 3 ',F 2 ',F 1 ':
1) Feature map F 4 The' generation mode is as follows: outputting the characteristic diagram F output in the step 2.3 p And feature map F 4 Input to local attention module, output feature map F 4_1 Input to residual block to obtain feature map F 4_2 Input F 4_2 Hole convolution to expansion rate d=1 to obtain a feature map F 4 ';
2) Special purposeSign F 3 The' generation mode is as follows: input of a feature map F 4_2 And feature map F 3 Obtaining a feature map F to a local attention module 3_1 Input of a feature map F 3_1 Obtaining a feature map F from the residual block 3_2 Input F 3_2 Hole convolution to expansion rate d=1 to obtain a feature map F 3 ';
3) Feature map F 2 The' generation mode is as follows: input of a feature map F 3_2 And feature map F 2 Obtaining a feature map F to a local attention module 2_1 Input of a feature map F 2_1 Obtaining a feature map F from the residual block 2_2 Input F 2_2 Hole convolution to expansion rate d=1 to obtain a feature map F 2 ';
4) Feature map F 1 The' generation mode is as follows: input of a feature map F 2_2 And feature map F 1 Obtaining a feature map F to a local attention module 1_1 Input of a feature map F 1_1 Obtaining a feature map F from the residual block 1_2 Input F 1_2 Hole convolution to expansion rate d=1 to obtain a feature map F 1 ';
Then splice four feature maps F 1 ',F 2 ',F 3 ',F 4 ' finally, the splicing result is subjected to 1X 1 convolution operation once to obtain the output F of the coding network encoder 。
The specific process of the step 4 is as follows: first, a low-level feature map F low Performing a 1×1 convolution operation to obtain a feature map F low ' output feature map F of coding network encoder Upsampling using bilinear sampling method to obtain F encoder 'A'; then F is carried out low ' and F encoder ' splicing, and performing 3×3 convolution operation; finally obtaining a segmentation result through bilinear upsampling by 4 times.
The specific process of the step 5 is as follows:
step 5.1, training a model by using image data of a training set, and evaluating the segmentation effect of the model by using a verification set in the training process, wherein the verification set does not participate in the training process; training the model uses a cross entropy Loss function Loss ce The initial learning rate is set to 0.007 and a polynomial decay strategy is adopted;
in step 5.1, the cross entropy Loss function Loss ce Is defined as formula (11):
wherein T is a real tag value, and the total number of samples is N, p i,t Representing the probability that the ith sample is predicted to be the t tag value, y i,t Representing that the i-th sample is the true probability value of the t-th label, where i e {0,1,., 1000}, t e {0,1,., 19};
and 5.2, using the average blending ratio and the accuracy in the semantic segmentation method as evaluation indexes to evaluate a model, inputting the image data in the test set into the model one by one, wherein the output of the model is the semantic segmentation result of each image, and simultaneously outputting the time used for segmenting each image.
The beneficial effects of the invention are as follows:
(1) The method of the invention is based on the segmentation structure of the encoder-decoder, the encoder network extracts the characteristics layer by layer, and the decoder gradually restores the resolution of the image through up-sampling, thereby achieving the purpose of classifying each pixel in the image.
(2) During convolution operation, each feature map is identified as the same kernel, and the method of the invention assigns different weights to each feature map through the local attention network module, and for the feature map divided with gain, the assigned weights are larger, and for the redundant feature map, the weights are smaller. Therefore, the method can obviously improve the discrimination capability of the network model to each category, reduce the segmentation inconsistency in the category and improve the visual smoothing effect of semantic segmentation.
Drawings
FIG. 1 is a flow chart of the street view image semantic segmentation method based on the local attention network of the present invention;
FIG. 2 is a schematic diagram of a local attention module architecture used in the street view image semantic segmentation method based on the local attention network of the present invention;
FIG. 3 is a schematic diagram of a residual block structure used in the local attention network based street view image semantic segmentation method of the present invention;
FIG. 4 is a diagram showing a comparison of a first original image, a real label and a segmentation result randomly obtained in a test set in an embodiment of the present invention;
FIG. 5 is a diagram showing a comparison of a second original image, a real label and a segmentation result randomly obtained in a test set in an embodiment of the present invention;
fig. 6 is a comparison chart of a third original image, a real label and a segmentation result obtained randomly in a test set in the embodiment of the present invention.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
The invention discloses a street view image semantic segmentation method based on a local attention network, which comprises the following specific implementation steps:
step 1, firstly randomly selecting partial image data from a public data set Cityscapes, dividing the selected partial image data into a training set, a verification set and a test set, and finally carrying out data enhancement and preprocessing operations on all image data of the training set, the verification set and the test set respectively;
the specific process of the step 1 is as follows:
step 1.1, randomly selecting 1000, 500 and 166 images from a public data set Cityscapes respectively as image data of a training set, a testing set and a verification set;
step 1.2, for all image data of the training set, enhancing the data by using a random overturn, random clipping and random Gaussian blur method, and finally normalizing the image data of the training set;
step 1.3, for the verification set and the test set, firstly scaling the image size to 513×1026 pixels by using bilinear interpolation method for all image data; then cutting into 513×513 images; and finally, carrying out normalization operation on all image data of the verification set and the test set.
Step 2, firstly constructing an inverted residual error module by using a depth separable convolution and residual error structure, and then using the inverted residual error moduleAnd constructing a MobileNet V2 network model by hole convolution, wherein the detailed structure of the MobileNet V2 network model is shown in table 1. Inputting the image data of the training set into a MobileNetV2 network model to extract image features and outputting a low-level feature map F low And high-level feature map F high For F high Four cavity convolutions with different expansion rates and one-time global average pooling are used to obtain five feature graphs;
TABLE 1 detailed structure of MobileNet V2 network model
The specific process of the step 2 is as follows:
step 2.1, constructing an inverted residual module using a depth separable convolution and residual network structure: first convolution up-dimension with convolution kernel size 1 x 1, then convolution with depth separable convolution with convolution kernel size 3 x 3, in order to reduce the computational effort, finally convolution down-dimension with convolution kernel size 1 x 1, and use of two ReLu 6 Activating a function;
in step 2.1, reLu 6 Activation function ReLu 6 (x) Is defined as shown in formula (1):
ReLu 6 (x)=min{max(0,x),6} (1)
wherein x represents input data, and max () and min () are two functions of returning the maximum value and the minimum value in the input data;
step 2.2, a MobileNetV2 network model is constructed using 3 convolutional layers, 7 inversion residual modules, and 1 average pooling layer, and the specific model structure is shown in table 1. In order to increase the receptive field of convolution without losing information, adopting cavity convolution to extract image features layer by using a cross-pixel mode; all convolution operations used by the MobileNetV2 network model are hole convolutions with a dilation rate d=1, the fourth layer output is the low layer feature map F low The output of the overall network model is a high-level feature map F high ;
In step 2.2, the calculation mode of the equivalent convolution kernel size k' of the hole convolution is defined as shown in formula (2):
k'=k+(k-1)×(d-1) (2)
where k is the convolution kernel size, d is the expansion rate of the hole convolution, and the i+1st layer of hole convolution is receptive field RF i+1 The definition of the calculation mode is shown in a formula (3):
RF i+1 =RF i +(k'-1)×S i (3)
wherein i represents the index of the network layer number, RF represents the receptive field of the ith layer, S i Representing the product of all previous layer steps, S i The calculation mode of (2) is shown as the formula (4):
wherein, the Stride i Representing the step size of the i-th layer;
step 2.3, for F high First, four feature maps are obtained by using the hole convolution of four expansion rates d=0, 1,2, 3: f (F) 1 ,F 2 ,F 3 ,F 4 Then, a global average pooling is used to obtain a feature map F p ,F p The calculation method of the output size N of the (B) is as shown in the formula (5):
wherein W is the input size, F is the kernel size, S is the step size, and P is the fill size.
Step 3, designing a local attention module and a residual block as shown in fig. 2 and 3, and constructing a coding network for extracting image features and recovering local context information of input image data, because local context information is likely to be lost in the process of gradually extracting features;
the specific process of the step 3 is as follows:
step 3.1, constructing a local attention module (Local Attention Block, LAB): first, the input feature map f a And feature map f b After splicing, batch normalization and 1×1 convolution operations are performedObtaining f b 'A'; then f is carried out b ' global pooling, relu activation function, 1×1 convolution, sigmoid activation function are sequentially performed to obtain f b ”,f b "AND f a Multiplying to obtain f a 'A'; finally f a ' and f b The 'addition' is the output of the local attention module; a block diagram of the local attention module is shown in fig. 2. The purpose is to assign different weights to each channel, the convolution operation assigns probability values for each class to each pixel, and sets the highest probability of weight parameter optimization.
In step 3.1, the Relu activation function ReLu (x) and the Sigmoid activation function are defined as formula (6) and formula (7):
wherein x represents an input value;
the convolution operation gives a probability value for each class to each pixel, and finally the probability F for each class is added by all feature maps, as in equation (8):
where D represents a characteristic diagram of the network output, w represents a convolution operation, D represents a set of all pixel positions, K e {0,1,.. k A value representing the kth channel;
introducing a weight parameter alpha = Sigmoid (d; w) to correct the highest probability of the prediction, new predicted valueAs shown in formula (9):
step 3.2, constructing a residual block: firstly, the input characteristic diagram is subjected to 1X 1 convolution to obtain a characteristic diagram f c1 The method comprises the steps of carrying out a first treatment on the surface of the Then the characteristic diagram f is obtained by 3X 3 convolution, relu activation function, batch normalization and 3X 3 convolution in sequence c2 The method comprises the steps of carrying out a first treatment on the surface of the Finally f c1 And f c2 Adding, and obtaining the output of a residual block by using a Relu activation function; the block diagram of the residual block is shown in fig. 3, and the purpose of the block is to combine the information of all channels to achieve the purpose of refining the feature map.
Step 3.3, constructing a coding network: first four feature maps F are generated as follows 4 ',F 3 ',F 2 ',F 1 ':
1) Feature map F 4 The' generation mode is as follows: outputting the characteristic diagram F output in the step 2.3 p And feature map F 4 Input to local attention module, output feature map F 4_1 Input to residual block to obtain feature map F 4_2 Input F 4_2 Hole convolution to expansion rate d=1 to obtain a feature map F 4 ';
2) Feature map F 3 The' generation mode is as follows: input of a feature map F 4_2 And feature map F 3 Obtaining a feature map F to a local attention module 3_1 Input of a feature map F 3_1 Obtaining a feature map F from the residual block 3_2 Input F 3_2 Hole convolution to expansion rate d=1 to obtain a feature map F 3 ';
3) Feature map F 2 The' generation mode is as follows: input of a feature map F 3_2 And feature map F 2 Obtaining a feature map F to a local attention module 2_1 Input of a feature map F 2_1 Obtaining a feature map F from the residual block 2_2 Input F 2_2 Hole convolution to expansion rate d=1 to obtain a feature map F 2 ';
4) Feature map F 1 The' generation mode is as follows: input of a feature map F 2_2 And feature map F 1 Obtaining a feature map F to a local attention module 1_1 Input of a feature map F 1_1 Obtaining a feature map F from the residual block 1_2 Input F 1_2 Hole convolution to expansion rate d=1 to obtain a feature map F 1 ';
Then splice four feature maps F 1 ',F 2 ',F 3 ',F 4 ' finally, the splicing result is subjected to 1X 1 convolution operation once to obtain the output F of the coding network encoder 。
Step 4, constructing a decoding network: output of coding network and low-level characteristic diagram F low Sequentially performing upsampling, splicing and upsampling operations, gradually recovering the image resolution, and finally outputting the semantic segmentation result;
the specific process of the step 4 is as follows: first, a low-level feature map F low Performing a 1×1 convolution operation to obtain a feature map F low ' output feature map F of coding network encoder Upsampling using bilinear sampling method to obtain F encoder 'A'; then F is carried out low ' and F encoder ' splicing, and performing 3×3 convolution operation; finally obtaining a segmentation result through bilinear upsampling by 4 times.
And 5, training the model by using a training set and a verification set based on the semantic segmentation network model structure based on the local attention network constructed in the steps 2-4 as shown in figure 1, and verifying the segmentation effect of the model on the test set, wherein the performance effect of the model on the training set and the verification set is shown in table 2.
The specific process of the step 5 is as follows:
step 5.1, a semantic segmentation model structure based on the local attention network is shown in fig. 1. Training a model by using image data of a training set, and evaluating the segmentation effect of the model by using a verification set in the training process, wherein the verification set does not participate in the training process; training the model uses a cross entropy Loss function Loss ce The initial learning rate is set to 0.007 and a polynomial decay strategy is adopted;
in step 5.1, the cross entropy Loss function Loss ce Is defined as formula (11):
wherein T is a real tag value, and the total number of samples is N, p i,t Representing the probability that the ith sample is predicted to be the t tag value, y i,t Representing that the i-th sample is the true probability value of the t-th label, where i e {0,1,., 1000}, t e {0,1,., 19};
and 5.2, using an average blending ratio (MIOU) and an accuracy (Acc) in a semantic segmentation method as evaluation indexes to evaluate a model, inputting image data in a test set into the model one by one, wherein the output of the model is a semantic segmentation result of each image, and simultaneously outputting the time used for segmenting each image, the performance effect of the model in a training set and a verification set is shown in a table 2, the model has good performance in the whole, the average blending ratio of all categories reaches 0.613, the accuracy reaches 0.942, and the segmentation result can be obtained only within 0.5 seconds for an image with a resolution of 512 multiplied by 1024.
Table 2 model performance effects on training set and validation set
As shown in fig. 4 to 6, the original pictures are three images randomly acquired in the test set, the three original pictures are processed by using a semantic segmentation model based on a local attention network, the second column is a real label corresponding to the original pictures, and the third column is a semantic segmentation result obtained by using the model processing on the three original pictures.
Comparing the real label with the segmentation results can be seen: the model has accurate segmentation result and good visual effect, and does not have the situation of large-area classification errors, particularly the situation that the segmentation of large-area categories (roads, buildings, vehicles and the like) is more accurate, particularly the situation that the edges of the categories are not saw-toothed and information in the categories is not lost is shown, the small targets can be segmented into approximate outlines, and the specific details of the small targets need further subdivision.
Claims (4)
1. A street view image semantic segmentation method based on a local attention network is characterized by comprising the following specific implementation steps:
step 1, firstly randomly selecting partial image data from a public data set Cityscapes, dividing the selected partial image data into a training set, a verification set and a test set, and finally carrying out data enhancement and preprocessing operations on all image data of the training set, the verification set and the test set respectively;
step 2, firstly constructing an inverted residual error module by using a depth separable convolution and a residual error structure, and then constructing a MobileNet V2 network model by using the inverted residual error module and a cavity convolution; inputting the image data of the training set into a MobileNetV2 network model to extract image features and outputting a low-level feature map F low And high-level feature map F high For F high Four cavity convolutions with different expansion rates and one-time global average pooling are used to obtain five feature graphs;
the specific process of the step 2 is as follows:
step 2.1, constructing an inverted residual module using a depth separable convolution and residual network structure:
first convolution up-dimension with convolution kernel size 1 x 1, then convolution with depth separable convolution with convolution kernel size 3 x 3, in order to reduce the computational effort, finally convolution down-dimension with convolution kernel size 1 x 1, and use of two ReLu 6 Activating a function;
in step 2.1, reLu 6 Activation function ReLu 6 (x) Is defined as shown in formula (1):
ReLu 6 (x)=min{max(0,x),6}(1)
wherein x represents input data, and max () and min () are two functions of returning the maximum value and the minimum value in the input data;
step 2.2, constructing a MobileNet V2 network model by using 3 convolution layers, 7 inversion residual error modules and 1 average pooling layer, and extracting image features layer by adopting a mode of hole convolution by using cross pixels;
all convolution operations used by the MobileNetV2 network model are hole convolutions with a dilation rate d=1, the fourth layer output is the low layer feature map F low The output of the overall network model is a high-level feature map F high ;
In step 2.2, the calculation mode of the equivalent convolution kernel size k' of the hole convolution is defined as shown in formula (2):
k'=k+(k-1)×(d-1) (2)
where k is the convolution kernel size, d is the expansion rate of the hole convolution, and the i+1st layer of hole convolution is receptive field RF i+1 The definition of the calculation mode is shown in a formula (3):
RF i+1 =RF i +(k'-1)×S i (3)
wherein i represents the index of the network layer number, RF represents the receptive field of the ith layer, S i Representing the product of all previous layer steps, S i The calculation mode of (2) is shown as the formula (4):
wherein, the Stride i Representing the step size of the i-th layer;
step 2.3, for F high First, four feature maps are obtained by using the hole convolution of four expansion rates d=0, 1,2, 3: f (F) 1 ,F 2 ,F 3 ,F 4 Then, a global average pooling is used to obtain a feature map F p ,F p The calculation method of the output size N of the (B) is as shown in the formula (5):
wherein W is the input size, F is the kernel size, S is the step size, and P is the filling size;
step 3, designing a local attention module and a residual block, and constructing a coding network;
the specific process of the step 3 is as follows:
step 3.1, constructing a local attention module: first, the input feature map f a And feature map f b After splicing, f is obtained through batch normalization and convolution operation of 1 multiplied by 1 b 'A'; then f is carried out b ' Global pooling, relu activation function, 1×1 convolution in order,Sigmoid activates a function to obtain f b ”,f b "AND f a Multiplying to obtain f a 'A'; finally f a ' and f b The 'addition' is the output of the local attention module;
in step 3.1, the Relu activation function ReLu (x) and the Sigmoid activation function are defined as formula (6) and formula (7):
wherein x represents an input value;
the convolution operation gives a probability value for each class to each pixel, and finally the probability F for each class is added by all feature maps, as in equation (8):
where D represents a characteristic diagram of the network output, w represents a convolution operation, D represents a set of all pixel positions, K e {0,1,.. k A value representing the kth channel;
introducing a weight parameter alpha = Sigmoid (d; w) to correct the highest probability of the prediction, new predicted valueAs shown in formula (9):
step 3.2, constructing a residual block: firstly, the input characteristic diagram is subjected to 1X 1 convolution to obtain a characteristic diagram f c1 The method comprises the steps of carrying out a first treatment on the surface of the Then sequentially pass through 3X 3 convolution and Relu activation function, batch normalization, and 3×3 convolution to obtain feature map f c2 The method comprises the steps of carrying out a first treatment on the surface of the Finally f c1 And f c2 Adding, and obtaining the output of a residual block by using a Relu activation function;
step 3.3, constructing a coding network: first four feature maps F are generated as follows 4 ',F 3 ',F 2 ',F 1 ':
1) Feature map F 4 The' generation mode is as follows: outputting the characteristic diagram F output in the step 2.3 p And feature map F 4 Input to local attention module, output feature map F 4_1 Input to residual block to obtain feature map F 4_2 Input F 4_2 Hole convolution to expansion rate d=1 to obtain a feature map F 4 ';
2) Feature map F 3 The' generation mode is as follows: input of a feature map F 4_2 And feature map F 3 Obtaining a feature map F to a local attention module 3_1 Input of a feature map F 3_1 Obtaining a feature map F from the residual block 3_2 Input F 3_2 Hole convolution to expansion rate d=1 to obtain a feature map F 3 ';
3) Feature map F 2 The' generation mode is as follows: input of a feature map F 3_2 And feature map F 2 Obtaining a feature map F to a local attention module 2_1 Input of a feature map F 2_1 Obtaining a feature map F from the residual block 2_2 Input F 2_2 Hole convolution to expansion rate d=1 to obtain a feature map F 2 ';
4) Feature map F 1 The' generation mode is as follows: input of a feature map F 2_2 And feature map F 1 Obtaining a feature map F to a local attention module 1_1 Input of a feature map F 1_1 Obtaining a feature map F from the residual block 1_2 Input F 1_2 Hole convolution to expansion rate d=1 to obtain a feature map F 1 ';
Then splice four feature maps F 1 ',F 2 ',F 3 ',F 4 ' finally, the splicing result is subjected to 1X 1 convolution operation once to obtain the output F of the coding network encoder ;
Step 4, constructing a decoding network: will encode the networkOutput and low-level feature map F low Sequentially performing upsampling, splicing and upsampling operations, gradually recovering the image resolution, and finally outputting the semantic segmentation result;
and 5, training the model by using the training set and the verification set, and verifying the segmentation effect of the model on the test set.
2. The street view image semantic segmentation method based on the local attention network according to claim 1, wherein the specific process of the step 1 is as follows:
step 1.1, randomly selecting partial image data from a public data set Cityscapes, and dividing the selected partial image data into a training set, a verification set and a test set according to the proportion of 6:3:1;
step 1.2, for all image data of the training set, enhancing the data by using a random overturn, random clipping and random Gaussian blur method, and finally normalizing the image data of the training set;
step 1.3, for the verification set and the test set, firstly scaling the image size to 513×1026 pixels by using bilinear interpolation method for all image data; then cutting into 513×513 images; and finally, carrying out normalization operation on all image data of the verification set and the test set.
3. The street view image semantic segmentation method based on the local attention network according to claim 1, wherein the specific process of the step 4 is as follows: first, a low-level feature map F low Performing a 1×1 convolution operation to obtain a feature map F low ' output feature map F of coding network encoder Upsampling using bilinear sampling method to obtain F encoder 'A'; then F is carried out low ' and F encoder ' splicing, and performing 3×3 convolution operation; finally obtaining a segmentation result through bilinear upsampling by 4 times.
4. The street view image semantic segmentation method based on the local attention network according to claim 1, wherein the specific process of the step 5 is as follows:
step 5.1, training a model by using image data of a training set, and evaluating the segmentation effect of the model by using a verification set in the training process, wherein the verification set does not participate in the training process; training the model uses a cross entropy Loss function Loss ce The initial learning rate is set to 0.007 and a polynomial decay strategy is adopted;
in step 5.1, the cross entropy Loss function Loss ce Is defined as formula (11):
wherein T is a real tag value, and the total number of samples is N, p i,t Representing the probability that the ith sample is predicted to be the t tag value, y i,t Representing that the i-th sample is the true probability value of the t-th label, where i e {0,1,., 1000}, t e {0,1,., 19};
and 5.2, using the average blending ratio and the accuracy in the semantic segmentation method as evaluation indexes to evaluate a model, inputting the image data in the test set into the model one by one, wherein the output of the model is the semantic segmentation result of each image, and simultaneously outputting the time used for segmenting each image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110763344.8A CN113642390B (en) | 2021-07-06 | 2021-07-06 | Street view image semantic segmentation method based on local attention network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110763344.8A CN113642390B (en) | 2021-07-06 | 2021-07-06 | Street view image semantic segmentation method based on local attention network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113642390A CN113642390A (en) | 2021-11-12 |
CN113642390B true CN113642390B (en) | 2024-02-13 |
Family
ID=78416754
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110763344.8A Active CN113642390B (en) | 2021-07-06 | 2021-07-06 | Street view image semantic segmentation method based on local attention network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113642390B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114266977B (en) * | 2021-12-27 | 2023-04-07 | 青岛澎湃海洋探索技术有限公司 | Multi-AUV underwater target identification method based on super-resolution selectable network |
CN114332839A (en) * | 2021-12-30 | 2022-04-12 | 福州大学 | Streetscape text detection method based on multi-space joint perception |
CN116055174A (en) * | 2023-01-10 | 2023-05-02 | 吉林大学 | Internet of vehicles intrusion detection method based on improved MobileNet V2 |
CN116843696B (en) * | 2023-04-27 | 2024-04-09 | 山东省人工智能研究院 | Cardiac MRI (magnetic resonance imaging) segmentation method based on feature similarity and super-parameter convolution attention |
CN116612122B (en) * | 2023-07-20 | 2023-10-10 | 湖南快乐阳光互动娱乐传媒有限公司 | Image significance region detection method and device, storage medium and electronic equipment |
CN116721302B (en) * | 2023-08-10 | 2024-01-12 | 成都信息工程大学 | Ice and snow crystal particle image classification method based on lightweight network |
CN117409030B (en) * | 2023-12-14 | 2024-03-22 | 齐鲁工业大学(山东省科学院) | OCTA image blood vessel segmentation method and system based on dynamic tubular convolution |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112183258A (en) * | 2020-09-16 | 2021-01-05 | 太原理工大学 | Remote sensing image road segmentation method based on context information and attention mechanism |
CN112183360A (en) * | 2020-09-29 | 2021-01-05 | 上海交通大学 | Lightweight semantic segmentation method for high-resolution remote sensing image |
CN112330681A (en) * | 2020-11-06 | 2021-02-05 | 北京工业大学 | Attention mechanism-based lightweight network real-time semantic segmentation method |
AU2020103901A4 (en) * | 2020-12-04 | 2021-02-11 | Chongqing Normal University | Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11461998B2 (en) * | 2019-09-25 | 2022-10-04 | Samsung Electronics Co., Ltd. | System and method for boundary aware semantic segmentation |
-
2021
- 2021-07-06 CN CN202110763344.8A patent/CN113642390B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112183258A (en) * | 2020-09-16 | 2021-01-05 | 太原理工大学 | Remote sensing image road segmentation method based on context information and attention mechanism |
CN112183360A (en) * | 2020-09-29 | 2021-01-05 | 上海交通大学 | Lightweight semantic segmentation method for high-resolution remote sensing image |
CN112330681A (en) * | 2020-11-06 | 2021-02-05 | 北京工业大学 | Attention mechanism-based lightweight network real-time semantic segmentation method |
AU2020103901A4 (en) * | 2020-12-04 | 2021-02-11 | Chongqing Normal University | Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field |
Non-Patent Citations (2)
Title |
---|
严广宇 ; 刘正熙 ; .基于混合注意力的实时语义分割算法.现代计算机.2020,(第10期),全文. * |
任天赐 ; 黄向生 ; 丁伟利 ; 安重阳 ; 翟鹏博 ; .全局双边网络的语义分割算法.计算机科学.2020,(第S1期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN113642390A (en) | 2021-11-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113642390B (en) | Street view image semantic segmentation method based on local attention network | |
CN110443143B (en) | Multi-branch convolutional neural network fused remote sensing image scene classification method | |
CN111612807B (en) | Small target image segmentation method based on scale and edge information | |
CN112132156B (en) | Image saliency target detection method and system based on multi-depth feature fusion | |
CN108171701B (en) | Significance detection method based on U network and counterstudy | |
CN111325751A (en) | CT image segmentation system based on attention convolution neural network | |
CN111523553B (en) | Central point network multi-target detection method based on similarity matrix | |
CN111915627A (en) | Semantic segmentation method, network, device and computer storage medium | |
CN111612008A (en) | Image segmentation method based on convolution network | |
CN113011357A (en) | Depth fake face video positioning method based on space-time fusion | |
CN112149526B (en) | Lane line detection method and system based on long-distance information fusion | |
CN113723377A (en) | Traffic sign detection method based on LD-SSD network | |
CN116311214B (en) | License plate recognition method and device | |
CN111626134A (en) | Dense crowd counting method, system and terminal based on hidden density distribution | |
CN113269224A (en) | Scene image classification method, system and storage medium | |
CN113762396A (en) | Two-dimensional image semantic segmentation method | |
CN115861756A (en) | Earth background small target identification method based on cascade combination network | |
CN116206112A (en) | Remote sensing image semantic segmentation method based on multi-scale feature fusion and SAM | |
CN113066089B (en) | Real-time image semantic segmentation method based on attention guide mechanism | |
CN116740362B (en) | Attention-based lightweight asymmetric scene semantic segmentation method and system | |
CN113674288A (en) | Automatic segmentation method for non-small cell lung cancer digital pathological image tissues | |
CN117079276B (en) | Semantic segmentation method, system, equipment and medium based on knowledge distillation | |
CN113096133A (en) | Method for constructing semantic segmentation network based on attention mechanism | |
CN112818774A (en) | Living body detection method and device | |
TWI803243B (en) | Method for expanding images, computer device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |