CN113642390B - Street view image semantic segmentation method based on local attention network - Google Patents

Street view image semantic segmentation method based on local attention network Download PDF

Info

Publication number
CN113642390B
CN113642390B CN202110763344.8A CN202110763344A CN113642390B CN 113642390 B CN113642390 B CN 113642390B CN 202110763344 A CN202110763344 A CN 202110763344A CN 113642390 B CN113642390 B CN 113642390B
Authority
CN
China
Prior art keywords
feature map
convolution
input
network
image data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110763344.8A
Other languages
Chinese (zh)
Other versions
CN113642390A (en
Inventor
赵明华
郅宇星
王睿
胡静
都双丽
石程
李鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110763344.8A priority Critical patent/CN113642390B/en
Publication of CN113642390A publication Critical patent/CN113642390A/en
Application granted granted Critical
Publication of CN113642390B publication Critical patent/CN113642390B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a street view image semantic segmentation method based on a local attention network, which comprises the following specific implementation steps: step 1, firstly, randomly selecting partial image data from a public data set Cityscapes, and dividing the selected partial image data into a training set, a verification set and a test set; step 2, constructing a MobileNet V2 network model by using an inverted residual error module and hole convolution; step 3, designing a local attention module and a residual block, and constructing a coding network; step 4, constructing a decoding network, gradually recovering the image resolution, and finally outputting the semantic segmentation result; and 5, training the model by using the training set and the verification set, and verifying the segmentation effect of the model on the test set. The method solves the problem that the ubiquitous local information in the prior art cannot be completely reserved in the feature extraction process, so that the inconsistent segmentation result in the category is solved.

Description

Street view image semantic segmentation method based on local attention network
Technical Field
The invention belongs to the field of digital image processing methods, and particularly relates to a street view image semantic segmentation method based on a local attention network.
Background
Vision is an important way for humans to recognize and accept external information, and humans tend to acquire information directly from images, as opposed to linguistic text descriptions. However, computers are still a challenging task in that they need to perform scene understanding tasks like the human eye, such as accurately classifying image pixels. The objective of the semantic segmentation task is to correctly classify each pixel in an image using a computer, which is a pixel-by-pixel classification task. Scene understanding is a task similar to human perception and understanding of the environment by using a computer, wherein semantic segmentation is a key and fundamental technology as a necessary path for realizing scene understanding.
In urban road-oriented scenes, semantic segmentation is a key technology for understanding different kinds of objects such as vehicles, sidewalks, roads and signal lamps in urban street scenes, but street scenes usually have complicated and unstructured characteristics, such as illumination, seasonal weather changes, too small target dimensions, objects are blocked, and the like, so that diversified targets are usually present in roads, and the visual understanding and semantic segmentation task in street-oriented scenes becomes a very complex and serious challenge.
Disclosure of Invention
The invention aims to provide a street view image semantic segmentation method based on a local attention network, which solves the problem that ubiquitous local information in the prior art cannot be completely reserved in the feature extraction process, so that inconsistent segmentation results in categories are solved.
The invention adopts the technical scheme that the street view image semantic segmentation method based on the local attention network comprises the following specific implementation steps:
step 1, firstly randomly selecting partial image data from a public data set Cityscapes, dividing the selected partial image data into a training set, a verification set and a test set, and finally carrying out data enhancement and preprocessing operations on all image data of the training set, the verification set and the test set respectively;
step 2, firstly constructing an inverted residual error module by using a depth separable convolution and a residual error structure, and then constructing a MobileNet V2 network model by using the inverted residual error module and a cavity convolution; inputting the image data of the training set into a MobileNetV2 network model to extract image features and outputting a low-level feature map F low And high-level feature map F high For F high Four cavity convolutions with different expansion rates and one-time global average pooling are used to obtain five feature graphs;
step 3, designing a local attention module and a residual block and constructing a coding network to extract image features and restore local context information of input image data at the same time because the local context information is likely to be lost in the process of gradually extracting the features;
step 4, constructing a decoding network: output of coding network and low-level characteristic diagram F low Sequentially performing upsampling, splicing and upsampling operations, gradually recovering the image resolution, and finally outputting the semantic segmentation result;
and 5, training the model by using the training set and the verification set, and verifying the segmentation effect of the model on the test set.
The present invention is also characterized in that,
the specific process of the step 1 is as follows:
step 1.1, randomly selecting partial image data from a public data set Cityscapes, and dividing the selected partial image data into a training set, a verification set and a test set according to the proportion of 6:3:1;
step 1.2, for all image data of the training set, enhancing the data by using a random overturn, random clipping and random Gaussian blur method, and finally normalizing the image data of the training set;
step 1.3, for the verification set and the test set, firstly scaling the image size to 513×1026 pixels by using bilinear interpolation method for all image data; then cutting into 513×513 images; and finally, carrying out normalization operation on all image data of the verification set and the test set.
The specific process of the step 2 is as follows:
step 2.1, constructing an inverted residual module using a depth separable convolution and residual network structure: first convolution up-dimension with convolution kernel size 1 x 1, then convolution with depth separable convolution with convolution kernel size 3 x 3, in order to reduce the computational effort, finally convolution down-dimension with convolution kernel size 1 x 1, and use of two ReLu 6 Activating a function;
in step 2.1, reLu 6 Activation function ReLu 6 (x) Is defined as shown in formula (1):
ReLu 6 (x)=min{max(0,x),6} (1)
wherein x represents input data, and max () and min () are two functions of returning the maximum value and the minimum value in the input data;
step 2.2, constructing a MobileNet V2 network model by using 3 convolution layers, 7 inversion residual error modules and 1 average pooling layer, and extracting image features layer by adopting a mode of hole convolution by using cross pixels; all convolution operations used by the MobileNetV2 network model are hole convolutions with a dilation rate d=1, the fourth layer output is the low layer feature map F low The output of the overall network model is a high-level feature map F high
In step 2.2, the calculation mode of the equivalent convolution kernel size k' of the hole convolution is defined as shown in formula (2):
k'=k+(k-1)×(d-1) (2)
where k is the convolution kernel size, d is the expansion rate of the hole convolution, and the i+1st layer of hole convolution is receptive field RF i+1 The definition of the calculation mode is shown in a formula (3):
RF i+1 =RF i +(k'-1)×S i (3)
where i denotes the index of the network layer number and RF denotes the feel of the i-th layerWild, S i Representing the product of all previous layer steps, S i The calculation mode of (2) is shown as the formula (4):
wherein, the Stride i Representing the step size of the i-th layer;
step 2.3, for F high First, four feature maps are obtained by using the hole convolution of four expansion rates d=0, 1,2, 3: f (F) 1 ,F 2 ,F 3 ,F 4 Then, a global average pooling is used to obtain a feature map F p ,F p The calculation method of the output size N of the (B) is as shown in the formula (5):
wherein W is the input size, F is the kernel size, S is the step size, and P is the fill size.
The specific process of the step 3 is as follows:
step 3.1, constructing a local attention module: first, the input feature map f a And feature map f b After splicing, f is obtained through batch normalization and convolution operation of 1 multiplied by 1 b 'A'; then f is carried out b ' global pooling, relu activation function, 1×1 convolution, sigmoid activation function are sequentially performed to obtain f b ”,f b "AND f a Multiplying to obtain f a 'A'; finally f a ' and f b The 'addition' is the output of the local attention module;
in step 3.1, the Relu activation function ReLu (x) and the Sigmoid activation function are defined as formula (6) and formula (7):
wherein x represents an input value;
the convolution operation gives a probability value for each class to each pixel, and finally the probability F for each class is added by all feature maps, as in equation (8):
where D represents a characteristic diagram of the network output, w represents a convolution operation, D represents a set of all pixel positions, K e {0,1,.. k A value representing the kth channel;
introducing a weight parameter alpha = Sigmoid (d; w) to correct the highest probability of the prediction, new predicted valueAs shown in formula (9):
step 3.2, constructing a residual block: firstly, the input characteristic diagram is subjected to 1X 1 convolution to obtain a characteristic diagram f c1 The method comprises the steps of carrying out a first treatment on the surface of the Then the characteristic diagram f is obtained by 3X 3 convolution, relu activation function, batch normalization and 3X 3 convolution in sequence c2 The method comprises the steps of carrying out a first treatment on the surface of the Finally f c1 And f c2 Adding, and obtaining the output of a residual block by using a Relu activation function;
step 3.3, constructing a coding network: first four feature maps F are generated as follows 4 ',F 3 ',F 2 ',F 1 ':
1) Feature map F 4 The' generation mode is as follows: outputting the characteristic diagram F output in the step 2.3 p And feature map F 4 Input to local attention module, output feature map F 4_1 Input to residual block to obtain feature map F 4_2 Input F 4_2 Hole convolution to expansion rate d=1 to obtain a feature map F 4 ';
2) Special purposeSign F 3 The' generation mode is as follows: input of a feature map F 4_2 And feature map F 3 Obtaining a feature map F to a local attention module 3_1 Input of a feature map F 3_1 Obtaining a feature map F from the residual block 3_2 Input F 3_2 Hole convolution to expansion rate d=1 to obtain a feature map F 3 ';
3) Feature map F 2 The' generation mode is as follows: input of a feature map F 3_2 And feature map F 2 Obtaining a feature map F to a local attention module 2_1 Input of a feature map F 2_1 Obtaining a feature map F from the residual block 2_2 Input F 2_2 Hole convolution to expansion rate d=1 to obtain a feature map F 2 ';
4) Feature map F 1 The' generation mode is as follows: input of a feature map F 2_2 And feature map F 1 Obtaining a feature map F to a local attention module 1_1 Input of a feature map F 1_1 Obtaining a feature map F from the residual block 1_2 Input F 1_2 Hole convolution to expansion rate d=1 to obtain a feature map F 1 ';
Then splice four feature maps F 1 ',F 2 ',F 3 ',F 4 ' finally, the splicing result is subjected to 1X 1 convolution operation once to obtain the output F of the coding network encoder
The specific process of the step 4 is as follows: first, a low-level feature map F low Performing a 1×1 convolution operation to obtain a feature map F low ' output feature map F of coding network encoder Upsampling using bilinear sampling method to obtain F encoder 'A'; then F is carried out low ' and F encoder ' splicing, and performing 3×3 convolution operation; finally obtaining a segmentation result through bilinear upsampling by 4 times.
The specific process of the step 5 is as follows:
step 5.1, training a model by using image data of a training set, and evaluating the segmentation effect of the model by using a verification set in the training process, wherein the verification set does not participate in the training process; training the model uses a cross entropy Loss function Loss ce The initial learning rate is set to 0.007 and a polynomial decay strategy is adopted;
in step 5.1, the cross entropy Loss function Loss ce Is defined as formula (11):
wherein T is a real tag value, and the total number of samples is N, p i,t Representing the probability that the ith sample is predicted to be the t tag value, y i,t Representing that the i-th sample is the true probability value of the t-th label, where i e {0,1,., 1000}, t e {0,1,., 19};
and 5.2, using the average blending ratio and the accuracy in the semantic segmentation method as evaluation indexes to evaluate a model, inputting the image data in the test set into the model one by one, wherein the output of the model is the semantic segmentation result of each image, and simultaneously outputting the time used for segmenting each image.
The beneficial effects of the invention are as follows:
(1) The method of the invention is based on the segmentation structure of the encoder-decoder, the encoder network extracts the characteristics layer by layer, and the decoder gradually restores the resolution of the image through up-sampling, thereby achieving the purpose of classifying each pixel in the image.
(2) During convolution operation, each feature map is identified as the same kernel, and the method of the invention assigns different weights to each feature map through the local attention network module, and for the feature map divided with gain, the assigned weights are larger, and for the redundant feature map, the weights are smaller. Therefore, the method can obviously improve the discrimination capability of the network model to each category, reduce the segmentation inconsistency in the category and improve the visual smoothing effect of semantic segmentation.
Drawings
FIG. 1 is a flow chart of the street view image semantic segmentation method based on the local attention network of the present invention;
FIG. 2 is a schematic diagram of a local attention module architecture used in the street view image semantic segmentation method based on the local attention network of the present invention;
FIG. 3 is a schematic diagram of a residual block structure used in the local attention network based street view image semantic segmentation method of the present invention;
FIG. 4 is a diagram showing a comparison of a first original image, a real label and a segmentation result randomly obtained in a test set in an embodiment of the present invention;
FIG. 5 is a diagram showing a comparison of a second original image, a real label and a segmentation result randomly obtained in a test set in an embodiment of the present invention;
fig. 6 is a comparison chart of a third original image, a real label and a segmentation result obtained randomly in a test set in the embodiment of the present invention.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
The invention discloses a street view image semantic segmentation method based on a local attention network, which comprises the following specific implementation steps:
step 1, firstly randomly selecting partial image data from a public data set Cityscapes, dividing the selected partial image data into a training set, a verification set and a test set, and finally carrying out data enhancement and preprocessing operations on all image data of the training set, the verification set and the test set respectively;
the specific process of the step 1 is as follows:
step 1.1, randomly selecting 1000, 500 and 166 images from a public data set Cityscapes respectively as image data of a training set, a testing set and a verification set;
step 1.2, for all image data of the training set, enhancing the data by using a random overturn, random clipping and random Gaussian blur method, and finally normalizing the image data of the training set;
step 1.3, for the verification set and the test set, firstly scaling the image size to 513×1026 pixels by using bilinear interpolation method for all image data; then cutting into 513×513 images; and finally, carrying out normalization operation on all image data of the verification set and the test set.
Step 2, firstly constructing an inverted residual error module by using a depth separable convolution and residual error structure, and then using the inverted residual error moduleAnd constructing a MobileNet V2 network model by hole convolution, wherein the detailed structure of the MobileNet V2 network model is shown in table 1. Inputting the image data of the training set into a MobileNetV2 network model to extract image features and outputting a low-level feature map F low And high-level feature map F high For F high Four cavity convolutions with different expansion rates and one-time global average pooling are used to obtain five feature graphs;
TABLE 1 detailed structure of MobileNet V2 network model
The specific process of the step 2 is as follows:
step 2.1, constructing an inverted residual module using a depth separable convolution and residual network structure: first convolution up-dimension with convolution kernel size 1 x 1, then convolution with depth separable convolution with convolution kernel size 3 x 3, in order to reduce the computational effort, finally convolution down-dimension with convolution kernel size 1 x 1, and use of two ReLu 6 Activating a function;
in step 2.1, reLu 6 Activation function ReLu 6 (x) Is defined as shown in formula (1):
ReLu 6 (x)=min{max(0,x),6} (1)
wherein x represents input data, and max () and min () are two functions of returning the maximum value and the minimum value in the input data;
step 2.2, a MobileNetV2 network model is constructed using 3 convolutional layers, 7 inversion residual modules, and 1 average pooling layer, and the specific model structure is shown in table 1. In order to increase the receptive field of convolution without losing information, adopting cavity convolution to extract image features layer by using a cross-pixel mode; all convolution operations used by the MobileNetV2 network model are hole convolutions with a dilation rate d=1, the fourth layer output is the low layer feature map F low The output of the overall network model is a high-level feature map F high
In step 2.2, the calculation mode of the equivalent convolution kernel size k' of the hole convolution is defined as shown in formula (2):
k'=k+(k-1)×(d-1) (2)
where k is the convolution kernel size, d is the expansion rate of the hole convolution, and the i+1st layer of hole convolution is receptive field RF i+1 The definition of the calculation mode is shown in a formula (3):
RF i+1 =RF i +(k'-1)×S i (3)
wherein i represents the index of the network layer number, RF represents the receptive field of the ith layer, S i Representing the product of all previous layer steps, S i The calculation mode of (2) is shown as the formula (4):
wherein, the Stride i Representing the step size of the i-th layer;
step 2.3, for F high First, four feature maps are obtained by using the hole convolution of four expansion rates d=0, 1,2, 3: f (F) 1 ,F 2 ,F 3 ,F 4 Then, a global average pooling is used to obtain a feature map F p ,F p The calculation method of the output size N of the (B) is as shown in the formula (5):
wherein W is the input size, F is the kernel size, S is the step size, and P is the fill size.
Step 3, designing a local attention module and a residual block as shown in fig. 2 and 3, and constructing a coding network for extracting image features and recovering local context information of input image data, because local context information is likely to be lost in the process of gradually extracting features;
the specific process of the step 3 is as follows:
step 3.1, constructing a local attention module (Local Attention Block, LAB): first, the input feature map f a And feature map f b After splicing, batch normalization and 1×1 convolution operations are performedObtaining f b 'A'; then f is carried out b ' global pooling, relu activation function, 1×1 convolution, sigmoid activation function are sequentially performed to obtain f b ”,f b "AND f a Multiplying to obtain f a 'A'; finally f a ' and f b The 'addition' is the output of the local attention module; a block diagram of the local attention module is shown in fig. 2. The purpose is to assign different weights to each channel, the convolution operation assigns probability values for each class to each pixel, and sets the highest probability of weight parameter optimization.
In step 3.1, the Relu activation function ReLu (x) and the Sigmoid activation function are defined as formula (6) and formula (7):
wherein x represents an input value;
the convolution operation gives a probability value for each class to each pixel, and finally the probability F for each class is added by all feature maps, as in equation (8):
where D represents a characteristic diagram of the network output, w represents a convolution operation, D represents a set of all pixel positions, K e {0,1,.. k A value representing the kth channel;
introducing a weight parameter alpha = Sigmoid (d; w) to correct the highest probability of the prediction, new predicted valueAs shown in formula (9):
step 3.2, constructing a residual block: firstly, the input characteristic diagram is subjected to 1X 1 convolution to obtain a characteristic diagram f c1 The method comprises the steps of carrying out a first treatment on the surface of the Then the characteristic diagram f is obtained by 3X 3 convolution, relu activation function, batch normalization and 3X 3 convolution in sequence c2 The method comprises the steps of carrying out a first treatment on the surface of the Finally f c1 And f c2 Adding, and obtaining the output of a residual block by using a Relu activation function; the block diagram of the residual block is shown in fig. 3, and the purpose of the block is to combine the information of all channels to achieve the purpose of refining the feature map.
Step 3.3, constructing a coding network: first four feature maps F are generated as follows 4 ',F 3 ',F 2 ',F 1 ':
1) Feature map F 4 The' generation mode is as follows: outputting the characteristic diagram F output in the step 2.3 p And feature map F 4 Input to local attention module, output feature map F 4_1 Input to residual block to obtain feature map F 4_2 Input F 4_2 Hole convolution to expansion rate d=1 to obtain a feature map F 4 ';
2) Feature map F 3 The' generation mode is as follows: input of a feature map F 4_2 And feature map F 3 Obtaining a feature map F to a local attention module 3_1 Input of a feature map F 3_1 Obtaining a feature map F from the residual block 3_2 Input F 3_2 Hole convolution to expansion rate d=1 to obtain a feature map F 3 ';
3) Feature map F 2 The' generation mode is as follows: input of a feature map F 3_2 And feature map F 2 Obtaining a feature map F to a local attention module 2_1 Input of a feature map F 2_1 Obtaining a feature map F from the residual block 2_2 Input F 2_2 Hole convolution to expansion rate d=1 to obtain a feature map F 2 ';
4) Feature map F 1 The' generation mode is as follows: input of a feature map F 2_2 And feature map F 1 Obtaining a feature map F to a local attention module 1_1 Input of a feature map F 1_1 Obtaining a feature map F from the residual block 1_2 Input F 1_2 Hole convolution to expansion rate d=1 to obtain a feature map F 1 ';
Then splice four feature maps F 1 ',F 2 ',F 3 ',F 4 ' finally, the splicing result is subjected to 1X 1 convolution operation once to obtain the output F of the coding network encoder
Step 4, constructing a decoding network: output of coding network and low-level characteristic diagram F low Sequentially performing upsampling, splicing and upsampling operations, gradually recovering the image resolution, and finally outputting the semantic segmentation result;
the specific process of the step 4 is as follows: first, a low-level feature map F low Performing a 1×1 convolution operation to obtain a feature map F low ' output feature map F of coding network encoder Upsampling using bilinear sampling method to obtain F encoder 'A'; then F is carried out low ' and F encoder ' splicing, and performing 3×3 convolution operation; finally obtaining a segmentation result through bilinear upsampling by 4 times.
And 5, training the model by using a training set and a verification set based on the semantic segmentation network model structure based on the local attention network constructed in the steps 2-4 as shown in figure 1, and verifying the segmentation effect of the model on the test set, wherein the performance effect of the model on the training set and the verification set is shown in table 2.
The specific process of the step 5 is as follows:
step 5.1, a semantic segmentation model structure based on the local attention network is shown in fig. 1. Training a model by using image data of a training set, and evaluating the segmentation effect of the model by using a verification set in the training process, wherein the verification set does not participate in the training process; training the model uses a cross entropy Loss function Loss ce The initial learning rate is set to 0.007 and a polynomial decay strategy is adopted;
in step 5.1, the cross entropy Loss function Loss ce Is defined as formula (11):
wherein T is a real tag value, and the total number of samples is N, p i,t Representing the probability that the ith sample is predicted to be the t tag value, y i,t Representing that the i-th sample is the true probability value of the t-th label, where i e {0,1,., 1000}, t e {0,1,., 19};
and 5.2, using an average blending ratio (MIOU) and an accuracy (Acc) in a semantic segmentation method as evaluation indexes to evaluate a model, inputting image data in a test set into the model one by one, wherein the output of the model is a semantic segmentation result of each image, and simultaneously outputting the time used for segmenting each image, the performance effect of the model in a training set and a verification set is shown in a table 2, the model has good performance in the whole, the average blending ratio of all categories reaches 0.613, the accuracy reaches 0.942, and the segmentation result can be obtained only within 0.5 seconds for an image with a resolution of 512 multiplied by 1024.
Table 2 model performance effects on training set and validation set
As shown in fig. 4 to 6, the original pictures are three images randomly acquired in the test set, the three original pictures are processed by using a semantic segmentation model based on a local attention network, the second column is a real label corresponding to the original pictures, and the third column is a semantic segmentation result obtained by using the model processing on the three original pictures.
Comparing the real label with the segmentation results can be seen: the model has accurate segmentation result and good visual effect, and does not have the situation of large-area classification errors, particularly the situation that the segmentation of large-area categories (roads, buildings, vehicles and the like) is more accurate, particularly the situation that the edges of the categories are not saw-toothed and information in the categories is not lost is shown, the small targets can be segmented into approximate outlines, and the specific details of the small targets need further subdivision.

Claims (4)

1. A street view image semantic segmentation method based on a local attention network is characterized by comprising the following specific implementation steps:
step 1, firstly randomly selecting partial image data from a public data set Cityscapes, dividing the selected partial image data into a training set, a verification set and a test set, and finally carrying out data enhancement and preprocessing operations on all image data of the training set, the verification set and the test set respectively;
step 2, firstly constructing an inverted residual error module by using a depth separable convolution and a residual error structure, and then constructing a MobileNet V2 network model by using the inverted residual error module and a cavity convolution; inputting the image data of the training set into a MobileNetV2 network model to extract image features and outputting a low-level feature map F low And high-level feature map F high For F high Four cavity convolutions with different expansion rates and one-time global average pooling are used to obtain five feature graphs;
the specific process of the step 2 is as follows:
step 2.1, constructing an inverted residual module using a depth separable convolution and residual network structure:
first convolution up-dimension with convolution kernel size 1 x 1, then convolution with depth separable convolution with convolution kernel size 3 x 3, in order to reduce the computational effort, finally convolution down-dimension with convolution kernel size 1 x 1, and use of two ReLu 6 Activating a function;
in step 2.1, reLu 6 Activation function ReLu 6 (x) Is defined as shown in formula (1):
ReLu 6 (x)=min{max(0,x),6}(1)
wherein x represents input data, and max () and min () are two functions of returning the maximum value and the minimum value in the input data;
step 2.2, constructing a MobileNet V2 network model by using 3 convolution layers, 7 inversion residual error modules and 1 average pooling layer, and extracting image features layer by adopting a mode of hole convolution by using cross pixels;
all convolution operations used by the MobileNetV2 network model are hole convolutions with a dilation rate d=1, the fourth layer output is the low layer feature map F low The output of the overall network model is a high-level feature map F high
In step 2.2, the calculation mode of the equivalent convolution kernel size k' of the hole convolution is defined as shown in formula (2):
k'=k+(k-1)×(d-1) (2)
where k is the convolution kernel size, d is the expansion rate of the hole convolution, and the i+1st layer of hole convolution is receptive field RF i+1 The definition of the calculation mode is shown in a formula (3):
RF i+1 =RF i +(k'-1)×S i (3)
wherein i represents the index of the network layer number, RF represents the receptive field of the ith layer, S i Representing the product of all previous layer steps, S i The calculation mode of (2) is shown as the formula (4):
wherein, the Stride i Representing the step size of the i-th layer;
step 2.3, for F high First, four feature maps are obtained by using the hole convolution of four expansion rates d=0, 1,2, 3: f (F) 1 ,F 2 ,F 3 ,F 4 Then, a global average pooling is used to obtain a feature map F p ,F p The calculation method of the output size N of the (B) is as shown in the formula (5):
wherein W is the input size, F is the kernel size, S is the step size, and P is the filling size;
step 3, designing a local attention module and a residual block, and constructing a coding network;
the specific process of the step 3 is as follows:
step 3.1, constructing a local attention module: first, the input feature map f a And feature map f b After splicing, f is obtained through batch normalization and convolution operation of 1 multiplied by 1 b 'A'; then f is carried out b ' Global pooling, relu activation function, 1×1 convolution in order,Sigmoid activates a function to obtain f b ”,f b "AND f a Multiplying to obtain f a 'A'; finally f a ' and f b The 'addition' is the output of the local attention module;
in step 3.1, the Relu activation function ReLu (x) and the Sigmoid activation function are defined as formula (6) and formula (7):
wherein x represents an input value;
the convolution operation gives a probability value for each class to each pixel, and finally the probability F for each class is added by all feature maps, as in equation (8):
where D represents a characteristic diagram of the network output, w represents a convolution operation, D represents a set of all pixel positions, K e {0,1,.. k A value representing the kth channel;
introducing a weight parameter alpha = Sigmoid (d; w) to correct the highest probability of the prediction, new predicted valueAs shown in formula (9):
step 3.2, constructing a residual block: firstly, the input characteristic diagram is subjected to 1X 1 convolution to obtain a characteristic diagram f c1 The method comprises the steps of carrying out a first treatment on the surface of the Then sequentially pass through 3X 3 convolution and Relu activation function, batch normalization, and 3×3 convolution to obtain feature map f c2 The method comprises the steps of carrying out a first treatment on the surface of the Finally f c1 And f c2 Adding, and obtaining the output of a residual block by using a Relu activation function;
step 3.3, constructing a coding network: first four feature maps F are generated as follows 4 ',F 3 ',F 2 ',F 1 ':
1) Feature map F 4 The' generation mode is as follows: outputting the characteristic diagram F output in the step 2.3 p And feature map F 4 Input to local attention module, output feature map F 4_1 Input to residual block to obtain feature map F 4_2 Input F 4_2 Hole convolution to expansion rate d=1 to obtain a feature map F 4 ';
2) Feature map F 3 The' generation mode is as follows: input of a feature map F 4_2 And feature map F 3 Obtaining a feature map F to a local attention module 3_1 Input of a feature map F 3_1 Obtaining a feature map F from the residual block 3_2 Input F 3_2 Hole convolution to expansion rate d=1 to obtain a feature map F 3 ';
3) Feature map F 2 The' generation mode is as follows: input of a feature map F 3_2 And feature map F 2 Obtaining a feature map F to a local attention module 2_1 Input of a feature map F 2_1 Obtaining a feature map F from the residual block 2_2 Input F 2_2 Hole convolution to expansion rate d=1 to obtain a feature map F 2 ';
4) Feature map F 1 The' generation mode is as follows: input of a feature map F 2_2 And feature map F 1 Obtaining a feature map F to a local attention module 1_1 Input of a feature map F 1_1 Obtaining a feature map F from the residual block 1_2 Input F 1_2 Hole convolution to expansion rate d=1 to obtain a feature map F 1 ';
Then splice four feature maps F 1 ',F 2 ',F 3 ',F 4 ' finally, the splicing result is subjected to 1X 1 convolution operation once to obtain the output F of the coding network encoder
Step 4, constructing a decoding network: will encode the networkOutput and low-level feature map F low Sequentially performing upsampling, splicing and upsampling operations, gradually recovering the image resolution, and finally outputting the semantic segmentation result;
and 5, training the model by using the training set and the verification set, and verifying the segmentation effect of the model on the test set.
2. The street view image semantic segmentation method based on the local attention network according to claim 1, wherein the specific process of the step 1 is as follows:
step 1.1, randomly selecting partial image data from a public data set Cityscapes, and dividing the selected partial image data into a training set, a verification set and a test set according to the proportion of 6:3:1;
step 1.2, for all image data of the training set, enhancing the data by using a random overturn, random clipping and random Gaussian blur method, and finally normalizing the image data of the training set;
step 1.3, for the verification set and the test set, firstly scaling the image size to 513×1026 pixels by using bilinear interpolation method for all image data; then cutting into 513×513 images; and finally, carrying out normalization operation on all image data of the verification set and the test set.
3. The street view image semantic segmentation method based on the local attention network according to claim 1, wherein the specific process of the step 4 is as follows: first, a low-level feature map F low Performing a 1×1 convolution operation to obtain a feature map F low ' output feature map F of coding network encoder Upsampling using bilinear sampling method to obtain F encoder 'A'; then F is carried out low ' and F encoder ' splicing, and performing 3×3 convolution operation; finally obtaining a segmentation result through bilinear upsampling by 4 times.
4. The street view image semantic segmentation method based on the local attention network according to claim 1, wherein the specific process of the step 5 is as follows:
step 5.1, training a model by using image data of a training set, and evaluating the segmentation effect of the model by using a verification set in the training process, wherein the verification set does not participate in the training process; training the model uses a cross entropy Loss function Loss ce The initial learning rate is set to 0.007 and a polynomial decay strategy is adopted;
in step 5.1, the cross entropy Loss function Loss ce Is defined as formula (11):
wherein T is a real tag value, and the total number of samples is N, p i,t Representing the probability that the ith sample is predicted to be the t tag value, y i,t Representing that the i-th sample is the true probability value of the t-th label, where i e {0,1,., 1000}, t e {0,1,., 19};
and 5.2, using the average blending ratio and the accuracy in the semantic segmentation method as evaluation indexes to evaluate a model, inputting the image data in the test set into the model one by one, wherein the output of the model is the semantic segmentation result of each image, and simultaneously outputting the time used for segmenting each image.
CN202110763344.8A 2021-07-06 2021-07-06 Street view image semantic segmentation method based on local attention network Active CN113642390B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110763344.8A CN113642390B (en) 2021-07-06 2021-07-06 Street view image semantic segmentation method based on local attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110763344.8A CN113642390B (en) 2021-07-06 2021-07-06 Street view image semantic segmentation method based on local attention network

Publications (2)

Publication Number Publication Date
CN113642390A CN113642390A (en) 2021-11-12
CN113642390B true CN113642390B (en) 2024-02-13

Family

ID=78416754

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110763344.8A Active CN113642390B (en) 2021-07-06 2021-07-06 Street view image semantic segmentation method based on local attention network

Country Status (1)

Country Link
CN (1) CN113642390B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114266977B (en) * 2021-12-27 2023-04-07 青岛澎湃海洋探索技术有限公司 Multi-AUV underwater target identification method based on super-resolution selectable network
CN114332839A (en) * 2021-12-30 2022-04-12 福州大学 Streetscape text detection method based on multi-space joint perception
CN116055174A (en) * 2023-01-10 2023-05-02 吉林大学 Internet of vehicles intrusion detection method based on improved MobileNet V2
CN116843696B (en) * 2023-04-27 2024-04-09 山东省人工智能研究院 Cardiac MRI (magnetic resonance imaging) segmentation method based on feature similarity and super-parameter convolution attention
CN116612122B (en) * 2023-07-20 2023-10-10 湖南快乐阳光互动娱乐传媒有限公司 Image significance region detection method and device, storage medium and electronic equipment
CN116721302B (en) * 2023-08-10 2024-01-12 成都信息工程大学 Ice and snow crystal particle image classification method based on lightweight network
CN117409030B (en) * 2023-12-14 2024-03-22 齐鲁工业大学(山东省科学院) OCTA image blood vessel segmentation method and system based on dynamic tubular convolution

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183258A (en) * 2020-09-16 2021-01-05 太原理工大学 Remote sensing image road segmentation method based on context information and attention mechanism
CN112183360A (en) * 2020-09-29 2021-01-05 上海交通大学 Lightweight semantic segmentation method for high-resolution remote sensing image
CN112330681A (en) * 2020-11-06 2021-02-05 北京工业大学 Attention mechanism-based lightweight network real-time semantic segmentation method
AU2020103901A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11461998B2 (en) * 2019-09-25 2022-10-04 Samsung Electronics Co., Ltd. System and method for boundary aware semantic segmentation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183258A (en) * 2020-09-16 2021-01-05 太原理工大学 Remote sensing image road segmentation method based on context information and attention mechanism
CN112183360A (en) * 2020-09-29 2021-01-05 上海交通大学 Lightweight semantic segmentation method for high-resolution remote sensing image
CN112330681A (en) * 2020-11-06 2021-02-05 北京工业大学 Attention mechanism-based lightweight network real-time semantic segmentation method
AU2020103901A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
严广宇 ; 刘正熙 ; .基于混合注意力的实时语义分割算法.现代计算机.2020,(第10期),全文. *
任天赐 ; 黄向生 ; 丁伟利 ; 安重阳 ; 翟鹏博 ; .全局双边网络的语义分割算法.计算机科学.2020,(第S1期),全文. *

Also Published As

Publication number Publication date
CN113642390A (en) 2021-11-12

Similar Documents

Publication Publication Date Title
CN113642390B (en) Street view image semantic segmentation method based on local attention network
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
CN111612807B (en) Small target image segmentation method based on scale and edge information
CN112132156B (en) Image saliency target detection method and system based on multi-depth feature fusion
CN108171701B (en) Significance detection method based on U network and counterstudy
CN111325751A (en) CT image segmentation system based on attention convolution neural network
CN111523553B (en) Central point network multi-target detection method based on similarity matrix
CN111915627A (en) Semantic segmentation method, network, device and computer storage medium
CN111612008A (en) Image segmentation method based on convolution network
CN113011357A (en) Depth fake face video positioning method based on space-time fusion
CN112149526B (en) Lane line detection method and system based on long-distance information fusion
CN113723377A (en) Traffic sign detection method based on LD-SSD network
CN116311214B (en) License plate recognition method and device
CN111626134A (en) Dense crowd counting method, system and terminal based on hidden density distribution
CN113269224A (en) Scene image classification method, system and storage medium
CN113762396A (en) Two-dimensional image semantic segmentation method
CN115861756A (en) Earth background small target identification method based on cascade combination network
CN116206112A (en) Remote sensing image semantic segmentation method based on multi-scale feature fusion and SAM
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN116740362B (en) Attention-based lightweight asymmetric scene semantic segmentation method and system
CN113674288A (en) Automatic segmentation method for non-small cell lung cancer digital pathological image tissues
CN117079276B (en) Semantic segmentation method, system, equipment and medium based on knowledge distillation
CN113096133A (en) Method for constructing semantic segmentation network based on attention mechanism
CN112818774A (en) Living body detection method and device
TWI803243B (en) Method for expanding images, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant