CN114419449A

CN114419449A - Self-attention multi-scale feature fusion remote sensing image semantic segmentation method

Info

Publication number: CN114419449A
Application number: CN202210308387.1A
Authority: CN
Inventors: 符颖; 郭丹青; 文武; 吴锡; 周激流
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-04-29
Anticipated expiration: 2042-03-28
Also published as: CN114419449B

Abstract

The invention relates to a self-attention multi-scale feature fusion remote sensing image semantic segmentation method, wherein a segmentation network comprises a feature encoder and a decoder, the feature encoder transmits feature maps with different scales in the first three stages to a corresponding self-attention multi-scale feature fusion module in the decoder, the decoder performs upsampling from the feature map in the last stage and overlaps with the feature map with self-attention multi-scale feature fusion, the upsampling is performed step by step until the feature map is the same as the feature map in the first stage, and finally, the feature maps with all scales are respectively upsampled to the original size and each pixel is subjected to prediction classification, and prediction results of four scales are fused to obtain a final remote sensing image semantic segmentation result.

Description

Self-attention multi-scale feature fusion remote sensing image semantic segmentation method

Technical Field

The invention relates to the field of remote sensing image processing, in particular to a semantic segmentation method for a self-attention multiscale feature fusion remote sensing image.

Background

With the continuous development of remote sensing image technology, the processing for remote sensing images becomes more and more important, and semantic segmentation is one of the important research directions. In both natural images and remote sensing images, semantic segmentation is classified at the pixel level, and a label is assigned to each pixel. Compared with natural images, the remote sensing images have the advantages of high resolution, complex content and larger object scale difference, and the requirements on segmentation accuracy in practical application are higher due to the complex image content. With the great diversity of deep learning in the field of computer vision in recent years, researchers continuously apply the deep learning technology to the remote sensing image semantic segmentation, and the result also proves that the effect of the deep learning on the remote sensing image semantic segmentation is superior to that of most traditional methods and gradually becomes the mainstream method of the remote sensing semantic segmentation. The deep learning method applied to remote sensing image semantic segmentation at first is mostly based on convolutional neural networks, the most classical is a full convolutional network used for image semantic segmentation end to end, the general idea is to replace fully connected layers with convolutional layers, the fraction of each category in an image is calculated by using 1 x 1 convolution to realize pixel level prediction, and the full convolutional network directly up-samples the feature map of the last convolutional layer to the size of an original image, so that the segmentation edge is incomplete and the precision is low.

In order to better restore the feature graph to the original image size, a semantic segmentation network U-Net based on a codec framework is proposed, the semantic segmentation network U-Net is a U-shaped symmetrical structure and consists of an encoder for extracting features and a decoder for restoring the scale, jump connection is added in the decoder for improving the extraction capability of the object edge features to further fuse the features extracted by the encoder, and the improvements enable the U-Net to obtain better segmentation accuracy and robustness under the condition of fewer training samples. The semantic segmentation model based on the convolutional neural network usually needs continuous downsampling to obtain a larger receptive field, but the continuous downsampling can cause the resolution of the picture to be reduced, so that the position information is lost. In order to solve the problem, a semantic segmentation network DeepLab with a hole convolution and a conditional random field is proposed on the basis of VGG16, and the proposed hole convolution can increase the receptive field under the condition of keeping the resolution unchanged so that the segmentation precision is higher.

The Self Attenttion and Transformer began to become familiar to researchers as Google's Transformer model, introduced in 2016, achieved a frightened effect in the field of natural language processing. It is because of the remarkable achievement of the Transformer in natural language processing and its powerful modeling capability, especially its excellent performance in global information processing, that researchers have begun to try to apply it to the field of computer vision.

The prior art scheme has the following defects:

1. insufficient capability in extracting semantic information from remote sensing images with complex content background

The traditional convolutional neural network has limited capability of extracting semantic information, and can not effectively extract key information required by semantic segmentation in the face of a remote sensing image with a complex content background, so that the segmentation effect is greatly influenced.

2. Correlation between different scale features is not considered when performing feature fusion

Because the difference of the object dimensions in the remote sensing image is large, the semantic information of a plurality of objects with small and medium dimensions cannot be transmitted to a deep network, and the segmentation result is seriously influenced. In the past, when multi-feature fusion is used, only simple superposition is carried out from top to bottom, correlation among different scales is not considered, and multi-scale semantic information cannot be well utilized.

3. The remote sensing image data which are distributed unevenly on the object have poor generalization capability

The remote sensing images have complex content and uneven distribution due to the difference of shooting areas, and the traditional method has better semantic segmentation and poorer generalization capability only on the remote sensing images of a certain scene of a city and a country.

Disclosure of Invention

Aiming at the defects of the prior art, a self-attention multi-scale feature fusion remote sensing image semantic segmentation method comprises the following steps that a feature encoder transmits feature maps with different scales in four stages to a decoder, the decoder starts to up sample from the feature map in the last stage and superposes the feature map with the self-attention multi-scale feature fusion, the up-sampling is gradually carried out until the feature maps are the same as the scale of the feature map in the first stage, finally, all scales of the feature maps are up-sampled to the original size respectively, each pixel is predicted and classified, and the four scales of prediction results are fused to obtain the final remote sensing image semantic segmentation result, the method can effectively fuse the remote sensing semantic features with different scales, and the segmentation performance is improved, and the method comprises the following specific steps:

step 1: constructing a remote sensing semantic segmentation network, wherein the segmentation network comprises a feature encoder and a decoder, training a Swin-T network on an ImageNet data set in advance, taking the trained Swin-T network as the encoder, and taking a self-attention multiscale feature fused pyramid structure network as the decoder;

step 2: the feature encoder comprises a Swin-T four modules which are sequentially connected, specifically a Swin-T first module, a Swin-T second module, a Swin-T third module and a Swin-T fourth module, wherein the Swin-T four modules sequentially carry out feature acquisition on an input remote sensing image, and finally four feature maps with different scales are obtained and transmitted to a decoder;

and step 3: the decoder comprises three self-attention multi-scale feature fusion modules, specifically a first feature fusion module, a second feature fusion module and a third feature fusion module, wherein feature graphs generated by the Swin-T first module, the Swin-T second module and the Swin-T third module are all input into the three feature fusion modules, and the three feature fusion modules perform feature fusion according to the scale and the channel number of the current stage, specifically:

step 31: the first feature fusion module takes the feature diagram scale and the channel number generated by the Swin-T first module as standards, performs 2 times of upsampling and channel number halving on the feature diagram generated by the Swin-T second module, and performs 4 times of upsampling and channel number reducing to 1/4 on the feature diagram generated by the Swin-T third module;

step 32: the second feature fusion module takes the scale and the number of channels of the feature map generated by the Swin-T second module as standards, processes the number and the scale of the channels of the feature map generated by the Swin-T first module through the feature adjustment module, and performs 2-time upsampling and channel number halving on the feature map generated by the Swin-T third module;

step 33: the third feature fusion module takes the feature graph scale and the channel number generated by the Swin-T third module as standards, and the feature graphs generated by the Swin-T first module and the Swin-T second module are processed by the feature adjustment module according to the channel number and the channel number;

and 4, step 4: the feature fusion module also comprises an attention calculating module, wherein the attention calculating module is used for performing global average pooling on the three adjusted feature maps in each feature fusion module, splicing the pooled feature maps to calculate self-attention, splitting the result of the self-attention calculation according to the same mode to obtain three correlation scores, multiplying the three correlation scores with the feature maps before the respective global pooling, and finally splicing the multiplied feature maps in channel dimensions and adjusting the number of channels to be consistent with the number of channels in the current stage;

and 5: performing prediction classification on each pixel, specifically:

as shown in fig. 1, the feature map generated by the Swin-T fourth module is subjected to double upsampling and then is superimposed with the feature map output by the third feature fusion module, the feature map superimposed by the third feature fusion module is subjected to double upsampling and then is superimposed with the feature map output by the second feature fusion module, the feature map superimposed by the second feature fusion module is subjected to double upsampling and then is superimposed with the feature map output by the first feature fusion module, the three superimposed feature maps and the feature map generated by the Swin-T fourth module are upsampled to the original size, then each pixel is subjected to prediction classification, and finally, the prediction results of four scales are fused to obtain the final remote sensing semantic segmentation result.

According to a preferred embodiment, the operating mode of the feature adjusting module is specifically: and 2, performing 2 × 2 maximum pooling on the input feature map, better retaining the main features of the large-scale feature map, adjusting the number of channels by using 1 × 1 convolution, then performing selective extraction on the features by using two 3 × 3 convolutions, and using residual connection to avoid gradient explosion and disappearance and accelerate network convergence.

According to a preferred embodiment, step 5 further comprises: when the segmentation prediction is carried out, a multi-scale strategy is also adopted, the input remote sensing image is segmented and predicted according to the scale of 0.5, 0.75, 1.0, 1.25, 1.5 and 2.0, and finally, the segmentation results under all scales are overlapped.

The invention has the beneficial effects that:

1. the self-attention multi-scale feature fusion module provided by the invention can effectively fuse features among different scales, can extract useful semantic features from the remote sensing image with complex background content, and can obtain better segmentation results on the semantic segmentation of the remote sensing image with complex background, variable object scales and uneven distribution.

2. The feature adjusting module provided by the invention can effectively transmit the large-scale features to the attention multi-scale feature fusion module, better retains the main features of the large-scale feature map and improves the segmentation precision.

3. The relevance among different scale characteristic graphs is considered, the relevance is calculated by using self attention, the obtained relevance score can be understood as a weight, the relevance among different scale characteristics is fully considered, and the segmentation precision of objects with large scale difference, particularly buildings and water bodies, can be improved in the remote sensing image semantic segmentation.

Drawings

FIG. 1 is a schematic diagram of the structure of a semantic segmentation network of the present invention;

FIG. 2 is a schematic diagram of a self-attention multi-scale module according to the present invention;

FIG. 3 is a block diagram of a feature adjustment module;

FIG. 4 is a graph comparing the results of the experiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

The Swin-T network of the first stage represents the Swin-T first module.

The Swin-T network of the second stage represents the Swin-T second module.

The Swin-T network of the third stage represents the Swin-T third module.

The Swin-T network of the fourth stage represents the Swin-T fourth module.

The Swin-Transformer has four types of scales (Swin-t, Swin-b, Swin-s and Swin-l), the structures are the same, and the difference is that the network scales are different in parameter setting. Swin-t is the smallest size of the Swin-transformers, and the Swin-transformers comprise four Swin-transformer-blocks, i.e., stages one through four.

The following detailed description is made with reference to the accompanying drawings.

The invention mainly solves the problems of incomplete semantic segmentation and low accuracy rate caused by factors such as complex content, large object scale difference, uneven distribution and the like of remote sensing images, and provides a self-attention multi-scale feature fusion remote sensing semantic segmentation method. Fig. 1 is a schematic structural diagram of a semantic segmentation network according to the present invention, and as shown in fig. 1, the specific steps of the present invention include:

step 1: and constructing a remote sensing semantic segmentation network, wherein the segmentation network comprises a feature encoder and a decoder, the Swin-T network is trained on the ImageNet data set in advance, the trained Swin-T network is used as the encoder, and a self-attention multiscale feature fusion pyramid structure network is used as the decoder. Specifically, a Swin-T model pre-trained on ImageNet is used as a feature extractor, and feature maps of 1/4, 1/8 and 1/16 original image sizes are extracted and transmitted to a self-attention multi-scale feature fusion module for fusion processing. In the training process, the original image is randomly transformed into the scale sizes of 0.5, 0.75, 1.0, 1.25, 1.5 and 2.0 to perform multi-scale enhancement.

Step 2: the feature encoder comprises a Swin-T four modules which are sequentially connected, specifically a Swin-T first module, a Swin-T second module, a Swin-T third module and a Swin-T fourth module, and the Swin-T four modules sequentially carry out feature acquisition on input remote sensing images to finally obtain four feature maps with different scales and sizes and transmit the feature maps to the decoder.

And step 3: the decoder comprises three self-attention multi-scale feature fusion modules, as shown in fig. 1, a first feature fusion module, a second feature fusion module and a third feature fusion module are sequentially arranged from left to right, feature graphs generated by the Swin-T first module, the Swin-T second module and the Swin-T third module are all input into the three feature fusion modules, and the three feature fusion modules perform feature fusion according to the scale and the channel number of the current stage, specifically:

step 31: the first feature fusion module takes the feature diagram scale and the channel number generated by the Swin-T first module as standards, and performs 2 times of upsampling and channel number halving on the feature diagram generated by the Swin-T second module, and performs 4 times of upsampling and channel number reducing to 1/4 on the feature diagram generated by the Swin-T third module.

Step 32: the second feature fusion module takes the scale of the feature diagram generated by the Swin-T second module and the number of channels as standards, the feature diagram generated by the Swin-T first module is processed by the feature adjustment module according to the number and the scale of the channels, and the feature diagram generated by the Swin-T third module is subjected to up-sampling by 2 times and operation of reducing the number of the channels by half.

Step 33: and the third feature fusion module takes the feature graph scale and the channel number generated by the Swin-T third module as standards, and processes the channel number and the scale on the feature graphs generated by the Swin-T first module and the Swin-T second module through the feature adjustment module.

And 4, step 4: the feature fusion module also comprises an attention calculating module, wherein the attention calculating module is used for performing global average pooling on the three adjusted feature maps in each feature fusion module, splicing the pooled feature maps to calculate self-attention, and then splitting the self-attention calculating result in the same mode to obtain three correlation scores. The results after the global pooling of the three feature maps are ensured to be independent and correspond to the feature maps before pooling one by one, and the corresponding sequence is not disturbed by the calculation self-attention. And multiplying the three correlation scores by the feature maps before the global pooling, finally splicing the multiplied feature maps in channel dimensions, and adjusting the number of the channels to be consistent with the number of the channels in the current stage. The effect of multiplying the relevance score and the pre-pooling profile is that: considering the existence of correlation among different scale feature maps, self attention is used for calculating the correlation, the obtained correlation score can be understood as a weight value, and the process of multiplying the correlation score and the feature map of the pooling label is a weighting process.

Fig. 2 is a schematic structural diagram of the self-attention multi-scale module according to the present invention, which is exemplified by a second stage, the scale size of the second stage is 64 pixels by 64 pixels, the number of channels is 192, and the self-attention multi-scale feature fusion process is shown in fig. 2.

The method has the advantages that no additional parameters are needed, and the defect that effective information is not well reserved, so that the multi-scale feature fusion effect is poor. In order to better transfer the features of the large-scale feature map to the self-attention multi-scale feature fusion module, we propose a feature adjustment module, as shown in fig. 3, the working mode of the feature adjustment module specifically is as follows: and 2, performing 2 × 2 maximum pooling on the input feature map, better retaining the main features of the large-scale feature map, adjusting the number of channels by using 1 × 1 convolution, then performing selective extraction on the features by using two 3 × 3 convolutions, and using residual connection to avoid gradient explosion and disappearance and accelerate network convergence.

And 5: performing prediction classification on each pixel, specifically:

Step 5 also includes: when the remote sensing semantic segmentation prediction is carried out, a multi-scale strategy is also adopted, the input remote sensing image is segmented and predicted according to the scale of 0.5, 0.75, 1.0, 1.25, 1.5 and 2.0, and finally, the segmentation results under all scales are overlapped. And in the prediction stage, a multi-scale strategy is used to improve the segmentation precision, the remote sensing image is respectively converted into scales of the six proportions and then is input into the network for prediction to obtain six results, and then the six results are averaged, weighted and fused to obtain a more accurate segmentation result.

In order to evaluate the performance of the proposed remote sensing image semantic segmentation method, an intersection over (IoU) and an average intersection over (mIoU) which are commonly used in remote sensing image semantic segmentation are adopted as evaluation indexes. The intersection and union ratio is the ratio of the intersection and union of the model to a certain class of prediction results and the true value, and the average intersection and union ratio is the sum average of the intersection and union ratios of all classes. The intersection ratio calculation formula is as follows:

whereinTPIndicating the number of pixels that are actually true, predicted to be true,FPindicating the number of pixels that are actually false, predicted to be true,FNindicating the number of pixels that are actually true and predicted to be false. The higher the intersection ratio is, the better the semantic segmentation effect of the model on the remote sensing image is.

In order to verify the effectiveness of the remote sensing image Semantic segmentation method provided by the invention, the method is compared with the reference scores of all network models given in a LoveDA data set on a test set, wherein the network models comprise FCN8S, DeepLabV3+, PAN, UNet + +, Semantic-FPN, PSPNet, LinkNet, FarSeg, FactSeg and HRNet.

Firstly, comparison is carried out under a single scale, the result is shown in table 1, the cross-over ratio of the method provided by the text on each category is optimal, the average cross-over ratio is improved by 2.98% compared with the comparison sub-optimal result, and the text model has excellent performance in the semantic segmentation of the complex remote sensing image.

TABLE 1 semantic segmentation result comparison on LoveDA dataset

In the face of remote sensing images with large scale difference, the performance of the model can be effectively improved by using a multi-scale strategy in semantic segmentation training and testing. To this end, the multi-scale results are also compared here at the multi-scale with the DeepLabV3+, UNet, and HRNet in the LoveDA dataset, using multi-scale ratios uniformly 0.5, 0.75, 1.0, 1.25, 1.5, 1.75. The comparison results are shown in table 2, and it can be seen that the performance of different methods is significantly improved by the multi-scale strategy, the average cross-over ratio of the method herein is 54.19%, and the comparison suboptimal result is improved by 1.47%, and the optimal result on the LoveDA data set is obtained.

TABLE 2 comparison of Multi-Scale training and Multi-Scale test results

To further analyze the role of the individual modules in the overall network model, ablation studies were performed. We split the experiments into reset 50+ FPN, Swin-T + FPN, self-attention multi-scale feature fusion with feature adjustment, and self-attention multi-scale feature fusion without feature adjustment module, with the results shown in table 3. As can be seen by analysis of table 3: 1) Swin-Transformer performs better in semantic segmentation of complex remote sensing images due to the powerful modeling capability of Swin-Transformer. 2) In the self-attention multi-scale feature fusion process, the effect improvement is limited without using a feature adjustment module, and the analysis reason is that the large-scale features are directly pooled into the same scale violently, so that the features are not matched. 3) And a feature adjusting module is added, and features of different scales are input into the self-attention multi-scale feature fusion module after certain selection and adjustment to calculate the correlation between the features and then are spliced and fused, so that the segmentation performance under single scale and multi-scale can be effectively improved.

TABLE 3 ablation experiment

Fig. 4 shows a segmentation result graph of different models, from which it can be seen that the segmentation result of the model of the present invention is more smooth, the edge is more complete, and the situation of branch breakage does not occur. By combining the experimental results and analysis, the Swin-Transformer-based self-attention multi-scale feature fusion module can effectively fuse features of different scales and improve the performance of a model on the semantic segmentation of remote sensing images. As can be seen from FIG. 4, the method of the present invention has a good generalization ability, and can maintain a good performance in the remote sensing image semantic segmentation with uneven object distribution and large image content difference.

It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A self-attention multi-scale feature fusion remote sensing image semantic segmentation method is characterized in that a feature encoder transmits feature maps of four stages with different scales to a decoder, the decoder starts to up sample from the feature map of the last stage and superposes the feature maps with the feature map of self-attention multi-scale feature fusion, the up sampling is gradually carried out until the feature maps are the same as the feature map of the first stage in scale, finally, all scale feature maps are up sampled to the original image size respectively, each pixel is predicted and classified, and the prediction results of the four scales are fused to obtain the final remote sensing image semantic segmentation result, the method can effectively fuse the remote sensing semantic features of different scales, and the segmentation performance is improved, and the specific steps comprise:

and 4, step 4: the feature fusion module also comprises an attention calculation module, wherein the attention calculation module is used for performing global average pooling on the three adjusted feature maps in each feature fusion module, splicing the pooled feature maps to calculate self-attention, splitting the result of the self-attention calculation according to the same mode to obtain three correlation scores, multiplying the three correlation scores with the feature maps before the respective global pooling, and finally splicing the multiplied feature maps in channel dimensions and adjusting the number of channels to be consistent with the number of channels in the current stage;

and 5: performing prediction classification on each pixel, specifically:

the feature graph generated by the Swin-T fourth module is subjected to double upsampling and then is superposed with a feature graph output by a third feature fusion module, the feature graph superposed by the third feature fusion module is subjected to double upsampling and then is superposed with a feature graph output by a second feature fusion module, the feature graph superposed by the second feature fusion module is subjected to double upsampling and then is superposed with a feature graph output by a first feature fusion module, the three superposed feature graphs and the feature graph generated by the Swin-T fourth module are upsampled to the original graph size and then are subjected to prediction classification, and finally, prediction results of four scales are fused to obtain a final remote sensing semantic segmentation result.

2. The method for semantic segmentation of the self-attention multi-scale feature-fused remote sensing image according to claim 1, wherein the feature adjustment module specifically operates in a manner that: and 2, performing 2 × 2 maximum pooling on the input feature map, better retaining the main features of the large-scale feature map, adjusting the number of channels by using 1 × 1 convolution, then performing selective extraction on the features by using two 3 × 3 convolutions, and using residual connection to avoid gradient explosion and disappearance and accelerate network convergence.

3. The method for semantic segmentation of the self-attention multi-scale feature fused remote sensing image according to claim 1, wherein the step 5 further comprises: when the remote sensing semantic segmentation prediction is carried out, a multi-scale strategy is also adopted, the input remote sensing image is segmented and predicted according to the scale of 0.5, 0.75, 1.0, 1.25, 1.5 and 2.0, and finally, the segmentation results under all scales are overlapped.