CN113283435A

CN113283435A - Remote sensing image semantic segmentation method based on multi-scale attention fusion

Info

Publication number: CN113283435A
Application number: CN202110528206.1A
Authority: CN
Inventors: 雷涛; 李林泽; 加小红; 薛丁华; 张月
Original assignee: Shaanxi University of Science and Technology
Current assignee: Shaanxi University of Science and Technology
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-08-20
Anticipated expiration: 2041-05-14
Also published as: CN113283435B

Abstract

The invention discloses a remote sensing image semantic segmentation method based on multi-scale attention fusion, mainly relates to an image segmentation technology and aims to solve the semantic segmentation problem of a high-resolution remote sensing image. The method solves the problem that targets in the remote sensing image are difficult to classify by fusing multi-mode data; an attention mechanism is introduced to redistribute resources in the feature extraction stage, so that redundant features are avoided; the problem of large difference of target scales of remote sensing images is solved by adopting a multi-scale space context module; and the coding end information is reserved and optimized by using a residual jump connection strategy, and the problem of image feature loss during down-sampling is solved. The invention not only can realize semantic segmentation of the high-resolution remote sensing image, but also has higher classification precision, and provides objective and accurate data for understanding and analyzing the high-resolution remote sensing image.

Description

Remote sensing image semantic segmentation method based on multi-scale attention fusion

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a remote sensing image semantic segmentation method based on multi-scale attention fusion.

Background

The high-resolution remote sensing image can provide abundant ground geometric information and ground target details, so that the method is widely applied to ground target classification and identification in complex scenes. Semantic segmentation of high-resolution remote sensing images requires the assignment of specific semantic labels to each pixel, such as: buildings, pavements, automobiles, trees, low vegetation, and the like. Image semantic segmentation may identify multiple objects in an image simultaneously, as opposed to a single object identification. Therefore, the method is widely applied to the fields of military target detection, city planning, building identification, road extraction and the like. However, the existing high-resolution remote sensing image semantic segmentation technology faces the following challenges: firstly, the imaging technology of the high-resolution remote sensing image has unique complexity, and the difficulty and the efficiency of manually distinguishing the ground target are high; secondly, interference from shadows, clouds, and lighting causes large classification errors. Compared with a natural image, the semantic segmentation task of the remote sensing image is more complex. On the one hand, high resolution remote sensing images typically contain many more complex scenes. On the other hand, the remote sensing image has the characteristics of high intra-class variance and low inter-class variance, and different ground objects such as trees and low vegetation can show similar characteristics in the spectral image, which brings challenges to the semantic segmentation task of the high-resolution remote sensing image. However, a Digital Surface Model (DSM) containing rich geographic information provides supplementary information for ground feature classification, and experiments show that the accuracy of segmentation can be significantly improved by fully utilizing DSM data. Therefore, the research of the semantic segmentation network with strong robustness and high accuracy has great significance for understanding the complex scene of the high-resolution remote sensing image.

A number of conventional machine learning methods have been used for remote sensing image analysis. Although these methods can achieve object detection and recognition in a telepresence image, they are less accurate because reliable feature extraction is difficult. In recent years, with the rapid development of deep learning technology, convolutional neural networks have achieved great success in the field of image semantic segmentation. It is well known that convolutional neural networks can provide hierarchical feature representations and learn deep semantic features, which is very important to improve model performance by stacking convolutional layers. In addition, the convolutional neural network can effectively suppress noise interference, thereby enhancing robustness.

Different from the traditional high-resolution remote sensing image semantic segmentation, the addition of multi-modal data further improves the classification precision. In order to reasonably utilize data of two modes, the related art discloses a side-to-side DSM Fusion Network (DSMFNet), which designs four interaction modes to fuse and process multi-mode data, wherein a model with the highest precision extends the strong performance of deep v3+ when extracting RGB image features, and designs a lightweight depth separable convolution module to extract DSM image features separately, and fuses information of different modes before upsampling. Although DSM images contain less information, simple network models cannot extract deeper features, and simply superimposing red, green, blue (RGB) spectral images and DSM images does not take full advantage of the relationship between multimodal information, but introduces redundant features.

In the high-resolution remote sensing image semantic segmentation task, the difference of the target dimension to be segmented is large. Aiming at the problem of target Multi-scale, the related technology also discloses a Multi-scale Adaptive Feature Fusion Network (MANet), wherein the Network uses ResNet101 as a backbone Network to extract image features, and transmits high-level semantic features to a context extraction module, so that the problem that the target size in a remote sensing image is large in difference and difficult to segment is solved. The self-adaptive fusion module fuses high-level semantic information and low-level semantic information, and redistributes the fused resources, so that redundant information is avoided while self-adaptive combination weight is achieved. However, the algorithm does not utilize the spatial relationship between channels when extracting multi-scale features, and for targets with similar semantic features, the relevance of features in classes cannot be emphasized, so that the segmentation precision is low.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a remote sensing image semantic segmentation method based on multi-scale attention fusion, which can realize semantic segmentation of a high-resolution remote sensing image, avoid redundant features, has higher classification precision and provides objective and accurate data for understanding and analyzing the high-resolution remote sensing image.

In order to achieve the above purpose, the technical scheme adopted by the invention comprises the following steps:

1) clipping a data set;

2) inputting the cut IRRG image and the cut DSM image into a multi-modal fusion module to obtain multi-modal fusion characteristics F of each stage₀、F₁、F₂And F₃(ii) a A module based on channel attention is introduced into the multi-mode fusion module to extract, recombine and fuse the characteristics, and weight resources are distributed;

3) integration and improvement of multi-modal fusion features F using a multi-scale spatial context enhancement module₃Then carrying out primary up-sampling;

4) multi-mode fusion feature F for optimizing encoding end by utilizing residual jump connection strategy₀、F₁And F₂Fusing the characteristics with the decoding end in the corresponding scale and continuously up-sampling to output a segmentation graph;

5) and splicing the segmentation maps according to the size of the original image to finish the semantic segmentation of the remote sensing image.

Further, the IRRG image, the DSM image, and their corresponding label maps are cropped using a sliding window in step 1), and the cropped image size is 256 × 256.

Further, the multi-modal fusion module includes an optical branch, a depth branch, and a coded fusion branch, each optical branch and depth branch providing a set of feature maps at each module stage, the coded fusion branch processing fused data using a fusion from the optical branch and depth branch as input prior to down-sampling.

Further, the multi-modal fusion module implementation includes:

1) features I of IRRG image₀And DSM image characteristics D₀Respectively input to two ResNet50 pre-trained on ImageNet, and I is₀And D₀Fusion feature M of₀Input to a third ResNet50 pre-trained on ImageNet, the initial stage model MFM-0 is:

wherein

Which represents an addition at the pixel level,

means for representing a channel based attention;

2) taking the feature graph output by the three branches in the first stage as the input of the second stage, wherein the fused output detail MFM-1 is as follows:

3) taking the feature graph output by the three branches in the second stage as the input of the third stage, wherein the fused output detail MFM-2 is as follows:

4) the feature map of the output of the third branch in the third stage is used as the input of the fourth stage, and the fused output detail MFM-3 is:

further, the module implementation based on channel attention includes:

1) inputting a characteristic diagram A ═ a₁，a₂，...，a_c]Viewed as channel a_i∈R^H×WThe vector G is obtained after the whole local average pooling^1×1×CAnd k^thElement, the model is:

integrating global information into a vector G;

2) converting vector G into

Wherein, O₁∈R^1×1×C/2， O₂∈R^1×1×CDenotes two fully connected convolutional layers, in O₁After that, an Activation function is added, and the Activation function is further added through a Sigmoid function sigma (·)

Activate, constrain it at [0, 1]；

3) A is mixed with

Performing outer product to obtain

The model is

Further, the multi-scale spatial context enhancement module comprises an ASPP module and a non-local module, wherein F represents a feature map processed by the multi-scale spatial context enhancement module, and the model is F-nl (ASPP (F)₃))。

Further, the multi-scale spatial context enhancement module implementation comprises:

1) multi-modal fusion feature F of the last stage of the multi-modal fusion module₃Input multiscale spatial context enhancement moduleExtracting multi-scale information, combining 3 × 3 convolutions with expansion rates of 3, 6 and 9 with a standard 1 × 1 convolution for multi-scale information extraction, and adding image average pooling integrated global context information;

2) after multi-scale information fusion is carried out by using an ASPP module, the number of channels is reduced to 256 by using 1 multiplied by 1 convolution, and then the channels enter a non-local module;

3) the non-local model is

Let the characteristic diagram X ═ X₁，x₂，...，x_n]As an input, where x_i∈R^1×1×C，x_j∈R^1×1×CThe feature vectors of the i position and the j position respectively, N ═ hxw represents the number of pixel points, hxw represents the space dimension, F is the same as the number of channels of X, c (X) is normalization operation, g (X)_j)＝W_vx_jExpressed as 1 × 1 convolution in the network, f (x)_i，x_j) (X) is a vector x_iAnd vector x_jThe normalized correlation of (1) and (2) calculating the spatial similarity, the model is

Wherein m (x)_i) And n (x)_j) Is a linear transformation matrix, m (x)_i)＝W_qx_i， g(x_j)＝W_kx_jAll 1 × 1 convolutions in the network;

4) f is upsampled once in a bilinear interpolation.

Further, the residual jump connection model is

Wherein f is_lIs 1^thThe characteristics of the layers, Tconv is transposed convolution, Activation is the ReLU Activation function, DSC stands for depth separable convolution,

for addition at the pixel level, f_l-1Is (l-1)^thCharacteristic of a layer not being sampled, f_l+1Is the result after the residual jump connection processing.

Further, the residual jump connection implementation includes:

1) will l^thThe features of the layers are restored to AND (l-1) via a transposed convolution learning^thFeatures of the same layer size;

2) will (l-1)^thThe features of the layer which are not down sampled are extracted separately and added;

3) features are learned again using depth separable convolution and transmitted to (l +1)^thAfter upsampling of the layer.

Further, the features optimized by the residual error jump connection strategy are gradually fused with the features of the decoder to perform continuous upsampling in a manner of a bilinear difference value until a segmentation graph is output.

Compared with the prior art, the invention aims at the problems that the scale difference of the segmented target is larger, the complexity of the segmented scene is stronger, the semantic segmentation of the high-resolution remote sensing image is difficult, and the semantic segmentation effect cannot be improved only by using the frequency spectrum data and the traditional multi-scale feature extraction module, and is based on the coding-decoding structure, firstly, the multi-mode fusion module is used for respectively processing IRRG and DSM data from different modes, and the module based on the channel attention is introduced for redistributing resources between high-level semantic features and low-level semantic features, thereby improving the fusion result of the multi-mode data, solving the problem that the target in the remote sensing image is difficult to classify, avoiding redundant features, utilizing the multi-mode fusion module to respectively extract and fuse the features from the IRRG image and the DSM image, not only solving the problem that the DSM image information cannot be fully utilized due to the unbalanced network structure, the problem that characteristic redundancy is easy to occur in two modal branched encoders is solved, and a better segmentation result can be obtained compared with a mainstream high-resolution remote sensing image semantic segmentation algorithm. And secondly, the fused features are improved by using a multi-scale spatial context enhancement module, and the problem of large target scale difference in the remote sensing image is solved. And finally, the multi-mode fusion information of the encoding end is reserved and optimized by using a residual jump connection strategy, the problem of image feature loss in the down-sampling process is solved, the output result of the encoder is optimized by using the residual jump connection strategy, effective feature mapping is provided for the decoding end, and the precision of the target contour in the high-resolution remote sensing image is improved while the multi-mode information fusion is optimized. Compared with a mainstream remote sensing image semantic segmentation algorithm, the method improves the pixel classification precision by utilizing the image depth data and the multi-scale spatial context enhancement module on one hand, and improves the outline precision of the segmented target while accurately positioning the target by utilizing a residual jump connection strategy on the other hand. The method can realize semantic segmentation of the high-resolution remote sensing image, has higher classification precision, and has wide application prospect in the field of scene understanding and analysis of the high-resolution remote sensing image.

Drawings

FIG. 1 is a diagram of a network architecture of the present invention;

FIG. 2a is a block diagram of the MFM-0 stage of the multimodal fusion module of the present invention; FIG. 2b is a block diagram of the MFM-n (n e [1,3]) stage of the multimodal fusion module of the present invention;

FIG. 3 is a block diagram of the present invention based on channel attention;

FIG. 4 is a block diagram of the multi-scale spatial context enhancement module of the present invention;

FIG. 5 is a diagram of the residual jump connection strategy architecture of the present invention;

FIG. 6 is a sample image of a Potsdam dataset and a Vaihingen dataset;

FIG. 7 is a graph comparing the segmentation results of the Potsdam dataset slices according to the present invention and the prior art method;

FIG. 8 is a graph comparing the segmentation results of Potsdam datasets for the present invention and prior art methods;

FIG. 9 is a graph comparing the segmentation results of slices of the Vaihingen data set according to the present invention and the prior art method;

FIG. 10 is a graph comparing the results of the present invention and prior art methods in the segmentation of the Vaihingen data set.

Detailed Description

The present invention will be further explained with reference to the drawings and specific examples in the specification, and it should be understood that the examples described are only a part of the examples of the present application, and not all examples. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.

The invention provides a remote sensing image semantic segmentation method based on Multi-scale Attention Fusion, designs a Multi-scale Attention Fusion Network (MAFNet) for high-resolution remote sensing image semantic segmentation, relates to technologies such as a convolutional neural Network and Multi-mode data Fusion, can be applied to the semantic segmentation problem of high-resolution remote sensing images, and lays a foundation for scene understanding of the high-resolution remote sensing images.

Referring to fig. 1, the Multi-scale Attention Fusion network of the present invention is based on an encoding-decoding structure, and first uses a Multi-Modal Fusion Module (MFM) to process IRRG and DSM data from different modalities respectively, and introduces a Channel Attention-based Module (MCA) to redistribute resources between high-level semantic features and low-level semantic features, thereby improving the Fusion result of Multi-modal data, solving the problem that objects in remote sensing images are difficult to classify, and avoiding redundant features. And secondly, a Multi-scale Spatial Context Enhancement Module (MSCEM) is used for improving the characteristics after fusion, so that the problem of large difference of target scales in the remote sensing image is solved. And finally, a Residual Skip Connection strategy (RSC) is utilized to reserve and optimize multi-mode fusion information of the encoding end, and the problem that image characteristics are lost during down-sampling is solved. Compared with a mainstream remote sensing image semantic segmentation algorithm, the method improves the pixel classification precision by utilizing the image depth data and the multi-scale spatial context enhancement module on one hand, and improves the outline precision of the segmented target while accurately positioning the target by utilizing a residual jump connection strategy on the other hand. The method can realize semantic segmentation of the high-resolution remote sensing image, has higher classification precision, and has wide application prospect in the field of scene understanding and analysis of the high-resolution remote sensing image.

The present invention takes into account that higher segmentation accuracy is obtained using near infrared, red, green (IRRG) data than RGB data, and therefore uses an IRRG image and a normalized depth image (DSM) as data sources. The symmetrical structure of the encoder and the decoder with reasonable design is used for reference. Firstly, in the encoding process, the proposed model utilizes a multi-mode fusion module to respectively extract features from an IRRG image and a DSM image, and multi-mode data are fused at each sampling stage; secondly, a multi-scale spatial context enhancement module is introduced at the end of the encoding stage, and global spatial information of different-scale targets is integrated and improved. Finally, learning and optimizing multi-modal characteristics by using a residual error jump connection strategy, and performing upsampling and outputting a segmentation graph by fusing the multi-modal characteristics with decoding characteristics, wherein the method comprises the following steps:

(1) clipping a data set;

(2) inputting the cut IRRG image and the cut DSM image into a multi-modal fusion module to obtain multi-modal fusion characteristics F of each stage₀、F₁、F₂And F₃(ii) a A module based on channel attention is introduced into the multi-mode fusion module to extract, recombine and fuse the characteristics, and weight resources are distributed;

(3) integration and improvement of multi-modal fusion features F using a multi-scale spatial context enhancement module₃Then carrying out primary up-sampling;

(4) optimization of multi-modal fusion characteristics F of encoding end by using residual jump connection strategy₀、F₁And F₂Fusing the characteristics with the decoding end in the corresponding scale and continuously up-sampling to output a segmentation graph;

(5) and (5) splicing and outputting the segmentation images according to the size of the original image, namely completing the semantic segmentation of the remote sensing image.

The invention is described in detail below, comprising the steps of:

(1) the IRRG image, the DSM image, and their corresponding label maps are cropped using a sliding window, and the image size after cropping is 256 × 256.

(2) The cropped IRRG image and the cropped DSM image are input to a multi-modal fusion module, which includes an optical branch, a depth branch and a coding fusion branch, image features are extracted at the optical branch and the depth branch, each optical branch and depth branch provides a set of feature maps at each module stage, on this basis, a third coding fusion branch is introduced, which is used to process the fused data, see fig. 2a, which shows that the coding fusion branch takes the fusion from the optical branch and the depth branch as input before down-sampling, residual learning is performed by a convolution block, and the residual utilizes the sum of the feature maps of the other two encoders until the stage MFM-3. MFM-n (n ∈ [1,3]) is structured as shown in FIG. 2b, with three pre-trained ResNet50 extracting features from three branches and then fusing the features using the same pattern as the MFM-0 stage. Before the feature map addition, a module based on channel attention is introduced, and the last downsampling is abandoned in the encoding stage. The specific implementation mode is as follows:

(a) features I of IRRG image₀And DSM image characteristics D₀Respectively input to two ResNet50 pre-trained on ImageNet, and I is₀And D₀Fusion feature M of₀Input to a third ResNet50 pre-trained on ImageNet. Initial stage model MFM-0 of

Wherein

Which represents an addition at the pixel level,

representing modules based on channel attention.

(b) The feature diagram output by three branches in the first stage is used as the input of the second stage, and the fused output detail MFM-1 is

(c) The feature diagram output by the three branches of the second stage is taken as the input of the third stage, and the fused output detail MFM-2 is

(d) The feature diagram output by the three branches of the third stage is taken as the input of the fourth stage, and the fused output detail MFM-3 is

(3)

A channel attention-based module is shown for extracting, reorganizing and fusing features, assigning weighting resources on a more meaningful feature map. As shown in fig. 3, the specific implementation is as follows:

(a) inputting a characteristic diagram A ═ a₁，a₂，...，a_c]Viewed as channel a_i∈R^H×WCombinations of (a) and (b). Firstly, Global Average Pooling (GAP) is carried out to obtain a vector G epsilon R^1×1×CAnd k thereof^thElement, model is

This operation integrates global information into the vector G.

(b) Next, the vector is converted to

Wherein O is₁∈R^1×1×C/2， O₂∈R^1×1×CDenotes two fully connected convolutional layers, in O₁Activation functions are also added afterwards, which create channel dependencies on feature extraction. Will be further processed by Sigmoid function σ (·)

ActivationWhich is constrained to [0, 1 ]]。

(c) Finally, A and

performing outer product to obtain

The model is

ReLU remaps the original channels to the new channels, adding nonlinear elements in an adaptive fashion, resulting in a better fit of the network. During the process of network learning, the module can suppress redundant features and recalibrate the weights to be further optimized on a more meaningful feature map.

(4) Referring to fig. 4, the multi-scale Spatial context enhancement module is composed of an ASPP (associated Spatial gradient) module and a non-local module, where F represents a characteristic diagram processed by the multi-scale Spatial context enhancement module, and the model is F ═ nl (ASPP (F)₃) Concrete implementation is as follows:

(a) fusing feature F of the last stage of the multi-modal fusion module₃And the input multi-scale space context enhancing module extracts multi-scale information. Due to F₃The size of the expansion convolution is 16 multiplied by 16, the expansion convolution with different expansion rates is to perform operation on the same feature map and perform fusion output, the fusion should cover the whole feature map, therefore, the 3 multiplied by 3 convolution with the expansion rates of 3, 6 and 9 and a standard 1 multiplied by 1 convolution are combined to perform multi-scale information extraction, and then an image average pooling integrated global context information is added. Better efficiency and performance without increasing the number of parameters.

(b) After multi-scale information fusion is carried out by using an ASPP module, the number of channels is reduced to 256 by using 1 multiplied by 1 convolution, and then the channels enter a non-local module.

(c) The non-local model is

Let the characteristic diagram X ═ X₁，x₂，...，x_n]As an input, where x_i∈R^1×1×C，x_j∈R^1×1×CThe feature vectors for the i and j positions, respectively. N — H × W represents the number of pixels, and H × W represents a spatial dimension. F is the same as X in number, C (X) is normalization, g (X)_j)＝W_vx_jExpressed as a 1 × 1 convolution in the network. Second, f (x)_i，x_j) (X) is a vector x_iAnd vector x_jThe normalized correlation of (1) and (2) calculating the spatial similarity, the model is

Wherein m (x)_i) And n (x)_j) Is a linear transformation matrix, m (x)_i)＝W_qx_i， g(x_j)＝W_kx_jAll in the network are 1 × 1 convolutions. The module establishes a relationship between any two spatial positions, and improves semantic feature expression.

(d) F is upsampled once in a bilinear interpolation.

The application of a global context remote dependence strategy is important in semantic segmentation of multi-classification high-resolution remote sensing images, and in order to better utilize spatial information of a multi-scale feature map, non-local is introduced after multi-scale integration information. The DSM data also provides ancillary physical properties for specific classes in the remote sensing image, and spatial relationships may enhance local properties of the feature map by clustering dependencies on other pixel locations. For the targets with similar semantic features, the relevance of the features in the class is enhanced by the aid of the strategy of the relation context, and the module combines global and local information to enable semantic segmentation results of the high-resolution remote sensing image to be more accurate.

(5) FIG. 5 is a diagram showing a residual jump link strategy, (l-1)^thLayer characteristics are transferred to (l +1) via a jump connection^thLayer whose characteristics are passed on to l via downsampling^thThe layer is further transmitted to (l +1) through the up-sampling^thLayer, the process being repeatedAnd (4) calculating. The loss of low resolution information results in blurring of the segmentation boundaries. The conventional jump connection directly transmits the high-resolution feature map to a decoder without learning of any convolution layer, and effective information of a coding end is continuously lost in a down-sampling process, so that a network model obtained by final learning cannot effectively map the high-resolution information. The residual jump connection model is

The specific implementation mode is as follows:

(a) will l^thThe features of the layers are restored to AND (l-1) via a transposed convolution learning^thFeatures with the same layer size.

(b) Will (l-1)^thFeatures that are not down-sampled by the layer are individually extracted and added to them.

(c) Features are learned again using depth separable convolution and transmitted to (l +1)^thAfter upsampling of the layer.

Features F of the first three stages obtained by the multi-modal fusion module₀、F₁And F₂The information is fed back to the decoder in its entirety using this strategy. This approach allows for higher levels of features. Wherein f is_lIs 1^thThe characteristics of the layers, Tconv is transposed convolution, Activation is the ReLU Activation function, DSC stands for depth separable convolution,

for addition at the pixel level, f_l-1Is (l-1)^thFeature of a layer not downsampled, f_l+1Is the result after the residual jump connection processing.

(6) And gradually fusing the features optimized by the residual jump connection strategy with the features of the decoder to perform continuous upsampling in a bilinear difference mode until a segmentation map is output.

(7) And splicing and outputting the segmentation images according to the size of the original image.

In order to verify the validity of the semantic segmentation of the high-resolution remote sensing image, the method respectively performs urban ground feature classification experiments on the Vaihingen data set and the Potsdam data set of the two public data sets, and verifies the performance of the model by utilizing an evaluation index evaluation result.

The Potsdam dataset includes 38 images, each having three bands corresponding to near Infrared (IR), red (R), and green (G), respectively. The data set also provides a digital surface model and a normalized digital surface model corresponding to the image slice. The image slices had a spatial resolution of 5 cm and were all 6000 x 6000 pixels in size. Six categories (road, building, low vegetation, trees, cars, and clutter) have been marked in pixels on 24 tagged images. The present invention uses both IRRG and DSM data types. Image numbers 5_12, 6_7, and 7_9 are selected for verification, image numbers 5_10 and 6_8 are selected for testing, and the remaining images are selected for training.

The Vaihingen data set includes 33 images with a spatial resolution of 9 centimeters. The band for each image was the same as the Potsdam dataset, with an average size of 2494 × 2064 pixels. Only 16 images have ground truth labels that also contain the same six categories as the Potsdam dataset. Both IRRG and DSM data types are also used. Five images were used as a test set to evaluate the network model of the present invention,

image numbers

11, 15, 28, 30, 34, 3 images were used as a validation set, numbers 7, 23, 37, and the remaining images were used for training. Fig. 6 is a sample image, a digital surface model and corresponding labels from these two datasets.

Due to GPU memory limitations, the size of the image in the dataset needs to be changed to adapt the network model of the present invention. Each image is cropped to 256 × 256 pixels with an overlap of 128 pixels, and the prediction results are finally stitched. The present invention uses data enhancement to reduce the risk of overfitting, including random flipping (vertical and horizontal) and random rotation (0 °, 90 °, 180 °, 270 °) on all training images. The enhanced data can effectively prevent the model from being over-fitted, and the robustness of the model is improved. The invention is built by using a deep learning framework PyTorch. ResNet50 pre-trained on ImageNet acts as the backbone network. The operating system is Windows 10, the processors are Intel (R) Xeon (R) CPUs E5-1620 v4, the proposed MAFNet is trained on two NVIDIA GeForce GTX 1080 graphics processors, each having 8GB of memory. The invention uses a cross entropy loss and random gradient descent optimizer with momentum of 0.9 and weight attenuation of 0.004 to optimize the network. The initial learning rate is 1e-3, multiplied by 0.98 at the end of each epoch. The total batch was set to 16 and 250 epochs were used to train the network.

In order to further show the superiority of the invention, the method for comparing and visualizing the result of the remote sensing image semantic segmentation mainstream algorithm based on deep learning comprises the following steps: deeplab v3+, APPD, MANet, DSMFNet, and REMSNet.

In order to compare the performances of different algorithms, the Overall Accuracy (OA) and F1 Score are selected as evaluation indexes, and the larger the values of OA and F1 Score are, the better the segmentation result is.

Comparative experiments for different methods were performed on the Potsdam dataset and the results are shown in table 1:

TABLE 1 results of comparative experiments on Potsdam data set for different methods

Method	Imp.Surf.	Building	Low veg.	Tree	Car	Mean F1	OA
								DeepLab v3+	89.88	93.78	83.23	81.66	93.50	88.41	87.72
APPD	90.80	94.56	84.37	85.14	94.42	89.86	88.42
								MANet	91.33	95.91	85.88	87.01	91.46	90.32	89.19
DSMFNet	93.03	95.75	86.33	86.46	94.88	91.29	90.36
								REMSNet	93.48	96.17	87.52	87.97	95.03	92.03	90.79
MAFNet	93.61	96.26	87.87	88.65	95.32	92.34	91.04

In the experiments with the Potsdam dataset, we calculated F1 Score, mean F1 Score and Overall Accuracy (OA) for each class. As shown in Table 1, the average F1 Score and the overall accuracy of the method reach 92.34% and 91.04%, respectively, which are superior to other algorithms in all evaluation indexes. The Potsdam data set scene is relatively complex, the tree and low vegetation are difficult to classify, compared with DeepLab v3+, the classification of the tree is improved by 7.0%, and the classification of other categories is correspondingly improved, so that the MAFNet can capture targets with different scales by utilizing global context space information.

FIG. 7 is a segmentation result visualization on Potsdam dataset slices by the present invention and other methods. From unmarked artwork, it can be seen that trees and low vegetation are very similar, and some areas or even the human eye has failed to classify them correctly. However, from the comparison of the dashed boxes, it can be found that the present invention can obtain better segmentation results in the classification of trees and low vegetation, which also verifies the superior performance of the present invention. Secondly, the proposed MAFNet segmentation results are also more refined for small targets like vehicles. The residual jump connection strategy provided by the invention solves the problem that small targets are easy to be wrongly divided due to information loss caused by down-sampling, also enhances the semantic characteristics of the attributes in the class for large targets, and reduces wrong classification.

Fig. 8 is a visualization of the overall classification result of the 5_10 region in the Potsdam dataset, which can clearly distinguish the regions and distribution rules of all categories, and has practical significance for city planning.

Comparative experiments for different methods were performed on the Vaihingen dataset, see table 2 for comparative results:

TABLE 2 results of comparative experiments on the Vaihingen data set for different methods

Method	Imp.Surf.	Building	Low veg.	Tree	Car	Mean F1	OA
								DeepLab v3+	87.67	93.95	79.17	86.26	80.34	85.48	87.22
APPD	88.78	93.38	80.43	86.76	80.88	86.05	87.71
								MANet	90.12	94.08	81.01	87.21	81.16	86.72	88.17
DSMFNet	91.47	95.08	82.11	88.61	81.01	87.66	89.80
								REMSNet	92.01	95.67	82.35	89.73	81.26	88.20	90.08
MAFNet	92.06	96.12	82.71	90.01	82.13	88.61	90.27

Table 2 evaluation results F1 Score, mean F1 Score and Overall Accuracy (OA) were calculated for each category. As shown in table 2, the average F1 Score and overall accuracy of the proposed MAFNet are 88.61% and 90.27%, respectively, which is superior to other algorithms. Especially in the automobile category, the residual error jump connection strategy provided by the invention effectively retains the information of the small objects. DSM data is added into the network model input, and the method is improved on other class classification with auxiliary physical space height information, so that the problem that the high-resolution remote sensing image target is difficult to classify is solved, and the data fusion among different modes is verified to help the classification of the remote sensing image ground features. The result shows that the method has stronger capacity in a complex high-resolution remote sensing scene, the multi-scale space context enhancement module solves the problem of larger scale difference of the segmented targets, effectively extracts the characteristics of the targets with different scales, and can still realize correct segmentation even when the target has small occupation ratio in one region and has higher similarity with other class targets.

Figure 9 is a visualization of the results of terrain classification using different algorithms on the Vaihingen test set. Trees and low vegetation have high similarity and thus cause difficulty in classification. The dashed bounding box shows that the invention not only can better distinguish areas with high similarity, but also retains all the information of small objects. The proposed MAFNet can also reduce interference factors to some extent for factors such as lighting, shadows, etc., for example, in the fourth row, trees under shadow can also be correctly classified.

FIG. 10 is a complete area diagram after slice splicing, the second behavior is compared with other algorithms and results of the invention, and experimental results show that the invention can play a better role in scene analysis of complex high-resolution remote sensing images.

The present invention decomposes and combines the proposed modules and further verifies the effectiveness of the different modules with F1 Score and overall accuracy. Ablation experiments used the Vaihingen dataset. Firstly, the first model uses two ResNet50 to extract the features of different modal data respectively, and the features are fused after the last residual block, and a segmentation graph is output through continuous up-sampling. No interaction between different data is done during feature extraction. And secondly, verifying MFM, adding a multi-mode fusion module, reasonably distributing feature resources by using an attention mechanism during feature extraction by two ResNet50, continuously performing information fusion, introducing a third ResNet50 to process fusion branches, and finally continuously upsampling a feature map fused by the three branches to obtain a final segmentation map. In the third model, in the encoding stage, the ResNet50 fused feature map is input into a multi-scale space context enhancement module to obtain a new feature map, and in the decoding stage, continuous up-sampling is carried out until the final output is obtained. And then combining ResNet50 and a residual jump connection strategy for the fourth model, verifying the fusion of a decoding end and preserving the validity of information of a coding end, sequentially corresponding the first three down-sampled information of feature extraction with an up-sampling stage by utilizing a residual learning strategy, and outputting the final prediction. Finally, all modules were integrated together and all results of the ablation experiments are shown in table 3:

TABLE 3 results of ablation experiments performed on the Vaihingen dataset

Models	Imp.Surf.	Building	Low veg.	Tree	Car	Mean F1	OA
								Res50	86.94	89.67	75.83	84.42	77.40	82.85	84.98
Res50+MFM	88.15	93.84	76.49	86.48	78.02	84.60	86.66
								Res50+MSCEM	88.79	93.09	79.79	85.55	80.38	85.52	87.35
Res50+RSC	90.11	92.97	80.24	86.04	81.14	86.10	87.82
								MAFNet	92.06	96.12	82.71	90.01	82.13	88.61	90.27

The results in table 3 show that the average F1 Score of "Res 50+ MFM" is improved by 1.8% compared with that of Res net50, the overall accuracy is improved by 1.7%, an attention mechanism is introduced before data fusion of different modes to solve the problem of feature map weight distribution, the validity of multi-mode data information is verified at the same time, and the segmentation accuracy can be improved by efficiently fusing the features. The average F1 Score and the overall accuracy of the 'Res 50+ MSCEM' are improved by 2.7% and 2.4% compared with the ResNet50, the multi-scale space context enhancement module improves the performance of a backbone network, effectively obtains all information in an image, enhances the relevance among different categories, and solves the problem that multi-scale targets in a remote sensing image are difficult to extract. Compared with ResNet50, the average F1 Score of Res50+ RSC is improved by 3.3%, the overall accuracy is improved by 2.8%, and compared with the common jump connection, the novel residual jump connection strategy not only enhances the characteristics output by a coding end, but also provides better characteristic fusion for a decoding end. In addition, all modules are integrated, and compared with an initial network model, the average F1 Score and the overall accuracy of the proposed MAFNet are respectively improved by 5.8% and 5.3%, so that the semantic segmentation effect of the high-resolution remote sensing image can be remarkably improved.

In conclusion, the remote sensing image semantic segmentation method based on multi-scale attention fusion solves the problem that targets in remote sensing images are difficult to classify by fusing multi-mode data; an attention mechanism is introduced to reallocate resources in the characteristic extraction stage, so that redundant characteristics are avoided; the problem of large difference of target scales of the remote sensing image is solved by adopting a multi-scale space context module; and the coding end information is reserved and optimized by using a residual jump connection strategy, and the problem of image feature loss during down-sampling is solved. The invention not only can realize semantic segmentation of the high-resolution remote sensing image, but also has higher classification precision, and provides objective accurate data for understanding and analysis of the high-resolution remote sensing image.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A remote sensing image semantic segmentation method based on multi-scale attention fusion is characterized by comprising the following steps:

1) clipping a data set;

4) multi-mode fusion feature F for optimizing encoding end by utilizing residual jump connection strategy₀、F₁And F₂Fusing the feature with the scale corresponding to the decoding end and continuously up-sampling to output a segmentation graph;

2. The method for semantically segmenting the remote sensing image based on the multi-scale attention fusion as claimed in claim 1, wherein the IRRG image, the DSM image and the corresponding label map are cut by using a sliding window in the step 1), and the size of the cut image is 256 x 256.

3. The method for semantically segmenting the remote sensing image based on the multi-scale attention fusion as claimed in claim 1, wherein the multi-modal fusion module comprises an optical branch, a depth branch and a coding fusion branch, each optical branch and depth branch provides a set of feature mapping at each module stage, the coding fusion branch takes the fusion from the optical branch and the depth branch as input before down-sampling, and processes the fused data.

4. The method for semantic segmentation of remote sensing images based on multi-scale attention fusion as claimed in claim 3, characterized in that the implementation manner of the multi-modal fusion module comprises:

wherein

Which represents an addition at the pixel level,

means for representing a channel based attention;

2) taking the feature map output by the three branches in the first stage as the input of the second stage, wherein the fused output detail MFM-1 is as follows:

5. the method for semantically segmenting the remote sensing image based on the multi-scale attention fusion as claimed in claim 4, wherein the module implementation manner based on the channel attention comprises:

1) inputting a characteristic diagram A ═ a₁，a₂，...，a_c]Viewed as channel a_i∈R^H×WTo obtain a vector G epsilon R after global average pooling^1×1×CAnd k^thElement, the model is:

integrating global information into a vector G;

2) converting vector G into

Wherein, O₁∈R^1×1×C/2，O₂∈R^1×1×CDenotes two fully connected convolutional layers, in O₁After that, an Activation function is added, and the Activation function is further added through a Sigmoid function sigma (·)

Activate, constrain it at [0, 1]；

3) A is mixed with

Performing outer product to obtain

The model is

6. The method for semantically segmenting the remote sensing image based on multi-scale attention fusion as claimed in claim 1, wherein said multi-scale spatial context enhancement module comprises ASPP module and non-local module, F represents the semantic segmentation process through multi-scale attention fusionThe feature map processed by the spatial context enhancement module is F ═ nl (ASPP (F)₃))。

7. The method for semantically segmenting the remote sensing image based on the multi-scale attention fusion as claimed in claim 6, wherein the implementation manner of the multi-scale spatial context enhancement module comprises:

1) multi-modal fusion feature F of the last stage of the multi-modal fusion module₃Inputting a multi-scale space context enhancement module to extract multi-scale information, combining 3 × 3 convolutions with expansion rates of 3, 6 and 9 with a standard 1 × 1 convolution to extract the multi-scale information, and adding an image average pooling integrated global context information;

3) the non-local model is

Let the characteristic diagram X ═ X₁，x₂，…，x_n]As an input, where x_i∈R^1×1×C，x_j∈R^1×1×CThe feature vectors of the i position and the j position respectively, N ═ hxw represents the number of pixel points, hxw represents the space dimension, F is the same as the number of channels of X, c (X) is normalization operation, g (X)_j)＝W_vx_jExpressed as 1 × 1 convolution in the network, f (x)_i，x_j) (X) is a vector x_iAnd vector x_jThe normalized correlation of (1) and (2) calculating the spatial similarity, the model is

Wherein m (x)_i) And n (x)_j) Is a linear transformation matrix, m (x)_i)＝W_qx_i，g(x_j)＝W_kx_jAll 1 × 1 convolutions in the network;

4) f is upsampled once in a bilinear interpolation.

8. The method for semantically segmenting the remote sensing image based on the multi-scale attention fusion as claimed in claim 1, wherein the residual jump connection model is

Wherein fl is the characteristics of the lth layer, Tconv is the transposed convolution, Activation is the ReLU Activation function, DSC represents the depth separable convolution,

9. The method for semantically segmenting the remote sensing image based on the multi-scale attention fusion as claimed in claim 8, wherein the implementation manner of residual jump connection comprises:

10. The remote sensing image semantic segmentation method based on multi-scale attention fusion of claim 9, characterized in that the features optimized by the residual jump connection strategy are gradually fused with decoder features to perform continuous upsampling in a bilinear difference manner until a segmentation map is output.