CN113283435B

CN113283435B - Remote sensing image semantic segmentation method based on multi-scale attention fusion

Info

Publication number: CN113283435B
Application number: CN202110528206.1A
Authority: CN
Inventors: 雷涛; 李林泽; 加小红; 薛丁华; 张月
Original assignee: Shaanxi University of Science and Technology
Current assignee: Shaanxi University of Science and Technology
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2023-08-22
Anticipated expiration: 2041-05-14
Also published as: CN113283435A

Abstract

The application discloses a remote sensing image semantic segmentation method based on multi-scale attention fusion, which mainly relates to an image segmentation technology and aims to solve the problem of semantic segmentation of a high-resolution remote sensing image. The method solves the problem that targets in the remote sensing image are difficult to classify by fusing multi-mode data; introducing an attention mechanism to reallocate resources in a feature extraction stage, so as to avoid redundant features; the problem of large target scale difference of remote sensing images is solved by adopting a multi-scale space context module; the residual jump connection strategy is utilized to reserve and optimize the information of the coding end, and the problem of image characteristic loss during downsampling is solved. The application not only can realize semantic segmentation of the high-resolution remote sensing image, but also has higher classification precision, and provides objective and accurate data for understanding and analyzing the high-resolution remote sensing image.

Description

Remote sensing image semantic segmentation method based on multi-scale attention fusion

Technical Field

The application belongs to the technical field of image processing, and particularly relates to a remote sensing image semantic segmentation method based on multi-scale attention fusion.

Background

Because the high-resolution remote sensing image can provide rich ground geometric information and ground target details, the method has been widely applied to ground target classification and identification in complex scenes. Semantic segmentation of high resolution telemetry images requires assigning a specific semantic label to each pixel, such as: buildings, road surfaces, automobiles, trees, low-planting quilt, and the like. Image semantic segmentation may identify multiple objects in an image simultaneously as compared to single object identification. Therefore, the method has been widely applied to the fields of military target detection, city planning, building identification, road extraction and the like. However, existing high resolution remote sensing image semantic segmentation techniques face the following challenges: firstly, the imaging technology of the high-resolution remote sensing image has unique complexity, and the difficulty of manually distinguishing ground targets is high and the efficiency is low; second, interference from shadows, clouds, and illumination, etc., results in large classification errors. Compared with natural images, the semantic segmentation task of the remote sensing images is complex. On the one hand, high resolution remote sensing images typically contain many more complex scenes. On the other hand, the remote sensing image has the characteristics of high intra-class variance and low inter-class variance, and different ground features such as trees and low vegetation possibly show similar characteristics in the spectrum image, so that the semantic segmentation task of the high-resolution remote sensing image is challenged. However, the digital surface model (Digital Surface Model, DSM) containing rich geographic information provides supplemental information for classification of features, and experiments indicate that fully utilizing DSM data can significantly improve segmentation accuracy. Therefore, research of a semantic segmentation network with strong robustness and high accuracy has important significance for understanding complex scenes of high-resolution remote sensing images.

A number of conventional machine learning methods have been used for remote sensing image analysis. Although these methods can achieve target detection and recognition in tele-images, the accuracy is low because reliable feature extraction is difficult. In recent years, with the rapid development of deep learning technology, convolutional neural networks have achieved great success in the field of image semantic segmentation. It is well known that convolutional neural networks can provide hierarchical feature representations and learn deep semantic features, which is important to improve model performance by stacking convolutional layers. In addition, the convolutional neural network can effectively suppress noise interference, thereby enhancing robustness.

Different from the traditional semantic segmentation of the high-resolution remote sensing image, the addition of the multi-mode data further improves the classification precision. Aiming at how to reasonably utilize data of two modes, a DSM fusion network (DSM Fusion Network, DSMFNet) named end-to-end is disclosed in the related technology, the network designs four interactive modes to fuse and process multi-mode data, wherein a model with highest precision extends the powerful performance of deep v3+ when extracting RGB image features, and a lightweight depth separable convolution module is designed to independently extract DSM image features and fuse information of different modes before upsampling. Although the DSM image contains less information, a simple network model cannot extract deeper features, and simply superimposing the spectral images of red, green, blue (RGB) and the DSM image does not make full use of the relationship between multimodal information, but instead introduces redundant features.

In the semantic segmentation task of the high-resolution remote sensing image, the scale difference of the targets to be segmented is large. Aiming at the problem of Multi-scale of targets, the related art also discloses a Multi-scale self-adaptive feature fusion network (Multi-scale Adaptive Feature Fusion Network, MANet) which uses ResNet101 as a backbone network to extract image features and transmits advanced semantic features to a context extraction module, so that the problem that the targets in remote sensing images are large in size difference and difficult to segment is solved. The self-adaptive fusion module fuses the high-level semantic information and the low-level semantic information, and reallocates the fused resources, so that redundant information is avoided while self-adaptive combination weight is realized. However, the algorithm does not utilize the spatial relationship among channels when extracting the multi-scale features, and for the targets with similar semantic features, the correlation of the features in the class cannot be emphasized, so that the segmentation precision is low.

Disclosure of Invention

In order to solve the problems in the prior art, the application provides a remote sensing image semantic segmentation method based on multi-scale attention fusion, which can realize semantic segmentation of a high-resolution remote sensing image, avoid redundant features, has higher classification precision and provides objective and accurate data for understanding and analyzing the high-resolution remote sensing image.

In order to achieve the above object, the technical scheme adopted by the application comprises the following steps:

1) Clipping the data set;

2) Inputting the cut IRRG image and the cut DSM image into a multi-mode fusion module to obtain multi-mode fusion characteristics F of each stage ₀ 、F ₁ 、F ₂ And F ₃ The method comprises the steps of carrying out a first treatment on the surface of the Introducing a module based on channel attention into the multi-mode fusion module to extract, reorganize and fuse features, and distributing weight resources;

3) Integration and improvement of multi-modal fusion features F using multi-scale spatial context enhancement modules ₃ Then carrying out primary up-sampling;

4) Optimizing multi-mode fusion characteristics F of coding end by residual jump connection strategy ₀ 、F ₁ And F ₂ Fusing the segmentation map with the features of the corresponding scale of the decoding end, and continuously up-sampling the segmentation map to output a segmentation map;

5) And splicing the segmentation graphs according to the original graph size, namely finishing semantic segmentation of the remote sensing image.

Further, in the step 1), the IRRG image, the DSM image, and the label corresponding thereto are cut out using a sliding window, and the cut-out image size is 256×256.

Further, the multi-modal fusion module includes optical branches, depth branches, and code fusion branches, each of which provides a set of feature maps at each module stage, the code fusion branches taking fusion from the optical branches and the depth branches as input before downsampling, processing the fused data.

Further, the multi-mode fusion module implementation manner includes:

1) Feature I of IRRG image ₀ And DSM image feature D ₀ Respectively input to two ResNet50 trained in advance on ImageNet and input I ₀ And D ₀ Fusion feature M of (2) ₀ Input to the third pre-trained ResNet50 on ImageNet, the initial stage model MFM-0 is: wherein->Representing the addition of pixel levels +.>A module representing channel-based attention;

2) Taking the characteristic diagrams output by the three branches in the first stage as the input of the second stage, and the fused output minutiae MFM-1 is as follows:

3) Taking the characteristic diagrams output by the three branches of the second stage as the input of the third stage, and the fused output minutiae MFM-2 is as follows:

4) The feature map output by the three branches in the third stage is used as the input of the fourth stage, and the fused output detail MFM-3 is as follows:

further, the channel attention-based module implementation includes:

1) Feature map a= [ a ] to be input ₁ ，a ₂ ，...，a _c ]Seen as channel a _i ∈R ^H×W Is subjected to global average pooling to obtain a vector G E R ^1×1×C And k ^th The elements and the model are as follows:

integrating global information into vector G;

2) Converting vector G intoWherein O is ₁ ∈R ^1×1×C/2 ， O ₂ ∈R ^1×1×C Representing two fully connected convolutional layers, at O ₁ Then, an action Activation function is added, and the ++is further added through a Sigmoid function sigma (& gt)>Activation, constraint it to [0,1 ]]；

3) Will A andto obtain +.>The model is

Further, the multi-scale space context enhancement module includes an ASPP module and a non-local module, and F represents a feature map processed by the multi-scale space context enhancement module, where the model is f=nl (ASPP (F ₃ ))。

Further, the multi-scale space context enhancement module implementation includes:

1) Multimode fusion characteristic F of the last stage of the multimode fusion module ₃ Inputting a multi-scale space context enhancement module to extract multi-scale information, combining 3×3 convolution with expansion rates of 3, 6 and 9 and 1×1 convolution of one standard to extract multi-scale information, and adding an image average pooling integration global context information;

2) Carrying out multi-scale information fusion by using an ASPP module, then reducing the channel number to 256 by using 1X 1 convolution, and entering a non-local module;

3) The non-local model isFeature map x= [ X ] ₁ ，x ₂ ，...，x _n ]As input, where x _i ∈R ^1×1×C ，x _j ∈R ^1×1×C The feature vectors of the i position and the j position respectively, n=h×w represents the number of pixel points, h×w represents the spatial dimension, F is the same as the number of channels of X, C (X) is a normalization operation, g (X) _j )＝W _v x _j Represented in the network as a 1 x 1 convolution, f (x _i ，x _j ) C (X) is vector X _i Vector x _j Calculating spatial similarity, model +.>Wherein m (x) _i ) And n (x) _j ) Is a linear transformation matrix, m (x _i )＝W _q x _i ， g(x _j )＝W _k x _j 1 x 1 convolutions in the network;

4) F is upsampled once in a bilinear interpolation.

Further, the residual jump connection model is as follows Wherein f _l Is l ^th Layer characteristics, tconv is transposed convolution, activation is ReLU Activation function, DSC represents depth separable convolution,>f is an addition operation at the pixel level _l-1 Is (l-1) ^th Features of layers not subjected to undersampling, f _l+1 After the residual jump connection processingAs a result.

Further, the residual jump connection implementation method includes:

1) Will l ^th The features of the layer are restored to (l-1) through a transpose convolution learning ^th Features of the same layer size;

2) Will (l-1) ^th The characteristics of the layer which are not downsampled are extracted independently and added with the characteristics;

3) Features are relearned and transmitted to (l+1) using depth separable convolution ^th After upsampling of the layer.

Further, the features optimized by the residual jump connection strategy are gradually fused with the decoder features, and continuous up-sampling is carried out in a bilinear difference mode until a segmentation map is output.

Compared with the prior art, the method aims at the problems that the scale difference of the segmented targets is large, the complexity of the segmented scenes is high, the semantic segmentation of the high-resolution remote sensing image is difficult, only the spectrum data and the traditional multi-scale feature extraction module are used, and the semantic segmentation effect cannot be improved. And secondly, the fused characteristics are improved by utilizing a multi-scale space context enhancement module, so that the problem of large target scale difference in the remote sensing image is solved. And finally, reserving and optimizing multi-mode fusion information of the encoding end by using a residual jump connection strategy, solving the problem of image characteristic loss during downsampling, optimizing an output result of the encoder by using the residual jump connection strategy, providing effective characteristic mapping for the decoding end, optimizing multi-mode information fusion, and improving the precision of a target contour in a high-resolution remote sensing image. Compared with the main stream remote sensing image semantic segmentation algorithm, the method has the advantages that on one hand, the pixel classification precision is improved by utilizing the image depth data and the multi-scale space context enhancement module, and on the other hand, the contour precision of the segmented target is improved while the target is accurately positioned by utilizing the residual jump connection strategy. The application can realize semantic segmentation of the high-resolution remote sensing image, has higher classification precision, and has wide application prospect in the field of scene understanding and analysis of the high-resolution remote sensing image.

Drawings

FIG. 1 is a network block diagram of the present application;

FIG. 2a is a block diagram of the MFM-0 phase of the multi-modal fusion module of the present application; FIG. 2b is a block diagram of the MFM-n (n.epsilon.1, 3) phase of the multi-modal fusion module of the present application;

FIG. 3 is a block diagram of the channel attention based module of the present application;

FIG. 4 is a block diagram of a multi-scale spatial context enhancement module of the present application;

FIG. 5 is a block diagram of a residual skip connection strategy of the present application;

FIG. 6 is a sample image of the Potsdam dataset and the Vaihingen dataset;

FIG. 7 is a graph comparing the segmentation results of the present application and prior art methods on Potsdam dataset slices;

FIG. 8 is a graph comparing segmentation results of the present application and prior art methods in a Potsdam dataset;

FIG. 9 is a graph comparing the segmentation results of the present application and prior art methods on Vaihingen dataset slices;

fig. 10 is a graph comparing the segmentation results of the present application and the prior art method in the Vaihingen dataset.

Detailed Description

The present application will be further illustrated by the following description of the drawings and specific embodiments, wherein it is apparent that the embodiments described are some, but not all, of the embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to fall within the scope of the present application.

The application provides a remote sensing image semantic segmentation method based on Multi-scale attention fusion, designs a Multi-scale attention fusion network (Multi-scale Attention Fusion Network, MAFNet) for semantic segmentation of a high-resolution remote sensing image, relates to a convolutional neural network and Multi-mode data fusion and other technologies, can be applied to the semantic segmentation problem of the high-resolution remote sensing image, and lays a foundation for scene understanding of the high-resolution remote sensing image.

Referring to fig. 1, the Multi-scale attention fusion network of the present application is based on an encoding-decoding structure, firstly, a Multi-mode fusion module (Multi-modal Fusion Module, MFM) is used to process IRRG and DSM data from different modes respectively, and a channel attention-based module (Module based on Channel Attention, MCA) is introduced to redistribute resources between high-level semantic features and low-level semantic features, so that the fusion result of Multi-mode data is improved, the problem that objects in remote sensing images are difficult to classify is solved, and redundant features are avoided. And secondly, the fused characteristics are improved by utilizing a Multi-scale space context enhancement module (Multi-scale Spatial Context Enhancement Module, MSCEM), so that the problem of large target scale difference in the remote sensing image is solved. Finally, the multi-mode fusion information of the coding end is reserved and optimized by utilizing a residual jump connection strategy (Residual Skip Connection strategy, RSC), and the problem of image characteristic loss during downsampling is solved. Compared with the main stream remote sensing image semantic segmentation algorithm, the method has the advantages that on one hand, the pixel classification precision is improved by utilizing the image depth data and the multi-scale space context enhancement module, and on the other hand, the contour precision of the segmented target is improved while the target is accurately positioned by utilizing the residual jump connection strategy. The application can realize semantic segmentation of the high-resolution remote sensing image, has higher classification precision, and has wide application prospect in the field of scene understanding and analysis of the high-resolution remote sensing image.

The present application uses IRRG image and normalized depth image (DSM) as data sources, considering that higher segmentation accuracy is obtained with near infrared, red, green (IRRG) data than RGB data. And the symmetrical structure of the encoder and the decoder with reasonable design is used for reference. Firstly, in the encoding process, the proposed model utilizes a multi-mode fusion module to extract features from an IRRG image and a DSM image respectively, and fuses multi-mode data in each lower sampling stage; secondly, a multi-scale space context enhancement module is introduced at the end of the encoding stage, and global space information of different scale targets is integrated and improved. Finally, the residual jump connection strategy is utilized to learn and optimize the multi-mode characteristics, and the multi-mode characteristics are fused with the decoding characteristics to carry out up-sampling and output a segmentation map, and the method comprises the following steps:

(1) Clipping the data set;

(2) Inputting the cut IRRG image and the cut DSM image into a multi-mode fusion module to obtain multi-mode fusion characteristics F of each stage ₀ 、F ₁ 、F ₂ And F ₃ The method comprises the steps of carrying out a first treatment on the surface of the Introducing a module based on channel attention into the multi-mode fusion module to extract, reorganize and fuse features, and distributing weight resources;

(3) Integration and improvement of multi-modal fusion features F using multi-scale spatial context enhancement modules ₃ Then up-sampling is carried out for the first time;

(4) Optimizing coding-end multi-mode fusion feature F by residual jump connection strategy ₀ 、F ₁ And F ₂ Fusing the segmentation map with the features of the corresponding scale of the decoding end, and continuously up-sampling the segmentation map to output a segmentation map;

(5) And (5) splicing the segmentation graphs according to the original graph size, and outputting to finish the semantic segmentation of the remote sensing image.

The application is described in detail below, including the following steps:

(1) The IRRG image, the DSM image and the corresponding label graph are cut by utilizing a sliding window, and the cut image size is 256 multiplied by 256.

(2) Inputting the clipped IRRG image and the clipped DSM image into a multi-modal fusion module, wherein the multi-modal fusion module comprises an optical branch, a depth branch and a coding fusion branch, image features are respectively extracted in the optical branch and the depth branch, each optical branch and the depth branch provide a group of feature maps in each module stage, a third coding fusion branch is introduced on the basis, the coding fusion branch is used for processing fused data, and referring to FIG. 2a, the coding fusion branch takes fusion from the optical branch and the depth branch as input before downsampling, residual error learning is carried out through a convolution block, and the residual error utilizes the sum of feature maps of other two encoders until the stage MFM-3. The structure of MFM-n (n e 1, 3) is shown in fig. 2b, with three pre-trained res nets 50 extracting features from the three branches, and then using the same pattern fusion features as MFM-0 phases. Before the feature map addition, a channel attention based module is introduced, the last downsampling is abandoned in the encoding phase. The specific implementation mode is as follows:

(a) Feature I of IRRG image ₀ And DSM image feature D ₀ Respectively input to two ResNet50 trained in advance on ImageNet and input I ₀ And D ₀ Fusion feature M of (2) ₀ To a third pre-trained res net50 on ImageNet. The initial stage model MFM-0 isWherein->Representing the addition of pixel levels +.>A module based on channel attention is shown.

(b) Taking the characteristic diagrams output by the three branches of the first stage as the input of the second stage, and fusing the output details MFM-1 as follows

(c) Taking the characteristic diagrams output by the three branches of the second stage as the input of the third stage, and fusing the output details MFM-2 as follows

(d) Taking the characteristic diagram output by three branches of the third stage as the input of the fourth stage, and fusing the output details MFM-3 as follows

(3)A channel attention-based module is represented, which is used to extract, reorganize, and fuse features, assigning weight resources to more meaningful feature graphs. As shown in fig. 3, the specific implementation is as follows:

(a) Feature map a= [ a ] to be input ₁ ，a ₂ ，...，a _c ]Seen as channel a _i ∈R ^H×W Is a combination of (a) and (b). Firstly, global Average Pooling (GAP) is carried out to obtain a vector G E R ^1×1×C And its k ^th Elements, model isThis operation integrates global information into vector G.

(b) Second, the vector is converted intoWherein O is ₁ ∈R ^1×1×C/2 ， O ₂ ∈R ^1×1×C Representing two fully connected convolutional layers, at O ₁ An Activation function is then added, which creates a channel dependence on feature extraction. Further +.>Activation, which is constrained to [0,1 ]]。

(c) Finally, A andto obtain +.>The model is ReLU remaps the original channel into a new channel, adding nonlinear elements in an adaptive fashion, allowing the network to fit better. During the network learning process, the module suppresses redundant features, and re-calibrates weights to further optimize on more meaningful feature maps.

(4) Referring to fig. 4, the multi-scale space context enhancement module is composed of ASPP (Atrous Spatial Pyramid Pooling) module and non-local module, F represents a feature map processed by the multi-scale space context enhancement module, and the model is f=nl (ASPP (F ₃ ) The specific implementation mode is as follows:

(a) Fusion characteristic F of the final stage of the multi-mode fusion module ₃ The input multi-scale spatial context enhancement module extracts multi-scale information. Due to F ₃ The dilation convolutions with different dilation rates are calculated and fused to output the same feature map, and the fusion should cover the whole feature map, so that the 3×3 convolutions with dilation rates of 3, 6 and 9 and a standard 1×1 convolution are combined to extract multi-scale information, and then an image average pooling integration global context information is added. Has better efficiency and performance without increasing the number of parameters.

(b) And performing multi-scale information fusion by using the ASPP module, then performing dimension reduction on the channel number to 256 by using 1X 1 convolution, and entering the non-local module.

(c) The non-local model isFeature map x= [ X ] ₁ ，x ₂ ，...，x _n ]As input, where x _i ∈R ^1×1×C ，x _j ∈R ^1×1×C The feature vectors of the i position and the j position respectively. N=h×w represents the number of pixel points, and h×w is expressed as a spatial dimension. F is the same as the number of channels of X, C (X) is normalization operation, g (X) _j )＝W _v x _j Represented in the network as a 1 x 1 convolution. Next, f (x _i ，x _j ) C (X) is vector X _i Vector x _j Calculating spatial similarity, model +.>Wherein m (x) _i ) And n (x) _j ) Is a linear transformation matrix, m (x _i )＝W _q x _i ， g(x _j )＝W _k x _j All are 1 x 1 convolutions in the network. The module establishes a relation between any two spatial positions, and semantic feature expression is improved.

(d) F is upsampled once in a bilinear interpolation.

The application of the global context remote-dependent strategy is important in semantic segmentation of multi-classification high-resolution remote sensing images, and in order to better utilize the spatial information of the multi-scale feature images, non-local information is introduced after the multi-scale integration information. The DSM data also provides auxiliary physical properties for specific categories in the remote sensing image, and spatial relationships may enhance local properties of the feature map by aggregating dependent on other pixel locations. For the target with similar semantic features, the correlation of the features in the class is enhanced by the strategy of the connection context, and the module combines global and local information, so that the semantic segmentation result of the high-resolution remote sensing image is more accurate.

(5) FIG. 5 is a schematic diagram of a residual jump connection strategy, (l-1) ^th The characteristics of the layers are transferred to (l+1) via a jump connection ^th Layer, characterized by downsampling and passing to l ^th The layer is then up-sampled to (l+1) ^th Layer, the process is a iterative operation. Loss of low resolution information results in blurring of the segmentation boundaries. The conventional jump connection directly transmits the high-resolution characteristic diagram to the decoder without any study of convolution layers, and the effective information of the encoding endAnd the network model obtained by last learning cannot effectively map high-resolution information because the network model is continuously lost in the down-sampling process. The residual jump connection model is as followsThe specific implementation mode is as follows:

(a) Will l ^th The features of the layer are restored to (l-1) through a transpose convolution learning ^th Features of the same layer size.

(b) Will (l-1) ^th Features of the layer that are not downsampled are extracted separately and summed together.

(c) Features are relearned and transmitted to (l+1) using depth separable convolution ^th After upsampling of the layer.

Feature F of the first three stages obtained by the multi-mode fusion module ₀ 、F ₁ And F ₂ The information is fed back in its entirety to the decoder using this strategy. This approach allows the features to be at a higher level. Wherein f _l Is l ^th Layer characteristics, tconv is the transposed convolution, activation is the ReLU Activation function, DSC represents the depth separable convolution,f is an addition operation at the pixel level _l-1 Is (l-1) ^th Features of layers not downsampled, f _l+1 And the result is obtained after the residual jump connection processing.

(6) And gradually fusing the features optimized through the residual jump connection strategy with the decoder features, and continuously upsampling in a bilinear difference mode until a segmentation map is output.

(7) And splicing and outputting the segmentation graphs according to the original graph size.

In order to verify the validity of the application on semantic segmentation of the high-resolution remote sensing image, urban ground object classification experiments are respectively carried out on two public data sets Vaihingen data sets and Potsdam data sets, and the performance of the model is verified by utilizing an evaluation index evaluation result.

The Potsdam dataset includes 38 images, each having three bands, corresponding to near Infrared (IR), red (R), and green (G), respectively. The dataset also provides a digital surface model and a normalized digital surface model corresponding to the image slice. The spatial resolution of the image slices is 5 cm and the sizes are 6000 x 6000 pixels. Six categories (pavement, buildings, low vegetation, trees, cars, and clutter) have been marked pixel by pixel on 24 tagged images. The application uses both IRRG and DSM data types. Image numbers 5_12, 6_7, and 7_9 are selected for verification, image numbers 5_10 and 6_8 are selected for testing, and the remaining images are selected for training.

The Vaihingen dataset comprises 33 images with a spatial resolution of 9 cm. The band of each image is the same as the Potsdam dataset with an average size of 2494×2064 pixels. Only 16 images had ground truth labels, which also contained the same six categories as the watsdam dataset. Both IRRG and DSM data types are also used. Five images were used as test sets to evaluate the network model of the present application, with image numbers 11, 15, 28, 30, 34,3 images as validation sets, with numbers 7, 23, 37, and the remaining images used for training. Fig. 6 is a sample image, digital surface model and corresponding label from the two data sets.

Due to GPU memory limitations, the size of the image in the dataset needs to be changed to adapt to the network model of the present application. Each image is cropped to 256 x 256 pixels with 128 pixels overlapping, ultimately stitching the prediction results. The present application uses data enhancement to reduce the risk of overfitting, including random flipping (vertical and horizontal) and random rotation (0 °,90 °,180 °,270 °) on all training images. The enhanced data can effectively prevent the model from being over fitted, and the robustness of the model is improved. The application is constructed by using a deep learning frame PyTorch. ResNet50, pre-trained on ImageNet, acts as the backbone network. The operating system is Windows 10, the processor is Intel (R) Xeon (R) CPU E5-1620 v4, and the proposed MAFNet is trained on two NVIDIA GeForce GTX 1080 graphics processors, each with 8GB of memory. The application uses a random gradient descent optimizer with cross entropy loss and momentum of 0.9 and weight decay of 0.004 to optimize the network. The initial learning rate was 1e-3, multiplied by 0.98 at the end of each epoch. The total batch was set to 16, using 250 epochs training networks.

In order to further demonstrate the superiority of the application, the application is compared with the results of other remote sensing image semantic segmentation mainstream algorithms based on deep learning and visualized, and the method comprises the following steps: deep v3+, APPD, MANet, DSMFNet, and remnet.

In order to compare the performances of different algorithms, the application selects the Overall Accuracy (OA) and the F1 Score as evaluation indexes, and the larger the values of the OA and the F1 Score are, the better the segmentation result is.

Comparative experiments of the different methods were performed on the Potsdam dataset, see Table 1 for comparative results:

table 1 results of comparative experiments of different methods on the watsdam dataset

Method	Imp.Surf.	Building	Low veg.	Tree	Car	Mean F1	OA
								DeepLab v3+	89.88	93.78	83.23	81.66	93.50	88.41	87.72
APPD	90.80	94.56	84.37	85.14	94.42	89.86	88.42
								MANet	91.33	95.91	85.88	87.01	91.46	90.32	89.19
DSMFNet	93.03	95.75	86.33	86.46	94.88	91.29	90.36
								REMSNet	93.48	96.17	87.52	87.97	95.03	92.03	90.79
MAFNet	93.61	96.26	87.87	88.65	95.32	92.34	91.04

In experiments with the Potsdam dataset, we calculated F1 Score, average F1 Score and Overall Accuracy (OA) for each category. As shown in Table 1, the present application achieved 92.34% and 91.04% average F1 Score and overall accuracy, respectively, which was superior to other algorithms on all evaluation criteria. The Potsdam data set has relatively complex scene, and classification of trees and low vegetation is difficult, compared with deep Labv3+, the classification of the trees is improved by 7.0%, and classification of other types is correspondingly improved, so that the MAFNet can capture targets with different scales by using global context space information.

FIG. 7 is a visualization of the segmentation results of the present application and other methods on a Potsdam dataset slice. From the unlabeled artwork, it can be seen that the trees and low vegetation are very similar, and that some areas even the human eye has not been able to classify them correctly. However, from the comparison of the dashed boxes, it is found that the present application can obtain better segmentation results in classification of trees and low vegetation, which also verifies the superior performance of the present application. Second, for small targets such as vehicles, the proposed segmentation results of MAFNet are also more refined. The residual jump connection strategy provided by the application solves the problem that small targets are easy to split by mistake due to information loss caused by downsampling, and similarly, for large targets, the semantic features of the attributes in the class are enhanced, and the error classification is reduced.

Fig. 8 is a view of the overall classification result of the 5_10 region in the watsdam dataset, so that the regions and distribution rules of all categories can be clearly distinguished, and the method has practical significance for city planning.

Comparative experiments of the different methods were performed on the Vaihingen dataset, see table 2 for comparative results:

table 2 results of comparative experiments of different methods on Vaihingen dataset

Method	Imp.Surf.	Building	Low veg.	Tree	Car	Mean F1	OA
								DeepLab v3+	87.67	93.95	79.17	86.26	80.34	85.48	87.22
APPD	88.78	93.38	80.43	86.76	80.88	86.05	87.71
								MANet	90.12	94.08	81.01	87.21	81.16	86.72	88.17
DSMFNet	91.47	95.08	82.11	88.61	81.01	87.66	89.80
								REMSNet	92.01	95.67	82.35	89.73	81.26	88.20	90.08
MAFNet	92.06	96.12	82.71	90.01	82.13	88.61	90.27

Table 2 the evaluation results calculate F1 Score, average F1 Score and Overall Accuracy (OA) for each category. As shown in table 2, the mean F1 Score and overall accuracy of the proposed MAFNet were 88.61% and 90.27%, respectively, superior to the other algorithms. In particular to the category of automobiles, the residual jump connection strategy provided by the application effectively reserves the information of small objects. DSM data is added in network model input, so that improvement is achieved in classification of other categories with auxiliary physical space height information, the problem that high-resolution remote sensing image targets are difficult to classify is solved, and data fusion among different modes is verified to bring assistance to classification of ground features of the remote sensing image. The result shows that the method has stronger capability in a complex high-resolution remote sensing scene, the multi-scale space context enhancement module solves the problem of larger scale difference of the segmented targets, the characteristics of the targets with different scales are effectively extracted, and even when the targets have small proportion in one area and have higher similarity with other types of targets, the method can still realize correct segmentation.

Fig. 9 is a visualization of the results of feature classification using different algorithms on the Vaihingen test set. Trees and low vegetation present difficulties in classification because of the high similarity. The bounding box with the broken line shows that the application can better distinguish the areas with high similarity and retain all information of the small object. For factors such as illumination and shadow, the proposed MAFNet can also weaken interference factors to a certain extent, for example, the fourth row, and trees under shadow shielding can also be correctly classified.

Fig. 10 is a complete area diagram after slicing and splicing, and the second behavior is compared with the results of the present application, and the experimental results show that the present application can play a better role in the scene analysis of the complex high-resolution remote sensing image.

The application decomposes and combines the proposed modules and further verifies the validity of the different modules with F1 Score and overall accuracy. Ablation experiments used the Vaihingen dataset. Firstly, the first model uses two ResNet50 to extract the characteristics of different mode data respectively, and then the last residual block is fused, and the segmentation map is output through continuous up-sampling. No interaction between any of the different data is done during feature extraction. Secondly, verifying the MFM, adding a multi-mode fusion module, reasonably distributing the feature resources by using a attention mechanism during feature extraction by the two ResNet50, continuously carrying out information fusion, introducing a third ResNet50 to process fusion branches, and finally, continuously up-sampling the feature images fused by the three branches to obtain a final segmentation image. The third model, the encoding stage inputs the feature map after ResNet50 fusion into the multi-scale space context enhancement module to obtain a new feature map, and the decoding stage is continuous up-sampling until the last output. And combining ResNet50 and a residual jump connection strategy for a fourth model, verifying the validity of information of a decoding end fusion and a reserved encoding end, and sequentially corresponding the first three downsampled information extracted by the characteristics with an upsampling stage by utilizing a residual learning strategy and outputting the final prediction. Finally, all modules were integrated together and all results of the ablation experiments are shown in table 3:

table 3 results of ablation experiments performed on Vaihingen dataset

Models	Imp.Surf.	Building	Low veg.	Tree	Car	Mean F1	OA
								Res50	86.94	89.67	75.83	84.42	77.40	82.85	84.98
Res50+MFM	88.15	93.84	76.49	86.48	78.02	84.60	86.66
								Res50+MSCEM	88.79	93.09	79.79	85.55	80.38	85.52	87.35
Res50+RSC	90.11	92.97	80.24	86.04	81.14	86.10	87.82
								MAFNet	92.06	96.12	82.71	90.01	82.13	88.61	90.27

The results in table 3 show that the average F1 Score of Res50+MFM is improved by 1.8% compared with ResNet50, the overall accuracy is improved by 1.7%, the problem of weight distribution of the feature map is solved by introducing an attention mechanism before data fusion of different modes, the validity of multi-mode data information is verified, and the segmentation accuracy can be improved by efficiently fusing the features. The Res50+MSCEM improves the average F1 Score and the overall accuracy by 2.7% and 2.4% compared with ResNet50, and the multi-scale space context enhancement module improves the performance of a backbone network, effectively acquires all information in an image, enhances the relevance among different categories and solves the problem that multi-scale targets in a remote sensing image are difficult to extract. The average F1 Score of the Res50+RSC is improved by 3.3 percent compared with the ResNet50, the overall accuracy is improved by 2.8 percent, and compared with the common jump connection, the novel residual jump connection strategy not only enhances the output characteristics of the encoding end, but also provides better characteristic fusion for the decoding end. In addition, all modules are integrated, and compared with an initial network model, the MAFNet is improved by 5.8% and 5.3% on average F1 Score and overall accuracy, so that the semantic segmentation effect of the high-resolution remote sensing image can be obviously improved.

In summary, the remote sensing image semantic segmentation method based on multi-scale attention fusion solves the problem that targets in a remote sensing image are difficult to classify by fusing multi-mode data; introducing an attention mechanism to reallocate resources in a feature extraction stage, so as to avoid redundant features; the problem of large target scale variability of remote sensing images is solved by adopting a multi-scale space context module; the residual jump connection strategy is utilized to reserve and optimize the information of the coding end, and the problem of image characteristic loss during downsampling is solved. The application not only can realize semantic segmentation of the high-resolution remote sensing image, but also has higher classification precision, and provides objective and accurate data for understanding and analyzing the high-resolution remote sensing image.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limited thereto; although the application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present application.

Claims

1. A remote sensing image semantic segmentation method based on multi-scale attention fusion is characterized by comprising the following steps:

1) Clipping the data set;

2. The semantic segmentation method of remote sensing images based on multi-scale attention fusion according to claim 1, wherein in the step 1), the IRRG image, the DSM image and the label map corresponding to the IRRG image and the DSM image are cut by using a sliding window, and the cut image size is 256×256.

3. The method of claim 1, wherein the multi-modal fusion module includes optical branches, depth branches, and code fusion branches, each of the optical branches and the depth branches providing a set of feature maps at each module stage, the code fusion branches taking as input a fusion from the optical branches and the depth branches prior to downsampling, and processing the fused data.

4. A remote sensing image semantic segmentation method based on multi-scale attention fusion according to claim 3, wherein the multi-modal fusion module implementation manner comprises:

2) Taking the characteristic diagrams output by the three branches in the first stage as the input of the second stage, and fusing the output details MFM-1 as follows:

3) Taking the characteristic diagrams output by the three branches of the second stage as the input of the third stage, and fusing the output details MFM-2 as follows:

5. the method for semantic segmentation of remote sensing images based on multi-scale attention fusion according to claim 4, wherein the module implementation manner based on channel attention comprises:

1) Feature map a= [ a ] to be input ₁ ，a ₂ ，...，a _c ]Seen as channel a _i ∈R ^H×W Is subjected to global average pooling to obtain a vector G E R ^1×1×C And k ^th The elements and the model are as follows:integrating global information into vector G;

2) Converting vector G intoWherein O is ₁ ∈R ^1×1×C/2 ，O ₂ ∈R ^1×1×C Representing two fully connected convolutional layers, at O ₁ Then, an action Activation function is added, and the ++is further added through a Sigmoid function sigma (& gt)>Activation, constraint it to [0,1 ]]；

3) Will A andto obtain +.>The model is->

6. The method for semantic segmentation of a remote sensing image based on multi-scale attention fusion according to claim 1, wherein the multi-scale spatial context enhancement module comprises an ASPP module and a non-local module, F represents a feature map processed by the multi-scale spatial context enhancement module, and the model is f=nl (ASPP (F ₃ ))。

7. The remote sensing image semantic segmentation method based on multi-scale attention fusion according to claim 6, wherein the multi-scale spatial context enhancement module implementation manner comprises:

1) Multimode fusion characteristic F of the last stage of the multimode fusion module ₃ Inputting a multi-scale space context enhancement module to extract multi-scale information, combining 3×3 convolution with expansion rates of 3, 6 and 9 and 1×1 convolution of one standard to extract multi-scale information, and adding an image average pool to integrate global context information;

3) The non-local model isFeature map x= [ X ] ₁ ，x ₂ ，…，x _n ]As input, where x _i ∈R ^1×1×C ，x _j ∈R ^1×1×C The feature vectors of the i position and the j position respectively, n=h×w represents the number of pixel points, h×w represents the spatial dimension, F is the same as the number of channels of X, C (X) is a normalization operation, g (X) _j )＝W _v x _j Represented in the network as a 1 x 1 convolution, f (x _i ，x _j ) C (X) is vector X _i Vector x _j Calculating spatial similarity based on normalized correlation of (a), and modeling asWherein m (x) _i ) And n (x) _j ) Is a linear transformation matrix, m (x _i )＝W _q x _i ，g(x _j )＝W _k x _j 1 x 1 convolutions in the network;

4) F is upsampled once in a bilinear interpolation.

8. The remote sensing image semantic segmentation method based on multi-scale attention fusion according to claim 1, wherein the residual jump isThe jump connection model is Wherein fl is the characteristic of the lth layer, tconv is the transposed convolution, activation is the ReLU Activation function, DSC represents the depth separable convolution, +.>F is an addition operation at the pixel level _l-1 Is (l-1) ^th Features of layers not downsampled, f _l+1 And the result is obtained after the residual jump connection processing.

9. The remote sensing image semantic segmentation method based on multi-scale attention fusion according to claim 8, wherein the residual jump connection implementation manner comprises:

10. The remote sensing image semantic segmentation method based on multi-scale attention fusion according to claim 9, wherein the feature after the optimization of the residual jump connection strategy is gradually fused with the decoder feature to perform continuous up-sampling in a bilinear difference manner until a segmentation map is output.