CN111932553A

CN111932553A - Remote sensing image semantic segmentation method based on area description self-attention mechanism

Info

Publication number: CN111932553A
Application number: CN202010732126.3A
Authority: CN
Inventors: 赵丹培; 王晨旭; 史振威; 姜志国; 张浩鹏
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2020-11-13
Anticipated expiration: 2040-07-27
Also published as: CN111932553B

Abstract

The invention discloses a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which comprises the steps of inputting a visible light remote sensing image into an encoder, extracting high-level semantic features of the visible light remote sensing image, and obtaining feature maps of different levels: respectively extracting global scenes and essential features based on a self-attention module based on feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map; and inputting the scene guide feature map and the noiseless feature map into a decoder to be up-sampled and returned to the original image size, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image. The invention improves the receptive field of the model, can adapt to the scale change of data and can solve the problem of unbalanced category by extracting the encoder of semantic features, the self-attention module for increasing the internal connection of the image and the decoder for mapping the semantic features weighted by attention back to the original space so as to carry out pixel-by-pixel classification.

Description

Remote sensing image semantic segmentation method based on area description self-attention mechanism

Technical Field

The invention relates to the technical field of remote sensing and computer vision, in particular to remote sensing image semantic segmentation based on a region description self-attention mechanism.

Background

The semantic segmentation technology is applied to the field of remote sensing, the effects of inputting a remote sensing image and outputting each pixel class label in the remote sensing image can be achieved, and great help is brought to understanding of the remote sensing image. For example, in the aspect of homeland planning, if the ground surface coverage type (city, road, forest, farmland, river, etc.) of each pixel on the satellite image can be identified, the distribution situation and the occupied area of the pixels can be clearly known, which is beneficial to global planning; for example, intelligent identification is carried out on buildings, whether illegal buildings exist can be found quickly, and manpower consumption can be greatly reduced.

At present, most of work of semantic segmentation by deep learning is a derivative of a Full Convolution Network (FCN), the FCN converts the well-known classification network into the full convolution network, and the specific method is to replace a full connection layer with a convolution layer, and to integrate a deep layer feature with a shallow layer feature through deconvolution operation, so that a high-level semantic feature and a low-level position feature are considered, and the accuracy of the model is improved. However, FCN still cannot be applied to all scenes, and the main reason is that the local nature of convolution results in a small receptive field, so that a larger range of context information cannot be considered. In order to solve the problems, a semantic segmentation network is derived on the basis of the FCN.

The traditional remote sensing image semantic segmentation only uses convolution stacking to extract features, due to the physical property of a convolution filter, the receptive field of convolution operation is limited, the convolution filter aims to capture local features and relations, in other words, information flow in a convolution neural network is only transmitted in a local area, and understanding of complex scenes is greatly influenced. Especially for the remote sensing task, the remote sensing image has many and complex scenes, and long-distance dependence needs to be collected to help predict the current position. Therefore, the invention introduces the self-attention mechanism into the semantic segmentation task of the traditional remote sensing image for the first time, improves the receptive field of the model, enables information to flow in the whole situation, and finds long-distance dependency relationships in the remote sensing image, such as the inclusion relationship existing before the airplane and the airport, the automobile should have certain relation with other automobiles or roads, and the relationships can provide great help for the semantic features of the remote sensing image.

However, the conventional point-and-point-based self-attention mechanism increases context information by calculating the relationship between each pixel point and other pixel points, but has a disadvantage that the point-and-point calculation similarity is more concerned with objects with identical characteristics, for example, a red car and a red car generate a considerable degree of association, but since the labels are only of the car type, the red car can be more desirably associated with all cars and even roads. In other words, in the task of identifying the vehicle at present, the color features are redundant, the color in the feature map is noise, and the color of the vehicle does not help the current task, and even the opposite effect may be generated due to the difference of the features.

Therefore, how to provide a remote sensing image semantic segmentation method based on a region description self-attention mechanism is a problem which needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the above, the invention provides a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which introduces the self-attention mechanism based on a region descriptor into a traditional remote sensing image semantic segmentation task, improves the receptive field of a model, enables information to flow in the whole world, and can search long-distance dependency relationship in a remote sensing image.

In order to achieve the purpose, the invention adopts the following technical scheme:

the remote sensing image semantic segmentation method based on the area description self-attention mechanism comprises the following steps:

step 1: inputting a visible light remote sensing image into an encoder, extracting high-level semantic features of the visible light remote sensing image, and obtaining feature maps of different levels:

step 2: respectively carrying out global scene extraction and intrinsic feature extraction based on a self-attention module on the basis of the feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map;

and step 3: and inputting the scene guide characteristic graph and the noiseless characteristic graph into a decoder to be up-sampled and returned to the size of the visible light remote sensing image, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.

Further, the encoder includes a feature extraction network ResNet-101.

Further, inputting the visible light remote sensing image I into the feature extraction network ResNet-101, and outputting a feature map F by a fourth layer₄Fifth layer output characteristic diagram F₅。

Further, the specific steps of generating the noise-free feature map in step 2 are as follows:

step 21: the feature map F is processed according to the ground route₄Dividing into K soft regions N ═ M₁,M₂,…M_KWherein each of the soft regions M_kBelongs to the kth category, and N represents a soft region set;

step 22: weighting and aggregating the pixel values in each soft region to obtain a coarse region descriptor of the current region, wherein the calculation formula is as follows:

in the formula (f)_kDenotes a coarse area descriptor, x_iRepresentation feature diagram F₄Middle pixel p_iIs characterized in that it is a mixture of two or more of the above-mentioned components,

representing a pixel p_iProbability of belonging to the k-th soft region, I denotes the feature map F₄Extracting pixel points;

performing 1x1 convolution transformation on the coarse region descriptor to obtain the final region descriptorT area descriptors r_t；

Step 23: calculating the feature map F₅After dimension reduction, obtaining the self-attention weight value of each pixel relative to the region descriptor according to the relation between each pixel and the region descriptor:

wherein T represents the number of soft regions, r_tDenotes a region descriptor, W_itRepresenting a self-attention weight;

step 24: based on the self-attention weight, calculating the association degree of the point and the region:

in the formula, r_tDenotes the t-th region descriptor, W_itRepresentation feature diagram F₅The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimension reduction, (. cndot.) and ρ (-) are two transformation functions, y_iRepresenting the degree of association of a point with a region;

step 25: obtaining the noiseless feature map based on the relevance between the point and the region descriptor;

where g (-) is a transformation function including 1x1 convolution, batch normalization layer, and ReLU activation function, x_iRepresentation feature diagram F₄Middle pixel p_iIs characterized by z_iRepresenting a characterization of the noise free feature map.

Further, the specific step of generating the scene guide feature map in step 2 is:

step 26: the feature map F₅Performing feature dimensionality reduction through 1x1 convolution to obtain local featuresObtaining a global scene descriptor through spatial global average pooling and 1x1 convolution in sequence;

step 27: fusing the local features and the global scene descriptors to obtain a scene guide feature map F_g。

Further, the step 3 specifically includes:

step 31: respectively convolving the noiseless feature map by using 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map F_cFeature diagram F_d1Feature diagram F_d2And feature map F_d3；

Step 32: guiding feature map F based on the scene_gThe characteristic diagram F_cThe characteristic diagram F_d1The characteristic diagram F_d2And the characteristic diagram F_d3And (3) carrying out feature vector splicing to obtain a fusion feature map H:

step 33: and performing convolution on the fusion feature map by 3x3 to obtain a multi-scale feature map with attention, sampling the multi-scale feature map with attention as the size of the visible light remote sensing image by a bilinear interpolation method, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.

Further, the feature extraction network ResNet-101 comprises a layer 0 res₀Layer 1 res₁Layer 2 res₂Layer 3 res₃And layer 4 res₄；

The 0 th layer res₀Comprises 3 convolution layers of 3x3, the 1 st layer res₁Comprises 2 bottleneck layers, the 2 res layer₂Comprises 3 bottleneck layers, the 3 rd layer res₃Comprises 22 bottleneck layers, the 4 th layer res₄Including 2 bottleneck layers.

Further, the layer 1 res₁And said layer 2 res₂The bottleneck layers in the bottle comprise 1x1 convolutional layer, 3x3 convolutional layer and 1x1 convolutional layer, and the 3 rd layer res₃And said layer 4 res₄The bottleneck layers in (1) comprise 1x1 convolutional layers, hollow convolutional layers and 1x1 convolutional layers.

Further, the expansion ratio of the void convolution layer is 2.

Further, the decoder includes 3 × 3 convolutional layers, a hole convolutional layer having a expansion rate of 16, a hole convolutional layer having an expansion rate of 24, and a hole convolutional layer having an expansion rate of 36.

According to the technical scheme, compared with the prior art, the remote sensing image semantic segmentation method based on the area description self-attention mechanism is provided, the perception field of the model is improved, the scale change of data can be adapted, and the problem of category imbalance can be solved by an encoder for extracting semantic features, a self-attention module for increasing the internal relation of the image, and a decoder for mapping the semantic features weighted by attention back to the original space so as to perform pixel-by-pixel classification.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a remote sensing image semantic segmentation method based on a region description self-attention mechanism provided by the invention.

Fig. 2 is a specific flowchart of the processing performed by the self-attention module according to the present invention.

Fig. 3 is a specific flowchart of the processing performed by the decoder according to the present invention.

Fig. 4 is a detailed structural diagram of an encoder provided by the present invention.

Fig. 5 is a detailed block diagram of a decoder according to the present invention.

FIG. 6 is a regional descriptor attention diagram provided by the present invention. The first column represents a visible light remote sensing image, the first area descriptor in the second column describes the characteristics of buildings, the second area descriptor in the third column describes the characteristics of vegetation, and the third area descriptor in the fourth column describes the characteristics of roads.

FIG. 7 is a graph showing the effect of a comparative test on the consistency of the same kind of target. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.

Figure 8 is a graph of the effect of comparative experiments on small target performance. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.

FIG. 9 is a graph illustrating the effect of a comparative experiment on a plurality of complex scene segmentations. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.

Fig. 10 is a graph showing the effect of a comparative experiment on the accuracy of edge segmentation. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which comprises the following steps of:

step 1: inputting the visible light remote sensing image into an encoder, and extracting high-level semantic features of the visible light remote sensing image, wherein the encoder adopts an extraction network ResNet-101 to output a feature map F from a fourth layer₄Fifth layer output characteristic diagram F₅。

Step 2: the self-attention module will use the ground truth as a supervision to find the region descriptor. The core idea is to divide the global context pixel into a plurality of soft object areas under the supervision of the group channel, and at this time, each area corresponds to a class in the group channel. In the soft regions, a representation of the current region population is obtained by aggregating the pixels in the region, and the representation is a region descriptor. Finally, weighting and summing each pixel and the attention weight of the area descriptors to obtain new characteristics of each pixel related to the area descriptors.

As shown in fig. 2, specifically:

step 21: feature map F₄Roughly dividing the prediction object area, and dividing the feature map F according to the ground route₄Dividing into K soft regions N ═ M₁,M₂,…M_KWherein each soft region M_kBelong to the kth category; the cross entropy loss function is used in the training process to learn the soft region generation from the ground channel, which can be regarded as introducing an auxiliary loss branch.

in the formula (f)_kDenotes a coarse area descriptor, x_iRepresenting a pixel p_iIs characterized in that it is a mixture of two or more of the above-mentioned components,

representing a pixel p_iAnd (3) the probability of belonging to the k-th soft area is normalized by using softmax to ensure that the sum of the probabilities of all pixels in the whole image belonging to the k-th soft area is 1.

Carrying out convolution transformation on the coarse region descriptors by 1x1 to obtain the final T region descriptors r_t；

Step 23: calculation of feature map F Using softmax₅After dimension reduction, obtaining the self-attention weight value of each pixel relative to the area descriptor according to the relation between each pixel and the area descriptor:

step 24: calculating the feature map F₅After dimensionality reduction, obtaining the association degree of each pixel and the region according to the characteristics of each pixel related to the region descriptor:

step 25: the final pixel feature is composed of two parts, one of which is the original feature x_iThe other part is a feature y represented by weighted summation of area descriptors_iThe characteristic of the obtained noiseless characteristic diagram is z_i

Wherein g (-) is a transformation function comprising 1x1 convolution, batch normalization layer, and ReLU activation function, x_iRepresentation feature diagram F₄Middle pixel p_iThe characteristics of (1).

Based on the characteristic diagram F₅The process of extracting the global scene comprises the following steps:

feature map F of fifth layer output of ResNet-101₅Performing feature dimensionality reduction through a 1x1 convolution to obtain local features x_iThen convolving the space global average pooling with another 1x1 to obtain a global scene descriptor g (x); local feature { x_iThe method is fused with a global vector g (x) to obtain a scene-guided feature map F_g。

And step 3: inputting the scene guide feature map and the noiseless feature map into a decoder to obtain the size of an original image through up-sampling, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image, wherein the specific steps are as shown in fig. 3:

step 31: respectively convolving the noiseless feature map output from the attention module by using the conventional 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map F_cFeature diagram F_d1Feature diagram F_d2And feature map F_d3；

Step 32: guiding the scene to the feature map F_gThe characteristic diagram F_cThe characteristic diagram F_d1The characteristic diagram F_d2And the characteristic diagram F_d3And (3) carrying out feature vector splicing to obtain a fusion feature map H:

d1, d2, d3 indicate different expansion rates in the hole convolution, d₁＝16，a₂＝24，d₃36. And finally, convolving the feature graph H after feature fusion by 3x3 to obtain a multi-scale feature graph with attention, upsampling the feature graph to the same size as the original graph by a bilinear interpolation method, and then classifying pixel by pixel to obtain a final semantic segmentation result.

As shown in fig. 4, the encoder structure is mainly composed of 3 convolutional layers and 29 bottleneck layers to extract the features of the image, wherein the following 24 bottleneck layers replace the ordinary convolution with a hole rate of 2.

The invention adopts the depth residual error network ResNet-101 as the encoder, because ResNet has stronger characteristic extraction capability compared with VGG and GoogLeNet because of the residual error structure. Meanwhile, the problem of large scale range of the remote sensing image is considered, and the conventional convolution is replaced by the void convolution in the deep level feature extraction.

The original ResNet-101 is first convolved with a 7x7 convolution, and to save computation and achieve the same field size as the 7x7 convolution, it is stacked with three 3x3 convolutions, the number of convolution kernel parameters per channel is reduced from 49 to 27, which can be referred to as layer 0 res₀(ii) a The later layers 1, 2, 3 and 4 are all composed of bottleneck layer, the layer 1 res₁Comprises 1 base layer (BasicBlock) and 2 bottleneck layers, 2 res₂Comprises 1 base layer and 3 bottleneck layers, 3 res layers₃Comprises 1 base layer and 22 bottleneck layers, 4 res₄Comprising 1 base layer and 2 bottleneck layers. The deeper network structure can better extract complex characteristics in the remote sensing image, and the degradation problem caused by the too deep network is prevented due to the residual structure.

When extracting high-level features, pooling is often used due to the need to expand the receptive field, but the disadvantage is that small objects are easily lost. And because small targets in the remote sensing image often need to be paid more attention, in order to prevent the small target information from being incapable of being reconstructed, the pooling layer used by ResNet-101 in deep layer feature extraction is deleted, and the conventional convolution is replaced by the void convolution.

At layer 3 res₃And layer 4 res₄The pooling layer used for downsampling is removed and the conventional 3x3 convolution in all bottleneck layers is replaced with a hole convolution with an expansion ratio set to 2 to achieve the same effect of enlarging the field of view as the removed pooling layer. Layer 3 res₃And layer 4 res₄Firstly, carrying out dimensionality reduction on high-dimensionality features by convolution of 1x1, then extracting the features by utilizing cavity convolution, reducing the feature dimensionality by utilizing convolution of 1x1, and finally directly inputting the feature dimensionality and the original dimensionalityAnd adding to obtain Bottleneeck output.

Because the model only utilizes ResNet-101 to extract the characteristics, the last average pooling layer and the full-connection layer of the original ResNet-101 are removed and replaced by a 3x3 convolutional layer.

As shown in FIG. 5, in consideration of the characteristic of large scale change of the remote sensing image, the decoder uses three cavity convolutions with different expansion rates to process the feature maps in parallel, and captures the object and the context information in the image at a plurality of scales, so that the multi-scale image obtains better segmentation effect. However, the remote sensing image has the characteristic of large and rich scene, so that the classification of the whole scene of the remote sensing image can help the semantic segmentation of the remote sensing image. For example, if the scene is a city, buildings and roads should appear more probably in the scene, and the probability of oil tanks and airplanes appearing in the city scene is very low. Therefore, in the design of the decoder, the problem of multi-scale of the target is not considered, and a global scene descriptor is introduced to help the model to make better prediction.

Experimental verification section:

table 1 compares the algorithm of the present invention with a commonly used method in semantic segmentation tasks, and the method designed by the present invention achieves the highest score on a droneDeploy remote sensing image dataset.

Table 1 compares performance with existing algorithms

Note: with the method added a self-attention mechanism.

For models without a self-attention mechanism, RefineNet combines higher-level coarse semantic features with lower-level fine semantic features; the PSPNet adopts a Pyramid Pooling Module (PPM) to aggregate context information of different areas; the deep Lab fuses the segmentation results with different resolutions by using an empty space pyramid pooling module (ASPP); densesaspp is a dense connection of ASPP modules to DenseNet resulting in a larger acceptance domain and denser sampling points. They have in common that different levels of feature map fusion and the use of context information to generate a larger receptive field.

The self-attention mechanism is also a good way to enlarge the receptive field, and by associating with all information of the whole world, the information can be transmitted in the whole image and has the whole receptive field. The DANet combines the spatial correlation with the channel correlation; PSANet uses two steps of dispersion and collection to transmit information in the whole graph; EncNet encodes context information, taking scene characteristics into consideration; the CCNet acquires global context information by using a cross structure; the OCNet aggregates pixels and then divides them. The above is an extension of the self-attention mechanism, and it can be observed that the algorithm with asterisks and the self-attention mechanism is better than the algorithm without the self-attention mechanism as a whole, which shows the effectiveness of the self-attention mechanism on the semantic segmentation task.

In fig. 6, the region descriptor attention maps of the intermediate results are visualized, each representing the attention weight of a different region descriptor in reconstructing the feature map. Each region descriptor corresponds to some specific semantic information, not just to the foreground or background.

In fig. 7, it can be seen that the algorithm of the present invention produces great advantages in object segmentation where the pixels themselves have different characteristics, but belong to the same class. In a first example, pixels belonging to the same terrestrial type have two general features, namely a dark color system and a light color system. The benchmark model classifies the features into two categories due to the difference of the features, however, in practice, the two categories are both in the same category of land, and the algorithm of the invention well classifies the features into one category.

In fig. 8, the algorithm of the present invention is very accurate in segmenting small objects, and the assigned labels are also accurate. The inverse reference model is difficult to extract enough features due to the small target, is easily interfered by redundant features, has defects when the small target is frequently divided, and is easily connected into a whole.

In fig. 9, the common features of the images are that the scene is complex, the variety of the images is wide, and the boundaries between different objects are not clear. In such a complex scene, the algorithm of the invention can divide the object regularly, which is more obvious than the disordered division of the reference model.

In fig. 10, the algorithm of the present invention is more accurate at the edge of object segmentation compared to the reference model. The method has the advantages that each pixel point is reconstructed by the region descriptor, and the boundaries of different objects can be distinguished more accurately by using different essential characteristics.

Therefore, the final prediction result of the region-based self-attention mechanism model is better in segmentation effect compared with a reference model without the self-attention mechanism, and particularly, the algorithm disclosed by the invention has great advantages when the part with a white frame in a picture is focused.

The invention has the following advantages:

1) the method introduces a self-attention mechanism into the semantic segmentation task of the traditional remote sensing image, improves the receptive field of the model, enables information to flow in the whole situation, finds long-distance dependency relationship in the remote sensing image, and can provide great help for the semantic features of the remote sensing image.

2) Based on the self-attention mechanism of the calculation point and the point, the self-attention mechanism of the calculation point and the area is expanded, and a noise-free feature map can be reconstructed.

The core idea is to map data on a noisy high-dimensional space to a compact subspace for capturing the most essential semantic concept, then calculate the correlation degree of each pixel point and the captured semantic features, and assign values to each pixel point again by using the set of descriptors so as to reconstruct a noiseless feature map.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The remote sensing image semantic segmentation method based on the area description self-attention mechanism is characterized by comprising the following steps of:

step 1: inputting the visible light remote sensing image into an encoder, extracting high-level semantic features, and obtaining feature maps of different levels:

2. The method for semantically segmenting the remote sensing image based on the area description self-attention mechanism according to claim 1, wherein the encoder comprises a feature extraction network ResNet-101.

3. The remote sensing image semantic segmentation method based on the area description attention mechanism according to claim 2, characterized in that the visible light remote sensing image I is input into the feature extraction network ResNet-101, and a feature map F is output at a fourth layer₄Fifth layer output characteristic diagram F₅。

4. The remote sensing image semantic segmentation method based on the region description attention mechanism according to claim 3, wherein the specific steps of generating the noise-free feature map in the step 2 are as follows:

representing a pixel p_iProbability of belonging to the k-th soft region, I denotes the feature map F₄The obtained pixel points are obtained;

5. The remote sensing image semantic segmentation method based on the region description attention mechanism according to claim 4, wherein the specific steps of generating the scene guide feature map in step 2 are as follows:

step 26: the feature map F₅Performing feature dimensionality reduction through 1x1 convolution to obtain local features, and sequentially performing spatial global average pooling and 1x1 convolution on the local features to obtain a global scene descriptor;

6. The remote sensing image semantic segmentation method based on the region description attention mechanism according to claim 5, wherein the step 3 specifically comprises:

7. The remote sensing image semantic segmentation method based on the area description self-attention mechanism according to claim 6, wherein the feature extraction network ResNet-101 comprises a 0-level res₀Layer 1 res₁Layer 2 res₂Layer 3 res₃And layer 4 res₄；

8. The method for semantically segmenting the remote sensing image based on the area description attention mechanism of claim 7, wherein the layer 1 res is₁And said layer 2 res₂The bottleneck layers in the bottle comprise 1x1 convolutional layer, 3x3 convolutional layer and 1x1 convolutional layer, and the 3 rd layer res₃And said layer 4 res₄The bottleneck layers in (1) comprise 1x1 convolutional layers, hollow convolutional layers and 1x1 convolutional layers.

9. The method for semantically segmenting the remote sensing image based on the area description self-attention mechanism according to claim 8, wherein the expansion rate of the void convolution layer is 2.

10. The remote sensing image semantic segmentation method based on the region description self-attention mechanism according to any one of claims 1 to 9, wherein the decoder comprises 3x3 convolutional layers, a hole convolutional layer with an expansion rate of 16, a hole convolutional layer with an expansion rate of 24 and a hole convolutional layer with an expansion rate of 36.