CN111932553B

CN111932553B - Remote sensing image semantic segmentation method based on area description self-attention mechanism

Info

Publication number: CN111932553B
Application number: CN202010732126.3A
Authority: CN
Inventors: 赵丹培; 王晨旭; 史振威; 姜志国; 张浩鹏
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2022-09-06
Anticipated expiration: 2040-07-27
Also published as: CN111932553A

Abstract

The invention discloses a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which comprises the steps of inputting a visible light remote sensing image into an encoder, extracting high-level semantic features of the visible light remote sensing image, and obtaining feature maps of different levels: respectively extracting global scenes and essential features based on a self-attention module based on feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map; and inputting the scene guide feature map and the noiseless feature map into a decoder to be up-sampled and returned to the original image size, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image. The invention improves the receptive field of the model, can adapt to the scale change of data and can solve the problem of unbalanced category by extracting the encoder of semantic features, the self-attention module for increasing the internal connection of the image and the decoder for mapping the semantic features weighted by attention back to the original space so as to carry out pixel-by-pixel classification.

Description

Remote sensing image semantic segmentation method based on area description self-attention mechanism

Technical Field

The invention relates to the technical field of remote sensing and computer vision, in particular to remote sensing image semantic segmentation based on a region description self-attention mechanism.

Background

The semantic segmentation technology is applied to the field of remote sensing, the effects of inputting a remote sensing image and outputting each pixel class label in the remote sensing image can be achieved, and the method is greatly helpful for understanding the remote sensing image. For example, in the aspect of homeland planning, if the surface coverage type (city, road, forest, farmland, river, etc.) of each pixel on the satellite image can be identified, the distribution situation and the occupied area of the pixels can be clearly known, which is beneficial to the global planning; for example, intelligent identification is carried out on the building, so that whether illegal buildings exist can be found quickly, and the labor consumption can be greatly reduced.

At present, most of work of semantic segmentation by deep learning is a derivative of a Full Convolution Network (FCN), the FCN converts the well-known classification network into the full convolution network, and the specific method is to replace a full connection layer with a convolution layer, and to integrate a deep layer feature with a shallow layer feature through deconvolution operation, so that a high-level semantic feature and a low-level position feature are considered, and the accuracy of the model is improved. However, FCN still cannot be applied to all scenes, and the main reason is that the local nature of convolution results in a small receptive field, so that a larger range of context information cannot be considered. In order to solve the problems, a semantic segmentation network is derived on the basis of the FCN.

The traditional remote sensing image semantic segmentation only uses convolution stacking to extract features, due to the physical property of a convolution filter, the receptive field of convolution operation is limited, the convolution filter aims to capture local features and relations, in other words, information flow in a convolution neural network is only transmitted in a local area, and understanding of complex scenes is greatly influenced. Especially for the remote sensing task, the remote sensing image has many and complex scenes, and long-distance dependence needs to be collected to help predict the current position. Therefore, the invention introduces the self-attention mechanism into the semantic segmentation task of the traditional remote sensing image for the first time, improves the receptive field of the model, enables information to flow in the whole situation, and finds long-distance dependency relationships in the remote sensing image, such as the inclusion relationship existing before the airplane and the airport, the automobile should have certain relation with other automobiles or roads, and the relationships can provide great help for the semantic features of the remote sensing image.

However, the conventional point-and-point-based self-attention mechanism increases context information by calculating the relationship between each pixel point and other pixel points, but has a disadvantage that the calculated similarity between the points and the points is more focused on objects with identical characteristics, for example, a red car and a red car can generate a considerable degree of association, but since the labels are only of the car type, the red car can be more expected to establish the degree of association with all cars and even roads. In other words, in the task of identifying the vehicle at present, the color features are redundant, the color in the feature map is noise, and the color of the vehicle does not help the current task, and even the opposite effect may be generated due to the difference of the features.

Therefore, how to provide a remote sensing image semantic segmentation method based on a region description self-attention mechanism is a problem which needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of this, the invention provides a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which introduces the self-attention mechanism based on a region descriptor into a traditional remote sensing image semantic segmentation task, improves the receptive field of a model, enables information to flow in the whole world, and can search for long-distance dependency in a remote sensing image.

In order to achieve the purpose, the invention adopts the following technical scheme:

the remote sensing image semantic segmentation method based on the area description self-attention mechanism comprises the following steps:

step 1: inputting a visible light remote sensing image into an encoder, extracting high-level semantic features of the visible light remote sensing image, and obtaining feature maps of different levels:

step 2: respectively carrying out global scene extraction and intrinsic feature extraction based on a self-attention module on the basis of the feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map;

and step 3: and inputting the scene guide characteristic graph and the noiseless characteristic graph into a decoder to be up-sampled and returned to the size of the visible light remote sensing image, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.

Further, the encoder includes a feature extraction network ResNet-101.

Further, inputting the visible light remote sensing image I into the feature extraction network ResNet-101, and outputting a feature map F by a fourth layer ₄ Fifth layer output characteristic diagram F ₅ 。

Further, the specific steps of generating the noise-free feature map in step 2 are as follows:

step 21: the feature map F is processed according to the ground route ₄ Dividing into K soft regions N ═ M ₁ ,M ₂ ,…M _K Wherein each of the soft regions M _k Belongs to the kth category, and N represents a soft region set;

step 22: weighting and aggregating the pixel values in each soft region to obtain a coarse region descriptor of the current region, wherein the calculation formula is as follows:

in the formula (f) _k Denotes a coarse area descriptor, x _i Representation feature diagram F ₄ Middle pixel p _i Is characterized in that it is a mixture of two or more of the above-mentioned components,

representing a pixel p _i Probability of belonging to the k-th soft region, I denotes the feature F ₄ Extracting pixel points;

carrying out convolution transformation on the coarse region descriptors by 1x1 to obtain the final T region descriptors r _t ；

Step 23: calculating the feature map F ₅ After dimension reduction, obtaining the self-attention weight value of each pixel relative to the region descriptor according to the relation between each pixel and the region descriptor:

wherein T represents the number of soft regions, r _t Denotes a region descriptor, W _it Representing a self-attention weight;

step 24: based on the self-attention weight, calculating the association degree of the point and the region:

in the formula, r _t Denotes the t-th region descriptor, W _it Representation feature map F ₅ The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimensionality reduction, delta (-) and rho (-) are two transformation functions, y _i Representing the degree of association of a point with a region;

step 25: obtaining the noiseless feature map based on the relevance between the point and the region descriptor;

where g (-) is a transformation function including 1x1 convolution, batch normalization layer, and ReLU activation function, x _i Representation feature map F ₄ Middle pixel p _i Is characterized by z _i Representing a characterization of the noise-free feature map.

Further, the specific step of generating the scene guide feature map in step 2 is:

step 26: the feature map F ₅ Performing feature dimensionality reduction through 1x1 convolution to obtain local features, and sequentially performing spatial global average pooling and 1x1 convolution on the local features to obtain a global scene descriptor;

step 27: fusing the local features and the global scene descriptors to obtain a scene guide feature map F _g 。

Further, the step 3 specifically includes:

step 31: respectively convolving the noiseless feature map by using 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map F _c Feature diagram F _d1 Feature diagram F _d2 And feature map F _d3 ；

Step 32: guiding feature map F based on the scene _g The characteristic diagram F _c The characteristic diagram F _d1 The characteristic diagram F _d2 And the characteristic diagram F _d3 Performing feature vector splicing to obtain a fusion feature map H:

step 33: and performing convolution on the fusion feature map by 3x3 to obtain a multi-scale feature map with attention, sampling the multi-scale feature map with attention as the size of the visible light remote sensing image by a bilinear interpolation method, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.

Further, the feature extraction network ResNet-101 comprises a 0 th layer res ₀ Layer 1 res ₁ Layer 2 res ₂ Layer 3 res ₃ And layer 4 res ₄ ；

The 0 th layer res ₀ Comprises 3 convolution layers of 3x3, the 1 st layer res ₁ Comprises 2 bottleneck layers, the 2 res layer ₂ Comprises 3 bottleneck layers, the 3 rd layer res ₃ Comprises 22 bottleneck layers, the 4 th layer res ₄ Including 2 bottleneck layers.

Further, the layer 1 res ₁ And said layer 2 res ₂ The bottleneck layers in the bottle comprise 1x1 convolutional layer, 3x3 convolutional layer and 1x1 convolutional layer, and the 3 rd layer res ₃ And said layer 4 res ₄ The bottleneck layers in (1) comprise 1x1 convolutional layers, hollow convolutional layers and 1x1 convolutional layers.

Further, the expansion ratio of the void convolution layer is 2.

Further, the decoder includes 3 × 3 convolutional layers, a hole convolutional layer having a expansion rate of 16, a hole convolutional layer having an expansion rate of 24, and a hole convolutional layer having an expansion rate of 36.

According to the technical scheme, compared with the prior art, the remote sensing image semantic segmentation method based on the area description self-attention mechanism is provided, through an encoder for extracting semantic features, a self-attention module for increasing image internal connection and a decoder for mapping the semantic features after attention weighting back to an original space so as to carry out pixel-by-pixel classification, the receptive field of the model is improved, the method can adapt to scale change of data, and the problem of category imbalance can be solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a remote sensing image semantic segmentation method based on a region description self-attention mechanism provided by the invention.

Fig. 2 is a specific flowchart of the processing performed by the self-attention module according to the present invention.

Fig. 3 is a specific flowchart of the processing performed by the decoder according to the present invention.

Fig. 4 is a detailed structural diagram of an encoder provided by the present invention.

Fig. 5 is a detailed block diagram of a decoder according to the present invention.

FIG. 6 is a regional descriptor attention diagram provided by the present invention. The first column represents a visible light remote sensing image, the first area descriptor in the second column describes the characteristics of buildings, the second area descriptor in the third column describes the characteristics of vegetation, and the third area descriptor in the fourth column describes the characteristics of roads.

FIG. 7 is a graph showing the effect of comparative tests on the consistency of the same kind of objects. The first column shows the original image, the second column shows the result of pixel segmentation by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.

Figure 8 the figure is a graph of the effect of a comparative experiment on small target performance. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.

FIG. 9 is a graph illustrating the effect of a comparative experiment on a plurality of complex scene segmentations. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.

Fig. 10 is a graph showing the effect of a comparative experiment on the accuracy of edge segmentation. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which comprises the following steps of:

step 1: inputting the visible light remote sensing image into an encoder, and extracting the high-level semantic features of the visible light remote sensing image, wherein the encoder adopts an extraction network ResNet-101 to output a feature map F from a fourth layer ₄ Fifth layer output characteristic diagram F ₅ 。

Step 2: the self-attention module will use the ground route as a supervision to find the region descriptor. The core idea is to divide the global context pixel into a plurality of soft object areas under the supervision of the group channel, and at this time, each area corresponds to a class in the group channel. In the soft regions, a representation of the current region population is obtained by aggregating the pixels in the region, and the representation is a region descriptor. Finally, weighting and summing each pixel and the attention weight of the area descriptors to obtain new characteristics of each pixel related to the area descriptors.

As shown in fig. 2, specifically:

step 21: characteristic diagram F ₄ Roughly dividing the prediction object area, and dividing the feature map F according to the ground route ₄ Dividing into K soft regions N ═ M ₁ ,M ₂ ,…M _K Wherein each soft region M _k Belong to the kth category; the cross entropy loss function is used in the training process to learn the soft region generation from the ground channel, which can be regarded as introducing an auxiliary loss branch.

in the formula (f) _k Denotes a coarse area descriptor, x _i Representing a pixel p _i Is characterized in that it is a mixture of two or more of the above-mentioned components,

representing a pixel p _i And (3) the probability of belonging to the k-th soft area is normalized by using softmax to ensure that the sum of the probabilities of all pixels in the whole image belonging to the k-th soft area is 1.

Step 23: calculation of feature map F Using softmax ₅ After dimension reduction, obtaining the self-attention weight value of each pixel relative to the area descriptor according to the relation between each pixel and the area descriptor:

step 24: calculating the feature map F ₅ After dimensionality reduction, obtaining the association degree of each pixel and the region according to the characteristics of each pixel related to the region descriptor:

in the formula, r _t Denotes the t-th region descriptor, W _it Representation feature diagram F ₅ The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimensionality reduction, delta (-) and rho (-) are two transformation functions, y _i Representing the degree of association of a point with a region;

step 25: the final pixel feature is composed of two parts, one of which is the original feature x _i The other part is a feature y represented by weighted summation of region descriptors _i Characterization of the resulting noise-free feature map as z _i

Where g (-) is a transformation function including 1x1 convolution, batch normalization layer, and ReLU activation function, x _i Representation feature diagram F ₄ Middle pixel p _i The characteristics of (1).

Based on the characteristic diagram F ₅ The process of extracting the global scene comprises the following steps:

feature pattern F of fifth layer output of ResNet-101 ₅ Performing feature dimension reduction through a 1x1 convolution to obtain local features x _i Obtaining a global scene descriptor g (x) by convolution of spatial global average pooling and another 1x 1; local feature { x _i The feature map F guided by the scene is obtained by fusing the global vector g (x) _g 。

And step 3: inputting the scene guide feature map and the noiseless feature map into a decoder to obtain the size of an original image through up-sampling, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image, wherein the specific steps are as shown in fig. 3:

step 31: respectively convolving the noiseless feature map output from the attention module by using the conventional 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map F _c Feature diagram F _d1 Feature diagram F _d2 And feature map F _d3 ；

Step 32: guiding the scene to the feature map F _g The characteristic diagram F _c The characteristic diagram F _d1 The characteristic diagram F _d2 And the characteristic diagram F _d3 And (3) carrying out feature vector splicing to obtain a fusion feature map H:

d1, d2, d3 indicate different expansion rates in the hole convolution, d ₁ ＝16，a ₂ ＝24，d ₃ 36. And finally, convolving the feature graph H with the fused features by 3x3 to obtain a multi-scale feature graph with attention, upsampling the feature graph to be the same as the original graph by a bilinear interpolation method, and then classifying pixel by pixel to obtain a final semantic segmentation result.

As shown in fig. 4, the encoder structure is mainly composed of 3 convolutional layers and 29 bottleneck layers to extract the features of the image, wherein the following 24 bottleneck layers replace the ordinary convolution with a hole rate of 2.

The invention adopts a deep residual error network ResNet-101 as an encoder, because ResNet has stronger feature extraction capability compared with VGG and GoogLeNet due to the residual error structure. Meanwhile, considering the problem of large scale range of remote sensing images, the conventional convolution is replaced by the cavity convolution in the deep level feature extraction.

The original ResNet-101 is first convolved with a 7x7 convolution, and to save computation and achieve the same field size as the 7x7 convolution, it is stacked with three 3x3 convolutions, the number of convolution kernel parameters per channel is reduced from 49 to 27, which can be referred to as layer 0 res ₀ (ii) a The following

layers

1, 2, 3 and 4 are all composed of bottleneck layers, and the layer 1 res ₁ Comprises 1 base layer (BasicBlock) and 2 bottleneck layers, 2 res ₂ Comprising 1 radicalA base layer and 3 bottleneck layers, layer 3 res ₃ Comprises 1 base layer and 22 bottleneck layers, 4 res ₄ Comprising 1 base layer and 2 bottleneck layers. The deeper network structure can better extract complex characteristics in the remote sensing image, and the degradation problem caused by the too deep network is prevented due to the residual structure.

In high-level feature extraction, pooling is often used due to the need to enlarge the receptive field, but the disadvantage is that small targets are easily lost. And because small targets in the remote sensing image often need to be paid more attention, in order to prevent the small target information from being incapable of being reconstructed, the pooling layer used by ResNet-101 in deep layer feature extraction is deleted, and the conventional convolution is replaced by the void convolution.

At layer 3 res ₃ And layer 4 res ₄ The pooling layer used for downsampling is removed and the conventional 3x3 convolution in all bottleneck layers is replaced with a hole convolution with an expansion ratio set to 2 to achieve the same effect of expanding the receptive field as the removed pooling layer. Layer 3 res ₃ And layer 4 res ₄ Firstly, carrying out dimensionality reduction on high-dimensionality features by convolution of 1x1, then extracting the features by utilizing cavity convolution, reducing the feature dimensionality by utilizing convolution of 1x1, and finally directly adding the feature dimensionality and the original input to obtain Bottleneck output.

Because the model only utilizes ResNet-101 to extract the characteristics, the last average pooling layer and the full-connection layer of the original ResNet-101 are removed and replaced by a 3x3 convolutional layer.

As shown in FIG. 5, considering the characteristic that the remote sensing image has large scale change, the decoder uses three hole convolutions with different expansion rates to process the feature maps in parallel, captures the object and the context information in the image at a plurality of proportions, and enables the multi-scale image to obtain better segmentation effect. However, the remote sensing image has the characteristic of large and rich scene, so that the classification of the whole scene of the remote sensing image can help the semantic segmentation of the remote sensing image. For example, if the scene is a city, buildings and roads should appear more probably in the scene, and the probability of oil tanks and airplanes appearing in the city scene is very low. Therefore, in the design of the decoder, the problem of multi-scale of the target is not considered, and a global scene descriptor is introduced to help the model to make better prediction.

Experimental validation section:

table 1 compares the algorithm of the present invention with a commonly used method in semantic segmentation tasks, and the method designed by the present invention achieves the highest score on a droneDeploy remote sensing image dataset.

Table 1 compares performance with existing algorithms

Note: with the method added a self-attention mechanism.

For models without a self-attention mechanism, RefineNet combines higher-level coarse semantic features with lower-level fine semantic features; the PSPNet adopts a Pyramid Pooling Module (PPM) to aggregate context information of different areas; the deep Lab fuses the segmentation results with different resolutions by using an empty space pyramid pooling module (ASPP); densesaspp is a dense connection of ASPP modules to DenseNet resulting in a larger acceptance domain and denser sampling points. They have in common that different levels of feature map fusion and the use of context information to generate a larger receptive field.

The self-attention mechanism is also a good way to enlarge the receptive field, and by associating with all information of the whole world, the information can be transmitted in the whole image and has the whole receptive field. The DANet combines the spatial correlation with the channel correlation; PSANet uses two steps of dispersion and collection to transmit information in the whole graph; EncNet encodes context information, taking scene characteristics into consideration; the CCNet acquires global context information by using a cross structure; the OCNet aggregates pixels and then divides them. The above is an extension of the self-attention mechanism, and it can be observed that the algorithm with asterisks and the self-attention mechanism is better than the algorithm without the self-attention mechanism as a whole, which shows the effectiveness of the self-attention mechanism on the semantic segmentation task.

In fig. 6, the region descriptor attention diagrams of the intermediate results are visualized, each representing the attention weight of a different region descriptor in reconstructing the feature map. Each region descriptor corresponds to some specific semantic information, not just to the foreground or background.

In fig. 7, it can be seen that the algorithm of the present invention produces great advantages in object segmentation where the pixels themselves have different characteristics, but belong to the same class. In a first example, pixels belonging to the same terrestrial type have two features, namely dark color and light color. The reference model divides the features into two classes due to the difference of the features, however, in fact, the two classes are both the land class, and the algorithm of the invention can well classify the features into one class.

In fig. 8, the algorithm of the present invention is very accurate in segmenting small objects, and the assigned labels are also accurate. The inverse reference model is difficult to extract enough features due to the small target, and is easily interfered by redundant features, and the small target is often divided into defects and is easily connected into a whole.

In fig. 9, the common features of the images are that the scene is complex, the variety of the images is wide, and the boundaries between different objects are not clear. In such a complex scene, the algorithm of the invention can divide the object regularly, which is more obvious than the disordered division of the reference model.

In fig. 10, the algorithm of the present invention is more accurate at the edge of object segmentation compared to the reference model. The method has the advantages that each pixel point is reconstructed by the region descriptor, and the boundaries of different objects can be distinguished more accurately by using different essential characteristics.

Therefore, the final prediction result of the region-based self-attention mechanism model is better in segmentation effect compared with a reference model without the self-attention mechanism, and particularly, the algorithm disclosed by the invention has great advantages when the part with a white frame in a picture is focused.

The invention has the following advantages:

1) the method introduces a self-attention mechanism into the semantic segmentation task of the traditional remote sensing image, improves the receptive field of the model, enables information to flow in the whole situation, finds long-distance dependency relationship in the remote sensing image, and can provide great help for the semantic features of the remote sensing image.

2) Based on the self-attention mechanism of the calculation point and the point, the self-attention mechanism of the calculation point and the area is expanded, and a noise-free feature map can be reconstructed.

The core idea is to map data on a noisy high-dimensional space to a compact subspace for capturing the most essential semantic concept, then calculate the correlation degree of each pixel point and the captured semantic features, and assign values to each pixel point again by using the set of descriptors so as to reconstruct a noiseless feature map.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The remote sensing image semantic segmentation method based on the area description self-attention mechanism is characterized by comprising the following steps of:

step 1: inputting the visible light remote sensing image into an encoder, extracting high-level semantic features, and obtaining feature maps of different levels:

and step 3: inputting the scene guide characteristic graph and the noiseless characteristic graph into a decoder, up-sampling the scene guide characteristic graph and the noiseless characteristic graph, returning the size of the visible light remote sensing image, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image;

the encoder comprises a feature extraction network ResNet-101;

inputting the visible light remote sensing image I into the feature extraction network ResNet-101, and outputting a feature map F by a fourth layer ₄ Fifth output characteristic diagram F ₅ ；

The specific steps for generating the noiseless feature map in the step 2 are as follows:

in the formula (f) _k Denotes a coarse region descriptor, x _i Representation feature diagram F ₄ Middle pixel p _i Is characterized in that the pressure difference between the pressure sensor and the pressure sensor,

representing a pixel p _i Probability of belonging to the k-th soft region, I denotes the feature map F ₄ The obtained pixel points are obtained;

Step 23: calculating the feature map F ₅ After dimension reduction, each pixel is drawn with the areaAnd obtaining the self-attention weight value of each pixel relative to the region descriptor through the relationship among the sub-regions:

in the formula, T represents the number of soft regions, r _t Denotes a region descriptor, W _it Representing a self-attention weight;

and step 24: based on the self-attention weight, calculating the association degree of the point and the region:

in the formula, r _t Denotes the t-th region descriptor, W _it Representation feature diagram F ₅ The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimension reduction, delta (-) and rho (-) are two transformation functions, y _i Representing the degree of association of the point with the region;

where g (-) is a transformation function including 1x1 convolution, batch normalization layer, and ReLU activation function, x _i Representation feature map F ₄ Middle pixel p _i Is characterized by z _i A representation representing a noise-free feature map;

the specific steps of generating the scene guide feature map in the step 2 are as follows:

step 27: will the officeFusing the partial features and the global scene descriptor to obtain a scene guide feature map F _g ；

The step 3 specifically comprises:

step 31: respectively convolving the noiseless feature map by using 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map F _c Characteristic diagram F _d1 Characteristic diagram F _d2 And feature map F _d3 ；

Step 32: guiding feature map F based on the scene _g The characteristic diagram F _c The characteristic diagram F _d1 The characteristic diagram F _d2 And the characteristic diagram F _d3 And (3) carrying out feature vector splicing to obtain a fusion feature map H:

2. The remote sensing image semantic segmentation method based on the area description self-attention mechanism as claimed in claim 1, wherein the feature extraction network ResNet-101 comprises a 0-level res ₀ Layer 1 res ₁ Layer 2 res ₂ Layer 3 res ₃ And layer 4 res ₄ ；

3. The remote sensing image semantic segmentation based on the region description self-attention mechanism according to claim 2Method, characterized in that said layer 1 res ₁ And said layer 2 res ₂ The bottleneck layers in the bottle comprise 1x1 convolutional layer, 3x3 convolutional layer and 1x1 convolutional layer, and the 3 rd layer res ₃ And said layer 4 res ₄ The neck layers in (1) include 1x1 convolutional layers, void convolutional layers and 1x1 convolutional layers.

4. The method for semantically segmenting the remote sensing image based on the area description self-attention mechanism according to claim 3, wherein the expansion rate of the void convolution layer is 2.

5. The remote sensing image semantic segmentation method based on the region description self-attention mechanism according to any one of claims 1 to 4, wherein the decoder comprises 3x3 convolutional layers, a hole convolutional layer with an expansion rate of 16, a hole convolutional layer with an expansion rate of 24 and a hole convolutional layer with an expansion rate of 36.