CN111932553B - Remote sensing image semantic segmentation method based on area description self-attention mechanism - Google Patents

Remote sensing image semantic segmentation method based on area description self-attention mechanism Download PDF

Info

Publication number
CN111932553B
CN111932553B CN202010732126.3A CN202010732126A CN111932553B CN 111932553 B CN111932553 B CN 111932553B CN 202010732126 A CN202010732126 A CN 202010732126A CN 111932553 B CN111932553 B CN 111932553B
Authority
CN
China
Prior art keywords
feature map
layer
remote sensing
sensing image
pixel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010732126.3A
Other languages
Chinese (zh)
Other versions
CN111932553A (en
Inventor
赵丹培
王晨旭
史振威
姜志国
张浩鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202010732126.3A priority Critical patent/CN111932553B/en
Publication of CN111932553A publication Critical patent/CN111932553A/en
Application granted granted Critical
Publication of CN111932553B publication Critical patent/CN111932553B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10032Satellite or aerial image; Remote sensing

Abstract

The invention discloses a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which comprises the steps of inputting a visible light remote sensing image into an encoder, extracting high-level semantic features of the visible light remote sensing image, and obtaining feature maps of different levels: respectively extracting global scenes and essential features based on a self-attention module based on feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map; and inputting the scene guide feature map and the noiseless feature map into a decoder to be up-sampled and returned to the original image size, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image. The invention improves the receptive field of the model, can adapt to the scale change of data and can solve the problem of unbalanced category by extracting the encoder of semantic features, the self-attention module for increasing the internal connection of the image and the decoder for mapping the semantic features weighted by attention back to the original space so as to carry out pixel-by-pixel classification.

Description

Remote sensing image semantic segmentation method based on area description self-attention mechanism
Technical Field
The invention relates to the technical field of remote sensing and computer vision, in particular to remote sensing image semantic segmentation based on a region description self-attention mechanism.
Background
The semantic segmentation technology is applied to the field of remote sensing, the effects of inputting a remote sensing image and outputting each pixel class label in the remote sensing image can be achieved, and the method is greatly helpful for understanding the remote sensing image. For example, in the aspect of homeland planning, if the surface coverage type (city, road, forest, farmland, river, etc.) of each pixel on the satellite image can be identified, the distribution situation and the occupied area of the pixels can be clearly known, which is beneficial to the global planning; for example, intelligent identification is carried out on the building, so that whether illegal buildings exist can be found quickly, and the labor consumption can be greatly reduced.
At present, most of work of semantic segmentation by deep learning is a derivative of a Full Convolution Network (FCN), the FCN converts the well-known classification network into the full convolution network, and the specific method is to replace a full connection layer with a convolution layer, and to integrate a deep layer feature with a shallow layer feature through deconvolution operation, so that a high-level semantic feature and a low-level position feature are considered, and the accuracy of the model is improved. However, FCN still cannot be applied to all scenes, and the main reason is that the local nature of convolution results in a small receptive field, so that a larger range of context information cannot be considered. In order to solve the problems, a semantic segmentation network is derived on the basis of the FCN.
The traditional remote sensing image semantic segmentation only uses convolution stacking to extract features, due to the physical property of a convolution filter, the receptive field of convolution operation is limited, the convolution filter aims to capture local features and relations, in other words, information flow in a convolution neural network is only transmitted in a local area, and understanding of complex scenes is greatly influenced. Especially for the remote sensing task, the remote sensing image has many and complex scenes, and long-distance dependence needs to be collected to help predict the current position. Therefore, the invention introduces the self-attention mechanism into the semantic segmentation task of the traditional remote sensing image for the first time, improves the receptive field of the model, enables information to flow in the whole situation, and finds long-distance dependency relationships in the remote sensing image, such as the inclusion relationship existing before the airplane and the airport, the automobile should have certain relation with other automobiles or roads, and the relationships can provide great help for the semantic features of the remote sensing image.
However, the conventional point-and-point-based self-attention mechanism increases context information by calculating the relationship between each pixel point and other pixel points, but has a disadvantage that the calculated similarity between the points and the points is more focused on objects with identical characteristics, for example, a red car and a red car can generate a considerable degree of association, but since the labels are only of the car type, the red car can be more expected to establish the degree of association with all cars and even roads. In other words, in the task of identifying the vehicle at present, the color features are redundant, the color in the feature map is noise, and the color of the vehicle does not help the current task, and even the opposite effect may be generated due to the difference of the features.
Therefore, how to provide a remote sensing image semantic segmentation method based on a region description self-attention mechanism is a problem which needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of this, the invention provides a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which introduces the self-attention mechanism based on a region descriptor into a traditional remote sensing image semantic segmentation task, improves the receptive field of a model, enables information to flow in the whole world, and can search for long-distance dependency in a remote sensing image.
In order to achieve the purpose, the invention adopts the following technical scheme:
the remote sensing image semantic segmentation method based on the area description self-attention mechanism comprises the following steps:
step 1: inputting a visible light remote sensing image into an encoder, extracting high-level semantic features of the visible light remote sensing image, and obtaining feature maps of different levels:
step 2: respectively carrying out global scene extraction and intrinsic feature extraction based on a self-attention module on the basis of the feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map;
and step 3: and inputting the scene guide characteristic graph and the noiseless characteristic graph into a decoder to be up-sampled and returned to the size of the visible light remote sensing image, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.
Further, the encoder includes a feature extraction network ResNet-101.
Further, inputting the visible light remote sensing image I into the feature extraction network ResNet-101, and outputting a feature map F by a fourth layer 4 Fifth layer output characteristic diagram F 5
Further, the specific steps of generating the noise-free feature map in step 2 are as follows:
step 21: the feature map F is processed according to the ground route 4 Dividing into K soft regions N ═ M 1 ,M 2 ,…M K Wherein each of the soft regions M k Belongs to the kth category, and N represents a soft region set;
step 22: weighting and aggregating the pixel values in each soft region to obtain a coarse region descriptor of the current region, wherein the calculation formula is as follows:
Figure BDA0002603703350000031
in the formula (f) k Denotes a coarse area descriptor, x i Representation feature diagram F 4 Middle pixel p i Is characterized in that it is a mixture of two or more of the above-mentioned components,
Figure BDA0002603703350000032
representing a pixel p i Probability of belonging to the k-th soft region, I denotes the feature F 4 Extracting pixel points;
carrying out convolution transformation on the coarse region descriptors by 1x1 to obtain the final T region descriptors r t
Step 23: calculating the feature map F 5 After dimension reduction, obtaining the self-attention weight value of each pixel relative to the region descriptor according to the relation between each pixel and the region descriptor:
Figure BDA0002603703350000033
wherein T represents the number of soft regions, r t Denotes a region descriptor, W it Representing a self-attention weight;
step 24: based on the self-attention weight, calculating the association degree of the point and the region:
Figure BDA0002603703350000034
in the formula, r t Denotes the t-th region descriptor, W it Representation feature map F 5 The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimensionality reduction, delta (-) and rho (-) are two transformation functions, y i Representing the degree of association of a point with a region;
step 25: obtaining the noiseless feature map based on the relevance between the point and the region descriptor;
Figure BDA0002603703350000035
where g (-) is a transformation function including 1x1 convolution, batch normalization layer, and ReLU activation function, x i Representation feature map F 4 Middle pixel p i Is characterized by z i Representing a characterization of the noise-free feature map.
Further, the specific step of generating the scene guide feature map in step 2 is:
step 26: the feature map F 5 Performing feature dimensionality reduction through 1x1 convolution to obtain local features, and sequentially performing spatial global average pooling and 1x1 convolution on the local features to obtain a global scene descriptor;
step 27: fusing the local features and the global scene descriptors to obtain a scene guide feature map F g
Further, the step 3 specifically includes:
step 31: respectively convolving the noiseless feature map by using 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map F c Feature diagram F d1 Feature diagram F d2 And feature map F d3
Step 32: guiding feature map F based on the scene g The characteristic diagram F c The characteristic diagram F d1 The characteristic diagram F d2 And the characteristic diagram F d3 Performing feature vector splicing to obtain a fusion feature map H:
Figure BDA0002603703350000041
step 33: and performing convolution on the fusion feature map by 3x3 to obtain a multi-scale feature map with attention, sampling the multi-scale feature map with attention as the size of the visible light remote sensing image by a bilinear interpolation method, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.
Further, the feature extraction network ResNet-101 comprises a 0 th layer res 0 Layer 1 res 1 Layer 2 res 2 Layer 3 res 3 And layer 4 res 4
The 0 th layer res 0 Comprises 3 convolution layers of 3x3, the 1 st layer res 1 Comprises 2 bottleneck layers, the 2 res layer 2 Comprises 3 bottleneck layers, the 3 rd layer res 3 Comprises 22 bottleneck layers, the 4 th layer res 4 Including 2 bottleneck layers.
Further, the layer 1 res 1 And said layer 2 res 2 The bottleneck layers in the bottle comprise 1x1 convolutional layer, 3x3 convolutional layer and 1x1 convolutional layer, and the 3 rd layer res 3 And said layer 4 res 4 The bottleneck layers in (1) comprise 1x1 convolutional layers, hollow convolutional layers and 1x1 convolutional layers.
Further, the expansion ratio of the void convolution layer is 2.
Further, the decoder includes 3 × 3 convolutional layers, a hole convolutional layer having a expansion rate of 16, a hole convolutional layer having an expansion rate of 24, and a hole convolutional layer having an expansion rate of 36.
According to the technical scheme, compared with the prior art, the remote sensing image semantic segmentation method based on the area description self-attention mechanism is provided, through an encoder for extracting semantic features, a self-attention module for increasing image internal connection and a decoder for mapping the semantic features after attention weighting back to an original space so as to carry out pixel-by-pixel classification, the receptive field of the model is improved, the method can adapt to scale change of data, and the problem of category imbalance can be solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a remote sensing image semantic segmentation method based on a region description self-attention mechanism provided by the invention.
Fig. 2 is a specific flowchart of the processing performed by the self-attention module according to the present invention.
Fig. 3 is a specific flowchart of the processing performed by the decoder according to the present invention.
Fig. 4 is a detailed structural diagram of an encoder provided by the present invention.
Fig. 5 is a detailed block diagram of a decoder according to the present invention.
FIG. 6 is a regional descriptor attention diagram provided by the present invention. The first column represents a visible light remote sensing image, the first area descriptor in the second column describes the characteristics of buildings, the second area descriptor in the third column describes the characteristics of vegetation, and the third area descriptor in the fourth column describes the characteristics of roads.
FIG. 7 is a graph showing the effect of comparative tests on the consistency of the same kind of objects. The first column shows the original image, the second column shows the result of pixel segmentation by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
Figure 8 the figure is a graph of the effect of a comparative experiment on small target performance. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
FIG. 9 is a graph illustrating the effect of a comparative experiment on a plurality of complex scene segmentations. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
Fig. 10 is a graph showing the effect of a comparative experiment on the accuracy of edge segmentation. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which comprises the following steps of:
step 1: inputting the visible light remote sensing image into an encoder, and extracting the high-level semantic features of the visible light remote sensing image, wherein the encoder adopts an extraction network ResNet-101 to output a feature map F from a fourth layer 4 Fifth layer output characteristic diagram F 5
Step 2: the self-attention module will use the ground route as a supervision to find the region descriptor. The core idea is to divide the global context pixel into a plurality of soft object areas under the supervision of the group channel, and at this time, each area corresponds to a class in the group channel. In the soft regions, a representation of the current region population is obtained by aggregating the pixels in the region, and the representation is a region descriptor. Finally, weighting and summing each pixel and the attention weight of the area descriptors to obtain new characteristics of each pixel related to the area descriptors.
As shown in fig. 2, specifically:
step 21: characteristic diagram F 4 Roughly dividing the prediction object area, and dividing the feature map F according to the ground route 4 Dividing into K soft regions N ═ M 1 ,M 2 ,…M K Wherein each soft region M k Belong to the kth category; the cross entropy loss function is used in the training process to learn the soft region generation from the ground channel, which can be regarded as introducing an auxiliary loss branch.
Step 22: weighting and aggregating the pixel values in each soft region to obtain a coarse region descriptor of the current region, wherein the calculation formula is as follows:
Figure BDA0002603703350000061
in the formula (f) k Denotes a coarse area descriptor, x i Representing a pixel p i Is characterized in that it is a mixture of two or more of the above-mentioned components,
Figure BDA0002603703350000062
representing a pixel p i And (3) the probability of belonging to the k-th soft area is normalized by using softmax to ensure that the sum of the probabilities of all pixels in the whole image belonging to the k-th soft area is 1.
Carrying out convolution transformation on the coarse region descriptors by 1x1 to obtain the final T region descriptors r t
Step 23: calculation of feature map F Using softmax 5 After dimension reduction, obtaining the self-attention weight value of each pixel relative to the area descriptor according to the relation between each pixel and the area descriptor:
Figure BDA0002603703350000071
wherein T represents the number of soft regions, r t Denotes a region descriptor, W it Representing a self-attention weight;
step 24: calculating the feature map F 5 After dimensionality reduction, obtaining the association degree of each pixel and the region according to the characteristics of each pixel related to the region descriptor:
Figure BDA0002603703350000072
in the formula, r t Denotes the t-th region descriptor, W it Representation feature diagram F 5 The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimensionality reduction, delta (-) and rho (-) are two transformation functions, y i Representing the degree of association of a point with a region;
step 25: the final pixel feature is composed of two parts, one of which is the original feature x i The other part is a feature y represented by weighted summation of region descriptors i Characterization of the resulting noise-free feature map as z i
Figure BDA0002603703350000073
Where g (-) is a transformation function including 1x1 convolution, batch normalization layer, and ReLU activation function, x i Representation feature diagram F 4 Middle pixel p i The characteristics of (1).
Based on the characteristic diagram F 5 The process of extracting the global scene comprises the following steps:
feature pattern F of fifth layer output of ResNet-101 5 Performing feature dimension reduction through a 1x1 convolution to obtain local features x i Obtaining a global scene descriptor g (x) by convolution of spatial global average pooling and another 1x 1; local feature { x i The feature map F guided by the scene is obtained by fusing the global vector g (x) g
And step 3: inputting the scene guide feature map and the noiseless feature map into a decoder to obtain the size of an original image through up-sampling, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image, wherein the specific steps are as shown in fig. 3:
step 31: respectively convolving the noiseless feature map output from the attention module by using the conventional 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map F c Feature diagram F d1 Feature diagram F d2 And feature map F d3
Step 32: guiding the scene to the feature map F g The characteristic diagram F c The characteristic diagram F d1 The characteristic diagram F d2 And the characteristic diagram F d3 And (3) carrying out feature vector splicing to obtain a fusion feature map H:
Figure BDA0002603703350000081
d1, d2, d3 indicate different expansion rates in the hole convolution, d 1 =16,a 2 =24,d 3 36. And finally, convolving the feature graph H with the fused features by 3x3 to obtain a multi-scale feature graph with attention, upsampling the feature graph to be the same as the original graph by a bilinear interpolation method, and then classifying pixel by pixel to obtain a final semantic segmentation result.
As shown in fig. 4, the encoder structure is mainly composed of 3 convolutional layers and 29 bottleneck layers to extract the features of the image, wherein the following 24 bottleneck layers replace the ordinary convolution with a hole rate of 2.
The invention adopts a deep residual error network ResNet-101 as an encoder, because ResNet has stronger feature extraction capability compared with VGG and GoogLeNet due to the residual error structure. Meanwhile, considering the problem of large scale range of remote sensing images, the conventional convolution is replaced by the cavity convolution in the deep level feature extraction.
The original ResNet-101 is first convolved with a 7x7 convolution, and to save computation and achieve the same field size as the 7x7 convolution, it is stacked with three 3x3 convolutions, the number of convolution kernel parameters per channel is reduced from 49 to 27, which can be referred to as layer 0 res 0 (ii) a The following layers 1, 2, 3 and 4 are all composed of bottleneck layers, and the layer 1 res 1 Comprises 1 base layer (BasicBlock) and 2 bottleneck layers, 2 res 2 Comprising 1 radicalA base layer and 3 bottleneck layers, layer 3 res 3 Comprises 1 base layer and 22 bottleneck layers, 4 res 4 Comprising 1 base layer and 2 bottleneck layers. The deeper network structure can better extract complex characteristics in the remote sensing image, and the degradation problem caused by the too deep network is prevented due to the residual structure.
In high-level feature extraction, pooling is often used due to the need to enlarge the receptive field, but the disadvantage is that small targets are easily lost. And because small targets in the remote sensing image often need to be paid more attention, in order to prevent the small target information from being incapable of being reconstructed, the pooling layer used by ResNet-101 in deep layer feature extraction is deleted, and the conventional convolution is replaced by the void convolution.
At layer 3 res 3 And layer 4 res 4 The pooling layer used for downsampling is removed and the conventional 3x3 convolution in all bottleneck layers is replaced with a hole convolution with an expansion ratio set to 2 to achieve the same effect of expanding the receptive field as the removed pooling layer. Layer 3 res 3 And layer 4 res 4 Firstly, carrying out dimensionality reduction on high-dimensionality features by convolution of 1x1, then extracting the features by utilizing cavity convolution, reducing the feature dimensionality by utilizing convolution of 1x1, and finally directly adding the feature dimensionality and the original input to obtain Bottleneck output.
Because the model only utilizes ResNet-101 to extract the characteristics, the last average pooling layer and the full-connection layer of the original ResNet-101 are removed and replaced by a 3x3 convolutional layer.
As shown in FIG. 5, considering the characteristic that the remote sensing image has large scale change, the decoder uses three hole convolutions with different expansion rates to process the feature maps in parallel, captures the object and the context information in the image at a plurality of proportions, and enables the multi-scale image to obtain better segmentation effect. However, the remote sensing image has the characteristic of large and rich scene, so that the classification of the whole scene of the remote sensing image can help the semantic segmentation of the remote sensing image. For example, if the scene is a city, buildings and roads should appear more probably in the scene, and the probability of oil tanks and airplanes appearing in the city scene is very low. Therefore, in the design of the decoder, the problem of multi-scale of the target is not considered, and a global scene descriptor is introduced to help the model to make better prediction.
Experimental validation section:
table 1 compares the algorithm of the present invention with a commonly used method in semantic segmentation tasks, and the method designed by the present invention achieves the highest score on a droneDeploy remote sensing image dataset.
Table 1 compares performance with existing algorithms
Figure BDA0002603703350000091
Note: with the method added a self-attention mechanism.
For models without a self-attention mechanism, RefineNet combines higher-level coarse semantic features with lower-level fine semantic features; the PSPNet adopts a Pyramid Pooling Module (PPM) to aggregate context information of different areas; the deep Lab fuses the segmentation results with different resolutions by using an empty space pyramid pooling module (ASPP); densesaspp is a dense connection of ASPP modules to DenseNet resulting in a larger acceptance domain and denser sampling points. They have in common that different levels of feature map fusion and the use of context information to generate a larger receptive field.
The self-attention mechanism is also a good way to enlarge the receptive field, and by associating with all information of the whole world, the information can be transmitted in the whole image and has the whole receptive field. The DANet combines the spatial correlation with the channel correlation; PSANet uses two steps of dispersion and collection to transmit information in the whole graph; EncNet encodes context information, taking scene characteristics into consideration; the CCNet acquires global context information by using a cross structure; the OCNet aggregates pixels and then divides them. The above is an extension of the self-attention mechanism, and it can be observed that the algorithm with asterisks and the self-attention mechanism is better than the algorithm without the self-attention mechanism as a whole, which shows the effectiveness of the self-attention mechanism on the semantic segmentation task.
In fig. 6, the region descriptor attention diagrams of the intermediate results are visualized, each representing the attention weight of a different region descriptor in reconstructing the feature map. Each region descriptor corresponds to some specific semantic information, not just to the foreground or background.
In fig. 7, it can be seen that the algorithm of the present invention produces great advantages in object segmentation where the pixels themselves have different characteristics, but belong to the same class. In a first example, pixels belonging to the same terrestrial type have two features, namely dark color and light color. The reference model divides the features into two classes due to the difference of the features, however, in fact, the two classes are both the land class, and the algorithm of the invention can well classify the features into one class.
In fig. 8, the algorithm of the present invention is very accurate in segmenting small objects, and the assigned labels are also accurate. The inverse reference model is difficult to extract enough features due to the small target, and is easily interfered by redundant features, and the small target is often divided into defects and is easily connected into a whole.
In fig. 9, the common features of the images are that the scene is complex, the variety of the images is wide, and the boundaries between different objects are not clear. In such a complex scene, the algorithm of the invention can divide the object regularly, which is more obvious than the disordered division of the reference model.
In fig. 10, the algorithm of the present invention is more accurate at the edge of object segmentation compared to the reference model. The method has the advantages that each pixel point is reconstructed by the region descriptor, and the boundaries of different objects can be distinguished more accurately by using different essential characteristics.
Therefore, the final prediction result of the region-based self-attention mechanism model is better in segmentation effect compared with a reference model without the self-attention mechanism, and particularly, the algorithm disclosed by the invention has great advantages when the part with a white frame in a picture is focused.
The invention has the following advantages:
1) the method introduces a self-attention mechanism into the semantic segmentation task of the traditional remote sensing image, improves the receptive field of the model, enables information to flow in the whole situation, finds long-distance dependency relationship in the remote sensing image, and can provide great help for the semantic features of the remote sensing image.
2) Based on the self-attention mechanism of the calculation point and the point, the self-attention mechanism of the calculation point and the area is expanded, and a noise-free feature map can be reconstructed.
The core idea is to map data on a noisy high-dimensional space to a compact subspace for capturing the most essential semantic concept, then calculate the correlation degree of each pixel point and the captured semantic features, and assign values to each pixel point again by using the set of descriptors so as to reconstruct a noiseless feature map.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (5)

1. The remote sensing image semantic segmentation method based on the area description self-attention mechanism is characterized by comprising the following steps of:
step 1: inputting the visible light remote sensing image into an encoder, extracting high-level semantic features, and obtaining feature maps of different levels:
step 2: respectively carrying out global scene extraction and intrinsic feature extraction based on a self-attention module on the basis of the feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map;
and step 3: inputting the scene guide characteristic graph and the noiseless characteristic graph into a decoder, up-sampling the scene guide characteristic graph and the noiseless characteristic graph, returning the size of the visible light remote sensing image, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image;
the encoder comprises a feature extraction network ResNet-101;
inputting the visible light remote sensing image I into the feature extraction network ResNet-101, and outputting a feature map F by a fourth layer 4 Fifth output characteristic diagram F 5
The specific steps for generating the noiseless feature map in the step 2 are as follows:
step 21: the feature map F is processed according to the ground route 4 Dividing into K soft regions N ═ M 1 ,M 2 ,…M K Wherein each of the soft regions M k Belongs to the kth category, and N represents a soft region set;
step 22: weighting and aggregating the pixel values in each soft region to obtain a coarse region descriptor of the current region, wherein the calculation formula is as follows:
Figure FDA0003592785350000011
in the formula (f) k Denotes a coarse region descriptor, x i Representation feature diagram F 4 Middle pixel p i Is characterized in that the pressure difference between the pressure sensor and the pressure sensor,
Figure FDA0003592785350000012
representing a pixel p i Probability of belonging to the k-th soft region, I denotes the feature map F 4 The obtained pixel points are obtained;
carrying out convolution transformation on the coarse region descriptors by 1x1 to obtain the final T region descriptors r t
Step 23: calculating the feature map F 5 After dimension reduction, each pixel is drawn with the areaAnd obtaining the self-attention weight value of each pixel relative to the region descriptor through the relationship among the sub-regions:
Figure FDA0003592785350000013
in the formula, T represents the number of soft regions, r t Denotes a region descriptor, W it Representing a self-attention weight;
and step 24: based on the self-attention weight, calculating the association degree of the point and the region:
Figure FDA0003592785350000021
in the formula, r t Denotes the t-th region descriptor, W it Representation feature diagram F 5 The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimension reduction, delta (-) and rho (-) are two transformation functions, y i Representing the degree of association of the point with the region;
step 25: obtaining the noiseless feature map based on the relevance between the point and the region descriptor;
Figure FDA0003592785350000022
where g (-) is a transformation function including 1x1 convolution, batch normalization layer, and ReLU activation function, x i Representation feature map F 4 Middle pixel p i Is characterized by z i A representation representing a noise-free feature map;
the specific steps of generating the scene guide feature map in the step 2 are as follows:
step 26: the feature map F 5 Performing feature dimensionality reduction through 1x1 convolution to obtain local features, and sequentially performing spatial global average pooling and 1x1 convolution on the local features to obtain a global scene descriptor;
step 27: will the officeFusing the partial features and the global scene descriptor to obtain a scene guide feature map F g
The step 3 specifically comprises:
step 31: respectively convolving the noiseless feature map by using 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map F c Characteristic diagram F d1 Characteristic diagram F d2 And feature map F d3
Step 32: guiding feature map F based on the scene g The characteristic diagram F c The characteristic diagram F d1 The characteristic diagram F d2 And the characteristic diagram F d3 And (3) carrying out feature vector splicing to obtain a fusion feature map H:
Figure FDA0003592785350000023
step 33: and performing convolution on the fusion feature map by 3x3 to obtain a multi-scale feature map with attention, sampling the multi-scale feature map with attention as the size of the visible light remote sensing image by a bilinear interpolation method, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.
2. The remote sensing image semantic segmentation method based on the area description self-attention mechanism as claimed in claim 1, wherein the feature extraction network ResNet-101 comprises a 0-level res 0 Layer 1 res 1 Layer 2 res 2 Layer 3 res 3 And layer 4 res 4
The 0 th layer res 0 Comprises 3 convolution layers of 3x3, the 1 st layer res 1 Comprises 2 bottleneck layers, the 2 res layer 2 Comprises 3 bottleneck layers, the 3 rd layer res 3 Comprises 22 bottleneck layers, the 4 th layer res 4 Including 2 bottleneck layers.
3. The remote sensing image semantic segmentation based on the region description self-attention mechanism according to claim 2Method, characterized in that said layer 1 res 1 And said layer 2 res 2 The bottleneck layers in the bottle comprise 1x1 convolutional layer, 3x3 convolutional layer and 1x1 convolutional layer, and the 3 rd layer res 3 And said layer 4 res 4 The neck layers in (1) include 1x1 convolutional layers, void convolutional layers and 1x1 convolutional layers.
4. The method for semantically segmenting the remote sensing image based on the area description self-attention mechanism according to claim 3, wherein the expansion rate of the void convolution layer is 2.
5. The remote sensing image semantic segmentation method based on the region description self-attention mechanism according to any one of claims 1 to 4, wherein the decoder comprises 3x3 convolutional layers, a hole convolutional layer with an expansion rate of 16, a hole convolutional layer with an expansion rate of 24 and a hole convolutional layer with an expansion rate of 36.
CN202010732126.3A 2020-07-27 2020-07-27 Remote sensing image semantic segmentation method based on area description self-attention mechanism Active CN111932553B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010732126.3A CN111932553B (en) 2020-07-27 2020-07-27 Remote sensing image semantic segmentation method based on area description self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010732126.3A CN111932553B (en) 2020-07-27 2020-07-27 Remote sensing image semantic segmentation method based on area description self-attention mechanism

Publications (2)

Publication Number Publication Date
CN111932553A CN111932553A (en) 2020-11-13
CN111932553B true CN111932553B (en) 2022-09-06

Family

ID=73315343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010732126.3A Active CN111932553B (en) 2020-07-27 2020-07-27 Remote sensing image semantic segmentation method based on area description self-attention mechanism

Country Status (1)

Country Link
CN (1) CN111932553B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487927B (en) * 2020-11-26 2024-02-13 深圳市人工智能与机器人研究院 Method and system for realizing indoor scene recognition based on object associated attention
CN112528803B (en) * 2020-12-03 2023-12-19 中国地质大学(武汉) Road feature extraction method, device, equipment and storage medium
CN112580649B (en) * 2020-12-15 2022-08-02 重庆邮电大学 Semantic segmentation method based on regional context relation module
CN112699937B (en) * 2020-12-29 2022-06-21 江苏大学 Apparatus, method, device, and medium for image classification and segmentation based on feature-guided network
CN112749736B (en) * 2020-12-30 2022-09-13 华南师范大学 Image recognition method, control device and storage medium
CN113065586B (en) * 2021-03-23 2022-10-18 四川翼飞视科技有限公司 Non-local image classification device, method and storage medium
CN113223008A (en) * 2021-04-16 2021-08-06 山东师范大学 Fundus image segmentation method and system based on multi-scale guide attention network
CN113421259B (en) * 2021-08-20 2021-11-16 北京工业大学 OCTA image analysis method based on classification network
CN113537254B (en) * 2021-08-27 2022-08-26 重庆紫光华山智安科技有限公司 Image feature extraction method and device, electronic equipment and readable storage medium
CN113807206B (en) * 2021-08-30 2023-04-07 电子科技大学 SAR image target identification method based on denoising task assistance
CN113989511B (en) * 2021-12-29 2022-07-01 中科视语(北京)科技有限公司 Image semantic segmentation method and device, electronic equipment and storage medium
CN115170934B (en) * 2022-09-05 2022-12-23 粤港澳大湾区数字经济研究院(福田) Image segmentation method, system, equipment and storage medium
CN115690704B (en) * 2022-09-27 2023-08-22 淮阴工学院 LG-CenterNet model-based complex road scene target detection method and device
CN115937742B (en) * 2022-11-28 2024-04-12 北京百度网讯科技有限公司 Video scene segmentation and visual task processing methods, devices, equipment and media
CN115810020B (en) * 2022-12-02 2023-06-02 中国科学院空间应用工程与技术中心 Semantic guidance-based coarse-to-fine remote sensing image segmentation method and system
CN116229277B (en) * 2023-05-08 2023-08-08 中国海洋大学 Strong anti-interference ocean remote sensing image semantic segmentation method based on semantic correlation

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232394B (en) * 2018-03-06 2021-08-10 华南理工大学 Multi-scale image semantic segmentation method
CN109145769A (en) * 2018-08-01 2019-01-04 辽宁工业大学 The target detection network design method of blending image segmentation feature
WO2020093210A1 (en) * 2018-11-05 2020-05-14 中国科学院计算技术研究所 Scene segmentation method and system based on contenxtual information guidance
US10929665B2 (en) * 2018-12-21 2021-02-23 Samsung Electronics Co., Ltd. System and method for providing dominant scene classification by semantic segmentation
CN110210485A (en) * 2019-05-13 2019-09-06 常熟理工学院 The image, semantic dividing method of Fusion Features is instructed based on attention mechanism
CN110322446B (en) * 2019-07-01 2021-02-19 华中科技大学 Domain self-adaptive semantic segmentation method based on similarity space alignment
CN111047551B (en) * 2019-11-06 2023-10-31 北京科技大学 Remote sensing image change detection method and system based on U-net improved algorithm

Also Published As

Publication number Publication date
CN111932553A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
CN111932553B (en) Remote sensing image semantic segmentation method based on area description self-attention mechanism
CN111462126B (en) Semantic image segmentation method and system based on edge enhancement
CN111259905B (en) Feature fusion remote sensing image semantic segmentation method based on downsampling
CN107341517B (en) Multi-scale small object detection method based on deep learning inter-level feature fusion
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN110909642A (en) Remote sensing image target detection method based on multi-scale semantic feature fusion
CN111461039B (en) Landmark identification method based on multi-scale feature fusion
CN110751111B (en) Road extraction method and system based on high-order spatial information global automatic perception
Asokan et al. Machine learning based image processing techniques for satellite image analysis-a survey
CN112906706A (en) Improved image semantic segmentation method based on coder-decoder
CN112990065B (en) Vehicle classification detection method based on optimized YOLOv5 model
CN112287983B (en) Remote sensing image target extraction system and method based on deep learning
Liu et al. Remote sensing data fusion with generative adversarial networks: State-of-the-art methods and future research directions
Sulehria et al. Vehicle number plate recognition using mathematical morphology and neural networks
CN115661777A (en) Semantic-combined foggy road target detection algorithm
CN113762396A (en) Two-dimensional image semantic segmentation method
CN114155371A (en) Semantic segmentation method based on channel attention and pyramid convolution fusion
CN112861970A (en) Fine-grained image classification method based on feature fusion
CN113066089A (en) Real-time image semantic segmentation network based on attention guide mechanism
CN113221814A (en) Road traffic sign identification method, equipment and storage medium
Saravanarajan et al. Improving semantic segmentation under hazy weather for autonomous vehicles using explainable artificial intelligence and adaptive dehazing approach
CN114332780A (en) Traffic man-vehicle non-target detection method for small target
CN114155165A (en) Image defogging method based on semi-supervision
CN114782949A (en) Traffic scene semantic segmentation method for boundary guide context aggregation
Xi et al. High Resolution Remote Sensing Image Classification Using Hybrid Ensemble Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant