CN111932553A - Remote sensing image semantic segmentation method based on area description self-attention mechanism - Google Patents
Remote sensing image semantic segmentation method based on area description self-attention mechanism Download PDFInfo
- Publication number
- CN111932553A CN111932553A CN202010732126.3A CN202010732126A CN111932553A CN 111932553 A CN111932553 A CN 111932553A CN 202010732126 A CN202010732126 A CN 202010732126A CN 111932553 A CN111932553 A CN 111932553A
- Authority
- CN
- China
- Prior art keywords
- feature map
- remote sensing
- layer
- sensing image
- region
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10032—Satellite or aerial image; Remote sensing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which comprises the steps of inputting a visible light remote sensing image into an encoder, extracting high-level semantic features of the visible light remote sensing image, and obtaining feature maps of different levels: respectively extracting global scenes and essential features based on a self-attention module based on feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map; and inputting the scene guide feature map and the noiseless feature map into a decoder to be up-sampled and returned to the original image size, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image. The invention improves the receptive field of the model, can adapt to the scale change of data and can solve the problem of unbalanced category by extracting the encoder of semantic features, the self-attention module for increasing the internal connection of the image and the decoder for mapping the semantic features weighted by attention back to the original space so as to carry out pixel-by-pixel classification.
Description
Technical Field
The invention relates to the technical field of remote sensing and computer vision, in particular to remote sensing image semantic segmentation based on a region description self-attention mechanism.
Background
The semantic segmentation technology is applied to the field of remote sensing, the effects of inputting a remote sensing image and outputting each pixel class label in the remote sensing image can be achieved, and great help is brought to understanding of the remote sensing image. For example, in the aspect of homeland planning, if the ground surface coverage type (city, road, forest, farmland, river, etc.) of each pixel on the satellite image can be identified, the distribution situation and the occupied area of the pixels can be clearly known, which is beneficial to global planning; for example, intelligent identification is carried out on buildings, whether illegal buildings exist can be found quickly, and manpower consumption can be greatly reduced.
At present, most of work of semantic segmentation by deep learning is a derivative of a Full Convolution Network (FCN), the FCN converts the well-known classification network into the full convolution network, and the specific method is to replace a full connection layer with a convolution layer, and to integrate a deep layer feature with a shallow layer feature through deconvolution operation, so that a high-level semantic feature and a low-level position feature are considered, and the accuracy of the model is improved. However, FCN still cannot be applied to all scenes, and the main reason is that the local nature of convolution results in a small receptive field, so that a larger range of context information cannot be considered. In order to solve the problems, a semantic segmentation network is derived on the basis of the FCN.
The traditional remote sensing image semantic segmentation only uses convolution stacking to extract features, due to the physical property of a convolution filter, the receptive field of convolution operation is limited, the convolution filter aims to capture local features and relations, in other words, information flow in a convolution neural network is only transmitted in a local area, and understanding of complex scenes is greatly influenced. Especially for the remote sensing task, the remote sensing image has many and complex scenes, and long-distance dependence needs to be collected to help predict the current position. Therefore, the invention introduces the self-attention mechanism into the semantic segmentation task of the traditional remote sensing image for the first time, improves the receptive field of the model, enables information to flow in the whole situation, and finds long-distance dependency relationships in the remote sensing image, such as the inclusion relationship existing before the airplane and the airport, the automobile should have certain relation with other automobiles or roads, and the relationships can provide great help for the semantic features of the remote sensing image.
However, the conventional point-and-point-based self-attention mechanism increases context information by calculating the relationship between each pixel point and other pixel points, but has a disadvantage that the point-and-point calculation similarity is more concerned with objects with identical characteristics, for example, a red car and a red car generate a considerable degree of association, but since the labels are only of the car type, the red car can be more desirably associated with all cars and even roads. In other words, in the task of identifying the vehicle at present, the color features are redundant, the color in the feature map is noise, and the color of the vehicle does not help the current task, and even the opposite effect may be generated due to the difference of the features.
Therefore, how to provide a remote sensing image semantic segmentation method based on a region description self-attention mechanism is a problem which needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which introduces the self-attention mechanism based on a region descriptor into a traditional remote sensing image semantic segmentation task, improves the receptive field of a model, enables information to flow in the whole world, and can search long-distance dependency relationship in a remote sensing image.
In order to achieve the purpose, the invention adopts the following technical scheme:
the remote sensing image semantic segmentation method based on the area description self-attention mechanism comprises the following steps:
step 1: inputting a visible light remote sensing image into an encoder, extracting high-level semantic features of the visible light remote sensing image, and obtaining feature maps of different levels:
step 2: respectively carrying out global scene extraction and intrinsic feature extraction based on a self-attention module on the basis of the feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map;
and step 3: and inputting the scene guide characteristic graph and the noiseless characteristic graph into a decoder to be up-sampled and returned to the size of the visible light remote sensing image, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.
Further, the encoder includes a feature extraction network ResNet-101.
Further, inputting the visible light remote sensing image I into the feature extraction network ResNet-101, and outputting a feature map F by a fourth layer4Fifth layer output characteristic diagram F5。
Further, the specific steps of generating the noise-free feature map in step 2 are as follows:
step 21: the feature map F is processed according to the ground route4Dividing into K soft regions N ═ M1,M2,…MKWherein each of the soft regions MkBelongs to the kth category, and N represents a soft region set;
step 22: weighting and aggregating the pixel values in each soft region to obtain a coarse region descriptor of the current region, wherein the calculation formula is as follows:
in the formula (f)kDenotes a coarse area descriptor, xiRepresentation feature diagram F4Middle pixel piIs characterized in that it is a mixture of two or more of the above-mentioned components,representing a pixel piProbability of belonging to the k-th soft region, I denotes the feature map F4Extracting pixel points;
performing 1x1 convolution transformation on the coarse region descriptor to obtain the final region descriptorT area descriptors rt;
Step 23: calculating the feature map F5After dimension reduction, obtaining the self-attention weight value of each pixel relative to the region descriptor according to the relation between each pixel and the region descriptor:
wherein T represents the number of soft regions, rtDenotes a region descriptor, WitRepresenting a self-attention weight;
step 24: based on the self-attention weight, calculating the association degree of the point and the region:
in the formula, rtDenotes the t-th region descriptor, WitRepresentation feature diagram F5The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimension reduction, (. cndot.) and ρ (-) are two transformation functions, yiRepresenting the degree of association of a point with a region;
step 25: obtaining the noiseless feature map based on the relevance between the point and the region descriptor;
where g (-) is a transformation function including 1x1 convolution, batch normalization layer, and ReLU activation function, xiRepresentation feature diagram F4Middle pixel piIs characterized by ziRepresenting a characterization of the noise free feature map.
Further, the specific step of generating the scene guide feature map in step 2 is:
step 26: the feature map F5Performing feature dimensionality reduction through 1x1 convolution to obtain local featuresObtaining a global scene descriptor through spatial global average pooling and 1x1 convolution in sequence;
step 27: fusing the local features and the global scene descriptors to obtain a scene guide feature map Fg。
Further, the step 3 specifically includes:
step 31: respectively convolving the noiseless feature map by using 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map FcFeature diagram Fd1Feature diagram Fd2And feature map Fd3;
Step 32: guiding feature map F based on the scenegThe characteristic diagram FcThe characteristic diagram Fd1The characteristic diagram Fd2And the characteristic diagram Fd3And (3) carrying out feature vector splicing to obtain a fusion feature map H:
step 33: and performing convolution on the fusion feature map by 3x3 to obtain a multi-scale feature map with attention, sampling the multi-scale feature map with attention as the size of the visible light remote sensing image by a bilinear interpolation method, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.
Further, the feature extraction network ResNet-101 comprises a layer 0 res0Layer 1 res1Layer 2 res2Layer 3 res3And layer 4 res4;
The 0 th layer res0Comprises 3 convolution layers of 3x3, the 1 st layer res1Comprises 2 bottleneck layers, the 2 res layer2Comprises 3 bottleneck layers, the 3 rd layer res3Comprises 22 bottleneck layers, the 4 th layer res4Including 2 bottleneck layers.
Further, the layer 1 res1And said layer 2 res2The bottleneck layers in the bottle comprise 1x1 convolutional layer, 3x3 convolutional layer and 1x1 convolutional layer, and the 3 rd layer res3And said layer 4 res4The bottleneck layers in (1) comprise 1x1 convolutional layers, hollow convolutional layers and 1x1 convolutional layers.
Further, the expansion ratio of the void convolution layer is 2.
Further, the decoder includes 3 × 3 convolutional layers, a hole convolutional layer having a expansion rate of 16, a hole convolutional layer having an expansion rate of 24, and a hole convolutional layer having an expansion rate of 36.
According to the technical scheme, compared with the prior art, the remote sensing image semantic segmentation method based on the area description self-attention mechanism is provided, the perception field of the model is improved, the scale change of data can be adapted, and the problem of category imbalance can be solved by an encoder for extracting semantic features, a self-attention module for increasing the internal relation of the image, and a decoder for mapping the semantic features weighted by attention back to the original space so as to perform pixel-by-pixel classification.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a remote sensing image semantic segmentation method based on a region description self-attention mechanism provided by the invention.
Fig. 2 is a specific flowchart of the processing performed by the self-attention module according to the present invention.
Fig. 3 is a specific flowchart of the processing performed by the decoder according to the present invention.
Fig. 4 is a detailed structural diagram of an encoder provided by the present invention.
Fig. 5 is a detailed block diagram of a decoder according to the present invention.
FIG. 6 is a regional descriptor attention diagram provided by the present invention. The first column represents a visible light remote sensing image, the first area descriptor in the second column describes the characteristics of buildings, the second area descriptor in the third column describes the characteristics of vegetation, and the third area descriptor in the fourth column describes the characteristics of roads.
FIG. 7 is a graph showing the effect of a comparative test on the consistency of the same kind of target. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
Figure 8 is a graph of the effect of comparative experiments on small target performance. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
FIG. 9 is a graph illustrating the effect of a comparative experiment on a plurality of complex scene segmentations. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
Fig. 10 is a graph showing the effect of a comparative experiment on the accuracy of edge segmentation. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which comprises the following steps of:
step 1: inputting the visible light remote sensing image into an encoder, and extracting high-level semantic features of the visible light remote sensing image, wherein the encoder adopts an extraction network ResNet-101 to output a feature map F from a fourth layer4Fifth layer output characteristic diagram F5。
Step 2: the self-attention module will use the ground truth as a supervision to find the region descriptor. The core idea is to divide the global context pixel into a plurality of soft object areas under the supervision of the group channel, and at this time, each area corresponds to a class in the group channel. In the soft regions, a representation of the current region population is obtained by aggregating the pixels in the region, and the representation is a region descriptor. Finally, weighting and summing each pixel and the attention weight of the area descriptors to obtain new characteristics of each pixel related to the area descriptors.
As shown in fig. 2, specifically:
step 21: feature map F4Roughly dividing the prediction object area, and dividing the feature map F according to the ground route4Dividing into K soft regions N ═ M1,M2,…MKWherein each soft region MkBelong to the kth category; the cross entropy loss function is used in the training process to learn the soft region generation from the ground channel, which can be regarded as introducing an auxiliary loss branch.
Step 22: weighting and aggregating the pixel values in each soft region to obtain a coarse region descriptor of the current region, wherein the calculation formula is as follows:
in the formula (f)kDenotes a coarse area descriptor, xiRepresenting a pixel piIs characterized in that it is a mixture of two or more of the above-mentioned components,representing a pixel piAnd (3) the probability of belonging to the k-th soft area is normalized by using softmax to ensure that the sum of the probabilities of all pixels in the whole image belonging to the k-th soft area is 1.
Carrying out convolution transformation on the coarse region descriptors by 1x1 to obtain the final T region descriptors rt;
Step 23: calculation of feature map F Using softmax5After dimension reduction, obtaining the self-attention weight value of each pixel relative to the area descriptor according to the relation between each pixel and the area descriptor:
wherein T represents the number of soft regions, rtDenotes a region descriptor, WitRepresenting a self-attention weight;
step 24: calculating the feature map F5After dimensionality reduction, obtaining the association degree of each pixel and the region according to the characteristics of each pixel related to the region descriptor:
in the formula, rtDenotes the t-th region descriptor, WitRepresentation feature diagram F5The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimension reduction, (. cndot.) and ρ (-) are two transformation functions, yiRepresenting the degree of association of a point with a region;
step 25: the final pixel feature is composed of two parts, one of which is the original feature xiThe other part is a feature y represented by weighted summation of area descriptorsiThe characteristic of the obtained noiseless characteristic diagram is zi
Wherein g (-) is a transformation function comprising 1x1 convolution, batch normalization layer, and ReLU activation function, xiRepresentation feature diagram F4Middle pixel piThe characteristics of (1).
Based on the characteristic diagram F5The process of extracting the global scene comprises the following steps:
feature map F of fifth layer output of ResNet-1015Performing feature dimensionality reduction through a 1x1 convolution to obtain local features xiThen convolving the space global average pooling with another 1x1 to obtain a global scene descriptor g (x); local feature { xiThe method is fused with a global vector g (x) to obtain a scene-guided feature map Fg。
And step 3: inputting the scene guide feature map and the noiseless feature map into a decoder to obtain the size of an original image through up-sampling, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image, wherein the specific steps are as shown in fig. 3:
step 31: respectively convolving the noiseless feature map output from the attention module by using the conventional 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map FcFeature diagram Fd1Feature diagram Fd2And feature map Fd3;
Step 32: guiding the scene to the feature map FgThe characteristic diagram FcThe characteristic diagram Fd1The characteristic diagram Fd2And the characteristic diagram Fd3And (3) carrying out feature vector splicing to obtain a fusion feature map H:
d1, d2, d3 indicate different expansion rates in the hole convolution, d1=16,a2=24,d336. And finally, convolving the feature graph H after feature fusion by 3x3 to obtain a multi-scale feature graph with attention, upsampling the feature graph to the same size as the original graph by a bilinear interpolation method, and then classifying pixel by pixel to obtain a final semantic segmentation result.
As shown in fig. 4, the encoder structure is mainly composed of 3 convolutional layers and 29 bottleneck layers to extract the features of the image, wherein the following 24 bottleneck layers replace the ordinary convolution with a hole rate of 2.
The invention adopts the depth residual error network ResNet-101 as the encoder, because ResNet has stronger characteristic extraction capability compared with VGG and GoogLeNet because of the residual error structure. Meanwhile, the problem of large scale range of the remote sensing image is considered, and the conventional convolution is replaced by the void convolution in the deep level feature extraction.
The original ResNet-101 is first convolved with a 7x7 convolution, and to save computation and achieve the same field size as the 7x7 convolution, it is stacked with three 3x3 convolutions, the number of convolution kernel parameters per channel is reduced from 49 to 27, which can be referred to as layer 0 res0(ii) a The later layers 1, 2, 3 and 4 are all composed of bottleneck layer, the layer 1 res1Comprises 1 base layer (BasicBlock) and 2 bottleneck layers, 2 res2Comprises 1 base layer and 3 bottleneck layers, 3 res layers3Comprises 1 base layer and 22 bottleneck layers, 4 res4Comprising 1 base layer and 2 bottleneck layers. The deeper network structure can better extract complex characteristics in the remote sensing image, and the degradation problem caused by the too deep network is prevented due to the residual structure.
When extracting high-level features, pooling is often used due to the need to expand the receptive field, but the disadvantage is that small objects are easily lost. And because small targets in the remote sensing image often need to be paid more attention, in order to prevent the small target information from being incapable of being reconstructed, the pooling layer used by ResNet-101 in deep layer feature extraction is deleted, and the conventional convolution is replaced by the void convolution.
At layer 3 res3And layer 4 res4The pooling layer used for downsampling is removed and the conventional 3x3 convolution in all bottleneck layers is replaced with a hole convolution with an expansion ratio set to 2 to achieve the same effect of enlarging the field of view as the removed pooling layer. Layer 3 res3And layer 4 res4Firstly, carrying out dimensionality reduction on high-dimensionality features by convolution of 1x1, then extracting the features by utilizing cavity convolution, reducing the feature dimensionality by utilizing convolution of 1x1, and finally directly inputting the feature dimensionality and the original dimensionalityAnd adding to obtain Bottleneeck output.
Because the model only utilizes ResNet-101 to extract the characteristics, the last average pooling layer and the full-connection layer of the original ResNet-101 are removed and replaced by a 3x3 convolutional layer.
As shown in FIG. 5, in consideration of the characteristic of large scale change of the remote sensing image, the decoder uses three cavity convolutions with different expansion rates to process the feature maps in parallel, and captures the object and the context information in the image at a plurality of scales, so that the multi-scale image obtains better segmentation effect. However, the remote sensing image has the characteristic of large and rich scene, so that the classification of the whole scene of the remote sensing image can help the semantic segmentation of the remote sensing image. For example, if the scene is a city, buildings and roads should appear more probably in the scene, and the probability of oil tanks and airplanes appearing in the city scene is very low. Therefore, in the design of the decoder, the problem of multi-scale of the target is not considered, and a global scene descriptor is introduced to help the model to make better prediction.
Experimental verification section:
table 1 compares the algorithm of the present invention with a commonly used method in semantic segmentation tasks, and the method designed by the present invention achieves the highest score on a droneDeploy remote sensing image dataset.
Table 1 compares performance with existing algorithms
Note: with the method added a self-attention mechanism.
For models without a self-attention mechanism, RefineNet combines higher-level coarse semantic features with lower-level fine semantic features; the PSPNet adopts a Pyramid Pooling Module (PPM) to aggregate context information of different areas; the deep Lab fuses the segmentation results with different resolutions by using an empty space pyramid pooling module (ASPP); densesaspp is a dense connection of ASPP modules to DenseNet resulting in a larger acceptance domain and denser sampling points. They have in common that different levels of feature map fusion and the use of context information to generate a larger receptive field.
The self-attention mechanism is also a good way to enlarge the receptive field, and by associating with all information of the whole world, the information can be transmitted in the whole image and has the whole receptive field. The DANet combines the spatial correlation with the channel correlation; PSANet uses two steps of dispersion and collection to transmit information in the whole graph; EncNet encodes context information, taking scene characteristics into consideration; the CCNet acquires global context information by using a cross structure; the OCNet aggregates pixels and then divides them. The above is an extension of the self-attention mechanism, and it can be observed that the algorithm with asterisks and the self-attention mechanism is better than the algorithm without the self-attention mechanism as a whole, which shows the effectiveness of the self-attention mechanism on the semantic segmentation task.
In fig. 6, the region descriptor attention maps of the intermediate results are visualized, each representing the attention weight of a different region descriptor in reconstructing the feature map. Each region descriptor corresponds to some specific semantic information, not just to the foreground or background.
In fig. 7, it can be seen that the algorithm of the present invention produces great advantages in object segmentation where the pixels themselves have different characteristics, but belong to the same class. In a first example, pixels belonging to the same terrestrial type have two general features, namely a dark color system and a light color system. The benchmark model classifies the features into two categories due to the difference of the features, however, in practice, the two categories are both in the same category of land, and the algorithm of the invention well classifies the features into one category.
In fig. 8, the algorithm of the present invention is very accurate in segmenting small objects, and the assigned labels are also accurate. The inverse reference model is difficult to extract enough features due to the small target, is easily interfered by redundant features, has defects when the small target is frequently divided, and is easily connected into a whole.
In fig. 9, the common features of the images are that the scene is complex, the variety of the images is wide, and the boundaries between different objects are not clear. In such a complex scene, the algorithm of the invention can divide the object regularly, which is more obvious than the disordered division of the reference model.
In fig. 10, the algorithm of the present invention is more accurate at the edge of object segmentation compared to the reference model. The method has the advantages that each pixel point is reconstructed by the region descriptor, and the boundaries of different objects can be distinguished more accurately by using different essential characteristics.
Therefore, the final prediction result of the region-based self-attention mechanism model is better in segmentation effect compared with a reference model without the self-attention mechanism, and particularly, the algorithm disclosed by the invention has great advantages when the part with a white frame in a picture is focused.
The invention has the following advantages:
1) the method introduces a self-attention mechanism into the semantic segmentation task of the traditional remote sensing image, improves the receptive field of the model, enables information to flow in the whole situation, finds long-distance dependency relationship in the remote sensing image, and can provide great help for the semantic features of the remote sensing image.
2) Based on the self-attention mechanism of the calculation point and the point, the self-attention mechanism of the calculation point and the area is expanded, and a noise-free feature map can be reconstructed.
The core idea is to map data on a noisy high-dimensional space to a compact subspace for capturing the most essential semantic concept, then calculate the correlation degree of each pixel point and the captured semantic features, and assign values to each pixel point again by using the set of descriptors so as to reconstruct a noiseless feature map.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. The remote sensing image semantic segmentation method based on the area description self-attention mechanism is characterized by comprising the following steps of:
step 1: inputting the visible light remote sensing image into an encoder, extracting high-level semantic features, and obtaining feature maps of different levels:
step 2: respectively carrying out global scene extraction and intrinsic feature extraction based on a self-attention module on the basis of the feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map;
and step 3: and inputting the scene guide characteristic graph and the noiseless characteristic graph into a decoder to be up-sampled and returned to the size of the visible light remote sensing image, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.
2. The method for semantically segmenting the remote sensing image based on the area description self-attention mechanism according to claim 1, wherein the encoder comprises a feature extraction network ResNet-101.
3. The remote sensing image semantic segmentation method based on the area description attention mechanism according to claim 2, characterized in that the visible light remote sensing image I is input into the feature extraction network ResNet-101, and a feature map F is output at a fourth layer4Fifth layer output characteristic diagram F5。
4. The remote sensing image semantic segmentation method based on the region description attention mechanism according to claim 3, wherein the specific steps of generating the noise-free feature map in the step 2 are as follows:
step 21: the feature map F is processed according to the ground route4Dividing into K soft regions N ═ M1,M2,…MKWherein each of the soft regions MkBelongs to the kth category, and N represents a soft region set;
step 22: weighting and aggregating the pixel values in each soft region to obtain a coarse region descriptor of the current region, wherein the calculation formula is as follows:
in the formula (f)kDenotes a coarse area descriptor, xiRepresentation feature diagram F4Middle pixel piIs characterized in that it is a mixture of two or more of the above-mentioned components,representing a pixel piProbability of belonging to the k-th soft region, I denotes the feature map F4The obtained pixel points are obtained;
carrying out convolution transformation on the coarse region descriptors by 1x1 to obtain the final T region descriptors rt;
Step 23: calculating the feature map F5After dimension reduction, obtaining the self-attention weight value of each pixel relative to the region descriptor according to the relation between each pixel and the region descriptor:
wherein T represents the number of soft regions, rtDenotes a region descriptor, WitRepresenting a self-attention weight;
step 24: based on the self-attention weight, calculating the association degree of the point and the region:
in the formula, rtDenotes the t-th region descriptor, WitRepresentation feature diagram F5The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimension reduction, (. cndot.) and ρ (-) are two transformation functions, yiRepresenting the degree of association of a point with a region;
step 25: obtaining the noiseless feature map based on the relevance between the point and the region descriptor;
where g (-) is a transformation function including 1x1 convolution, batch normalization layer, and ReLU activation function, xiRepresentation feature diagram F4Middle pixel piIs characterized by ziRepresenting a characterization of the noise free feature map.
5. The remote sensing image semantic segmentation method based on the region description attention mechanism according to claim 4, wherein the specific steps of generating the scene guide feature map in step 2 are as follows:
step 26: the feature map F5Performing feature dimensionality reduction through 1x1 convolution to obtain local features, and sequentially performing spatial global average pooling and 1x1 convolution on the local features to obtain a global scene descriptor;
step 27: fusing the local features and the global scene descriptors to obtain a scene guide feature map Fg。
6. The remote sensing image semantic segmentation method based on the region description attention mechanism according to claim 5, wherein the step 3 specifically comprises:
step 31: respectively convolving the noiseless feature map by using 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map FcFeature diagram Fd1Feature diagram Fd2And feature map Fd3;
Step 32: guiding feature map F based on the scenegThe characteristic diagram FcThe characteristic diagram Fd1The characteristic diagram Fd2And the characteristic diagram Fd3And (3) carrying out feature vector splicing to obtain a fusion feature map H:
step 33: and performing convolution on the fusion feature map by 3x3 to obtain a multi-scale feature map with attention, sampling the multi-scale feature map with attention as the size of the visible light remote sensing image by a bilinear interpolation method, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.
7. The remote sensing image semantic segmentation method based on the area description self-attention mechanism according to claim 6, wherein the feature extraction network ResNet-101 comprises a 0-level res0Layer 1 res1Layer 2 res2Layer 3 res3And layer 4 res4;
The 0 th layer res0Comprises 3 convolution layers of 3x3, the 1 st layer res1Comprises 2 bottleneck layers, the 2 res layer2Comprises 3 bottleneck layers, the 3 rd layer res3Comprises 22 bottleneck layers, the 4 th layer res4Including 2 bottleneck layers.
8. The method for semantically segmenting the remote sensing image based on the area description attention mechanism of claim 7, wherein the layer 1 res is1And said layer 2 res2The bottleneck layers in the bottle comprise 1x1 convolutional layer, 3x3 convolutional layer and 1x1 convolutional layer, and the 3 rd layer res3And said layer 4 res4The bottleneck layers in (1) comprise 1x1 convolutional layers, hollow convolutional layers and 1x1 convolutional layers.
9. The method for semantically segmenting the remote sensing image based on the area description self-attention mechanism according to claim 8, wherein the expansion rate of the void convolution layer is 2.
10. The remote sensing image semantic segmentation method based on the region description self-attention mechanism according to any one of claims 1 to 9, wherein the decoder comprises 3x3 convolutional layers, a hole convolutional layer with an expansion rate of 16, a hole convolutional layer with an expansion rate of 24 and a hole convolutional layer with an expansion rate of 36.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010732126.3A CN111932553B (en) | 2020-07-27 | 2020-07-27 | Remote sensing image semantic segmentation method based on area description self-attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010732126.3A CN111932553B (en) | 2020-07-27 | 2020-07-27 | Remote sensing image semantic segmentation method based on area description self-attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111932553A true CN111932553A (en) | 2020-11-13 |
CN111932553B CN111932553B (en) | 2022-09-06 |
Family
ID=73315343
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010732126.3A Active CN111932553B (en) | 2020-07-27 | 2020-07-27 | Remote sensing image semantic segmentation method based on area description self-attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111932553B (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487927A (en) * | 2020-11-26 | 2021-03-12 | 深圳市人工智能与机器人研究院 | Indoor scene recognition implementation method and system based on object associated attention |
CN112528803A (en) * | 2020-12-03 | 2021-03-19 | 中国地质大学(武汉) | Road feature extraction method, device, equipment and storage medium |
CN112580649A (en) * | 2020-12-15 | 2021-03-30 | 重庆邮电大学 | Semantic segmentation method based on regional context relation module |
CN112699937A (en) * | 2020-12-29 | 2021-04-23 | 江苏大学 | Apparatus, method, device, and medium for image classification and segmentation based on feature-guided network |
CN112749736A (en) * | 2020-12-30 | 2021-05-04 | 华南师范大学 | Image recognition method, control device and storage medium |
CN113065586A (en) * | 2021-03-23 | 2021-07-02 | 四川翼飞视科技有限公司 | Non-local image classification device, method and storage medium |
CN113223008A (en) * | 2021-04-16 | 2021-08-06 | 山东师范大学 | Fundus image segmentation method and system based on multi-scale guide attention network |
CN113421259A (en) * | 2021-08-20 | 2021-09-21 | 北京工业大学 | OCTA image analysis method based on classification network |
CN113537254A (en) * | 2021-08-27 | 2021-10-22 | 重庆紫光华山智安科技有限公司 | Image feature extraction method and device, electronic equipment and readable storage medium |
CN113807206A (en) * | 2021-08-30 | 2021-12-17 | 电子科技大学 | SAR image target identification method based on denoising task assistance |
CN113989511A (en) * | 2021-12-29 | 2022-01-28 | 中科视语(北京)科技有限公司 | Image semantic segmentation method and device, electronic equipment and storage medium |
CN115170934A (en) * | 2022-09-05 | 2022-10-11 | 粤港澳大湾区数字经济研究院(福田) | Image segmentation method, system, equipment and storage medium |
CN115690704A (en) * | 2022-09-27 | 2023-02-03 | 淮阴工学院 | LG-CenterNet model-based complex road scene target detection method and device |
CN115810020A (en) * | 2022-12-02 | 2023-03-17 | 中国科学院空间应用工程与技术中心 | Remote sensing image segmentation method and system from coarse to fine based on semantic guidance |
CN115937742A (en) * | 2022-11-28 | 2023-04-07 | 北京百度网讯科技有限公司 | Video scene segmentation and visual task processing method, device, equipment and medium |
CN116229277A (en) * | 2023-05-08 | 2023-06-06 | 中国海洋大学 | Strong anti-interference ocean remote sensing image semantic segmentation method based on semantic correlation |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109145769A (en) * | 2018-08-01 | 2019-01-04 | 辽宁工业大学 | The target detection network design method of blending image segmentation feature |
CN110210485A (en) * | 2019-05-13 | 2019-09-06 | 常熟理工学院 | The image, semantic dividing method of Fusion Features is instructed based on attention mechanism |
CN110232394A (en) * | 2018-03-06 | 2019-09-13 | 华南理工大学 | A kind of multi-scale image semantic segmentation method |
CN110322446A (en) * | 2019-07-01 | 2019-10-11 | 华中科技大学 | A kind of domain adaptive semantic dividing method based on similarity space alignment |
CN111047551A (en) * | 2019-11-06 | 2020-04-21 | 北京科技大学 | Remote sensing image change detection method and system based on U-net improved algorithm |
WO2020093210A1 (en) * | 2018-11-05 | 2020-05-14 | 中国科学院计算技术研究所 | Scene segmentation method and system based on contenxtual information guidance |
US20200202128A1 (en) * | 2018-12-21 | 2020-06-25 | Samsung Electronics Co., Ltd. | System and method for providing dominant scene classification by semantic segmentation |
-
2020
- 2020-07-27 CN CN202010732126.3A patent/CN111932553B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110232394A (en) * | 2018-03-06 | 2019-09-13 | 华南理工大学 | A kind of multi-scale image semantic segmentation method |
CN109145769A (en) * | 2018-08-01 | 2019-01-04 | 辽宁工业大学 | The target detection network design method of blending image segmentation feature |
WO2020093210A1 (en) * | 2018-11-05 | 2020-05-14 | 中国科学院计算技术研究所 | Scene segmentation method and system based on contenxtual information guidance |
US20200202128A1 (en) * | 2018-12-21 | 2020-06-25 | Samsung Electronics Co., Ltd. | System and method for providing dominant scene classification by semantic segmentation |
CN110210485A (en) * | 2019-05-13 | 2019-09-06 | 常熟理工学院 | The image, semantic dividing method of Fusion Features is instructed based on attention mechanism |
CN110322446A (en) * | 2019-07-01 | 2019-10-11 | 华中科技大学 | A kind of domain adaptive semantic dividing method based on similarity space alignment |
CN111047551A (en) * | 2019-11-06 | 2020-04-21 | 北京科技大学 | Remote sensing image change detection method and system based on U-net improved algorithm |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487927B (en) * | 2020-11-26 | 2024-02-13 | 深圳市人工智能与机器人研究院 | Method and system for realizing indoor scene recognition based on object associated attention |
CN112487927A (en) * | 2020-11-26 | 2021-03-12 | 深圳市人工智能与机器人研究院 | Indoor scene recognition implementation method and system based on object associated attention |
CN112528803A (en) * | 2020-12-03 | 2021-03-19 | 中国地质大学(武汉) | Road feature extraction method, device, equipment and storage medium |
CN112528803B (en) * | 2020-12-03 | 2023-12-19 | 中国地质大学(武汉) | Road feature extraction method, device, equipment and storage medium |
CN112580649A (en) * | 2020-12-15 | 2021-03-30 | 重庆邮电大学 | Semantic segmentation method based on regional context relation module |
CN112699937A (en) * | 2020-12-29 | 2021-04-23 | 江苏大学 | Apparatus, method, device, and medium for image classification and segmentation based on feature-guided network |
CN112699937B (en) * | 2020-12-29 | 2022-06-21 | 江苏大学 | Apparatus, method, device, and medium for image classification and segmentation based on feature-guided network |
US11763542B2 (en) | 2020-12-29 | 2023-09-19 | Jiangsu University | Apparatus and method for image classification and segmentation based on feature-guided network, device, and medium |
WO2022141723A1 (en) * | 2020-12-29 | 2022-07-07 | 江苏大学 | Image classification and segmentation apparatus and method based on feature guided network, and device and medium |
CN112749736A (en) * | 2020-12-30 | 2021-05-04 | 华南师范大学 | Image recognition method, control device and storage medium |
CN113065586A (en) * | 2021-03-23 | 2021-07-02 | 四川翼飞视科技有限公司 | Non-local image classification device, method and storage medium |
CN113223008A (en) * | 2021-04-16 | 2021-08-06 | 山东师范大学 | Fundus image segmentation method and system based on multi-scale guide attention network |
CN113421259A (en) * | 2021-08-20 | 2021-09-21 | 北京工业大学 | OCTA image analysis method based on classification network |
CN113537254A (en) * | 2021-08-27 | 2021-10-22 | 重庆紫光华山智安科技有限公司 | Image feature extraction method and device, electronic equipment and readable storage medium |
CN113537254B (en) * | 2021-08-27 | 2022-08-26 | 重庆紫光华山智安科技有限公司 | Image feature extraction method and device, electronic equipment and readable storage medium |
CN113807206B (en) * | 2021-08-30 | 2023-04-07 | 电子科技大学 | SAR image target identification method based on denoising task assistance |
CN113807206A (en) * | 2021-08-30 | 2021-12-17 | 电子科技大学 | SAR image target identification method based on denoising task assistance |
CN113989511B (en) * | 2021-12-29 | 2022-07-01 | 中科视语(北京)科技有限公司 | Image semantic segmentation method and device, electronic equipment and storage medium |
CN113989511A (en) * | 2021-12-29 | 2022-01-28 | 中科视语(北京)科技有限公司 | Image semantic segmentation method and device, electronic equipment and storage medium |
CN115170934A (en) * | 2022-09-05 | 2022-10-11 | 粤港澳大湾区数字经济研究院(福田) | Image segmentation method, system, equipment and storage medium |
CN115690704A (en) * | 2022-09-27 | 2023-02-03 | 淮阴工学院 | LG-CenterNet model-based complex road scene target detection method and device |
CN115690704B (en) * | 2022-09-27 | 2023-08-22 | 淮阴工学院 | LG-CenterNet model-based complex road scene target detection method and device |
CN115937742A (en) * | 2022-11-28 | 2023-04-07 | 北京百度网讯科技有限公司 | Video scene segmentation and visual task processing method, device, equipment and medium |
CN115937742B (en) * | 2022-11-28 | 2024-04-12 | 北京百度网讯科技有限公司 | Video scene segmentation and visual task processing methods, devices, equipment and media |
CN115810020A (en) * | 2022-12-02 | 2023-03-17 | 中国科学院空间应用工程与技术中心 | Remote sensing image segmentation method and system from coarse to fine based on semantic guidance |
CN116229277A (en) * | 2023-05-08 | 2023-06-06 | 中国海洋大学 | Strong anti-interference ocean remote sensing image semantic segmentation method based on semantic correlation |
CN116229277B (en) * | 2023-05-08 | 2023-08-08 | 中国海洋大学 | Strong anti-interference ocean remote sensing image semantic segmentation method based on semantic correlation |
Also Published As
Publication number | Publication date |
---|---|
CN111932553B (en) | 2022-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111932553B (en) | Remote sensing image semantic segmentation method based on area description self-attention mechanism | |
CN111462126B (en) | Semantic image segmentation method and system based on edge enhancement | |
CN111259905B (en) | Feature fusion remote sensing image semantic segmentation method based on downsampling | |
CN112396607B (en) | Deformable convolution fusion enhanced street view image semantic segmentation method | |
CN107341517B (en) | Multi-scale small object detection method based on deep learning inter-level feature fusion | |
CN110909642A (en) | Remote sensing image target detection method based on multi-scale semantic feature fusion | |
CN111461039B (en) | Landmark identification method based on multi-scale feature fusion | |
CN112581409B (en) | Image defogging method based on end-to-end multiple information distillation network | |
CN112906706A (en) | Improved image semantic segmentation method based on coder-decoder | |
CN112990065B (en) | Vehicle classification detection method based on optimized YOLOv5 model | |
CN112287983B (en) | Remote sensing image target extraction system and method based on deep learning | |
CN110781850A (en) | Semantic segmentation system and method for road recognition, and computer storage medium | |
CN115631344B (en) | Target detection method based on feature self-adaptive aggregation | |
CN114155371A (en) | Semantic segmentation method based on channel attention and pyramid convolution fusion | |
CN113762396A (en) | Two-dimensional image semantic segmentation method | |
CN115346071A (en) | Image classification method and system for high-confidence local feature and global feature learning | |
CN114782949A (en) | Traffic scene semantic segmentation method for boundary guide context aggregation | |
CN111062347A (en) | Traffic element segmentation method in automatic driving, electronic device and storage medium | |
Wang | Remote sensing image semantic segmentation algorithm based on improved ENet network | |
CN114155165A (en) | Image defogging method based on semi-supervision | |
CN112785610B (en) | Lane line semantic segmentation method integrating low-level features | |
Saravanarajan et al. | Improving semantic segmentation under hazy weather for autonomous vehicles using explainable artificial intelligence and adaptive dehazing approach | |
CN114332780A (en) | Traffic man-vehicle non-target detection method for small target | |
Xi et al. | High Resolution Remote Sensing Image Classification Using Hybrid Ensemble Learning | |
Norelyaqine et al. | Architecture of Deep Convolutional Encoder-Decoder Networks for Building Footprint Semantic Segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |