CN111932553A - Remote sensing image semantic segmentation method based on area description self-attention mechanism - Google Patents

Remote sensing image semantic segmentation method based on area description self-attention mechanism Download PDF

Info

Publication number
CN111932553A
CN111932553A CN202010732126.3A CN202010732126A CN111932553A CN 111932553 A CN111932553 A CN 111932553A CN 202010732126 A CN202010732126 A CN 202010732126A CN 111932553 A CN111932553 A CN 111932553A
Authority
CN
China
Prior art keywords
feature map
remote sensing
layer
sensing image
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010732126.3A
Other languages
Chinese (zh)
Other versions
CN111932553B (en
Inventor
赵丹培
王晨旭
史振威
姜志国
张浩鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202010732126.3A priority Critical patent/CN111932553B/en
Publication of CN111932553A publication Critical patent/CN111932553A/en
Application granted granted Critical
Publication of CN111932553B publication Critical patent/CN111932553B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10032Satellite or aerial image; Remote sensing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which comprises the steps of inputting a visible light remote sensing image into an encoder, extracting high-level semantic features of the visible light remote sensing image, and obtaining feature maps of different levels: respectively extracting global scenes and essential features based on a self-attention module based on feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map; and inputting the scene guide feature map and the noiseless feature map into a decoder to be up-sampled and returned to the original image size, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image. The invention improves the receptive field of the model, can adapt to the scale change of data and can solve the problem of unbalanced category by extracting the encoder of semantic features, the self-attention module for increasing the internal connection of the image and the decoder for mapping the semantic features weighted by attention back to the original space so as to carry out pixel-by-pixel classification.

Description

Remote sensing image semantic segmentation method based on area description self-attention mechanism
Technical Field
The invention relates to the technical field of remote sensing and computer vision, in particular to remote sensing image semantic segmentation based on a region description self-attention mechanism.
Background
The semantic segmentation technology is applied to the field of remote sensing, the effects of inputting a remote sensing image and outputting each pixel class label in the remote sensing image can be achieved, and great help is brought to understanding of the remote sensing image. For example, in the aspect of homeland planning, if the ground surface coverage type (city, road, forest, farmland, river, etc.) of each pixel on the satellite image can be identified, the distribution situation and the occupied area of the pixels can be clearly known, which is beneficial to global planning; for example, intelligent identification is carried out on buildings, whether illegal buildings exist can be found quickly, and manpower consumption can be greatly reduced.
At present, most of work of semantic segmentation by deep learning is a derivative of a Full Convolution Network (FCN), the FCN converts the well-known classification network into the full convolution network, and the specific method is to replace a full connection layer with a convolution layer, and to integrate a deep layer feature with a shallow layer feature through deconvolution operation, so that a high-level semantic feature and a low-level position feature are considered, and the accuracy of the model is improved. However, FCN still cannot be applied to all scenes, and the main reason is that the local nature of convolution results in a small receptive field, so that a larger range of context information cannot be considered. In order to solve the problems, a semantic segmentation network is derived on the basis of the FCN.
The traditional remote sensing image semantic segmentation only uses convolution stacking to extract features, due to the physical property of a convolution filter, the receptive field of convolution operation is limited, the convolution filter aims to capture local features and relations, in other words, information flow in a convolution neural network is only transmitted in a local area, and understanding of complex scenes is greatly influenced. Especially for the remote sensing task, the remote sensing image has many and complex scenes, and long-distance dependence needs to be collected to help predict the current position. Therefore, the invention introduces the self-attention mechanism into the semantic segmentation task of the traditional remote sensing image for the first time, improves the receptive field of the model, enables information to flow in the whole situation, and finds long-distance dependency relationships in the remote sensing image, such as the inclusion relationship existing before the airplane and the airport, the automobile should have certain relation with other automobiles or roads, and the relationships can provide great help for the semantic features of the remote sensing image.
However, the conventional point-and-point-based self-attention mechanism increases context information by calculating the relationship between each pixel point and other pixel points, but has a disadvantage that the point-and-point calculation similarity is more concerned with objects with identical characteristics, for example, a red car and a red car generate a considerable degree of association, but since the labels are only of the car type, the red car can be more desirably associated with all cars and even roads. In other words, in the task of identifying the vehicle at present, the color features are redundant, the color in the feature map is noise, and the color of the vehicle does not help the current task, and even the opposite effect may be generated due to the difference of the features.
Therefore, how to provide a remote sensing image semantic segmentation method based on a region description self-attention mechanism is a problem which needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which introduces the self-attention mechanism based on a region descriptor into a traditional remote sensing image semantic segmentation task, improves the receptive field of a model, enables information to flow in the whole world, and can search long-distance dependency relationship in a remote sensing image.
In order to achieve the purpose, the invention adopts the following technical scheme:
the remote sensing image semantic segmentation method based on the area description self-attention mechanism comprises the following steps:
step 1: inputting a visible light remote sensing image into an encoder, extracting high-level semantic features of the visible light remote sensing image, and obtaining feature maps of different levels:
step 2: respectively carrying out global scene extraction and intrinsic feature extraction based on a self-attention module on the basis of the feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map;
and step 3: and inputting the scene guide characteristic graph and the noiseless characteristic graph into a decoder to be up-sampled and returned to the size of the visible light remote sensing image, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.
Further, the encoder includes a feature extraction network ResNet-101.
Further, inputting the visible light remote sensing image I into the feature extraction network ResNet-101, and outputting a feature map F by a fourth layer4Fifth layer output characteristic diagram F5
Further, the specific steps of generating the noise-free feature map in step 2 are as follows:
step 21: the feature map F is processed according to the ground route4Dividing into K soft regions N ═ M1,M2,…MKWherein each of the soft regions MkBelongs to the kth category, and N represents a soft region set;
step 22: weighting and aggregating the pixel values in each soft region to obtain a coarse region descriptor of the current region, wherein the calculation formula is as follows:
Figure BDA0002603703350000031
in the formula (f)kDenotes a coarse area descriptor, xiRepresentation feature diagram F4Middle pixel piIs characterized in that it is a mixture of two or more of the above-mentioned components,
Figure BDA0002603703350000032
representing a pixel piProbability of belonging to the k-th soft region, I denotes the feature map F4Extracting pixel points;
performing 1x1 convolution transformation on the coarse region descriptor to obtain the final region descriptorT area descriptors rt
Step 23: calculating the feature map F5After dimension reduction, obtaining the self-attention weight value of each pixel relative to the region descriptor according to the relation between each pixel and the region descriptor:
Figure BDA0002603703350000033
wherein T represents the number of soft regions, rtDenotes a region descriptor, WitRepresenting a self-attention weight;
step 24: based on the self-attention weight, calculating the association degree of the point and the region:
Figure BDA0002603703350000034
in the formula, rtDenotes the t-th region descriptor, WitRepresentation feature diagram F5The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimension reduction, (. cndot.) and ρ (-) are two transformation functions, yiRepresenting the degree of association of a point with a region;
step 25: obtaining the noiseless feature map based on the relevance between the point and the region descriptor;
Figure BDA0002603703350000035
where g (-) is a transformation function including 1x1 convolution, batch normalization layer, and ReLU activation function, xiRepresentation feature diagram F4Middle pixel piIs characterized by ziRepresenting a characterization of the noise free feature map.
Further, the specific step of generating the scene guide feature map in step 2 is:
step 26: the feature map F5Performing feature dimensionality reduction through 1x1 convolution to obtain local featuresObtaining a global scene descriptor through spatial global average pooling and 1x1 convolution in sequence;
step 27: fusing the local features and the global scene descriptors to obtain a scene guide feature map Fg
Further, the step 3 specifically includes:
step 31: respectively convolving the noiseless feature map by using 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map FcFeature diagram Fd1Feature diagram Fd2And feature map Fd3
Step 32: guiding feature map F based on the scenegThe characteristic diagram FcThe characteristic diagram Fd1The characteristic diagram Fd2And the characteristic diagram Fd3And (3) carrying out feature vector splicing to obtain a fusion feature map H:
Figure BDA0002603703350000041
step 33: and performing convolution on the fusion feature map by 3x3 to obtain a multi-scale feature map with attention, sampling the multi-scale feature map with attention as the size of the visible light remote sensing image by a bilinear interpolation method, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.
Further, the feature extraction network ResNet-101 comprises a layer 0 res0Layer 1 res1Layer 2 res2Layer 3 res3And layer 4 res4
The 0 th layer res0Comprises 3 convolution layers of 3x3, the 1 st layer res1Comprises 2 bottleneck layers, the 2 res layer2Comprises 3 bottleneck layers, the 3 rd layer res3Comprises 22 bottleneck layers, the 4 th layer res4Including 2 bottleneck layers.
Further, the layer 1 res1And said layer 2 res2The bottleneck layers in the bottle comprise 1x1 convolutional layer, 3x3 convolutional layer and 1x1 convolutional layer, and the 3 rd layer res3And said layer 4 res4The bottleneck layers in (1) comprise 1x1 convolutional layers, hollow convolutional layers and 1x1 convolutional layers.
Further, the expansion ratio of the void convolution layer is 2.
Further, the decoder includes 3 × 3 convolutional layers, a hole convolutional layer having a expansion rate of 16, a hole convolutional layer having an expansion rate of 24, and a hole convolutional layer having an expansion rate of 36.
According to the technical scheme, compared with the prior art, the remote sensing image semantic segmentation method based on the area description self-attention mechanism is provided, the perception field of the model is improved, the scale change of data can be adapted, and the problem of category imbalance can be solved by an encoder for extracting semantic features, a self-attention module for increasing the internal relation of the image, and a decoder for mapping the semantic features weighted by attention back to the original space so as to perform pixel-by-pixel classification.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a remote sensing image semantic segmentation method based on a region description self-attention mechanism provided by the invention.
Fig. 2 is a specific flowchart of the processing performed by the self-attention module according to the present invention.
Fig. 3 is a specific flowchart of the processing performed by the decoder according to the present invention.
Fig. 4 is a detailed structural diagram of an encoder provided by the present invention.
Fig. 5 is a detailed block diagram of a decoder according to the present invention.
FIG. 6 is a regional descriptor attention diagram provided by the present invention. The first column represents a visible light remote sensing image, the first area descriptor in the second column describes the characteristics of buildings, the second area descriptor in the third column describes the characteristics of vegetation, and the third area descriptor in the fourth column describes the characteristics of roads.
FIG. 7 is a graph showing the effect of a comparative test on the consistency of the same kind of target. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
Figure 8 is a graph of the effect of comparative experiments on small target performance. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
FIG. 9 is a graph illustrating the effect of a comparative experiment on a plurality of complex scene segmentations. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
Fig. 10 is a graph showing the effect of a comparative experiment on the accuracy of edge segmentation. The first column shows the original image, the second column shows the pixel segmentation result graph by the ground route method, the third column shows the effect graph of segmentation based on the reference model, and the fourth column shows the effect graph of segmentation by the algorithm of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a remote sensing image semantic segmentation method based on a region description self-attention mechanism, which comprises the following steps of:
step 1: inputting the visible light remote sensing image into an encoder, and extracting high-level semantic features of the visible light remote sensing image, wherein the encoder adopts an extraction network ResNet-101 to output a feature map F from a fourth layer4Fifth layer output characteristic diagram F5
Step 2: the self-attention module will use the ground truth as a supervision to find the region descriptor. The core idea is to divide the global context pixel into a plurality of soft object areas under the supervision of the group channel, and at this time, each area corresponds to a class in the group channel. In the soft regions, a representation of the current region population is obtained by aggregating the pixels in the region, and the representation is a region descriptor. Finally, weighting and summing each pixel and the attention weight of the area descriptors to obtain new characteristics of each pixel related to the area descriptors.
As shown in fig. 2, specifically:
step 21: feature map F4Roughly dividing the prediction object area, and dividing the feature map F according to the ground route4Dividing into K soft regions N ═ M1,M2,…MKWherein each soft region MkBelong to the kth category; the cross entropy loss function is used in the training process to learn the soft region generation from the ground channel, which can be regarded as introducing an auxiliary loss branch.
Step 22: weighting and aggregating the pixel values in each soft region to obtain a coarse region descriptor of the current region, wherein the calculation formula is as follows:
Figure BDA0002603703350000061
in the formula (f)kDenotes a coarse area descriptor, xiRepresenting a pixel piIs characterized in that it is a mixture of two or more of the above-mentioned components,
Figure BDA0002603703350000062
representing a pixel piAnd (3) the probability of belonging to the k-th soft area is normalized by using softmax to ensure that the sum of the probabilities of all pixels in the whole image belonging to the k-th soft area is 1.
Carrying out convolution transformation on the coarse region descriptors by 1x1 to obtain the final T region descriptors rt
Step 23: calculation of feature map F Using softmax5After dimension reduction, obtaining the self-attention weight value of each pixel relative to the area descriptor according to the relation between each pixel and the area descriptor:
Figure BDA0002603703350000071
wherein T represents the number of soft regions, rtDenotes a region descriptor, WitRepresenting a self-attention weight;
step 24: calculating the feature map F5After dimensionality reduction, obtaining the association degree of each pixel and the region according to the characteristics of each pixel related to the region descriptor:
Figure BDA0002603703350000072
in the formula, rtDenotes the t-th region descriptor, WitRepresentation feature diagram F5The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimension reduction, (. cndot.) and ρ (-) are two transformation functions, yiRepresenting the degree of association of a point with a region;
step 25: the final pixel feature is composed of two parts, one of which is the original feature xiThe other part is a feature y represented by weighted summation of area descriptorsiThe characteristic of the obtained noiseless characteristic diagram is zi
Figure BDA0002603703350000073
Wherein g (-) is a transformation function comprising 1x1 convolution, batch normalization layer, and ReLU activation function, xiRepresentation feature diagram F4Middle pixel piThe characteristics of (1).
Based on the characteristic diagram F5The process of extracting the global scene comprises the following steps:
feature map F of fifth layer output of ResNet-1015Performing feature dimensionality reduction through a 1x1 convolution to obtain local features xiThen convolving the space global average pooling with another 1x1 to obtain a global scene descriptor g (x); local feature { xiThe method is fused with a global vector g (x) to obtain a scene-guided feature map Fg
And step 3: inputting the scene guide feature map and the noiseless feature map into a decoder to obtain the size of an original image through up-sampling, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image, wherein the specific steps are as shown in fig. 3:
step 31: respectively convolving the noiseless feature map output from the attention module by using the conventional 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map FcFeature diagram Fd1Feature diagram Fd2And feature map Fd3
Step 32: guiding the scene to the feature map FgThe characteristic diagram FcThe characteristic diagram Fd1The characteristic diagram Fd2And the characteristic diagram Fd3And (3) carrying out feature vector splicing to obtain a fusion feature map H:
Figure BDA0002603703350000081
d1, d2, d3 indicate different expansion rates in the hole convolution, d1=16,a2=24,d336. And finally, convolving the feature graph H after feature fusion by 3x3 to obtain a multi-scale feature graph with attention, upsampling the feature graph to the same size as the original graph by a bilinear interpolation method, and then classifying pixel by pixel to obtain a final semantic segmentation result.
As shown in fig. 4, the encoder structure is mainly composed of 3 convolutional layers and 29 bottleneck layers to extract the features of the image, wherein the following 24 bottleneck layers replace the ordinary convolution with a hole rate of 2.
The invention adopts the depth residual error network ResNet-101 as the encoder, because ResNet has stronger characteristic extraction capability compared with VGG and GoogLeNet because of the residual error structure. Meanwhile, the problem of large scale range of the remote sensing image is considered, and the conventional convolution is replaced by the void convolution in the deep level feature extraction.
The original ResNet-101 is first convolved with a 7x7 convolution, and to save computation and achieve the same field size as the 7x7 convolution, it is stacked with three 3x3 convolutions, the number of convolution kernel parameters per channel is reduced from 49 to 27, which can be referred to as layer 0 res0(ii) a The later layers 1, 2, 3 and 4 are all composed of bottleneck layer, the layer 1 res1Comprises 1 base layer (BasicBlock) and 2 bottleneck layers, 2 res2Comprises 1 base layer and 3 bottleneck layers, 3 res layers3Comprises 1 base layer and 22 bottleneck layers, 4 res4Comprising 1 base layer and 2 bottleneck layers. The deeper network structure can better extract complex characteristics in the remote sensing image, and the degradation problem caused by the too deep network is prevented due to the residual structure.
When extracting high-level features, pooling is often used due to the need to expand the receptive field, but the disadvantage is that small objects are easily lost. And because small targets in the remote sensing image often need to be paid more attention, in order to prevent the small target information from being incapable of being reconstructed, the pooling layer used by ResNet-101 in deep layer feature extraction is deleted, and the conventional convolution is replaced by the void convolution.
At layer 3 res3And layer 4 res4The pooling layer used for downsampling is removed and the conventional 3x3 convolution in all bottleneck layers is replaced with a hole convolution with an expansion ratio set to 2 to achieve the same effect of enlarging the field of view as the removed pooling layer. Layer 3 res3And layer 4 res4Firstly, carrying out dimensionality reduction on high-dimensionality features by convolution of 1x1, then extracting the features by utilizing cavity convolution, reducing the feature dimensionality by utilizing convolution of 1x1, and finally directly inputting the feature dimensionality and the original dimensionalityAnd adding to obtain Bottleneeck output.
Because the model only utilizes ResNet-101 to extract the characteristics, the last average pooling layer and the full-connection layer of the original ResNet-101 are removed and replaced by a 3x3 convolutional layer.
As shown in FIG. 5, in consideration of the characteristic of large scale change of the remote sensing image, the decoder uses three cavity convolutions with different expansion rates to process the feature maps in parallel, and captures the object and the context information in the image at a plurality of scales, so that the multi-scale image obtains better segmentation effect. However, the remote sensing image has the characteristic of large and rich scene, so that the classification of the whole scene of the remote sensing image can help the semantic segmentation of the remote sensing image. For example, if the scene is a city, buildings and roads should appear more probably in the scene, and the probability of oil tanks and airplanes appearing in the city scene is very low. Therefore, in the design of the decoder, the problem of multi-scale of the target is not considered, and a global scene descriptor is introduced to help the model to make better prediction.
Experimental verification section:
table 1 compares the algorithm of the present invention with a commonly used method in semantic segmentation tasks, and the method designed by the present invention achieves the highest score on a droneDeploy remote sensing image dataset.
Table 1 compares performance with existing algorithms
Figure BDA0002603703350000091
Note: with the method added a self-attention mechanism.
For models without a self-attention mechanism, RefineNet combines higher-level coarse semantic features with lower-level fine semantic features; the PSPNet adopts a Pyramid Pooling Module (PPM) to aggregate context information of different areas; the deep Lab fuses the segmentation results with different resolutions by using an empty space pyramid pooling module (ASPP); densesaspp is a dense connection of ASPP modules to DenseNet resulting in a larger acceptance domain and denser sampling points. They have in common that different levels of feature map fusion and the use of context information to generate a larger receptive field.
The self-attention mechanism is also a good way to enlarge the receptive field, and by associating with all information of the whole world, the information can be transmitted in the whole image and has the whole receptive field. The DANet combines the spatial correlation with the channel correlation; PSANet uses two steps of dispersion and collection to transmit information in the whole graph; EncNet encodes context information, taking scene characteristics into consideration; the CCNet acquires global context information by using a cross structure; the OCNet aggregates pixels and then divides them. The above is an extension of the self-attention mechanism, and it can be observed that the algorithm with asterisks and the self-attention mechanism is better than the algorithm without the self-attention mechanism as a whole, which shows the effectiveness of the self-attention mechanism on the semantic segmentation task.
In fig. 6, the region descriptor attention maps of the intermediate results are visualized, each representing the attention weight of a different region descriptor in reconstructing the feature map. Each region descriptor corresponds to some specific semantic information, not just to the foreground or background.
In fig. 7, it can be seen that the algorithm of the present invention produces great advantages in object segmentation where the pixels themselves have different characteristics, but belong to the same class. In a first example, pixels belonging to the same terrestrial type have two general features, namely a dark color system and a light color system. The benchmark model classifies the features into two categories due to the difference of the features, however, in practice, the two categories are both in the same category of land, and the algorithm of the invention well classifies the features into one category.
In fig. 8, the algorithm of the present invention is very accurate in segmenting small objects, and the assigned labels are also accurate. The inverse reference model is difficult to extract enough features due to the small target, is easily interfered by redundant features, has defects when the small target is frequently divided, and is easily connected into a whole.
In fig. 9, the common features of the images are that the scene is complex, the variety of the images is wide, and the boundaries between different objects are not clear. In such a complex scene, the algorithm of the invention can divide the object regularly, which is more obvious than the disordered division of the reference model.
In fig. 10, the algorithm of the present invention is more accurate at the edge of object segmentation compared to the reference model. The method has the advantages that each pixel point is reconstructed by the region descriptor, and the boundaries of different objects can be distinguished more accurately by using different essential characteristics.
Therefore, the final prediction result of the region-based self-attention mechanism model is better in segmentation effect compared with a reference model without the self-attention mechanism, and particularly, the algorithm disclosed by the invention has great advantages when the part with a white frame in a picture is focused.
The invention has the following advantages:
1) the method introduces a self-attention mechanism into the semantic segmentation task of the traditional remote sensing image, improves the receptive field of the model, enables information to flow in the whole situation, finds long-distance dependency relationship in the remote sensing image, and can provide great help for the semantic features of the remote sensing image.
2) Based on the self-attention mechanism of the calculation point and the point, the self-attention mechanism of the calculation point and the area is expanded, and a noise-free feature map can be reconstructed.
The core idea is to map data on a noisy high-dimensional space to a compact subspace for capturing the most essential semantic concept, then calculate the correlation degree of each pixel point and the captured semantic features, and assign values to each pixel point again by using the set of descriptors so as to reconstruct a noiseless feature map.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. The remote sensing image semantic segmentation method based on the area description self-attention mechanism is characterized by comprising the following steps of:
step 1: inputting the visible light remote sensing image into an encoder, extracting high-level semantic features, and obtaining feature maps of different levels:
step 2: respectively carrying out global scene extraction and intrinsic feature extraction based on a self-attention module on the basis of the feature maps of different levels to correspondingly obtain a scene guide feature map and a noise-free feature map;
and step 3: and inputting the scene guide characteristic graph and the noiseless characteristic graph into a decoder to be up-sampled and returned to the size of the visible light remote sensing image, and carrying out pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.
2. The method for semantically segmenting the remote sensing image based on the area description self-attention mechanism according to claim 1, wherein the encoder comprises a feature extraction network ResNet-101.
3. The remote sensing image semantic segmentation method based on the area description attention mechanism according to claim 2, characterized in that the visible light remote sensing image I is input into the feature extraction network ResNet-101, and a feature map F is output at a fourth layer4Fifth layer output characteristic diagram F5
4. The remote sensing image semantic segmentation method based on the region description attention mechanism according to claim 3, wherein the specific steps of generating the noise-free feature map in the step 2 are as follows:
step 21: the feature map F is processed according to the ground route4Dividing into K soft regions N ═ M1,M2,…MKWherein each of the soft regions MkBelongs to the kth category, and N represents a soft region set;
step 22: weighting and aggregating the pixel values in each soft region to obtain a coarse region descriptor of the current region, wherein the calculation formula is as follows:
Figure FDA0002603703340000011
in the formula (f)kDenotes a coarse area descriptor, xiRepresentation feature diagram F4Middle pixel piIs characterized in that it is a mixture of two or more of the above-mentioned components,
Figure FDA0002603703340000012
representing a pixel piProbability of belonging to the k-th soft region, I denotes the feature map F4The obtained pixel points are obtained;
carrying out convolution transformation on the coarse region descriptors by 1x1 to obtain the final T region descriptors rt
Step 23: calculating the feature map F5After dimension reduction, obtaining the self-attention weight value of each pixel relative to the region descriptor according to the relation between each pixel and the region descriptor:
Figure FDA0002603703340000021
wherein T represents the number of soft regions, rtDenotes a region descriptor, WitRepresenting a self-attention weight;
step 24: based on the self-attention weight, calculating the association degree of the point and the region:
Figure FDA0002603703340000022
in the formula, rtDenotes the t-th region descriptor, WitRepresentation feature diagram F5The self-attention weights of the ith pixel and the tth region descriptor in the feature map after dimension reduction, (. cndot.) and ρ (-) are two transformation functions, yiRepresenting the degree of association of a point with a region;
step 25: obtaining the noiseless feature map based on the relevance between the point and the region descriptor;
Figure FDA0002603703340000023
where g (-) is a transformation function including 1x1 convolution, batch normalization layer, and ReLU activation function, xiRepresentation feature diagram F4Middle pixel piIs characterized by ziRepresenting a characterization of the noise free feature map.
5. The remote sensing image semantic segmentation method based on the region description attention mechanism according to claim 4, wherein the specific steps of generating the scene guide feature map in step 2 are as follows:
step 26: the feature map F5Performing feature dimensionality reduction through 1x1 convolution to obtain local features, and sequentially performing spatial global average pooling and 1x1 convolution on the local features to obtain a global scene descriptor;
step 27: fusing the local features and the global scene descriptors to obtain a scene guide feature map Fg
6. The remote sensing image semantic segmentation method based on the region description attention mechanism according to claim 5, wherein the step 3 specifically comprises:
step 31: respectively convolving the noiseless feature map by using 3x3 and convolving the noiseless feature map with three holes with different expansion rates in parallel to obtain a feature map FcFeature diagram Fd1Feature diagram Fd2And feature map Fd3
Step 32: guiding feature map F based on the scenegThe characteristic diagram FcThe characteristic diagram Fd1The characteristic diagram Fd2And the characteristic diagram Fd3And (3) carrying out feature vector splicing to obtain a fusion feature map H:
Figure FDA0002603703340000031
step 33: and performing convolution on the fusion feature map by 3x3 to obtain a multi-scale feature map with attention, sampling the multi-scale feature map with attention as the size of the visible light remote sensing image by a bilinear interpolation method, and performing pixel-by-pixel classification to obtain a semantic segmentation result of the remote sensing image.
7. The remote sensing image semantic segmentation method based on the area description self-attention mechanism according to claim 6, wherein the feature extraction network ResNet-101 comprises a 0-level res0Layer 1 res1Layer 2 res2Layer 3 res3And layer 4 res4
The 0 th layer res0Comprises 3 convolution layers of 3x3, the 1 st layer res1Comprises 2 bottleneck layers, the 2 res layer2Comprises 3 bottleneck layers, the 3 rd layer res3Comprises 22 bottleneck layers, the 4 th layer res4Including 2 bottleneck layers.
8. The method for semantically segmenting the remote sensing image based on the area description attention mechanism of claim 7, wherein the layer 1 res is1And said layer 2 res2The bottleneck layers in the bottle comprise 1x1 convolutional layer, 3x3 convolutional layer and 1x1 convolutional layer, and the 3 rd layer res3And said layer 4 res4The bottleneck layers in (1) comprise 1x1 convolutional layers, hollow convolutional layers and 1x1 convolutional layers.
9. The method for semantically segmenting the remote sensing image based on the area description self-attention mechanism according to claim 8, wherein the expansion rate of the void convolution layer is 2.
10. The remote sensing image semantic segmentation method based on the region description self-attention mechanism according to any one of claims 1 to 9, wherein the decoder comprises 3x3 convolutional layers, a hole convolutional layer with an expansion rate of 16, a hole convolutional layer with an expansion rate of 24 and a hole convolutional layer with an expansion rate of 36.
CN202010732126.3A 2020-07-27 2020-07-27 Remote sensing image semantic segmentation method based on area description self-attention mechanism Active CN111932553B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010732126.3A CN111932553B (en) 2020-07-27 2020-07-27 Remote sensing image semantic segmentation method based on area description self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010732126.3A CN111932553B (en) 2020-07-27 2020-07-27 Remote sensing image semantic segmentation method based on area description self-attention mechanism

Publications (2)

Publication Number Publication Date
CN111932553A true CN111932553A (en) 2020-11-13
CN111932553B CN111932553B (en) 2022-09-06

Family

ID=73315343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010732126.3A Active CN111932553B (en) 2020-07-27 2020-07-27 Remote sensing image semantic segmentation method based on area description self-attention mechanism

Country Status (1)

Country Link
CN (1) CN111932553B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487927A (en) * 2020-11-26 2021-03-12 深圳市人工智能与机器人研究院 Indoor scene recognition implementation method and system based on object associated attention
CN112528803A (en) * 2020-12-03 2021-03-19 中国地质大学(武汉) Road feature extraction method, device, equipment and storage medium
CN112580649A (en) * 2020-12-15 2021-03-30 重庆邮电大学 Semantic segmentation method based on regional context relation module
CN112699937A (en) * 2020-12-29 2021-04-23 江苏大学 Apparatus, method, device, and medium for image classification and segmentation based on feature-guided network
CN112749736A (en) * 2020-12-30 2021-05-04 华南师范大学 Image recognition method, control device and storage medium
CN113065586A (en) * 2021-03-23 2021-07-02 四川翼飞视科技有限公司 Non-local image classification device, method and storage medium
CN113223008A (en) * 2021-04-16 2021-08-06 山东师范大学 Fundus image segmentation method and system based on multi-scale guide attention network
CN113421259A (en) * 2021-08-20 2021-09-21 北京工业大学 OCTA image analysis method based on classification network
CN113537254A (en) * 2021-08-27 2021-10-22 重庆紫光华山智安科技有限公司 Image feature extraction method and device, electronic equipment and readable storage medium
CN113807206A (en) * 2021-08-30 2021-12-17 电子科技大学 SAR image target identification method based on denoising task assistance
CN113989511A (en) * 2021-12-29 2022-01-28 中科视语(北京)科技有限公司 Image semantic segmentation method and device, electronic equipment and storage medium
CN115170934A (en) * 2022-09-05 2022-10-11 粤港澳大湾区数字经济研究院(福田) Image segmentation method, system, equipment and storage medium
CN115690704A (en) * 2022-09-27 2023-02-03 淮阴工学院 LG-CenterNet model-based complex road scene target detection method and device
CN115810020A (en) * 2022-12-02 2023-03-17 中国科学院空间应用工程与技术中心 Remote sensing image segmentation method and system from coarse to fine based on semantic guidance
CN115937742A (en) * 2022-11-28 2023-04-07 北京百度网讯科技有限公司 Video scene segmentation and visual task processing method, device, equipment and medium
CN116229277A (en) * 2023-05-08 2023-06-06 中国海洋大学 Strong anti-interference ocean remote sensing image semantic segmentation method based on semantic correlation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145769A (en) * 2018-08-01 2019-01-04 辽宁工业大学 The target detection network design method of blending image segmentation feature
CN110210485A (en) * 2019-05-13 2019-09-06 常熟理工学院 The image, semantic dividing method of Fusion Features is instructed based on attention mechanism
CN110232394A (en) * 2018-03-06 2019-09-13 华南理工大学 A kind of multi-scale image semantic segmentation method
CN110322446A (en) * 2019-07-01 2019-10-11 华中科技大学 A kind of domain adaptive semantic dividing method based on similarity space alignment
CN111047551A (en) * 2019-11-06 2020-04-21 北京科技大学 Remote sensing image change detection method and system based on U-net improved algorithm
WO2020093210A1 (en) * 2018-11-05 2020-05-14 中国科学院计算技术研究所 Scene segmentation method and system based on contenxtual information guidance
US20200202128A1 (en) * 2018-12-21 2020-06-25 Samsung Electronics Co., Ltd. System and method for providing dominant scene classification by semantic segmentation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232394A (en) * 2018-03-06 2019-09-13 华南理工大学 A kind of multi-scale image semantic segmentation method
CN109145769A (en) * 2018-08-01 2019-01-04 辽宁工业大学 The target detection network design method of blending image segmentation feature
WO2020093210A1 (en) * 2018-11-05 2020-05-14 中国科学院计算技术研究所 Scene segmentation method and system based on contenxtual information guidance
US20200202128A1 (en) * 2018-12-21 2020-06-25 Samsung Electronics Co., Ltd. System and method for providing dominant scene classification by semantic segmentation
CN110210485A (en) * 2019-05-13 2019-09-06 常熟理工学院 The image, semantic dividing method of Fusion Features is instructed based on attention mechanism
CN110322446A (en) * 2019-07-01 2019-10-11 华中科技大学 A kind of domain adaptive semantic dividing method based on similarity space alignment
CN111047551A (en) * 2019-11-06 2020-04-21 北京科技大学 Remote sensing image change detection method and system based on U-net improved algorithm

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487927B (en) * 2020-11-26 2024-02-13 深圳市人工智能与机器人研究院 Method and system for realizing indoor scene recognition based on object associated attention
CN112487927A (en) * 2020-11-26 2021-03-12 深圳市人工智能与机器人研究院 Indoor scene recognition implementation method and system based on object associated attention
CN112528803A (en) * 2020-12-03 2021-03-19 中国地质大学(武汉) Road feature extraction method, device, equipment and storage medium
CN112528803B (en) * 2020-12-03 2023-12-19 中国地质大学(武汉) Road feature extraction method, device, equipment and storage medium
CN112580649A (en) * 2020-12-15 2021-03-30 重庆邮电大学 Semantic segmentation method based on regional context relation module
CN112699937A (en) * 2020-12-29 2021-04-23 江苏大学 Apparatus, method, device, and medium for image classification and segmentation based on feature-guided network
CN112699937B (en) * 2020-12-29 2022-06-21 江苏大学 Apparatus, method, device, and medium for image classification and segmentation based on feature-guided network
US11763542B2 (en) 2020-12-29 2023-09-19 Jiangsu University Apparatus and method for image classification and segmentation based on feature-guided network, device, and medium
WO2022141723A1 (en) * 2020-12-29 2022-07-07 江苏大学 Image classification and segmentation apparatus and method based on feature guided network, and device and medium
CN112749736A (en) * 2020-12-30 2021-05-04 华南师范大学 Image recognition method, control device and storage medium
CN113065586A (en) * 2021-03-23 2021-07-02 四川翼飞视科技有限公司 Non-local image classification device, method and storage medium
CN113223008A (en) * 2021-04-16 2021-08-06 山东师范大学 Fundus image segmentation method and system based on multi-scale guide attention network
CN113421259A (en) * 2021-08-20 2021-09-21 北京工业大学 OCTA image analysis method based on classification network
CN113537254A (en) * 2021-08-27 2021-10-22 重庆紫光华山智安科技有限公司 Image feature extraction method and device, electronic equipment and readable storage medium
CN113537254B (en) * 2021-08-27 2022-08-26 重庆紫光华山智安科技有限公司 Image feature extraction method and device, electronic equipment and readable storage medium
CN113807206B (en) * 2021-08-30 2023-04-07 电子科技大学 SAR image target identification method based on denoising task assistance
CN113807206A (en) * 2021-08-30 2021-12-17 电子科技大学 SAR image target identification method based on denoising task assistance
CN113989511B (en) * 2021-12-29 2022-07-01 中科视语(北京)科技有限公司 Image semantic segmentation method and device, electronic equipment and storage medium
CN113989511A (en) * 2021-12-29 2022-01-28 中科视语(北京)科技有限公司 Image semantic segmentation method and device, electronic equipment and storage medium
CN115170934A (en) * 2022-09-05 2022-10-11 粤港澳大湾区数字经济研究院(福田) Image segmentation method, system, equipment and storage medium
CN115690704A (en) * 2022-09-27 2023-02-03 淮阴工学院 LG-CenterNet model-based complex road scene target detection method and device
CN115690704B (en) * 2022-09-27 2023-08-22 淮阴工学院 LG-CenterNet model-based complex road scene target detection method and device
CN115937742A (en) * 2022-11-28 2023-04-07 北京百度网讯科技有限公司 Video scene segmentation and visual task processing method, device, equipment and medium
CN115937742B (en) * 2022-11-28 2024-04-12 北京百度网讯科技有限公司 Video scene segmentation and visual task processing methods, devices, equipment and media
CN115810020A (en) * 2022-12-02 2023-03-17 中国科学院空间应用工程与技术中心 Remote sensing image segmentation method and system from coarse to fine based on semantic guidance
CN116229277A (en) * 2023-05-08 2023-06-06 中国海洋大学 Strong anti-interference ocean remote sensing image semantic segmentation method based on semantic correlation
CN116229277B (en) * 2023-05-08 2023-08-08 中国海洋大学 Strong anti-interference ocean remote sensing image semantic segmentation method based on semantic correlation

Also Published As

Publication number Publication date
CN111932553B (en) 2022-09-06

Similar Documents

Publication Publication Date Title
CN111932553B (en) Remote sensing image semantic segmentation method based on area description self-attention mechanism
CN111462126B (en) Semantic image segmentation method and system based on edge enhancement
CN111259905B (en) Feature fusion remote sensing image semantic segmentation method based on downsampling
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN107341517B (en) Multi-scale small object detection method based on deep learning inter-level feature fusion
CN110909642A (en) Remote sensing image target detection method based on multi-scale semantic feature fusion
CN111461039B (en) Landmark identification method based on multi-scale feature fusion
CN112581409B (en) Image defogging method based on end-to-end multiple information distillation network
CN112906706A (en) Improved image semantic segmentation method based on coder-decoder
CN112990065B (en) Vehicle classification detection method based on optimized YOLOv5 model
CN112287983B (en) Remote sensing image target extraction system and method based on deep learning
CN110781850A (en) Semantic segmentation system and method for road recognition, and computer storage medium
CN115631344B (en) Target detection method based on feature self-adaptive aggregation
CN114155371A (en) Semantic segmentation method based on channel attention and pyramid convolution fusion
CN113762396A (en) Two-dimensional image semantic segmentation method
CN115346071A (en) Image classification method and system for high-confidence local feature and global feature learning
CN114782949A (en) Traffic scene semantic segmentation method for boundary guide context aggregation
CN111062347A (en) Traffic element segmentation method in automatic driving, electronic device and storage medium
Wang Remote sensing image semantic segmentation algorithm based on improved ENet network
CN114155165A (en) Image defogging method based on semi-supervision
CN112785610B (en) Lane line semantic segmentation method integrating low-level features
Saravanarajan et al. Improving semantic segmentation under hazy weather for autonomous vehicles using explainable artificial intelligence and adaptive dehazing approach
CN114332780A (en) Traffic man-vehicle non-target detection method for small target
Xi et al. High Resolution Remote Sensing Image Classification Using Hybrid Ensemble Learning
Norelyaqine et al. Architecture of Deep Convolutional Encoder-Decoder Networks for Building Footprint Semantic Segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant