CN114419449A - Self-attention multi-scale feature fusion remote sensing image semantic segmentation method - Google Patents

Self-attention multi-scale feature fusion remote sensing image semantic segmentation method Download PDF

Info

Publication number
CN114419449A
CN114419449A CN202210308387.1A CN202210308387A CN114419449A CN 114419449 A CN114419449 A CN 114419449A CN 202210308387 A CN202210308387 A CN 202210308387A CN 114419449 A CN114419449 A CN 114419449A
Authority
CN
China
Prior art keywords
feature
module
swin
scale
remote sensing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210308387.1A
Other languages
Chinese (zh)
Other versions
CN114419449B (en
Inventor
符颖
郭丹青
文武
吴锡
周激流
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN202210308387.1A priority Critical patent/CN114419449B/en
Publication of CN114419449A publication Critical patent/CN114419449A/en
Application granted granted Critical
Publication of CN114419449B publication Critical patent/CN114419449B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a self-attention multi-scale feature fusion remote sensing image semantic segmentation method, wherein a segmentation network comprises a feature encoder and a decoder, the feature encoder transmits feature maps with different scales in the first three stages to a corresponding self-attention multi-scale feature fusion module in the decoder, the decoder performs upsampling from the feature map in the last stage and overlaps with the feature map with self-attention multi-scale feature fusion, the upsampling is performed step by step until the feature map is the same as the feature map in the first stage, and finally, the feature maps with all scales are respectively upsampled to the original size and each pixel is subjected to prediction classification, and prediction results of four scales are fused to obtain a final remote sensing image semantic segmentation result.

Description

Self-attention multi-scale feature fusion remote sensing image semantic segmentation method
Technical Field
The invention relates to the field of remote sensing image processing, in particular to a semantic segmentation method for a self-attention multiscale feature fusion remote sensing image.
Background
With the continuous development of remote sensing image technology, the processing for remote sensing images becomes more and more important, and semantic segmentation is one of the important research directions. In both natural images and remote sensing images, semantic segmentation is classified at the pixel level, and a label is assigned to each pixel. Compared with natural images, the remote sensing images have the advantages of high resolution, complex content and larger object scale difference, and the requirements on segmentation accuracy in practical application are higher due to the complex image content. With the great diversity of deep learning in the field of computer vision in recent years, researchers continuously apply the deep learning technology to the remote sensing image semantic segmentation, and the result also proves that the effect of the deep learning on the remote sensing image semantic segmentation is superior to that of most traditional methods and gradually becomes the mainstream method of the remote sensing semantic segmentation. The deep learning method applied to remote sensing image semantic segmentation at first is mostly based on convolutional neural networks, the most classical is a full convolutional network used for image semantic segmentation end to end, the general idea is to replace fully connected layers with convolutional layers, the fraction of each category in an image is calculated by using 1 x 1 convolution to realize pixel level prediction, and the full convolutional network directly up-samples the feature map of the last convolutional layer to the size of an original image, so that the segmentation edge is incomplete and the precision is low.
In order to better restore the feature graph to the original image size, a semantic segmentation network U-Net based on a codec framework is proposed, the semantic segmentation network U-Net is a U-shaped symmetrical structure and consists of an encoder for extracting features and a decoder for restoring the scale, jump connection is added in the decoder for improving the extraction capability of the object edge features to further fuse the features extracted by the encoder, and the improvements enable the U-Net to obtain better segmentation accuracy and robustness under the condition of fewer training samples. The semantic segmentation model based on the convolutional neural network usually needs continuous downsampling to obtain a larger receptive field, but the continuous downsampling can cause the resolution of the picture to be reduced, so that the position information is lost. In order to solve the problem, a semantic segmentation network DeepLab with a hole convolution and a conditional random field is proposed on the basis of VGG16, and the proposed hole convolution can increase the receptive field under the condition of keeping the resolution unchanged so that the segmentation precision is higher.
The Self Attenttion and Transformer began to become familiar to researchers as Google's Transformer model, introduced in 2016, achieved a frightened effect in the field of natural language processing. It is because of the remarkable achievement of the Transformer in natural language processing and its powerful modeling capability, especially its excellent performance in global information processing, that researchers have begun to try to apply it to the field of computer vision.
The prior art scheme has the following defects:
1. insufficient capability in extracting semantic information from remote sensing images with complex content background
The traditional convolutional neural network has limited capability of extracting semantic information, and can not effectively extract key information required by semantic segmentation in the face of a remote sensing image with a complex content background, so that the segmentation effect is greatly influenced.
2. Correlation between different scale features is not considered when performing feature fusion
Because the difference of the object dimensions in the remote sensing image is large, the semantic information of a plurality of objects with small and medium dimensions cannot be transmitted to a deep network, and the segmentation result is seriously influenced. In the past, when multi-feature fusion is used, only simple superposition is carried out from top to bottom, correlation among different scales is not considered, and multi-scale semantic information cannot be well utilized.
3. The remote sensing image data which are distributed unevenly on the object have poor generalization capability
The remote sensing images have complex content and uneven distribution due to the difference of shooting areas, and the traditional method has better semantic segmentation and poorer generalization capability only on the remote sensing images of a certain scene of a city and a country.
Disclosure of Invention
Aiming at the defects of the prior art, a self-attention multi-scale feature fusion remote sensing image semantic segmentation method comprises the following steps that a feature encoder transmits feature maps with different scales in four stages to a decoder, the decoder starts to up sample from the feature map in the last stage and superposes the feature map with the self-attention multi-scale feature fusion, the up-sampling is gradually carried out until the feature maps are the same as the scale of the feature map in the first stage, finally, all scales of the feature maps are up-sampled to the original size respectively, each pixel is predicted and classified, and the four scales of prediction results are fused to obtain the final remote sensing image semantic segmentation result, the method can effectively fuse the remote sensing semantic features with different scales, and the segmentation performance is improved, and the method comprises the following specific steps:
step 1: constructing a remote sensing semantic segmentation network, wherein the segmentation network comprises a feature encoder and a decoder, training a Swin-T network on an ImageNet data set in advance, taking the trained Swin-T network as the encoder, and taking a self-attention multiscale feature fused pyramid structure network as the decoder;
step 2: the feature encoder comprises a Swin-T four modules which are sequentially connected, specifically a Swin-T first module, a Swin-T second module, a Swin-T third module and a Swin-T fourth module, wherein the Swin-T four modules sequentially carry out feature acquisition on an input remote sensing image, and finally four feature maps with different scales are obtained and transmitted to a decoder;
and step 3: the decoder comprises three self-attention multi-scale feature fusion modules, specifically a first feature fusion module, a second feature fusion module and a third feature fusion module, wherein feature graphs generated by the Swin-T first module, the Swin-T second module and the Swin-T third module are all input into the three feature fusion modules, and the three feature fusion modules perform feature fusion according to the scale and the channel number of the current stage, specifically:
step 31: the first feature fusion module takes the feature diagram scale and the channel number generated by the Swin-T first module as standards, performs 2 times of upsampling and channel number halving on the feature diagram generated by the Swin-T second module, and performs 4 times of upsampling and channel number reducing to 1/4 on the feature diagram generated by the Swin-T third module;
step 32: the second feature fusion module takes the scale and the number of channels of the feature map generated by the Swin-T second module as standards, processes the number and the scale of the channels of the feature map generated by the Swin-T first module through the feature adjustment module, and performs 2-time upsampling and channel number halving on the feature map generated by the Swin-T third module;
step 33: the third feature fusion module takes the feature graph scale and the channel number generated by the Swin-T third module as standards, and the feature graphs generated by the Swin-T first module and the Swin-T second module are processed by the feature adjustment module according to the channel number and the channel number;
and 4, step 4: the feature fusion module also comprises an attention calculating module, wherein the attention calculating module is used for performing global average pooling on the three adjusted feature maps in each feature fusion module, splicing the pooled feature maps to calculate self-attention, splitting the result of the self-attention calculation according to the same mode to obtain three correlation scores, multiplying the three correlation scores with the feature maps before the respective global pooling, and finally splicing the multiplied feature maps in channel dimensions and adjusting the number of channels to be consistent with the number of channels in the current stage;
and 5: performing prediction classification on each pixel, specifically:
as shown in fig. 1, the feature map generated by the Swin-T fourth module is subjected to double upsampling and then is superimposed with the feature map output by the third feature fusion module, the feature map superimposed by the third feature fusion module is subjected to double upsampling and then is superimposed with the feature map output by the second feature fusion module, the feature map superimposed by the second feature fusion module is subjected to double upsampling and then is superimposed with the feature map output by the first feature fusion module, the three superimposed feature maps and the feature map generated by the Swin-T fourth module are upsampled to the original size, then each pixel is subjected to prediction classification, and finally, the prediction results of four scales are fused to obtain the final remote sensing semantic segmentation result.
According to a preferred embodiment, the operating mode of the feature adjusting module is specifically: and 2, performing 2 × 2 maximum pooling on the input feature map, better retaining the main features of the large-scale feature map, adjusting the number of channels by using 1 × 1 convolution, then performing selective extraction on the features by using two 3 × 3 convolutions, and using residual connection to avoid gradient explosion and disappearance and accelerate network convergence.
According to a preferred embodiment, step 5 further comprises: when the segmentation prediction is carried out, a multi-scale strategy is also adopted, the input remote sensing image is segmented and predicted according to the scale of 0.5, 0.75, 1.0, 1.25, 1.5 and 2.0, and finally, the segmentation results under all scales are overlapped.
The invention has the beneficial effects that:
1. the self-attention multi-scale feature fusion module provided by the invention can effectively fuse features among different scales, can extract useful semantic features from the remote sensing image with complex background content, and can obtain better segmentation results on the semantic segmentation of the remote sensing image with complex background, variable object scales and uneven distribution.
2. The feature adjusting module provided by the invention can effectively transmit the large-scale features to the attention multi-scale feature fusion module, better retains the main features of the large-scale feature map and improves the segmentation precision.
3. The relevance among different scale characteristic graphs is considered, the relevance is calculated by using self attention, the obtained relevance score can be understood as a weight, the relevance among different scale characteristics is fully considered, and the segmentation precision of objects with large scale difference, particularly buildings and water bodies, can be improved in the remote sensing image semantic segmentation.
Drawings
FIG. 1 is a schematic diagram of the structure of a semantic segmentation network of the present invention;
FIG. 2 is a schematic diagram of a self-attention multi-scale module according to the present invention;
FIG. 3 is a block diagram of a feature adjustment module;
FIG. 4 is a graph comparing the results of the experiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
The Swin-T network of the first stage represents the Swin-T first module.
The Swin-T network of the second stage represents the Swin-T second module.
The Swin-T network of the third stage represents the Swin-T third module.
The Swin-T network of the fourth stage represents the Swin-T fourth module.
The Swin-Transformer has four types of scales (Swin-t, Swin-b, Swin-s and Swin-l), the structures are the same, and the difference is that the network scales are different in parameter setting. Swin-t is the smallest size of the Swin-transformers, and the Swin-transformers comprise four Swin-transformer-blocks, i.e., stages one through four.
The following detailed description is made with reference to the accompanying drawings.
The invention mainly solves the problems of incomplete semantic segmentation and low accuracy rate caused by factors such as complex content, large object scale difference, uneven distribution and the like of remote sensing images, and provides a self-attention multi-scale feature fusion remote sensing semantic segmentation method. Fig. 1 is a schematic structural diagram of a semantic segmentation network according to the present invention, and as shown in fig. 1, the specific steps of the present invention include:
step 1: and constructing a remote sensing semantic segmentation network, wherein the segmentation network comprises a feature encoder and a decoder, the Swin-T network is trained on the ImageNet data set in advance, the trained Swin-T network is used as the encoder, and a self-attention multiscale feature fusion pyramid structure network is used as the decoder. Specifically, a Swin-T model pre-trained on ImageNet is used as a feature extractor, and feature maps of 1/4, 1/8 and 1/16 original image sizes are extracted and transmitted to a self-attention multi-scale feature fusion module for fusion processing. In the training process, the original image is randomly transformed into the scale sizes of 0.5, 0.75, 1.0, 1.25, 1.5 and 2.0 to perform multi-scale enhancement.
Step 2: the feature encoder comprises a Swin-T four modules which are sequentially connected, specifically a Swin-T first module, a Swin-T second module, a Swin-T third module and a Swin-T fourth module, and the Swin-T four modules sequentially carry out feature acquisition on input remote sensing images to finally obtain four feature maps with different scales and sizes and transmit the feature maps to the decoder.
And step 3: the decoder comprises three self-attention multi-scale feature fusion modules, as shown in fig. 1, a first feature fusion module, a second feature fusion module and a third feature fusion module are sequentially arranged from left to right, feature graphs generated by the Swin-T first module, the Swin-T second module and the Swin-T third module are all input into the three feature fusion modules, and the three feature fusion modules perform feature fusion according to the scale and the channel number of the current stage, specifically:
step 31: the first feature fusion module takes the feature diagram scale and the channel number generated by the Swin-T first module as standards, and performs 2 times of upsampling and channel number halving on the feature diagram generated by the Swin-T second module, and performs 4 times of upsampling and channel number reducing to 1/4 on the feature diagram generated by the Swin-T third module.
Step 32: the second feature fusion module takes the scale of the feature diagram generated by the Swin-T second module and the number of channels as standards, the feature diagram generated by the Swin-T first module is processed by the feature adjustment module according to the number and the scale of the channels, and the feature diagram generated by the Swin-T third module is subjected to up-sampling by 2 times and operation of reducing the number of the channels by half.
Step 33: and the third feature fusion module takes the feature graph scale and the channel number generated by the Swin-T third module as standards, and processes the channel number and the scale on the feature graphs generated by the Swin-T first module and the Swin-T second module through the feature adjustment module.
And 4, step 4: the feature fusion module also comprises an attention calculating module, wherein the attention calculating module is used for performing global average pooling on the three adjusted feature maps in each feature fusion module, splicing the pooled feature maps to calculate self-attention, and then splitting the self-attention calculating result in the same mode to obtain three correlation scores. The results after the global pooling of the three feature maps are ensured to be independent and correspond to the feature maps before pooling one by one, and the corresponding sequence is not disturbed by the calculation self-attention. And multiplying the three correlation scores by the feature maps before the global pooling, finally splicing the multiplied feature maps in channel dimensions, and adjusting the number of the channels to be consistent with the number of the channels in the current stage. The effect of multiplying the relevance score and the pre-pooling profile is that: considering the existence of correlation among different scale feature maps, self attention is used for calculating the correlation, the obtained correlation score can be understood as a weight value, and the process of multiplying the correlation score and the feature map of the pooling label is a weighting process.
Fig. 2 is a schematic structural diagram of the self-attention multi-scale module according to the present invention, which is exemplified by a second stage, the scale size of the second stage is 64 pixels by 64 pixels, the number of channels is 192, and the self-attention multi-scale feature fusion process is shown in fig. 2.
The method has the advantages that no additional parameters are needed, and the defect that effective information is not well reserved, so that the multi-scale feature fusion effect is poor. In order to better transfer the features of the large-scale feature map to the self-attention multi-scale feature fusion module, we propose a feature adjustment module, as shown in fig. 3, the working mode of the feature adjustment module specifically is as follows: and 2, performing 2 × 2 maximum pooling on the input feature map, better retaining the main features of the large-scale feature map, adjusting the number of channels by using 1 × 1 convolution, then performing selective extraction on the features by using two 3 × 3 convolutions, and using residual connection to avoid gradient explosion and disappearance and accelerate network convergence.
And 5: performing prediction classification on each pixel, specifically:
as shown in fig. 1, the feature map generated by the Swin-T fourth module is subjected to double upsampling and then is superimposed with the feature map output by the third feature fusion module, the feature map superimposed by the third feature fusion module is subjected to double upsampling and then is superimposed with the feature map output by the second feature fusion module, the feature map superimposed by the second feature fusion module is subjected to double upsampling and then is superimposed with the feature map output by the first feature fusion module, the three superimposed feature maps and the feature map generated by the Swin-T fourth module are upsampled to the original size, then each pixel is subjected to prediction classification, and finally, the prediction results of four scales are fused to obtain the final remote sensing semantic segmentation result.
Step 5 also includes: when the remote sensing semantic segmentation prediction is carried out, a multi-scale strategy is also adopted, the input remote sensing image is segmented and predicted according to the scale of 0.5, 0.75, 1.0, 1.25, 1.5 and 2.0, and finally, the segmentation results under all scales are overlapped. And in the prediction stage, a multi-scale strategy is used to improve the segmentation precision, the remote sensing image is respectively converted into scales of the six proportions and then is input into the network for prediction to obtain six results, and then the six results are averaged, weighted and fused to obtain a more accurate segmentation result.
In order to evaluate the performance of the proposed remote sensing image semantic segmentation method, an intersection over (IoU) and an average intersection over (mIoU) which are commonly used in remote sensing image semantic segmentation are adopted as evaluation indexes. The intersection and union ratio is the ratio of the intersection and union of the model to a certain class of prediction results and the true value, and the average intersection and union ratio is the sum average of the intersection and union ratios of all classes. The intersection ratio calculation formula is as follows:
Figure 586224DEST_PATH_IMAGE001
whereinTPIndicating the number of pixels that are actually true, predicted to be true,FPindicating the number of pixels that are actually false, predicted to be true,FNindicating the number of pixels that are actually true and predicted to be false. The higher the intersection ratio is, the better the semantic segmentation effect of the model on the remote sensing image is.
In order to verify the effectiveness of the remote sensing image Semantic segmentation method provided by the invention, the method is compared with the reference scores of all network models given in a LoveDA data set on a test set, wherein the network models comprise FCN8S, DeepLabV3+, PAN, UNet + +, Semantic-FPN, PSPNet, LinkNet, FarSeg, FactSeg and HRNet.
Firstly, comparison is carried out under a single scale, the result is shown in table 1, the cross-over ratio of the method provided by the text on each category is optimal, the average cross-over ratio is improved by 2.98% compared with the comparison sub-optimal result, and the text model has excellent performance in the semantic segmentation of the complex remote sensing image.
TABLE 1 semantic segmentation result comparison on LoveDA dataset
Figure 300102DEST_PATH_IMAGE003
In the face of remote sensing images with large scale difference, the performance of the model can be effectively improved by using a multi-scale strategy in semantic segmentation training and testing. To this end, the multi-scale results are also compared here at the multi-scale with the DeepLabV3+, UNet, and HRNet in the LoveDA dataset, using multi-scale ratios uniformly 0.5, 0.75, 1.0, 1.25, 1.5, 1.75. The comparison results are shown in table 2, and it can be seen that the performance of different methods is significantly improved by the multi-scale strategy, the average cross-over ratio of the method herein is 54.19%, and the comparison suboptimal result is improved by 1.47%, and the optimal result on the LoveDA data set is obtained.
TABLE 2 comparison of Multi-Scale training and Multi-Scale test results
Figure 610997DEST_PATH_IMAGE004
To further analyze the role of the individual modules in the overall network model, ablation studies were performed. We split the experiments into reset 50+ FPN, Swin-T + FPN, self-attention multi-scale feature fusion with feature adjustment, and self-attention multi-scale feature fusion without feature adjustment module, with the results shown in table 3. As can be seen by analysis of table 3: 1) Swin-Transformer performs better in semantic segmentation of complex remote sensing images due to the powerful modeling capability of Swin-Transformer. 2) In the self-attention multi-scale feature fusion process, the effect improvement is limited without using a feature adjustment module, and the analysis reason is that the large-scale features are directly pooled into the same scale violently, so that the features are not matched. 3) And a feature adjusting module is added, and features of different scales are input into the self-attention multi-scale feature fusion module after certain selection and adjustment to calculate the correlation between the features and then are spliced and fused, so that the segmentation performance under single scale and multi-scale can be effectively improved.
TABLE 3 ablation experiment
Figure 537365DEST_PATH_IMAGE005
Fig. 4 shows a segmentation result graph of different models, from which it can be seen that the segmentation result of the model of the present invention is more smooth, the edge is more complete, and the situation of branch breakage does not occur. By combining the experimental results and analysis, the Swin-Transformer-based self-attention multi-scale feature fusion module can effectively fuse features of different scales and improve the performance of a model on the semantic segmentation of remote sensing images. As can be seen from FIG. 4, the method of the present invention has a good generalization ability, and can maintain a good performance in the remote sensing image semantic segmentation with uneven object distribution and large image content difference.
It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents.

Claims (3)

1. A self-attention multi-scale feature fusion remote sensing image semantic segmentation method is characterized in that a feature encoder transmits feature maps of four stages with different scales to a decoder, the decoder starts to up sample from the feature map of the last stage and superposes the feature maps with the feature map of self-attention multi-scale feature fusion, the up sampling is gradually carried out until the feature maps are the same as the feature map of the first stage in scale, finally, all scale feature maps are up sampled to the original image size respectively, each pixel is predicted and classified, and the prediction results of the four scales are fused to obtain the final remote sensing image semantic segmentation result, the method can effectively fuse the remote sensing semantic features of different scales, and the segmentation performance is improved, and the specific steps comprise:
step 1: constructing a remote sensing semantic segmentation network, wherein the segmentation network comprises a feature encoder and a decoder, training a Swin-T network on an ImageNet data set in advance, taking the trained Swin-T network as the encoder, and taking a self-attention multiscale feature fused pyramid structure network as the decoder;
step 2: the feature encoder comprises a Swin-T four modules which are sequentially connected, specifically a Swin-T first module, a Swin-T second module, a Swin-T third module and a Swin-T fourth module, wherein the Swin-T four modules sequentially carry out feature acquisition on an input remote sensing image, and finally four feature maps with different scales are obtained and transmitted to a decoder;
and step 3: the decoder comprises three self-attention multi-scale feature fusion modules, specifically a first feature fusion module, a second feature fusion module and a third feature fusion module, wherein feature graphs generated by the Swin-T first module, the Swin-T second module and the Swin-T third module are all input into the three feature fusion modules, and the three feature fusion modules perform feature fusion according to the scale and the channel number of the current stage, specifically:
step 31: the first feature fusion module takes the feature diagram scale and the channel number generated by the Swin-T first module as standards, performs 2 times of upsampling and channel number halving on the feature diagram generated by the Swin-T second module, and performs 4 times of upsampling and channel number reducing to 1/4 on the feature diagram generated by the Swin-T third module;
step 32: the second feature fusion module takes the scale and the number of channels of the feature map generated by the Swin-T second module as standards, processes the number and the scale of the channels of the feature map generated by the Swin-T first module through the feature adjustment module, and performs 2-time upsampling and channel number halving on the feature map generated by the Swin-T third module;
step 33: the third feature fusion module takes the feature graph scale and the channel number generated by the Swin-T third module as standards, and the feature graphs generated by the Swin-T first module and the Swin-T second module are processed by the feature adjustment module according to the channel number and the channel number;
and 4, step 4: the feature fusion module also comprises an attention calculation module, wherein the attention calculation module is used for performing global average pooling on the three adjusted feature maps in each feature fusion module, splicing the pooled feature maps to calculate self-attention, splitting the result of the self-attention calculation according to the same mode to obtain three correlation scores, multiplying the three correlation scores with the feature maps before the respective global pooling, and finally splicing the multiplied feature maps in channel dimensions and adjusting the number of channels to be consistent with the number of channels in the current stage;
and 5: performing prediction classification on each pixel, specifically:
the feature graph generated by the Swin-T fourth module is subjected to double upsampling and then is superposed with a feature graph output by a third feature fusion module, the feature graph superposed by the third feature fusion module is subjected to double upsampling and then is superposed with a feature graph output by a second feature fusion module, the feature graph superposed by the second feature fusion module is subjected to double upsampling and then is superposed with a feature graph output by a first feature fusion module, the three superposed feature graphs and the feature graph generated by the Swin-T fourth module are upsampled to the original graph size and then are subjected to prediction classification, and finally, prediction results of four scales are fused to obtain a final remote sensing semantic segmentation result.
2. The method for semantic segmentation of the self-attention multi-scale feature-fused remote sensing image according to claim 1, wherein the feature adjustment module specifically operates in a manner that: and 2, performing 2 × 2 maximum pooling on the input feature map, better retaining the main features of the large-scale feature map, adjusting the number of channels by using 1 × 1 convolution, then performing selective extraction on the features by using two 3 × 3 convolutions, and using residual connection to avoid gradient explosion and disappearance and accelerate network convergence.
3. The method for semantic segmentation of the self-attention multi-scale feature fused remote sensing image according to claim 1, wherein the step 5 further comprises: when the remote sensing semantic segmentation prediction is carried out, a multi-scale strategy is also adopted, the input remote sensing image is segmented and predicted according to the scale of 0.5, 0.75, 1.0, 1.25, 1.5 and 2.0, and finally, the segmentation results under all scales are overlapped.
CN202210308387.1A 2022-03-28 2022-03-28 Self-attention multi-scale feature fusion remote sensing image semantic segmentation method Active CN114419449B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210308387.1A CN114419449B (en) 2022-03-28 2022-03-28 Self-attention multi-scale feature fusion remote sensing image semantic segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210308387.1A CN114419449B (en) 2022-03-28 2022-03-28 Self-attention multi-scale feature fusion remote sensing image semantic segmentation method

Publications (2)

Publication Number Publication Date
CN114419449A true CN114419449A (en) 2022-04-29
CN114419449B CN114419449B (en) 2022-06-24

Family

ID=81263512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210308387.1A Active CN114419449B (en) 2022-03-28 2022-03-28 Self-attention multi-scale feature fusion remote sensing image semantic segmentation method

Country Status (1)

Country Link
CN (1) CN114419449B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019182A (en) * 2022-07-28 2022-09-06 北京卫星信息工程研究所 Remote sensing image target fine-grained identification method, system, equipment and storage medium
CN115578406A (en) * 2022-12-13 2023-01-06 四川大学 CBCT jaw bone region segmentation method and system based on context fusion mechanism
CN116229065A (en) * 2023-02-14 2023-06-06 湖南大学 Multi-branch fusion-based robotic surgical instrument segmentation method
CN116295469A (en) * 2023-05-19 2023-06-23 九识(苏州)智能科技有限公司 High-precision map generation method, device, equipment and storage medium
CN116580241A (en) * 2023-05-22 2023-08-11 内蒙古农业大学 Image processing method and system based on double-branch multi-scale semantic segmentation network
WO2023231329A1 (en) * 2022-05-30 2023-12-07 湖南大学 Medical image semantic segmentation method and apparatus
CN117315460A (en) * 2023-09-15 2023-12-29 生态环境部卫星环境应用中心 FarSeg algorithm-based dust source extraction method for construction sites of urban construction area

Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190164290A1 (en) * 2016-08-25 2019-05-30 Intel Corporation Coupled multi-task fully convolutional networks using multi-scale contextual information and hierarchical hyper-features for semantic image segmentation
CN110689083A (en) * 2019-09-30 2020-01-14 苏州大学 Context pyramid fusion network and image segmentation method
CN111145170A (en) * 2019-12-31 2020-05-12 电子科技大学 Medical image segmentation method based on deep learning
CN111292330A (en) * 2020-02-07 2020-06-16 北京工业大学 Image semantic segmentation method and device based on coder and decoder
CN111797779A (en) * 2020-07-08 2020-10-20 兰州交通大学 Remote sensing image semantic segmentation method based on regional attention multi-scale feature fusion
US20210019562A1 (en) * 2019-07-18 2021-01-21 Beijing Sensetime Technology Development Co., Ltd. Image processing method and apparatus and storage medium
CN112329800A (en) * 2020-12-03 2021-02-05 河南大学 Salient object detection method based on global information guiding residual attention
CN112418176A (en) * 2020-12-09 2021-02-26 江西师范大学 Remote sensing image semantic segmentation method based on pyramid pooling multilevel feature fusion network
CN112560733A (en) * 2020-12-23 2021-03-26 上海交通大学 Multitasking system and method for two-stage remote sensing image
CN112597985A (en) * 2021-03-04 2021-04-02 成都西交智汇大数据科技有限公司 Crowd counting method based on multi-scale feature fusion
CN113033570A (en) * 2021-03-29 2021-06-25 同济大学 Image semantic segmentation method for improving fusion of void volume and multilevel characteristic information
US20210201499A1 (en) * 2019-12-30 2021-07-01 Medo Dx Pte. Ltd Apparatus and method for image segmentation using a deep convolutional neural network with a nested u-structure
CN113256649A (en) * 2021-05-11 2021-08-13 国网安徽省电力有限公司经济技术研究院 Remote sensing image station selection and line selection semantic segmentation method based on deep learning
CN113469094A (en) * 2021-07-13 2021-10-01 上海中科辰新卫星技术有限公司 Multi-mode remote sensing data depth fusion-based earth surface coverage classification method
CN113516126A (en) * 2021-07-02 2021-10-19 成都信息工程大学 Adaptive threshold scene text detection method based on attention feature fusion
CN113679426A (en) * 2021-09-14 2021-11-23 上海市第六人民医院 Ultrasonic image processing system
CN113688813A (en) * 2021-10-27 2021-11-23 长沙理工大学 Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage
CN113705675A (en) * 2021-08-27 2021-11-26 合肥工业大学 Multi-focus image fusion method based on multi-scale feature interaction network
CN113850825A (en) * 2021-09-27 2021-12-28 太原理工大学 Remote sensing image road segmentation method based on context information and multi-scale feature fusion
CN113902751A (en) * 2021-11-10 2022-01-07 南京大学 Intestinal neuron dysplasia identification method based on Swin-Unet algorithm
US20220051056A1 (en) * 2019-11-12 2022-02-17 Tencent Technology (Shenzhen) Company Limited Semantic segmentation network structure generation method and apparatus, device, and storage medium
CN114066902A (en) * 2021-11-22 2022-02-18 安徽大学 Medical image segmentation method, system and device based on convolution and transformer fusion
CN114140472A (en) * 2022-02-07 2022-03-04 湖南大学 Cross-level information fusion medical image segmentation method
CN114202550A (en) * 2021-11-24 2022-03-18 重庆邮电大学 Brain tumor MRI image three-dimensional segmentation method based on RAPNet network
CN114240004A (en) * 2022-02-23 2022-03-25 武汉纺织大学 Garment fashion trend prediction method and system based on multi-source information fusion

Patent Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190164290A1 (en) * 2016-08-25 2019-05-30 Intel Corporation Coupled multi-task fully convolutional networks using multi-scale contextual information and hierarchical hyper-features for semantic image segmentation
US20210019562A1 (en) * 2019-07-18 2021-01-21 Beijing Sensetime Technology Development Co., Ltd. Image processing method and apparatus and storage medium
CN110689083A (en) * 2019-09-30 2020-01-14 苏州大学 Context pyramid fusion network and image segmentation method
US20220051056A1 (en) * 2019-11-12 2022-02-17 Tencent Technology (Shenzhen) Company Limited Semantic segmentation network structure generation method and apparatus, device, and storage medium
US20210201499A1 (en) * 2019-12-30 2021-07-01 Medo Dx Pte. Ltd Apparatus and method for image segmentation using a deep convolutional neural network with a nested u-structure
CN111145170A (en) * 2019-12-31 2020-05-12 电子科技大学 Medical image segmentation method based on deep learning
CN111292330A (en) * 2020-02-07 2020-06-16 北京工业大学 Image semantic segmentation method and device based on coder and decoder
CN111797779A (en) * 2020-07-08 2020-10-20 兰州交通大学 Remote sensing image semantic segmentation method based on regional attention multi-scale feature fusion
CN112329800A (en) * 2020-12-03 2021-02-05 河南大学 Salient object detection method based on global information guiding residual attention
CN112418176A (en) * 2020-12-09 2021-02-26 江西师范大学 Remote sensing image semantic segmentation method based on pyramid pooling multilevel feature fusion network
CN112560733A (en) * 2020-12-23 2021-03-26 上海交通大学 Multitasking system and method for two-stage remote sensing image
CN112597985A (en) * 2021-03-04 2021-04-02 成都西交智汇大数据科技有限公司 Crowd counting method based on multi-scale feature fusion
CN113033570A (en) * 2021-03-29 2021-06-25 同济大学 Image semantic segmentation method for improving fusion of void volume and multilevel characteristic information
CN113256649A (en) * 2021-05-11 2021-08-13 国网安徽省电力有限公司经济技术研究院 Remote sensing image station selection and line selection semantic segmentation method based on deep learning
CN113516126A (en) * 2021-07-02 2021-10-19 成都信息工程大学 Adaptive threshold scene text detection method based on attention feature fusion
CN113469094A (en) * 2021-07-13 2021-10-01 上海中科辰新卫星技术有限公司 Multi-mode remote sensing data depth fusion-based earth surface coverage classification method
CN113705675A (en) * 2021-08-27 2021-11-26 合肥工业大学 Multi-focus image fusion method based on multi-scale feature interaction network
CN113679426A (en) * 2021-09-14 2021-11-23 上海市第六人民医院 Ultrasonic image processing system
CN113850825A (en) * 2021-09-27 2021-12-28 太原理工大学 Remote sensing image road segmentation method based on context information and multi-scale feature fusion
CN113688813A (en) * 2021-10-27 2021-11-23 长沙理工大学 Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage
CN113902751A (en) * 2021-11-10 2022-01-07 南京大学 Intestinal neuron dysplasia identification method based on Swin-Unet algorithm
CN114066902A (en) * 2021-11-22 2022-02-18 安徽大学 Medical image segmentation method, system and device based on convolution and transformer fusion
CN114202550A (en) * 2021-11-24 2022-03-18 重庆邮电大学 Brain tumor MRI image three-dimensional segmentation method based on RAPNet network
CN114140472A (en) * 2022-02-07 2022-03-04 湖南大学 Cross-level information fusion medical image segmentation method
CN114240004A (en) * 2022-02-23 2022-03-25 武汉纺织大学 Garment fashion trend prediction method and system based on multi-source information fusion

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
SITONG WU等: "Fully Transformer Networks for Semantic Image Segmentation", 《ARXIV:2106.04108V3》 *
SITONG WU等: "Fully Transformer Networks for Semantic Image Segmentation", 《ARXIV:2106.04108V3》, 28 December 2021 (2021-12-28), pages 1 - 12 *
ZEYU CHENG等: "Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation", 《IEEE SENSORS JOURNAL》 *
ZEYU CHENG等: "Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation", 《IEEE SENSORS JOURNAL》, vol. 21, no. 23, 1 December 2021 (2021-12-01), pages 26912 - 26920, XP011890294, DOI: 10.1109/JSEN.2021.3120753 *
吴宇杭: "基于注意力机制的多尺度融合肝脏器官分割", 《现代计算机》 *
吴宇杭: "基于注意力机制的多尺度融合肝脏器官分割", 《现代计算机》, no. 13, 31 May 2021 (2021-05-31), pages 90 - 96 *
郑婷月等: "基于全卷积神经网络的多尺度视网膜血管分割", 《光学学报》 *
郑婷月等: "基于全卷积神经网络的多尺度视网膜血管分割", 《光学学报》, vol. 39, no. 2, 28 February 2019 (2019-02-28), pages 0211002 - 1 *
钟建平: "基于特征融合与注意力机制的视频目标检测研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
钟建平: "基于特征融合与注意力机制的视频目标检测研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 3, 15 March 2022 (2022-03-15), pages 138 - 2321 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023231329A1 (en) * 2022-05-30 2023-12-07 湖南大学 Medical image semantic segmentation method and apparatus
CN115019182A (en) * 2022-07-28 2022-09-06 北京卫星信息工程研究所 Remote sensing image target fine-grained identification method, system, equipment and storage medium
CN115578406A (en) * 2022-12-13 2023-01-06 四川大学 CBCT jaw bone region segmentation method and system based on context fusion mechanism
CN116229065A (en) * 2023-02-14 2023-06-06 湖南大学 Multi-branch fusion-based robotic surgical instrument segmentation method
CN116229065B (en) * 2023-02-14 2023-12-01 湖南大学 Multi-branch fusion-based robotic surgical instrument segmentation method
CN116295469A (en) * 2023-05-19 2023-06-23 九识(苏州)智能科技有限公司 High-precision map generation method, device, equipment and storage medium
CN116295469B (en) * 2023-05-19 2023-08-15 九识(苏州)智能科技有限公司 High-precision map generation method, device, equipment and storage medium
CN116580241A (en) * 2023-05-22 2023-08-11 内蒙古农业大学 Image processing method and system based on double-branch multi-scale semantic segmentation network
CN116580241B (en) * 2023-05-22 2024-05-14 内蒙古农业大学 Image processing method and system based on double-branch multi-scale semantic segmentation network
CN117315460A (en) * 2023-09-15 2023-12-29 生态环境部卫星环境应用中心 FarSeg algorithm-based dust source extraction method for construction sites of urban construction area

Also Published As

Publication number Publication date
CN114419449B (en) 2022-06-24

Similar Documents

Publication Publication Date Title
CN114419449B (en) Self-attention multi-scale feature fusion remote sensing image semantic segmentation method
CN108961235B (en) Defective insulator identification method based on YOLOv3 network and particle filter algorithm
CN111126202B (en) Optical remote sensing image target detection method based on void feature pyramid network
CN111047551A (en) Remote sensing image change detection method and system based on U-net improved algorithm
CN112541503A (en) Real-time semantic segmentation method based on context attention mechanism and information fusion
CN112801117B (en) Multi-channel receptive field guided characteristic pyramid small target detection network and detection method
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN111476133B (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN113255837A (en) Improved CenterNet network-based target detection method in industrial environment
CN113554032B (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN113239825B (en) High-precision tobacco beetle detection method in complex scene
CN113780132A (en) Lane line detection method based on convolutional neural network
CN114973011A (en) High-resolution remote sensing image building extraction method based on deep learning
CN111832453A (en) Unmanned scene real-time semantic segmentation method based on double-path deep neural network
CN115908772A (en) Target detection method and system based on Transformer and fusion attention mechanism
CN113298817A (en) High-accuracy semantic segmentation method for remote sensing image
CN116755090A (en) SAR ship detection method based on novel pyramid structure and mixed pooling channel attention mechanism
CN116206112A (en) Remote sensing image semantic segmentation method based on multi-scale feature fusion and SAM
CN113313180B (en) Remote sensing image semantic segmentation method based on deep confrontation learning
CN114926826A (en) Scene text detection system
CN113436198A (en) Remote sensing image semantic segmentation method for collaborative image super-resolution reconstruction
CN117058386A (en) Asphalt road crack detection method based on improved deep Labv3+ network
CN110136098B (en) Cable sequence detection method based on deep learning
CN117197663A (en) Multi-layer fusion picture classification method and system based on long-distance dependency mechanism
CN114494284B (en) Scene analysis model and method based on explicit supervision area relation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant