CN116109920A

CN116109920A - Remote sensing image building extraction method based on transducer

Info

Publication number: CN116109920A
Application number: CN202211597465.0A
Authority: CN
Inventors: 雷艳静; 王渊; 产思贤; 卢雅婷; 孟祥路
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-05-12

Abstract

The invention discloses a remote sensing image building extraction method based on a transducer, which belongs to the technical field of image segmentation and comprises the following steps: acquiring a remote sensing image and preprocessing the remote sensing image; and inputting the preprocessed remote sensing image into a deep learning model for semantic segmentation to obtain a building segmentation result. According to the invention, an asymmetric network structure is designed in the original Swin transform technical scheme, and meanwhile, a newly designed multi-branch weighted pyramid pooling module is adopted in jump connection, so that characteristic information is further mined, and the problem that buildings with various types and different scales in remote sensing images are difficult to identify is solved.

Description

Remote sensing image building extraction method based on transducer

Technical Field

The invention belongs to the technical field of image segmentation, and particularly relates to a remote sensing image building extraction method based on a transducer.

Background

Building extraction is an important subtask of semantic segmentation in computer vision, which is of great significance to military reconnaissance, precision guidance and civilian aspects. Unlike the semantic segmentation of natural images, the existing building extraction method of remote sensing images has the problem that the intensive prediction task of feature extraction through a convolutional neural network is difficult to expand the effective receptive field and establish long-distance dependence. The network mainly comprising Tranformer has the problems of large calculation amount and easy overfitting to a small number of remote sensing images.

Disclosure of Invention

The invention aims to provide a remote sensing image building extraction method based on a transducer, which designs an asymmetric network structure in the original Swin transducer technical scheme, adopts a newly designed multi-branch weighted pyramid pooling module in jump connection, further excavates characteristic information, and solves the problem that buildings with various types and different scales in remote sensing images are difficult to identify.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a remote sensing image building extraction method based on a transducer, the remote sensing image building extraction method based on the transducer comprising:

acquiring a remote sensing image and preprocessing the remote sensing image;

inputting the preprocessed remote sensing image into a deep learning model for semantic segmentation to obtain a building segmentation result;

the deep learning model is of an asymmetric network structure and comprises a four-layer encoder Swin Transformer, a multi-branch weighted pyramid pooling module, a three-layer decoder and a multi-level feature cascade fusion module, wherein:

the encoder Swin transducer receives the preprocessed remote sensing image and outputs four-dimensional characteristic images F through a four-layer structure ₁ 、F ₂ 、F ₃ 、F ₄ ；

Three multi-branch pyramid pooling modules are arranged, and the three multi-branch pyramid pooling modules respectively take a characteristic diagram F output by a three-layer structure after an encoder Swin transform ₂ 、F ₃ 、F ₄ For the characteristic diagram F ₂ 、F ₃ 、F ₄ Post-processing output feature map F ₂₂ 、F ₃₂ 、F ₄₂ ；

The first layer structure in the decoder acquires feature map F ₄₂ And F is equal to ₃₂ Process and output a feature map F ₃₂ ' second layer Structure acquisition feature map F ₎₂ ' and F ₂₂ Process and output a feature map F ₂₂ ' third layer Structure acquisition feature map F ₂₂ ' and F ₁ Process and output a feature map F ₁ ′。

The multistage feature cascade fusion module fuses the feature map F ₂₂ ' and feature map F ₃₂ ' feature fusion is carried out to obtain a feature map F ₂₂ "then the feature map F ₁ ' and feature map F ₂₂ And carrying out feature fusion to obtain a final segmentation map, and taking the final segmentation map as a building segmentation result.

The following provides several alternatives, but not as additional limitations to the above-described overall scheme, and only further additions or preferences, each of which may be individually combined for the above-described overall scheme, or may be combined among multiple alternatives, without technical or logical contradictions.

Preferably, the preprocessing includes scaling the remote sensing image to an image size of 512 x 512.

Preferably, the four-layer structure of the encoder Swin Transformer is defined as a first-layer structure, a second-layer structure, a third-layer structure and a fourth-layer structure according to the data flow direction;

the first layer structure comprises a patch dividing block, a linear embedding block and a Swin transform block which are sequentially connected from a data input side to an output side, and each layer of structure in the second layer structure, the third layer structure and the fourth layer structure comprises a patch merging layer and a Swin transform block which are sequentially connected from the data input side to the output side;

wherein the first layer structure outputs a feature map F ₁ The second layer structure outputs a feature map F ₂ Third layer structure output characteristic diagram F ₃ Fourth layer structure output characteristic diagram F ₄ 。

Preferably, the pair of feature maps F ₂ 、F ₃ 、F ₄ Post-processing output feature map F ₂₂ 、F ₃₂ 、F ₄₂ Comprising:

for the characteristic diagram F ₄ The corresponding multi-branch pyramid pooling module sets the branch number as 4, and the characteristic diagram F is compared with the branch number ₄ Pooling operations with different scales are carried out to obtain feature graphs with feature sizes of 1, 2, 3 and 6 respectively corresponding to each branch, weights respectively assigned to the corresponding branches are 0.1, 0.2 and 0.5, and the feature graphs on each branch are restored to the feature graph F through bilinear interpolation after depth separable convolution ₄ Weighting and splicing the same size, and performing a convolution operation on the spliced feature images to obtain a feature image F ₄₂ ；

For the characteristic diagram F ₃ The corresponding multi-branch pyramid pooling module sets the branch number as 4, and the characteristic diagram F is compared with the branch number ₃ Pooling operations with different scales are carried out to obtain feature graphs with feature sizes of 1, 2, 3 and 6 respectively corresponding to each branch, weights respectively assigned to the corresponding branches are 0.1, 0.2 and 0.5, and the feature graphs on each branch are restored to the feature graph F through bilinear interpolation after depth separable convolution ₃ Weighting and splicing the same size, and performing a convolution operation on the spliced feature images to obtain a feature image F ₃₂ ；

For the characteristic diagram F ₂ The corresponding multi-branch pyramid pooling module sets the branch number as 5, and the characteristic diagram F is compared with the set branch number ₂ Making different scalesPooling the degrees to obtain feature graphs with feature sizes of 1, 4, 8, 16 and 32 respectively corresponding to each branch, giving weights of 0.1, 0.2 and 0.4 to each corresponding branch, and recovering the feature graphs and the feature graphs F through bilinear interpolation after depth separable convolution ₂ Weighting and splicing the same size, and performing a convolution operation on the spliced feature images to obtain a feature image F ₂₂ 。

Preferably, the three-layer structure of the decoder is defined as a first layer structure, a second layer structure and a third layer structure according to a data stream direction, and each layer structure includes a patch expansion layer and a split attention module.

Preferably, the first layer structure in the decoder acquires a feature map F ₄₂ And F is equal to ₃₂ Process and output a feature map F ₃₂ ' second layer Structure acquisition feature map F ₃₂ ' and F ₂₂ Process and output a feature map F ₂₂ ' third layer Structure acquisition feature map F ₂₂ ' and F ₁ Process and output a feature map F ₁ ' comprising:

feature map F ₄₂ The patch expansion layer is input into the first layer structure, and after expansion treatment of the patch expansion layer, the patch expansion layer and the feature map F are combined ₃₂ Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a first layer structure, and outputting the feature image F by the split-flow attention module of the first layer structure ₃₂ ' at the same time, feature map F ₃₂ ' input to the patch expansion layer in the second layer structure;

feature map F ₃₂ ' after expansion treatment of the patch expansion layer in the second layer structure and feature map F ₂₂ Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a second layer structure, and outputting the feature image F by the split-flow attention module of the second layer structure ₂₂ ' at the same time, feature map F ₂₂ ' input to the patch expansion layer in the third layer structure;

feature map F ₂₂ ' after expansion treatment of the patch expansion layer in the third layer structure and feature map F ₁ Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a third layer structure, and outputting the feature image F by the split-flow attention module in the third layer structure ₁ ′。

Preferably, the multistage feature cascade fusion module fuses the feature map F ₂₂ ' and feature map F ₃₂ ' feature fusion is carried out to obtain a feature map F ₂₂ "then the feature map F ₁ ' and feature map F ₂₂ Feature fusion is carried out to obtain a final segmentation map, and the final segmentation map is used as a building segmentation result, and the method comprises the following steps:

first to characteristic diagram F ₃₂ ' bilinear interpolation to obtain sum feature map F ₂₂ ' feature map F of the same size ₃₂ "in the case of FIG. 2 ₃₂ "and feature map F ₂₂ ' adding and continuing to obtain the feature map F by bilinear interpolation ₁ ' feature map F of the same size ₁ Finally, the feature map F ₁ "and feature map F ₁ And carrying out feature fusion to obtain a final segmentation map.

According to the remote sensing image building extraction method based on the Transformer, a new decoder is designed to form a new asymmetric network structure in the existing Swin Transformer technical scheme, and redesigned multi-branch weighted pyramid pooling module depth mining feature semantic information is added between the decoders of the encoder, so that the remote sensing image building extraction precision can be effectively improved.

Drawings

FIG. 1 is a flow chart of a method for extracting a remote sensing image building based on a transducer according to the present invention;

FIG. 2 is a schematic diagram of a deep learning model according to the present invention;

FIG. 3 is a schematic diagram of a multi-branch weighted pyramid pooling module according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

In order to overcome the defects existing in the prior art aiming at remote sensing image segmentation, the embodiment provides a remote sensing image building extraction method based on a Transformer, and the accuracy of remote sensing image building extraction is improved.

As shown in fig. 1, the remote sensing image building extraction method based on the transducer of the embodiment includes the following steps:

and step 1, acquiring a remote sensing image, and preprocessing the remote sensing image.

To meet the neural network input requirements, the present embodiment requires scaling the telemetry image to an image size of 512×512. The preprocessing in this embodiment focuses on scaling the image size, and in other embodiments, other image preprocessing may be performed in order to improve the image quality.

And step 2, inputting the preprocessed remote sensing image into a deep learning model for semantic segmentation to obtain a building segmentation result.

According to the embodiment, the deep learning model is introduced to perform semantic segmentation, so that the complexity of segmentation operation is reduced, and the accuracy and stability of the segmentation operation are improved. As shown in fig. 2, the deep learning model of the present embodiment is an asymmetric network structure, and includes a four-layer encoder Swin Transformer, a multi-branch weighted pyramid pooling module, a three-layer decoder, and a multi-stage feature cascade fusion module. The deep learning model employed in the present embodiment is described below by way of a detailed description of each module.

1) Encoder Swin transducer: receiving the preprocessed remote sensing image and passingThe four-layer structure outputs four-size feature images F ₁ 、F ₂ 、F ₃ 、F ₄ 。

The encoder of this embodiment adopts a Swin Transformer, and for convenience of description, the four-layer structure is defined as a first-layer structure, a second-layer structure, a third-layer structure and a fourth-layer structure according to the data flow direction.

The first layer structure comprises a Patch dividing block (Patch dividing), a Linear embedding block (Linear embedding) and a Swin transform block (Swin Transformer Block) which are sequentially connected from the data input side to the output side, and each of the second layer structure, the third layer structure and the fourth layer structure comprises a Patch Merging layer (Patch Merging) and a Swin transform block which are sequentially connected from the data input side to the output side. Wherein the first layer structure outputs a feature map F ₁ The second layer structure outputs a feature map F ₂ Third layer structure output characteristic diagram F ₃ Fourth layer structure output characteristic diagram F ₄ 。

The whole Swin transducer model adopts a hierarchical design, and totally comprises four layers of structures, each layer of structure can reduce the resolution of an input feature map, and the receptive field is expanded layer by layer like CNN. At the beginning of the input, a Patch Partition is made, the picture is cut into individual tiles and embedded in the Linear frame (Patch divided into 4*4, stretched into a number of one-dimensional vectors and embedded in the Linear frame). Wherein the Patch merge module primarily reduces the picture resolution at the beginning of each layer of structure. While Swin Transformer Block is composed mainly of LayerNorm, MLP, window Attention and Shifted Window Attention.

Note that, the Swin Transformer is an existing network structure, and the specific structure and the workflow of the present embodiment are not described in detail.

2) Three multi-branch pyramid pooling modules are arranged, and the three multi-branch pyramid pooling modules respectively take a characteristic diagram F output by a three-layer structure after an encoder SwinTransformer ₂ 、F ₃ 、F ₄ For the characteristic diagram F ₂ 、F ₃ 、F ₄ Post-processing output characteristics map 4 ₂₂ 、F ₃₂ 、F ₄₂ 。

The multi-branch pyramid pooling module pools the feature map F ₂ 、F ₃ 、F ₄ From the dimensions B (number of batches), L (number of patches), C (number of channels) to the form B (number of batches), H (height), W (width), C (number of channels). The overall structure of the module is shown in figure 3, the module structure is improved based on the conventional PPM pyramid pooling module, and the difference between the module structure and the PPM pyramid pooling module is that for each layer of a coder-decoder, the multi-branch pyramid pooling module is provided with different branches in a targeted manner, and the importance of each branch is given corresponding weight. In addition, the use of depth separable convolution instead of conventional convolution operations reduces the computational effort. The module is specifically arranged as follows:

for the characteristic diagram F ₄ The corresponding multi-branch pyramid pooling module (Multi branch PPM Block 1) sets the branch number as 4 for the feature map F ₄ Pooling operations of different scales are carried out to obtain feature graphs with feature sizes of 1, 2, 3 and 6 respectively corresponding to each branch, weights respectively assigned to the corresponding branches are 0.1, 0.2 and 0.5, and the feature graphs on each branch are restored to the feature graph F through bilinear interpolation (upsamples) after depth separable convolution (Depthwise Separable Convmodule or Depthwise Separable Convolution) is carried out ₄ Weighting and splicing the same size, and carrying out a traditional common convolution operation on the spliced characteristic diagram to obtain a characteristic diagram F ₄₂ 。

For the characteristic diagram F ₃ The corresponding multi-branch pyramid pooling module (Multi branch PPM Block 2) sets the branch number as 4 for the feature map F ₃ Pooling operations with different scales are carried out to obtain feature graphs with feature sizes of 1, 2, 3 and 6 respectively corresponding to each branch, weights respectively assigned to the corresponding branches are 0.1, 0.2 and 0.5, and the feature graphs on each branch are restored to the feature graph F through bilinear interpolation after depth separable convolution ₃ Weighting and splicing the same size, and carrying out a traditional common convolution operation on the spliced characteristic diagram to obtain a characteristic diagram F ₃₂ 。

For the characteristic diagram F ₂ The corresponding multi-branch pyramid pooling module (Multi branch PPM Block 3) sets the branch number to 5, corresponding to the feature map F ₂ Pooling operations of different scales are carried out to obtain feature graphs with feature sizes of 1, 4, 8, 16 and 32 respectively corresponding to each branch, weights given to the corresponding branches are 0.1, 0.2 and 0.4 respectively, and after depth separable convolution is carried out on the feature graphs on each branch, the feature graphs and the feature graphs F are restored through bilinear interpolation ₂ Weighting and splicing the same size, and carrying out a traditional common convolution operation on the spliced characteristic diagram to obtain a characteristic diagram F ₂₂ 。

It should be noted that, in this embodiment, weights are added in each branch, so that important feature information can be better reflected, and an effective feature influence can be provided, and when the features are weighted and overlapped, features output by the depth separable convolution can be multiplied by weights in each branch, and then are spliced after bilinear interpolation is restored; or directly taking the characteristic of the depth separable convolution output, and then weighting and splicing after bilinear interpolation recovery. And the weights in fig. 3 are only one example and are not particularly limited.

3) First layer structure acquisition feature map F in decoder ₄₂ And F is equal to ₃₂ Process and output a feature map F ₃₂ ' second layer Structure acquisition feature map F ₃₂ ' and F ₂₂ Process and output a feature map F ₂₂ ' third layer Structure acquisition feature map F ₂₂ ' and F ₁ Process and output a feature map F ₁ ′。

The three-layer structure of the decoder of the present embodiment is defined as a first layer structure, a second layer structure, and a third layer structure according to the data flow direction, respectively, and each layer structure includes a Patch expansion layer (Patch expansion) and a split attention module (Shunted Transformer Block).

The patch expansion layer is a patch expansion layer in the Swin Unet, the split attention module is based on Shunted Transformer Block in Shunted Transformer, the aggregation rate of each layer of the split attention module is set to be two, the three decoding layers have different attention, the attention to the characteristics is different, and the number of split streams set by each layer is different based on the characteristics.

Feature map F ₄₂ The patch expansion layer is input into the first layer structure, and after expansion treatment of the patch expansion layer, the patch expansion layer and the feature map F are combined ₃₂ Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a first layer structure, and outputting the feature image F by the split-flow attention module of the first layer structure ₃₂ ' at the same time, feature map F ₃₂ ' input to the patch expansion layer in the second layer structure.

Feature map F ₃₂ ' after expansion treatment of the patch expansion layer in the second layer structure and feature map F ₂₂ Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a second layer structure, and outputting the feature image F by the split-flow attention module of the second layer structure ₂₂ ' at the same time, feature map F ₂₂ ' input to the patch expansion layer in the third layer structure.

In the patch expansion layer in this embodiment, for dimensions such as B (batch number), L (patch number), and C (channel number), the feature map converts the dimensions into B, L,2×c by using a linear mapping method, and then converts the dimensions into B,4×l, and C/2 by using a reorganization method to expand the dimension. The split attention module projects vectors in the form of { B, L, C } as tensors of Q, K and V, aggregates K, V through different aggregation rates r, and then calculates attention to obtain output vectors fused with global context information.

4) The multistage feature cascade fusion module fuses the feature map F ₂₂ ' and feature map F ₃₂ ' feature fusion is carried out to obtain a feature map F ₂₂ "then the feature map F ₁ ' and feature map F ₂₂ Feature fusion is carried out to obtain a final segmentation map, and the final segmentation map is taken as a buildingAnd (5) object segmentation results.

In order to ensure the fusion effect, the embodiment firstly aims at the feature map F ₃₂ ' bilinear interpolation to obtain sum feature map F ₂₂ ' feature map F of the same size ₃₂ "in the course of feature map F ₃₂ "and feature map F ₂₂ ' adding and continuing to obtain the feature map F by bilinear interpolation ₁ ' feature map F of the same size ₁ Finally, the feature map F ₁ "and feature map F ₁ And carrying out feature fusion to obtain a final segmentation map.

The feature fusion operation is a conventional feature fusion operation, and this embodiment will not be described in detail.

In order to ensure the application effect of the deep learning model, the deep learning model needs to be trained in advance, and the embodiment provides a training flow as follows:

step S1, acquiring a remote sensing image training data set with a building segmentation mask, randomly overturning the remote sensing image training data set, enhancing data of photometric distortion, and then starting training in batches.

The present embodiment performs random inversion and photometric-distorted data enhancement on the training data set. The random overturn, namely the overturn in the horizontal or vertical direction is carried out on the pictures trained on each batch with the probability of 50 percent, has the advantages of increasing the robustness for our small data set and having better precision for the remote sensing segmentation. Photometric distortion data enhancement, i.e., adjusting the brightness, chromaticity, contrast, saturation of an image, and adding noise, makes the dataset more compatible with remote sensing images obtained at various times.

Step S2, adjusting the size of the input picture of the enhanced image training data set, namely adjusting the size to 512×512, inputting the size to a network encoder Swin converter, and obtaining four-size feature maps F output by

stages

1, 2, 3 and 4 in the encoder Swin converter ₁ 、F ₂ 、F ₃ 、F ₄ 。

The application adopts a Swin Transformer as a backbone network for feature extraction. The Swin transducer is used to pre-add the pre-training weight trained on ADE20K, the training is performed in batches, the batch size is 8 (i.e. 8 pictures are processed in each batch) in the training process, the optimizer is AdamW, the initial learning rate is 6e-05, the momentum is 0.9, the learning rate adjustment strategy is poly strategy, the power is set to 1, and the weight attenuation is 0.01.

The 512×512 pictures are put into an encoder Swin transform, and after passing through a four-layer encoder, four-size feature maps F of 128×128, 64×64, 32×32, 16×16 are sequentially output ₁ 、F ₂ 、F ₃ 、F ₄ 。

The size of the feature map is determined by the backbone network Swin Transformer, and will not be described in detail here. For the original Swin transducer, only the last layer of extracted features is used for final segmentation prediction. In this embodiment, all the features output by the Swin Transformer are used, the feature information is deeply mined by the multi-branch weighted pyramid pooling module, and is put into a decoder to be sequentially spliced with the bottommost features after being decoded, and finally the multi-scale features are subjected to cascade fusion to generate a segmentation prediction mask.

Step S3, selecting the last three F in the four size feature graphs ₂ 、F ₃ 、F ₄ Performing feature processing in a multi-branch weighted pyramid pooling module to obtain a processed feature map F ₂₂ 、F ₃₂ 、F ₄₂ 。

It should be noted that, in order to reduce the calculation amount and the parameter amount, the depth separable convolution in the multi-branch weighted pyramid pooling module includes a depth convolution and a point convolution. The deep convolution is to perform a group convolution on each channel of the input, and the point convolution is a convolution of 1×1. And the semantic information of the features is further mined by carrying out multi-branch weighted pyramid pooling module processing on the features of each different scale, so that the subsequent processing is facilitated.

Step S4, the processed characteristic diagram F ₁ 、F ₂₂ 、F ₃₂ 、F ₄₂ Respectively, into respective corresponding decoders, and the bottommost processed feature F ₄₂ Then it is necessary to first go through the patchThe expansion layer expands one time of the size, restores the size of the upper layer, then obtains a characteristic diagram through processing by a decoder, and obtains the final segmentation prediction mask through cascade fusion. And calculating loss and carrying out back propagation to update network parameters so as to complete the training of the network.

For the last three-layer feature F ₃ ^′ ₂ 、F ₂ ^′ ₂ 、F ₁ ' first to F ₃₂ ' bilinear interpolation to get sum F ₂₂ ' same size F ₃₂ ", F ₃₂ "and F ₂₂ ' add and continue to get the sum F by bilinear interpolation ₁ ' F of the same size ₁ ", F ₁ "and F ₁ The 'fusion' obtains the final feature map, and the feature map is converted from B, L and C into B, H, W and C formats for prediction. In addition, since the tags of the building extraction task are only two types, namely, the building and the background, the final number of channels is only 2 channels.

The split prediction graph and the real label are subjected to loss calculation through a loss function CrossEntropyLoss, and the specific formula is as follows:

CELoss＝-(ylog(p(x))+(1-y)log(1-p(x)))

where y represents whether it is a target, the value is 1 or 0, and p (x) is the predicted targeting score.

It should be noted that, the calculation of the classification loss, the target score loss and the frame loss is already a relatively mature technology in the art, and will not be described herein.

This results in a loss between the predicted and the true values, and back propagation is performed to reduce the loss before each batch is completed. And updating network parameters at the same time, starting training of the next batch until training data of all batches are trained, finally obtaining trained weights, and storing all updated parameters in output weight files.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present invention, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of the invention should be assessed as that of the appended claims.

Claims

1. The remote sensing image building extraction method based on the transducer is characterized by comprising the following steps of:

acquiring a remote sensing image and preprocessing the remote sensing image;

The first layer structure in the decoder acquires feature map F ₄₂ And F is equal to ₃₂ Process and output a feature map F ₃₂ ' second layer Structure acquisition feature map F ₃₂ ' and F ₂₂ Process and output a feature map F ₂₂ ' third layer Structure acquisition feature map F ₂₂ ' and F ₁ Process and output a feature map F ₁ ′；

2. The Transformer-based remote sensing image building extraction method of claim 1, wherein the preprocessing includes scaling the remote sensing image to an image size of 512 x 512.

3. The method for extracting a remote sensing image building based on a transducer according to claim 1, wherein the four-layer structure of the encoder Swin transducer is defined as a first layer structure, a second layer structure, a third layer structure and a fourth layer structure according to data flow direction, respectively;

4. The method for extracting a remote sensing image building based on a transducer according to claim 1, wherein the pair of feature maps F ₂ 、F ₃ 、F ₄ Post-processing output feature map F ₂₂ 、F ₃₂ 、F ₄₂ Comprising:

for the characteristic diagram F ₄ Corresponding multi-branch pyramidThe pooling module sets the branch number as 4, and the characteristic diagram F ₄ Pooling operations with different scales are carried out to obtain feature graphs with feature sizes of 1, 2, 3 and 6 respectively corresponding to each branch, weights respectively assigned to the corresponding branches are 0.1, 0.2 and 0.5, and the feature graphs on each branch are restored to the feature graph F through bilinear interpolation after depth separable convolution ₄ Weighting and splicing the same size, and performing a convolution operation on the spliced feature images to obtain a feature image F ₄₂ ；

For the characteristic diagram F ₂ The corresponding multi-branch pyramid pooling module sets the branch number as 5, and the characteristic diagram F is compared with the set branch number ₂ Pooling operations of different scales are carried out to obtain feature graphs with feature sizes of 1, 4, 8, 16 and 32 respectively corresponding to each branch, weights given to the corresponding branches are 0.1, 0.2 and 0.4 respectively, and after depth separable convolution is carried out on the feature graphs on each branch, the feature graphs and the feature graphs F are restored through bilinear interpolation ₂ Weighting and splicing the same size, and performing a convolution operation on the spliced feature images to obtain a feature image F ₂₂ 。

5. The method of claim 1, wherein the three-layer structure of the decoder is defined as a first layer structure, a second layer structure and a third layer structure according to a data flow direction, and each layer structure includes a patch expansion layer and a split attention module.

6. The method for extracting a remote sensing image building based on a transducer according to claim 5, wherein the first layer structure in the decoder acquires a feature map F ₄₂ And F is equal to ₃₂ Process and output a feature map F ₃₂ ' second layer Structure acquisition feature map F ₃₂ ' and F ₂₂ Process and output a feature map F ₂₂ ' third layer Structure acquisition feature map F ₂₂ ' and F ₁ Process and output a feature map F ₁ ' comprising:

7. The method for extracting a remote sensing image building based on a transducer according to claim 1, wherein the multistage feature cascade fusion module is configured to fuse a feature map F ₂₂ ' and feature map F ₃₂ ' feature fusion is carried out to obtain a feature map F ₂₂ "then the feature map F ₁ ' sumFeature map F ₂₂ Feature fusion is carried out to obtain a final segmentation map, and the final segmentation map is used as a building segmentation result, and the method comprises the following steps: