CN116109920A - Remote sensing image building extraction method based on transducer - Google Patents

Remote sensing image building extraction method based on transducer Download PDF

Info

Publication number
CN116109920A
CN116109920A CN202211597465.0A CN202211597465A CN116109920A CN 116109920 A CN116109920 A CN 116109920A CN 202211597465 A CN202211597465 A CN 202211597465A CN 116109920 A CN116109920 A CN 116109920A
Authority
CN
China
Prior art keywords
feature
layer structure
feature map
map
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211597465.0A
Other languages
Chinese (zh)
Inventor
雷艳静
王渊
产思贤
卢雅婷
孟祥路
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202211597465.0A priority Critical patent/CN116109920A/en
Publication of CN116109920A publication Critical patent/CN116109920A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/176Urban or other man-made structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a remote sensing image building extraction method based on a transducer, which belongs to the technical field of image segmentation and comprises the following steps: acquiring a remote sensing image and preprocessing the remote sensing image; and inputting the preprocessed remote sensing image into a deep learning model for semantic segmentation to obtain a building segmentation result. According to the invention, an asymmetric network structure is designed in the original Swin transform technical scheme, and meanwhile, a newly designed multi-branch weighted pyramid pooling module is adopted in jump connection, so that characteristic information is further mined, and the problem that buildings with various types and different scales in remote sensing images are difficult to identify is solved.

Description

Remote sensing image building extraction method based on transducer
Technical Field
The invention belongs to the technical field of image segmentation, and particularly relates to a remote sensing image building extraction method based on a transducer.
Background
Building extraction is an important subtask of semantic segmentation in computer vision, which is of great significance to military reconnaissance, precision guidance and civilian aspects. Unlike the semantic segmentation of natural images, the existing building extraction method of remote sensing images has the problem that the intensive prediction task of feature extraction through a convolutional neural network is difficult to expand the effective receptive field and establish long-distance dependence. The network mainly comprising Tranformer has the problems of large calculation amount and easy overfitting to a small number of remote sensing images.
Disclosure of Invention
The invention aims to provide a remote sensing image building extraction method based on a transducer, which designs an asymmetric network structure in the original Swin transducer technical scheme, adopts a newly designed multi-branch weighted pyramid pooling module in jump connection, further excavates characteristic information, and solves the problem that buildings with various types and different scales in remote sensing images are difficult to identify.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a remote sensing image building extraction method based on a transducer, the remote sensing image building extraction method based on the transducer comprising:
acquiring a remote sensing image and preprocessing the remote sensing image;
inputting the preprocessed remote sensing image into a deep learning model for semantic segmentation to obtain a building segmentation result;
the deep learning model is of an asymmetric network structure and comprises a four-layer encoder Swin Transformer, a multi-branch weighted pyramid pooling module, a three-layer decoder and a multi-level feature cascade fusion module, wherein:
the encoder Swin transducer receives the preprocessed remote sensing image and outputs four-dimensional characteristic images F through a four-layer structure 1 、F 2 、F 3 、F 4
Three multi-branch pyramid pooling modules are arranged, and the three multi-branch pyramid pooling modules respectively take a characteristic diagram F output by a three-layer structure after an encoder Swin transform 2 、F 3 、F 4 For the characteristic diagram F 2 、F 3 、F 4 Post-processing output feature map F 22 、F 32 、F 42
The first layer structure in the decoder acquires feature map F 42 And F is equal to 32 Process and output a feature map F 32 ' second layer Structure acquisition feature map F )2 ' and F 22 Process and output a feature map F 22 ' third layer Structure acquisition feature map F 22 ' and F 1 Process and output a feature map F 1 ′。
The multistage feature cascade fusion module fuses the feature map F 22 ' and feature map F 32 ' feature fusion is carried out to obtain a feature map F 22 "then the feature map F 1 ' and feature map F 22 And carrying out feature fusion to obtain a final segmentation map, and taking the final segmentation map as a building segmentation result.
The following provides several alternatives, but not as additional limitations to the above-described overall scheme, and only further additions or preferences, each of which may be individually combined for the above-described overall scheme, or may be combined among multiple alternatives, without technical or logical contradictions.
Preferably, the preprocessing includes scaling the remote sensing image to an image size of 512 x 512.
Preferably, the four-layer structure of the encoder Swin Transformer is defined as a first-layer structure, a second-layer structure, a third-layer structure and a fourth-layer structure according to the data flow direction;
the first layer structure comprises a patch dividing block, a linear embedding block and a Swin transform block which are sequentially connected from a data input side to an output side, and each layer of structure in the second layer structure, the third layer structure and the fourth layer structure comprises a patch merging layer and a Swin transform block which are sequentially connected from the data input side to the output side;
wherein the first layer structure outputs a feature map F 1 The second layer structure outputs a feature map F 2 Third layer structure output characteristic diagram F 3 Fourth layer structure output characteristic diagram F 4
Preferably, the pair of feature maps F 2 、F 3 、F 4 Post-processing output feature map F 22 、F 32 、F 42 Comprising:
for the characteristic diagram F 4 The corresponding multi-branch pyramid pooling module sets the branch number as 4, and the characteristic diagram F is compared with the branch number 4 Pooling operations with different scales are carried out to obtain feature graphs with feature sizes of 1, 2, 3 and 6 respectively corresponding to each branch, weights respectively assigned to the corresponding branches are 0.1, 0.2 and 0.5, and the feature graphs on each branch are restored to the feature graph F through bilinear interpolation after depth separable convolution 4 Weighting and splicing the same size, and performing a convolution operation on the spliced feature images to obtain a feature image F 42
For the characteristic diagram F 3 The corresponding multi-branch pyramid pooling module sets the branch number as 4, and the characteristic diagram F is compared with the branch number 3 Pooling operations with different scales are carried out to obtain feature graphs with feature sizes of 1, 2, 3 and 6 respectively corresponding to each branch, weights respectively assigned to the corresponding branches are 0.1, 0.2 and 0.5, and the feature graphs on each branch are restored to the feature graph F through bilinear interpolation after depth separable convolution 3 Weighting and splicing the same size, and performing a convolution operation on the spliced feature images to obtain a feature image F 32
For the characteristic diagram F 2 The corresponding multi-branch pyramid pooling module sets the branch number as 5, and the characteristic diagram F is compared with the set branch number 2 Making different scalesPooling the degrees to obtain feature graphs with feature sizes of 1, 4, 8, 16 and 32 respectively corresponding to each branch, giving weights of 0.1, 0.2 and 0.4 to each corresponding branch, and recovering the feature graphs and the feature graphs F through bilinear interpolation after depth separable convolution 2 Weighting and splicing the same size, and performing a convolution operation on the spliced feature images to obtain a feature image F 22
Preferably, the three-layer structure of the decoder is defined as a first layer structure, a second layer structure and a third layer structure according to a data stream direction, and each layer structure includes a patch expansion layer and a split attention module.
Preferably, the first layer structure in the decoder acquires a feature map F 42 And F is equal to 32 Process and output a feature map F 32 ' second layer Structure acquisition feature map F 32 ' and F 22 Process and output a feature map F 22 ' third layer Structure acquisition feature map F 22 ' and F 1 Process and output a feature map F 1 ' comprising:
feature map F 42 The patch expansion layer is input into the first layer structure, and after expansion treatment of the patch expansion layer, the patch expansion layer and the feature map F are combined 32 Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a first layer structure, and outputting the feature image F by the split-flow attention module of the first layer structure 32 ' at the same time, feature map F 32 ' input to the patch expansion layer in the second layer structure;
feature map F 32 ' after expansion treatment of the patch expansion layer in the second layer structure and feature map F 22 Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a second layer structure, and outputting the feature image F by the split-flow attention module of the second layer structure 22 ' at the same time, feature map F 22 ' input to the patch expansion layer in the third layer structure;
feature map F 22 ' after expansion treatment of the patch expansion layer in the third layer structure and feature map F 1 Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a third layer structure, and outputting the feature image F by the split-flow attention module in the third layer structure 1 ′。
Preferably, the multistage feature cascade fusion module fuses the feature map F 22 ' and feature map F 32 ' feature fusion is carried out to obtain a feature map F 22 "then the feature map F 1 ' and feature map F 22 Feature fusion is carried out to obtain a final segmentation map, and the final segmentation map is used as a building segmentation result, and the method comprises the following steps:
first to characteristic diagram F 32 ' bilinear interpolation to obtain sum feature map F 22 ' feature map F of the same size 32 "in the case of FIG. 2 32 "and feature map F 22 ' adding and continuing to obtain the feature map F by bilinear interpolation 1 ' feature map F of the same size 1 Finally, the feature map F 1 "and feature map F 1 And carrying out feature fusion to obtain a final segmentation map.
According to the remote sensing image building extraction method based on the Transformer, a new decoder is designed to form a new asymmetric network structure in the existing Swin Transformer technical scheme, and redesigned multi-branch weighted pyramid pooling module depth mining feature semantic information is added between the decoders of the encoder, so that the remote sensing image building extraction precision can be effectively improved.
Drawings
FIG. 1 is a flow chart of a method for extracting a remote sensing image building based on a transducer according to the present invention;
FIG. 2 is a schematic diagram of a deep learning model according to the present invention;
FIG. 3 is a schematic diagram of a multi-branch weighted pyramid pooling module according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
In order to overcome the defects existing in the prior art aiming at remote sensing image segmentation, the embodiment provides a remote sensing image building extraction method based on a Transformer, and the accuracy of remote sensing image building extraction is improved.
As shown in fig. 1, the remote sensing image building extraction method based on the transducer of the embodiment includes the following steps:
and step 1, acquiring a remote sensing image, and preprocessing the remote sensing image.
To meet the neural network input requirements, the present embodiment requires scaling the telemetry image to an image size of 512×512. The preprocessing in this embodiment focuses on scaling the image size, and in other embodiments, other image preprocessing may be performed in order to improve the image quality.
And step 2, inputting the preprocessed remote sensing image into a deep learning model for semantic segmentation to obtain a building segmentation result.
According to the embodiment, the deep learning model is introduced to perform semantic segmentation, so that the complexity of segmentation operation is reduced, and the accuracy and stability of the segmentation operation are improved. As shown in fig. 2, the deep learning model of the present embodiment is an asymmetric network structure, and includes a four-layer encoder Swin Transformer, a multi-branch weighted pyramid pooling module, a three-layer decoder, and a multi-stage feature cascade fusion module. The deep learning model employed in the present embodiment is described below by way of a detailed description of each module.
1) Encoder Swin transducer: receiving the preprocessed remote sensing image and passingThe four-layer structure outputs four-size feature images F 1 、F 2 、F 3 、F 4
The encoder of this embodiment adopts a Swin Transformer, and for convenience of description, the four-layer structure is defined as a first-layer structure, a second-layer structure, a third-layer structure and a fourth-layer structure according to the data flow direction.
The first layer structure comprises a Patch dividing block (Patch dividing), a Linear embedding block (Linear embedding) and a Swin transform block (Swin Transformer Block) which are sequentially connected from the data input side to the output side, and each of the second layer structure, the third layer structure and the fourth layer structure comprises a Patch Merging layer (Patch Merging) and a Swin transform block which are sequentially connected from the data input side to the output side. Wherein the first layer structure outputs a feature map F 1 The second layer structure outputs a feature map F 2 Third layer structure output characteristic diagram F 3 Fourth layer structure output characteristic diagram F 4
The whole Swin transducer model adopts a hierarchical design, and totally comprises four layers of structures, each layer of structure can reduce the resolution of an input feature map, and the receptive field is expanded layer by layer like CNN. At the beginning of the input, a Patch Partition is made, the picture is cut into individual tiles and embedded in the Linear frame (Patch divided into 4*4, stretched into a number of one-dimensional vectors and embedded in the Linear frame). Wherein the Patch merge module primarily reduces the picture resolution at the beginning of each layer of structure. While Swin Transformer Block is composed mainly of LayerNorm, MLP, window Attention and Shifted Window Attention.
Note that, the Swin Transformer is an existing network structure, and the specific structure and the workflow of the present embodiment are not described in detail.
2) Three multi-branch pyramid pooling modules are arranged, and the three multi-branch pyramid pooling modules respectively take a characteristic diagram F output by a three-layer structure after an encoder SwinTransformer 2 、F 3 、F 4 For the characteristic diagram F 2 、F 3 、F 4 Post-processing output characteristics map 4 22 、F 32 、F 42
The multi-branch pyramid pooling module pools the feature map F 2 、F 3 、F 4 From the dimensions B (number of batches), L (number of patches), C (number of channels) to the form B (number of batches), H (height), W (width), C (number of channels). The overall structure of the module is shown in figure 3, the module structure is improved based on the conventional PPM pyramid pooling module, and the difference between the module structure and the PPM pyramid pooling module is that for each layer of a coder-decoder, the multi-branch pyramid pooling module is provided with different branches in a targeted manner, and the importance of each branch is given corresponding weight. In addition, the use of depth separable convolution instead of conventional convolution operations reduces the computational effort. The module is specifically arranged as follows:
for the characteristic diagram F 4 The corresponding multi-branch pyramid pooling module (Multi branch PPM Block 1) sets the branch number as 4 for the feature map F 4 Pooling operations of different scales are carried out to obtain feature graphs with feature sizes of 1, 2, 3 and 6 respectively corresponding to each branch, weights respectively assigned to the corresponding branches are 0.1, 0.2 and 0.5, and the feature graphs on each branch are restored to the feature graph F through bilinear interpolation (upsamples) after depth separable convolution (Depthwise Separable Convmodule or Depthwise Separable Convolution) is carried out 4 Weighting and splicing the same size, and carrying out a traditional common convolution operation on the spliced characteristic diagram to obtain a characteristic diagram F 42
For the characteristic diagram F 3 The corresponding multi-branch pyramid pooling module (Multi branch PPM Block 2) sets the branch number as 4 for the feature map F 3 Pooling operations with different scales are carried out to obtain feature graphs with feature sizes of 1, 2, 3 and 6 respectively corresponding to each branch, weights respectively assigned to the corresponding branches are 0.1, 0.2 and 0.5, and the feature graphs on each branch are restored to the feature graph F through bilinear interpolation after depth separable convolution 3 Weighting and splicing the same size, and carrying out a traditional common convolution operation on the spliced characteristic diagram to obtain a characteristic diagram F 32
For the characteristic diagram F 2 The corresponding multi-branch pyramid pooling module (Multi branch PPM Block 3) sets the branch number to 5, corresponding to the feature map F 2 Pooling operations of different scales are carried out to obtain feature graphs with feature sizes of 1, 4, 8, 16 and 32 respectively corresponding to each branch, weights given to the corresponding branches are 0.1, 0.2 and 0.4 respectively, and after depth separable convolution is carried out on the feature graphs on each branch, the feature graphs and the feature graphs F are restored through bilinear interpolation 2 Weighting and splicing the same size, and carrying out a traditional common convolution operation on the spliced characteristic diagram to obtain a characteristic diagram F 22
It should be noted that, in this embodiment, weights are added in each branch, so that important feature information can be better reflected, and an effective feature influence can be provided, and when the features are weighted and overlapped, features output by the depth separable convolution can be multiplied by weights in each branch, and then are spliced after bilinear interpolation is restored; or directly taking the characteristic of the depth separable convolution output, and then weighting and splicing after bilinear interpolation recovery. And the weights in fig. 3 are only one example and are not particularly limited.
3) First layer structure acquisition feature map F in decoder 42 And F is equal to 32 Process and output a feature map F 32 ' second layer Structure acquisition feature map F 32 ' and F 22 Process and output a feature map F 22 ' third layer Structure acquisition feature map F 22 ' and F 1 Process and output a feature map F 1 ′。
The three-layer structure of the decoder of the present embodiment is defined as a first layer structure, a second layer structure, and a third layer structure according to the data flow direction, respectively, and each layer structure includes a Patch expansion layer (Patch expansion) and a split attention module (Shunted Transformer Block).
The patch expansion layer is a patch expansion layer in the Swin Unet, the split attention module is based on Shunted Transformer Block in Shunted Transformer, the aggregation rate of each layer of the split attention module is set to be two, the three decoding layers have different attention, the attention to the characteristics is different, and the number of split streams set by each layer is different based on the characteristics.
Feature map F 42 The patch expansion layer is input into the first layer structure, and after expansion treatment of the patch expansion layer, the patch expansion layer and the feature map F are combined 32 Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a first layer structure, and outputting the feature image F by the split-flow attention module of the first layer structure 32 ' at the same time, feature map F 32 ' input to the patch expansion layer in the second layer structure.
Feature map F 32 ' after expansion treatment of the patch expansion layer in the second layer structure and feature map F 22 Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a second layer structure, and outputting the feature image F by the split-flow attention module of the second layer structure 22 ' at the same time, feature map F 22 ' input to the patch expansion layer in the third layer structure.
Feature map F 22 ' after expansion treatment of the patch expansion layer in the third layer structure and feature map F 1 Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a third layer structure, and outputting the feature image F by the split-flow attention module in the third layer structure 1 ′。
In the patch expansion layer in this embodiment, for dimensions such as B (batch number), L (patch number), and C (channel number), the feature map converts the dimensions into B, L,2×c by using a linear mapping method, and then converts the dimensions into B,4×l, and C/2 by using a reorganization method to expand the dimension. The split attention module projects vectors in the form of { B, L, C } as tensors of Q, K and V, aggregates K, V through different aggregation rates r, and then calculates attention to obtain output vectors fused with global context information.
4) The multistage feature cascade fusion module fuses the feature map F 22 ' and feature map F 32 ' feature fusion is carried out to obtain a feature map F 22 "then the feature map F 1 ' and feature map F 22 Feature fusion is carried out to obtain a final segmentation map, and the final segmentation map is taken as a buildingAnd (5) object segmentation results.
In order to ensure the fusion effect, the embodiment firstly aims at the feature map F 32 ' bilinear interpolation to obtain sum feature map F 22 ' feature map F of the same size 32 "in the course of feature map F 32 "and feature map F 22 ' adding and continuing to obtain the feature map F by bilinear interpolation 1 ' feature map F of the same size 1 Finally, the feature map F 1 "and feature map F 1 And carrying out feature fusion to obtain a final segmentation map.
The feature fusion operation is a conventional feature fusion operation, and this embodiment will not be described in detail.
In order to ensure the application effect of the deep learning model, the deep learning model needs to be trained in advance, and the embodiment provides a training flow as follows:
step S1, acquiring a remote sensing image training data set with a building segmentation mask, randomly overturning the remote sensing image training data set, enhancing data of photometric distortion, and then starting training in batches.
The present embodiment performs random inversion and photometric-distorted data enhancement on the training data set. The random overturn, namely the overturn in the horizontal or vertical direction is carried out on the pictures trained on each batch with the probability of 50 percent, has the advantages of increasing the robustness for our small data set and having better precision for the remote sensing segmentation. Photometric distortion data enhancement, i.e., adjusting the brightness, chromaticity, contrast, saturation of an image, and adding noise, makes the dataset more compatible with remote sensing images obtained at various times.
Step S2, adjusting the size of the input picture of the enhanced image training data set, namely adjusting the size to 512×512, inputting the size to a network encoder Swin converter, and obtaining four-size feature maps F output by stages 1, 2, 3 and 4 in the encoder Swin converter 1 、F 2 、F 3 、F 4
The application adopts a Swin Transformer as a backbone network for feature extraction. The Swin transducer is used to pre-add the pre-training weight trained on ADE20K, the training is performed in batches, the batch size is 8 (i.e. 8 pictures are processed in each batch) in the training process, the optimizer is AdamW, the initial learning rate is 6e-05, the momentum is 0.9, the learning rate adjustment strategy is poly strategy, the power is set to 1, and the weight attenuation is 0.01.
The 512×512 pictures are put into an encoder Swin transform, and after passing through a four-layer encoder, four-size feature maps F of 128×128, 64×64, 32×32, 16×16 are sequentially output 1 、F 2 、F 3 、F 4
The size of the feature map is determined by the backbone network Swin Transformer, and will not be described in detail here. For the original Swin transducer, only the last layer of extracted features is used for final segmentation prediction. In this embodiment, all the features output by the Swin Transformer are used, the feature information is deeply mined by the multi-branch weighted pyramid pooling module, and is put into a decoder to be sequentially spliced with the bottommost features after being decoded, and finally the multi-scale features are subjected to cascade fusion to generate a segmentation prediction mask.
Step S3, selecting the last three F in the four size feature graphs 2 、F 3 、F 4 Performing feature processing in a multi-branch weighted pyramid pooling module to obtain a processed feature map F 22 、F 32 、F 42
It should be noted that, in order to reduce the calculation amount and the parameter amount, the depth separable convolution in the multi-branch weighted pyramid pooling module includes a depth convolution and a point convolution. The deep convolution is to perform a group convolution on each channel of the input, and the point convolution is a convolution of 1×1. And the semantic information of the features is further mined by carrying out multi-branch weighted pyramid pooling module processing on the features of each different scale, so that the subsequent processing is facilitated.
Step S4, the processed characteristic diagram F 1 、F 22 、F 32 、F 42 Respectively, into respective corresponding decoders, and the bottommost processed feature F 42 Then it is necessary to first go through the patchThe expansion layer expands one time of the size, restores the size of the upper layer, then obtains a characteristic diagram through processing by a decoder, and obtains the final segmentation prediction mask through cascade fusion. And calculating loss and carrying out back propagation to update network parameters so as to complete the training of the network.
For the last three-layer feature F 3 2 、F 2 2 、F 1 ' first to F 32 ' bilinear interpolation to get sum F 22 ' same size F 32 ", F 32 "and F 22 ' add and continue to get the sum F by bilinear interpolation 1 ' F of the same size 1 ", F 1 "and F 1 The 'fusion' obtains the final feature map, and the feature map is converted from B, L and C into B, H, W and C formats for prediction. In addition, since the tags of the building extraction task are only two types, namely, the building and the background, the final number of channels is only 2 channels.
The split prediction graph and the real label are subjected to loss calculation through a loss function CrossEntropyLoss, and the specific formula is as follows:
CELoss=-(ylog(p(x))+(1-y)log(1-p(x)))
where y represents whether it is a target, the value is 1 or 0, and p (x) is the predicted targeting score.
It should be noted that, the calculation of the classification loss, the target score loss and the frame loss is already a relatively mature technology in the art, and will not be described herein.
This results in a loss between the predicted and the true values, and back propagation is performed to reduce the loss before each batch is completed. And updating network parameters at the same time, starting training of the next batch until training data of all batches are trained, finally obtaining trained weights, and storing all updated parameters in output weight files.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present invention, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of the invention should be assessed as that of the appended claims.

Claims (7)

1. The remote sensing image building extraction method based on the transducer is characterized by comprising the following steps of:
acquiring a remote sensing image and preprocessing the remote sensing image;
inputting the preprocessed remote sensing image into a deep learning model for semantic segmentation to obtain a building segmentation result;
the deep learning model is of an asymmetric network structure and comprises a four-layer encoder Swin Transformer, a multi-branch weighted pyramid pooling module, a three-layer decoder and a multi-level feature cascade fusion module, wherein:
the encoder Swin transducer receives the preprocessed remote sensing image and outputs four-dimensional characteristic images F through a four-layer structure 1 、F 2 、F 3 、F 4
Three multi-branch pyramid pooling modules are arranged, and the three multi-branch pyramid pooling modules respectively take a characteristic diagram F output by a three-layer structure after an encoder Swin transform 2 、F 3 、F 4 For the characteristic diagram F 2 、F 3 、F 4 Post-processing output feature map F 22 、F 32 、F 42
The first layer structure in the decoder acquires feature map F 42 And F is equal to 32 Process and output a feature map F 32 ' second layer Structure acquisition feature map F 32 ' and F 22 Process and output a feature map F 22 ' third layer Structure acquisition feature map F 22 ' and F 1 Process and output a feature map F 1 ′;
The multistage feature cascade fusion module fuses the feature map F 22 ' and feature map F 32 ' feature fusion is carried out to obtain a feature map F 22 "then the feature map F 1 ' and feature map F 22 And carrying out feature fusion to obtain a final segmentation map, and taking the final segmentation map as a building segmentation result.
2. The Transformer-based remote sensing image building extraction method of claim 1, wherein the preprocessing includes scaling the remote sensing image to an image size of 512 x 512.
3. The method for extracting a remote sensing image building based on a transducer according to claim 1, wherein the four-layer structure of the encoder Swin transducer is defined as a first layer structure, a second layer structure, a third layer structure and a fourth layer structure according to data flow direction, respectively;
the first layer structure comprises a patch dividing block, a linear embedding block and a Swin transform block which are sequentially connected from a data input side to an output side, and each layer of structure in the second layer structure, the third layer structure and the fourth layer structure comprises a patch merging layer and a Swin transform block which are sequentially connected from the data input side to the output side;
wherein the first layer structure outputs a feature map F 1 The second layer structure outputs a feature map F 2 Third layer structure output characteristic diagram F 3 Fourth layer structure output characteristic diagram F 4
4. The method for extracting a remote sensing image building based on a transducer according to claim 1, wherein the pair of feature maps F 2 、F 3 、F 4 Post-processing output feature map F 22 、F 32 、F 42 Comprising:
for the characteristic diagram F 4 Corresponding multi-branch pyramidThe pooling module sets the branch number as 4, and the characteristic diagram F 4 Pooling operations with different scales are carried out to obtain feature graphs with feature sizes of 1, 2, 3 and 6 respectively corresponding to each branch, weights respectively assigned to the corresponding branches are 0.1, 0.2 and 0.5, and the feature graphs on each branch are restored to the feature graph F through bilinear interpolation after depth separable convolution 4 Weighting and splicing the same size, and performing a convolution operation on the spliced feature images to obtain a feature image F 42
For the characteristic diagram F 3 The corresponding multi-branch pyramid pooling module sets the branch number as 4, and the characteristic diagram F is compared with the branch number 3 Pooling operations with different scales are carried out to obtain feature graphs with feature sizes of 1, 2, 3 and 6 respectively corresponding to each branch, weights respectively assigned to the corresponding branches are 0.1, 0.2 and 0.5, and the feature graphs on each branch are restored to the feature graph F through bilinear interpolation after depth separable convolution 3 Weighting and splicing the same size, and performing a convolution operation on the spliced feature images to obtain a feature image F 32
For the characteristic diagram F 2 The corresponding multi-branch pyramid pooling module sets the branch number as 5, and the characteristic diagram F is compared with the set branch number 2 Pooling operations of different scales are carried out to obtain feature graphs with feature sizes of 1, 4, 8, 16 and 32 respectively corresponding to each branch, weights given to the corresponding branches are 0.1, 0.2 and 0.4 respectively, and after depth separable convolution is carried out on the feature graphs on each branch, the feature graphs and the feature graphs F are restored through bilinear interpolation 2 Weighting and splicing the same size, and performing a convolution operation on the spliced feature images to obtain a feature image F 22
5. The method of claim 1, wherein the three-layer structure of the decoder is defined as a first layer structure, a second layer structure and a third layer structure according to a data flow direction, and each layer structure includes a patch expansion layer and a split attention module.
6. The method for extracting a remote sensing image building based on a transducer according to claim 5, wherein the first layer structure in the decoder acquires a feature map F 42 And F is equal to 32 Process and output a feature map F 32 ' second layer Structure acquisition feature map F 32 ' and F 22 Process and output a feature map F 22 ' third layer Structure acquisition feature map F 22 ' and F 1 Process and output a feature map F 1 ' comprising:
feature map F 42 The patch expansion layer is input into the first layer structure, and after expansion treatment of the patch expansion layer, the patch expansion layer and the feature map F are combined 32 Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a first layer structure, and outputting the feature image F by the split-flow attention module of the first layer structure 32 ' at the same time, feature map F 32 ' input to the patch expansion layer in the second layer structure;
feature map F 32 ' after expansion treatment of the patch expansion layer in the second layer structure and feature map F 22 Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a second layer structure, and outputting the feature image F by the split-flow attention module of the second layer structure 22 ' at the same time, feature map F 22 ' input to the patch expansion layer in the third layer structure;
feature map F 22 ' after expansion treatment of the patch expansion layer in the third layer structure and feature map F 1 Splicing the channel dimensions, inputting the spliced feature images to a split-flow attention module in a third layer structure, and outputting the feature image F by the split-flow attention module in the third layer structure 1 ′。
7. The method for extracting a remote sensing image building based on a transducer according to claim 1, wherein the multistage feature cascade fusion module is configured to fuse a feature map F 22 ' and feature map F 32 ' feature fusion is carried out to obtain a feature map F 22 "then the feature map F 1 ' sumFeature map F 22 Feature fusion is carried out to obtain a final segmentation map, and the final segmentation map is used as a building segmentation result, and the method comprises the following steps:
first to characteristic diagram F 32 ' bilinear interpolation to obtain sum feature map F 22 ' feature map F of the same size 32 "in the case of FIG. 2 32 "and feature map F 22 ' adding and continuing to obtain the feature map F by bilinear interpolation 1 ' feature map F of the same size 1 Finally, the feature map F 1 "and feature map F 1 And carrying out feature fusion to obtain a final segmentation map.
CN202211597465.0A 2022-12-12 2022-12-12 Remote sensing image building extraction method based on transducer Pending CN116109920A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211597465.0A CN116109920A (en) 2022-12-12 2022-12-12 Remote sensing image building extraction method based on transducer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211597465.0A CN116109920A (en) 2022-12-12 2022-12-12 Remote sensing image building extraction method based on transducer

Publications (1)

Publication Number Publication Date
CN116109920A true CN116109920A (en) 2023-05-12

Family

ID=86260652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211597465.0A Pending CN116109920A (en) 2022-12-12 2022-12-12 Remote sensing image building extraction method based on transducer

Country Status (1)

Country Link
CN (1) CN116109920A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665053A (en) * 2023-05-30 2023-08-29 浙江时空智子大数据有限公司 High-resolution remote sensing image building identification method and system considering shadow information
CN116993756A (en) * 2023-07-05 2023-11-03 石河子大学 Method for dividing verticillium wilt disease spots of field cotton

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665053A (en) * 2023-05-30 2023-08-29 浙江时空智子大数据有限公司 High-resolution remote sensing image building identification method and system considering shadow information
CN116665053B (en) * 2023-05-30 2023-11-07 浙江时空智子大数据有限公司 High-resolution remote sensing image building identification method and system considering shadow information
CN116993756A (en) * 2023-07-05 2023-11-03 石河子大学 Method for dividing verticillium wilt disease spots of field cotton

Similar Documents

Publication Publication Date Title
CN116109920A (en) Remote sensing image building extraction method based on transducer
CN111178316B (en) High-resolution remote sensing image land coverage classification method
CN111915619A (en) Full convolution network semantic segmentation method for dual-feature extraction and fusion
CN111091130A (en) Real-time image semantic segmentation method and system based on lightweight convolutional neural network
CN112634296B (en) RGB-D image semantic segmentation method and terminal for gate mechanism guided edge information distillation
CN109241972A (en) Image, semantic dividing method based on deep learning
CN111898439A (en) Deep learning-based traffic scene joint target detection and semantic segmentation method
CN111832453B (en) Unmanned scene real-time semantic segmentation method based on two-way deep neural network
CN111401379A (en) Deep L abv3plus-IRCNet image semantic segmentation algorithm based on coding and decoding structure
CN109597998A (en) A kind of characteristics of image construction method of visual signature and characterizing semantics joint insertion
CN114693929A (en) Semantic segmentation method for RGB-D bimodal feature fusion
CN115311555A (en) Remote sensing image building extraction model generalization method based on batch style mixing
CN116469100A (en) Dual-band image semantic segmentation method based on Transformer
CN113269133A (en) Unmanned aerial vehicle visual angle video semantic segmentation method based on deep learning
CN113066089A (en) Real-time image semantic segmentation network based on attention guide mechanism
CN113870286A (en) Foreground segmentation method based on multi-level feature and mask fusion
CN111046738B (en) Precision improvement method of light u-net for finger vein segmentation
CN110633706B (en) Semantic segmentation method based on pyramid network
CN116612283A (en) Image semantic segmentation method based on large convolution kernel backbone network
CN116778318A (en) Convolutional neural network remote sensing image road extraction model and method
CN117557856A (en) Pathological full-slice feature learning method based on self-supervision learning
CN115375922B (en) Light-weight significance detection method based on multi-scale spatial attention
CN115995002B (en) Network construction method and urban scene real-time semantic segmentation method
CN112733777A (en) Road extraction method, device, equipment and storage medium for remote sensing image
CN116091918A (en) Land utilization classification method and system based on data enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination