CN113688813B - Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage - Google Patents

Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage Download PDF

Info

Publication number
CN113688813B
CN113688813B CN202111252286.9A CN202111252286A CN113688813B CN 113688813 B CN113688813 B CN 113688813B CN 202111252286 A CN202111252286 A CN 202111252286A CN 113688813 B CN113688813 B CN 113688813B
Authority
CN
China
Prior art keywords
module
remote sensing
sensing image
feature map
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111252286.9A
Other languages
Chinese (zh)
Other versions
CN113688813A (en
Inventor
王威
唐琛
王新
刘冠群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha University of Science and Technology
Original Assignee
Changsha University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha University of Science and Technology filed Critical Changsha University of Science and Technology
Priority to CN202111252286.9A priority Critical patent/CN113688813B/en
Publication of CN113688813A publication Critical patent/CN113688813A/en
Application granted granted Critical
Publication of CN113688813B publication Critical patent/CN113688813B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a multi-scale feature fusion remote sensing image segmentation method, a multi-scale feature fusion remote sensing image segmentation device, multi-scale feature fusion remote sensing image segmentation equipment and a multi-scale feature fusion remote sensing image segmentation memory. The method comprises the following steps: obtaining a remote sensing image, and labeling to obtain a training sample; constructing a multi-scale feature fusion remote sensing image segmentation network, wherein the network comprises: the input network is used for dividing the training sample into small blocks with fixed size, unfolding the small blocks into one-dimensional vectors and embedding the vectors into position codes to obtain an input sequence; the encoder is used for extracting different layers of the input sequence by utilizing a multi-layer Transformer module; a decoder for obtaining a sample prediction result by fusing the multi-scale feature maps; and training the network by using the training sample to obtain a trained multi-scale feature fusion remote sensing image segmentation network, and obtaining a prediction result of the remote sensing image to be detected by using the network. The method fully utilizes the multi-scale characteristic diagram extracted by the encoder, combines local classification and hierarchical segmentation, and can adapt to the characteristic that the target in the remote sensing image is complex and changeable.

Description

Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage
Technical Field
The application relates to the technical field of remote sensing image processing, in particular to a multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage.
Background
With the continuous development of remote sensing technology, massive high-resolution remote sensing image data can be acquired. Semantic segmentation of remote sensing images is one of means for processing remote sensing image data, and the method has a plurality of applications in forest land coverage detection, city change detection, city planning, crop monitoring and other aspects. The remote sensing image segmentation is a specific task in semantic segmentation, and the segmentation of the remote sensing image can extract rich information contained in the remote sensing image for researchers to use, so the quality of information extraction is determined by the quality of image segmentation performance. The remote sensing image contains abundant category information and irregular distribution form, which brings great challenges to the segmentation task.
At present, most segmentation research aiming at remote sensing images uses a Full Convolution Network (FCN), the FCN is an initiative work for applying the CNN to the semantic segmentation field, classification of all pixel points of the images is realized by using a full convolution method, an end-to-end structure is innovatively used, and a foundation is laid for the occurrence of a subsequent coding-decoding structure. In the FCN-8s structure, the size of the input image is
Figure 268841DEST_PATH_IMAGE001
,
Figure 717140DEST_PATH_IMAGE002
The size of the image is specified, and 3 represents the three channels of RGB of the image, in pixels. The input to the subsequent layer i is a three-dimensional tensor
Figure 903402DEST_PATH_IMAGE003
And C is the channel number of the characteristic diagram. The feature maps of the next layer are obtained by convolution based on the input feature maps of the previous layer, and these feature maps of the next layer are connected by convolution of the layers, and these convolutions are defined as their receptive fields. Through a plurality of convolution and pooling operations, the size of the feature map is continuously reduced, and the number of channels is continuously increased. Due to the locality of the convolution operation, the receptive field increases linearly with the depth of the layer, and is closely related to the size of the convolution kernel (usually
Figure 829769DEST_PATH_IMAGE004
). Therefore, in the FCN architecture, a shallow feature map focuses on local features of an image, and a deep feature map focuses on global features of the image. FCN-8s fuses shallow features and deep features through skip splicing and outputs prediction results through full convolution, so that the model can integrate global and local featuresThe structure of the part is predicted. However, studies have shown that once a certain depth is reached, the benefit of adding more layers is rapidly diminished. Therefore, the limited receptive field of the general CNN structure is an inherent limitation of the FCN architecture, and affects the segmentation effect of the remote sensing image.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a multi-scale feature fusion remote sensing image segmentation method, device, apparatus, and memory.
A multi-scale feature fusion remote sensing image segmentation method comprises the following steps:
and obtaining a remote sensing image with high resolution, and marking the remote sensing image to obtain a training sample.
Constructing a multi-scale feature fusion remote sensing image segmentation network; the multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding position codes to obtain an input sequence; the encoder is used for extracting features of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling.
And training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and a sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network.
And acquiring a remote sensing image to be detected, and inputting the remote sensing image to be detected into a trained multi-scale feature fusion remote sensing image segmentation network to obtain a prediction result of the remote sensing image to be detected.
A multi-scale feature fusion remote sensing image segmentation apparatus, the apparatus comprising:
and the remote sensing image acquisition module is used for acquiring a high-resolution remote sensing image and marking the remote sensing image to obtain a training sample.
The multi-scale feature fusion remote sensing image segmentation network construction module is used for constructing a multi-scale feature fusion remote sensing image segmentation network; the multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding position codes to obtain an input sequence; the encoder is used for extracting features of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling.
And the multi-scale feature fusion remote sensing image segmentation network training module is used for training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and the sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network.
And the prediction result determining module is used for acquiring the remote sensing image to be detected and inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network to obtain the prediction result of the remote sensing image to be detected.
The method comprises the steps of obtaining a high-resolution remote sensing image, marking the high-resolution remote sensing image to obtain a training sample, and constructing a multi-scale feature fusion remote sensing image segmentation network, wherein the network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network divides the training sample into a plurality of small images with fixed size, and the small images are unfolded into one-dimensional vectors and embedded into position codes to obtain an input sequence; the encoder extracts the characteristics of different levels of an input sequence by using a multi-layer Transformer module; the decoder obtains the features of different scales through convolution operation after the features of different levels are subjected to shape adjustment, fuses the features of different scales through splicing operation, and finally obtains a sample prediction result through multiple times of convolution and up-sampling; training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and the sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network; and acquiring a remote sensing image to be detected, and inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network to obtain a prediction result of the remote sensing image to be detected. The method can fully utilize the multi-scale characteristic diagram extracted by the encoder, combines local classification with hierarchical segmentation, and can adapt to the characteristic that the target in the remote sensing image is complex and changeable.
Drawings
FIG. 1 is a schematic flow chart illustrating a multi-scale feature fusion remote sensing image segmentation method according to an embodiment;
FIG. 2 is a schematic diagram of a multi-scale feature fusion remote sensing image segmentation network in another embodiment;
FIG. 3 is a diagram illustrating a structure of a decoder based on multi-scale feature map fusion in another embodiment;
FIG. 4 is a schematic flow chart of a feature transformation method in another embodiment, wherein (a) is a first feature transformation method and (b) is a second feature transformation method;
FIG. 5 is a block diagram of a multi-scale feature fusion remote sensing image segmentation apparatus according to an embodiment;
fig. 6 is an internal structural diagram of the apparatus in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a multi-scale feature fusion remote sensing image segmentation method is provided, and the method includes the following steps:
step 100: and obtaining a remote sensing image with high resolution, and marking the remote sensing image to obtain a training sample.
Step 102: and constructing a multi-scale feature fusion remote sensing image segmentation network.
The multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion.
The input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding the one-dimensional vectors into position codes to obtain an input sequence; the encoder is used for extracting the characteristics of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling.
Specifically, a Multiscale Feature fusion remote sensing image SEgmentation network (SETR-MFPD) uses a Vision transform (ViT) as an encoder, and the encoder comprises a b-layer transform module composed of a multi-head self-attention mechanism. The decoder gives a sample prediction result by fusing the image features extracted by different layers.
The multi-scale feature fusion remote sensing image segmentation network has the advantages that: the features of different layers extracted by the encoder are changed into different sizes, and feature graphs with different channel numbers are input into the decoder, so that feature fusion of the decoder is facilitated, and the segmentation capability of targets with uneven distribution and different sizes is improved; and the multi-scale feature maps of different layers are fused in a decoder through splicing operation, so that the local and global information perception capability of the decoder is improved.
SEGMENTATION TRansformer (SETR) applies transform to a semantic SEgmentation task, the SETR uses ViT to extract the features of an image, restores the image features into a multi-channel feature map in a mode of transforming the shapes of the image features, and finally inputs the restored feature map to a CNN-based decoder to realize semantic SEgmentation.
Step 104: and training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and the sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network.
Step 106: and acquiring a remote sensing image to be detected, and inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network to obtain a prediction result of the remote sensing image to be detected.
In the multi-scale feature fusion remote sensing image segmentation method, a high-resolution remote sensing image is obtained and marked to obtain a training sample; constructing a multi-scale feature fusion remote sensing image segmentation network, wherein the network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding the one-dimensional vectors into position codes to obtain an input sequence; the encoder is used for extracting the characteristics of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling; training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and the sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network; and acquiring a remote sensing image to be detected, and inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network to obtain a prediction result of the remote sensing image to be detected. The method can fully utilize the multi-scale characteristic diagram extracted by the encoder, combines local classification with hierarchical segmentation, and can adapt to the characteristic that the target in the remote sensing image is complex and changeable.
In one embodiment, step 104 includes: inputting a training sample into an input network, dividing the training sample into a plurality of small images with fixed sizes, expanding the small images into one-dimensional vectors, then adjusting the dimensionality of the one-dimensional vectors through linear connection mapping, and embedding position codes into the dimensionality-adjusted vectors to obtain an input sequence; inputting the input sequence into an encoder to obtain features of different layers; inputting the features of different levels into a decoder based on multi-scale feature map fusion to obtain a sample prediction result, and training a multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network.
In one embodiment, the encoder comprises b Transformer modules connected in series, and each Transformer module has the same structure; the Transformer module consists of a multi-head self-attention module, a layer standardization module and a multi-layer perceptron module; wherein b is an integer greater than or equal to 1; step 104 further comprises: inputting an input sequence into a first transform module, processing the input sequence by a layer standardization module to obtain a standardized input sequence, extracting features of the standardized input sequence by a multi-head self-attention module to obtain attention features, fusing the attention features with the input sequence to obtain attention fusion features, processing the attention fusion features by the layer standardization module, inputting an obtained standardized processing result into a multilayer perceptron module, and fusing the obtained output features with the attention fusion features to obtain output features of the first transform module; taking the output characteristics of the first Transformer module as an input sequence of a second Transformer module, inputting the input sequence into the second Transformer module to obtain the output characteristics of the second Transformer module, and so on to obtain b image characteristics from shallow to deep; and (4) performing feature selection on the b image features from shallow to deep according to the same layer interval to obtain features of different layers.
Wherein: the b image features from shallow to deep comprise output features from a first Transformer module to a b Transformer module, namely the output features of the first Transformer module and the output features of the second Transformer module, … … and the output features of the b Transformer module.
In one embodiment, the decoder based on multi-scale feature map fusion is composed of a multi-scale feature fusion module and an image size recovery module; step 104 further comprises: inputting the features of different levels into a multi-scale feature fusion module to obtain a multi-scale fusion feature map; inputting the multi-scale fusion feature map into an image size recovery module to obtain a prediction result; and training the multi-scale feature fusion remote sensing image segmentation network according to the prediction result and the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network.
In one embodiment, the different levels of features include: the output characteristics of the s-th Transformer module, the output characteristics of the 2 s-th Transformer module, the output characteristics of the 3 s-th Transformer module and the output characteristics of the b-th Transformer module; wherein s is the interval of the layers of the feature extraction, s is an integer greater than 1, and b is greater than 3 s; step 104 further comprises: inputting the output characteristics of the s-th fransformer module, the output characteristics of the 2 s-th fransformer module, the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module into a multi-scale characteristic fusion module, and respectively transforming the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module by adopting a first characteristic transformation method to obtain a characteristic diagram of a 3s layer and a first characteristic diagram of a b layer; wherein the feature map of the 3s layer and the first feature map of the b layer are of a size of
Figure 299934DEST_PATH_IMAGE005
A characteristic diagram ofKFor the length of the vector, it is,HandWthe height and width of the feature map, respectively; the first characteristic transformation method is to adjust each column vector of the characteristic diagram into a two-dimensional characteristic diagram; respectively outputting the output characteristics of the b th Transformer module and the output characteristics of the s th Transformer module toThe output characteristics of the 2s transform module are transformed by adopting a second characteristic transformation method to obtain a second characteristic diagram of the b layer, a characteristic diagram of the s layer and a characteristic diagram of the 2s layer; the second characteristic diagram of the b layer is the size
Figure 653555DEST_PATH_IMAGE006
The feature map of s layer has a feature map size of
Figure 327113DEST_PATH_IMAGE007
The size of the feature map of 2s layer is
Figure 791592DEST_PATH_IMAGE008
A characteristic diagram of (1); the second feature transformation method is to transform the feature mapnThe column vector is adjusted tonOpening a feature map andnsplicing the characteristic graphs to obtain
Figure 991629DEST_PATH_IMAGE009
A sheet feature map; whereinnIs a perfect square number and can be usedKTrimming; respectively performing convolution operation on the s-layer feature map, the 2 s-layer feature map, the 3 s-layer feature map and the b-layer first feature map to obtain the s-layer convolution feature map, the 2 s-layer convolution feature map, the 3 s-layer convolution feature map and the b-layer first convolution feature map; and (3) performing upsampling on the convolution characteristic diagram of the s layer, the convolution characteristic diagram of the 2s layer, the convolution characteristic diagram of the 3s layer and the first convolution characteristic diagram of the b layer, and then splicing the upsampled convolution characteristic diagram with the second characteristic diagram of the b layer to obtain the multi-scale fusion characteristic diagram.
The first feature transformation method and the second feature transformation method are both methods for transforming two-dimensional features into three-dimensional multi-scale feature maps.
In one embodiment, step 104 further comprises: inputting the multi-scale fusion feature map into an image size recovery module, and recovering the multi-scale fusion feature map to the size of an original image by adopting progressive convolution and up-sampling operation to obtain a sample prediction result; the progressive convolution is a three-fold serial convolution operation.
In one embodiment, the multi-layered perceptron module comprises a fully-connected layer of two hidden layers and a GELU activation function; the multi-head self-attention module consists of h self-attention modules; wherein h is an integer greater than 1.
In one embodiment, a multi-scale feature fusion remote sensing image segmentation method is provided, and an SETR-MFPD network comprises an encoding-decoding structure, as shown in FIG. 2. The encoder consists of a complete Transformer network and comprises a b-layer Transformer module consisting of a multi-head self-attention mechanism. The decoder gives the prediction result by fusing the image features extracted by different layers. Convolution 1 in the decoder in fig. 2 is a convolution with a convolution kernel of 3 x 3 steps 1, convolution 2 is the same as convolution 1, convolution 3 is a convolution with a convolution kernel of 1 x 1 steps 1, upsampling is used to enlarge the feature map size by a factor of 2.
As can be seen from the network architecture diagram shown in fig. 2: a given input image is firstly divided into small blocks with fixed sizes, then the small blocks are unfolded into one-dimensional vectors and embedded with position codes, and then the vectors are input into an encoder consisting of b transform modules to extract features of different layers of the image. The method comprises the steps of reducing feature map channels through convolution once after the shapes of features of different layers are adjusted, fusing the features of different scales through splicing operation, and finally generating a prediction result through multiple times of convolution and up-sampling.
(1) An encoder:
the self-attention mechanism used in the Transformer network integrates well the local and global information of the input sequence. Therefore, the problem of limited scope of the FCN framework can be solved by using the Vision transform as the encoder for dividing tasks.
1) Input processing
In the natural language processing task, the input of the transform network used is a set of one-dimensional vectors, while the input in the visual task is a two-dimensional image. In order to use the transform network as an encoder for the segmentation task, it is necessary to perform dimension adjustment on the input image. For input image
Figure 139320DEST_PATH_IMAGE010
Firstly, divide it into N pictures with equal length and width according to length and widthPhoto block
Figure 424808DEST_PATH_IMAGE011
Wherein Y represents the width of the segmentation, and
Figure 568345DEST_PATH_IMAGE012
. Then, the N image blocks are expanded according to the length and the width to obtain N image blocks with the length of N
Figure 622889DEST_PATH_IMAGE013
Is formed into a sequence
Figure 708525DEST_PATH_IMAGE014
WhereinKIs the vector length. The value of N has an important influence on the performance of the transform network. For an input of
Figure 215730DEST_PATH_IMAGE015
If N =4 is taken, the RGB image of (1) is divided into four one-dimensional vectors with dimensions 49,152. The multi-layer Perceptron (MLP) structure used by the Transformer cannot support such high-dimensional vectors as its input because the fully-connected layer used in MLP consumes a lot of time and space when processing such high-dimensional input.
In the semantic segmentation task, an encoder often obtains a multi-scale feature map by down-sampling
Figure 162957DEST_PATH_IMAGE016
Wherein C represents the channel number of the characteristic diagram,
Figure 72007DEST_PATH_IMAGE017
corresponding to feature maps of different scales. To facilitate feature shape adjustment in the decoder and to take into account the performance of the Transformer, we take N =256 and cut the input image into 256 segments with widths of 256
Figure 564431DEST_PATH_IMAGE018
Image blocks of (1) are expanded toTo a sequence of 256 one-dimensional vectors of 768 dimensions, denoted
Figure 824511DEST_PATH_IMAGE019
Figure 309850DEST_PATH_IMAGE020
WhereiniRepresenting the sequence number of the block. At ViT, the vectors may also be mapped through a linear connection (Liner project) before being input into the transform network
Figure 338986DEST_PATH_IMAGE021
The dimensions of the input vector are adjusted and then trained class code and position code parameters are embedded in the input vector, wherein the class code is embedded by adding a class number to the input dimension, and the position code is embedded by adding the input vector. But dimensional changes brought about by the embedded class code can make subsequent shape adjustments difficult. So that only the position code is embedded in the input vector and the final input can be represented as
Figure 766425DEST_PATH_IMAGE022
Wherein
Figure 248222DEST_PATH_IMAGE023
Representing the embedded position code.
2) Transformer network
The method in this embodiment serializes the images to obtain the initial input Z0The transform network is used as an encoder to extract the features of the image. A Transformer network is composed of a plurality of Transformer modules connected in series, each Transformer module has the same structure, and the output of the previous module is the input of the next module. Each module is composed of a multi-head self-attention (MSA) module, a Layer Normalization (LN) module and an MLP module. Suppose thatlThe input sequence of the layer isZ l-1The input sequence is first obtained through the LN module
Figure 396307DEST_PATH_IMAGE024
As shown in formula 1:
Figure 155315DEST_PATH_IMAGE025
(1)
Figure 629022DEST_PATH_IMAGE026
as input to the MSA module, the MSA module is composed ofhSelf-attention (SA) modules. The inputs to the SA block are three matrices
Figure 227143DEST_PATH_IMAGE027
The calculation method of the three input matrices is shown in formula 2, and the calculation process of the SA module is shown in formula 3:
Figure 178919DEST_PATH_IMAGE028
(2)
Figure 58013DEST_PATH_IMAGE029
(3)
wherein
Figure 702621DEST_PATH_IMAGE030
Representing the weight parameters that can be trained on,ddetermines the magnitude of the weight parameter. The calculation process of the MSA module is shown as formula 4
Figure 283644DEST_PATH_IMAGE031
(4)
Wherein concat represents ahEach size is
Figure 773531DEST_PATH_IMAGE032
The matrix is spliced according to the row dimension to obtain a matrix with the size of
Figure 241553DEST_PATH_IMAGE033
The matrix of (a) is,
Figure 57062DEST_PATH_IMAGE034
representing weight parameters that can be trained, will generally bedIs set to Nh
Figure 626846DEST_PATH_IMAGE035
Represents the output of the ith self-actuation module,
Figure 186003DEST_PATH_IMAGE036
wherein
Figure 774110DEST_PATH_IMAGE037
Representing trainable weight parameters of the ith SA module.
The complete calculation process of the final Transformer module is shown in the formulas 5-7:
Figure 494942DEST_PATH_IMAGE038
(5)
Figure 50557DEST_PATH_IMAGE039
(6)
Figure 882246DEST_PATH_IMAGE040
(7)
wherein the MLP is composed of a fully-connected layer including two hidden layers and a GELU activation function,l=1…bbrepresents the number of Transformer modules in the Transformer network, i.e. the number of layers of the Transformer network. By passingbThe sub-iteration, Transformer network can extractbThe features of the image from shallow to deep are marked as
Figure 715073DEST_PATH_IMAGE041
. By selecting different linear connectionsThe number of the mappings is such that,hband hidden layer size, Dosovitski et al devised three different ViT models, as shown in table 1.
TABLE 1 configuration of the different ViT models
Figure 482172DEST_PATH_IMAGE043
(2) Multi-scale feature fusion decoder
Features of different levels of the input sequence are extracted by the encoder, which features all have the same shape. In this embodiment, a multi-level feature fusion decoder similar to the feature pyramid network is designed, and the structure of the decoder is shown in fig. 3. The decoder is different from the decoder in that the features extracted by the transform module need to be adjusted into pyramid shapes, and the features of different layers extracted by the encoder do not need to be adjusted into shapes through pooling operation, so that information loss is avoided.
The embodiment selects 4 features of different layers extracted by a Transformer network
Figure 400450DEST_PATH_IMAGE044
sThe number of layers to select the feature spacing is determined. Fig. 4 shows a method for transforming two-dimensional features into a three-dimensional multi-scale feature map, wherein (a) is a first feature transformation method, which specifically comprises the following steps: directly adjusting each column vector into a two-dimensional characteristic diagram to obtain K characteristic diagrams; (b) the first feature transformation method specifically comprises the following steps: will be provided withnThe column vector is adjusted tonOpening a characteristic diagram and splicingnCombining the sheet feature maps into a larger feature map, thereby obtaining
Figure 924578DEST_PATH_IMAGE045
A sheet feature map; whereinnIs a perfectly square number and can be divided exactly by K. For deep layer features
Figure 346332DEST_PATH_IMAGE046
It is resized to size using the method shown in fig. 4 (a)
Figure 549912DEST_PATH_IMAGE047
Characteristic diagram of
Figure 689906DEST_PATH_IMAGE048
. For the
Figure 519191DEST_PATH_IMAGE049
It is also resized to size by the method shown in fig. 4 (b)
Figure 795451DEST_PATH_IMAGE050
Characteristic diagram of
Figure 169932DEST_PATH_IMAGE051
. For shallow features
Figure 797222DEST_PATH_IMAGE052
The methods shown in FIG. 4 (b) are selected and respectively adjusted to the sizes
Figure 666084DEST_PATH_IMAGE053
Characteristic diagram of
Figure 796851DEST_PATH_IMAGE054
A size of
Figure 466866DEST_PATH_IMAGE055
Characteristic diagram of
Figure 456819DEST_PATH_IMAGE056
. Then respectively using the sizes of
Figure 237693DEST_PATH_IMAGE057
Step size is 1, and the output channel is divided into 256,128,64 and 32 convolution kernels to perform convolution operation on the feature maps 4. Finally, the 4 characteristic graphs are sampled and
Figure 878759DEST_PATH_IMAGE058
spliced to obtain the size of
Figure 719676DEST_PATH_IMAGE059
Multi-scale fusion feature map of
Figure 196925DEST_PATH_IMAGE060
. In order to utilize the information of the multi-scale fusion feature map to the maximum extent, the size of the image is restored by progressive convolution and up-sampling, and finally the original image size is restored by three serial convolution and up-sampling operations.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 5, there is provided a multi-scale feature fusion remote sensing image segmentation apparatus, including: the remote sensing image acquisition module, the multi-scale feature fusion remote sensing image segmentation network construction module, the multi-scale feature fusion remote sensing image segmentation network training module and the prediction result determination module of the remote sensing image to be measured are as follows:
and the remote sensing image acquisition module is used for acquiring a high-resolution remote sensing image and marking the remote sensing image to obtain a training sample.
The multi-scale feature fusion remote sensing image segmentation network construction module is used for constructing a multi-scale feature fusion remote sensing image segmentation network; the multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding the one-dimensional vectors into position codes to obtain an input sequence; the encoder is used for extracting the characteristics of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling.
And the multi-scale feature fusion remote sensing image segmentation network training module is used for training the multi-scale feature fusion remote sensing image segmentation network according to the label of the training sample and a sample prediction result obtained by inputting the training sample into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network.
And the prediction result determining module is used for acquiring the remote sensing image to be detected, inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network, and obtaining the prediction result of the remote sensing image to be detected.
In one embodiment, the multi-scale feature fusion remote sensing image segmentation network training module is further configured to input a training sample into an input network, segment the training sample into a plurality of small block images with fixed sizes, expand the small block images into one-dimensional vectors, then adjust the dimensions of the one-dimensional vectors through linear connection mapping, and embed position codes in the dimension-adjusted vectors to obtain an input sequence; inputting the input sequence into an encoder to obtain features of different layers; inputting the features of different levels into a decoder based on multi-scale feature map fusion to obtain a sample prediction result, and training a multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network.
In one embodiment, the encoder comprises b Transformer modules connected in series, and each Transformer module has the same structure; the Transformer module consists of a multi-head self-attention module, a layer standardization module and a multi-layer perceptron module; wherein b is an integer greater than or equal to 1; the multi-scale feature fusion remote sensing image segmentation network training module is also used for inputting an input sequence into a first transform module, processing the input sequence by a layer standardization module to obtain a standardized input sequence, extracting features of the standardized input sequence by a multi-head self-attention module to obtain attention features, fusing the attention features and the input sequence to obtain attention fusion features, processing the attention fusion features by the layer standardization module, inputting the obtained standardized processing result into a multilayer perceptron module, and fusing the obtained sensed output features and the attention fusion features to obtain first transform module output features; taking the output characteristics of the first Transformer module as an input sequence of a second Transformer module, inputting the input sequence into the second Transformer module to obtain the output characteristics of the second Transformer module, and so on to obtain b image characteristics from shallow to deep; and (4) performing feature selection on the b image features from shallow to deep according to the same layer interval to obtain features of different layers.
In one embodiment, the decoder based on multi-scale feature map fusion is composed of a multi-scale feature fusion module and an image size recovery module; the multi-scale feature fusion remote sensing image segmentation network training module is also used for inputting features of different levels into the multi-scale feature fusion module to obtain a multi-scale fusion feature map; inputting the multi-scale fusion feature map into an image size recovery module to obtain a sample prediction result; and training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network.
In one embodiment, the different levels of features include: the output characteristics of the s-th Transformer module, the output characteristics of the 2 s-th Transformer module, the output characteristics of the 3 s-th Transformer module and the output characteristics of the b-th Transformer module; wherein s is the interval of the layers of the feature extraction, s is an integer greater than 1, and b is greater than 3 s; a multi-scale feature fusion remote sensing image segmentation network training module, and is also used for combiningInputting the output characteristics of the s-th fransformer module, the output characteristics of the 2 s-th fransformer module, the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module into a multi-scale characteristic fusion module, and respectively transforming the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module by adopting a first characteristic transformation method to obtain a characteristic diagram of a 3s layer and a first characteristic diagram of a b layer; wherein the feature map of the 3s layer and the first feature map of the b layer are of a size of
Figure 515911DEST_PATH_IMAGE061
Wherein K is the vector length, H and W are the height and width of the feature map, respectively; the first characteristic transformation method is to adjust each column vector of the characteristic diagram into a two-dimensional characteristic diagram; respectively transforming the output characteristics of the b-th transform module, the output characteristics of the s-th transform module and the output characteristics of the 2 s-th transform module by adopting a second characteristic transformation method to obtain a second characteristic diagram of a b layer, a characteristic diagram of an s layer and a characteristic diagram of a 2s layer; the second characteristic diagram of the b layer is the size
Figure 504159DEST_PATH_IMAGE062
The feature map of s layer has a feature map size of
Figure 515977DEST_PATH_IMAGE063
The size of the feature map of 2s layer is
Figure 214943DEST_PATH_IMAGE064
A characteristic diagram of (1); the second feature transformation method is to transform the feature mapnThe column vector is adjusted tonOpening a feature map andnsplicing the characteristic graphs to obtain
Figure 603199DEST_PATH_IMAGE065
A sheet feature map; whereinnIs a perfect square number and can be usedKTrimming; respectively performing convolution operation on the s-layer feature map, the 2 s-layer feature map, the 3 s-layer feature map and the b-layer first feature map to obtain the s-layer convolution feature map, the 2 s-layer convolution feature map and the 3 s-layer convolution feature mapA feature map and a first convolution feature map of layer b; and (3) performing upsampling on the convolution characteristic diagram of the s layer, the convolution characteristic diagram of the 2s layer, the convolution characteristic diagram of the 3s layer and the first convolution characteristic diagram of the b layer, and then splicing the upsampled convolution characteristic diagram with the second characteristic diagram of the b layer to obtain the multi-scale fusion characteristic diagram.
In one embodiment, the multi-scale feature fusion remote sensing image segmentation network training module is further used for inputting the multi-scale fusion feature map into the image size recovery module, and recovering the multi-scale fusion feature map to the size of an original image by adopting progressive convolution and up-sampling operations to obtain a sample prediction result; the progressive convolution is a three-fold serial convolution operation.
In one embodiment, the multi-layer perceptron module in the device comprises a fully-connected layer of two hidden layers and a GELU activation function; the multi-head self-attention module consists of h self-attention modules; wherein h is an integer greater than 1.
For specific limitations of the multi-scale feature fusion remote sensing image segmentation device, reference may be made to the above limitations of the multi-scale feature fusion remote sensing image segmentation method, which are not described herein again. All modules in the multi-scale feature fusion remote sensing image segmentation device can be completely or partially realized through software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The apparatus includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the device is configured to provide computing and control capabilities. The memory of the device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a multi-scale feature fusion remote sensing image segmentation method. The display screen of the device can be a liquid crystal display screen or an electronic ink display screen, and the input device of the device can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer device, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, an apparatus is provided, comprising a memory storing a computer program and a processor implementing the steps of the above-described method embodiments when the processor executes the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
In one validated embodiment, the Gaofen-2 Chenzhou (GF 2-CZ) dataset was used as the experimental dataset. The original image of the GF2-CZ data set is from remote sensing images of six towns in Chenzhou area shot by Gaofen-2, the spatial resolution of each image is 0.8m, and pixel-level labels are made on each remote sensing image. According to the landform characteristics of Chenzhou, the label is composed of seven categories of background, woodland, wetland, river, building, road and hilly region. By using
Figure 953278DEST_PATH_IMAGE066
The sampling window carries out random sampling on six remote sensing images, and 10,000 training pictures and 2,000 testing pictures are finally obtained through data enhancement means such as rotation, blurring and the like. Specific information for the GF2-CZ dataset is shown in table 2.
TABLE 2 GF2-CZ dataset
Figure 870418DEST_PATH_IMAGE067
In the embodiment, Pix Accuracy (PA), Mean Intersection over Union (MIoU) and Frequency Weighted Intersection over Union (FWIoU) commonly used in the semantic segmentation task are selected as the measurement indexes of the model performance. Assuming that there are k target classes and 1 background class, the calculation formula of PA, MIoU, FWIoU is shown in equations 8,9, and 10:
Figure 587838DEST_PATH_IMAGE068
(8)
Figure 248627DEST_PATH_IMAGE069
(9)
Figure 328578DEST_PATH_IMAGE070
(10)
wherein the content of the first and second substances,
Figure 573877DEST_PATH_IMAGE071
indicates that the book belongs to the categoryiBut predicted as a categoryjThe total number of pixels of (a), specifically,
Figure 372069DEST_PATH_IMAGE072
representing the total number of correctly classified pixels,
Figure 711915DEST_PATH_IMAGE073
and
Figure 646372DEST_PATH_IMAGE074
indicating the number of pixels that are misclassified.
The embodiment uses an open-source MMSegmentation library to build an experimental platform. MMSegmentation is a PyTorch-based semantic segmentation open source toolkit that is part of the OpenMMLab project. The MMSegmentation integrates a plurality of semantic segmentation methods such as PSPNet, DeeplabV3, STER and the like, and provides a uniform benchmark test platform for users.
This embodiment uses cross entropy as a loss function, selects SGD as an optimizer, sets the initial learning rate to 0.001, gradually attenuates the learning rate according to polynomial rate, sets momentum (momentum) and weight decay (weight decay) coefficients to 0.9 and 0.0005, respectively, and sets the batch size to 2.
This example performed comparative experiments on GF2-CZ dataset comparing the differences in performance of STER-MFPD with FCN-8s, PSPNet, DeeplabV3 and SETR-Na meive, SETR-MLA, SETR-PUP. The network weights in Encoder are initialized using the weights of the pre-trained model on ImageNet. The results of the experiment are shown in table 3.
TABLE 3 Experimental results of different segmentation methods
Figure 561108DEST_PATH_IMAGE075
As can be seen from table 3, the depth of the network used by the encoder has a large impact on the segmentation result. The deep level network performance is superior to the shallow level network. The method using the Transformer as the encoder is higher in accuracy, mlou, fwIoU, etc. than the method using the CNN. Specifically, the PSPNet using multilevel pyramid fusion has an accuracy of 90.49%, although slightly lower than 91.36% of FCN-8s, the mlio u reaches 55.66%, which indicates that the decoder with multilevel pyramid structure is more accurate in classifying the target as a whole. Although the Deeplab V3 uses the ASPP module, the accuracy (89.20%) and the mIoU (50.62%) are not high, but the feature extraction capability of the Deeplab V3 module on the remote sensing image is not good in consideration of the performance of the Deeplab V on other data sets. DFCN121-C, using a modified decoder module with DenseNet-121 as its encoder, performed best in comparative experiments in the CNN-based segmentation method, with 91.54% accuracy and 53.58% mlio u. In the Transformer-based approach, SETR-PUP has the highest accuracy (91.66%), and the improved SETR-MFPD achieves the highest mlou (60.13%). The multi-level feature fusion decoder used in the embodiment can achieve a better overall classification effect on the targets in the remote sensing image. The method using Transformer is superior to the CNN-based method in both the accuracy of segmentation and the overall segmentation effect. From experimental results, the STER method has more segmentation effect on objects such as forest lands, water areas, buildings and the like than FCN and PSPNet. And STER-MFPD is better than STER-MLA in the segmentation accuracy of the woodland.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (6)

1. A multi-scale feature fusion remote sensing image segmentation method is characterized by comprising the following steps:
obtaining a remote sensing image with high resolution, and marking the remote sensing image to obtain a training sample;
constructing a multi-scale feature fusion remote sensing image segmentation network; the multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding position codes to obtain an input sequence; the encoder is used for extracting features of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling;
training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and a sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain a trained multi-scale feature fusion remote sensing image segmentation network;
acquiring a remote sensing image to be detected, and inputting the remote sensing image to be detected into a trained multi-scale feature fusion remote sensing image segmentation network to obtain a prediction result of the remote sensing image to be detected;
wherein, the steps are as follows: training the multi-scale feature fusion remote sensing image segmentation network according to the labeling of the training sample and a sample prediction result obtained by inputting the training sample into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network, and the method comprises the following steps of:
inputting the training sample into the input network, dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors, then adjusting the dimensions of the one-dimensional vectors through linear connection mapping, and embedding position codes into the vectors with the adjusted dimensions to obtain an input sequence;
inputting the input sequence into an encoder to obtain features of different layers;
inputting the features of different levels into a decoder based on multi-scale feature map fusion to obtain a sample prediction result, and training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain a trained multi-scale feature fusion remote sensing image segmentation network;
the encoder comprises b Transformer modules which are connected in series, and the structures of the Transformer modules are the same; the Transformer module consists of a multi-head self-attention module, a layer standardization module and a multi-layer perceptron module; wherein b is an integer greater than or equal to 1;
the method comprises the following steps: inputting the input sequence into an encoder to obtain features of different layers, wherein the features of different layers comprise:
inputting the input sequence into a first Transformer module, processing the input sequence by a layer standardization module to obtain a standardized input sequence, extracting features of the standardized input sequence by a multi-head self-attention module to obtain attention features, fusing the attention features with the input sequence to obtain attention fusion features, processing the attention fusion features by the layer standardization module, inputting an obtained standardized processing result into a multi-layer perceptron module, and fusing the obtained perceptron output features with the attention fusion features to obtain first Transformer module output features;
taking the output characteristics of the first Transformer module as an input sequence of a second Transformer module, inputting the input sequence into the second Transformer module to obtain the output characteristics of the second Transformer module, and so on to obtain b image characteristics from shallow to deep;
selecting features from the b image features from shallow to deep according to the same layer interval to obtain features of different layers;
the decoder based on the multi-scale feature map fusion is composed of a multi-scale feature fusion module and an image size recovery module;
the method comprises the following steps: inputting the features of different levels into a decoder based on multi-scale feature map fusion to obtain a sample prediction result, and training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network, wherein the method comprises the following steps:
inputting the features of different levels into the multi-scale feature fusion module to obtain a multi-scale fusion feature map;
inputting the multi-scale fusion feature map into the image size recovery module to obtain a sample prediction result;
training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network;
wherein the different levels of features include: the output characteristics of the s-th Transformer module, the output characteristics of the 2 s-th Transformer module, the output characteristics of the 3 s-th Transformer module and the output characteristics of the b-th Transformer module; wherein s is the interval of the layers of the feature extraction, s is an integer greater than 1, and b is greater than 3 s;
the method comprises the following steps: inputting the features of different levels into the multi-scale feature fusion module to obtain a multi-scale fusion feature map, which comprises:
inputting the output characteristics of the s-th fransformer module, the output characteristics of the 2 s-th fransformer module, the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module into the multi-scale feature fusion module, and respectively transforming the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module by adopting a first feature transformation method to obtain a feature map of a 3s layer and a first feature map of a b layer; wherein the feature map of the 3s layer and the first feature map of the b layer are of a size of
Figure 530026DEST_PATH_IMAGE002
Wherein K is the vector length, H and W are the height and width of the feature map, respectively; the first feature transformation method is to adjust each column vector of the feature map into a two-dimensional feature map;
respectively transforming the output characteristics of the b-th transform module, the output characteristics of the s-th transform module and the output characteristics of the 2 s-th transform module by adopting a second characteristic transformation method to obtain a second characteristic diagram of a b layer, a characteristic diagram of an s layer and a characteristic diagram of a 2s layer; the second characteristic diagram of the b layer is the size
Figure 650428DEST_PATH_IMAGE004
The feature map of the s layer has a feature map size of
Figure 899007DEST_PATH_IMAGE006
The feature map of the 2s layer has a feature map size of
Figure 261593DEST_PATH_IMAGE008
A characteristic diagram of (1); the second feature transformation method is to transform the feature mapnThe column vector is adjusted tonOpening a feature map andnsplicing the characteristic graphs to obtain
Figure 544807DEST_PATH_IMAGE010
A sheet feature map; whereinnIs a perfect square number and can be usedKTrimming;
respectively performing convolution operation on the s-layer feature map, the 2 s-layer feature map, the 3 s-layer feature map and the b-layer first feature map to obtain an s-layer convolution feature map, a 2 s-layer convolution feature map, a 3 s-layer convolution feature map and a b-layer first convolution feature map;
and performing up-sampling on the s-layer convolution feature map, the 2 s-layer convolution feature map, the 3 s-layer convolution feature map and the b-layer first convolution feature map, and then splicing the up-sampled and up-sampled s-layer convolution feature map with the b-layer second feature map to obtain a multi-scale fusion feature map.
2. The method of claim 1, wherein inputting the multi-scale fused feature map into the image size recovery module to obtain a sample prediction result comprises:
inputting the multi-scale fusion feature map into the image size recovery module, and recovering the multi-scale fusion feature map to the size of an original image by adopting progressive convolution and up-sampling operation to obtain a sample prediction result; the progressive convolution is a three-fold serial convolution operation.
3. The method according to any of claims 1-2, wherein the multi-layered perceptron module comprises a fully-connected layer of two hidden layers and one GELU activation function;
the multi-head self-attention module consists of h self-attention modules; wherein h is an integer greater than 1.
4. A multi-scale feature fusion remote sensing image segmentation device is characterized by comprising:
the remote sensing image acquisition module is used for acquiring a high-resolution remote sensing image and marking the remote sensing image to obtain a training sample;
the multi-scale feature fusion remote sensing image segmentation network construction module is used for constructing a multi-scale feature fusion remote sensing image segmentation network; the multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding position codes to obtain an input sequence; the encoder is used for extracting features of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling;
the multi-scale feature fusion remote sensing image segmentation network training module is used for training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and the sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network;
the prediction result determining module is used for acquiring the remote sensing image to be detected and inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network to obtain the prediction result of the remote sensing image to be detected;
the multi-scale feature fusion remote sensing image segmentation network training module is further used for inputting the training sample into the input network, segmenting the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors, then adjusting the dimensionality of the one-dimensional vectors through linear connection mapping, and embedding position codes into the dimensionality-adjusted vectors to obtain an input sequence; inputting the input sequence into an encoder to obtain features of different layers; inputting the features of different levels into a decoder based on multi-scale feature map fusion to obtain a sample prediction result, and training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain a trained multi-scale feature fusion remote sensing image segmentation network; the encoder comprises b Transformer modules which are connected in series, and the structures of the Transformer modules are the same; the Transformer module consists of a multi-head self-attention module, a layer standardization module and a multi-layer perceptron module; wherein b is an integer greater than or equal to 1;
the method comprises the following steps: the multi-scale feature fusion remote sensing image segmentation network training module is further used for inputting the input sequence into a first transform module, processing the input sequence through a layer standardization module to obtain a standardized input sequence, extracting features of the standardized input sequence through a multi-head self-attention module to obtain attention features, fusing the attention features and the input sequence to obtain attention fusion features, processing the attention fusion features through the layer standardization module, inputting the obtained standardized processing result into a multilayer perceptron module, and fusing the obtained perceptron output features and the attention fusion features to obtain first transform module output features; taking the output characteristics of the first Transformer module as an input sequence of a second Transformer module, inputting the input sequence into the second Transformer module to obtain the output characteristics of the second Transformer module, and so on to obtain b image characteristics from shallow to deep; selecting features from the b image features from shallow to deep according to the same layer interval to obtain features of different layers;
the decoder based on the multi-scale feature map fusion is composed of a multi-scale feature fusion module and an image size recovery module; the multi-scale feature fusion remote sensing image segmentation network training module is also used for inputting the features of different levels into the multi-scale feature fusion module to obtain a multi-scale fusion feature map; inputting the multi-scale fusion feature map into the image size recovery module to obtain a sample prediction result; training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network; wherein the different levels of features include: the output characteristics of the s-th Transformer module, the output characteristics of the 2 s-th Transformer module, the output characteristics of the 3 s-th Transformer module and the output characteristics of the b-th Transformer module; wherein s is the interval of the layers of the feature extraction, s is an integer greater than 1, and b is greater than 3 s;
the multi-scale feature fusion remote sensing image segmentation network training module is further used for inputting the output features of the(s) th fransformer module, the output features of the (2 s) th fransformer module, the output features of the (3 s) th fransformer module and the output features of the (b) th fransformer module into the multi-scale feature fusion module, and respectively transforming the output features of the (3 s) th fransformer module and the output features of the (b) th fransformer module by adopting a first feature transformation method to obtain a feature map of a layer 3s and a first feature map of a layer b; wherein the feature map of the 3s layer and the first feature map of the b layer are of a size of
Figure DEST_PATH_IMAGE012
Wherein K is the vector length, H and W are the height and width of the feature map, respectively; the first feature transformation method is to adjust each column vector of the feature map into a two-dimensional feature map; respectively transforming the output characteristics of the b-th transform module, the output characteristics of the s-th transform module and the output characteristics of the 2 s-th transform module by adopting a second characteristic transformation method to obtain a second characteristic diagram of a b layer, a characteristic diagram of an s layer and a characteristic diagram of a 2s layer; the second characteristic diagram of the b layer is the size
Figure DEST_PATH_IMAGE014
The feature map of the s layer has a feature map size of
Figure DEST_PATH_IMAGE016
The feature map of the 2s layer has a feature map size of
Figure DEST_PATH_IMAGE018
A characteristic diagram of (1); the second feature transformation method is to transform the feature mapnThe column vector is adjusted tonOpening a feature map andnsplicing the characteristic graphs to obtain
Figure DEST_PATH_IMAGE020
A sheet feature map; whereinnIs a perfect square number and can be usedKTrimming; respectively performing convolution operation on the s-layer feature map, the 2 s-layer feature map, the 3 s-layer feature map and the b-layer first feature map to obtain an s-layer convolution feature map, a 2 s-layer convolution feature map, a 3 s-layer convolution feature map and a b-layer first convolution feature map;
and performing up-sampling on the s-layer convolution feature map, the 2 s-layer convolution feature map, the 3 s-layer convolution feature map and the b-layer first convolution feature map, and then splicing the up-sampled and up-sampled s-layer convolution feature map with the b-layer second feature map to obtain a multi-scale fusion feature map.
5. An apparatus comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 3 when executing the computer program.
6. A computer-readable memory, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.
CN202111252286.9A 2021-10-27 2021-10-27 Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage Active CN113688813B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111252286.9A CN113688813B (en) 2021-10-27 2021-10-27 Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111252286.9A CN113688813B (en) 2021-10-27 2021-10-27 Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage

Publications (2)

Publication Number Publication Date
CN113688813A CN113688813A (en) 2021-11-23
CN113688813B true CN113688813B (en) 2022-01-04

Family

ID=78588237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111252286.9A Active CN113688813B (en) 2021-10-27 2021-10-27 Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage

Country Status (1)

Country Link
CN (1) CN113688813B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114359554A (en) * 2021-11-25 2022-04-15 河南农业大学 Image semantic segmentation method based on multi-receptive-field context semantic information
CN114037899A (en) * 2021-12-01 2022-02-11 福州大学 VIT-based hyperspectral remote sensing image-oriented classification radial accumulation position coding system
CN114022788B (en) * 2022-01-05 2022-03-04 长沙理工大学 Remote sensing image change detection method and device, computer equipment and storage medium
CN114092833B (en) * 2022-01-24 2022-05-27 长沙理工大学 Remote sensing image classification method and device, computer equipment and storage medium
CN114419449B (en) * 2022-03-28 2022-06-24 成都信息工程大学 Self-attention multi-scale feature fusion remote sensing image semantic segmentation method
CN114913339B (en) * 2022-04-21 2023-12-05 北京百度网讯科技有限公司 Training method and device for feature map extraction model
CN114758360B (en) * 2022-04-24 2023-04-18 北京医准智能科技有限公司 Multi-modal image classification model training method and device and electronic equipment
CN114943963B (en) * 2022-04-29 2023-07-04 南京信息工程大学 Remote sensing image cloud and cloud shadow segmentation method based on double-branch fusion network
CN114842312B (en) * 2022-05-09 2023-02-10 深圳市大数据研究院 Generation and segmentation method and device for unpaired cross-modal image segmentation model
CN114972220B (en) * 2022-05-13 2023-02-21 北京医准智能科技有限公司 Image processing method and device, electronic equipment and readable storage medium
CN114998653B (en) * 2022-05-24 2024-04-26 电子科技大学 ViT network-based small sample remote sensing image classification method, medium and equipment
CN115019182B (en) * 2022-07-28 2023-03-24 北京卫星信息工程研究所 Method, system, equipment and storage medium for identifying fine granularity of remote sensing image target
CN115147606B (en) * 2022-08-01 2024-05-14 深圳技术大学 Medical image segmentation method, medical image segmentation device, computer equipment and storage medium
CN115761383B (en) * 2023-01-06 2023-04-18 北京匠数科技有限公司 Image classification method and device, electronic equipment and medium
CN116188431B (en) * 2023-02-21 2024-02-09 北京长木谷医疗科技股份有限公司 Hip joint segmentation method and device based on CNN and transducer
CN116310840A (en) * 2023-05-11 2023-06-23 天地信息网络研究院(安徽)有限公司 Winter wheat remote sensing identification method integrating multiple key weather period spectral features
CN117173525A (en) * 2023-09-05 2023-12-05 北京交通大学 Universal multi-mode image fusion method and device
CN117709580A (en) * 2023-11-29 2024-03-15 广西科学院 Ocean disaster-bearing body vulnerability evaluation method based on SETR and geographic grid
CN117789042B (en) * 2024-02-28 2024-05-14 中国地质大学(武汉) Road information interpretation method, system and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111127493A (en) * 2019-11-12 2020-05-08 中国矿业大学 Remote sensing image semantic segmentation method based on attention multi-scale feature fusion
CN113111835A (en) * 2021-04-23 2021-07-13 长沙理工大学 Semantic segmentation method and device for satellite remote sensing image, electronic equipment and storage medium
CN113191285A (en) * 2021-05-08 2021-07-30 山东大学 River and lake remote sensing image segmentation method and system based on convolutional neural network and Transformer

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797779A (en) * 2020-07-08 2020-10-20 兰州交通大学 Remote sensing image semantic segmentation method based on regional attention multi-scale feature fusion

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111127493A (en) * 2019-11-12 2020-05-08 中国矿业大学 Remote sensing image semantic segmentation method based on attention multi-scale feature fusion
CN113111835A (en) * 2021-04-23 2021-07-13 长沙理工大学 Semantic segmentation method and device for satellite remote sensing image, electronic equipment and storage medium
CN113191285A (en) * 2021-05-08 2021-07-30 山东大学 River and lake remote sensing image segmentation method and system based on convolutional neural network and Transformer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers;Sixiao Zheng 等;《arXiv:2012.15840v1 [cs.CV]》;20201231;第1-12页 *

Also Published As

Publication number Publication date
CN113688813A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
CN113688813B (en) Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage
CN111160335B (en) Image watermark processing method and device based on artificial intelligence and electronic equipment
Dewi et al. Weight analysis for various prohibitory sign detection and recognition using deep learning
RU2661750C1 (en) Symbols recognition with the use of artificial intelligence
Xu et al. High-resolution remote sensing image change detection combined with pixel-level and object-level
US20190205700A1 (en) Multiscale analysis of areas of interest in an image
CN114022788B (en) Remote sensing image change detection method and device, computer equipment and storage medium
Chen et al. A landslide extraction method of channel attention mechanism U-Net network based on Sentinel-2A remote sensing images
CN116258976A (en) Hierarchical transducer high-resolution remote sensing image semantic segmentation method and system
Zhao et al. Multiscale object detection in high-resolution remote sensing images via rotation invariant deep features driven by channel attention
US20200034664A1 (en) Network Architecture for Generating a Labeled Overhead Image
CN116524369B (en) Remote sensing image segmentation model construction method and device and remote sensing image interpretation method
CN113034495A (en) Spine image segmentation method, medium and electronic device
CN117597703A (en) Multi-scale converter for image analysis
CN116740422A (en) Remote sensing image classification method and device based on multi-mode attention fusion technology
Huang et al. Attention-guided label refinement network for semantic segmentation of very high resolution aerial orthoimages
CN111179272B (en) Rapid semantic segmentation method for road scene
CN116310916A (en) Semantic segmentation method and system for high-resolution remote sensing city image
CN115147606A (en) Medical image segmentation method and device, computer equipment and storage medium
CN113408540B (en) Synthetic aperture radar image overlap area extraction method and storage medium
CN113111885B (en) Dynamic resolution instance segmentation method and computer readable storage medium
Hou et al. BFFNet: a bidirectional feature fusion network for semantic segmentation of remote sensing objects
CN115713624A (en) Self-adaptive fusion semantic segmentation method for enhancing multi-scale features of remote sensing image
Qi et al. JAED-Net: joint attention encoder–decoder network for road extraction from remote sensing images
Kulishova et al. Impact of the textbooks’ graphic design on the augmented reality applications tracking ability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant