CN113688813B

CN113688813B - Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage

Info

Publication number: CN113688813B
Application number: CN202111252286.9A
Authority: CN
Inventors: 王威; 唐琛; 王新; 刘冠群
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2022-01-04
Anticipated expiration: 2041-10-27
Also published as: CN113688813A

Abstract

The application relates to a multi-scale feature fusion remote sensing image segmentation method, a multi-scale feature fusion remote sensing image segmentation device, multi-scale feature fusion remote sensing image segmentation equipment and a multi-scale feature fusion remote sensing image segmentation memory. The method comprises the following steps: obtaining a remote sensing image, and labeling to obtain a training sample; constructing a multi-scale feature fusion remote sensing image segmentation network, wherein the network comprises: the input network is used for dividing the training sample into small blocks with fixed size, unfolding the small blocks into one-dimensional vectors and embedding the vectors into position codes to obtain an input sequence; the encoder is used for extracting different layers of the input sequence by utilizing a multi-layer Transformer module; a decoder for obtaining a sample prediction result by fusing the multi-scale feature maps; and training the network by using the training sample to obtain a trained multi-scale feature fusion remote sensing image segmentation network, and obtaining a prediction result of the remote sensing image to be detected by using the network. The method fully utilizes the multi-scale characteristic diagram extracted by the encoder, combines local classification and hierarchical segmentation, and can adapt to the characteristic that the target in the remote sensing image is complex and changeable.

Description

Multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage

Technical Field

The application relates to the technical field of remote sensing image processing, in particular to a multi-scale feature fusion remote sensing image segmentation method, device, equipment and storage.

Background

With the continuous development of remote sensing technology, massive high-resolution remote sensing image data can be acquired. Semantic segmentation of remote sensing images is one of means for processing remote sensing image data, and the method has a plurality of applications in forest land coverage detection, city change detection, city planning, crop monitoring and other aspects. The remote sensing image segmentation is a specific task in semantic segmentation, and the segmentation of the remote sensing image can extract rich information contained in the remote sensing image for researchers to use, so the quality of information extraction is determined by the quality of image segmentation performance. The remote sensing image contains abundant category information and irregular distribution form, which brings great challenges to the segmentation task.

At present, most segmentation research aiming at remote sensing images uses a Full Convolution Network (FCN), the FCN is an initiative work for applying the CNN to the semantic segmentation field, classification of all pixel points of the images is realized by using a full convolution method, an end-to-end structure is innovatively used, and a foundation is laid for the occurrence of a subsequent coding-decoding structure. In the FCN-8s structure, the size of the input image is

,

The size of the image is specified, and 3 represents the three channels of RGB of the image, in pixels. The input to the subsequent layer i is a three-dimensional tensor

And C is the channel number of the characteristic diagram. The feature maps of the next layer are obtained by convolution based on the input feature maps of the previous layer, and these feature maps of the next layer are connected by convolution of the layers, and these convolutions are defined as their receptive fields. Through a plurality of convolution and pooling operations, the size of the feature map is continuously reduced, and the number of channels is continuously increased. Due to the locality of the convolution operation, the receptive field increases linearly with the depth of the layer, and is closely related to the size of the convolution kernel (usually

). Therefore, in the FCN architecture, a shallow feature map focuses on local features of an image, and a deep feature map focuses on global features of the image. FCN-8s fuses shallow features and deep features through skip splicing and outputs prediction results through full convolution, so that the model can integrate global and local featuresThe structure of the part is predicted. However, studies have shown that once a certain depth is reached, the benefit of adding more layers is rapidly diminished. Therefore, the limited receptive field of the general CNN structure is an inherent limitation of the FCN architecture, and affects the segmentation effect of the remote sensing image.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a multi-scale feature fusion remote sensing image segmentation method, device, apparatus, and memory.

A multi-scale feature fusion remote sensing image segmentation method comprises the following steps:

and obtaining a remote sensing image with high resolution, and marking the remote sensing image to obtain a training sample.

Constructing a multi-scale feature fusion remote sensing image segmentation network; the multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding position codes to obtain an input sequence; the encoder is used for extracting features of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling.

And training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and a sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network.

And acquiring a remote sensing image to be detected, and inputting the remote sensing image to be detected into a trained multi-scale feature fusion remote sensing image segmentation network to obtain a prediction result of the remote sensing image to be detected.

A multi-scale feature fusion remote sensing image segmentation apparatus, the apparatus comprising:

and the remote sensing image acquisition module is used for acquiring a high-resolution remote sensing image and marking the remote sensing image to obtain a training sample.

The multi-scale feature fusion remote sensing image segmentation network construction module is used for constructing a multi-scale feature fusion remote sensing image segmentation network; the multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding position codes to obtain an input sequence; the encoder is used for extracting features of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling.

And the multi-scale feature fusion remote sensing image segmentation network training module is used for training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and the sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network.

And the prediction result determining module is used for acquiring the remote sensing image to be detected and inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network to obtain the prediction result of the remote sensing image to be detected.

The method comprises the steps of obtaining a high-resolution remote sensing image, marking the high-resolution remote sensing image to obtain a training sample, and constructing a multi-scale feature fusion remote sensing image segmentation network, wherein the network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network divides the training sample into a plurality of small images with fixed size, and the small images are unfolded into one-dimensional vectors and embedded into position codes to obtain an input sequence; the encoder extracts the characteristics of different levels of an input sequence by using a multi-layer Transformer module; the decoder obtains the features of different scales through convolution operation after the features of different levels are subjected to shape adjustment, fuses the features of different scales through splicing operation, and finally obtains a sample prediction result through multiple times of convolution and up-sampling; training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and the sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network; and acquiring a remote sensing image to be detected, and inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network to obtain a prediction result of the remote sensing image to be detected. The method can fully utilize the multi-scale characteristic diagram extracted by the encoder, combines local classification with hierarchical segmentation, and can adapt to the characteristic that the target in the remote sensing image is complex and changeable.

Drawings

FIG. 1 is a schematic flow chart illustrating a multi-scale feature fusion remote sensing image segmentation method according to an embodiment;

FIG. 2 is a schematic diagram of a multi-scale feature fusion remote sensing image segmentation network in another embodiment;

FIG. 3 is a diagram illustrating a structure of a decoder based on multi-scale feature map fusion in another embodiment;

FIG. 4 is a schematic flow chart of a feature transformation method in another embodiment, wherein (a) is a first feature transformation method and (b) is a second feature transformation method;

FIG. 5 is a block diagram of a multi-scale feature fusion remote sensing image segmentation apparatus according to an embodiment;

fig. 6 is an internal structural diagram of the apparatus in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a multi-scale feature fusion remote sensing image segmentation method is provided, and the method includes the following steps:

step 100: and obtaining a remote sensing image with high resolution, and marking the remote sensing image to obtain a training sample.

Step 102: and constructing a multi-scale feature fusion remote sensing image segmentation network.

The multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion.

The input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding the one-dimensional vectors into position codes to obtain an input sequence; the encoder is used for extracting the characteristics of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling.

Specifically, a Multiscale Feature fusion remote sensing image SEgmentation network (SETR-MFPD) uses a Vision transform (ViT) as an encoder, and the encoder comprises a b-layer transform module composed of a multi-head self-attention mechanism. The decoder gives a sample prediction result by fusing the image features extracted by different layers.

The multi-scale feature fusion remote sensing image segmentation network has the advantages that: the features of different layers extracted by the encoder are changed into different sizes, and feature graphs with different channel numbers are input into the decoder, so that feature fusion of the decoder is facilitated, and the segmentation capability of targets with uneven distribution and different sizes is improved; and the multi-scale feature maps of different layers are fused in a decoder through splicing operation, so that the local and global information perception capability of the decoder is improved.

SEGMENTATION TRansformer (SETR) applies transform to a semantic SEgmentation task, the SETR uses ViT to extract the features of an image, restores the image features into a multi-channel feature map in a mode of transforming the shapes of the image features, and finally inputs the restored feature map to a CNN-based decoder to realize semantic SEgmentation.

Step 104: and training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and the sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network.

Step 106: and acquiring a remote sensing image to be detected, and inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network to obtain a prediction result of the remote sensing image to be detected.

In the multi-scale feature fusion remote sensing image segmentation method, a high-resolution remote sensing image is obtained and marked to obtain a training sample; constructing a multi-scale feature fusion remote sensing image segmentation network, wherein the network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding the one-dimensional vectors into position codes to obtain an input sequence; the encoder is used for extracting the characteristics of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling; training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and the sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network; and acquiring a remote sensing image to be detected, and inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network to obtain a prediction result of the remote sensing image to be detected. The method can fully utilize the multi-scale characteristic diagram extracted by the encoder, combines local classification with hierarchical segmentation, and can adapt to the characteristic that the target in the remote sensing image is complex and changeable.

In one embodiment, step 104 includes: inputting a training sample into an input network, dividing the training sample into a plurality of small images with fixed sizes, expanding the small images into one-dimensional vectors, then adjusting the dimensionality of the one-dimensional vectors through linear connection mapping, and embedding position codes into the dimensionality-adjusted vectors to obtain an input sequence; inputting the input sequence into an encoder to obtain features of different layers; inputting the features of different levels into a decoder based on multi-scale feature map fusion to obtain a sample prediction result, and training a multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network.

In one embodiment, the encoder comprises b Transformer modules connected in series, and each Transformer module has the same structure; the Transformer module consists of a multi-head self-attention module, a layer standardization module and a multi-layer perceptron module; wherein b is an integer greater than or equal to 1; step 104 further comprises: inputting an input sequence into a first transform module, processing the input sequence by a layer standardization module to obtain a standardized input sequence, extracting features of the standardized input sequence by a multi-head self-attention module to obtain attention features, fusing the attention features with the input sequence to obtain attention fusion features, processing the attention fusion features by the layer standardization module, inputting an obtained standardized processing result into a multilayer perceptron module, and fusing the obtained output features with the attention fusion features to obtain output features of the first transform module; taking the output characteristics of the first Transformer module as an input sequence of a second Transformer module, inputting the input sequence into the second Transformer module to obtain the output characteristics of the second Transformer module, and so on to obtain b image characteristics from shallow to deep; and (4) performing feature selection on the b image features from shallow to deep according to the same layer interval to obtain features of different layers.

Wherein: the b image features from shallow to deep comprise output features from a first Transformer module to a b Transformer module, namely the output features of the first Transformer module and the output features of the second Transformer module, … … and the output features of the b Transformer module.

In one embodiment, the decoder based on multi-scale feature map fusion is composed of a multi-scale feature fusion module and an image size recovery module; step 104 further comprises: inputting the features of different levels into a multi-scale feature fusion module to obtain a multi-scale fusion feature map; inputting the multi-scale fusion feature map into an image size recovery module to obtain a prediction result; and training the multi-scale feature fusion remote sensing image segmentation network according to the prediction result and the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network.

In one embodiment, the different levels of features include: the output characteristics of the s-th Transformer module, the output characteristics of the 2 s-th Transformer module, the output characteristics of the 3 s-th Transformer module and the output characteristics of the b-th Transformer module; wherein s is the interval of the layers of the feature extraction, s is an integer greater than 1, and b is greater than 3 s; step 104 further comprises: inputting the output characteristics of the s-th fransformer module, the output characteristics of the 2 s-th fransformer module, the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module into a multi-scale characteristic fusion module, and respectively transforming the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module by adopting a first characteristic transformation method to obtain a characteristic diagram of a 3s layer and a first characteristic diagram of a b layer; wherein the feature map of the 3s layer and the first feature map of the b layer are of a size of

A characteristic diagram ofKFor the length of the vector, it is,HandWthe height and width of the feature map, respectively; the first characteristic transformation method is to adjust each column vector of the characteristic diagram into a two-dimensional characteristic diagram; respectively outputting the output characteristics of the b th Transformer module and the output characteristics of the s th Transformer module toThe output characteristics of the 2s transform module are transformed by adopting a second characteristic transformation method to obtain a second characteristic diagram of the b layer, a characteristic diagram of the s layer and a characteristic diagram of the 2s layer; the second characteristic diagram of the b layer is the size

The feature map of s layer has a feature map size of

The size of the feature map of 2s layer is

A characteristic diagram of (1); the second feature transformation method is to transform the feature mapnThe column vector is adjusted tonOpening a feature map andnsplicing the characteristic graphs to obtain

A sheet feature map; whereinnIs a perfect square number and can be usedKTrimming; respectively performing convolution operation on the s-layer feature map, the 2 s-layer feature map, the 3 s-layer feature map and the b-layer first feature map to obtain the s-layer convolution feature map, the 2 s-layer convolution feature map, the 3 s-layer convolution feature map and the b-layer first convolution feature map; and (3) performing upsampling on the convolution characteristic diagram of the s layer, the convolution characteristic diagram of the 2s layer, the convolution characteristic diagram of the 3s layer and the first convolution characteristic diagram of the b layer, and then splicing the upsampled convolution characteristic diagram with the second characteristic diagram of the b layer to obtain the multi-scale fusion characteristic diagram.

The first feature transformation method and the second feature transformation method are both methods for transforming two-dimensional features into three-dimensional multi-scale feature maps.

In one embodiment, step 104 further comprises: inputting the multi-scale fusion feature map into an image size recovery module, and recovering the multi-scale fusion feature map to the size of an original image by adopting progressive convolution and up-sampling operation to obtain a sample prediction result; the progressive convolution is a three-fold serial convolution operation.

In one embodiment, the multi-layered perceptron module comprises a fully-connected layer of two hidden layers and a GELU activation function; the multi-head self-attention module consists of h self-attention modules; wherein h is an integer greater than 1.

In one embodiment, a multi-scale feature fusion remote sensing image segmentation method is provided, and an SETR-MFPD network comprises an encoding-decoding structure, as shown in FIG. 2. The encoder consists of a complete Transformer network and comprises a b-layer Transformer module consisting of a multi-head self-attention mechanism. The decoder gives the prediction result by fusing the image features extracted by different layers. Convolution 1 in the decoder in fig. 2 is a convolution with a convolution kernel of 3 x 3 steps 1, convolution 2 is the same as convolution 1, convolution 3 is a convolution with a convolution kernel of 1 x 1 steps 1, upsampling is used to enlarge the feature map size by a factor of 2.

As can be seen from the network architecture diagram shown in fig. 2: a given input image is firstly divided into small blocks with fixed sizes, then the small blocks are unfolded into one-dimensional vectors and embedded with position codes, and then the vectors are input into an encoder consisting of b transform modules to extract features of different layers of the image. The method comprises the steps of reducing feature map channels through convolution once after the shapes of features of different layers are adjusted, fusing the features of different scales through splicing operation, and finally generating a prediction result through multiple times of convolution and up-sampling.

(1) An encoder:

the self-attention mechanism used in the Transformer network integrates well the local and global information of the input sequence. Therefore, the problem of limited scope of the FCN framework can be solved by using the Vision transform as the encoder for dividing tasks.

1) Input processing

In the natural language processing task, the input of the transform network used is a set of one-dimensional vectors, while the input in the visual task is a two-dimensional image. In order to use the transform network as an encoder for the segmentation task, it is necessary to perform dimension adjustment on the input image. For input image

Firstly, divide it into N pictures with equal length and width according to length and widthPhoto block

Wherein Y represents the width of the segmentation, and

. Then, the N image blocks are expanded according to the length and the width to obtain N image blocks with the length of N

Is formed into a sequence

WhereinKIs the vector length. The value of N has an important influence on the performance of the transform network. For an input of

If N =4 is taken, the RGB image of (1) is divided into four one-dimensional vectors with dimensions 49,152. The multi-layer Perceptron (MLP) structure used by the Transformer cannot support such high-dimensional vectors as its input because the fully-connected layer used in MLP consumes a lot of time and space when processing such high-dimensional input.

In the semantic segmentation task, an encoder often obtains a multi-scale feature map by down-sampling

Wherein C represents the channel number of the characteristic diagram,

corresponding to feature maps of different scales. To facilitate feature shape adjustment in the decoder and to take into account the performance of the Transformer, we take N =256 and cut the input image into 256 segments with widths of 256

Image blocks of (1) are expanded toTo a sequence of 256 one-dimensional vectors of 768 dimensions, denoted

，

WhereiniRepresenting the sequence number of the block. At ViT, the vectors may also be mapped through a linear connection (Liner project) before being input into the transform network

The dimensions of the input vector are adjusted and then trained class code and position code parameters are embedded in the input vector, wherein the class code is embedded by adding a class number to the input dimension, and the position code is embedded by adding the input vector. But dimensional changes brought about by the embedded class code can make subsequent shape adjustments difficult. So that only the position code is embedded in the input vector and the final input can be represented as

Wherein

Representing the embedded position code.

2) Transformer network

The method in this embodiment serializes the images to obtain the initial input Z₀The transform network is used as an encoder to extract the features of the image. A Transformer network is composed of a plurality of Transformer modules connected in series, each Transformer module has the same structure, and the output of the previous module is the input of the next module. Each module is composed of a multi-head self-attention (MSA) module, a Layer Normalization (LN) module and an MLP module. Suppose thatlThe input sequence of the layer isZ _l-1The input sequence is first obtained through the LN module

As shown in formula 1:

(1)

as input to the MSA module, the MSA module is composed ofhSelf-attention (SA) modules. The inputs to the SA block are three matrices

The calculation method of the three input matrices is shown in formula 2, and the calculation process of the SA module is shown in formula 3:

(2)

(3)

wherein

Representing the weight parameters that can be trained on,ddetermines the magnitude of the weight parameter. The calculation process of the MSA module is shown as formula 4

(4)

Wherein concat represents ahEach size is

The matrix is spliced according to the row dimension to obtain a matrix with the size of

The matrix of (a) is,

representing weight parameters that can be trained, will generally bedIs set to Nh，

Represents the output of the ith self-actuation module,

wherein

Representing trainable weight parameters of the ith SA module.

The complete calculation process of the final Transformer module is shown in the formulas 5-7:

(5)

(6)

(7)

wherein the MLP is composed of a fully-connected layer including two hidden layers and a GELU activation function,l=1…b，brepresents the number of Transformer modules in the Transformer network, i.e. the number of layers of the Transformer network. By passingbThe sub-iteration, Transformer network can extractbThe features of the image from shallow to deep are marked as

. By selecting different linear connectionsThe number of the mappings is such that,h，band hidden layer size, Dosovitski et al devised three different ViT models, as shown in table 1.

TABLE 1 configuration of the different ViT models

(2) Multi-scale feature fusion decoder

Features of different levels of the input sequence are extracted by the encoder, which features all have the same shape. In this embodiment, a multi-level feature fusion decoder similar to the feature pyramid network is designed, and the structure of the decoder is shown in fig. 3. The decoder is different from the decoder in that the features extracted by the transform module need to be adjusted into pyramid shapes, and the features of different layers extracted by the encoder do not need to be adjusted into shapes through pooling operation, so that information loss is avoided.

The embodiment selects 4 features of different layers extracted by a Transformer network

，sThe number of layers to select the feature spacing is determined. Fig. 4 shows a method for transforming two-dimensional features into a three-dimensional multi-scale feature map, wherein (a) is a first feature transformation method, which specifically comprises the following steps: directly adjusting each column vector into a two-dimensional characteristic diagram to obtain K characteristic diagrams; (b) the first feature transformation method specifically comprises the following steps: will be provided withnThe column vector is adjusted tonOpening a characteristic diagram and splicingnCombining the sheet feature maps into a larger feature map, thereby obtaining

A sheet feature map; whereinnIs a perfectly square number and can be divided exactly by K. For deep layer features

It is resized to size using the method shown in fig. 4 (a)

Characteristic diagram of

. For the

It is also resized to size by the method shown in fig. 4 (b)

Characteristic diagram of

. For shallow features

The methods shown in FIG. 4 (b) are selected and respectively adjusted to the sizes

Characteristic diagram of

A size of

Characteristic diagram of

. Then respectively using the sizes of

Step size is 1, and the output channel is divided into 256,128,64 and 32 convolution kernels to perform convolution operation on the feature maps 4. Finally, the 4 characteristic graphs are sampled and

spliced to obtain the size of

Multi-scale fusion feature map of

. In order to utilize the information of the multi-scale fusion feature map to the maximum extent, the size of the image is restored by progressive convolution and up-sampling, and finally the original image size is restored by three serial convolution and up-sampling operations.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 5, there is provided a multi-scale feature fusion remote sensing image segmentation apparatus, including: the remote sensing image acquisition module, the multi-scale feature fusion remote sensing image segmentation network construction module, the multi-scale feature fusion remote sensing image segmentation network training module and the prediction result determination module of the remote sensing image to be measured are as follows:

The multi-scale feature fusion remote sensing image segmentation network construction module is used for constructing a multi-scale feature fusion remote sensing image segmentation network; the multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding the one-dimensional vectors into position codes to obtain an input sequence; the encoder is used for extracting the characteristics of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling.

And the multi-scale feature fusion remote sensing image segmentation network training module is used for training the multi-scale feature fusion remote sensing image segmentation network according to the label of the training sample and a sample prediction result obtained by inputting the training sample into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network.

And the prediction result determining module is used for acquiring the remote sensing image to be detected, inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network, and obtaining the prediction result of the remote sensing image to be detected.

In one embodiment, the multi-scale feature fusion remote sensing image segmentation network training module is further configured to input a training sample into an input network, segment the training sample into a plurality of small block images with fixed sizes, expand the small block images into one-dimensional vectors, then adjust the dimensions of the one-dimensional vectors through linear connection mapping, and embed position codes in the dimension-adjusted vectors to obtain an input sequence; inputting the input sequence into an encoder to obtain features of different layers; inputting the features of different levels into a decoder based on multi-scale feature map fusion to obtain a sample prediction result, and training a multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network.

In one embodiment, the encoder comprises b Transformer modules connected in series, and each Transformer module has the same structure; the Transformer module consists of a multi-head self-attention module, a layer standardization module and a multi-layer perceptron module; wherein b is an integer greater than or equal to 1; the multi-scale feature fusion remote sensing image segmentation network training module is also used for inputting an input sequence into a first transform module, processing the input sequence by a layer standardization module to obtain a standardized input sequence, extracting features of the standardized input sequence by a multi-head self-attention module to obtain attention features, fusing the attention features and the input sequence to obtain attention fusion features, processing the attention fusion features by the layer standardization module, inputting the obtained standardized processing result into a multilayer perceptron module, and fusing the obtained sensed output features and the attention fusion features to obtain first transform module output features; taking the output characteristics of the first Transformer module as an input sequence of a second Transformer module, inputting the input sequence into the second Transformer module to obtain the output characteristics of the second Transformer module, and so on to obtain b image characteristics from shallow to deep; and (4) performing feature selection on the b image features from shallow to deep according to the same layer interval to obtain features of different layers.

In one embodiment, the decoder based on multi-scale feature map fusion is composed of a multi-scale feature fusion module and an image size recovery module; the multi-scale feature fusion remote sensing image segmentation network training module is also used for inputting features of different levels into the multi-scale feature fusion module to obtain a multi-scale fusion feature map; inputting the multi-scale fusion feature map into an image size recovery module to obtain a sample prediction result; and training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network.

In one embodiment, the different levels of features include: the output characteristics of the s-th Transformer module, the output characteristics of the 2 s-th Transformer module, the output characteristics of the 3 s-th Transformer module and the output characteristics of the b-th Transformer module; wherein s is the interval of the layers of the feature extraction, s is an integer greater than 1, and b is greater than 3 s; a multi-scale feature fusion remote sensing image segmentation network training module, and is also used for combiningInputting the output characteristics of the s-th fransformer module, the output characteristics of the 2 s-th fransformer module, the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module into a multi-scale characteristic fusion module, and respectively transforming the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module by adopting a first characteristic transformation method to obtain a characteristic diagram of a 3s layer and a first characteristic diagram of a b layer; wherein the feature map of the 3s layer and the first feature map of the b layer are of a size of

Wherein K is the vector length, H and W are the height and width of the feature map, respectively; the first characteristic transformation method is to adjust each column vector of the characteristic diagram into a two-dimensional characteristic diagram; respectively transforming the output characteristics of the b-th transform module, the output characteristics of the s-th transform module and the output characteristics of the 2 s-th transform module by adopting a second characteristic transformation method to obtain a second characteristic diagram of a b layer, a characteristic diagram of an s layer and a characteristic diagram of a 2s layer; the second characteristic diagram of the b layer is the size

The feature map of s layer has a feature map size of

The size of the feature map of 2s layer is

A sheet feature map; whereinnIs a perfect square number and can be usedKTrimming; respectively performing convolution operation on the s-layer feature map, the 2 s-layer feature map, the 3 s-layer feature map and the b-layer first feature map to obtain the s-layer convolution feature map, the 2 s-layer convolution feature map and the 3 s-layer convolution feature mapA feature map and a first convolution feature map of layer b; and (3) performing upsampling on the convolution characteristic diagram of the s layer, the convolution characteristic diagram of the 2s layer, the convolution characteristic diagram of the 3s layer and the first convolution characteristic diagram of the b layer, and then splicing the upsampled convolution characteristic diagram with the second characteristic diagram of the b layer to obtain the multi-scale fusion characteristic diagram.

In one embodiment, the multi-scale feature fusion remote sensing image segmentation network training module is further used for inputting the multi-scale fusion feature map into the image size recovery module, and recovering the multi-scale fusion feature map to the size of an original image by adopting progressive convolution and up-sampling operations to obtain a sample prediction result; the progressive convolution is a three-fold serial convolution operation.

In one embodiment, the multi-layer perceptron module in the device comprises a fully-connected layer of two hidden layers and a GELU activation function; the multi-head self-attention module consists of h self-attention modules; wherein h is an integer greater than 1.

For specific limitations of the multi-scale feature fusion remote sensing image segmentation device, reference may be made to the above limitations of the multi-scale feature fusion remote sensing image segmentation method, which are not described herein again. All modules in the multi-scale feature fusion remote sensing image segmentation device can be completely or partially realized through software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The apparatus includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the device is configured to provide computing and control capabilities. The memory of the device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a multi-scale feature fusion remote sensing image segmentation method. The display screen of the device can be a liquid crystal display screen or an electronic ink display screen, and the input device of the device can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer device, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, an apparatus is provided, comprising a memory storing a computer program and a processor implementing the steps of the above-described method embodiments when the processor executes the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

In one validated embodiment, the Gaofen-2 Chenzhou (GF 2-CZ) dataset was used as the experimental dataset. The original image of the GF2-CZ data set is from remote sensing images of six towns in Chenzhou area shot by Gaofen-2, the spatial resolution of each image is 0.8m, and pixel-level labels are made on each remote sensing image. According to the landform characteristics of Chenzhou, the label is composed of seven categories of background, woodland, wetland, river, building, road and hilly region. By using

The sampling window carries out random sampling on six remote sensing images, and 10,000 training pictures and 2,000 testing pictures are finally obtained through data enhancement means such as rotation, blurring and the like. Specific information for the GF2-CZ dataset is shown in table 2.

TABLE 2 GF2-CZ dataset

In the embodiment, Pix Accuracy (PA), Mean Intersection over Union (MIoU) and Frequency Weighted Intersection over Union (FWIoU) commonly used in the semantic segmentation task are selected as the measurement indexes of the model performance. Assuming that there are k target classes and 1 background class, the calculation formula of PA, MIoU, FWIoU is shown in equations 8,9, and 10:

(8)

(9)

(10)

wherein the content of the first and second substances,

indicates that the book belongs to the categoryiBut predicted as a categoryjThe total number of pixels of (a), specifically,

representing the total number of correctly classified pixels,

and

indicating the number of pixels that are misclassified.

The embodiment uses an open-source MMSegmentation library to build an experimental platform. MMSegmentation is a PyTorch-based semantic segmentation open source toolkit that is part of the OpenMMLab project. The MMSegmentation integrates a plurality of semantic segmentation methods such as PSPNet, DeeplabV3, STER and the like, and provides a uniform benchmark test platform for users.

This embodiment uses cross entropy as a loss function, selects SGD as an optimizer, sets the initial learning rate to 0.001, gradually attenuates the learning rate according to polynomial rate, sets momentum (momentum) and weight decay (weight decay) coefficients to 0.9 and 0.0005, respectively, and sets the batch size to 2.

This example performed comparative experiments on GF2-CZ dataset comparing the differences in performance of STER-MFPD with FCN-8s, PSPNet, DeeplabV3 and SETR-Na meive, SETR-MLA, SETR-PUP. The network weights in Encoder are initialized using the weights of the pre-trained model on ImageNet. The results of the experiment are shown in table 3.

TABLE 3 Experimental results of different segmentation methods

As can be seen from table 3, the depth of the network used by the encoder has a large impact on the segmentation result. The deep level network performance is superior to the shallow level network. The method using the Transformer as the encoder is higher in accuracy, mlou, fwIoU, etc. than the method using the CNN. Specifically, the PSPNet using multilevel pyramid fusion has an accuracy of 90.49%, although slightly lower than 91.36% of FCN-8s, the mlio u reaches 55.66%, which indicates that the decoder with multilevel pyramid structure is more accurate in classifying the target as a whole. Although the Deeplab V3 uses the ASPP module, the accuracy (89.20%) and the mIoU (50.62%) are not high, but the feature extraction capability of the Deeplab V3 module on the remote sensing image is not good in consideration of the performance of the Deeplab V on other data sets. DFCN121-C, using a modified decoder module with DenseNet-121 as its encoder, performed best in comparative experiments in the CNN-based segmentation method, with 91.54% accuracy and 53.58% mlio u. In the Transformer-based approach, SETR-PUP has the highest accuracy (91.66%), and the improved SETR-MFPD achieves the highest mlou (60.13%). The multi-level feature fusion decoder used in the embodiment can achieve a better overall classification effect on the targets in the remote sensing image. The method using Transformer is superior to the CNN-based method in both the accuracy of segmentation and the overall segmentation effect. From experimental results, the STER method has more segmentation effect on objects such as forest lands, water areas, buildings and the like than FCN and PSPNet. And STER-MFPD is better than STER-MLA in the segmentation accuracy of the woodland.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A multi-scale feature fusion remote sensing image segmentation method is characterized by comprising the following steps:

obtaining a remote sensing image with high resolution, and marking the remote sensing image to obtain a training sample;

constructing a multi-scale feature fusion remote sensing image segmentation network; the multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding position codes to obtain an input sequence; the encoder is used for extracting features of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling;

training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and a sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain a trained multi-scale feature fusion remote sensing image segmentation network;

acquiring a remote sensing image to be detected, and inputting the remote sensing image to be detected into a trained multi-scale feature fusion remote sensing image segmentation network to obtain a prediction result of the remote sensing image to be detected;

wherein, the steps are as follows: training the multi-scale feature fusion remote sensing image segmentation network according to the labeling of the training sample and a sample prediction result obtained by inputting the training sample into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network, and the method comprises the following steps of:

inputting the training sample into the input network, dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors, then adjusting the dimensions of the one-dimensional vectors through linear connection mapping, and embedding position codes into the vectors with the adjusted dimensions to obtain an input sequence;

inputting the input sequence into an encoder to obtain features of different layers;

inputting the features of different levels into a decoder based on multi-scale feature map fusion to obtain a sample prediction result, and training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain a trained multi-scale feature fusion remote sensing image segmentation network;

the encoder comprises b Transformer modules which are connected in series, and the structures of the Transformer modules are the same; the Transformer module consists of a multi-head self-attention module, a layer standardization module and a multi-layer perceptron module; wherein b is an integer greater than or equal to 1;

the method comprises the following steps: inputting the input sequence into an encoder to obtain features of different layers, wherein the features of different layers comprise:

inputting the input sequence into a first Transformer module, processing the input sequence by a layer standardization module to obtain a standardized input sequence, extracting features of the standardized input sequence by a multi-head self-attention module to obtain attention features, fusing the attention features with the input sequence to obtain attention fusion features, processing the attention fusion features by the layer standardization module, inputting an obtained standardized processing result into a multi-layer perceptron module, and fusing the obtained perceptron output features with the attention fusion features to obtain first Transformer module output features;

taking the output characteristics of the first Transformer module as an input sequence of a second Transformer module, inputting the input sequence into the second Transformer module to obtain the output characteristics of the second Transformer module, and so on to obtain b image characteristics from shallow to deep;

selecting features from the b image features from shallow to deep according to the same layer interval to obtain features of different layers;

the decoder based on the multi-scale feature map fusion is composed of a multi-scale feature fusion module and an image size recovery module;

the method comprises the following steps: inputting the features of different levels into a decoder based on multi-scale feature map fusion to obtain a sample prediction result, and training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network, wherein the method comprises the following steps:

inputting the features of different levels into the multi-scale feature fusion module to obtain a multi-scale fusion feature map;

inputting the multi-scale fusion feature map into the image size recovery module to obtain a sample prediction result;

training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network;

wherein the different levels of features include: the output characteristics of the s-th Transformer module, the output characteristics of the 2 s-th Transformer module, the output characteristics of the 3 s-th Transformer module and the output characteristics of the b-th Transformer module; wherein s is the interval of the layers of the feature extraction, s is an integer greater than 1, and b is greater than 3 s;

the method comprises the following steps: inputting the features of different levels into the multi-scale feature fusion module to obtain a multi-scale fusion feature map, which comprises:

inputting the output characteristics of the s-th fransformer module, the output characteristics of the 2 s-th fransformer module, the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module into the multi-scale feature fusion module, and respectively transforming the output characteristics of the 3 s-th fransformer module and the output characteristics of the b-th fransformer module by adopting a first feature transformation method to obtain a feature map of a 3s layer and a first feature map of a b layer; wherein the feature map of the 3s layer and the first feature map of the b layer are of a size of

Wherein K is the vector length, H and W are the height and width of the feature map, respectively; the first feature transformation method is to adjust each column vector of the feature map into a two-dimensional feature map;

respectively transforming the output characteristics of the b-th transform module, the output characteristics of the s-th transform module and the output characteristics of the 2 s-th transform module by adopting a second characteristic transformation method to obtain a second characteristic diagram of a b layer, a characteristic diagram of an s layer and a characteristic diagram of a 2s layer; the second characteristic diagram of the b layer is the size

The feature map of the s layer has a feature map size of

The feature map of the 2s layer has a feature map size of

A sheet feature map; whereinnIs a perfect square number and can be usedKTrimming;

respectively performing convolution operation on the s-layer feature map, the 2 s-layer feature map, the 3 s-layer feature map and the b-layer first feature map to obtain an s-layer convolution feature map, a 2 s-layer convolution feature map, a 3 s-layer convolution feature map and a b-layer first convolution feature map;

and performing up-sampling on the s-layer convolution feature map, the 2 s-layer convolution feature map, the 3 s-layer convolution feature map and the b-layer first convolution feature map, and then splicing the up-sampled and up-sampled s-layer convolution feature map with the b-layer second feature map to obtain a multi-scale fusion feature map.

2. The method of claim 1, wherein inputting the multi-scale fused feature map into the image size recovery module to obtain a sample prediction result comprises:

inputting the multi-scale fusion feature map into the image size recovery module, and recovering the multi-scale fusion feature map to the size of an original image by adopting progressive convolution and up-sampling operation to obtain a sample prediction result; the progressive convolution is a three-fold serial convolution operation.

3. The method according to any of claims 1-2, wherein the multi-layered perceptron module comprises a fully-connected layer of two hidden layers and one GELU activation function;

the multi-head self-attention module consists of h self-attention modules; wherein h is an integer greater than 1.

4. A multi-scale feature fusion remote sensing image segmentation device is characterized by comprising:

the remote sensing image acquisition module is used for acquiring a high-resolution remote sensing image and marking the remote sensing image to obtain a training sample;

the multi-scale feature fusion remote sensing image segmentation network construction module is used for constructing a multi-scale feature fusion remote sensing image segmentation network; the multi-scale feature fusion remote sensing image segmentation network comprises an input network, an encoder and a decoder based on multi-scale feature map fusion; the input network is used for dividing the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors and embedding position codes to obtain an input sequence; the encoder is used for extracting features of different layers of the input sequence by utilizing a multi-layer Transformer module; the decoder is used for adjusting the shapes of the features of different layers, obtaining the features of different scales through convolution operation, fusing the features of different scales through splicing operation, and finally obtaining a sample prediction result through multiple times of convolution and up-sampling;

the multi-scale feature fusion remote sensing image segmentation network training module is used for training the multi-scale feature fusion remote sensing image segmentation network according to the labels of the training samples and the sample prediction result obtained by inputting the training samples into the multi-scale feature fusion remote sensing image segmentation network to obtain the trained multi-scale feature fusion remote sensing image segmentation network;

the prediction result determining module is used for acquiring the remote sensing image to be detected and inputting the remote sensing image to be detected into the trained multi-scale feature fusion remote sensing image segmentation network to obtain the prediction result of the remote sensing image to be detected;

the multi-scale feature fusion remote sensing image segmentation network training module is further used for inputting the training sample into the input network, segmenting the training sample into a plurality of small block images with fixed sizes, unfolding the small block images into one-dimensional vectors, then adjusting the dimensionality of the one-dimensional vectors through linear connection mapping, and embedding position codes into the dimensionality-adjusted vectors to obtain an input sequence; inputting the input sequence into an encoder to obtain features of different layers; inputting the features of different levels into a decoder based on multi-scale feature map fusion to obtain a sample prediction result, and training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain a trained multi-scale feature fusion remote sensing image segmentation network; the encoder comprises b Transformer modules which are connected in series, and the structures of the Transformer modules are the same; the Transformer module consists of a multi-head self-attention module, a layer standardization module and a multi-layer perceptron module; wherein b is an integer greater than or equal to 1;

the method comprises the following steps: the multi-scale feature fusion remote sensing image segmentation network training module is further used for inputting the input sequence into a first transform module, processing the input sequence through a layer standardization module to obtain a standardized input sequence, extracting features of the standardized input sequence through a multi-head self-attention module to obtain attention features, fusing the attention features and the input sequence to obtain attention fusion features, processing the attention fusion features through the layer standardization module, inputting the obtained standardized processing result into a multilayer perceptron module, and fusing the obtained perceptron output features and the attention fusion features to obtain first transform module output features; taking the output characteristics of the first Transformer module as an input sequence of a second Transformer module, inputting the input sequence into the second Transformer module to obtain the output characteristics of the second Transformer module, and so on to obtain b image characteristics from shallow to deep; selecting features from the b image features from shallow to deep according to the same layer interval to obtain features of different layers;

the decoder based on the multi-scale feature map fusion is composed of a multi-scale feature fusion module and an image size recovery module; the multi-scale feature fusion remote sensing image segmentation network training module is also used for inputting the features of different levels into the multi-scale feature fusion module to obtain a multi-scale fusion feature map; inputting the multi-scale fusion feature map into the image size recovery module to obtain a sample prediction result; training the multi-scale feature fusion remote sensing image segmentation network according to the sample prediction result and the label of the remote sensing image to obtain the trained multi-scale feature fusion remote sensing image segmentation network; wherein the different levels of features include: the output characteristics of the s-th Transformer module, the output characteristics of the 2 s-th Transformer module, the output characteristics of the 3 s-th Transformer module and the output characteristics of the b-th Transformer module; wherein s is the interval of the layers of the feature extraction, s is an integer greater than 1, and b is greater than 3 s;

the multi-scale feature fusion remote sensing image segmentation network training module is further used for inputting the output features of the(s) th fransformer module, the output features of the (2 s) th fransformer module, the output features of the (3 s) th fransformer module and the output features of the (b) th fransformer module into the multi-scale feature fusion module, and respectively transforming the output features of the (3 s) th fransformer module and the output features of the (b) th fransformer module by adopting a first feature transformation method to obtain a feature map of a layer 3s and a first feature map of a layer b; wherein the feature map of the 3s layer and the first feature map of the b layer are of a size of

Wherein K is the vector length, H and W are the height and width of the feature map, respectively; the first feature transformation method is to adjust each column vector of the feature map into a two-dimensional feature map; respectively transforming the output characteristics of the b-th transform module, the output characteristics of the s-th transform module and the output characteristics of the 2 s-th transform module by adopting a second characteristic transformation method to obtain a second characteristic diagram of a b layer, a characteristic diagram of an s layer and a characteristic diagram of a 2s layer; the second characteristic diagram of the b layer is the size

The feature map of the s layer has a feature map size of

The feature map of the 2s layer has a feature map size of

A sheet feature map; whereinnIs a perfect square number and can be usedKTrimming; respectively performing convolution operation on the s-layer feature map, the 2 s-layer feature map, the 3 s-layer feature map and the b-layer first feature map to obtain an s-layer convolution feature map, a 2 s-layer convolution feature map, a 3 s-layer convolution feature map and a b-layer first convolution feature map;

5. An apparatus comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 3 when executing the computer program.

6. A computer-readable memory, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.