CN117611817A

CN117611817A - Remote sensing image semantic segmentation method and system based on stacked depth residual error network

Info

Publication number: CN117611817A
Application number: CN202311609856.4A
Authority: CN
Inventors: 陈一平; 谢相依; 李军
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-02-27

Abstract

The invention relates to the field of computer vision and remote sensing image processing, in particular to a remote sensing image semantic segmentation method and a remote sensing image semantic segmentation system based on a stacked depth residual error network, wherein the method comprises the following steps: constructing a stacked depth residual error network, and extracting depth characteristics of the remote sensing image; scaling the depth of the stacked depth residual network using residual learning; utilizing the expansion residual block to aggregate multi-scale context characteristics; performing supervised learning on the stacked depth residual error network by using the intermediate loss; and performing semantic segmentation on the remote sensing image data by using the stack depth residual error network after supervised learning. The invention adopts a stacked depth residual error network with higher calculation efficiency to improve the network model so as to solve the problem of image land coverage classification, and semantic features are extracted from different layers of a network backbone network so as to improve semantic segmentation performance.

Description

Remote sensing image semantic segmentation method and system based on stacked depth residual error network

Technical Field

The invention relates to the field of computer vision and remote sensing image processing, in particular to a remote sensing image semantic segmentation method and a remote sensing image semantic segmentation system based on a stacked depth residual error network.

Background

The land cover image of the land surface generated by semantic segmentation of the high-resolution remote sensing image provides key decision support and technical support for urban planning, resource management and establishment of social development policies. However, high inter-class similarities and high intra-class differences are prevalent in terrain, which complicates classification tasks. In addition, the high-resolution remote sensing image has the phenomenon that a large object shields a smaller object, and the shielding of the object can bring great challenges to accurate semantic segmentation, so that the object in a specific category is difficult to distinguish.

Traditional prior knowledge-based methods have difficulty in distinguishing complex features in high-resolution remote sensing images. Most shallow classifiers do not fully utilize the rich context information in high resolution images, and are inefficient, time consuming, and dependent on a priori knowledge when performing classification tasks. Some shallow classifiers achieve good results in certain situations, but may not perform in other situations.

The Convolutional Neural Networks (CNNs) have complex and deep network structures, and can accurately segment the remote sensing image data with abundant information and multiple heterogeneous sources. The CNNs use the convolution layer, the sub-sampling layer and the activation function which are alternately connected to perform layering and automatic learning on complex image features, have strong learning capacity in the aspect of identifying complex relations in high-resolution spatial data, and can efficiently identify and analyze fine-grained images. Furthermore, CNNs can leverage existing computing resources to speed up computing in parallel or in a distributed manner with G PU and distributed computing.

Deep convolutional neural networks (Deep convolutionalneuralnetworks, DCCNs) are higher versions of CN Ns, with deeper layers, intended to learn high semantic representations from large amounts of data. The superior performance of DCCNs in image related tasks (such as object recognition, object detection and semantic segmentation) and the ability to process complex remote sensing big data make them unpredictably more popular than shallow models in remote sensing image analysis tasks.

In most CN Ns structures, the sub-sampling layer gradually reduces the spatial detail of the image through a pooling operation, thereby capturing a larger field of view, reducing the number of learnable parameters. However, multiple downsampling operations of cn s may cause the details of the image features to be greatly reduced, resulting in a rough feature image. The nonlinear complex relation between the processing data of the deeper network is very effective, more space detail information which is necessary for accurate semantic segmentation is reserved, and the method is widely used for extracting the detailed structural features of the images in the high-resolution images. Furthermore, residual learning is typically used to solve the gradient vanishing problem in DCNs. The students put forward the precision of the semantic segmentation task by using a full convolutional neural network, seg Net and the like, but how to acquire enough context information and fully utilize space details is still a difficulty in semantic segmentation of high-resolution images, so that a robust remote sensing data semantic segmentation network model is needed.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a remote sensing image semantic segmentation method and a remote sensing image semantic segmentation system based on a stacked depth residual network, which adopt a framework with higher calculation efficiency, namely a stacked depth residual network (stacked deep residual network, SDRNet), improve a network model so as to solve the problem of image land coverage classification, and extract semantic features from different layers of a network backbone network so as to improve semantic segmentation performance.

The segmentation method specifically adopts the following technical scheme: a remote sensing image semantic segmentation method based on a stacked depth residual error network comprises the following steps:

s1, constructing a stacked depth residual error network, and extracting depth features of a remote sensing image;

s2, zooming the depth of the stacked depth residual error network by using residual error learning;

s3, utilizing the expansion residual block to aggregate multi-scale context characteristics;

s4, performing supervised learning on the stacked depth residual error network by using the intermediate loss;

s5, semantic segmentation is carried out on the remote sensing image data by using the stacking depth residual error network after supervised learning.

Preferably, the stacked depth residual network of step S1 comprises a backbone network and two hierarchically connected first and second sub-networks, each sub-network comprising an encoder, an extraction unit and a decoder, wherein the extraction unit comprises an expansion residual module and an attention module; the output of the decoder of the first subnetwork is directly transmitted to the encoder of the second subnetwork.

Further preferably, the encoder of the first sub-network is constructed by adopting five structural blocks, and the high-level image semantics are improved by the decoder of the first sub-network to realize the up-sampling task, and the image size is up-sampled to the initial image size;

performing feature encoding on the input image by using an encoder of the first sub-network and an encoder of the second sub-network;

an encoder and a decoder of a second sub-network are constructed by adopting four structural blocks, and the decoder of the second sub-network upsamples the image size to the initial image size;

constructing a jump connection from an encoder of the first sub-network to a decoder of the second sub-network, connecting a feature map from the encoder to the output feature map; after each jump connection, two additional convolution operations are performed;

a self-attention mechanism is constructed using the inputs of layers 3, 4, 5 of the encoder of the first sub-network and the inputs of layers 2, 3, 4 of the encoder of the second sub-network to form an attention module to extract multi-level features.

Preferably, step S3 includes:

integrating the multi-scale context information using an expansion residual module;

the global receptive field is obtained using progressive dilation rates in different layers of the dilation residual module.

The invention discloses a remote sensing image semantic segmentation system based on a stacked depth residual error network, which concretely adopts the following technical scheme that the system comprises the following modules:

the feature extraction module is used for constructing a stacked depth residual error network and extracting depth features of the remote sensing image;

a depth scaling module that scales a depth of the stacked depth residual network using residual learning;

the feature aggregation module is used for aggregating the multi-scale context features by using the expanded residual block;

the supervised learning module performs supervised learning on the stacked depth residual error network by using the intermediate loss;

and the semantic segmentation module is used for carrying out semantic segmentation on the remote sensing image data by using the stack depth residual error network after supervised learning.

After the technical scheme is adopted, compared with the prior art, the invention has the following advantages:

1. the invention provides a new stacked depth residual error network to solve the challenging task of semantic segmentation of high-resolution images. The network architecture employs stacked encoder-decoder subnetworks to advance a multi-level feature learning and attention mechanism to refine basic learnable features.

2. The residual blocks expanded in the stacked depth residual network expand the receptive field, and the context features are further enriched by utilizing multi-scale reasoning, so that the accuracy of semantic segmentation is improved.

3. A large number of experiments show that the lightweight residual error network framework provided by the invention has good performance in semantic segmentation tasks and is superior to the prior art.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram of an SDRNet framework according to the present invention;

fig. 3 is a block diagram of a DRB according to the present invention;

fig. 4 is a block diagram of a core of the DRB proposed by the present invention;

FIG. 5 is a graph showing the segmentation results on a Vaihingen dataset according to the present invention; wherein (a) is an input image, (b) is ground truth, and (c) is a prediction result;

FIG. 6 is a graph of the segmentation results on a Potsdam dataset of the present invention, where (a) is the input image, (b) is the ground truth, and (c) is the predicted result.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

The implementation flow of the remote sensing image semantic segmentation method based on the stacked depth residual error network in this embodiment can be seen in fig. 1, and specifically includes the following steps:

s1, constructing a stacked depth residual error network, and extracting depth features of the remote sensing image.

The remote sensing image is high-resolution remote sensing image data, and the Vaihingen data set and the Potsdam data set of the ISPRS are used for training, verifying and testing the model. The two data sets are divided into six common categories of land cover, watertight surface (white), building (blue), low vegetation cover (cyan), tree (green), car (yellow) and graff (red). The Potsdam dataset contained 38 images of 6000 x 6000 pixels in size with a spatial resolution of 5cm, and an RGB image of 14 of these images was experimentally selected as the test image. The Vaihingen dataset contained 33 2494 x 2064 pixels of image with a spatial resolution of 9cm, and the experiment used three bands of near infrared, red and green.

The specific steps of depth feature extraction in the step comprise the following steps:

s11, a stacked depth residual network adopted by the embodiment is also called a stacked encoder-decoder network, and comprises a main network and two sub-networks connected in a layered manner, wherein the main network uses a ResNet50 model. Each sub-network comprises an encoder, an extraction unit comprising an expansion residual module (dilated residual blocks, DRB) and an attention module, and a decoder.

In stacked depth residual networks, two encoders help generate robust features from an input image, and two decoders enable reconstruction of spatial detail. Furthermore, the output of the decoder 1 of the first sub-network is directly transmitted to the encoder 2 of the second sub-network, which reduces the feature loss. The SDRNet framework in this embodiment is shown in FIG. 2.

S12, a pre-trained ResNet50 network is used on the encoder 1 of the first sub-network, so that basic feature learning capacity of the network is enriched, and requirements of the network on massive training labels are reduced.

Since the scene is different from the normal image used in the original training dataset, this embodiment redesigns the encoder 2 of the second sub-network to activate deeper layers, effectively learning more advanced functions.

In the stacked depth residual network of this embodiment, symmetrical encoders and decoders are stacked to form a spatial reconstruction sub-network, and abundant spatial details in the image are encoded to generate a new feature map. The encoder 1 of the first subnetwork uses five building blocks, gradually decreasing the image dimension as the image channel characteristics increase; the image size is up-sampled to the original image size by the decoder 1 of the symmetric first sub-network enhancing the advanced image semantics to achieve the up-sampling task. A 2 x 2 bilinear upsampling operation is performed in the decoder.

S13, performing feature coding on the input image by using the encoder 1 of the first sub-network and the encoder 2 of the second sub-network, performing convolution operation with the convolution kernel size of 3 multiplied by 3 on each convolution block in each encoder, and performing batch normalization to reduce internal covariate offset, accelerate network training and enhance the stability of a model.

S14, applying a ReLU activation function, enhancing nonlinearity of the model, performing maximum pooling operation to reduce the size of the feature map, reducing the parameter number and the calculated amount of the model, and simultaneously retaining the most important feature information.

S15, adopting four structural blocks to construct the encoder 2 and the decoder 2 of the second sub-network so as to reduce the number of the learnable features and network complexity. Since each block in the decoder performs a two-by-two linear upsampling operation on the feature input, the dimension of the input feature map is doubled. The decoder 2 at the second sub-network upsamples the picture size to the original picture size.

S16, constructing jump connection from the encoder 1 of the first sub-network to the decoder 2 of the second sub-network so as to maintain high spatial resolution and improve the overall quality of the output characteristic diagram. After each jump connection, two additional 3 x 3 convolution operations are performed, each followed by a batch normalization step and a ReLU activation function; finally, mapping is performed through a softmax function.

Thus, both decoders upsample the image size to the original image size before connecting to the multi-class softmax function.

The present embodiment connects the feature map from the encoder with the output feature map through a jump connection. There is a jump connection from the encoder 1 of the first sub-network to the decoder 1 of the first sub-network and from the encoder 1 of the first sub-network and the encoder 2 of the second sub-network to the decoder 2 of the second sub-network in order to maintain a higher spatial resolution and to improve the overall quality of the output profile.

S17, constructing a self-attention mechanism by adopting the input of layers 3, 4 and 5 of the encoder 1 of the first sub-network and the input of layers 2, 3 and 4 of the encoder 2 of the second sub-network to form an attention module so as to extract multi-stage characteristics.

Since the high resolution telemetry image contains redundant and unwanted features that are independent of the output class, the attention mechanism limits the irrelevant areas in the input image to highlight certain specific classes of features. Inspired by the human visual system, the self-attention mechanism improves the network to focus on specific important areas, significantly reducing the cost of learning features from unrelated areas and redundant data. Furthermore, since it is virtually impossible to activate all the learnable parameters in a deep convolutional neural network, the network is restricted to features and only features that are relevant to a particular class are utilized.

This embodiment avoids the use of layer 1 (i.e., the initial layer) of the encoder, which is not used because the initial layer learns features that are simple, basic, and does not provide enough information to complete the complex segmentation task.

The present embodiment defines the self-attention mechanism as:

M _s (F)＝σ(F ^a×a ([AvrPool(F)；MaxPool(F)]))

wherein σ represents a sigmoid function, F ^a×a Representing a convolution operation with a filter size of a×a, avrPool (F) represents an average pooling operation, and MaxPool (F) represents a maximum pooling operation.

S2, zooming the depth of the stacked depth residual error network by using residual error learning.

Residual learning strategies are used to solve the gradient vanishing problem, and stacked convolution blocks are replaced with identity mapping to construct a deep network that is not affected by the gradient vanishing problem. By jumping the connection, the network gradient transfer may not pass through a nonlinear activation function, thereby mitigating gradient explosion or extinction. In addition, the jump connection improves the gradient flow in the back propagation process, namely improves the backward gradient flow and accelerates the convergence of the deep network.

Accordingly, by decomposing the image into an encoding stage and a decoding stage and gradually adding more detailed information in the decoding stage, the original image can be reconstructed better.

Defining a residual function as:

y＝F(x，W _i )+x

where x represents the input, y represents the output feature map, F (x, W _i ) Representing the residual join function, W _i Representing the weight parameter.

S3, aggregation of multi-scale context features is carried out by using the expanded residual block DRB. The method comprises the following specific steps:

s31, integrating multi-scale context information by using an expansion residual error module.

To obtain more context feature information, multi-scale context information aggregation is achieved without sacrificing image resolution, and dilation convolution uses larger sparse kernels in the pooling layer and the convolution layer.

In this embodiment, the expansion residual module is provided based on a two-dimensional expansion convolution operation, where the two-dimensional expansion convolution operation is defined as:

where y (m, n) is the m-and n-dimensional output of the dilation convolution, x (m, n) is the m-and n-dimensional input of the dilation convolution, w (i, j) is a filter of size i x j of the dilation convolution, and the parameter r represents the expansion ratio.

When the expansion rate is equal to 1, the expansion convolution function is normal; when the expansion ratio is less than 1, detailed segmentation of the feature map is represented, which requires more training time; when the expansion ratio is greater than 1, the receptive field increases without increasing the number of parameters or the computational requirements. The receptive field range can be adjusted by different expansion rates.

S32, acquiring a global receptive field by using progressive expansion rates in different layers of the expansion residual error module.

Based on the concept of dilation convolution, a dilation residual block is used as a dilation residual module that uses progressive dilation rates in different layers to achieve a complete receptive field without loss of resolution or coverage.

The embodiment adopts the expansion kernels with progressive expansion rate and jump connection to enhance the information flow between the connection layers, and can cover the response of all areas in the image by ensuring that the obtained expansion convolution kernels can not increase the number of convolution kernel parameters, thereby reducing the phenomena of grid effect and information continuity loss.

The meshing problem is the problem of loss of information continuity due to not all pixels getting kernel responses through a "checkerboard effect". Aiming at the problems that a large expansion rate can cause grid effect and inhibit the performance of an expansion kernel, a DRB block (expansion residual block) uses an expansion kernel with a progressive expansion rate to complete a pixel-level dense prediction task. The layers are widely connected gradually through jump connection, information flow between the connected layers is enhanced, and information is efficiently transferred. By ensuring that the generated expanded kernel can obtain responses from all areas of the image without increasing the number of kernel parameters, a significant increase in the effect of image processing is achieved.

The present embodiment defines the convolution kernel of the extended residual block DRB as:

M _i ＝max[M _i+1 -2r _i ，M _i+1 -2(M _i+1 -r _i )，r _i ]

wherein r is _i Is the expansion ratio of the ith layer, M _i The maximum generation rate of the i-th layer is set, and the total layer number is n. In this embodiment, the structure of the extended residual block is shown in fig. 3, and the core of the extended residual block is shown in fig. 4.

And S4, performing supervised learning on the stacked depth residual error network by using the intermediate loss.

In the embodiment, intermediate loss is introduced at the end of the sub-network so as to overcome the gradient attenuation problem and improve the effective learning from deep layer to shallow layer in the deep layer network; the optimization of the middle layer is improved and optimized through the performance difference between the calculation model and the real situation, so as to improve gradient flow and enhance learning in the back propagation process. The method comprises the following specific steps:

s41, respectively placing loss functions at the two sub-network ends, and defining the loss functions as:

where N is the number of categories, W _j Representing the weight of category j, p _j And p _k Representing the predicted value and the true value of category j, respectively.

S42, defining the total loss function as:

L _T ＝(α×MainLoss)+(β×InterL1)

where α and β are the respective weights in the network, interL 1 is the loss value of the output layer, mainLoss is the loss value at the end of the first subnetwork (weights scaled 1:1).

In this embodiment, the segmentation results of the Vaihingen dataset and the watsdam dataset are shown in fig. 5 and 6.

Example 2

The embodiment and embodiment 1 are based on the same inventive concept, and provide a remote sensing image semantic segmentation system based on a stacked depth residual error network, which comprises the following modules:

and the semantic segmentation module is used for carrying out semantic segmentation on the remote sensing image data by using the stack depth residual error network after supervised learning. The modules of this embodiment are used to implement the steps of embodiment 1, respectively, for detailed implementation procedures of embodiment 1.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. The remote sensing image semantic segmentation method based on the stacked depth residual error network is characterized by comprising the following steps of:

2. The remote sensing image semantic segmentation method according to claim 1, wherein the stacked depth residual network of step S1 comprises a backbone network and two hierarchically connected first and second sub-networks, each sub-network comprising an encoder, an extraction unit and a decoder, wherein the extraction unit comprises an expansion residual module and an attention module; the output of the decoder of the first subnetwork is directly transmitted to the encoder of the second subnetwork.

3. The remote sensing image semantic segmentation method according to claim 2, wherein five structural blocks are adopted to construct an encoder of a first sub-network, and the decoder of the first sub-network is used for improving the high-level image semantic to realize an up-sampling task, so that the image size is up-sampled to the initial image size;

4. A method of semantic segmentation of a remote sensing image according to claim 3, characterized in that the self-attention mechanism is defined as:

M _s (F)＝σ(F ^a×a ([AvrPool(F)；MaxPool(F)]))

5. The method of claim 1, wherein the stacked convolution blocks are replaced by identity mapping in step S2 to construct a deep network that is not affected by the problem of gradient extinction.

6. The method of claim 1, wherein in step S2, the residual function is defined as:

y＝F(x，W _i )+x

7. The method of claim 1, wherein step S3 includes:

8. The remote sensing image semantic segmentation method according to claim 7, wherein the expansion residual module is based on a two-dimensional expansion convolution operation, and the two-dimensional expansion convolution operation is defined as:

9. The remote sensing image semantic segmentation method according to claim 7, wherein a dilation residual block is adopted as a dilation residual module, and a progressive dilation rate is used in different layers to realize a complete receptive field;

the convolution kernel of the dilated residual block DRB is defined as:

M _i ＝max[M _i+1 -2r _i ，M _i+1 -2(M _i+1 -r _i )，r _i ]

wherein r is _i Is the expansion ratio of the ith layer, M _i The maximum generation rate of the i-th layer is set, and the total layer number is n.

10. The remote sensing image semantic segmentation system based on the stacked depth residual error network is characterized by comprising the following modules: