CN116524189A

CN116524189A - High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization

Info

Publication number: CN116524189A
Application number: CN202310496605.3A
Authority: CN
Inventors: 于纯妍; 李东霖; 王玉磊; 赵恩宇
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-08-01

Abstract

The invention discloses a high-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization, which comprises the following steps: acquiring and amplifying a remote sensing image set, normalizing the remote sensing image, constructing a remote sensing image semantic segmentation model based on coding and decoding indexing edge characterization, training the segmentation model according to a training set, acquiring a prediction label of each pixel in the remote sensing image, calculating loss according to a truth label and the prediction label, judging whether a loss value meets a threshold value, updating parameters of the segmentation model if the loss value does not meet the threshold value, acquiring a trained segmentation model if the loss value does not meet the threshold value, acquiring a processed remote sensing image, inputting the processed remote sensing image into the trained segmentation model, and outputting a semantic segmentation result graph of the remote sensing image. The feature extraction and processing of the ground object edge information are improved, the recognition accuracy of small-size objects and complex boundary information in the remote sensing image is improved, and the accurate semantic segmentation of the remote sensing ground object edge is realized.

Description

High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization

Technical Field

The invention relates to the technical field of remote sensing image processing, in particular to a high-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization.

Background

The high-spatial resolution remote sensing image (high-resolution remote sensing image) is an important component of a modern remote sensing image, has the characteristics of high spatial resolution, high definition, high timeliness, large information quantity and the like, and can clearly and intuitively present rich ground feature detail information and the relationship between adjacent ground features. At present, semantic segmentation of images is a research hotspot in the visual direction of computers, and the task essence of the semantic segmentation is category identification of image areas, namely, category labels are assigned to each pixel in the images. The semantic segmentation of the high-resolution remote sensing image is used as an important component of the semantic segmentation direction, so that the earth surface features in the remote sensing image can be automatically extracted, and semantic categories are assigned to the ground object targets. The semantic segmentation of the high-resolution remote sensing image has wide application in the fields of disaster assessment and prediction, environmental protection, urban planning, traffic navigation, military safety and the like.

In recent years, deep learning, particularly deep convolutional neural network technology, has been rapidly developed and applied. The method has the advantages that the method shows striking feature extraction capability in tasks such as image classification, target detection, semantic segmentation and the like, can adaptively extract shallow features and deep features in images, and particularly has good understanding capability on complex scenes. Therefore, the application of the deep learning technology to semantic segmentation of the high-resolution remote sensing image has important practical significance, and a new development opportunity is brought to the processing of the remote sensing image. However, high resolution remote sensing images are typically composed of large and complex scenes and heterogeneous objects, and the problems of occlusion and shading caused by illumination conditions and imaging angles during image acquisition result in poor separation effects of the existing depth remote sensing segmentation model at the edges of objects. In addition, the small object has a higher proportion of edge pixels to the whole pixels of the object, and the division effect of the whole object is poor if the division is not ideal at the edge.

Disclosure of Invention

The invention provides a high-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization, which aims to overcome the technical problems.

A semantic segmentation method of high-resolution remote sensing images based on coding and decoding indexing edge characterization comprises the steps of,

step one, acquiring a remote sensing image set, amplifying the remote sensing image set, namely rotating the remote sensing image at any angle and storing the remote sensing image into the remote sensing image set, respectively carrying out normalization processing on the remote sensing image, dividing the remote sensing image set into a training set and a testing set,

step two, constructing a remote sensing image semantic segmentation model based on the coding and decoding indexing edge representation, training the remote sensing image semantic segmentation model based on the coding and decoding indexing edge representation according to a training set, obtaining a prediction label of each pixel in the remote sensing image, calculating loss according to the truth label and the prediction label of each pixel, judging whether the value of the loss meets a threshold value, optimizing parameters of the remote sensing image semantic segmentation model based on the coding and decoding indexing edge representation according to the difference between the value of the loss and the threshold value if the value of the loss does not meet the threshold value, obtaining the trained remote sensing image semantic segmentation model based on the coding and decoding indexing edge representation if the value of the loss meets the threshold value,

the remote sensing image semantic segmentation model based on the coding and decoding indexing edge representation comprises a multi-scale feature encoder, a separable pyramid unit, a coding and decoding indexing edge representation unit and an up-sampling decoder,

the multi-scale feature encoder is used for generating four initial feature matrixes according to the size h of the remote sensing image, the sizes of the four initial feature matrixes are h/2, h/4, h/8 and h/16 respectively,

the separable pyramid unit is used for obtaining four context feature matrixes according to four initial feature matrixes of the remote sensing image, the sizes of the four context feature matrixes are h/2, h/4, h/8 and h/16 respectively,

the coding and decoding indexing edge characterization unit is used for acquiring a first coding index and a first decoding index according to a context feature matrix with the size of h/2, acquiring a second coding index and a second decoding index according to a context feature matrix with the size of h/4, fusing the first coding index with the context feature matrix with the size of h/2, fusing the second coding index with the context feature matrix with the size of h/4, acquiring a fused context feature matrix with the size of h/2 and a fused context feature matrix with the size of h/4,

the up-sampling decoder is used for decoding and up-sampling the four context feature matrixes according to the order from small to large in size to obtain a semantic segmentation result graph of the remote sensing image, the semantic segmentation result graph comprises a prediction label of each pixel in the remote sensing image,

and thirdly, acquiring the processed remote sensing image, inputting the processed remote sensing image into a trained remote sensing image semantic segmentation model based on coding and decoding indexing edge characterization, and outputting a semantic segmentation result graph of the remote sensing image.

Preferably, the up-sampling decoder is configured to decode and up-sample a context feature matrix with a size of h/16 to obtain an output feature matrix x with a size of h/8 _d1 Will x _d1 The new feature matrix x is spliced with the context feature matrix with the size of h/8 along the dimension 1 by using a torch.cat function _m1 ；

For x _m1 Decoding up-sampling is carried out to obtain an output characteristic matrix x with the size of h/4 _d2 Will x _d2 The new feature matrix x is spliced with the context feature matrix with the size of h/4 along the dimension 1 by using a torch.cat function _m2 For the second decoding index and the feature matrix x _m2 After matrix multiplication operation, a characteristic matrix x is output _n2 ；

For x _n2 Decoding and up-sampling to obtain an output characteristic matrix with the size of h/2, and then splicing the output characteristic matrix with the size of h/2 with a context characteristic matrix along the dimension 1 by using a torch.cat function to form a new characteristic matrix x _m3 For the first decoding index and the feature matrix x _m3 Performing matrix product operation to output a feature matrix x _n3 ；

For x _n3 Decoding and up-sampling are carried out to obtain a feature matrix x with the size of h _d4 For x _d4 And after one convolution, inputting the result into a softmax activation function to obtain a semantic segmentation result graph.

Preferably, said calculating the penalty based on the true label and the predicted label for each pixel comprises calculating the penalty based on equation (1),

Loss _focal ＝-(1-p _t ) ^γ log(p _t ) (1)

wherein p is _t Is the prediction probability of the truth value label, the prediction probability is obtained according to the truth value label and the prediction label, gamma is the super parameter, and Loss _focal Representing the focal point loss function.

Preferably, the multi-scale context feature encoder comprises a spatial feature extraction branch for extracting local feature information of the remote sensing image, a self-attention feature extraction branch for extracting global feature information of the remote sensing image, and a fusion branch for fusing the local feature information and the global feature information according to formula (2),

x＝concatnate(Conv2d(x _ci )，Conv2d(x _si ))

y＝sigmoid(Conv2d(ReLU(Conv2d(AdaptiveAvgPool2d(x)))))

x _fi ＝x×reshape(y) (2)

wherein x is _si Representing the ith stage feature matrix, x of the self-attention feature extraction branch _ci Representing the ith stage feature matrix, x of the spatial feature extraction branch _fi Representing the fused features, conv2d (x) denotes a 2d convolution, adaptive avgpool2d (x) denotes an adaptive pooling function, sigmoid (x) denotes a sigmoid activation function, reLU (x) denotes a ReLU activation function, concate (x) denotes a concatenation of two matrices along dimension 1, and reshape (x) denotes a shape change function.

Preferably, the obtaining the first encoding index and the first decoding index according to the context feature matrix with the size of h/2 comprises,

s11, expressing a context characteristic matrix with the size of h/2 as x _i Obtaining x _i The shape parameters include batch size value batch size, channel number c, height h, width w,

s12, x _i Obtaining x from input Conv2d function _i1 Will x _i1 Inputting into a BatchNorm2d function to obtain x _i2 Will x _i2 Inputting into a BatchNorm2d function to obtain x _i3 Will x _i3 Inputting into a BatchNorm2d function to obtain x _i4 ，

S13, pair x _i1 、x _i2 、x _i3 、x _i4 Respectively carrying out maximum pooling operation to obtain four initial indexes x ₁ ，x ² ，x ³ ，x ⁴ ，

S14, the initial index x is obtained through a torch.cat function ₁ ，x ₂ ，x ₃ ，x ₄ Splicing the two matrixes into a new matrix along the dimension 1, and transmitting the new matrix to a sigmoid activation function to obtain an initial decoding index y; s15, the initial decoding index y is subjected to softmax function to obtain an initial coding index z, the view function is used for adjusting the shape parameters of the initial decoding index y and the initial coding index z, the shape parameters are adjusted to be batch size, c multiplied by 4, h/2 and w/2, the adjusted initial decoding index y and the initial coding index z are obtained,

s15, reorganizing the adjusted initial decoding index y and initial encoding index z into the size before adjustment by using a pixel_shuffle function to obtain a first encoding index and a first decoding index.

The invention provides a high-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization, which extracts multi-scale semantic features of images through a multi-scale feature encoder in a remote sensing image semantic segmentation model based on coding and decoding indexing edge characterization, can capture spatial context information in parallel, strengthens the segmentation effect of remote sensing feature edge information through extracting coding and decoding indexes containing edge information, improves feature extraction and processing of the remote sensing feature edge information, improves recognition precision of small-size objects and complex boundary information in a remote sensing image, and realizes accurate semantic segmentation of the remote sensing feature edge.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic flow chart of a semantic segmentation model of a remote sensing image based on coding and decoding indexing edge characterization;

FIG. 3 is a schematic diagram of a multi-scale feature encoder of the present invention;

FIG. 4 is a schematic diagram of the structure of a separable pyramid unit of the present invention;

FIG. 5 is a schematic diagram of the structure of the generated index of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

FIG. 1 is a flowchart of the method of the present invention, as shown in FIG. 1, the method of the present embodiment may include:

the up-sampling decoder is used for decoding and up-sampling the context feature matrix according to the order from small to large in size to obtain a semantic segmentation result graph of the remote sensing image, the semantic segmentation result graph comprises a prediction label of each pixel in the remote sensing image,

Based on the scheme, the multi-scale semantic features of the remote sensing image are extracted through the multi-scale feature encoder in the remote sensing image semantic segmentation model based on the encoding and decoding indexing edge characterization, the separable pyramid units capture spatial context information in parallel, the segmentation effect of the remote sensing ground object edge information is enhanced through the encoding and decoding indexing which extracts the edge information, the feature extraction and processing of the remote sensing ground object edge information are improved, the recognition precision of small-size objects and complex boundary information in the remote sensing image is improved, and the accurate semantic segmentation of the remote sensing ground object edge is realized.

Step one, acquiring a remote sensing image set, amplifying the remote sensing image set, namely rotating the remote sensing image at any angle and storing the remote sensing image set into the remote sensing image set, wherein the data amplification is used for preventing the model from being overfitted and improving the robustness of the model, generally comprises the steps of carrying out operations such as vertical and horizontal overturning, rotating by 90 degrees and the like on the remote sensing image, respectively carrying out normalization processing on the remote sensing image, wherein the normalization is a data preprocessing operation, mapping the original remote sensing image data into a range of 0-1, dividing the remote sensing image set into a training set and a testing set,

step two, constructing a remote sensing image semantic segmentation model based on coding and decoding indexing edge representation, as shown in fig. 2, training the remote sensing image semantic segmentation model based on coding and decoding indexing edge representation according to a training set to obtain a prediction label of each pixel in the remote sensing image, calculating loss according to the truth label and the prediction label of each pixel, judging whether the value of the loss meets a threshold value, if not, updating parameters of the remote sensing image semantic segmentation model based on coding and decoding indexing edge representation according to the difference between the loss and the threshold value, if yes, obtaining the trained remote sensing image semantic segmentation model based on coding and decoding indexing edge representation, wherein calculating the loss according to the truth label and the prediction label of each pixel comprises calculating the loss according to a formula (1),

the multi-scale context feature encoder comprises a spatial feature extraction branch, a self-attention feature extraction branch and a fusion branch, wherein the spatial feature extraction branch is used for extracting local feature information of a remote sensing image, the self-attention feature extraction branch is used for extracting global feature information of the remote sensing image, the fusion branch is used for fusing the local feature information and the global feature information according to a formula (2),

wherein x is _si Representing the ith stage feature matrix, x of the self-attention feature extraction branch _ci Representing the ith stage feature matrix, x of the spatial feature extraction branch _fi Representation ofThe fused features Conv2d (x) represent 2d convolution, adaptive avgpool2d (x) represent adaptive pooling functions, sigmoid (x) represent sigmoid activation functions, reLU (x) represent ReLU activation functions, concate (x) represent stitching two matrices along dimension 1, and reshape (x) represent shape change functions.

Specifically, when the multi-scale context feature encoder is adopted to extract global and local feature information of the remote sensing image, the following modes are adopted:

the first five stages of the lightweight characteristic extraction model EfficientNet-B5 are used as spatial characteristic extraction branches to extract local characteristic information of remote sensing images, and EfficientNet-B5 is one of convolutional neural network structures and belongs to EfficientNet series models. The design goal of the EfficientNet series model is to provide better performance while keeping the computational cost low. EfficienientNet B-5 is the fifth model in the EfficienientNet series;

the Swin transducer is used as a self-attention feature extraction branch to extract global feature information of the remote sensing image, is a novel neural network architecture, and adopts a transducer-based method to solve the computer vision task. The innovation is that a method called "sliding window" (Shifted Windows) is used to process the image, which enables it to more efficiently perform calculations when processing large-scale image data;

the separable pyramid unit is used for acquiring four context feature matrixes according to four initial feature matrixes of a remote sensing image, the sizes of the four context feature matrixes are h/2, h/4, h/8 and h/16 respectively, the separable pyramid unit is shown in a structural diagram 4, the separable pyramid unit has the functions of capturing the context feature matrixes with the same size and the same expansion rate by using separable expansion convolution with different expansion rates, and the expansion rates are 0, 1, 6 and 12;

the coding and decoding indexing edge characterization unit is configured to obtain a first coding index and a first decoding index according to a context feature matrix with a size of h/2, and the structure of generating the index is shown in fig. 5, specifically, obtaining the first coding index and the first decoding index according to the context feature matrix with a size of h/2 includes,

s16, reorganizing the adjusted initial decoding index y and initial encoding index z into the size before adjustment by using a pixel_shuffle function to obtain a first encoding index and a first decoding index.

Obtaining a second coding index and a second decoding index according to the context feature matrix with the size of h/4, processing the context feature matrix with the size of h/4 according to steps S11-S16 to obtain the second coding index and the second decoding index,

fusing the first coding index with a context feature matrix with the size of h/2, fusing the second coding index with a context feature matrix with the size of h/4, obtaining a fused context feature matrix with the size of h/2 and a fused context feature matrix with the size of h/4,

specifically, the separable pyramid unit is used for capturing the spatial context information of multiple scales of the initial feature map in parallel in the following manner:

the separable pyramid unit replaces all 3 x 3 convolutions in the expanded pyramid module with depth separable convolutions and builds and applies the separable pyramid unit based on four different sized feature matrices. The expansion pyramid module is a convolutional neural network module commonly used in deep learning and is used for extracting features with different scales from images. The structure is as follows:

input layer: accepting input data from a previous layer.

Convolution layer: and carrying out convolution operation on the input data by using convolution cores with different sizes, and extracting features with different scales.

Expansion layer: and expanding the characteristic diagram output by the convolution layer to obtain a larger receptive field.

Fusion layer: and fusing the feature graphs with different scales to obtain richer feature representations.

specifically, the up-sampling decoder is used for decoding and up-sampling the context feature matrix with the size of h/16 to obtain the output feature matrix x with the size of h/8 _d1 Will x _d1 The new feature matrix x is spliced with the context feature matrix with the size of h/8 along the dimension 1 by using a torch.cat function _m1 ；

Based on the technical scheme, the remote sensing image semantic segmentation method based on the coding and decoding indexing edge characterization provided by the invention uses an image-based deep learning classification frame in a segmentation model, and the remote sensing image is input into the model for training and prediction. Firstly, extracting and fusing the characteristics of the remote sensing image through a multi-scale characteristic encoder. Secondly, capturing spatial context information of multiple scales of the initial feature map in parallel through separable pyramid units with different expansion rates in an expansion convolution mode, and thirdly, inputting two feature maps with the largest size into an index generation module to extract coding indexes and decoding indexes containing edge feature information, and integrating the coding indexes into a feature matrix in a matrix product mode; and fourthly, performing four times of decoding and up-sampling on the feature map with the minimum size, merging the feature map with the corresponding size by using an output matrix of each decoding and up-sampling in a jump connection mode, merging two decoding indexes into the results of the second decoding and up-sampling and the third decoding in a matrix product mode, and finally classifying the results of the fourth decoding and up-sampling by using a one-time transposition convolution and a one-time softmax activation function to output a final semantic segmentation result map. The remote sensing image semantic segmentation model based on coding and decoding indexing edge characterization is used for improving feature extraction and processing of remote sensing ground object edge information, and improving recognition accuracy of small-size objects and complex boundary information in a remote sensing image.

In the embodiment, the real remote sensing image data are adopted for experiments, and two groups of public real remote sensing image data sets are adopted for carrying out test description, analysis and evaluation of application effects on the remote sensing image semantic segmentation model based on coding and decoding indexing edge characterization.

1. Data set and parameter settings

The present embodiment uses two published high resolution remote sensing image datasets (the Potsdam dataset and the Vaihingen dataset) in an ISPRS 2D semantic annotation large race for experimentation and analysis, the datasets employing Digital Surface Models (DSMs) generated from high resolution orthogonal photographs and corresponding dense image matching techniques.

The isps Vaihingen dataset comprises 33 images of varying sizes, with an average size of 2494 x 2064, a spatial resolution of 9cm, and images containing three bands of Near Infrared (NIR), red (R) and green (G). The label categories include water impermeable surfaces, buildings, low plants, trees, automobiles, and other 6 categories altogether.

The isps watsdam dataset then contained 38 images, each 6000 x 6000 in pixel size, with 5cm spatial resolution, using three bands of red (R), green (G) and blue (B). The tag class and number are consistent with the Vaihingen dataset.

The Batch size is set to 16 during training of the model, 300 epochs are trained each time, the learning rate is dynamically adjusted in a cosine annealing mode, the initial learning rate is set to 1e-3, the learning rate attenuation coefficient is 0.2, the learning rate attenuation interval is 5, and an AdamW optimizer is used for optimizing parameters.

2. Experimental evaluation index

The Overall Accuracy (OA) is a performance index for evaluating the classification model, and in the image semantic segmentation task, refers to the proportion of the number of correctly classified pixels to the total number of pixels. The calculation formula is as follows:

the F1 score and the mF1 score are indexes for measuring the performance of the classification model and are commonly used for evaluating the accuracy of a two-classification or multi-classification model. The cross-over ratio (Intersection over Union, ioU) and the average cross-over ratio (Mean Intersection over Union, mlou) are a commonly used index for measuring target detection and semantic segmentation model performance.

3. Analysis and evaluation of experimental results

The results of the remote sensing image semantic segmentation model based on the coding and decoding indexing edge characterization and using two groups of remote sensing image data experiments are shown in table 1 and table 2.

Table 1 Vaihingen dataset versus experiment results table

TABLE 2 Potsdam dataset vs. Experimental results Table

The experiments introduced DCNN based models UNet and SegNet, a modified model transfunet based on a Transformer, and a CapsUNet model using the same dataset. From the classification results, the following conclusions can be analytically drawn:

it can be seen from the table that the best results are achieved by the segmentation model. Because the remote sensing image data volume is limited, the TransUNet model experiment result is poor. The Transformer is used as an encoder in the transfunet model to present modeled remote dependencies and add low-level detail information to feature mapping in the decoder by skipping connections. However, because the Transformer model requires a large amount of data to train, transfune is not as similar to the remote sensing image semantic segmentation model based on the coding and decoding indexing edge characterization as proposed by the invention in the experiment. The model provided by the invention obtains better performance by comparing experimental results of a DCNN-based improved model (Unet, segNet, capsUNet) and a transform-based improved model TransUNet.

The whole beneficial effects are that:

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A semantic segmentation method of a high-resolution remote sensing image based on coding and decoding indexing edge characterization is characterized by comprising the following steps of,

2. The method for semantic segmentation of high-resolution remote sensing images based on coding and decoding indexed edge representation according to claim 1, wherein the method comprises the following steps ofThe up-sampling decoder is used for decoding and up-sampling the context feature matrix with the size of h/16 to obtain an output feature matrix x with the size of h/8 _d1 Will x _d1 The new feature matrix x is spliced with the context feature matrix with the size of h/8 along the dimension 1 by using a torch.cat function _m1 ；

3. The method of claim 1, wherein calculating the loss from the truth label and the predictive label for each pixel comprises calculating the loss according to equation (1),

Loss _focal ＝-(1-p _t ) ^γ log(p _t ) (1)

4. The method of claim 1, wherein the multi-scale context feature encoder comprises a spatial feature extraction branch for extracting local feature information of the remote sensing image, a self-attention feature extraction branch for extracting global feature information of the remote sensing image, and a fusion branch for fusing the local feature information and the global feature information according to formula (2),

x＝concatnate(Conv2d(x _ci )，Conv2d(x _si ))

y＝sigmoid(Conv2d(ReLU(Conv2d(AdaptiveAvgPool2d(x)))))

x _fi ＝x×reshape(y) (2)

wherein x is _si Representing the ith stage feature matrix, x of the self-attention feature extraction branch _ci Representing the ith stage feature matrix, x of the spatial feature extraction branch _fi Conv2d (x) represents 2d convolution, adapteveAvgPool 2d (x) represents an adaptive pooling function, sigmoid (x) represents a sigmoid activation function, reLU (x) represents a ReLU activation function, concate (x) represents stitching two matrices along dimension 1, and reshape (x) represents a shape change function.

5. The method of claim 1, wherein the obtaining the first encoding index and the first decoding index according to the context feature matrix with the size of h/2 comprises,

s11, expressing a context characteristic matrix with the size of h/2 as x _i Obtaining x _i The shape parameters include a batch size value, a channel number c, a height h, a width w,