CN118135239A

CN118135239A - Fusion filtering multi-scale high-resolution remote sensing glacier extraction method

Info

Publication number: CN118135239A
Application number: CN202410571994.6A
Authority: CN
Inventors: 夏琛威; 张秀再; 张昊
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2024-05-10
Filing date: 2024-05-10
Publication date: 2024-06-04
Anticipated expiration: 2044-05-10
Also published as: CN118135239B

Abstract

The invention provides a fusion filtering multi-scale high-resolution remote sensing glacier extraction method, which comprises the steps of firstly, carrying out data preprocessing on a high-resolution glacier remote sensing image to manufacture a deep learning semantic segmentation data set, enriching the data set through a data enhancement means, and ensuring the robustness of model training. Aiming at the problems of scattered and tiny glacier identification capability deficiency, a gating multi-scale filter layer (G-MsFL) is designed. Providing extraction and feature fusion modes with different scales for the model to capture fine glaciers, and effectively filtering useless feature information by a gating mechanism; aiming at the problem of fuzzy contour of glaciers, a parallel double-channel attention module (P-DAM) is designed. And coding the context information rich in glacier boundaries as local features of the feature map, so that the feature expression capability of the context information is enhanced. The method is effective in assisting the large-area glacier extraction work in the plateau area.

Description

Fusion filtering multi-scale high-resolution remote sensing glacier extraction method

Technical Field

The invention belongs to the field of deep learning and algorithm improvement, and particularly relates to a fusion filtering multi-scale high-resolution remote sensing glacier extraction method.

Background

In mountain and high latitude areas, the weather is severe and the annual average temperature is below 0 ℃, and the mountain is snow-accumulated throughout the year. When the accumulation of the snowfall is larger than ablation, the snow on the surface of the ground is thickened year by year, the snow gradually becomes granulated snow, and then the granulated snow becomes bluish glacier. Glacier ice slowly moves along the slope under the action of self gravity or the action of ice layer pressure, so that glacier is formed. Glaciers are one of the most valuable natural resources on earth. Glaciers are an indicator of climate change, one of the largest fresh water reservoirs on earth, while glacier formation and changes are closely related to the geological structure and evolution of the earth. Under global climate warming trend, glaciers are in an accelerated ablation state as a whole, and not only are rapid recession and thinning of the glaciers shown, but also the instability of the glaciers is increased, so that the associated disaster risk is aggravated. Glacier disasters are closely related to glacier changes. These glacier disasters endanger the life and property safety of local residents, destroy traffic roads, infrastructure, important projects, etc.

In high latitude areas, many glaciers are covered with a large amount of on-ice debris in their ablation zone, causing the glacier melt rate and spatial pattern to change, which also increases the likelihood of glacier breach flood generation. Factors such as global warming promote instability of newly formed glacier lakes, tillites and hillsides, so that glacier lakes burst flood and mountain debris flows occur around mountain ranges. Glacier identification and research is therefore of profound importance in the fields of earth science, climatology and environmental management.

The conventional mountain glacier identification method is roughly divided into two types, namely a visual interpretation method and a computer automatic identification method. The computer automatic identification method further comprises the following steps: ratio thresholding, normalized snow index (NDSI), supervised and unsupervised classification methods, object-oriented information extraction methods, and the like. The visual interpretation method needs to examine glaciers in the field according to expert expertise and experience and classify the glaciers manually, and the method needs to input a large amount of manpower and material resources and consumes more time. Researchers at home and abroad compare and analyze the advantages and disadvantages of glacier extraction methods on different plateaus, and select the most accurate glacier extraction method aiming at different conditions. The effect of the traditional glacier identification method is affected in a complex ground object environment, and the glacier identification error is large. In addition, the traditional method relies on manual design, extraction and calculation, so that complex spectrum information is difficult to capture and targets cannot be identified in a large scale. In summary, the conventional glacier identification method is time-consuming and labor-consuming and has low overall accuracy.

In recent years, along with development of remote sensing technology, high-resolution remote sensing images contain abundant ground object information, and different ground objects can be identified by dividing the high-resolution remote sensing images. The method has important application prospects in the fields of city planning, building extraction, road extraction, vehicle detection and the like. However, because the high-resolution remote sensing image has the characteristics of high spatial resolution, complex details and the like, the existing segmentation technology facing the natural image cannot be directly applied to semantic segmentation of the high-resolution remote sensing image. Therefore, the remote sensing image semantic segmentation method based on deep learning is widely applied to the field of plateau glacier identification. Semantic segmentation is a pixel-level classification method that labels each pixel of an image as an object label, and then performs pixel-by-pixel label prediction on the input image.

Disclosure of Invention

The invention aims to: the invention aims to solve the technical problem of providing a fusion filtering multi-scale high-resolution remote sensing glacier extraction method aiming at the defects of the prior art. The method can solve the problems of shielding, shadow, scattered glaciers, incomplete glacier boundary information extraction and the like which are easy to occur in the identification of glaciers in the mountain area.

The method comprises the following steps:

step 1, acquiring high-resolution remote sensing image data, preprocessing the high-resolution remote sensing image data to obtain glacier data sets and corresponding label data, and dividing the glacier data sets into training sets, verification sets and test sets according to a ratio (such as 7:2:1);

Step 2, a high-resolution remote sensing glacier extraction model is built, a training set and a verification set are input into the model for training and verification, and a trained remote sensing glacier extraction model is obtained;

And step 3, inputting the test pictures in the test set into the model obtained in the step 2 to perform remote sensing picture glacier identification, so that the accuracy of the model is improved.

In step 1, the preprocessing of the high-resolution remote sensing image data specifically includes: a self-built anima-Paniculatum plateau glacier dataset (acquired by Landsat-9 satellite) was used. The anima paniculatum is used as a research area, and is formed by Hua Li western-stage folds, and Jing Xima Laiya movement is carried out again. The mountain is located at the junction of the two provinces of Ganqiang and Qingqin, the length is 350 km, the width is about 50-60 km, the highest mountain potential part is located at the west of Maqin county, the trend of North west-south east is called Ma Jixue mountain, and the mountain has 18 peaks with the altitude exceeding 5000 meters, and 30 modern glaciers are developed. The area belongs to continental climate, the weather change is unusual, the area is mainly dominated by strong wind and snow before the end of 4 months each year, various types of glaciers in mountain ranges are rich, and remote sensing glacier images are obvious. The method comprises the steps of selecting a remote sensing image formed by combining two wave bands of a wave band 1, a wave band 2, a wave band 3, a wave band 4, a wave band 5 and a wave band 6 to carry out shearing amplification (Landsat-9 satellite comprises 11 wave band information, different wave bands react obviously to specific ground objects, selecting the wave bands 1-6), cutting an original large-size glacier remote sensing tiff image into a tiff image with 512 x 512 size by adopting a sliding cutting mode with the repetition rate of 0.1, converting the tiff image into a JPEG format, and combining the tiff image into an RGB image;

The RGB images were randomly flipped horizontally (prob=0.5) and vertically (prob=0.1), random noise (prob=0.4) and random blur (prob=0.1) were added, and random distortion (brightness, contrast, saturation were prob=0.5) was performed, and the images not containing glacier targets were deleted. The final training image contains not only the original slice image, but also an image obtained by data enhancement of the original slice image so as to enrich the data set. And meanwhile, the same operation is carried out on the label image, and the label image and the processed original image form a data pair.

In the step 2, the high-resolution remote sensing Glacier extraction model is a Glacier-Unet model, and the model adopts a coding and decoding structure, including a left half and a right half;

Wherein the left half is the encoder portion and the right half is the decoding portion; the left half is divided into five scales, each scale comprises two convolution layers with the same output channel number and the convolution kernel size 3*3, and a rectification linear correction unit ReLU and a maximum pooling with the step length of 2 are followed; the feature map is refined through proportion-by-proportion convolution sampling (in the process of extracting the model feature, each part of the input feature map is amplified proportion by proportion, then feature extraction is carried out on the amplified feature map, the process is called feature map refinement), feature information of different scales is extracted and reserved from the context of feature map, the number of channels of the feature map is doubled every time, and finally a coding part is jump-linked to a decoding part through a convolution layer to finish a contraction path; the sampling operation is favorable for filtering unimportant high-frequency information, and the repeated convolution and pooling operation can fully extract the high-level information of the remote sensing image in the corresponding characteristic dimension;

The decoder on the right half constructs a segmentation map according to the characteristics of the coding part, and gradually restores the high-level characteristics of the remote sensing image; each layer in the decoding part comprises up-sampling of feature mapping, the channel number of the feature map is halved through up-sampling of 2 x 2 convolution, feature fusion is carried out on the feature map with corresponding dimension of the encoding part, two 3*3 convolutions are followed, and a rectification linear correction unit ReLU is arranged behind each convolution; jump linking, i.e. the design of directly connecting from encoding to decoding, concatenates the output of the encoder with the output of the upsampling operation to the decoder and maps the concatenated features to the next layer, in this way preserving as much detail as possible, improving the resolution and edge detection accuracy of the final segmentation result.

In step 2, a position attention module and a channel attention module are arranged at the jump link, and are used for adjusting feature weights and learning glacier features in an important way.

In the step 2, a parallel connection mode is adopted to connect the position attention module and the channel attention module to form a parallel connection double-channel attention module P-DAM; the position attention module and the channel attention module multiply the output and the input feature graphs element by element in a jump connection mode respectively (a matrix graph containing different weight information, which is generated in each step in the model feature extraction process, is called a feature graph); in view of the specificity of the research target, the parallel connection mode has more flexibility, can adaptively adjust the position and the channel weight parameters aiming at specific problems, emphasizes partial information with higher weight value, and effectively improves the generalization capability of the model. The glacier characteristics of different forms and areas are different in required characteristic extraction modes, the parallel connection mode provides more various combination modes for the output characteristic weight matrix, the important characteristic extraction part weight index is emphasized, and the effectiveness of the attention mechanism is improved.

Constructing a gating multi-scale filter layer G-MsFL at the jump link, providing extraction and feature fusion modes with different scales for the model to capture fine glaciers, and effectively filtering useless feature information by a gating mechanism; the parallel double-channel attention module P-DAM encodes the context information of the glacier boundary as the local feature of the feature map, so that the feature expression capacity is enhanced;

Feature information downsampled by the encoder part at the jump link is subjected to feature filtering through a gating multi-scale filter layer G-MsFL, the filtered feature images are respectively input into a position convolution block and a channel convolution block, linear transformation on the channel is performed on the input feature images, the channel number and the channel dimension of the feature images are changed, and the feature images are correspondingly input into a position attention module and a channel attention module so as to reconstruct features in a follow-up attention module. In particular, semantic segmentation is a challenging task that requires both solid and consistent global context information and rich spatial information. The existing method ignores the self-adaptive capture of effective characteristics, lacks useful multi-scale information filtering and prevents the generation of clear characteristic information. Thus, a gated multi-scale filtering path is constructed to adaptively capture useful information. The threshold is a commonly used operation in a recursive network, and by generating a weight map and multiplying it with the features, the required feature information can be obtained and noise removed, and the usefulness of each feature vector is measured using the gate to control the propagation of semantic information.

In step 2, in the position attention module, a local feature is addedInputting a convolution layer to generate three new feature maps B, C and D respectively, wherein/>，/>Representation/>Real space of dimensions, where/>The number of channels is represented, H represents the height of the image, and W represents the width of the image; reconstructing the three feature maps B, C and D to/>Real space of dimensions/>Thereby obtaining feature matrices B1, C1, D1 of the three feature maps B, C and D; wherein/>Is the number of pixels; matrix multiplying feature matrix B1 and feature matrix C1 and computing spatial feature attention map/>, using softmax layersUse/>Representing the effect of position i on position j; matrix multiplication is performed between the transposed matrix of the spatial feature attention map S and the feature matrix D1, and the generated feature map is reconstructed into/>The format is that the reconstructed feature map is multiplied by a scale parameter alpha to carry out summation operation with the access feature map, and finally the feature map/> isgeneratedWherein the scale parameter α is initially 0; finally, a feature map E containing position information is obtained and is a weighted sum of all the position features and original features; /(I)And/>The calculation formula is as follows:

，

Wherein the method comprises the steps of Representing the fusion operation of feature matrices B1 and C1,/>Representing an input j-position local feature map; /(I)A matrix diagram representing the i-position feature matrix B1; /(I)A matrix diagram representing the j-position feature matrix C1; /(I)A matrix diagram representing the reconstructed i-position feature matrix D1; /(I)And (3) representing the finally obtained characteristic diagram of the j position, wherein the j value is 1-N, and j is not equal to i.

In step 2, in the channel attention module, the local features are directly alignedReconstruction is carried outFirstly, transpose of the feature matrices C1 and D1 is subjected to matrix multiplication, and then a softmax layer is applied to obtain channel attention mapping/>Use/>Representing the effect of the ith channel on the jth channel; performing matrix multiplication operation between the channel attention mapping x and the feature matrix B1, and multiplying the result by a scale coefficient beta to reconstruct the result into/>A format; finally, fusion addition operation is carried out with the input A to obtain the output/>Gradually learn weights from 0, the final feature of each channel is the weighted sum of all reconstructed channel features and the original features:

，

Wherein, Representing the j-th channel feature map,/>Characteristic diagram representing local i channel, j is 1/>And j is not equal to i.

In step 2, the gated multi-scale filtration layer G-MsFL is used to: performing 1*1 common convolution operations on a characteristic input X of the encoder, which is connected to a decoder part in a downsampling jump way, and then generating a normalized gating tensor X1 through a Sigmoid activation function, wherein the gating tensor X1 is in a range between 0 and 1 and is used for dynamically adjusting the weight of the output of a subsequent expansion convolution layer; then, expanding the convolution layer through expansion rates [2, 4, 8, 16] on the input X, and generating four feature maps { X2, X3, X4, X5} by applying batch normalization operation on each generated multi-scale filtering feature map; then, the gating tensor X1 is multiplied by the feature input X and four feature graphs { X2, X3, X4, X5} of the multiple scales element by element to obtain { X11, X12, X13, X14, X15}, and finally the { X11, X12, X13, X14, X15} weighted fusion is completed to gate the multiple scales filtering.

In step 2, the step of inputting the training set and the verification set into the model for training and verification includes: the model training speed is increased by adopting a freezing training method, the training wheel number epochs is set to 300, and the batch processing size batchsize is set to 8;

Optimizing the learning rate by using an Adam optimizer, wherein the maximum learning rate is set to be 10 ^-3, and the minimum learning rate is 10 ^-3 x 0.01; the model uses a weight decay strategy to prevent overfitting, with weight decay values set to 5 x 10 ^-4. When the epoch is 255 times nearby, the loss function reaches a lower value and keeps floating slightly, all indexes of the verification set acquire the highest value, an optimal model is stored, training is stopped, and a model for identifying the remote sensing plateau glacier is obtained.

Step 3 further comprises: model accuracy was evaluated using kappa coefficients, pixel accuracy pixel accuracy, average cross-over MIoU:

，

where n is the total number of columns of the confusion matrix, i.e., the total number of sample categories (the total number of categories to be distinguished in a task including background); The m-th row and m-th column of the confusion matrix are the number of samples, namely the number of correctly classified samples; /(I) And/>The total number of samples of the m-th row and the total number of samples of the m-th column are respectively; num is the total number of samples used for precision assessment; TP represents the probability that a target in a test sample (from a test set) can be correctly predicted; TN represents the probability that a non-target in a test sample is accurately predicted as a non-target; FP represents the probability that a non-target in a test sample is accurately predicted as a target; FN represents the probability that a target in a test sample is accurately predicted to be non-target.

The beneficial effects are that: the data enhancement preprocessing operation is carried out on the plateau glacier image data set, so that different interferences can be dealt with by a model during training in order to enrich the glacier data set, the robustness of the model is improved, and a foundation is laid for glacier extraction work of complex landforms; two modules are designed: G-MiFL and P-DAM can effectively mine tiny and scattered glacier information in the image when the model extracts image features, and simultaneously filter interference information to improve the model feature extraction efficiency.

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.

FIG. 1 is a flow diagram of fused filtered multi-scale high resolution remote sensing glacier extraction according to one embodiment.

FIG. 2 is a diagram of a fusion filtering multi-scale high resolution remote sensing glacier extraction model of one embodiment.

FIG. 3 is a schematic diagram of a parallel dual channel attention module according to one embodiment.

FIG. 4 is a schematic diagram of a position attention module of one embodiment.

FIG. 5 is a schematic diagram of a channel attention module of one embodiment.

FIG. 6 is a schematic diagram of a gated multi-scale filtration layer of one embodiment.

FIG. 7 is a visual comparison of the results of Landsat-9 remote sensing images and glacier recognition by seven methods (color is to distinguish segmented content).

Detailed Description

The invention provides a fusion filtering multi-scale high-resolution remote sensing glacier extraction method, which comprises the following steps of:

In step 1, the preprocessing of the high-resolution remote sensing image data specifically includes: a self-built anima-Paniculatum plateau glacier dataset (acquired by Landsat-9 satellite) was used. The anima paniculatum is used as a research area, and is formed by Hua Li western-stage folds, and Jing Xima Laiya movement is carried out again. The mountain is located at the junction of the two provinces of Ganqiang and Qingqin, the length is 350 km, the width is about 50-60 km, the highest mountain potential part is located at the west of Maqin county, the trend of North west-south east is called Ma Jixue mountain, and the mountain has 18 peaks with the altitude exceeding 5000 meters, and 30 modern glaciers are developed. The area belongs to continental climate, the weather change is unusual, the area is mainly dominated by strong wind and snow before the end of 4 months each year, various types of glaciers in mountain ranges are rich, and remote sensing glacier images are obvious. Cutting and amplifying the remote sensing images after the combination of the two wave bands of the wave band 1, the wave band 2, the wave band 3, the wave band 4, the wave band 5 and the wave band 6, cutting the original large-size glacier remote sensing tiff image into tiff image images with 512 x 512 size by adopting a sliding cutting mode with the repetition rate of 0.1, converting the tiff image images into JPEG format, and combining the JPEG image images into an RGB image;

In step 2, in the position attention module, a local feature is addedThe input convolution layer (local feature A of the input position attention model is obtained by the position convolution block in FIG. 3) generates three new feature maps B, C and D, respectively, wherein/>，/>Representation/>Real space of dimensions, where/>The number of channels is represented, H represents the height of the image, and W represents the width of the image; second, the three feature maps B, C and D are reconstructed into/>Real space of dimensions/>Thereby obtaining feature matrices B1, C1, D1 of the three feature maps B, C and D; wherein/>Is the number of pixels; matrix multiplying feature matrix B1 and feature matrix C1 and computing spatial feature attention map using softmax layersUse/>Representing the effect of position i on position j; matrix multiplication is performed between the transposed matrix of the spatial feature attention map S and the feature matrix D1, and the generated feature map is reconstructed into/>The format is that the reconstructed feature map is multiplied by a scale parameter alpha to carry out summation operation with the access feature map, and finally the feature map/> isgeneratedWherein the scale parameter α is initially 0; finally, a feature map E containing position information is obtained and is a weighted sum of all the position features and original features; /(I)And/>The calculation formula is as follows:

，

In step 2, the gated multi-scale filtration layer G-MsFL is used to: performing 1*1 common convolution operations on a characteristic input X of the encoder, which is connected to a decoder part in a downsampling jump way, and then generating a normalized gating tensor X1 through a Sigmoid activation function, wherein the gating tensor X1 is in a range between 0 and 1 and is used for dynamically adjusting the weight of the output of a subsequent expansion convolution layer; then, expanding the convolution layer through expansion rates [2, 4, 8, 16] on the input X, and performing batch normalization operation on each generated multi-scale filtering feature map to generate four feature maps { X2, X3, X4, X5} (four feature maps respectively generated after different processing of the input feature map in FIG. 6); then, the gating tensor X1 is multiplied by the feature input X and four feature graphs { X2, X3, X4, X5} of the multiple scales element by element to obtain { X11, X12, X13, X14, X15}, and finally the { X11, X12, X13, X14, X15} weighted fusion is completed to gate the multiple scales filtering.

In step 2, the step of inputting the training set and the verification set into the model for training and verification includes: the model training speed is increased by adopting a freezing training method, the training wheel number epochs is set to 300, and the batch processing size batchsize is set to 8; optimizing the learning rate by using an Adam optimizer, wherein the maximum learning rate is set to be 10 ^-3, and the minimum learning rate is 10 ^-3 x 0.01; the model uses a weight decay strategy to prevent overfitting, with weight decay values set to 5 x 10 ^-4. When the epoch is 255 times nearby, the loss function reaches a lower value and keeps floating slightly, all indexes of the verification set acquire the highest value, an optimal model is stored, training is stopped, and a model for identifying the remote sensing plateau glacier is obtained.

，

The high-resolution remote sensing glacier image contains complex feature information of land features, including clouds, glaciers, vegetation, mountain shadows and the like. The experiment is only carried out aiming at glacier information, and other characteristics are uniformly classified into background types. The complex background information can greatly influence the glacier extraction precision, a part of irrelevant information can be selectively ignored by adding an attention mechanism, the rest glacier information is subjected to weighted aggregation calculation, the characteristic weight value is improved, and the model focuses on the characteristic relevant information during characteristic extraction. Therefore, the proposed parallel dual-channel attention module (P-DAM) is added to the jump link part of the U-Net network, the feature weight is adjusted, the glacier feature is learned in a key way, and the capability of extracting glaciers in different areas of the network is enhanced. In order to improve the mapping efficiency of integrating features with different resolutions and strengthen the multi-scale information filtering capability of the model, a gating multi-scale filter layer (G-MsFL) is constructed at a jump link.

In an embodiment of the present invention, a fusion filtering multi-scale high-resolution remote sensing glacier extraction method is provided, as shown in fig. 1, including the following steps:

training phase: step 1, preprocessing data of a data set;

Because the glacier remote sensing image is oversized, the glacier remote sensing image cannot be directly input into a network frame for training and needs to be cut to a specified size. In this embodiment, the remote sensing image after the combination of the two bands of band 1, band 2, band 3, band 4, band 5 and band 6 is selected for shearing and amplifying. The self-built data set adopts a sliding cutting mode with the repetition rate of 0.1, and the original large-size glacier remote sensing tiff image is cut into a tiff image graph with 512 x 512 size. The multi-channel tiff format image map contains various channel information, so that the file size is large, more storage space and calculation resources are needed when a large amount of data is processed and stored, and training and verification of a model are inconvenient. Typically converting tiff images to JPEG format can combine them into one RGB image, making it easier to process with existing deep learning frameworks. Although this results in information loss, the requirements of training models can be generally met, and 1344 pictures of 512 x 512 are generated after clipping and format conversion.

The training data and the label data are respectively read, and are divided into a training set, a verification set and a test set according to the proportion of 7:2:1. For the problem of insufficient deep learning training data, performing data enhancement operation on output image data, and enhancing the data by using the method: random horizontal flipping (prob=0.5) and random vertical flipping (prob=0.1) are performed on the original image, and random noise (prob=0.4) and random blurring (prob=0.1) as well as random warping (luminance, contrast, saturation are prob=0.5) are added to the original image. The final training image contains not only the original slice image, but also an image obtained by data enhancement of the original slice image so as to enrich the data set. And meanwhile, the same operation is carried out on the label image, and the label image and the processed original image form a data pair. During slicing, there may be original images that do not contain glacier targets, such images are removed, simplifying the dataset.

Step 2, constructing a high-resolution remote sensing Glacier extraction model, and providing a Glacier-Unet model algorithm, as shown in fig. 2.

In order to enable the position attention module and the channel attention module to synchronously and independently learn the two dimensional information of the feature map position and the channel, the double attention modules are connected in parallel. In view of the specificity of the research target, the parallel connection mode has more flexibility, can adaptively adjust the position and the channel weight parameters aiming at specific problems, emphasizes partial information with higher weight value, and effectively improves the generalization capability of the model. The glacier characteristics of different forms and areas are different in required characteristic extraction modes, the parallel connection mode provides more various combination modes for the output characteristic weight matrix, the important characteristic extraction part weight index is emphasized, and the effectiveness of the attention mechanism is improved. Feature information downsampled by the encoder part at the jump link is subjected to feature filtering through a gating multi-scale filter layer, the filtered feature images are respectively input into a position convolution block and a channel convolution block, linear transformation on the input feature images is performed on the channels, the channel number and the channel dimension of the feature images are changed, and the feature images are correspondingly input into a position attention module and a channel attention module so as to reconstruct features in a follow-up attention module.

The two independent attention modules multiply the output and input feature graphs element by element in a jump connection mode respectively, so that the interactivity of the original input feature information and the information after the input feature information passes through the attention modules is effectively improved, the model learns residual information and incremental information at the same time, and a richer feature graph is generated. In addition, the connection mode of the residual errors is helpful to improve the performance of the model, the original information introduced by the residual errors can enable the model to flexibly compare and select the characteristic information with higher weight values, and the gradient propagation and the model optimization are facilitated. The parallel dual channel attention module is shown in fig. 3.

The high-resolution glacier remote sensing images are often large in intra-class difference, and the phenomena of false detection, missed detection and the like in the recognition result are caused by the fact that the context information of the local features of the feature map is easy to lose in the identification process of the glacier remote sensing images. The parallel dual channel attention module builds a position attention module. Each element of the input sequence has a corresponding weight, and the weight information is obtained in model continuous learning and used for calculating the weighted sum. The position information of the elements of the feature map is important in the model learning process, and the elements at different positions contain different semantic information. The position attention module is used for better capturing the local and global relations of elements in the sequence data. The location attention module can encode the rich context information as local features of the feature map, thereby enhancing its feature expression capabilities. As shown in fig. 4.

First, in the position attention module, a local feature is addedThe input convolution layer (local feature A of the input position attention model is obtained by the position convolution block in FIG. 3) generates three new feature maps B, C and D, respectively, wherein/>，/>Representation/>Real space of dimensions, where/>The number of channels is represented, H represents the height of the image, and W represents the width of the image; second, the three feature maps B, C and D are reconstructed into/>Real space of dimensions/>Thereby obtaining feature matrices B1, C1, D1 of the three feature maps B, C and D; wherein/>Is the number of pixels; matrix multiplying feature matrix B1 and feature matrix C1 and computing spatial feature attention map using softmax layersUse/>Representing the effect of position i on position j; matrix multiplication is performed between the transposed matrix of the spatial feature attention map S and the feature matrix D1, and the generated feature map is reconstructed into/>The format is that the reconstructed feature map is multiplied by a scale parameter alpha to carry out summation operation with the access feature map, and finally the feature map/> isgeneratedWherein the scale parameter α is initially 0; finally, a feature map E containing position information is obtained and is a weighted sum of all the position features and original features; it has a global context view and selectively aggregates context information by means of a spatial attention map. Similar semantic features achieve mutual gains, thereby improving compactness and semantic consistency within a class. /(I)And/>The calculation formula is as follows:

，

Complicated ground feature backgrounds exist in high-resolution glacier remote sensing images, so that a network is difficult to concentrate on a target area, and omission occurs in the glacier feature extraction process. Meanwhile, some local areas may have missing features in the identification process due to illumination, cloud and fog shielding and the like, so that the image semantic segmentation is accurate. The parallel dual channel attention module thus builds a channel attention module. In the feature extraction process, the feature map is formed by stacking feature matrixes of a plurality of channels, each channel contains different feature information, and the channel attention is to analyze the weight of each channel, so that important channels containing more information related to identification targets are highlighted, and unimportant channels are restrained. And the interdependence relation among the channel mappings is utilized to emphasize the interdependence feature mappings, so that the feature representation of specific semantics is improved, and the representation capability of the model is improved. As shown in fig. 5.

Unlike position attention, in the channel attention module, local features are directly alignedReconstruction/>Firstly, transpose of the feature matrices C1 and D1 is subjected to matrix multiplication, and then a softmax layer is applied to obtain channel attention mapping/>Use/>Representing the effect of the ith channel on the jth channel; performing matrix multiplication operation between the channel attention mapping x and the feature matrix B1, and multiplying the result by a scale coefficient beta to reconstruct the result into/>A format; finally, fusion addition operation is carried out with the input A to obtain the output/>Gradually learn weights from 0, the final feature of each channel is the weighted sum of all reconstructed channel features and the original features: /(I)

，

Wherein,Representing the j-th channel feature map,/>Characteristic diagram representing local i channel, j is 1/>And j is not equal to i.

Semantic segmentation is a challenging task that requires both solid and consistent global context information and rich spatial information. Recent approaches ignore adaptive capture of valid features. The lack of useful multi-scale information filtering prevents the generation of explicit feature information. Thus, a gated multi-scale filtering path is constructed to adaptively capture useful information. The threshold is a commonly used operation in a recursive network, and by generating a weight map and multiplying it with the features, the required feature information can be obtained and noise removed, and the usefulness of each feature vector is measured using the gate to control the propagation of semantic information.

Gating the multi-scale filtration layer G-MsFL: the structure is shown in fig. 6. The encoder downsampled and skip-connected feature input X to the decoder portion is subjected to 1*1 normal convolution operations, and then a normalized gating tensor X1 is generated by a Sigmoid activation function. This gating tensor is in the range between 0 and 1 for dynamically adjusting the weights of the subsequent inflated convolutional layer outputs. Then, the input X is expanded by the expansion rate [2, 4, 8, 16] to form { X2, X3, X4, X5} by applying batch normalization operation to each generated multi-scale filtering feature map, so as to improve model training stability. Then, the gating tensor X1 is multiplied by the input X and four outputs { X2, X3, X4, X5} of the multiscale respectively element by element to obtain { X11, X12, X13, X14, X15}, and finally { X11, X12, X13, X14, X15} is weighted and fused to complete the gating multiscale filtering.

The left-to-right gating multi-scale path is used to select and filter the desired semantic context information. The multi-scale pyramid part acts as a filter, controlling information propagation, filtering useless context information. Typically, the information between encoding and decoding is connected by means of a feature map summation or concatenation, which indicates that these features are received by the non-filtered decoded part. In fact, the different resolution feature maps produced by the downsampling process, the contribution of the different position feature information to the resolution reduction of the final feature map is different. At the jump link, a gated multi-scale filter layer is constructed to accomplish this task in order to efficiently select different resolution profile information. In short, the relation between the information entropy of the feature mapping with different resolutions and the label error mapping is researched, and a multi-scale gating mechanism is embedded on the basis of the relation to integrate the feature mapping more effectively.

Because the method needs to solve the problem of three classifications, the class probability of each pixel in the feature is judged by using a Softmax function, and a Loss value is calculated by using the Softmax function, and the Loss functionThe formula is defined as:

，

Wherein, The method is characterized in that category labels are represented, Q values exist, and three-category semantic segmentation is studied by the method, so that Q=3; /(I)Representing model parameters,/>Is/>Is a transpose of (2); /(I)Representing the input image pixel observation vector/>A U-th element of (b); /(I)Is a display function. /(I)

The programming language used in the method in this embodiment is Python, the deep learning framework is PADDLEPADDLE framework, and the SGD optimizer is used to optimize the programming language, wherein the attenuation rate is set to β ₁ =0.9 and β ₂ =0.999 to prevent the occurrence of the overfitting problem. The maximum learning rate is set toThe batch size was set to 8. When the epochs are around 255 (one Epoch is equal to the process of training once with all samples in the training set), the loss function tends to be smooth and the remaining six models are chosen as comparison methods.

Quantitative results of the Glacier-Unet model in the test dataset are given. The four semantic segmentation quantitative evaluation indexes of the accuracy AAccuracy, the Ddice loss function, the Kkappa coefficient and the average cross ratio MMIoU are used. Table 1 shows the quantitative results of seven methods in the set of self-built tests. As shown in Table 1, segNet has the worst glacier identification effect, and because the method has too single feature extraction and feature fusion modes, the glacier identification capability is insufficient and is easily influenced by complex environments; by comparing DeeplabV3+ models of different feature extraction networks, the model evaluation indexes of Resnet serving as the feature extraction network are better than those of Resnet18 serving as the model of the feature extraction network. The depth of the feature extraction network has a direct relation to glacier identification accuracy, and multiple feature extraction is beneficial to mining glacier information in a complex environment; the U-net and DeeplabV & lt3+ & gt series models with the same coding-decoding structure find that each index in the U-net model test is almost the same as the DeeplabV & lt3+ & gt Resnet model. The analysis of the feature extraction modes of the two models shows that the U-net reduces the resolution of the feature map through several times of downsampling, extracts feature information with different resolutions, performs feature fusion with the upsampled feature map of the decoding part through jump connection, and restores the feature map. DeeplabV3+ Resnet generates shallow and deep feature images by a feature extraction mode of extracting deeper features of a network part through trunk features, performs enhanced feature extraction on the deep feature images, generates feature images containing different scale information, fuses the feature images with the shallow features, and finally upsamples and restores. The method has the advantages that due to the reinforced feature extraction mode of different receptive fields, the problem of insufficient extraction of tiny glaciers can be effectively solved in the extraction of glacier remote sensing images, the G-MiFL module is embedded into the jump joint of the U-net model, the interference information on the different-scale feature images is effectively filtered while the feature images of the encoder are extracted in different scales, and the result shows that the U-net model embedded into the G-MiFL is obviously improved in various indexes; in order to further improve the glacier recognition performance of the model, a P-DAM combined attention module is designed. The help model adaptively emphasizes feature information of higher part dimensionality of the weight index to provide a selectable feature map for decoder feature fusion. And all indexes of the U-net model embedded into the P-DAM are improved. The Glacier-Unet model comprehensively uses the two modules of G-MiFL and P-DAM, has obvious advantages on the overall evaluation index compared with other methods, has the accuracy rate of 85.4%, the Dice loss of 80.03%, the kappa coefficient of 0.5879, the average cross ratio of 0.6602, and has excellent comprehensive performance due to the rest networks and in Glacier recognition tasks.

TABLE 1

Method	A_Accuracy/ %	Ddice/ %	Kkappa/ %	M_MIoU
					SegNet	76.3	72.47	0.4232	0.5638
DeeplabV3+_Resnet18	78.1	74.32	0.4506	0.5842
					DeeplabV3+_Resnet50	79.6	75.41	0.4633	0.5967
U-Net	79.3	74.98	0.4979	0.5941
					U-Net + G-MiFL	81.8	77.02	0.532	0.6132
U-Net + P-DAM	82.9	77.92	0.5403	0.6244
					Glacier-Unet	85.4	80.03	0.5879	0.6602

The proposed Glacier-Unet method and six deep learning-based methods were subjected to qualitative analysis of results in the test dataset, and the Glacier-Unet model was compared laterally and longitudinally. The first row respectively represents remote sensing image slices and corresponding ground real labels, and the rest methods are as follows: segNet, deeplabV + Resnet, deeplabV3+ Resnet, U-Net, U-Net+G-MiFL, U-Net+P-DAM. FIG. 7 is a visual comparison of the identification of landsat-9 remote sensing glaciers by seven methods. As can be seen from the figure, segNet, deeplabV & lt+ & gt_ Resnet18 & lt 18 & gt method has poor identification effect in glacier identification and serious detail loss, and because the downsampling times of the two methods are less and the feature extraction depth is lower, the advanced semantic information cannot be acquired better. From the figure, the glacier identification capability is effectively improved to a certain extent by increasing the number of network downsampling and upsampling rounds and enhancing the network feature extraction depth, but when the four methods are compared with the real labels, the fact that scattered tiny glaciers cannot be deeply extracted and the glacier contour detail identification is insufficient is found. The figure shows that the fine and scattered glacier extraction capability of the method for embedding the gate-control multi-scale filter layer (G-MiFL) is greatly improved on the basis of the U-net network. The embedded parallel dual channel attention module (P-DAM) approach also helps greatly with glacier profile information restoration. Finally, two modules are simultaneously applied to a U-net network, the method is found to be closest to a ground real label, extraction of small glaciers in a frame is basically completed, extraction of glacier outline and edge characteristics in the frame is complete, and the method has certain robustness. Through the comparison of the learning rates of the modified models, the higher learning rate is found to be inapplicable in the glacier extraction of the complex remote sensing images, and the feature extraction is difficult under the interference of complex environment and ground feature factors, so that the learning rate is properly reduced compared with a general recognition task, the learning round number is improved, and the complex ground feature information extraction is facilitated.

The invention provides a fusion filtering multi-scale high-resolution remote sensing glacier extraction method, and the method and the way for realizing the technical scheme are numerous, and the above description is only a preferred embodiment of the invention, and it should be pointed out that a plurality of improvements and modifications can be made to those skilled in the art without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. The fusion filtering multi-scale high-resolution remote sensing glacier extraction method is characterized by comprising the following steps of:

Step 1, acquiring high-resolution remote sensing image data, preprocessing the high-resolution remote sensing image data to obtain glacier data sets and corresponding label data, and dividing the glacier data sets into training sets, verification sets and test sets according to a proportion;

2. The method according to claim 1, wherein in step 1, the preprocessing of the high resolution remote sensing image data specifically includes: cutting and amplifying a remote sensing image, cutting an original large-size glacier remote sensing tiff image into tiff image maps by adopting a sliding cutting mode, converting the tiff image maps into a JPEG format, and combining the tiff image maps into an RGB image;

the RGB images are randomly flipped horizontally and vertically, random noise and random blur are added, and random warping is performed, and images that do not contain glacier targets are deleted.

3. The method according to claim 2, wherein in step 2, the high-resolution remote sensing Glacier extraction model is a Glacier-Unet model, and the model adopts an encoding and decoding structure, including a left half and a right half;

Wherein the left half is the encoder portion and the right half is the decoding portion; the left half is divided into five scales, each scale comprises two convolution layers with the same output channel number and the convolution kernel size 3*3, and a rectification linear correction unit ReLU and a maximum pooling with the step length of 2 are followed; feature images are thinned through proportion convolution sampling, feature information of different scales is extracted and reserved from the context of feature mapping, the number of channels of the feature images is doubled every time of downsampling, and finally a coding part is in jump link with a decoding part through a convolution layer, so that a contraction path is completed;

The decoder on the right half constructs a segmentation map according to the characteristics of the coding part, and gradually restores the high-level characteristics of the remote sensing image; each layer in the decoding part comprises up-sampling of feature mapping, the channel number of the feature map is halved through up-sampling of 2 x2 convolution, feature fusion is carried out on the feature map with corresponding dimension of the encoding part, two 3*3 convolutions are followed, and a rectification linear correction unit ReLU is arranged behind each convolution; jump linking is a design that directly connects from encoding to decoding, concatenates the output of the encoder with the output of the upsampling operation to the decoder, and maps the concatenated features to the next layer.

4. The method according to claim 3, wherein in step 2, a position attention module and a channel attention module are provided at the jump link for adjusting feature weights, and learning glacier features with emphasis.

5. The method of claim 4, wherein in step 2, the position attention module and the channel attention module are connected in parallel to form a parallel dual channel attention module P-DAM; the position attention module and the channel attention module multiply the output and the input characteristic diagram element by element in a jump connection mode respectively;

Constructing a gating multi-scale filter layer G-MsFL at the jump link, providing extraction and feature fusion modes with different scales for the model to capture fine glaciers, and effectively filtering useless feature information by a gating mechanism; the parallel double-channel attention module P-DAM encodes the context information of the glacier boundary as the local feature of the feature map;

feature information downsampled by the encoder part at the jump link is subjected to feature filtering through a gating multi-scale filter layer G-MsFL, the filtered feature images are respectively input into a position convolution block and a channel convolution block, linear transformation on the channel is performed on the input feature images, the channel number and the channel dimension of the feature images are changed, and the feature images correspond to the input position attention module and the channel attention module.

6. The method of claim 5, wherein in step 2, a local feature is added to the location attention moduleInputting the convolution layer to generate three new characteristic maps B, C and D respectively, wherein，/>Representation/>Real space of dimensions, where/>The number of channels is represented, H represents the height of the image, and W represents the width of the image; second, the three feature maps B, C and D are reconstructed into/>Real space of dimensions/>Thereby obtaining feature matrices B1, C1, D1 of the three feature maps B, C and D;

Wherein the method comprises the steps of Is the number of pixels; matrix multiplying feature matrix B1 and feature matrix C1 and computing spatial feature attention map/>, using softmax layersUse/>Representing the effect of position i on position j; matrix multiplication is performed between the transposed matrix of the spatial feature attention map S and the feature matrix D1, and the generated feature map is reconstructed into/>The format is that the reconstructed feature map is multiplied by the scale parameter alpha to carry out summation operation with the access feature map, and finally the feature map is generatedWherein the scale parameter α is initially 0; finally, a feature map E containing position information is obtained and is a weighted sum of all the position features and original features; /(I)And/>The calculation formula is as follows:

，

Wherein the method comprises the steps of Representing the fusion operation of feature matrices B1 and C1,/>Representing an input j-position local feature map; a matrix diagram representing the i-position feature matrix B1; /(I) A matrix diagram representing the j-position feature matrix C1; /(I)A matrix diagram representing the reconstructed i-position feature matrix D1; /(I)And (3) representing the finally obtained characteristic diagram of the j position, wherein the j value is 1-N, and j is not equal to i.

7. The method of claim 6, wherein in step 2, in the channel attention module, the local features are directly alignedReconstruction/>Firstly, transpose of the feature matrices C1 and D1 is subjected to matrix multiplication, and then a softmax layer is applied to obtain channel attention mapping/>Use/>Representing the effect of the ith channel on the jth channel; performing matrix multiplication operation between the channel attention mapping x and the feature matrix B1, and multiplying the result by a scale coefficient beta to reconstruct the result into/>A format; finally, fusion addition operation is carried out with the input A to obtain the output/>Gradually learn weights from 0, the final feature of each channel is the weighted sum of all reconstructed channel features and the original features:

，

8. The method of claim 7, wherein in step 2, the gated multi-scale filtration layer G-MsFL is used to: performing 1*1 common convolution operations on a characteristic input X of the encoder, which is connected to a decoder part in a downsampling jump way, and then generating a normalized gating tensor X1 through a Sigmoid activation function, wherein the gating tensor X1 is in a range between 0 and 1 and is used for dynamically adjusting the weight of the output of a subsequent expansion convolution layer; then, expanding the convolution layer through expansion rates [2, 4, 8, 16] on the input X, and generating four feature maps { X2, X3, X4, X5} by applying batch normalization operation on each generated multi-scale filtering feature map; then, the gating tensor X1 is multiplied by the feature input X and four feature graphs { X2, X3, X4, X5} of the multiple scales element by element to obtain { X11, X12, X13, X14, X15}, and finally the { X11, X12, X13, X14, X15} weighted fusion is completed to gate the multiple scales filtering.

9. The method according to claim 8, wherein in step 2, the training and verifying the training set and the verification set input model includes: the model training speed is increased by adopting a freezing training method, the training wheel number epochs is set to 300, and the batch processing size batchsize is set to 8; optimizing the learning rate by using an Adam optimizer, wherein the maximum learning rate is set to be 10 ^-3, and the minimum learning rate is 10 ^-3 x 0.01; the model uses a weight decay strategy to prevent overfitting, with weight decay values set to 5 x 10 ^-4.

10. The method of claim 9, wherein step 3 further comprises: model accuracy was evaluated using kappa coefficients, pixel accuracy pixel accuracy, average cross-over MIoU:

，

wherein n is the total number of columns of the confusion matrix, i.e., the total number of sample categories; The m-th row and m-th column of the confusion matrix are the number of samples, namely the number of correctly classified samples; /(I) And/>The total number of samples of the m-th row and the total number of samples of the m-th column are respectively; num is the total number of samples used for precision assessment; TP represents the probability that the target in the test sample can be correctly predicted; TN represents the probability that a non-target in a test sample is accurately predicted as a non-target; FP represents the probability that a non-target in a test sample is accurately predicted as a target; FN represents the probability that a target in a test sample is accurately predicted to be non-target.