CN115527027A

CN115527027A - Remote sensing image ground object segmentation method based on multi-feature fusion mechanism

Info

Publication number: CN115527027A
Application number: CN202210210255.5A
Authority: CN
Inventors: 崔梦天; 李凯; 郭曌阳; 余伟; 王琳; 罗洪; 李裕岚; 赵海军; 贺春林
Original assignee: Southwest Minzu University
Current assignee: Southwest Minzu University
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-12-27

Abstract

The invention relates to a remote sensing image ground feature segmentation method based on a multi-feature fusion mechanism, which comprises the following steps: establishing a deep learning model based on an encoder and a decoder framework, inputting a tested image into the encoder to carry out layer-by-layer convolution, and screening low-order semantic information through an attention mechanism to highlight target characteristics and inhibit background noise; the decoder receives the processing result of the encoder, carries out deconvolution reservation and transmits the result upwards, carries out multi-feature fusion on the feature map independently reserved in each layer and the result of the decoding network at the final output position of the decoding network, achieves the purpose of improving feature restoration precision, maps the obtained semantic mark with the same size as the original image to the original image, and realizes the visualization of the segmentation result. The invention fully excavates the multi-scale characteristics of the network through a lightweight channel attention mechanism and a deep multi-characteristic fusion mechanism, and fully improves the network segmentation precision on the premise of ensuring the size of a data set required by training.

Description

Remote sensing image ground object segmentation method based on multi-feature fusion mechanism

Technical Field

The invention relates to the technical field of remote sensing image processing, in particular to a remote sensing image ground object segmentation method based on a multi-feature fusion mechanism.

Background

The remote sensing image ground object segmentation plays an important role in a plurality of geographic information applications such as smart city construction, public transportation management, environmental monitoring and the like. At present, the cost for acquiring high-definition remote sensing data is lower and lower, such as: unmanned aerial vehicle remote sensing formation of image etc.. Various remote sensing platforms obtain various mass remote sensing image data through various sensor devices, such as: high-spectrum data, radar data and the like, and the data enables ground feature information observed by people to be more real-time and more comprehensive. In recent years, with the wide application of deep learning technology in actual production and life, many researchers use neural networks to extract information of a target region in image data, and the researches have remarkable significance in the field of remote sensing image application.

The rapid development of the remote sensing image sensor leads the resolution of the remote sensing image to be higher, the high resolution means that the feature details of the ground features contained in the remote sensing image are more and more, and the complexity is greatly improved, for example, the edge positions of different ground features are overlapped to a certain extent due to the improvement of the resolution, the texture details of a plurality of complex ground features are covered by shadows to be difficult to distinguish, and meanwhile, the actual problems that the feature features of individual ground features with smaller scales cannot be identified and the like can be caused, and the problems bring certain technical difficulty for the semantic segmentation task of the high-resolution remote sensing image.

Disclosure of Invention

The invention aims to: aiming at the defects of low precision and more wrong scores of the traditional remote sensing image ground object segmentation technology, the remote sensing image ground object segmentation method based on the multi-feature fusion mechanism is provided to solve the defects of the traditional remote sensing image ground object segmentation technology.

In order to solve the technical problems, the invention adopts a technical scheme that: the method for segmenting the ground features of the remote sensing image based on the multi-feature fusion mechanism comprises the following steps:

s1, establishing a deep neural network model of a decoder framework based on an encoder of a channel attention mechanism and a multi-feature fusion mechanism;

s2, inputting the tested original image into a coder for carrying out feature extraction and downsampling layer by layer, and screening a large amount of bottom layer information through a channel attention mechanism to highlight target features and inhibit background noise and obtain global semantic information;

s3, receiving the processing result of the encoder through a decoder, performing upsampling, reserving and independently restoring the characteristic graph which is sampled at each time, performing secondary analysis on decoding information, performing characteristic fusion at the final output position of a decoding network, and performing multi-characteristic fusion on the three groups of characteristic graphs which are reserved independently and the result of the decoding network, so that the purpose of improving the characteristic restoring precision is achieved, and finally, a semantic mark with the same size as the original image is obtained;

preferably, the encoder in step S2 is composed of four layer network layers, and is mainly responsible for encoding the feature map and extracting information, each layer is provided with a common 2D convolution and a depth separable convolution, and a common pooling network layer transmits each layer from the first layer to the fourth layer in sequence to obtain 1/2 scale of the image of the previous layer, and the output of the four layers is subjected to information screening by a channel attention mechanism to obtain the final output feature map of the four network layers, which is { F1, F2, F3, F4}. The network supports input images with any size, the sizes of convolution kernels of convolution layers in four network layers are all 3 multiplied by 3, the step length stride is 1, and the padding of 1 ensures that the convolution layers do not change the size of a characteristic diagram; for the pooled network layers, average pooling with parameters of convolution kernel size of 2, step length of 2 and padding of 1 is adopted to ensure that the size of the pooled characteristic diagram is 1/2 of the original size.

Preferably, the decoding network in the step S3 includes four network layers, each network layer including 23 × 3 deep separable convolutions and an upsampling layer; the input of the decoder is a feature map { { F1, F2, F3, F4} output by four network layers of the encoder, feature sizes of two adjacent stages are adjusted by utilizing 1 × 1 convolution and upsampling, the two groups of feature maps are added and then transmitted to the next network layer to obtain { U1, U2, U3, U4}, and finally the { U1, U2, U3, U4} is subjected to 3 × 3 deep separable convolution to adjust the number of channels, so that the size of the feature map is ensured to be unchanged; the sampling network on the backbone adopts up-sampling convolution with a scaling factor of 2 times to ensure that each layer amplifies the size of the feature map by two times, so that the size of the feature map is the original image size after the four-layer method; three independent up-sampling reduction branches of the multi-feature fusion mechanism adopt sub-pixel convolution with a scaling factor of 2 times to ensure that each layer amplifies the feature map by two times, and the final output position of sampling on the trunk is used for fusing four groups of feature maps to obtain a final result.

Preferably, the method for segmenting the ground features of the remote sensing image in the step S4 further includes a training step after the step of establishing a deep neural network model based on an encoder and a decoder architecture; and the training step comprises the step of inputting the trained data into the neural network model for training to obtain the optimal network weight.

Preferably, the training in the step S5 specifically includes: the data set adopts two public high-resolution remote sensing image data sets Potsdam and Vaihingen to evaluate the model, and the two data sets are divided into three parts, namely a training set 70%, a verification set 20% and a test set 10%; sequentially cutting the data set image by using a sliding frame with the size of 512 multiplied by 512, and ensuring that the overlapping rate of the next position of the moving sliding frame and the previous position value is 75 percent; random scaling, random horizontal inversion and preprocessing of Gaussian noise and salt and pepper noise interference are adopted to prevent overfitting; setting the iteration frequency epoch as 100 by the parameter, setting the learning rate as 0.00001 by initialization, setting the batch processing as 2, and setting the weight attenuation as 0.0001; the Loss function is a cross entropy Loss function which is selected to be general in semantic segmentation, wherein the cross entropy Loss function is L (F, Y) = Loss (U (F)), and Y, the optimizer is an Adam optimizer which is commonly used in the same semantic segmentation task and is used for training, wherein F is an encoder output result, U is a decoder, Y is a real label graph, and Loss is the cross entropy Loss function.

In summary, due to the adoption of the technical scheme, the remote sensing image ground object segmentation method comprises an S1 model construction unit, an S2 training unit and an S3 visualization unit; the method comprises the following steps that S1, a model construction unit is used for establishing a deep neural network model consisting of an S11 encoder and an S12 decoder, and realizing that S11 performs feature extraction and down sampling layer by layer on an input original image and screens a large amount of bottom information through a channel attention mechanism so as to highlight target features and inhibit background noise and acquire global semantic information; receiving the processing result of the encoder by an S12 decoder to perform upsampling, reserving and independently restoring the feature map which is upsampled each time to be used as secondary analysis of decoding information, performing multi-feature fusion on the independently reserved feature map and the result of the decoding network at the final output position of the decoding network, achieving the purpose of improving feature restoration precision, and finally obtaining a semantic mark with the same size as an original image; s2 the training unit is used for using the image

S21, carrying out model evaluation on two public high-resolution remote sensing image data sets Potsdam and Vaihingen, wherein the two data sets are divided into three parts, namely a training set 70%, a verification set 20% and a test set 10%; s22, sequentially cutting the data set image by using a sliding frame with the size of 512 multiplied by 512, and ensuring that the overlapping rate of the next position of the moving sliding frame and the previous position value is 75%; s23, preprocessing random scaling, random horizontal inversion and Gaussian noise and salt and pepper noise interference to prevent overfitting; s24, setting the iteration frequency epoch as 100, the learning rate as 0.00001, the batch processing as 2 and the weight attenuation as 0.0001 by using the parameter; s25, selecting a cross entropy Loss function which is relatively universal in semantic segmentation as L (F, Y) = Loss (U (F)), and Y, selecting a common Adam optimizer in the same semantic segmentation task for training by the optimizer, wherein F is an encoder output result, U is a decoder, Y is a real label graph, and Loss is the cross entropy Loss function; and S3, the visualization unit is used for mapping the semantic tag map to the original image to realize visualization of the segmentation result.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of the overall network model of the present invention;

FIG. 3 is a block diagram of an encoder;

FIG. 4 is a block diagram of a decoder;

FIG. 5 an example of a data enhancement method;

FIG. 6 is a visualization of results on a Potsdam dataset;

FIG. 7 is a visualization of results on a Vaihingen data set;

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations where mutually exclusive features or steps are mutually exclusive.

Any feature disclosed in this specification (including any accompanying claims, abstract) may be replaced by alternative features serving equivalent or similar purposes, unless expressly stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.

As shown in fig. 1, in the embodiment of the present invention, the method for segmenting a ground feature of a remote sensing image includes the following steps:

s1, establishing a deep neural network model based on a decoder framework of an encoder combined with a channel attention mechanism and a multi-feature fusion mechanism;

s2, acquiring a public remote sensing image data set;

s3, inputting the trained data set into deep neural network model type training to obtain the optimal network weight;

s4, inputting the tested original image into an encoder to perform feature extraction and downsampling layer by layer, and screening a large amount of bottom information through a channel attention mechanism to highlight target features and inhibit background noise to obtain global semantic information; the decoder receives the processing result of the encoder to perform up-sampling, retains and independently restores the feature map of each up-sampling as secondary analysis of the decoded information, performs multi-feature fusion on the feature map which is independently retained and the result of the decoding network at the final output position of the decoding network, and finally obtains a semantic mark with the same size as the original image;

and S5, mapping the semantic mark image on the original image to realize the visualization of the segmentation result.

In this example, the overall structure of the network is shown in fig. 2, and after the remote sensing image is input into the network, the semantic classification map is generated through an encoder and a decoder. The step S1 specifically includes:

s101: the coding network based on the channel attention mechanism is composed of four network layers and is mainly responsible for coding a feature map and extracting information, each layer is provided with a common 2D convolution and a depth separable convolution, each layer is sequentially transmitted from a first layer to a fourth layer to obtain the 1/2 scale of an image of the previous layer, and the output of the four layers is subjected to information screening through the channel attention mechanism to obtain the final output feature map of the four network layers, namely { F1, F2, F3 and F4}.

S102: the decoding network based on the multi-feature fusion mechanism comprises four network layers, wherein each network layer consists of 23 multiplied by 3 depth separable convolutions and an upper sampling layer; the input of the decoder is a feature map { { F1, F2, F3, F4} output by four network layers of the encoder, feature sizes of two adjacent stages are adjusted by utilizing 1 × 1 convolution and upsampling, the two sets of feature maps are added and then transmitted to the next network layer to obtain { U1, U2, U3, U4}, finally, the { U1, U2, U3, U4} is subjected to 3 × 3 deep separable convolution and upsampling to enable the sizes to be the original size, and finally, the four sets of feature maps are added to obtain a final feature map.

Step S2, acquiring a data set of the public urban road image, which specifically comprises the following steps:

the model evaluation method uses two high-resolution remote sensing image data sets Potsdam data sets and Vaihingen data sets disclosed by ISPRS to evaluate the model. The two data sets are two high-resolution data sets obtained by aerial photography of an airborne remote sensing device, and both data sets comprise real orthographic pictures with very high resolution and corresponding digital label images derived from intensive image matching technology. Both data set regions cover urban scenes. Vaihingen is a relatively small village with numerous individual buildings, small multi-storied buildings, and Potsdam is a typical famous city of historical culture with narrow streets and dense living structures. The Potsdam dataset contains 38 images in total, both RGB3 channel images in tif format, with both images having a spatial resolution of 5cm and all pixel sizes. The data set provides a corresponding label image for each image, the size and format of the label image are consistent with those of the original image, and the labels are divided into 6 types, namely, impervious surfaces, buildings, short plants, trees, vehicles and the like. The Vaihingen dataset contains 33 images in total, RGB3 channel images in tif format, 9cm spatial resolution of the images, but not uniform in pixel size, with an average pixel size equal to that of the Potsdam dataset, which provides each image with its corresponding label image, size and format identical to the original, and 6 types of labels, including impervious surfaces, buildings, dwarf plants, trees, cars, and others.

The two data sets were divided into three parts, training set 70%, validation set 20%, and test set 10%.

Step S3 specifically includes:

to prevent overfitting, a pre-processing of the data is performed, which includes:

sequentially cutting the image by using a sliding frame with the size of 512 multiplied by 512, and ensuring that the overlapping rate of the next position of the moving sliding frame and the previous position is 75%; random scaling: the zooming range is [ 0.5-2.0 ]; randomly turning horizontally; increasing preprocessing of Gaussian noise and salt and pepper noise interference, enhancing data as shown in fig. 5, selecting a cross entropy loss function with more general semantic segmentation by using the loss function, selecting a common Adam optimizer for training in the same semantic segmentation task by using the optimizer, setting the iteration times epoch of the parameter to be 100, initializing the learning rate to be 0.00001, setting the batch processing to be 2, and setting the weight attenuation to be 0.0001:

L(F，Y)＝Loss(softmax(U(F))

wherein F is the output result of the encoder, U is the decoder, Y is the real label graph, and Loss is the cross entropy Loss function.

And S4, inputting the original image to be tested to obtain a semantic label image with the same size as the original image, wherein in the testing stage, the input image is not required to be preprocessed, and the segmented label image is directly obtained.

Step S5, the semantic mark image is mapped on the original image to realize the visualization of the segmentation result, specifically, different semantic categories are mapped into different colors to be covered on the original image, so that the segmentation result has a visual visualization result, FIG. 6 is a visualization result on a Potsdam data set, and FIG. 7 is a visualization result on a Vaihingen data set.

For Potsdam data sets, it can be seen from the above two tables that SegNet networks are slightly superior to Unet networks in all evaluation indexes, and for two traditional networks (Unet, segNet), good segmentation effects have been obtained in three types of entities, namely Improvious resources, building and Car, and particularly for Building, segmentation is more stable than 90%, and for three types of entities with Low feature discrimination, tree and Clutter, the segmentation effect is not satisfactory. It can be seen that, compared with the pnet, the ASUnet has 3.69% improvement on the class Low occurrence and 3.31% improvement on the class Tree, and the ASUnet has certain improvement on the resolution of the Low vegetation and green vegetation characteristics, and improves the identification of the background by 7.72%. For the model, the cross-over ratio of the ASUnet improved network is improved by 9.82% to some extent on the basis of ASUnet, and particularly for trees, the F1-Score is increased by 2.76%, 4.61% and 3.14% on the basis of comprehensive evaluation indexes. Compared with the Unet network, the F1-Score is increased by 4.94%, 8.04% and 5.86% on the comprehensive evaluation index.

For the Vaihingen dataset, the segmentation effect for the vehicle class is poor, but for the class Background, the segmentation effect is much better than that for the Potsdam dataset, because the distribution proportion of each class of different datasets is different, and therefore the effect on different classes is slightly different. From the segmentation effect in the table, the SegNet network is slightly superior to the Unet network in all the evaluation indexes, good segmentation effects are obtained in four types of entities, namely, imparvious surfaces, building, tree and Clutter, for two traditional networks, and the evaluation results of four groups of models in different types tend to rise. The model of the text is improved to a certain extent on the basis of ASUnet, and particularly, compared with an ASUnet improved network, the Low version with a poor segmentation effect is improved by 8.85% at the highest in the cross-over ratio, and on the basis of comprehensive evaluation indexes, the F1-Score is increased by 2.8%, 4.8% and 3.93%. Compared with the Unet network, the F1-Score is increased by 5.27%, 8.8% and 6.72% on the comprehensive evaluation index.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A remote sensing image ground feature segmentation method based on a multi-feature fusion mechanism is characterized by comprising the following steps:

establishing a deep neural network model based on an encoder and a decoder framework:

s1, inputting a tested original image into an encoder to perform feature extraction and downsampling layer by layer, and screening a large amount of bottom layer information through a channel attention mechanism to highlight target features and inhibit background noise to obtain global semantic information;

s2, a decoder receives the processing result of the encoder to perform upsampling, retains and independently restores the feature map of each upsampling as secondary analysis of decoding information, performs multi-feature fusion on the independently retained feature map and the result of the decoding network at the final output position of the decoding network, achieves the purpose of improving feature restoration precision, and finally obtains a semantic mark with the same size as the original image;

and S3, mapping the semantic tag map to the original image to realize the visualization of the segmentation result.

2. The method for segmenting the remote sensing image surface features based on the multi-feature fusion mechanism according to claim 1, wherein the step S1 specifically comprises:

the encoder is composed of 4 network layers and is mainly responsible for encoding a feature map and extracting information, each layer is provided with a common 2D convolution and a depth separable convolution, the first layer of a pooling layer transmits each layer to the fourth layer in sequence to obtain the 1/2 scale of the image of the previous layer, and the output of the four layers is subjected to information screening through a channel attention mechanism to obtain the final output feature map of the four network layers, namely { F1, F2, F3 and F4}.

3. The method for segmenting the remote sensing image surface features based on the multi-feature fusion mechanism according to claim 1, wherein the step S2 specifically comprises:

the decoding network comprises four network layers, and each network layer consists of 23 multiplied by 3 deep separable convolutions and an upper sampling layer; the input of the decoder is a feature map { { F1, F2, F3, F4} output by four network layers of the encoder, feature sizes of two adjacent stages are adjusted by utilizing 1 × 1 convolution and upsampling, the two sets of feature maps are added and then transmitted to the next network layer to obtain { U1, U2, U3, U4}, finally, the { U1, U2, U3, U4} is subjected to 3 × 3 deep separable convolution and upsampling to enable the sizes to be the original size, and finally, the four sets of feature maps are added to obtain a final feature map.

4. The method for segmenting the remote sensing image terrain based on the multi-feature fusion mechanism according to any one of claims 1-3, characterized in that:

the remote sensing image ground object segmentation method further comprises a training step arranged after the step of establishing a deep neural network model based on an encoder and a decoder framework; and the training step comprises the step of inputting the trained data into the neural network model for training to obtain the optimal network weight.

5. The method for segmenting the ground features of the remote sensing image based on the multi-feature fusion mechanism according to claim 4, characterized in that: the training step specifically comprises:

the data set adopts two public high-resolution remote sensing image data sets Potsdam and Vaihingen to evaluate the model, and the two data sets are divided into three parts, namely a training set 70%, a verification set 20% and a test set 10%; sequentially cutting the data set image by using a sliding frame with the size of 512 multiplied by 512, and ensuring that the overlapping rate of the next position of the moving sliding frame and the previous position value is 75%; random scaling, random horizontal inversion and preprocessing of Gaussian noise and salt and pepper noise interference are adopted to prevent overfitting; setting the iteration frequency epoch as 100, initializing the learning rate as 0.00001, setting the batch processing as 2 and setting the weight attenuation as 0.0001 by using the parameters; the Loss function is a cross entropy Loss function which is selected to be general in semantic segmentation, wherein the cross entropy Loss function is L (F, Y) = Loss (U (F)), and Y, the optimizer is an Adam optimizer which is commonly used in the same semantic segmentation task and is used for training, wherein F is an encoder output result, U is a decoder, Y is a real label graph, and Loss is the cross entropy Loss function.

6. A remote sensing image ground feature segmentation method based on a multi-feature fusion mechanism is characterized by comprising the following steps:

the system comprises a model construction unit, a training unit and a visualization unit; the model building unit is used for building a deep neural network model consisting of an encoder and a decoder, extracting features of an input original image layer by layer, sampling the features of the input original image layer by layer, screening a large amount of bottom information through a channel attention mechanism, highlighting target features, inhibiting background noise and acquiring global semantic information; the decoder receives the processing result of the encoder to perform upsampling, and the feature map of each upsampling is reserved and independently restored to be used as secondary analysis of decoding information (as shown in a network structure diagram, the feature map which is independently reserved and the result of the decoding network are subjected to multi-feature fusion at the final output position of the decoding network, so that the purpose of improving the feature restoration precision is achieved, and finally, a semantic mark with the same size as the original image is obtained; the training unit adopts two public high-resolution remote sensing image data sets Potsdam and Vaihingen to evaluate a model, the two data sets are divided into three parts, namely a training set 70%, a verification set 20% and a test set 10%, images of the data sets are cut by using sliding frames with the size of 512 multiplied by 512 in sequence to ensure that the overlapping rate of the next position of the moving sliding frame and the previous position is 75%, preprocessing of random scaling, random horizontal inversion, gaussian noise and salt and pepper noise interference is adopted to prevent overfitting, the iteration times epoch of the parameter setting are 100, the learning rate is initialized to 0.00001, batch processing is set to 2, weight attenuation is set to 0.0001, a common cross entropy Loss function of semantic segmentation is selected as L (F, Y) = Loss (U (F)), Y, the optimizer selects an Adam optimizer in the same semantic segmentation task to carry out training, F is an encoder output result, U is a decoder, Y is a real label graph, the Loam (U (F, Y) is a real semantic Loss graph, the image is used for realizing the mapping of the original semantic segmentation, and the visualization unit is used for realizing the visualization on an original semantic image mark graph, and realizing the visualization of the segmentation result.