CN111582104A

CN111582104A - Semantic segmentation method and device for remote sensing image

Info

Publication number: CN111582104A
Application number: CN202010350688.1A
Authority: CN
Inventors: 付琨; 刁文辉; 孙显; 代贵杰; 牛瑞刚; 闫梦龙; 卢宛萱; 郭荣鑫
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2020-08-25
Anticipated expiration: 2040-04-28
Also published as: CN111582104B

Abstract

The invention relates to a method and a device for semantic segmentation of remote sensing images, comprising the following steps: acquiring a remote sensing image to be segmented; inputting the remote sensing image to be segmented to a pre-established self-attention multi-scale feature aggregation network, and obtaining an initial prediction result of the remote sensing image to be segmented output by the pre-established self-attention multi-scale feature aggregation network; up-sampling the initial prediction result of the remote sensing image to be segmented to the image size of the remote sensing image to be segmented, and obtaining the final prediction result of the remote sensing image to be segmented; the technical scheme provided by the invention can effectively enhance the correlation between modal characteristics and spatial information, improve the perception capability of multi-scale targets to the context information and obtain a more precise semantic annotation result.

Description

Semantic segmentation method and device for remote sensing image

Technical Field

The invention relates to the technical field of remote sensing image processing, in particular to a method and a device for semantic segmentation of a remote sensing image.

Background

In recent years, with the gradual and intensive research of deep learning technology in the field of image processing, image processing methods based on deep learning, particularly full convolution neural networks, are rapidly developed in the field of remote sensing. In the image processing under the remote sensing scene, the semantic segmentation can acquire the class marking information of the target pixel level, and the method has wide application prospect in the fields of land planning, wartime investigation, environmental monitoring and the like. However, the semantic segmentation method based on deep learning is a data-driven method, and requires a large amount of accurately labeled data. The traditional manual marking mode is high in cost and low in efficiency, so that the improvement of the data marking efficiency and precision is particularly important.

The existing semantic annotation method is sensitive to noise introduced by a complex background in a remote sensing scene, and has poor semantic perception capability on multi-scale ground feature elements. The characteristic receptive field of the convolutional neural network is usually improved by using the porous convolution, however, the existing multi-scale porous structure has limited size and variety of the receptive field, complex surface feature elements in a high-resolution remote sensing scene cannot be labeled, and semantic information is difficult to obtain under the condition that the multi-scale elements cause large-scale difference.

Another idea for enhancing semantic annotation in remote sensing scenes is to utilize the rich features of multiple modality data. However, in the existing method, multi-modal images or features are directly combined or added, the feature learning completely depends on the performance of a convolutional neural network, the differences of inherent data structures and feature complexity degrees of different modes are ignored, redundant features are easily introduced, the marking performance is reduced, and the network scale and parameter quantity are redundant.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a method and a device for segmenting the semantics of a remote sensing image, which can effectively enhance the correlation between modal characteristics and spatial information, improve the perception capability of multi-scale targets to the context information and obtain a more refined semantic annotation result.

The purpose of the invention is realized by adopting the following technical scheme:

in a method of semantic segmentation of a remote sensing image, the improvement comprising:

acquiring a remote sensing image to be segmented;

inputting the remote sensing image to be segmented to a pre-established self-attention multi-scale feature aggregation network, and obtaining an initial prediction result of the remote sensing image to be segmented output by the pre-established self-attention multi-scale feature aggregation network;

and up-sampling the initial prediction result of the remote sensing image to be segmented to the image size of the remote sensing image to be segmented, and obtaining the final prediction result of the remote sensing image to be segmented.

Preferably, the process of establishing the pre-established self-attention multi-scale feature aggregation network includes:

step 1, carrying out artificial semantic annotation on remote sensing images in a remote sensing image data set, and dividing the remote sensing image data set into a training set, a verification set and a test set;

step 2, performing data enhancement on the training set;

step 3, slicing the data of the training set, the verification set and the test set into 513x 513;

and 4, training the pre-established self-attention multi-scale feature aggregation initial network by utilizing the training set, the verification set and the test set.

Further, the pre-established self-attention multi-scale feature aggregation initial network comprises: the system comprises a deep convolutional neural network, a VGG neural network, a self-attention modal calibration module, a dense multi-scale context aggregation module and a self-attention space calibration module;

the deep convolutional neural network is used for extracting the characteristics of the optical image in the remote sensing image;

the VGG neural network is used for extracting the characteristics of digital surface model data in the remote sensing image;

the self-attention modal calibration module is used for performing feature fusion on the features of the optical image and the features of the digital surface model data to obtain a multi-modal feature fusion map;

the dense multi-scale context aggregation module is used for extracting a multi-scale fusion feature map of the multi-modal feature fusion map;

the self-attention space calibration module is used for obtaining an initial prediction result based on the characteristics of the optical image, the characteristics of the digital surface model data and the multi-scale fusion characteristic diagram.

Further, the deep convolutional neural network is an improved Xception network structure, and the improvement process includes: reducing the repeated structure of the middle circulation group of the Xcaption network structure to 6 groups, removing the last full connection layer of the Xcaption network structure, replacing all the maximum pooling layers in the Xcaption network structure with depth separable convolution layers with the step length of 2, and replacing the last three-layer depth separable convolution layers of the circulation group at the tail end of the Xcaption network structure with perforated convolution layers with the hole rates of 1,3 and 5 respectively.

Further, the VGG neural network is an improved VGG16 network structure, and the improvement process thereof comprises:

replacing all convolutional layers of the VGG16 network structure with depth separable convolutional layers, removing the last full-connection layer of the VGG16 network structure, replacing all the maximum pooling layers of the VGG16 network structure with depth separable convolutional layers with the step length of 2, and replacing the last three-layer depth separable convolutional layers of the VGG16 network structure with perforated convolutional layers with the perforation rates of 1,3 and 5 respectively.

Further, the self-attention modality calibration module includes: the system comprises a first merging connection layer, a first global maximum pooling layer, a first full connection layer, a first Relu function layer, a second full connection layer and a first Sigmoid function layer which are connected in sequence.

Further, the dense multi-scale context aggregation module comprises: a 1x1 convolutional layer, a first 3x3 convolutional layer, a second 3x3 convolutional layer, a third 3x3 convolutional layer, and a second merge-connection layer;

the output ends of the 1x1 convolutional layers are respectively connected with the input ends of the first 3x3 convolutional layer, the second 3x3 convolutional layer and the third 3x3 convolutional layer, and the output ends of the first 3x3 convolutional layer, the second 3x3 convolutional layer and the third 3x3 convolutional layer are respectively connected with the input end of the second combined connecting layer.

Further, the self-attention space calibration module comprises: the system comprises a third merging and connecting layer, a second global maximum pooling layer, a third full-connection layer, a second Relu function layer, a fourth full-connection layer and a second Sigmoid function layer.

Further, the step 2 comprises:

and sequentially carrying out random overturning in the horizontal and vertical directions on the training set according to the probability of 0.5, carrying out image random rotation operation of an angle of-20 degrees to 20 degrees and a step pitch of 1 degree, carrying out fixed angle random rotation operation of 90 degrees, 180 degrees and 270 degrees, and carrying out image size random scaling operation of 0.25 to 4 times.

Based on the same invention concept, the invention also provides a remote sensing image semantic segmentation device, and the improvement is that the device comprises:

the acquisition module is used for acquiring a remote sensing image to be segmented;

the segmentation module is used for inputting the remote sensing image to be segmented to a pre-established self-attention multi-scale feature aggregation network and obtaining an initial prediction result of the remote sensing image to be segmented output by the pre-established self-attention multi-scale feature aggregation network;

and the adjusting module is used for up-sampling the initial prediction result of the remote sensing image to be segmented to the image size of the remote sensing image to be segmented and obtaining the final prediction result of the remote sensing image to be segmented.

Compared with the closest prior art, the invention has the following beneficial effects:

the invention relates to a method and a device for semantic segmentation of remote sensing images, comprising the following steps: acquiring a remote sensing image to be segmented; inputting the remote sensing image to be segmented to a pre-established self-attention multi-scale feature aggregation network, and obtaining an initial prediction result of the remote sensing image to be segmented output by the pre-established self-attention multi-scale feature aggregation network; up-sampling the initial prediction result of the remote sensing image to be segmented to the image size of the remote sensing image to be segmented, and obtaining the final prediction result of the remote sensing image to be segmented; the relevance of modal characteristics and spatial information can be effectively enhanced, the downward information perception capability of a multi-scale target is improved, and a more precise semantic annotation result is obtained;

the semantic features of the multi-modal data are extracted by using a two-way network in the pre-established self-attention multi-scale feature aggregation network, and the parameter efficiency is improved, the model complexity is reduced, and the generalization capability is improved by using an asymmetric network structure while the annotation precision is improved by using rich modal information.

The self-attention modal calibration module in the pre-established self-attention multi-scale feature aggregation network carries out global semantic display modeling association on features of different modalities, a self-attention mechanism is utilized to weaken redundant features and highlight useful features, and the calibrated modality fusion features can enhance the accuracy of the labeling result.

The dense multi-scale context aggregation module in the pre-established self-attention multi-scale feature aggregation network improves the perception range of the network context semantic information by utilizing convolution of various band-pass rates, and meanwhile dense connection enables effective features of the multi-scale feature map to be denser, thereby being beneficial to refining marking results.

A pre-established self-attention space calibration module in the self-attention multi-scale feature aggregation network calibrates high-level features losing a large amount of space information, introduces features rich in space information at the bottom layers of two modes, and can recover edge information of large-scale ground feature elements and some small-scale ground feature elements after dynamic weighting selection of a self-attention mechanism, so that precision of fine labeling is improved.

Drawings

FIG. 1 is a flow chart of a method for semantic segmentation of remote sensing images provided by the present invention;

FIG. 2 is a schematic structural diagram of a self-attention mode calibration module according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a dense multi-scale context aggregation module in an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a self-attention space calibration module according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a semantic segmentation apparatus for remote sensing images provided by the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the problems of multi-modal data fusion, difficult multi-scale semantic extraction of a remote sensing scene and the like in the prior art, the invention provides a remote sensing image semantic segmentation method, as shown in figure 1, which comprises the following steps:

101, acquiring a remote sensing image to be segmented;

102, inputting the remote sensing image to be segmented to a pre-established self-attention multi-scale feature aggregation network, and obtaining an initial prediction result of the remote sensing image to be segmented output by the pre-established self-attention multi-scale feature aggregation network;

103, up-sampling the initial prediction result of the remote sensing image to be segmented to the image size of the remote sensing image to be segmented, and obtaining the final prediction result of the remote sensing image to be segmented.

Specifically, the process of establishing the pre-established self-attention multi-scale feature aggregation network includes:

step 2, performing data enhancement on the training set;

Wherein the pre-established self-attention multi-scale feature aggregation initial network comprises: the system comprises a deep convolutional neural network, a VGG neural network, a self-attention modal calibration module, a dense multi-scale context aggregation module and a self-attention space calibration module;

Wherein, the deep convolutional neural network is an improved Xception network structure, and the improvement process comprises: reducing the repeated structure of the middle circulation group of the Xcaption network structure to 6 groups, removing the last full connection layer of the Xcaption network structure, replacing all the maximum pooling layers in the Xcaption network structure with depth separable convolution layers with the step length of 2, and replacing the last three-layer depth separable convolution layers of the circulation group at the tail end of the Xcaption network structure with perforated convolution layers with the hole rates of 1,3 and 5 respectively.

The VGG neural network is an improved VGG16 network structure, and the improvement process comprises the following steps:

In an embodiment of the present invention, as shown in fig. 2, the self-attention modality calibration module includes: the system comprises a first merging connection layer, a first global maximum pooling layer, a first full connection layer, a first Relu function layer, a second full connection layer and a first Sigmoid function layer which are connected in sequence.

As shown in fig. 3, the dense multi-scale context aggregation module includes: a 1x1 convolutional layer, a first 3x3 convolutional layer, a second 3x3 convolutional layer, a third 3x3 convolutional layer, and a second merge-connection layer;

As shown in fig. 4, the self-attention space calibration module includes: the system comprises a third merging and connecting layer, a second global maximum pooling layer, a third full-connection layer, a second Relu function layer, a fourth full-connection layer and a second Sigmoid function layer.

Further, the step 2 comprises:

Based on the same inventive concept, the invention also provides a semantic segmentation device for remote sensing images, as shown in fig. 5, the device comprises:

step 2, performing data enhancement on the training set;

Further, the step 2 comprises:

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A semantic segmentation method for remote sensing images is characterized by comprising the following steps:

acquiring a remote sensing image to be segmented;

2. The method of claim 1, wherein the pre-established self-attention multi-scale feature aggregation network establishment procedure comprises:

step 2, performing data enhancement on the training set;

3. The method of claim 2, wherein the pre-established self-attention multi-scale feature aggregation initial network comprises: the system comprises a deep convolutional neural network, a VGG neural network, a self-attention modal calibration module, a dense multi-scale context aggregation module and a self-attention space calibration module;

4. The method of claim 3, wherein the deep convolutional neural network is a modified Xception network structure, the modification comprising: reducing the repeated structure of the middle circulation group of the Xcaption network structure to 6 groups, removing the last full connection layer of the Xcaption network structure, replacing all the maximum pooling layers in the Xcaption network structure with depth separable convolution layers with the step length of 2, and replacing the last three-layer depth separable convolution layers of the circulation group at the tail end of the Xcaption network structure with perforated convolution layers with the hole rates of 1,3 and 5 respectively.

5. The method of claim 3, wherein the VGG neural network is a modified VGG16 network structure, the modification comprising:

6. The method of claim 3, wherein the self-attentive modality calibration module comprises: the system comprises a first merging connection layer, a first global maximum pooling layer, a first full connection layer, a first Relu function layer, a second full connection layer and a first Sigmoid function layer which are connected in sequence.

7. The method of claim 3, wherein the dense multi-scale context aggregation module comprises: a 1x1 convolutional layer, a first 3x3 convolutional layer, a second 3x3 convolutional layer, a third 3x3 convolutional layer, and a second merge-connection layer;

8. The method of claim 3, wherein the self-attention space calibration module comprises: the system comprises a third merging and connecting layer, a second global maximum pooling layer, a third full-connection layer, a second Relu function layer, a fourth full-connection layer and a second Sigmoid function layer.

9. The method of claim 2, wherein step 2 comprises:

10. A device for semantic segmentation of remote sensing images, the device comprising: