CN111767810B

CN111767810B - Remote sensing image road extraction method based on D-LinkNet

Info

Publication number: CN111767810B
Application number: CN202010558654.1A
Authority: CN
Inventors: 兰海燕; 李京桦; 孙建国; 孙鹤玲
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2022-08-02
Anticipated expiration: 2040-06-18
Also published as: CN111767810A

Abstract

The invention provides a remote sensing image road extraction method based on D-LinkNet, which comprises the following steps: s1: after the characteristic diagram is input into a D-LinkNet network, processing is completed in a coder sub-network based on a residual error network and transfer learning; s2: inputting the feature map output in the step S1 into a feature extraction sub-network based on an expansion convolution and convolution block attention module for feature extraction; s3: and the feature graph obtained after the processing of the first two sub-networks enters a decoder sub-network based on the transposition convolution to realize the recovery of the image. The method can sample the road characteristics in the remote sensing image, well avoid the network degradation problem and enhance the extraction of the road characteristics; the expansion convolution can be used for amplifying the receptive field, the road characteristics in a larger range can be sensed while downsampling is not increased, the characteristics can be extracted, and the problem that the road part in the remote sensing image occupies a small scale can be well solved.

Description

Remote sensing image road extraction method based on D-LinkNet

Technical Field

The invention relates to a remote sensing image road extraction method, in particular to a remote sensing image road extraction method based on D-LinkNet, and belongs to the field of remote sensing image processing.

Background

Remote Sensing (RS) refers to non-contact, Remote and real-time acquisition of targets such as earth resources by Remote sensors or sensors, and then extraction, analysis and processing of relevant data information. In recent decades, many scholars at home and abroad make extensive and profound researches on complex road information in remote sensing images, and various road extraction algorithms are proposed successively. At present, according to three relatively mature and commonly used strategies, the methods for extracting roads from remote sensing images at home and abroad can be divided into three categories, namely three road extraction methods based on edge features, objects and deep learning.

Wang et al propose a high-resolution remote sensing image extraction method based on salient features and GVFS (gradient Vector Flow Snake) by using the geometric features of roads as the salient features, and carry out iterative solution through a gradient Vector Flow model to finally obtain road information. The road extraction method combining the significant features and the GVFS not only achieves the purpose of approaching to the actual boundary of the road, but also shortens the time of searching the boundary by an algorithm, and obviously improves the extraction precision of the road. However, the road extraction accuracy of the method depends on the acquired saliency map to a great extent, and the algorithm of the method is quite easy to interfere with the complex remote sensing image.

The Cao Yun just and the like propose a new non-road region removing algorithm and a tensor voting algorithm, firstly, pixel-level features and multi-scale object-level features are fused, a road network is initially extracted, then, the new non-road region removing algorithm is used for removing the part of a non-road, and finally, the tensor voting algorithm is used for finishing the fine extraction of the center line of the road. The method can improve the adhesion phenomenon of most object-based road extraction methods, and has higher accuracy in the aspect of road extraction. However, the training sample of this method requires a large amount of real data of the earth's surface when extracting the road center line.

Zhang Yonghong and the like, in order to improve the extraction precision of roads, a high-resolution remote sensing road extraction method based on a convolutional neural network is provided, images containing roads are screened out according to spectra, topographic features and CNN, and then abstract road features are extracted by using an improved network model in the method, namely PPMU-Net. The method better pays attention to the edge details of the road, extracts the small target road accurately, and has higher road extraction precision on the whole. However, when the model is affected by complex terrain, the extracted road is still insufficient, and the width change of the road cannot be sensitively coped with.

The road extraction method based on the edge characteristics can effectively extract simple road information, but the anti-interference capability is not strong; when the information similar to the road in the image is less, the extraction effect of the object-based road extraction method is good, but when other surface feature factors are too similar to the road and closely adjacent to the road in space, the method mostly has the adhesion phenomenon, and for a few methods for improving the problem, the design of the extraction process is too complex or the extraction scale is difficult to grasp, so that the bottleneck exists in road extraction; the road extraction method based on deep learning has strong learning capacity, and can well solve the problems of variable road types, complex and various backgrounds and similarity of road and non-road characteristics, but the method still has the problems of insufficient road extraction, easy loss of spatial information or insufficient universality of scenes with large changes and the like.

Disclosure of Invention

The invention aims to provide a remote sensing image road extraction method based on D-LinkNet in order to finish automatic extraction and segmentation of roads in a remote sensing image.

The purpose of the invention is realized as follows:

a remote sensing image road extraction method based on D-LinkNet comprises the following steps:

s1: after the characteristic diagram is input into a D-LinkNet network, processing is completed in a coder sub-network based on a residual error network and transfer learning;

s2: inputting the feature map output in the step S1 into a feature extraction sub-network based on an expansion convolution and convolution block attention module for feature extraction;

s3: and the feature graph obtained after the processing of the first two sub-networks enters a decoder sub-network based on the transposition convolution to realize the recovery of the image.

The invention also includes such features:

the first step is specifically as follows: firstly, carrying out convolution operation on a remote sensing picture by using a convolution layer, then carrying out maximum pooling processing on the remote sensing picture, and inputting a feature map obtained by processing into a coding unit containing a residual block;

(1) identity Block processing

y＝F(x,{W ₃ })+x

(2) Convolitional Block processing

y＝F(x,{W ₃ })+W _s x

F represents that multiple convolution processing is carried out on the remote sensing picture; { W ₃ The convolution processing process comprises three convolution operations; y represents the final output characteristic diagram; w _s x represents a result obtained after performing convolution on the input for one time; x represents a feature map;

the second step is specifically as follows:

(1) for the characteristic diagram input into the local network, expanding convolution is used for amplifying the receptive field, and road characteristics are known in a larger range;

R＝r+(k-1)×j

wherein R represents the unilateral size of the receptive field of the current layer, R represents the unilateral size of the receptive field of the previous layer, k represents the unilateral size of the convolution kernel, and j represents the set size of the expansion rate;

(2) processing operation M through channel attention mechanism _c Processing operation M of spatial attention mechanism _s Processing the input feature map F, and then outputting a feature map F' obtained by processing;

the third step is specifically as follows: the sub-network comprises decoding units, a transposition convolution operation and a convolution operation, wherein the output of each decoding unit is in hopping connection with the input of the corresponding coding unit in the encoder sub-network based on the residual error network and the migration learning;

o＝(i-1)×s+k-2×p

wherein o represents the unilateral size of the feature diagram output after the transposition convolution operation, i represents the unilateral size of the input feature diagram, s represents the set step size, k represents the size of a convolution kernel in the transposition convolution operation, and p represents the size of filling.

Compared with the prior art, the invention has the beneficial effects that:

the remote sensing image road extraction model based on the D-LinkNet can realize the feature extraction of the remote sensing image and the automatic segmentation of the road;

the remote sensing image road extraction model based on D-LinkNet can sample road characteristics in the remote sensing image through ResNet50, thereby well avoiding the degradation problem of the network and enhancing the extraction of the road characteristics;

the remote sensing image road extraction model based on the D-LinkNet can use the expansion convolution to amplify the receptive field, can sense the road characteristics in a larger range without adding down-sampling, and can well solve the problem that the road part in the remote sensing image occupies a small area ratio by extracting the characteristics;

the remote sensing image road extraction model based on the D-LinkNet uses the channel attention mechanism and the space attention mechanism in the convolution block attention module, can enhance the learning of road characteristics from two aspects of channel and space, can effectively solve the problem of wrong division caused by the fact that a road is too similar to other ground objects in the road extraction process, and simultaneously reduces the interference of other dissimilar ground objects on road extraction.

Drawings

FIG. 1 is a flow chart of a remote sensing image road extraction method based on D-LinkNet in the invention; the encoder is a sub-network of an encoder based on a residual error network and transfer learning; extracting a sub-network for the feature based on the expansion convolution and convolution block attention module; and (c) a decoder sub-network based on the transposed convolution.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The invention relates to the field of remote sensing image processing, in particular to a remote sensing image road extraction algorithm based on a semantic segmentation network.

In view of the drawbacks of the prior art, the object of the present invention or the technical problem to be solved is:

1. the feature extraction of the remote sensing image and the automatic segmentation of the road are realized;

2. the problem of network degradation is avoided, and the extraction of road characteristics is enhanced;

3. aiming at the problem that the road part in the remote sensing image occupies a small picture proportion, the precision of road extraction is improved;

4. aiming at the problem that roads are too similar to ground objects such as rivers, railways and the like, the phenomenon of mixing and separating in the road extraction process is avoided;

5. aiming at the problems of complex information and various ground feature types in the remote sensing image, the attention to road characteristics is improved, and the attention to other characteristics is reduced.

In order to achieve the purpose, the invention provides a semantic segmentation network of D-LinkNet, which is used for automatically extracting roads in a remote sensing image. The remote sensing image road extraction model based on the D-LinkNet can finally complete automatic extraction and segmentation of roads in the remote sensing image.

The method comprises the following steps:

s1, constructing a residual error network and transfer learning-based encoder sub-network in a D-LinkNet network by using a ResNet50 structure in the residual error network, preventing the D-LinkNet network from generating network degradation in the training process of road extraction, completing the task of continuously down-sampling the remote sensing image, and achieving the purpose of extracting road characteristics in the remote sensing image. And the performance of the encoder sub-network based on the residual error network and the transfer learning is improved by using the transfer learning, so that the encoder sub-network does not need to be trained from the beginning, the training speed of the remote sensing image road extraction model based on the D-LinkNet is increased, and the performance of the whole D-LinkNet network is improved.

For the profile x input to this sub-network, ResNet50 performs multiple convolution processing on it with two residual block structures F, { W ₃ And (5) representing that three convolution operations are included in the convolution processing process, and finally obtaining an output characteristic diagram y.

(1) Identity Block processing

y＝F(x,{W ₃ })+x

(2) Convolitional Block processing (W) _s x represents the result of a convolution of the input

y＝F(x,{W ₃ })+W _s x

S2, the feature extraction sub-network based on the expansion convolution and the convolution block attention module fully utilizes the advantages of the expansion convolution, namely the expansion of the receptive field range can be expanded without increasing down-sampling and changing the size of a filter, detailed information is better processed, road features in a larger range are extracted, and learning of the road features is enhanced from two aspects of a channel and a space through a channel attention mechanism and a space attention mechanism in the convolution block attention module, but not other features.

(1) For the feature map input into the local network, the propagation field is amplified by using the expansion convolution, and the road features are known in a larger range.

R＝r+(k-1)×j

Wherein, R represents the unilateral size of the current layer receptive field, R represents the unilateral size of the previous layer receptive field, k represents the unilateral size of the convolution kernel, and j represents the setting size of the expansion rate.

(2) Processing operation M through channel attention mechanism _c Processing operation M of the spatial attention mechanism _s And processing the input feature map F, and then outputting the processed feature map F'.

S3, by means of the advantages of the FCN, a decoder sub-network based on the transposition convolution in the D-LinkNet network is constructed, the transposition convolution is utilized to complete the recovery of the image, the calculation of the D-LinkNet network is further reduced, and the road extraction speed is improved.

o＝(i-1)×s+k-2×p

Through steps S1, S2, S3, the D-LinkNet network can be successfully built. The D-LinkNet network is used for road extraction of the remote sensing image, and the constructed remote sensing image road extraction model based on the D-LinkNet needs to complete image segmentation tasks of road parts and non-road parts in the remote sensing image. Therefore, the present invention will use a binary cross entropy loss function L _BCE And Dice loss function L _Dice As a total loss function of the evaluation model herein. In the training process of the model, the capability of the model is judged by minimizing the total loss function, and an optimized direction is provided for the model. Wherein, the corresponding weights of the binary cross entropy loss function and the Dice loss function in the combined function are respectively W ₁ And W ₂ 。

L＝W ₁ ·L _BCE +W ₂ ·L _Dice

The method comprises the steps of taking Tensorflow-GPU as a deep learning framework for building a D-LinkNet network, using NVIDIA GeForce RTX 2080Ti with the display memory capacity of 11G as the GPU, taking Adam as an optimizer in a network training process, setting the learning rate to be 0.0001, setting the iteration number to be 8, setting the batch processing size to be 2, and enabling a data set to be derived from DE Globe satellite images in 2018 to understand a data set for road extraction challenge in a challenge competition, wherein the size of each remote sensing image is 1024 x 1024, the training set comprises 5726 remote sensing images and 5726 label images corresponding to each remote sensing image, and the testing set comprises 500 remote sensing images and 500 label images corresponding to the remote sensing images. In order to prevent the phenomenon of overfitting and improve the generalization capability of the model, the invention enhances the data of the training set of experimental data during implementation, and specifically comprises left-right turning, up-down turning and diagonal turning.

Firstly, an input remote sensing image enters a D-LinkNet network, and then processing is completed in a coder sub-network based on a residual error network and transfer learning. The sub-network performs a convolution operation on the convolutional layer with a convolutional kernel size of 7 × 7, a filter number of 64, and a downsampling step size of 2, and then performs a maximum pooling process on the convolutional layer. Wherein the pooling window for the maximum pooling operation is 3 x 3 and the down-sampling step size is 2. Then, the feature map obtained by the processing is input into four coding units including residual blocks of ResNet50 for processing, wherein each coding unit only contains one corresponding Block processing, and the number of blocks containing Identity blocks in each coding unit is respectively 2, 3, 5 and 2.

And secondly, inputting the feature map output from the encoder sub-network based on the residual error network and the migration learning into a feature extraction sub-network based on an expansion convolution and convolution block attention module for feature extraction. The entire feature extraction sub-network based on the extended convolution and convolution block attention module can be divided into six extraction units (i.e., six processing chains): (1) the feature map input into the sub-network goes directly through M _c 、M _s Processing the output link; (2) the feature map input into the sub-network is processed by the expansion convolution operation with the expansion rate of 1, and then is processed by M _c 、M _s Processing the output link; (3) inputting the feature map of the sub-network, sequentially performing two times of expansion convolution operation with expansion rates of 1 and 4, and performing M _c 、M _s Processing the output link; (4) inputting the feature map of the sub-network, sequentially performing three times of expansion convolution operations with expansion rates of 1, 4 and 5, and then performing M _c 、M _s Processing the output link; (5) the feature map input into the sub-network is sequentially processed by four expansion convolution operations with expansion rates of 1, 4, 5 and 4, and then processed by M _c 、M _s Processing the output link; (6) the feature map input to the sub-network is sequentially passed through five expansion volumes with expansion rates of 1, 4, 5, 4 and 1After the stacking operation, the mixture is processed by M _c 、M _s The outgoing link is processed. The six links are connected in parallel, and six local outputs of the six links are added at the tail end of the feature extraction sub-network based on the expansion convolution and convolution block attention module to complete the output of the whole processing of the sub-network.

And finally, the feature graph obtained after the processing of the two previous subnetworks enters a decoder subnetwork based on the transposition convolution to realize the recovery of the image. The sub-network comprises 4 decoding units, a transposition convolution operation and a convolution operation, wherein the output of each decoding unit is in hopping connection with the input of the corresponding coding unit in the encoder sub-network based on the residual error network and the migration learning. The three-hop connection mode can enable the encoder and the decoder of the D-LinkNet network to be mutually connected, and the spatial information lost by each downsampling in the encoder sub-network based on the residual error network and the migration learning is recovered, and the spatial information is favorable for the upsampling operation in the decoder sub-network based on the transposition convolution. The size of a convolution kernel of the transposition convolution operation in the decoding unit is 3 multiplied by 3, the up-sampling step size of the transposition convolution is 2, and the dimension of the input feature graph is firstly reduced and then increased by utilizing the convolution operation with the convolution kernel size of 1 multiplied by 1 so as to reduce the calculation amount; and the convolution kernel size of the last transposition convolution operation of the sub-network is 4 x 4, the convolution kernel size of the last convolution operation is 4 x 4, and the up-sampling step sizes are all 2.

Three sub-networks of the D-LinkNet network can realize the feature extraction of the remote sensing image, an image with the same size as the original remote sensing image is output after the processing is finished, the image takes a road part as a foreground and takes a non-road part as a background, and the automatic segmentation of the road part and the non-road part can be finished.

The invention creates the innovation points of the protection: the D-LinkNet has the structural characteristic of central symmetry, each coding unit and the corresponding decoding unit form a pair of unit pairs, and the output of each decoding unit is connected with the input of the corresponding coding unit between each pair of unit pairs, namely, global jump connection. Because the D-LinkNet network uses ResNet50 to construct a residual network and migration learning based encoder sub-network, local hopping connectivity is applied to the encoder sub-network. In addition, the D-LinkNet network constructs a feature extraction sub-network based on the expansion convolution and convolution block attention module between the encoder part and the decoder part of the D-LinkNet network, so that compared with the original LinkNet network, the D-LinkNet network adds five local jump connections inside the feature extraction sub-network based on the expansion convolution and convolution block attention module. In the remote sensing image road extraction model based on the D-LinkNet, the jump connection can recover road information lost in the downsampling process under the condition that parameters in a decoder sub-network based on the transposition convolution are not increased, and the road information from more angles is shared in the feature extraction sub-network based on the expansion convolution and convolution block attention module.

1. The original LinkNet network uses the idea of a residual network, using ResNet18 as the encoder part of the network. ResNet18 has 18 weighted layers, where each residual block contains two convolutional layers with a convolutional kernel size of 3 × 3 and corresponding skip connections. However, in the D-LinkNet network, an encoder sub-network based on residual network and transition learning is herein taken as the encoder part of the present network, wherein the encoder sub-network selects ResNet50 as its structure. ResNet50 has 50 weighted layers, which is a different structure of the residual network than ResNet 18. Relative to ResNet18, ResNet50 is a deeper network that performs better than ResNet 18.

The D-LinkNet network constructs a feature extraction sub-network between its encoder and decoder sections based on the dilated convolution and convolution block attention modules. The feature extraction sub-network based on the expansion convolution and convolution block attention module is composed of six extraction units, and specifically comprises five expansion convolution layers and six convolution block attention modules. The sub-network has the main effects that under the condition of not carrying out down-sampling, the sub-network has a larger receptive field than the original LinkNet network, each convolution output contains information in a larger range, and road characteristics are paid attention to from two aspects of a channel and a space, so that more comprehensive road information is sensed when the road characteristics are extracted by the D-LinkNet-based remote sensing image road extraction model.

In summary, the following steps: the invention relates to the field of remote sensing image processing, in particular to a remote sensing image road extraction algorithm based on semantic segmentation. The invention provides a semantic segmentation network of a D-LinkNet, which consists of an encoder sub-network based on a residual error network and transfer learning, a feature extraction sub-network based on an expansion convolution and convolution block attention module and a decoder sub-network based on a transposition convolution. The invention designs a remote sensing image road extraction model based on the network. The network model utilizes a residual error network to carry out downsampling, enlarges the receptive field through expanding convolution, senses roads in a larger range, introduces a convolution block attention module to focus attention and extract road characteristics, and then uses transposition convolution to realize image recovery. Finally, the D-LinkNet network provided by the invention can finish automatic extraction and segmentation of road parts in remote sensing images. Compared with the LinkNet network and the U-Net network, the D-LinkNet can achieve higher precision in road extraction.

Claims

1. A remote sensing image road extraction method based on D-LinkNet is characterized by comprising the following steps:

s1: after the characteristic diagram is input into a D-LinkNet network, processing is completed in a coder sub-network based on a residual error network and transfer learning; the encoder subnetwork firstly performs convolution operation on the Convolutional layer with the Convolutional kernel size of 7 multiplied by 7, the filter number of 64 and the downsampling step length of 2, and then performs maximum pooling processing on the Convolutional layer, wherein the pooling window of the maximum pooling operation is 3 multiplied by 3, the downsampling step length of 2, and then a feature map obtained by processing is input into four coding units containing residual blocks of ResNet50 to be processed, wherein each coding unit only contains one time of Convolitional Block processing, and the number of the blocks containing Identity Block of each coding unit is respectively 2, 3, 5 and 2;

s2: inputting the feature map output in the step S1 into a feature extraction sub-network based on an expansion convolution and convolution block attention module for feature extraction; the feature extraction sub-network is divided into six extraction units: (1) inputting the feature graph of the sub-network to directly pass through and process the output link; (2) inputting the feature diagram of the sub-network, performing expansion convolution operation processing with the expansion rate of 1, and then processing an output link; (3) inputting the feature diagram of the sub-network, sequentially carrying out two times of expansion convolution operation processing with expansion rates of 1 and 4 respectively, and then carrying out processing on the output link; (4) inputting the feature diagram of the sub-network, sequentially carrying out three times of expansion convolution operation processing with expansion rates of 1, 4 and 5, and then carrying out processing on the output link; (5) inputting the feature diagram of the sub-network, sequentially carrying out four times of expansion convolution operations with expansion rates of 1, 4, 5 and 4, and then carrying out processing on the output link; (6) inputting the feature diagram of the sub-network, sequentially carrying out five times of expansion convolution operation processing with expansion rates of 1, 4, 5, 4 and 1, and then carrying out processing on an output link; the six links are mutually connected in parallel, and six local outputs of the six links are added at the tail end of a feature extraction sub-network based on an expansion convolution and convolution block attention module to complete the output of the overall processing of the sub-network;

s3: the feature diagram obtained after the processing of the former two sub-networks enters a decoder sub-network based on the transposition convolution to realize the recovery of the image, the decoder sub-network comprises 4 decoding units, a transposition convolution operation and a convolution operation, the output of each decoding unit is in jumping connection with the input of the corresponding coding unit in the encoder sub-network based on the residual error network and the migration learning, the three jumping connection modes can enable an encoder and a decoder of the D-LinkNet network to be mutually connected, the space information lost by each downsampling in the encoder sub-network based on the residual error network and the migration learning is recovered, the space information is beneficial to the upsampling operation in the decoder sub-network based on the transposition convolution, the convolution kernel size of the transposition convolution operation in the decoding unit is 3 multiplied by 3, and the upsampling step size of the transposition is 2, performing convolution operation with convolution kernel size of 1 × 1 to reduce and raise dimensions of the input feature graph to reduce calculated amount; the size of a convolution kernel of the last transposition convolution operation of the sub-network is 4 multiplied by 4, the size of the convolution kernel of the last convolution operation is 4 multiplied by 4, and the up-sampling step length is 2;

(1) identity Block processing

y＝F(x,{W ₃ })+x

(2) Convolitional Block processing

y＝F(x,{W ₃ })+W _s x

the second step is specifically as follows:

(1) for the feature map of the input feature extraction sub-network, expanding the sense field by using expansion convolution, and knowing road features in a larger range;

R＝r+(k-1)×j

the third step is specifically as follows: the decoder sub-network comprises decoding units, a transposition convolution operation and a convolution operation, wherein the output of each decoding unit is in hopping connection with the input of the corresponding coding unit in the encoder sub-network based on the residual error network and the migration learning;

o＝(i-1)×s+k-2×p