CN115471754A

CN115471754A - Remote sensing image road extraction method based on multi-dimensional and multi-scale U-net network

Info

Publication number: CN115471754A
Application number: CN202210941960.2A
Authority: CN
Inventors: 陶于祥; 何哲
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-08-08
Filing date: 2022-08-08
Publication date: 2022-12-13

Abstract

The invention discloses a remote sensing image road extraction method based on a multi-dimensional and multi-scale U-net network. And secondly, adding a cavity space pyramid pooling (ASPP) module to the bridge network part to perform multi-scale feature extraction on the road information. Finally, a feature alignment module is added to the decoding network part to adjust the inaccurate corresponding relation between the features of the high layer and the low layer caused by the non-learnability of the up-sampling operation and the repeatability of the up-sampling and the down-sampling. According to the method, the model loss is calculated by adopting a composite loss function combining the cross entropy and the Dice coefficient, the phenomenon of unbalance of positive and negative samples existing in the remote sensing road data set is relieved, and the road extraction result of the model is improved.

Description

Remote sensing image road extraction method based on multi-dimensional and multi-scale U-net network

Technical Field

The invention belongs to the technical field of remote sensing image processing, and particularly relates to a remote sensing image road extraction method based on an improved U-net network.

Background

Road information is an essential important content in life and travel, is a main trunk and a basic mode of transportation, and provides a lot of support for the development of human civilization. Road extraction is of great value in many applications, such as autonomous driving, city planning, intelligent transportation systems, emergency risk management, updating of geographical information, etc.

At present, the method for extracting the remote sensing image road is mainly divided into a traditional road extraction method and a road extraction method based on deep learning. The traditional method usually utilizes manually designed features to extract roads, and can be further divided into a pixel-based method and an object-oriented method. The pixel-based method mainly extracts roads by analyzing differences in spectral features, and can be generally classified into a spectral analysis method, a threshold segmentation method and an edge detection method. The pixel-based method fully utilizes the gray value of the image, and can better extract the road from the remote sensing image with clear image, simple background and sparse road network. However, such methods are prone to salt-and-pepper effects and do not distinguish background interference information well. The object-oriented method takes the road as an object, takes the road object as a whole, utilizes information modeling to identify the road object, has better noise resistance and applicability, but is easy to cause mixing and misdividing to ground objects which are adjacent in space and similar in shape.

With the rapid growth in available data and computing power, the use of deep learning techniques has achieved tremendous success in the field of computer vision. Deep learning has been increasingly applied to extracting information from high-resolution remote sensing images due to its good performance and generalization capability. Different from the traditional method that the road characteristics of the remote sensing image need to be extracted manually, the method actively learns the characteristic experience by means of training a neural network, can automatically learn the shallow information characteristics in the repeated network iteration process, and then learn deeper abstract characteristics step by step. The deep learning method can excavate high-level features of the road so as to improve the effectiveness of computer vision tasks, and has strong self-adaptive learning capability and feature fitting capability and great advantages in the accuracy and automation degree of road extraction.

Mnih and Hinton et al (2013) applied the deep learning technique in the field of road extraction for the first time, and constructed Massachusetts road data sets. Jonathan Long et al subsequently proposed a Full Convolutional Network (FCN) that implemented pixel-level classification from simple image classification by using standard convolutional layers instead of full link layers, while preserving spatial information of the original input image and greatly improving the segmentation effect. Therefore, the full convolution neural network based on the FCN extension is increasingly applied, especially in the aspect of road extraction. However, with the higher resolution of the remote sensing images, the detailed features of the road areas expressed in the images are more and more complex, the road surface interference factors (such as buildings and trees) are more and more, and many non-road areas (urban buildings, vegetable greenhouses and the like) also have features highly similar to roads. CN110807376A, an extraurban road extraction method and apparatus based on remote sensing image, the method includes utilizing and obtaining GIS image information from the digital map, and produce mark data and training/test data; constructing an initial road extraction network model based on a U-Net network; training the initial network model by using the marking data and the training/testing data to obtain a road extraction model with road recognition capability; and detecting the remote sensing image by using the road extraction model, and automatically extracting the road target. According to the method and the device, the accuracy of extracting the urban and outdoor roads by using the remote sensing image is improved by constructing the improved U-Net network.

Firstly, the two branches of the residual error structure are connected by using a feature summation mode in the patent, and the fusion mode is only to assign fixed weight to the features, does not consider the change of the feature content, and is low in efficiency. The method disclosed by the invention fuses two branches of a residual error structure in an attention feature fusion mode, can dynamically and adaptively fuse the received features in a scale perception mode, makes up semantic differences among different branches, and enhances the learning capability of a network on global and local information of the remote sensing image. Secondly, due to repeated application of up-down sampling, inaccurate correspondence exists between the connected high-level features and low-level features in the decoding network, and connecting misaligned features only in a channel-superimposed connection manner may adversely affect subsequent learning. This problem is not taken into account in this patent. Aiming at the problem, the invention adds a characteristic alignment module in a decoding network, and dynamically establishes the position corresponding relation between different layers of characteristics through the characteristic alignment module, thereby improving the capability of the decoder for reconstructing precise details.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A remote sensing image road extraction method based on a multi-dimensional and multi-scale U-net network is provided. The technical scheme of the invention is as follows:

a remote sensing image road extraction method based on a multi-dimensional and multi-scale U-net network comprises the following steps:

step 1, selecting a public Massachusetts road data set as original data, and performing preprocessing steps including cutting and data enhancement;

step 2, inputting the preprocessed data into a coding network, wherein the coding network combines a residual error structure and an attention feature fusion mechanism to perform multi-scale extraction on road feature information;

step 3, the output of the coding network is used as the input of a bridge network, a cavity space pyramid pooling ASPP module is added to the part of the bridge network, the cavity space pyramid pooling ASPP module comprises parallel cavity convolution layers, and the parallel cavity convolution layers in the ASPP module are equivalent to a plurality of different receiving domains and are used for sampling in parallel on a plurality of scales to realize multi-scale feature fusion on deep features;

step 4, decoding network stage: the feature image is gradually restored to the input image size by upsampling. Adding a feature alignment module FAM into a decoding network, wherein the module takes high-level features and low-level features of a corresponding layer of a coding network as input to generate semantic streams, and adjusts feature maps of two adjacent layers by using the semantic streams to generate feature output with high resolution and strong semantics;

step 5, finally, changing the number of channels into 2 through the 1 multiplied by 1 convolutional layer, and testing the model through a test set in a Massachusetts data set;

and during model training, performing loss calculation on the model by adopting a composite loss function combining a cross entropy loss function and a Dice loss function.

Further, the step 1 specifically includes:

the image size of the Massachusetts data set is 1500 multiplied by 1500, and a 256 multiplied by 256 area is arranged to cut all the images of the original data set; high-grade data of 256 multiplied by 3 wave bands are input into a built coding network as input data to extract road information.

Further, the step 2 inputs the preprocessed data into a coding network, and the coding network combines a residual structure and an attention feature fusion mechanism to perform multi-scale extraction on the road feature information, and specifically includes:

the encoding network comprises a convolution sequence block CSB and an attention residual error learning unit ARLU, the RGB image after preprocessing operation is converted into high-dimensional characteristics through the convolution sequence block, and then the multi-scale multi-level characteristics are generated through the attention residual error learning unit; in the attention residual learning unit, a residual unit is used to replace a common neural network unit, and then an identity mapping branch and a residual branch in the residual unit are fused through an attention feature fusion module. The two branches of the residual error structure are fused in an attention feature fusion mode, so that the network can extract information on a plurality of scales from the feature graph along the channel dimension, and meanwhile, the lightweight of the network is kept.

Further, step 3 is that the ASPP module in the bridged network includes 5 parallel branches, which are: the method comprises a 1 x 1 convolution branch, three 3 x 3 expansion convolution branches and a global average pooling branch, wherein the 1 x 1 convolution branch and the global average pooling branch are equivalent to using minimum and maximum receptive fields respectively to maintain the inherent characteristics of input, and the other three branches are respectively provided with different expansion rates for describing image features on different scales.

Further, the step 4 specifically includes: the feature image is restored to the input image size by up-sampling step by step at the decoding network stage. The high-level features of the decoding network and the low-level features of the corresponding layer of the coding network are connected by using a feature alignment module. In the feature alignment module, firstly, the high-level features change the image size and the channel number through inverse convolution, then the changed high-level features and the low-level features are connected, and the semantic meaning is generated through convolution operation. According to the semantic flow, the feature alignment module adjusts inaccurate corresponding relation between the high-level features and the low-level features caused by repeated up-down sampling, so that semantic information in the high-level features better flows to the low-level features, semantic and resolution difference between the high-level features and the low-level features is closed, and the model is guided to better recover to the initial resolution and simultaneously contain rich semantic information.

Further, the model prediction stage in step 5 specifically includes: changing the number of the characteristic diagram channels into 2 through a 1 multiplied by 1 convolutional layer to generate a final prediction diagram; and inputting the test images in the Massachusetts data set into the trained model after preprocessing.

Further, a composite loss function composed of a cross entropy loss function and a Dice loss function is used for calculating the model loss, and the cross entropy loss function and the Dice loss function are defined as follows respectively:

wherein N represents the total number of pixels, g _i The true label value, p, representing pixel i _i Representing a pixel i prediction value;

the composite loss function is defined as follows:

L＝L _BCE +L _D 。

the invention has the following advantages and beneficial effects:

the innovation of the invention mainly comprises the matching of the steps 2, 3 and 4 of the claims. Step 2, residual learning is introduced, so that network training is easier, and the degradation problem of the deep network is solved to a great extent; meanwhile, the attention characteristic fusion module is used for fusing the two branches of the residual error structure, so that the semantic difference between different branches is made up, and the learning capability of the network on the global and local information of the remote sensing image is enhanced. Step 3, the bridge network uses ASPP, a convolution kernel receiving domain is enlarged through parallel expansion convolution layers, information extraction and fusion of multiple scales are further carried out on high-level features, and connectivity of roads is improved. And 4, adding a feature alignment module in the decoding network, dynamically establishing the position corresponding relation between different layers of features, solving the problem of dislocation between different layers of features and improving the capability of reconstructing precise details of the decoding network.

Drawings

FIG. 1 is a diagram of a remote sensing image road extraction model framework based on MMS-UNet according to the preferred embodiment of the invention.

FIG. 2 is a block of convolutional sequences in a coding network

FIG. 3 is a residual Attention Residual Learning Unit (ARLU) in a coding network

FIG. 4 is an attention fusion module (AFF)

FIG. 5 is an ASPP module

FIG. 6 is a Feature Alignment Module (FAM)

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the invention aims to solve the problems that in practical application, the model is limited in receptive field, the fusion efficiency of the identity mapping branch and the residual error branch in the residual error structure is low, the fusion efficiency of high-level and low-level features is low and the like. Aiming at the problems, a remote sensing image road extraction method based on multi-dimensional multi-scale U-net is provided. The method fully utilizes information from different layers through an attention feature fusion method, expands an acceptance domain of a convolution kernel by using ASPP, then adds a feature module to align and fuse high-layer and low-layer information, and upsamples the information to the size of an input image to generate a prediction graph. Tests on Massachusetts public road data sets show that the method has good effect of predicting images.

Fig. 1 shows the network structure of the present invention, wherein:

step (1) is a preprocessing method of the input data of the model, the image size of the Massachusetts data set is 1500 multiplied by 1500, and a 256 multiplied by 256 area is set to cut all the images of the original data set. High-score data of a 256 multiplied by 3 wave band is input into an MMS-UNet network model built by us as input data to extract road information.

And (2) combining a residual error structure and an attention characteristic fusion module in the coding network to extract road characteristics. The coding network comprises a convolutional sequence block CSB and an attention residual learning unit ARLU. The convolution sequence block is composed of two convolution sequences stacked, each convolution sequence including a 3 × 3 convolution layer, a batch normalization layer, and a ReLU layer. Converting an input RGB image into high-dimensional features through a convolution sequence block, and then generating multi-scale and multi-level features through an attention residual error learning unit; in the attention residual error learning unit, a residual error unit is used for replacing a common neural network unit, and then an identity mapping branch and a residual error branch in the residual error unit are fused through an attention characteristic fusion module; the attention feature fusion module outputs the identity mapping branch firstly

Sum residual branch output

Fusing by means of feature summation, and then fusingThe combined result A is input into a multi-scale channel attention module MS-CAM, wherein one branch G (A) obtains global channel context information by using global average pooling, and the other branch L (A) directly obtains local channel context information by using a point convolution mode. Fusing the two branches G (A) and L (A) through a summation operation, and enabling the output of the MS-CAM to be M, then:

G(A)＝B(PW ₂ (δ(B(PW ₁ (g(A)))))) (1)

L(A)＝B(PW ₂ (δ(B(PW ₁ (A))))) (2)

M(A)＝σ(G(A)+L(A)) (3)

where g (·) represents the global average pooling operation (GAP). PW (pseudo wire) ₁ And PW ₂ Respectively represent a convolution kernel of

And

r represents the channel reduction rate. B denotes batch normalization operation (BN), δ denotes the ReLU activation function, and σ denotes the Sigmoid function. Therefore, the output F of the attention feature fusion module AFF can be represented by equation (4):

wherein, the first and the second end of the pipe are connected with each other,

representing element multiplication. The module adds local information to global information through point convolution, so that the network can extract information on a plurality of scales from the feature map along the channel dimension, and meanwhile, the lightweight of the network is kept. The attention residual error learning unit is connected with the residual error branch and the identity mapping branch through the attention feature fusion module, dynamically and adaptively fuses the received features in a scale perception mode, makes up semantic differences among different branch features, and enhances the learning of global and local information of the remote sensing image by a networkThe learning ability improves the accuracy of road identification.

And (3) introducing an ASPP module into the bridge network to realize multi-scale feature extraction on deep features. The ASPP module contains 5 branches, a 1 × 1 convolution branch, three 3 × 3 dilation convolution branches, and a global average pooling branch. The 1 × 1 convolution branch and the global average pooling branch are equivalent to keeping the inherent characteristics of input by respectively using the minimum and maximum receptive fields, and the other three branches are respectively provided with expansion rates of 6, 12 and 18 and perform feature sampling on the feature map; finally, the outputs of the five branches are fused by feature splicing, and the number of channels is adjusted by using a 1 × 1 convolutional layer.

And (4) restoring the characteristic image to the size of the input image by up-sampling step by step at the stage of decoding the network. Output of bridged network

Firstly, the number of channels and the image size are changed into the hierarchical characteristics corresponding to the coding network through inverse convolution

In agreement, then F _H And F _L Fused by splicing and input into a 3 x 3 convolutional layer to generate a semantic stream S _L As shown in formula (1);

S _L ＝Conv ₁ (Cat(T(F _H ),F _L )) (1)

in the formula, conv ₁ (. Cndot.) represents a 3 × 3 convolution operation, cat (. Cndot.) represents a concatenation operation, and T (. Cndot.) represents an inverse convolution operation. Derived semantic streams

Corresponding to an offset between the high-level feature and the low-level feature in two directions. Adding each pixel point p on the low-level feature map _L Corresponding to pixel point p on high-level characteristic diagram _H Then p is paired _H Can realize semantic alignment of high-level features and low-level features by linear interpolation of four adjacent points, such as formulas (2) and (3)Showing:

in the formula, N (p) _H ) Representing a pixel p in a high level feature map _H Adjacent points of (a), w _p A bilinear kernel weight representing a distance estimate through a warped mesh. F is to be _L After changing the number of channels by 1 × 1 convolution with F _H (p _H ) Performing summation operation to obtain output F of the feature alignment module _out Namely:

F _out ＝Conv ₂ (F _L )+F _H (p _H ) (4)

in the formula, conv ₂ (. Smallcap.) represents a 1 × 1 volume operation; the semantic and resolution difference between the high-level and low-level features is closed through the feature alignment module, and the model is guided to be better restored to the initial resolution and simultaneously contains rich semantic information.

And (5) in the model final prediction stage, changing the number of the characteristic diagram channels into 2 through a 1 × 1 convolutional layer, and generating a final prediction diagram. The Massachusetts data concentrated test image is input into the trained model after being preprocessed, and the experimental result shows that a better road extraction result can be obtained by improving the network model of the U-net.

And (6) the road extraction task uses a composite loss function consisting of a cross entropy loss function and a Dice loss function. In the remote sensing image, a road is a narrow area, compared with the whole image, the occupied proportion of the road is very small, and the problem of high imbalance of sample categories exists between the road and the background. The cross entropy loss function evaluates each pixel in the segmentation result, which may result in overfitting of classes with more samples if there is a class imbalance problem in the image. When the road is extracted from the remote sensing image, the network is biased to the background learning, and the capability of extracting the foreground target by the network is reduced. The Dice coefficient takes all pixels of one category as a whole, and calculates the proportion of the intersection of the two categories in the whole, so that the Dice coefficient is not influenced by a large number of background pixels, and can achieve a better effect under the condition of unbalanced samples. Thus, the cross entropy loss function and the Dice loss function are defined as follows:

the composite loss function is defined as follows:

L＝L _BCE +L _D (3)

the systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A remote sensing image road extraction method based on a multi-dimensional and multi-scale U-net network is characterized by comprising the following steps:

step 4, decoding network stage: the feature image is gradually restored to the input image size by upsampling. Adding a feature alignment module FAM in a decoding network, wherein the module takes high-level features and low-level features of a corresponding layer of a coding network as input to generate semantic stream, and adjusting feature maps of two adjacent layers by using the semantic stream to generate feature output with high resolution and strong semantics;

2. The method for extracting the remote sensing image road based on the multi-dimensional multi-scale U-net network according to claim 1, wherein the step 1 specifically comprises:

3. The method for extracting a remote sensing image road based on a multi-dimensional and multi-scale U-net network as claimed in claim 1, wherein the step 2 inputs the preprocessed data into a coding network, and the coding network combines a residual structure and an attention feature fusion mechanism to perform multi-scale extraction on road feature information, specifically comprising:

4. The method for extracting a remote sensing image road based on a multi-dimensional and multi-scale U-net network according to claim 1, wherein step 3 is that an ASPP module of a bridge network comprises 5 parallel branches, which are: the system comprises a 1 x 1 convolution branch, three 3 x 3 expansion convolution branches and a global average pooling branch, wherein the 1 x 1 convolution branch and the global average pooling branch are equivalent to using minimum and maximum receptive fields respectively to keep the inherent characteristics of input, and the other three branches are respectively set with different expansion rates to describe the image characteristics on different scales.

5. The method for extracting a remote sensing image road based on a multi-dimensional and multi-scale U-net network according to claim 1, wherein the step 4 specifically comprises: and gradually sampling the recovered feature image to the size of an input image in the decoding network stage, and connecting the high-level features of the decoding network and the low-level features of the corresponding layer of the coding network by using a feature alignment module. In the feature alignment module, firstly, the high-level features change the image size and the channel number through inverse convolution, then the changed high-level features and the low-level features are connected, and the semantic stream is generated through convolution operation. According to the semantic flow, the feature alignment module adjusts inaccurate corresponding relation between the high-layer features and the low-layer features caused by repeated up-down sampling, so that semantic information in the high-layer features flows into the low-layer features better, semantic and resolution difference between the high-layer features and the low-layer features is closed, and the model is guided to restore to the initial resolution better and contain rich semantic information.

6. The method for extracting a remote sensing image road based on a multi-dimensional multi-scale U-net network as claimed in claim 5, wherein the model prediction stage in the step 5 specifically comprises: changing the number of the characteristic diagram channels into 2 through a 1 multiplied by 1 convolution layer to generate a final prediction diagram; and inputting the test images in the Massachusetts data set into the trained model after preprocessing.

7. The method for extracting the remote sensing image road based on the multi-dimensional multi-scale U-net network according to claim 6, wherein model loss is calculated by using a composite loss function consisting of a cross entropy loss function and a Dice loss function, and the cross entropy loss function and the Dice loss function are respectively defined as follows:

wherein N represents the total number of pixels, g _i The true label value, p, representing pixel i _i Representing a pixel i prediction value; the composite loss function is defined as follows:

L＝L _BCE +L _D 。