CN118072155A

CN118072155A - High-resolution remote sensing image road extraction method based on deformable attention mechanism

Info

Publication number: CN118072155A
Application number: CN202211469934.0A
Authority: CN
Inventors: 张广运; 戴菱; 张荣庭
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2024-05-24

Abstract

The invention provides a high-resolution remote sensing image road extraction method based on a deformable attention mechanism, which solves the problems of low extraction accuracy and narrow application range of complex roads in the field of high-resolution remote sensing. The implementation steps are as follows: carrying out data enhancement on the existing high-resolution remote sensing image data set by utilizing sliding window, horizontal overturning and vertical overturning operations; a road enhancement module is formed by one-dimensional convolution in four directions (horizontal, vertical, left diagonal and right diagonal); the 1 multiplied by 1 convolution module is utilized to restrict the learning of the deformable offset in the self-attention mechanism, so as to form a deformable attention module, and the deformable attention module is automatically adapted to the geometric transformation of the road; embedding a road enhancement module and a deformable attention module between an encoder taking ResNet as a basic frame and a bilinear interpolation up-sampling decoder to extract multi-scale road information; the post-processing process comprises centerline extraction, centerline disconnection reconnection and centerline vector graph generation. The invention improves the accuracy of road extraction.

Description

High-resolution remote sensing image road extraction method based on deformable attention mechanism

Technical Field

The invention relates to a high-resolution remote sensing image processing technology. In particular to a high-resolution remote sensing image road extraction method based on a deformable attention mechanism.

Background

In order to quickly acquire the latest road information of a large area, extracting a road network from a high-resolution remote sensing image has become a promising method. The high-resolution remote sensing image can generate mass data and becomes a main data source for extracting road areas and updating a geospatial database in real time. Although deep convolutional neural networks have good performance in terms of road semantic segmentation and extraction, road extraction still has difficulties: the intra-class difference and the inter-class similarity of the roads increase the difficulty of extraction, but the types of the roads are various, and the urban roads, the rural roads, the railways and the like have obvious differences, but the similarity of the roads and other objects (such as urban roads and urban buildings) is higher, so that the information of the road context is lost.

The research progress at home and abroad is integrated, the existing neural network model often cannot capture long-distance dependence in remote sensing images, and the extraction effect is limited to a great extent. Thus, capturing long-range dependence has been a considerable problem in the field of deep learning. Recent work has focused on capturing long-range dependencies by multilayer convolution stacking, atrous convolutions, feature pyramids, and multi-scale convolutions to expand the receptive field. These methods have limitations such as Atrous convolutions with poor performance on small objects, stacking of multiple convolutions can lead to computational surge and extraction inefficiency. In addition, there is currently little research on capturing long-range dependencies of high-resolution remote sensing image roads in limited computing resources.

The self-attention mechanism captures global information to obtain larger receptive fields and context information. From a mathematical perspective, the self-attention mechanism learns the weights by taking the pair-wise dot product of the input signals and then adds all the signals to the corresponding weights. Although the self-attention mechanism has some effect in capturing remote dependencies, its limitations are also apparent. For remote sensing images, adjacent pixels in the spatial scale are always highly similar and have strong dependencies. However, the self-attention mechanism emphasizes semantic information of the global context while ignoring local information. The weighted sum operation of global information may distract the attention weight, causing the range of interest to go beyond the target range, introducing invalid attention weights. Therefore, in order to capture long-distance dependence of the road in the high-resolution remote sensing image and reduce the computation complexity of the self-attention mechanism, research on a high-resolution remote sensing image road extraction method based on the deformable attention mechanism is needed.

Disclosure of Invention

The invention aims to solve the technical problem of providing the high-resolution remote sensing image road extraction method based on the deformable attention mechanism, which can meet the requirements of the high-resolution remote sensing field on the accuracy, the rapidity and the wide application range of complex road extraction.

The technical scheme adopted by the invention is as follows: a high-resolution remote sensing image road extraction method based on a deformable attention mechanism is characterized by comprising the following steps:

1) The data enhancement is carried out on the high-resolution remote sensing image data set by utilizing sliding window, horizontal overturning and vertical overturning operations, and specifically comprises the following steps:

On a high-resolution remote sensing image, sampling by utilizing a window with a fixed size, and setting a sliding step length; the diversity of the data is enhanced by using horizontal and vertical flipping for the data set, and the probability of flipping the pictures in the data set is set to a random number between 0 and 1.

2) The road enhancement module is formed by one-dimensional convolution of four directions, which are respectively: horizontal, vertical, left diagonal, and right diagonal.

3) The 1 multiplied by 1 convolution module is utilized to restrain the learning of the deformable offset in the self-attention mechanism, so that a deformable attention module is formed, and the geometric transformation of the road is automatically adapted.

4) A road enhancement module and a deformable attention module are embedded between an encoder based on ResNet and a bilinear interpolation upsampling decoder to extract multi-scale road information.

5) Extracting the road center line of the road network obtained in the step 4), then carrying out disconnection detection and reconnection on the center line, improving the connectivity of the road, carrying out grid vector conversion on the optimized road network, and generating a vector road network.

6) Evaluating the results of the model method according to the road intersection ratio (IoU) and the F1 score, and comparing with other 5 advanced road extraction models, wherein the 5 models are respectively: U-Net, deeplabv3, D-LinkNet, SGCN-Net and MDANet to verify the advancement of the method.

According to the high-resolution remote sensing image road extraction method based on the deformable attention mechanism, the deformable attention mechanism with enhanced road information is designed, python is used as a design language, pytorch is used as an algorithm design framework, complex road features in the high-resolution remote sensing image can be better extracted, capturing of long-distance context dependency relationship of the complex road under low computational complexity is achieved under the condition of limited software computational resources, and the requirements of the high-resolution remote sensing field on accuracy, timeliness and wide application range of satellite image road extraction are met.

Drawings

FIG. 1 is a flow chart of a method for extracting a road from a high-resolution remote sensing image based on a deformable attention mechanism according to the present invention;

FIG. 2 is a high resolution remote sensing image road extraction model architecture based on deformable road attention;

FIG. 3 is a diagram of the deformable attention module DA of S112, S201 is a reference pixel, S202 is the offset of the sample point, S203 is an input feature map, S204 is a sample offset feature map, S205 is an attention weight, and S206 is an output pixel in the model architecture of FIG. 2;

Fig. 4 is an S111 road enhancement module RAM of fig. 3;

Fig. 5 is S101 in fig. 2;

fig. 6 is S102 in fig. 2;

Fig. 7 is S103 in fig. 2;

fig. 8 is S104 in fig. 2;

fig. 9 is S105 in fig. 2;

Fig. 10 is an encoder structure BTNK1 in fig. 6-9;

FIG. 11 is an encoder structure BTNK of FIGS. 6-9;

fig. 12 is a structure of S106 to S110 in fig. 2;

FIG. 13 is a residual structure;

FIG. 14 is a tensor voting flow chart;

FIGS. 15 and 16 are binary results on DeepGlobe dataset and CHN6-CUG dataset, (a) column is the original image, (b) column is the truth chart, (c) column is the U-Net extraction result, (D) column is Deeplabv extraction result, (e) column is D-LinkNet extraction result, (f) column is SGCN-Net extraction result, (g) column is MDANet extraction result, and (h) column is the extraction result of the method of the present invention.

Fig. 17 shows a road extraction result diagram, a road center line extraction result diagram, and a center line broken line reconnection optimization diagram from left to right.

The centerline vector diagram of fig. 18 is shown in arcmap 10.8.

Detailed Description

The high-resolution remote sensing image road extraction method based on the deformable attention mechanism of the invention is described in detail below with reference to the embodiments and the accompanying drawings.

The invention discloses a high-resolution remote sensing image road extraction method based on a deformable attention mechanism, which comprises the following steps of:

1) Enhancement of road semantic information using road enhancement module RAM in fig. 4, wherein Representing the convolution kernel, there are two directions d= (D _H,D_W) for any convolution kernel position, the convolution module is represented as:

Wherein X w represents convolution operation, D represents a convolution direction vector, and four values (0, 1), (1, 0), (1, 1), (-1, 1) are respectively taken as horizontal, vertical, left diagonal and right diagonal. Y _D represents the output of the convolution, [ i, j ] is the position of any 1X 1 subkernel constituting the convolution kernel, 2k+1 is the number of 1X 1 subkernels in the convolution kernel, and t represents a pure number without special meaning.

2) The offset for each pixel in the input signature can be calculated using a 1 x1 convolution operation, the offset for K output channels being denoted (Δi _x,ΔI_y), the sampling process being:

Where (Δi _x,ΔI_y) denotes the amount of deviation in the x and y directions, I _Q denotes a reference pixel in the feature map, K denotes a total sampling pixel, and K denotes an arbitrary sampling pixel. Sampling weights by 1 x 1 convolution A kth offset can be obtained to move the sampling position to an adjacent pixel. Then, an offset is added to the value of the original sampled pixel, generating a new value of the offset pixel:

I_K(x_k,y_k)＝I_Q((x+ΔI_x),(y+ΔI_y))

I _K denotes a feature map Key (K), (x _k,y_k) denotes a pixel position in the feature map K, I _Q denotes a reference pixel in the feature map, and since the position value after adding the offset is a non-integer number, it does not correspond to the actual pixel value on the feature map, and thus bilinear interpolation is required to obtain the offset pixel value, and the formula is as follows:

Let G (,) denote the bilinear interpolation operation in the x, y direction, (x _k,y_k) denote any (fractional) position, (x _int,y_int) denote the set of all integer positions, and I _Q denote the reference pixel in the feature map.

3) The road enhancement module can automatically adapt to geometric changes of the road, but the convolution of four different directions only takes into account spatial offset, and does not take into account interactions between adjacent pixels. To address this difficulty, a deformable attention module DAM was introduced as shown in FIG. 3. In the spatial attention module, key (K) and Value (K) are obtained from a feature map of the sampling offset. For ease of representation, I _Q (x, y) is denoted as I _Q, and the correlation between a given pixel and all sampled pixels can be represented as the transposed dot product of I _Q and I _K. Furthermore, the softmax function is used to limit the relevance score between 0 and 1. The attention weight is then expressed as:

Wherein the spatial weight matrix Attention _QK measures the influence of a specific reference pixel on a kth sampling pixel, I _Q represents a reference pixel in the feature map, I _K ^T represents a transpose of the feature map Key (K) matrix, K represents a total sampling pixel, and K represents an arbitrary sampling pixel. In order to selectively aggregate the road context information and preserve more semantic information in the global view, the context information is extracted by summing all sampled pixels I _V and has a corresponding Attention weight Attention _QK. Finally, the context information is combined with the original reference pixels to preserve certain initial features:

Wherein alpha is a scale parameter which is gradually increased from 0 in the training process, I _output is an output characteristic diagram, I _Q represents reference pixels in the characteristic diagram, K represents total sampling pixels, K represents any sampling pixels, I _V is all sampling pixels, and Attention _QK represents a spatial weight matrix between the reference pixels and the sampling pixels. The deformable spatial self-attention module may be embedded in any network layer due to the fact that the dimensions of the input are consistent with the output of the module.

4) Comprehensive flows 1), 2), 3) based on classical residual network ResNet, a road enhancement module RAM and a deformable attention module DAM are embedded ResNet between each layer of encoder and decoder as shown in fig. 2, and in encoder fig. 5 to 9, an activation function RELU is expressed as:

f(x)＝max(0，x)

The decoder then uses the upsampling algorithm of bilinear interpolation as shown in fig. 12, and on the premise that the function values Q₁₁＝(x₁,y₁),Q₁₂＝(x₁,y₂),Q₂₁＝(x₂,y₁),Q₂₂＝(x₂,y₂) of the four points are known, the process of bilinear interpolation is expressed as: interpolation in the x-direction first:

then interpolation in the y direction:

5) Using cross entropy as a loss function, in the case of a classification, the activation function sigmoid acting on x is expressed as:

the loss function can be expressed on this basis as:

Where N is denoted as the number of samples, i is the current sample number, f (x _i) is the predicted value, and y _i is the tag value.

6) The optimization algorithm employs a random gradient descent algorithm SGD to minimize the loss function value.

7) And processing the road network extraction result by using a morphological image refinement algorithm to obtain a road network center line.

8) The extracted central line of the road network is further optimized by using a tensor voting algorithm, when judging the broken line end point in the graph, the field point of the broken line end point is encoded into second-order symmetrical tensors, each tensor matrix is decomposed into a rod tensor component and a ball tensor component, and the process is expressed as follows:

T＝λ₁e₁e₁ ^T+λ₂e₂e₂ ^T

Wherein lambda ₁≥λ₂ is more than or equal to 0 and is a characteristic value of T, e ₁ and e ₂ are corresponding characteristic vectors, voting (tensor matrix) is carried out from the field points to the disconnection end points, and the disconnection end points accumulate tensor matrix from the field points for self structural characteristic inference.

9) And generating a road network vector diagram by adopting a grid-to-vector algorithm.

Specific examples of application of the present invention are given below.

The flow of the embodiment of the invention is shown in fig. 1, and mainly comprises data enhancement operation, network model design and model result verification. In the embodiment of the invention, the parameter settings are as follows: the input window size is 512×512; batchsize is 2; the initial learning rate is 1e-4; the initial weight is 1; the number of iterations epoch was 30.

The hardware conditions on which the embodiment of the invention relies are as follows: CPU: AMD Ryzen 7 5800h; GPU: RTX3060Ti, video memory 6G, algorithm design and program writing on a Pycharm2021.3 software platform by utilizing pytorch framework, and vector diagram display by means of ArcMap 10.8. According to the flow of fig. 1, a network model framework as in fig. 2 is designed, mainly comprising an encoder, a decoder, a road enhancement module RAM, a deformable attention module DAM. Wherein, the encoder module adopts ResNet as a network model basic framework to form a deep neural network; the road enhancement module RAM is used for enhancing the road semantic information; the deformable attention module DAM effectively captures the long-distance dependence of the road pixels and simultaneously abandons the calculation complexity brought by the global attention; embedding the DAM and RAM at each layer between the encoder and decoder facilitates extracting multi-scale road features.

The specific process of the embodiment of the invention is as follows:

Step one: the CHN6-CUG high-resolution remote sensing image is divided into a training set and a test set, wherein the training set is 2835 images with 512 multiplied by 512 pixels, and the test set is 561 images with 512 multiplied by 512 pixels. The diversity of the data is enhanced by using horizontal and vertical flipping for the data set, and the probability of flipping the pictures in the data set is set to a random number between 0 and 1.

Step two: the test dataset was put into the network model shown in fig. 2 for training, setting the initial learning rate, initial weights in the loss function, values of batchsize, and the number of iterations epoch.

In the present invention, resNet50 0 is adopted as a basic network framework, wherein a residual structure is used to help the network solve the problem that deep features are difficult to extract, the residual structure is shown in fig. 13, the input and output relationship of a module in the network is regarded as y=h (x), then the problem that the network extraction capability is degraded can be encountered when H (x) is directly solved by a gradient method, if such a structure with shortcuts is used, the optimization target of the variable parameter part is no longer H (x), and if F (x) is used to represent the part needing to be optimized, H (x) =f (x) +x, that is, F (x) =h (x) -x. Since x corresponds to the observed value in the assumption of unit mapping, F (x) corresponds to the residual, and the residual structure only has to learn the difference between the input and the output, i.e. the value of H (x) -x is how much the output changes from the input.

The road semantic information is reinforced by adopting a road reinforcing module RAM, so thatIs the input to the convolution module, where H, W and C represent the height, width and channel number of the image, respectively. X is input into four parallel paths, each path comprising a convolution in one (non-repeating) direction, resulting in four feature maps/>, of the same sizeRepresenting the convolution kernel:

There are two directions d= (D _H,D_W) for any convolution kernel position. X w represents convolution operation, D represents convolution direction vector, and takes four values (0, 1), (1, 0), (1, 1), (-1, 1) of horizontal, vertical, left diagonal, and right diagonal respectively. Y _D represents the output of the convolution, [ i, j ] is the position of any 1X 1 subkernel constituting the convolution kernel, 2k+1 is the number of 1X 1 subkernels in the convolution kernel, and t represents a pure number without special meaning.

In order to mimic a 3 x 3 convolution filter, k=4 is set so that there are 9 parameters per convolution. Any position in the output profile may be associated with multiple positions in four directions in the input profile.

Under the condition of the formula (1), a sampling process for calculating the offset of each pixel in the input feature map by using 1×1 convolution is obtained by the formulas (1) to (4):

I_K(x_k,y_k)＝I_Q((x+ΔI_x),(y+ΔI_y)) (3)；

Where (Δi _x,ΔI_y) denotes the amount of deviation in the x and y directions, I _Q denotes a reference pixel in the feature map, K denotes a total sampling pixel, and K denotes an arbitrary sampling pixel. Sampling weights by 1 x 1 convolution The kth offset can be derived to move the sampling position to an adjacent pixel, I _K denotes the feature map Key (K), and (x _k,y_k) denotes any (fractional) pixel position in the feature map K. The generated offset is added to the original sampling pixel value to obtain a new value of the offset pixel generated by the formula (3), and in order to make the position value after the offset addition be an integer, the formula (4) obtains the offset pixel value by bilinear interpolation. Let G (,) denote the bilinear interpolation operation in the x, y direction, (x _int,y_int) denote the set of all integer positions

In the deformable attention shown in fig. 3, the correlation between a given pixel and all sampled pixels may be represented as a dot product of the transposes of I _Q and I _K. Furthermore, the softmax function is used to limit the relevance score between 0 and 1. The attention weight is then expressed as:

Wherein the spatial weight matrix Attention _QK measures the effect of a particular reference pixel on the kth sample pixel. I _Q denotes the reference pixel in the feature map, I _K ^T denotes the transpose of the feature map Key (K) matrix, K denotes the total sampled pixel, K denotes any sampled pixel, and the context information is combined with the original reference pixel by equation (6) to preserve some initial features:

I _output is an output feature map, I _Q is a reference pixel in the feature map, K is a total sampling pixel, K is an arbitrary sampling pixel, I _V is all sampling pixels, and Attention _QK is a spatial weight matrix between the reference pixel and the sampling pixels. And optimizing model parameters by utilizing a random gradient descent algorithm SGD so as to minimize a loss function value. Bilinear interpolation is used as an up-sampling algorithm in the decoder, and deformable attention extraction multi-scale road information is embedded between the encoder and the decoder of each layer.

Step three: testing the trained model in the second step on test data, extracting a road network center line by using a morphological image refinement algorithm (a thining () function in python library OpenCV), optimizing the extracted road network center line by using a tensor voting algorithm flow in fig. 14, coding a field point of the model into a second-order symmetrical tensor when judging a broken line end point in the graph, decomposing each tensor matrix, and expressing the process into a formula (7):

T＝λ₁e₁e₁ ^T+λ₂e₂e₂ ^T (7)；

Wherein lambda ₁≥λ₂ is more than or equal to 0 and is a characteristic value of T, e ₁ and e ₂ are corresponding characteristic vectors, voting (tensor matrix) is carried out from the field points to the disconnection end points, and the disconnection end points accumulate tensor matrix from the field points for self structural characteristic inference. And (3) taking the road intersection ratio (IoU) and the F1 score as the evaluation indexes of the model quality for the optimized road network, and taking 5 advanced road extraction models as comparison objects. Wherein IoU and the F1 fraction are defined as shown in the formula (8) to the formula (9):

IOU＝TP/(TP+FP+FN) (8)；

F1＝2×(precision×recall)/(precision+recall) (9)；

where TP represents the number of pixels correctly extracted as a road; FP represents the number of pixels of other objects extracted as roads; FN represents the number of road pixels extracted as other objects. Precision and recall in equation (9) give definitions in equations (10) to (11):

precision＝TP/(TP+FP) (10)；

recall＝TP/(TP+FN) (11)；

the comparative model was selected from U-Net, deeplabV3+, D-LinkNet, SGCN-Net and MDANet.

Step four: and (3) carrying out grid conversion vector on the optimized grid road network obtained in the step (III) through a Polygonize () function in a GDAL library in python, and generating a road network vector diagram.

Experiment:

Under Windows11 system, algorithm design and program writing are completed on Pycharm2021.3 software platform by using python language, and simulation experiment is carried out on CHN6-CUG and DeepGlobe high-resolution remote sensing image road data. The experimental parameters were set as follows: the input size is 512×512 pixels, the initial learning rate is 1e-4, the initial weight of the loss function is 1, the batch size is set to 2, the iteration number epoch is 30, tables 1 and 2 are experimental quantization results on CHN6-CUG and DeepGlobe datasets respectively, fig. 15 and 16 are binary comparison results on DeepGlobe and CHN6-CUG datasets respectively, fig. 17 is a road extraction result diagram, a road center line extraction result diagram and a center line broken line reconnection optimization diagram respectively from left to right, and fig. 18 is a display of a center line vector diagram in arap 10.8.

TABLE 1 quantization results on the CHN6-CUG dataset

Table 2 DeepGlobe quantization results on dataset

Although the invention has been described above with reference to the accompanying drawings and examples, the invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many modifications may be made by those of ordinary skill in the art without departing from the spirit of the invention, which fall within the protection of the invention.

Claims

1. A high-resolution remote sensing image road extraction method based on a deformable attention mechanism is characterized by comprising the following steps:

4) A deformable attention module and a road enhancement module are embedded between an encoder based on ResNet and a bilinear interpolation up-sampling decoder to extract multi-scale road information.

5) And (3) extracting the road center line of the road network obtained in the step (4), detecting and reconnecting the broken line of the extracted center line, improving the connectivity of the road center line, and carrying out grid vector conversion on the optimized road network center line to generate a vector road network.

2. The method for extracting a road from a high-resolution remote sensing image based on a deformable attention mechanism according to claim 1, wherein in step 2), the road enhancement module RAM is:

Wherein the method comprises the steps of Representing the convolution kernel, there are two directions d= (D _H,D_W) for any convolution kernel position. X w represents convolution operation, D represents convolution direction vector, and takes four values (0, 1), (1, 0), (1, 1), (-1, 1) of horizontal, vertical, left diagonal, and right diagonal respectively. Y _D represents the output of the convolution, [ i, j ] is the position of any 1X 1 subkernel constituting the convolution kernel, 2k+1 is the number of 1X 1 subkernels in the convolution kernel, and t represents a pure number without special meaning.

3. The method for extracting a high-resolution remote sensing image road based on a deformable attention mechanism according to claim 1, wherein in the step 3), the step of restricting learning of the deformable offset in the self-attention mechanism by using a 1×1 convolution module is:

S1: calculating an offset for each sample pixel using a1 x 1 convolution; adding the calculated offset to the original pixel to obtain a new value of the sampled pixel, and obtaining an integer value by a bilinear interpolation method in order to correspond to the actual position in the feature map in consideration of the fact that the generated new value of the pixel may be a non-integer;

S2: calculating a spatial attention weight by using the new value of the sampling pixel obtained in the step S1 and a reference pixel;

S3: and (3) performing dot multiplication operation on the spatial attention weight obtained in the step (S2) and all the sampling pixels to obtain context information, and adding the context information with the original reference pixels to retain certain initial characteristics.

4. The method of claim 3, wherein the step S1 includes sampling each pixel offset:

Where (Δi _x,ΔI_y) denotes the amount of deviation in the x and y directions, I _Q denotes a reference pixel in the feature map, K denotes a total sampling pixel, and K denotes an arbitrary sampling pixel. Sampling weights by 1 x 1 convolution A kth offset can be obtained to move the sampling position to an adjacent pixel. Then, the offset is added to the value of the original sampled pixel using equation (3), generating a new value for the offset pixel:

I_K(x_k,y_k)＝I_Q((x+ΔI_x),(y+ΔI_y)) (3)；

I _K denotes a feature map Key (K), (x _k,y_k) denotes a pixel position in the feature map K, I _Q denotes a reference pixel in the feature map, and since the position value after adding the offset is a non-integer number, it does not correspond to the position of the actual pixel on the feature map, and thus bilinear interpolation is required to obtain the offset pixel value, and the formula is as follows:

5. The method of claim 3, wherein in step S2, the deformable attention module DAM considers not only the spatial offset but also the interaction between adjacent pixels, and the attention weights are:

The spatial weight matrix Attention _OK measures the effect of a particular reference pixel on the kth sample pixel, I _Q represents the reference pixel in the feature map, I _K ^T represents the transpose of the feature map Key (K) matrix, K represents the total sample pixels, and K represents any sample pixel.

6. The method of claim 3, wherein in step S3, in order to selectively aggregate the road context information and preserve more semantic information in the global view, the context information is extracted by summing all the sampling pixels I _V. Finally, the context information is combined with the original reference pixels by equation (6) to preserve certain initial features:

Where α is a scale parameter that increases gradually from 0 during training. I _output is an output feature map, I _Q is a reference pixel in the feature map, K is a total sampling pixel, K is an arbitrary sampling pixel, I _V is all sampling pixels, and Attention _QK is a spatial weight matrix between the reference pixel and the sampling pixels in claim 5.

7. The method of claim 1, wherein step 4) is to embed a road enhancement module RAM and a deformable attention module DAM between each layer of encoder and decoder ResNet based on a classical residual network ResNet as a basic framework to obtain multi-scale road features.

8. The method for extracting the road from the high-resolution remote sensing image based on the deformable attention mechanism as claimed in claim 1, wherein the step 5) firstly extracts the road center line by means of a morphological refinement algorithm, then uses a tensor voting algorithm to perform disconnection reconnection to improve the connectivity of the road network, encodes the domain point of the road center line into second-order symmetrical tensors when judging the disconnection end points in the graph, and then decomposes the tensor matrixes, and the process is expressed as a formula (7):

T＝λ₁e₁e₁ ^T+λ₂e₂e₂ ^T (7)；