CN114821192A

CN114821192A - Remote sensing image elevation prediction method combining semantic information

Info

Publication number: CN114821192A
Application number: CN202210557539.1A
Authority: CN
Inventors: 张永生; 王自全; 戴晨光; 王涛; 于英; 尚大帅; 江志鹏; 程彬彬; 吕可枫; 闵杰; 张磊
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-07-29

Abstract

The invention belongs to the technical field of geographic information, and particularly relates to a remote sensing image elevation prediction method combining semantic information, which comprises the following steps: the elevation extraction network model comprises a sharing weight recoding-decoding device, a semantic prediction branch and an elevation prediction branch; the shared weight reprogramming-decoder is used for executing a surface feature classification task and an elevation prediction task on the single-view remote sensing image, learning a priori relationship existing between semantic information extracted by the surface feature classification task and geometric information extracted by the elevation prediction task, and obtaining a feature vector by using the priori relationship; inputting data of the last channel in the feature vector into a semantic prediction branch to obtain a feature classification result; and inputting the data of the last channel in the feature vector into the elevation prediction branch to obtain an elevation prediction result. Therefore, the method is used for solving the problem of low accuracy when the single-view remote sensing image is used for carrying out elevation prediction.

Description

Remote sensing image elevation prediction method combining semantic information

Technical Field

The invention belongs to the technical field of geographic information, and particularly relates to a remote sensing image elevation prediction method combining semantic information.

Background

In the existing remote sensing image elevation prediction methods, a plurality of camera lenses are carried on a flight platform at the same time, more abundant image information is obtained from the vertical direction and other inclined directions at the same time, then the obtained plurality of multi-view remote sensing images are utilized, a series of calculations such as preprocessing is carried out on the remote sensing images, feature point extraction and adjustment calculation are carried out on the plurality of remote sensing images, the calculation process is used for extracting elevation geometric information, and the speed for obtaining an elevation prediction result is low due to the fact that the number of identified remote sensing images is large, the calculation process is complex and the speed is low. When the single-view remote sensing image is used for identifying the elevation geometric information, only one angle image exists, the remote sensing image is shielded due to the fact that shielding exists in the real world, and therefore the accuracy of subsequent extraction of the elevation information is low.

Disclosure of Invention

The invention aims to provide a remote sensing image elevation prediction method combined with semantic information, which is used for solving the problem of low accuracy when a single-view remote sensing image is used for elevation prediction.

In order to solve the technical problems, the technical scheme provided by the invention and the corresponding beneficial effects of the technical scheme are as follows:

the invention relates to a remote sensing image elevation prediction method combining semantic information, which comprises the following steps:

acquiring a single-view remote sensing image, and inputting the single-view remote sensing image into an elevation extraction network model to obtain an elevation prediction result and a ground feature classification result of the single-view remote sensing image; the elevation extraction network model is obtained by training by using the single-view remote sensing image, a digital elevation model corresponding to the single-view remote sensing image and a ground feature classification result as a training set, and comprises a shared weight recoding-decoding device, a semantic prediction branch and an elevation prediction branch; the shared weight recoding-decoding device is used for executing a ground feature classification task and an elevation prediction task on the single-view remote sensing image, learning a priori relationship between semantic information extracted by the ground feature classification task and geometric information extracted by the elevation prediction task, and obtaining a feature vector by using the priori relationship; inputting the feature data of the last channel except the feature vector into the semantic prediction branch to extract semantic features, and obtaining a ground feature classification result; and inputting the feature data of the last channel in the feature vector into the elevation prediction branch to extract elevation features, and obtaining an elevation prediction result.

The beneficial effects of the above technical scheme are: the method comprises the steps that a sharing weight recoding-decoding device is used for carrying out a ground feature classification task and an elevation prediction task, the ground feature classification task and the elevation prediction task share weights and are supervised mutually, and the prior relation existing between semantic information extracted by the ground feature classification task and geometric information extracted by the elevation prediction task is learned; for example, the average elevation of the "water area" should be lower than the average elevation of other ground features, the elevation of the "house" (flat area) should be higher than the elevation of the surrounding flat terrain, and the like, and the place where the elevation changes is often the boundary of the ground feature classification. The invention aims to enable the shared weighted recoding-decoding device to learn the prior relation in the process of recognizing the ground feature distribution and the predicted elevation, and enable the two tasks to be mutually supervised, the ground feature classification task assists the elevation prediction task to extract geometric information more accurately, then on one hand, the elevation feature is extracted through the elevation prediction branch to obtain a range prediction result with higher accuracy, on the other hand, the semantic feature is extracted through the semantic prediction branch to obtain a ground feature classification result with higher accuracy, and therefore the cognitive ability of the elevation extraction network model to the real ground features is finally improved. Therefore, the remote sensing image elevation prediction method combining semantic information with high prediction result accuracy when single-view remote sensing images are used for elevation prediction and ground feature classification is provided.

Further, the decoder in the shared-right re-encoding-decoder comprises four layers of structures connected in sequence, and each layer of structure comprises in sequence: a transposed convolution module and two first convolution modules;

the transposition convolution module comprises transposition convolution calculation, an activation function and filling operation after transposition convolution which are sequentially arranged.

The beneficial effects of the above technical scheme are: the decoder sharing the weight is further improved on the structure of the U-Net decoder, and is different from the U-Net decoder which directly uses an upsample layer with an expanded scale. Specifically, the elevation decoder of the present invention performs convolution corresponding to the size of the convolution kernel in the x-direction and the y-direction, respectively, to help the network to perceive the change of terrain gradient to enrich the texture details of the row prediction, and after the transposition convolution, adds a normal convolution to ablate the checkerboard shadows. Experimental results prove that the improved decoder for sharing the weight improves the accuracy of elevation prediction by 0.3 m. In addition, the invention adopts a U-net decoder structure, realizes the feature fusion by a splicing mode, and has simple and stable structure.

Further, the method for constructing the training set comprises the following steps:

the construction method of the training set comprises the following steps:

acquiring a single-view remote sensing image I and a digital elevation model D corresponding to a geographical range; segmenting a single-view remote sensing image I into a plurality of small images, traversing pixel points of each small image pixel by pixel I (I, j), and solving longitude X, latitude Y and elevation H of each current pixel point (r, c) corresponding to the digital elevation model D by using a least square method to obtain a training set; the method for solving the longitude X, the latitude Y and the elevation H by using the least square method comprises the following steps:

a. selecting longitude X in rational function imaging model RFM corresponding to single-view remote sensing image I ₀ Latitude Y ₀ Elevation offset H ₀ Recording as an iteration initial value;

b. calculating the pixel coordinates (r) of the projection of the iteration initial value and the rational function imaging model RFM _p ,c _p ) Calculating the difference value between the pixel coordinate and the current pixel point, and recording the difference value as a projection error;

c. calculating partial derivatives of the projection errors to the longitude and the latitude, and constructing a partial derivative arrangement matrix;

d. solving the correction numbers of the longitude direction and the latitude direction according to the partial derivative arrangement matrix and the projection error;

f. updating an iteration initial value according to the correction number, recording the iteration initial value as a current iteration initial value, obtaining the elevation corresponding to the longitude and the latitude in the current iteration initial value from the digital elevation model by an interpolation method, and updating the elevation value in the current iteration initial value;

g. and repeating the steps b-f until convergence, thereby obtaining the elevation value corresponding to the current pixel point.

The beneficial effects of the above technical scheme are: in the prior art, a digital elevation model DEM is used for acquiring training data, and when the DEM is used for acquiring the training data, on one hand, a remote sensing image is not corrected by geographic information; on the other hand, the acquired triple data coordinate consisting of longitude, latitude and elevation is obtained by interpolation and is not a coordinate value on the actual DEM model; therefore, the regular triple data coordinates obtained after grid sampling cannot correspond to the pixels of the remote sensing image one by one, and certain deviation exists, so that the obtained training data and the elevation prediction result are inaccurate. According to the method, a rational function imaging model is used for obtaining a training set, pixel points are traversed pixel by pixel, then the training set is obtained by setting an iteration initial value, constructing errors, linearly approximating, correcting coordinate errors of triple data, updating the iteration initial value and the like by using a least square method, coordinate data consisting of longitude, latitude and elevation on the rational function imaging model strictly correspond to each pixel on a remote sensing image, and the coordinate of the obtained triple data is more accurate, so that accurate elevation prediction results can be obtained subsequently.

Further, the encoder in the shared weight re-encoder is ResNet.

Further, the structure of the semantic prediction branch comprises: a second convolution module and a softmax layer; the second convolution module includes a convolution layer, a batch normalization layer, and an activation function.

Further, when the elevation extraction network model is trained, the loss function used by the semantic prediction branch is a cross entropy loss function, and the cross entropy loss function is as follows:

wherein y is the number of pixels in the sample to be predicted _i A classification label representing a one-hot form, y being the current pixel belonging to the ith class _i 1, otherwise y _i When p is 0, the output of the semantic prediction branch after passing through the softmax layer is a vector of n × 1, p _i Identify the probability that the current pixel belongs to the class i category, i ∈ 1,2,3.. n.

Further, the structure of the elevation prediction branch comprises a third convolution module and a tanh () activation function; the third convolution module includes: convolutional layers and batch normalization layers.

Further, when the elevation extraction network model is trained, the loss function used by the elevation prediction branch comprises an elevation error loss function L with scale invariance _g The formula is as follows:

d＝log(h _p )-log(h _t )

wherein, M is the number of sample pixels, (r, c) are the pixel coordinates of the remote sensing image, hp represents the predicted elevation, and ht represents the true value of the elevation.

Further, when the elevation extraction network model is trained, the loss function used by the elevation prediction branch comprises a reprojection loss function L _r The formula is as follows:

grid＝RFM(X′，Y′，h _p )

where M is the number of sample pixels, SSIM is an image mode uniformity descriptor, μ is a mean operator, σ is a variance operator, C1 and C2 are constants with a prevention denominator of 0, α is a weight control parameter, X 'is a matrix composed of longitudes in the sample, Y' is a matrix composed of latitudes in the sample, h is a matrix composed of latitudes in the sample, and _p is a matrix composed of predicted elevations, I is an original remote sensing image, RFM represents a rational function imaging model, grid is the X ', Y', h pairs of RFM _p A pixel coordinate grid obtained after the re-projection is carried out,

images generated for sampling on the original image using grid.

Drawings

FIG. 1 is a data flow diagram of a method for elevation prediction of remote sensing images incorporating semantic information according to the present invention;

FIG. 2-1 is a first raw remote sensing image in an embodiment of the method of the present invention;

FIG. 2-2 is a true elevation distribution graph corresponding to FIG. 2-1 in an embodiment of a method of the present invention;

FIGS. 2-3 illustrate predicted elevations of the model corresponding to FIGS. 2-1 in an embodiment of a method of the present invention;

FIG. 2-4 is a map of the terrain classification results output by the model corresponding to FIG. 2-1 in an embodiment of the method of the present invention;

FIG. 3-1 is a second raw remote sensing image in an embodiment of the method of the present invention;

FIG. 3-2 is a true elevation distribution graph corresponding to FIG. 3-1 in an embodiment of a method of the present invention;

3-3 are model predicted elevations corresponding to FIG. 3-1 in an embodiment of a method of the present invention;

FIG. 3-4 is a map of the terrain classification results output by the model corresponding to FIG. 3-1 in an embodiment of the method of the present invention;

FIG. 4-1 is a third raw remote sensing image in an embodiment of the method of the present invention;

FIG. 4-2 is a true elevation distribution map corresponding to FIG. 4-1 in an embodiment of a method of the present invention;

FIG. 4-3 illustrates a model predicted elevation corresponding to FIG. 4-1 in an embodiment of a method of the present invention;

fig. 4-4 is a map of the terrain classification result output by the model corresponding to fig. 4-1 in the embodiment of the method of the present invention.

Detailed Description

The invention designs a multi-task architecture for ground object segmentation and elevation prediction, and effectively utilizes semantic information of the ground objects. As shown in fig. 1, a multitask strategy is adopted, firstly, an input original remote sensing image firstly passes through a coder-decoder sharing weight so as to learn the relative relation between texture features and potential elevations of the remote sensing image, and then, acquired feature maps are respectively input into a semantic prediction module and an elevation prediction module, so that a ground feature classification result and an elevation prediction result are obtained. According to the invention, the semantic information of the ground objects is added, so that powerful prior guidance can be provided for the elevation inversion, for example, the average elevation corresponding to the water area part is generally lower than that of other ground objects.

The core idea of the multi-task strategy is that the network can complete a plurality of associated prediction tasks T at the same time ₁ ,T ₂ ,....T _n So as to constrain L on different sides ₁ ,L ₂ ,....L _n The most essential characteristics of the current task are learned, the generalization capability of the model is enhanced, and the method is widely applied to various tasks of computer vision processing. For example, the network needs to complete the associated classification task T at the same time ₁ And regression task T ₂ Then L is corresponded to ₁ May be a cross entropy loss function, L ₂ May be a mean square error loss function or a combination of numerical loss functions. Models under the guidance of a multitasking strategy, usually with a base module B sharing weights and predicted branches C facing different tasks ₁ ,C ₂ ,...C _n . In the optimization process, L ₁ ,L ₂ ,....L _n To C ₁ ,C ₂ ,...C _n The predicted result is subjected to loss calculation, and the generated gradient generates parameter update on B during back propagation optimization, so that B is coded and serves T at the same time ₁ ,T ₂ ,....T _n The essential characteristic of the method improves the condition that the model possibly falls into local optimization under a single task.

In addition, since the real objects of the present invention are mostly continuous, and the boundary of semantic judgment is usually the boundary of geometric shape (and vice versa), a multitasking strategy of mutual constraint of semantic and geometric can be used for the reduction and estimation of the real-world objects. In the invention, the distribution of the ground features is semantic information, the elevation of the ground features is geometric information, and the ground features and the geometric information have a priori relationship. For example, the average elevation of the "water area" should be lower than the average elevation of other ground features, the elevation of the "house" (flat area) should be higher than the elevation of the surrounding flat terrain, and the like, and the place where the elevation changes is often the boundary of the ground feature classification. The invention aims to enable the model to learn the prior relation in the process of recognizing the ground feature and predicting the elevation, and enable the two tasks to supervise each other, so as to finally improve the cognitive ability of the model on the real ground feature.

The following describes a method for predicting elevation of remote sensing image by combining semantic information in detail with reference to the accompanying drawings and embodiments.

The method comprises the following steps:

the embodiment of the method for predicting the elevation of the remote sensing image by combining the semantic information is described in detail below by combining the attached drawings.

Step one, constructing a data set.

Limited by the size, equipment computing power and geographical attributes of the remote sensing image, the method utilizes the rational function imaging model RFM to re-extract the training sample. An original large-format remote sensing image I (a single-view remote sensing image) and a digital elevation model D corresponding to a geographical range are given. Firstly, segmenting an original remote sensing image I into m multiplied by n small-size images I (I, j), wherein I is more than or equal to 0 and less than or equal to m, j is more than or equal to 0 and less than or equal to n, then traversing I (I, j) pixel by pixel, and acquiring a longitude, latitude and elevation coordinate tuple (X, Y, H) corresponding to the current pixel (r, c) from a digital elevation model D according to the following method:

step 01, selecting longitude, latitude and elevation offset X in RFM corresponding to I ₀ ，Y ₀ ，H ₀ Is an initial iteration value;

step 02, calculating and using X ₀ ，Y ₀ ，H ₀ Pixel coordinates (r) projected with RFM _p ，c _p ) Constructing the projection error L ═ r (r) _p ，c _p )-(r，c)；

Step 03, calculating the partial derivative of the projection error L to X and Y

Constructing Jacobian matrix A (partial derivative permutation matrix), wherein

Step 04, solving the correction numbers Δ X and Δ Y in the X and Y directions (a) ^T A) ^-1 L；

Step 05, update (X) ₀ ，Y ₀ )←(X ₀ ，Y ₀ ) + (Δ X, Δ Y) and obtains the current (X) from the digital elevation model D by bilinear interpolation ₀ ，Y ₀ ) Corresponding elevation H, update H ₀ ←H。

And 06, repeating the steps 02-05 until convergence. Dense elevation samples satisfying the RFM projection condition pixel by pixel of the sample I (I, j) can be obtained. And (3) producing all the small-breadth images within the range of i being more than or equal to 0 and less than or equal to m and j being more than or equal to 0 and less than or equal to n, and constructing a DEM elevation prediction training set.

And step two, constructing an elevation extraction network model, and training the elevation extraction network model by using the data set constructed in the step one to obtain the trained elevation extraction network model.

As shown in fig. 1, the elevation extraction network model includes a shared weight re-encoder-decoder, a semantic prediction branch, and an elevation prediction branch; the shared weight recoding-decoder is used for executing a surface feature classification task and an elevation prediction task on the single-view remote sensing image, learning a priori relationship existing between semantic information extracted by the surface feature classification task and geometric information extracted by the elevation prediction task, and obtaining a feature vector by using the priori relationship; inputting the feature data except the last channel in the feature vector into a semantic prediction branch to extract semantic features, and obtaining a ground feature classification result; and inputting the feature data of the last channel in the feature vector into the elevation prediction branch to extract elevation features, and obtaining an elevation prediction result.

The following describes these structures and the loss function used in training in detail.

1) Feature codecs that share weights.

The decoder in the sharing weight recoding-decoder comprises four layers of structures which are connected in sequence, and each layer of structure comprises the following components in sequence: a transposed convolution module and two first convolution modules; the transposition convolution module comprises a transposition convolution layer, an activation function and a filling operation layer after transposition convolution which are sequentially arranged; the encoder in the shared weight re-encoder is ResNet. The structure of the feature codec sharing weights obtained by performing combination improvement based on the ResNet and U-Net structures is shown in Table 1:

table 1 feature codec structure sharing weights

Where Conv denotes the convolutional layer, the parameter list is (number of input channels, number of output channels, convolutional kernel size, edge fill size), BatchNorm denotes the batch normalization layer, the parameter list is (number of feature channels), and ReLU is the activation function.

Pad denotes the operation of edge filling according to step combination in order to smoothly complete the subsequent convolution. convTr represents the transposed convolution, the parameter list is (number of input channels, number of output channels, convolution kernel size, step size), Pad _ Tr represents the filling operation after the transposed convolution, and the parameter is the step size. conv is the normal convolution operation as previously indicated.

Specifically, the dimensions of the input image are [ B, C, H, W ], where B is the number of single input images, C is the number of channels, H is the image height (pixel), and W is the image width (pixel). The number of classes in the learning process is n _ class, and the dimension of the final output of the feature codec is [ B, n _ class +1, H, W ]. The first n _ class channel is used as a category vector to be processed to enter a semantic prediction branch, and the last channel is used to enter an elevation prediction branch.

The characteristic coder-decoder sharing the weight is obtained by combining and improving based on the ResNet and U-Net structures and is marked as the characteristic coder-decoder sharing the weight. The encoded part (layer identification beginning with Enc) follows the pattern of the ResNet base layer. The decoding part (a first convolution module, the layer mark is the beginning of Dec) receives different scale characteristics output by the encoder according to the U-Net format, and is different from the U-Net which directly uses the upscale layer with expanded scale. Specifically, (1) the decoder performs convolutions with convolution kernels (3,1) and (1,3) in the x-direction and the y-direction, respectively, to help the network perceive changes in terrain gradients; (2) the method adopts a transposition convolution layer with learnable parameters, and adds common convolution after transposition convolution to ablate checkerboard shadows, thereby enriching predicted texture details. Experimental results prove that the improvement of the model contributes to the precision improvement of 0.3m in the elevation direction.

2) And (5) semantic prediction of branches and training.

And the semantic prediction branch is used for extracting semantic features to output a final ground object classification map, and n is the number of classes. The structure of the semantic prediction branch is shown in table 2, and includes: a second convolution module and a softmax layer; the second convolution module includes a convolution layer, a batch normalization layer, and an activation function.

Table 2 semantic predictive branch structure

Ground object classification training L according to cross entropy loss function _s ：

Wherein y is the number of pixels in the sample to be predicted _i A classification label representing a one-hot form, y being the current pixel belonging to the ith class _i 1, otherwise y _i 0. p is the output of the semantic prediction branch after passing through the softmax layer, and is a vector of n multiplied by 1, p _i Identify the probability that the current pixel belongs to the class i category, i ∈ 1,2,3.. n.

3) And (4) performing elevation prediction branching and training.

The elevation prediction branch is mainly used for calculating the elevation by combining a rational function model of the remote sensing image. The structure of the elevation prediction branch is shown in table 3, and includes a third convolution module and tan h () activation functions; the third convolution module includes: convolutional layers and batch normalization layers.

TABLE 3 elevation prediction Branch Structure

The elevation prediction branches take the following two penalty functions.

First, elevation error loss L with scale invariance _g ：

d＝log(h _p )-log(h _t )

Wherein, M is the number of sample pixels, (r, c) is the pixel coordinate of the remote sensing image, h _p Representing predicted elevation, h _t Representing the true value of the elevation.

Second, reprojection loss function L _r ：

grid＝RFM(X′，Y′，h _p )

Where M is the number of sample pixels, SSIM is the image mode uniformity descriptor, μ is the mean operator, σ is the variance operator, and C1 and C2 are the prevention scoresThe constants for a mother of 0 are typically set to 0.0001 and 0.0003, respectively. α is a weight control parameter, typically 0.85, X 'is a matrix of longitudes in the sample, Y' is a matrix of latitudes in the sample, h _p Is a matrix composed of predicted elevations, I is an original remote sensing image, RFM represents a rational function imaging model, grid is the X ', Y', h pairs of RFM _p A pixel coordinate grid obtained after the re-projection is carried out,

images generated for sampling on the original image using grid. Theoretically, when the predicted elevation coincides with the true value,

will be identical to I. And comparing the image with the original remote sensing image I, wherein the loss of the image can be used as an error of driving training.

And step three, acquiring a single-view remote sensing image to be predicted, inputting the single-view remote sensing image into the elevation extraction network model, and obtaining an elevation prediction result and a ground feature classification result.

The validity of the method of the invention is verified below with reference to specific examples. The results of the tests performed on the plain and hill test sets are shown in table 4:

TABLE 4 elevation prediction results

And comparing and displaying part of the visual results through the original remote sensing image, the elevation distribution true value graph, the model prediction elevation and the ground feature classification results output by the model. For example, the first original remote sensing image shown in fig. 2-1, the elevation distribution true value map corresponding to fig. 2-2, the model predicted elevation corresponding to fig. 2-3, and the terrain classification result output by the corresponding model corresponding to fig. 2-4; FIG. 3-1 is a second original remote sensing image, FIG. 3-2 is a corresponding true elevation distribution map, FIG. 3-3 is a corresponding model predicted elevation, and FIG. 3-4 is a feature classification result output by the corresponding model; for example, fig. 4-1 shows a third original remote sensing image, fig. 4-2 shows a corresponding true elevation distribution map, fig. 4-3 shows a corresponding model predicted elevation, and fig. 4-4 shows a feature classification result output by the corresponding model.

The invention aims to enable a shared weight recoding-decoding device to learn a priori relationship in the process of recognizing ground feature distribution and predicted elevation, enable an elevation prediction task and a ground feature classification task to be supervised mutually, assist the elevation prediction task in extracting geometric information more accurately by the ground feature classification task, extract elevation features through an elevation prediction branch to obtain a range prediction result with higher accuracy on one hand, and extract semantic features through a semantic prediction branch to obtain a ground feature classification result with higher accuracy on the other hand, so that the cognitive ability of an elevation extraction network model to a real ground feature is finally improved.

Claims

1. A remote sensing image elevation prediction method combining semantic information is characterized by comprising the following steps: the method comprises the following steps:

acquiring a single-view remote sensing image, and inputting the single-view remote sensing image into an elevation extraction network model to obtain an elevation prediction result and a ground feature classification result of the single-view remote sensing image;

the elevation extraction network model is obtained by training by using the single-view remote sensing image, a digital elevation model corresponding to the single-view remote sensing image and a ground feature classification result as a training set, and comprises a shared weight recoding-decoding device, a semantic prediction branch and an elevation prediction branch; the shared weight recoding-decoding device is used for executing a ground feature classification task and an elevation prediction task on the single-view remote sensing image, learning a priori relationship between semantic information extracted by the ground feature classification task and geometric information extracted by the elevation prediction task, and obtaining a feature vector by using the priori relationship; inputting the feature data of the last channel except the feature vector into the semantic prediction branch to extract semantic features, and obtaining a ground feature classification result; and inputting the feature data of the last channel in the feature vector into the elevation prediction branch to extract elevation features, and obtaining an elevation prediction result.

2. The method for predicting the elevation of the remote sensing image by combining the semantic information according to claim 1, wherein the method comprises the following steps:

the decoder in the shared right recoding-decoder comprises four layers of structures which are connected in sequence, and each layer of structure comprises the following components in sequence: a transposed convolution module and two first convolution modules;

the transposition convolution module comprises a transposition convolution layer, an activation function and a filling operation layer after transposition convolution which are sequentially arranged.

3. The method for predicting the elevation of the remote sensing image by combining the semantic information according to claim 1, wherein the method comprises the following steps:

the construction method of the training set comprises the following steps:

4. The method for predicting the elevation of the remote sensing image by combining the semantic information according to claim 1, wherein the method comprises the following steps:

the encoder in the shared weight re-encoder is ResNet.

5. The method for predicting the elevation of the remote sensing image by combining the semantic information according to claim 1, wherein the method comprises the following steps:

the structure of the semantic prediction branch comprises: a second convolution module and a softmax layer; the second convolution module includes a convolution layer, a batch normalization layer, and an activation function.

6. The method for predicting the elevation of the remote sensing image in combination with the semantic information as recited in claim 1 or 5, wherein:

when the elevation extraction network model is trained, the loss function used by the semantic prediction branch is a cross entropy loss function L _s Comprises the following steps:

7. The method for predicting the elevation of the remote sensing image by combining the semantic information according to claim 1, wherein the method comprises the following steps:

the structure of the elevation prediction branch comprises a third convolution module and an activation function; the third convolution module includes: convolutional layers and batch normalization layers.

8. The method for predicting the elevation of the remote sensing image in combination with the semantic information as recited in claim 1 or 7, wherein:

when the elevation extraction network model is trained, the loss function used by the elevation prediction branch comprises an elevation error loss function L with scale invariance _g The formula is as follows:

d＝log(h _p )-log(h _t )

9. The method for predicting the elevation of the remote sensing image in combination with the semantic information as recited in claim 1 or 7, wherein:

when the elevation extraction network model is trained, the loss function used by the elevation prediction branch comprises a reprojection loss function L _r The formula is as follows:

grid＝RFM(X′，Y′，h _p )

images generated for sampling on the original image using grid.