CN116402942A

CN116402942A - Large-scale building three-dimensional reconstruction method integrating multi-scale image features

Info

Publication number: CN116402942A
Application number: CN202310194010.2A
Authority: CN
Inventors: 杨青林; 张展; 张觅; 周桓; 杨炳楠; 李大宇; 刘青瑀
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2023-03-02
Filing date: 2023-03-02
Publication date: 2023-07-07

Abstract

The invention discloses a large-scale building three-dimensional reconstruction method fusing multi-scale image features, which comprises the following steps: remote sensing data selection and pretreatment; constructing a basic network module and a residual network module, which are used for obtaining a shadow scalar, an uncertainty predicted value, a reflected light color and an ambient light color; inputting light rays corresponding to the highest-level images into the basic network module, inputting light rays corresponding to other scale images into the residual network module layer by layer, inputting the light rays corresponding to each image into one residual network module, then fusing the results of all the residual network modules to obtain a unified output, finally fusing the unified output of the residual network modules with the output result of the basic network module to obtain a final output result, calculating the depth of a point corresponding to a target object and the color of a corresponding pixel through the final output result, and updating the network weight through a constructed loss function; and realizing three-dimensional reconstruction of the remote sensing image to be reconstructed by using the trained network.

Description

Large-scale building three-dimensional reconstruction method integrating multi-scale image features

Technical Field

The invention belongs to the application of a deep learning technology in the field of three-dimensional reconstruction using high-resolution remote sensing images, relates to a large-scale building three-dimensional reconstruction method fusing multi-scale image features, and particularly relates to a method for automatically generating a building three-dimensional model from multi-view remote sensing images.

Background

Three-dimensional reconstruction is a core technology in the fields of smart cities, autopilot, virtual reality, and the like (document 1). Conventional three-dimensional reconstruction methods which are relatively commonly used at present comprise oblique photogrammetry, proximity photogrammetry and the like (documents 2-3), and the methods are used for fine modeling of scenes, and the main data sources are equipment such as cameras, unmanned aerial vehicles and the like. The high-resolution remote sensing image has the characteristics of wide coverage, abundant scene characteristics, multiple times and the like, can be rapidly reconstructed in a large scale by means of a large-scale remote sensing scene, can ensure the quality of three-dimensional reconstruction by abundant scene characteristic information and multiple time-phase image data, has great research and application values and is widely focused. However, due to the specificity of the remote sensing camera model and the sparsity of target photography, the traditional method requires more manual editing to determine the coordinates of the control points to finish the orientation of the images when the high-resolution images are used for three-dimensional reconstruction, and the workflow is complex, so that the three-dimensional reconstruction of the remote sensing scene is difficult to be performed quickly, at low cost and in a large scale.

With the rapid development of artificial intelligence and hardware related technologies, deep learning technologies represented by MVS-Net networks have achieved great success in the field of three-dimensional reconstruction. MVS-Net provides a cost capacity (cost volume) function based on differentiable homography transformation based on a binocular stereo matching theory, calculates the confidence coefficient of a depth prediction result by constructing a cost body, trains with the maximum confidence coefficient as a target, realizes high-quality generation of an end-to-end scene depth map, and overcomes the dependence of the traditional remote sensing image three-dimensional reconstruction on a large number of manual control points (document 4). However, the model training requires a depth map generated by a three-dimensional data acquisition system as a true value, and the depth map of a large-scale remote sensing scene is difficult to acquire and has high cost, so that the depth map is difficult to be widely applied.

In recent years, neural radiation fields (Neural Radiance Fields, neRF) fit radiation and density fields of three-dimensional scenes by creatively using a multi-layer perceptron (MLP), constructing a loss function with color information of pixels in an image as a true value, eliminating dependence of model training on depth maps, realizing high-quality three-dimensional scene expression in a concise manner, and promoting great development in the field of three-dimensional reconstruction (document 5). The three-dimensional reconstruction research represented by Block-NeRF, bungeeNeRF, mega-NeRF realizes urban-level high-precision large-scale reconstruction by using unmanned aerial vehicle aerial images and street view images, and opens up a new technical route for three-dimensional reconstruction of large-scale scenes by using satellite remote sensing images (documents 6-8). However, the camera model used by the method has a large difference from the satellite sensor model, so that the method is difficult to be directly applied to the three-dimensional reconstruction task of the remote sensing image. In addition, the scene complexity, view sparsity and diversity of external influence factors of the remote sensing image are difficult to ensure the model reconstruction precision. Sat-NeRF successfully applied a NeRF-based network to the task of three-dimensional reconstruction of remote sensing images by introducing a rational polynomial coefficient (Rational Polynomial Coefficients, RPC) camera model of the satellite (literature 9). However, the method lacks of capturing details of the remote sensing scene, and the reconstruction accuracy of the method has room for further improvement.

[ document 1] Li Mingyang, chen Wei, wang Shanshan, etc.. Three-dimensional reconstruction methods for visual deep learning [ J/OL ]. Computer science and exploration: 1-26[2023-02-11]. Http:// kns.cnki.net/kcms/detail/11.5602.TP.20221020.1347.002.Html.

[ document 2] Sun Hongwei ] three-dimensional digital urban modeling based on oblique photogrammetry [ J ]. Modern mapping 2014,37 (1): 18-21.

[ document 3] Miao Zhicheng, yang Yongchong, yu Qing and the like. Proximate to the application of photogrammetry to detailed modeling of a single building [ J ]. Remote sensing information, 2021,36 (5): 107G113.

[ document 4]Yao Y,Luo Z,Li S,et al.Mvsnet:Depth inference for unstructured multi-view stereo [ C ]// Proceedings of the European conference on computer vision (ECCV) 2018:767-783 ] [ document 5]Mildenhall B,Srinivasan P P,Tancik M,et al.Nerf:Representing scenes as neural radiance fields for view synthesis[J ]. Communications of the ACM,2021,65 (1): 99-106.

[ document 6]Tancik M,Casser V,Yan X,et al.Block-surf: scalable large scene neural view synthesis [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recording.2022:8248-8258.

[ document 7]Xiangli Y,Xu L,Pan X,et al.Bungeenerf:Progressive neural radiance field for extreme multi-scale scene rendering [ C ]// Computer Vision-ECCV 2022:17th European Conference,Tel Aviv,Israel,October 23-27,2022,Proceedings,Part XXXII.Cham:Springer Nature Switzerland,2022:106-122.

[ document 8]Turki H,Ramanan D,Satyanarayanan M.Mega-surf: scalable construction of large-scale nerfs for virtual fly-through hs [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recording.2022:12922-12931.

[ document 9]Mari R,Facciolo G,Ehret T.Sat-surf: learning Multi-view satellite photogrammetry with transient objects and shadow modeling using rpc cameras [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recgnition.2022:1311-1321.

Disclosure of Invention

Aiming at the defects of the existing remote sensing three-dimensional reconstruction method, the invention adopts the RPC model of the satellite to construct a network, adds shadow detection processing and instantaneous object (such as vehicles, pedestrians and the like) processing aiming at the problems of shadow, background disorder, complex content and the like of the remote sensing image, and eliminates the influence of inconsistent imaging conditions on the model precision to a certain extent. Finally, a residual error network module is gradually and dynamically added in the training process to capture local details and integral features of the remote sensing image, so that the result accuracy is improved.

The technical scheme adopted by the invention is as follows: a large-scale building three-dimensional reconstruction method integrating multi-scale image features comprises the following steps:

step 1, remote sensing data are selected and preprocessed, and a data set is divided into training and testing data sets according to a certain proportion;

acquiring RGB data, RPC model parameters, sun direction parameters and time phase data of the remote sensing image from a data set; acquiring light rays emitted from a satellite sensor based on an RPC model through preprocessing, wherein each light ray is a vector, and the origin coordinates emitted by the light rays and the directions of the origin coordinates are represented; finally, organizing RGB data, sun direction parameters and light vectors into dictionary type data for use;

step 2, constructing a basic network module and a residual error network module, wherein the main bodies of the basic network module and the residual error network module are multi-layer perceptron (MLP) which comprises an input layer, a plurality of hidden layers and an output layer; each layer of the MLP has a width h, the input of the input layer is the coordinate of any point on the light and the direction of the coordinate, and the output of the first half part of the MLP is the predicted pixel value and voxel density;

after the voxel density sigma is output, an implicit layer with the width of h is additionally added, RGB output by the first half part of the MLP is taken as input, and finally, a shadow scalar s, an uncertainty predictive value beta and a reflected light color c are obtained through the additional implicit layer with the width of h/2 and the output layer _a Ambient light color a;

step 3, up-sampling remote sensing images input in the training data set by using a plurality of convolution layers to obtain a plurality of images with different scales, constructing an image pyramid by using the multi-scale images, and constructing the images in the image pyramid as light rays; firstly, inputting light rays corresponding to the highest-level image to a basic network module to learn overall scene information of a relatively abstract macro scale so as to obtain an output result; inputting light rays corresponding to other scale images into the residual error network module layer by layer to learn more specific and rich scene detail information in the low-level images, inputting the light rays corresponding to each image into one residual error network module, fusing the results of all the residual error network modules to obtain a unified output, finally fusing the unified output of the residual error network module with the output result of the basic network module to obtain a final output result, calculating the depth of a corresponding point of a target object and the color of a corresponding pixel through the final output result, and updating the network weight through a constructed loss function;

and step 4, inputting the remote sensing images in the test data set into the trained network in the step 3, and realizing three-dimensional reconstruction of the remote sensing images to be reconstructed.

Further, the processing procedure of the first half of the MLP in step 2 is as follows:

(RGB,σ)＝F(x,d) (1)

where RGB represents the pixel values predicted by MLP, σ is the voxel density, x represents the three-dimensional coordinates of the point on the ray, and d is the direction of the ray.

Further, the specific implementation manner of calculating the depth of the corresponding point of the target object and the color of the corresponding pixel in the step 3 is as follows;

step 3.1, first, sigma is calculated _i The transmittance alpha of the space point is calculated by taking the formula (2) _i And opacity factor T _i ：

Middle sigma _i To input the coordinate of the ith point and the light direction thereof to the first half of the MLP and output the obtained target object voxel density value delta _i ＝t _i+1 -t _i-1 Is the distance between two adjacent sampling points;

step 3.2, alpha _i 、T _i 、t _i Carry-over formula(3) Calculating to obtain the depth d (r) of the corresponding point of the target object;

step 3.3, c _a 、a _i 、s _i Carrying out formula (4) to calculate and obtain a color predicted value c of the corresponding light sampling point _i ；

c(x _i ,ω,t _j )＝c _a (x _i )·(s(x _i ,ω)+(1-s(x _i ,ω))·a(ω)) (4)

Wherein c (x) _i ，ω，t _j ) Representing color value obtained by rendering i points on each ray, namely c _i Wherein x is _i Representing three-dimensional coordinates of point i on the light ray, wherein omega is sun direction angle, and c _a (x _i ) Refer to the reflected light color c of the MLP output _a A (omega) refers to the ambient light color output by the MLP, t _j Phase data of the image; s (x) _i ω) is a shadow scalar, taking a value between 0 and 1;

step 3.4, T is taken _i 、c _i 、α _i Bringing the color c (r) of the pixel corresponding to the light into the formula (5);

c (r) is the color value obtained by rendering the ray r, d (r) is the depth value of the corresponding point of the target object on the ray, N is the number of rays, t _i 、c _i 、α _i And T _i The distance to the camera, the predicted color value, the transmittance and the opacity factor for the i-th point of the up-sample of ray r, respectively.

Further, the loss function constructed in the step 2 includes a solar ray direction correction term, an MSE loss, and a depth supervision loss, wherein a calculation formula of the solar ray direction correction term is as follows:

wherein N is _SC Refers to the total number of sampling points on the light, s _i Scalar for the shadow of the ith point on the ray, R _SC Is a secondary sun correction ray.

Further, the calculation formula of the MSE loss is as follows:

here, R refers to the ray set of the present training input, c _GT (r 2) is the true color value of the corresponding pixel of ray r2 in the input image, β' (r 2) =β (r 2) +β _min (beta), wherein beta _min And η is the specified empirical value.

The β (r 2) associated with ray r2 is obtained by integrating the uncertainty predictions for the N samples of r2, as follows:

wherein beta is _i Is the uncertainty predictive value of the ith point of the ray r;

further, in order to learn depth information of a scene, a depth supervision loss is constructed, and a calculation formula is as follows:

wherein R is _DS The method is characterized in that the method refers to light rays which are constructed by key points extracted from a remote sensing image through a SIFT algorithm, X (r 3) is a three-dimensional coordinate of a sampling point on the light rays, o (r 3) is a light ray origin, and omega (r 3) is a contribution weight of X (r 3) to depth supervision information, namely the ratio of the absolute value of a difference value between each light ray X (r 3) and o (r 3) selected in the training to the sum of the absolute values of all light ray difference values.

Further, the calculation formula of the final loss function is as follows:

L＝L _RGB (R)+λ _SC L _SC (R _SC )+λ _DS L _DS (R _DS ) (10)

wherein lambda is _SC And lambda is _DS Is the weight of the corresponding loss term.

Further, after network training is completed, the network weight is kept in a ckpt file, end-to-end three-dimensional reconstruction is directly carried out through an input remote sensing image by means of the ckpt file, an imaging tool is used for reconstructing a three-dimensional model of the whole scene according to the output c (r) and d (r), and a corresponding DSM or mp4 type visualization result is output.

Further, each layer of the MLP has a width of 1024.

Further, lambda _SC And lambda is _DS Taking 0.1/3 and 1000/3 respectively.

Further, in step 1, an IEEE GRSS data set is selected, where the data set includes the following parts:

(1) WorldView-3 satellite images, full color and eight-band visible light and near infrared, ground sampling distances of 35 cm and 1.3 m respectively;

(2) Three-dimensional data provided by point clouds or digital surface models DSMs generated by an airborne laser radar with a resolution of 80 centimeters;

(3) Sensor RPC parameter, solar ray direction and shooting time phase information data.

The invention fully utilizes the multi-level characteristics of the high-resolution remote sensing image to realize the efficient automatic three-dimensional reconstruction of the large-scale building. The advantages are as follows:

(1) The RPC model is introduced into the nerve radiation field, so that camera projection light is constructed, the remote sensing image can be applied to NeRF, and meanwhile, the shadow shielding and the instantaneous object existing in the remote sensing image are correspondingly processed, so that the final result is finer.

(2) And multi-level features in the image are extracted in a convolution mode, the whole and partial information of the image is fully utilized, and the final result is improved.

(3) And progressively training the whole network model, combining the basic network module and the residual network module, finally fusing the output results of the two modules to obtain a final result, and improving the reconstruction precision.

Drawings

FIG. 1 is a schematic diagram of a neural network model for large-scale three-dimensional reconstruction according to an embodiment of the present invention;

fig. 2 is an overall flow chart of an embodiment of the present invention.

Detailed Description

In order to facilitate an understanding and practice of the invention by those of ordinary skill in the art, the invention will be further described with reference to the drawings and specific examples, it being understood that the examples described herein are for illustration and description only and are not intended to be limiting of the invention.

Referring to fig. 2, the method for reconstructing a large-scale building by fusing multi-scale image features provided by the invention is characterized by comprising the following steps: the invention adopts the RPC model of the satellite to construct the NeRF network, and adds a shadow detection module and an instantaneous object (such as vehicles, pedestrians and the like) detection module aiming at the problems of shadow, disordered background, complex content and the like of the remote sensing image, thereby eliminating the influence of inconsistent imaging conditions of the image on the model precision to a certain extent. Finally, a residual block is gradually and dynamically added in the training process to capture local details and integral features of the remote sensing image, so that the result accuracy is improved, and the integral model structure is shown in fig. 1. And finally, realizing the three-dimensional reconstruction of the large-scale building from end to end through the trained network.

Based on the structure of the model, the embodiment of the invention provides an end-to-end progressive nerve radiation field remote sensing image large-scale three-dimensional reconstruction method. The specific implementation steps are as follows:

step 1, remote sensing data selection and preprocessing

In the embodiment, the 2019 IEEE GRSS data fusion large-scale race data set is selected to be compared and tested with various building reconstruction methods. The data set has a plurality of complex urban scenes composed of different building densities, space sizes and surrounding environments, and can well verify the extraction precision and reliability of the three-dimensional reconstruction methods of different buildings. The dataset contained the following:

(1) WorldView-3 satellite images (supplied by Maxar), full-color and eight-band visible and near-infrared, ground sample distances of 35 cm and 1.3 meters, respectively.

(2) The three-dimensional data provided by the point cloud or Digital Surface Models (DSMs) generated by the airborne lidar is 80 cm in resolution.

(3) Metadata such as sensor RPC parameters, solar ray direction, shooting time phase information, etc.

The present example selected 26 Maxar world view-3 images from the dataset collected in jackson ville, florida from 2014 to 2016. From these data we take as input a set of RGB image segmentation results of different sizes, approximately 800 x 800 pixels, with a minimum resolution of 0.3 m/pixel, each AOI covering a range of 256 x 256 m. The training set and the test set are divided according to the ratio of 8:2, and the RPC camera model of the satellite image is directly used for light projection, and each RPC is defined by a projection function (projecting 3D points onto image pixels) and an inverse function (positioning function) thereof. The minimum height and the maximum height of the scene are respectively expressed as h _min And h _max A ray passing through the scene and intersecting pixel p of the jth image is modeled as a straight line between an initial point and a final 3D point, i.e., x _start And x _end . RPC localization function L using jth image _j For pixel p at h _min And h _max Positioning the position to obtain the boundary points:

x _start ＝L _j (p,h _max ) _ECEF ；x _end ＝L _j (p,h _min ) _ECEF

wherein the subindex ECEF indicates the location function L _j The returned 3D points are converted into the earth's center, earth-fixed coordinate system (or geocentric system) to operate in a cartesian reference system.

Given x _start And x _end The origin o and the direction vector d of the ray r (t) =o+td intersecting the pixel p may select the height boundary [ h ] in various ways _min ,h _max ]Such as a large scale elevation model extracted from low resolution data. The j-th image is expressed as：

Maximum height point x nearest to camera _start As the origin o of the ray. The boundary of ray r (t) =o+td, i.e. [ t ] _min ,t _max ]Let t be _min ＝0,t _max ＝‖x _end -x _start ‖ ₂ . The ECEF coordinates cannot be used directly in practice, because they represent a larger coordinate value, the present invention normalizes all ray points to the interval [ -1,1 using a process similar to the subtraction offset and scaling used in the RPC function]Is a kind of medium. Limiting all pixels in the input image to h _min And h _max The resulting set of 3D points in range is used to calculate the offset and scale in each spatial dimension. And finally, organizing the constructed light and image RGB information and other metadata into a dictionary for convenient use.

Step 2, constructing a basic network module and a residual network module

As shown in fig. 1, the basic network model and the residual network model are identical in structure, and the main body is composed of MLPs. The key of NeRF is the construction and rendering of camera rays, the calculation amount is large, the selection of MLP can simplify the network structure to reduce the calculation amount on the one hand, and the gradient updating can be completed faster on the other hand.

Because the ground features in the remote sensing image are complex, and there are many objects which exist temporarily like pedestrians, vehicles and the like, the result obtained by learning different time phase data has larger error, and the building in the image often has a part covered by the shadow, so that processing operations for the shadow and the instantaneous object need to be added in the model to eliminate the effects.

Therefore, in constructing the basic network module and the residual network module, the two network modules have the same structure, such as the block in fig. 1, the main body is a multi-layer perceptron (MLP), each layer of the MLP has a width h (default 1024), the front half part of the MLP (the part before outputting σ in the block in fig. 1) will fit to the NeRF representation of the whole scene (i.e. the function in the form of MLP), the input is the coordinates of any point on the light (since the starting point and direction of the light are known, the coordinates of any point on the light can be calculated) and the direction thereof, and the function can be expressed as

(RGB,σ)＝F(x,d) (1)

Where RGB represents the pixel values predicted by the MLP, x represents the three-dimensional coordinates of the point on the ray, and d is the direction of the ray.

After the final output voxel density sigma, an additional hidden layer with the width of h is added, RGB output by the first half part of the MLP is taken as input, and finally, a shadow scalar s, an uncertainty predictive value beta and a reflected light color c are obtained through the additional hidden layer with the width of h/2 and the output layer _a And ambient light color a (related only to solar ray angle), i.e. by adding additional layers, a function expressed in MLP from input to corresponding output is obtained. It should be noted that the MLP outputting beta requires an additional input time feature t, as shown in block in FIG. 1 of the specification _j The output s requires an additional input solar direction angle ω, while the output a is only a single layer MLP representation ω to a function;

step 3, progressively training the network

In the process of extracting features of an input remote sensing image by using a plurality of convolution layers, a plurality of images with different scales are obtained, the step is to up-sample an original input image for a plurality of times to construct an image pyramid, the images in the image pyramid are constructed as light rays, and the light ray direction of the image with higher hierarchy is obtained by scaling according to the light ray direction constructed by the original image.

Firstly, inputting rays constructed by the highest-level image into a basic network to learn overall scene information of abstract macroscopics, fitting spatial distribution of the scene by using MLP according to the basic network module structure, and outputting voxel density sigma of the overall scene ^base Color rendering value c ^base Shadow scalar s ^base Transmittance alpha ^base Uncertainty prediction value beta ^base 。

Then inputting the light constructed by each layer of images into the residual error network module layer by layer until the original image is inputLight rays are input to a residual error network module by learning more specific and rich scene detail information in low-level images, all residual error network modules output voxel density, color rendering values, shadow scalar, transmittance and uncertainty prediction values, and finally the results of all residual error network modules are fused (an average method is adopted by the method) to obtain a unified output: voxel density sigma ^res Color rendering value c ^res Shadow scalar s ^res Transmittance alpha ^res Uncertainty prediction value beta ^res The method comprises the steps of carrying out a first treatment on the surface of the Finally, outputting voxel density sigma of residual error network module ^res Color of reflected light c _a ^res Shadow scalar s ^res Ambient light color a ^res Uncertainty prediction value beta ^res Voxel density sigma of overall scene output from basic network ^base Color of reflected light c _a ^base Shadow scalar s ^base Ambient light color a ^base Uncertainty prediction value beta ^base Fusion is carried out to obtain the final output voxel density sigma _i Color of reflected light c _a Shadow scalar s _i Ambient light color a _i Uncertainty prediction value beta _i 。

The depth of the spatial point is calculated by taking the parameters of the output into the following formula (i.e. sigma _i The transmittance alpha of the space point is calculated by taking the formula (4) into consideration _i And opacity factor T _i And then alpha is _i 、T _i 、t _i And (3) carrying out calculation by using the formula (3) to obtain the depth d (r) of the corresponding point of the target object. Will c _a 、a _i 、s _i Carrying out formula (5) to calculate and obtain a color predicted value c of the corresponding light sampling point _i Will T _i 、c _i 、α _i Bringing into formula (2) to obtain color c (r)) of the corresponding pixel of the light ray, and updating the network weight (beta is obtained through the constructed loss function _i 、T _i 、α _i Carrying out calculation by taking the formula (8) to obtain a predicted value beta (r) of the possibility that the corresponding pixels of the light belong to the instantaneous object, and combining c (r), beta (r) and the color value c of the corresponding pixels of the input image of the corresponding hierarchy _GT (r) bring-in formula(7) Construction of loss term L _RGB The method comprises the steps of carrying out a first treatment on the surface of the Will T _i 、α _i 、s _i Build loss term L with equation (6) _SC The method comprises the steps of carrying out a first treatment on the surface of the Bringing d (r) into equation (9) builds the loss term L _DS The method comprises the steps of carrying out a first treatment on the surface of the Finally will L _RGB 、L _SC 、L _DS The final loss function is obtained by substituting equation (10).

The basic formula for rendering a ray is as follows:

wherein c (r) is a color value obtained by rendering the light ray r, d (r) is a depth value of a corresponding point of the target object on the light ray, N is a point sampled on the light ray r, and t _i 、c _i 、α _i And T _i The distance to the camera, the predicted color value, the transmittance and the opacity factor for the i-th point of the up-sample of the ray r, respectively, are defined as follows:

middle sigma _i To input the coordinate of the ith point and the light direction thereof into the first half of the MLP and output the obtained target object voxel density value, x _i ＝t _i+1 -t _i-1 Is the distance between adjacent two sampling points.

The color value c obtained by rendering the i point on each ray is calculated by adopting a shadow perception irradiance model proposed in S-NeRF, and the calculation formula is as follows:

c(x _i ,ω,t _j )＝c _a (x _i )·(s(x _i ,ω)+(1-s(x _i ,ω))·a(ω)) (5)

wherein c (x) _i ，ω，t _j ) Substituting c in the rendering formula _i Wherein x is _i Representing the three-dimensional coordinates of point i on the light, ω being the solar direction angle, c _a (x _i ) Refers to the reflected light color of the MLP output, i.e., c in FIG. 1 _a A (omega) refers to the ambient light color output by the MLP, t _j Phase data of the image; s (x) _i ω) is a shadow scalar, takes a value between 0 and 1, and determines the shadow region by the albedo of the scene. Ideally, at those 3D points which are directly illuminated by the sun, s.apprxeq.1, the color is entirely determined by the reflectance c _a (x) To explain.

The loss function comprises a solar ray direction correction term, MSE loss and depth supervision loss, and the specific construction mode is as follows:

in practice the direction of the solar rays ω is closely related to the acquisition date (in particular the satellite passing at the same time of day). Thus, due to the mixture of phenomena, the finally captured ambient irradiance of a (ω) is not only related to ω but also to the conditions of a particular date, such as weather or seasonal variations. The model can not correct the distortion caused by the change of different sun ray directions from training data, and the invention solves the problem by adding a sun ray direction correction term in a loss function, wherein the calculation formula of the correction term is as follows:

wherein N is _SC Refers to the total number of sampling points, s, on ray r1 _i Scalar for the shadow of the ith point on ray R1, R _SC Is a secondary sun correction ray that follows the direction of the sun ray omega, while the other primary rays follow the viewing direction of the camera. The learnable geometry used by the solar ray correction term is defined by the transmittance T _i And transparency alpha _i Encoding to further supervise the learning of shadow-aware shadows s (x, ω). The first part of the formula means that for R _SC R1, s predicted at the ith point _i Should be matched with T _i Similarly; while the second part of the formula requires that the integral of s/r1 is as close to 1 as possible, since non-occluded and non-shadowed areas must beMainly explained by the reflectivity in the shadow-aware irradiance model. The above operation is an operation of adding a shadow process as in fig. 2.

The MSE loss is to refer to a task uncertainty learning method in W-NeRF to improve the robustness of the model, and in this embodiment, the uncertainty prediction β is weighted according to the contribution of each ray between the rendered and known colors to the MSE, where the formula is as follows:

here, R refers to the ray set of the present training input, c _GT (r 2) is the true color value of the corresponding pixel of ray r2 in the input image, β' (r 2) =β (r 2) +β _min Wherein beta is _min The value of eta is manually specified empirical value, and beta is taken _min =0.05, η=3 to avoid taking negative values in the logarithm; logarithmic at L _RGB The function of (a) is to prevent β from converging to infinity. In this way, the model can trade off between uncertainty coefficient β and color difference to get the final value.

wherein beta is _i Is the uncertainty predictor of the ith point of ray r. The above operation is implemented as the add transient object handling operation in fig. 2.

Meanwhile, in order to learn the depth information of the scene, a depth supervision loss function can be constructed, and the calculation formula is as follows:

wherein R is _DS Refers to extraction from remote sensing images through SIFT algorithmThe light constructed by the key points is X (r 3) which is the three-dimensional coordinate of the sampling point on the light, o (r 3) is the origin of the light, and omega (r 3) is the contribution weight of X (r 3) to the depth supervision information, namely the ratio of the absolute value of the difference value between each light X (r 3) and o (r 3) selected in the training to the sum of the absolute values of the difference values of all the light.

And finally, constructing a final loss function by weighted addition of the loss functions:

L＝L _RGB (R)+λ _SC L _SC (R _SC )+λ _DS L _DS (R _DS ) (10)

wherein lambda is _SC And lambda is _DS The weight of the corresponding loss item is respectively 0.1/3 and 1000/3.

And 4, realizing three-dimensional reconstruction of the building by utilizing the trained network in the step 3.

After the network training is finished, the network weight is kept in a ckpt file, and the end-to-end three-dimensional reconstruction can be directly carried out through the input remote sensing image by means of the ckpt file, so that a final result is output.

In specific implementation, the invention can adopt computer software technology to realize automatic operation flow, and the device for operating the flow of the invention should be within the protection scope.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. A large-scale building three-dimensional reconstruction method integrating multi-scale image features is characterized by comprising the following steps:

2. A method for three-dimensional reconstruction of a large-scale building incorporating multi-scale image features as defined in claim 1, wherein: the processing of the first half of the MLP in step 2 is shown below:

(RGB,σ)＝F(x,d) (1)

3. A method for three-dimensional reconstruction of a large-scale building incorporating multi-scale image features as defined in claim 1, wherein: the specific implementation manner of calculating the depth of the corresponding point of the target object and the color of the corresponding pixel in the step 3 is as follows;

step 3.2, alpha _i 、T _i 、t _i Carrying out calculation by using the formula (3) to obtain the depth d (r) of the corresponding point of the target object;

c(x _i ,ω,t _j )＝c _a (x _i )·(s(x _i ,ω)+(1-s(x _i ,ω))·a(ω)) (4)

4. A method for three-dimensional reconstruction of a large-scale building incorporating multi-scale image features as claimed in claim 3, wherein: the loss function constructed in the step 2 comprises a solar ray direction correction term, MSE loss and depth supervision loss, wherein the calculation formula of the solar ray direction correction term is as follows:

5. The method for three-dimensional reconstruction of a large-scale building fused with multi-scale image features according to claim 4, wherein: the calculation formula of the MSE loss is as follows:

wherein beta is _i Is the uncertainty predictor of the ith point of ray r.

6. The method for three-dimensional reconstruction of a large-scale building fused with multi-scale image features according to claim 5, wherein: in order to learn the depth information of the scene, a depth supervision loss is constructed, and the calculation formula is as follows:

7. The method for three-dimensional reconstruction of a large-scale building fused with multi-scale image features according to claim 6, wherein: the calculation formula of the final loss function is as follows:

L＝L _RGB (R)+λ _SC L _SC (R _SC )+λ _DS L _DS (R _DS ) (10)

8. A method for three-dimensional reconstruction of a large-scale building incorporating multi-scale image features as defined in claim 1, wherein: after network training is completed, the network weight is kept in a ckpt file, end-to-end three-dimensional reconstruction is directly carried out through the input remote sensing image by means of the ckpt file, an image tool is used for reconstructing a three-dimensional model of the whole scene according to the output c (r) and d (r), and a corresponding DSM or mp4 type visualization result is output.

9. The method for three-dimensional reconstruction of a large-scale building fused with multi-scale image features according to claim 7, wherein: lambda (lambda) _SC And lambda is _DS Taking 0.1/3 and 1000/3 respectively.

10. A method for three-dimensional reconstruction of a large-scale building incorporating multi-scale image features as defined in claim 1, wherein: in step 1, an IEEE GRSS data set is selected, wherein the data set comprises the following parts: