CN116402942A - Large-scale building three-dimensional reconstruction method integrating multi-scale image features - Google Patents

Large-scale building three-dimensional reconstruction method integrating multi-scale image features Download PDF

Info

Publication number
CN116402942A
CN116402942A CN202310194010.2A CN202310194010A CN116402942A CN 116402942 A CN116402942 A CN 116402942A CN 202310194010 A CN202310194010 A CN 202310194010A CN 116402942 A CN116402942 A CN 116402942A
Authority
CN
China
Prior art keywords
ray
scale
light
color
dimensional reconstruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310194010.2A
Other languages
Chinese (zh)
Inventor
杨青林
张展
张觅
周桓
杨炳楠
李大宇
刘青瑀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202310194010.2A priority Critical patent/CN116402942A/en
Publication of CN116402942A publication Critical patent/CN116402942A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/06Ray-tracing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10032Satellite or aerial image; Remote sensing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/04Architectural design, interior design

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a large-scale building three-dimensional reconstruction method fusing multi-scale image features, which comprises the following steps: remote sensing data selection and pretreatment; constructing a basic network module and a residual network module, which are used for obtaining a shadow scalar, an uncertainty predicted value, a reflected light color and an ambient light color; inputting light rays corresponding to the highest-level images into the basic network module, inputting light rays corresponding to other scale images into the residual network module layer by layer, inputting the light rays corresponding to each image into one residual network module, then fusing the results of all the residual network modules to obtain a unified output, finally fusing the unified output of the residual network modules with the output result of the basic network module to obtain a final output result, calculating the depth of a point corresponding to a target object and the color of a corresponding pixel through the final output result, and updating the network weight through a constructed loss function; and realizing three-dimensional reconstruction of the remote sensing image to be reconstructed by using the trained network.

Description

Large-scale building three-dimensional reconstruction method integrating multi-scale image features
Technical Field
The invention belongs to the application of a deep learning technology in the field of three-dimensional reconstruction using high-resolution remote sensing images, relates to a large-scale building three-dimensional reconstruction method fusing multi-scale image features, and particularly relates to a method for automatically generating a building three-dimensional model from multi-view remote sensing images.
Background
Three-dimensional reconstruction is a core technology in the fields of smart cities, autopilot, virtual reality, and the like (document 1). Conventional three-dimensional reconstruction methods which are relatively commonly used at present comprise oblique photogrammetry, proximity photogrammetry and the like (documents 2-3), and the methods are used for fine modeling of scenes, and the main data sources are equipment such as cameras, unmanned aerial vehicles and the like. The high-resolution remote sensing image has the characteristics of wide coverage, abundant scene characteristics, multiple times and the like, can be rapidly reconstructed in a large scale by means of a large-scale remote sensing scene, can ensure the quality of three-dimensional reconstruction by abundant scene characteristic information and multiple time-phase image data, has great research and application values and is widely focused. However, due to the specificity of the remote sensing camera model and the sparsity of target photography, the traditional method requires more manual editing to determine the coordinates of the control points to finish the orientation of the images when the high-resolution images are used for three-dimensional reconstruction, and the workflow is complex, so that the three-dimensional reconstruction of the remote sensing scene is difficult to be performed quickly, at low cost and in a large scale.
With the rapid development of artificial intelligence and hardware related technologies, deep learning technologies represented by MVS-Net networks have achieved great success in the field of three-dimensional reconstruction. MVS-Net provides a cost capacity (cost volume) function based on differentiable homography transformation based on a binocular stereo matching theory, calculates the confidence coefficient of a depth prediction result by constructing a cost body, trains with the maximum confidence coefficient as a target, realizes high-quality generation of an end-to-end scene depth map, and overcomes the dependence of the traditional remote sensing image three-dimensional reconstruction on a large number of manual control points (document 4). However, the model training requires a depth map generated by a three-dimensional data acquisition system as a true value, and the depth map of a large-scale remote sensing scene is difficult to acquire and has high cost, so that the depth map is difficult to be widely applied.
In recent years, neural radiation fields (Neural Radiance Fields, neRF) fit radiation and density fields of three-dimensional scenes by creatively using a multi-layer perceptron (MLP), constructing a loss function with color information of pixels in an image as a true value, eliminating dependence of model training on depth maps, realizing high-quality three-dimensional scene expression in a concise manner, and promoting great development in the field of three-dimensional reconstruction (document 5). The three-dimensional reconstruction research represented by Block-NeRF, bungeeNeRF, mega-NeRF realizes urban-level high-precision large-scale reconstruction by using unmanned aerial vehicle aerial images and street view images, and opens up a new technical route for three-dimensional reconstruction of large-scale scenes by using satellite remote sensing images (documents 6-8). However, the camera model used by the method has a large difference from the satellite sensor model, so that the method is difficult to be directly applied to the three-dimensional reconstruction task of the remote sensing image. In addition, the scene complexity, view sparsity and diversity of external influence factors of the remote sensing image are difficult to ensure the model reconstruction precision. Sat-NeRF successfully applied a NeRF-based network to the task of three-dimensional reconstruction of remote sensing images by introducing a rational polynomial coefficient (Rational Polynomial Coefficients, RPC) camera model of the satellite (literature 9). However, the method lacks of capturing details of the remote sensing scene, and the reconstruction accuracy of the method has room for further improvement.
[ document 1] Li Mingyang, chen Wei, wang Shanshan, etc.. Three-dimensional reconstruction methods for visual deep learning [ J/OL ]. Computer science and exploration: 1-26[2023-02-11]. Http:// kns.cnki.net/kcms/detail/11.5602.TP.20221020.1347.002.Html.
[ document 2] Sun Hongwei ] three-dimensional digital urban modeling based on oblique photogrammetry [ J ]. Modern mapping 2014,37 (1): 18-21.
[ document 3] Miao Zhicheng, yang Yongchong, yu Qing and the like. Proximate to the application of photogrammetry to detailed modeling of a single building [ J ]. Remote sensing information, 2021,36 (5): 107G113.
[ document 4]Yao Y,Luo Z,Li S,et al.Mvsnet:Depth inference for unstructured multi-view stereo [ C ]// Proceedings of the European conference on computer vision (ECCV) 2018:767-783 ] [ document 5]Mildenhall B,Srinivasan P P,Tancik M,et al.Nerf:Representing scenes as neural radiance fields for view synthesis[J ]. Communications of the ACM,2021,65 (1): 99-106.
[ document 6]Tancik M,Casser V,Yan X,et al.Block-surf: scalable large scene neural view synthesis [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recording.2022:8248-8258.
[ document 7]Xiangli Y,Xu L,Pan X,et al.Bungeenerf:Progressive neural radiance field for extreme multi-scale scene rendering [ C ]// Computer Vision-ECCV 2022:17th European Conference,Tel Aviv,Israel,October 23-27,2022,Proceedings,Part XXXII.Cham:Springer Nature Switzerland,2022:106-122.
[ document 8]Turki H,Ramanan D,Satyanarayanan M.Mega-surf: scalable construction of large-scale nerfs for virtual fly-through hs [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recording.2022:12922-12931.
[ document 9]Mari R,Facciolo G,Ehret T.Sat-surf: learning Multi-view satellite photogrammetry with transient objects and shadow modeling using rpc cameras [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recgnition.2022:1311-1321.
Disclosure of Invention
Aiming at the defects of the existing remote sensing three-dimensional reconstruction method, the invention adopts the RPC model of the satellite to construct a network, adds shadow detection processing and instantaneous object (such as vehicles, pedestrians and the like) processing aiming at the problems of shadow, background disorder, complex content and the like of the remote sensing image, and eliminates the influence of inconsistent imaging conditions on the model precision to a certain extent. Finally, a residual error network module is gradually and dynamically added in the training process to capture local details and integral features of the remote sensing image, so that the result accuracy is improved.
The technical scheme adopted by the invention is as follows: a large-scale building three-dimensional reconstruction method integrating multi-scale image features comprises the following steps:
step 1, remote sensing data are selected and preprocessed, and a data set is divided into training and testing data sets according to a certain proportion;
acquiring RGB data, RPC model parameters, sun direction parameters and time phase data of the remote sensing image from a data set; acquiring light rays emitted from a satellite sensor based on an RPC model through preprocessing, wherein each light ray is a vector, and the origin coordinates emitted by the light rays and the directions of the origin coordinates are represented; finally, organizing RGB data, sun direction parameters and light vectors into dictionary type data for use;
step 2, constructing a basic network module and a residual error network module, wherein the main bodies of the basic network module and the residual error network module are multi-layer perceptron (MLP) which comprises an input layer, a plurality of hidden layers and an output layer; each layer of the MLP has a width h, the input of the input layer is the coordinate of any point on the light and the direction of the coordinate, and the output of the first half part of the MLP is the predicted pixel value and voxel density;
after the voxel density sigma is output, an implicit layer with the width of h is additionally added, RGB output by the first half part of the MLP is taken as input, and finally, a shadow scalar s, an uncertainty predictive value beta and a reflected light color c are obtained through the additional implicit layer with the width of h/2 and the output layer a Ambient light color a;
step 3, up-sampling remote sensing images input in the training data set by using a plurality of convolution layers to obtain a plurality of images with different scales, constructing an image pyramid by using the multi-scale images, and constructing the images in the image pyramid as light rays; firstly, inputting light rays corresponding to the highest-level image to a basic network module to learn overall scene information of a relatively abstract macro scale so as to obtain an output result; inputting light rays corresponding to other scale images into the residual error network module layer by layer to learn more specific and rich scene detail information in the low-level images, inputting the light rays corresponding to each image into one residual error network module, fusing the results of all the residual error network modules to obtain a unified output, finally fusing the unified output of the residual error network module with the output result of the basic network module to obtain a final output result, calculating the depth of a corresponding point of a target object and the color of a corresponding pixel through the final output result, and updating the network weight through a constructed loss function;
and step 4, inputting the remote sensing images in the test data set into the trained network in the step 3, and realizing three-dimensional reconstruction of the remote sensing images to be reconstructed.
Further, the processing procedure of the first half of the MLP in step 2 is as follows:
(RGB,σ)=F(x,d) (1)
where RGB represents the pixel values predicted by MLP, σ is the voxel density, x represents the three-dimensional coordinates of the point on the ray, and d is the direction of the ray.
Further, the specific implementation manner of calculating the depth of the corresponding point of the target object and the color of the corresponding pixel in the step 3 is as follows;
step 3.1, first, sigma is calculated i The transmittance alpha of the space point is calculated by taking the formula (2) i And opacity factor T i
Figure BDA0004106622520000041
Middle sigma i To input the coordinate of the ith point and the light direction thereof to the first half of the MLP and output the obtained target object voxel density value delta i =t i+1 -t i-1 Is the distance between two adjacent sampling points;
step 3.2, alpha i 、T i 、t i Carry-over formula(3) Calculating to obtain the depth d (r) of the corresponding point of the target object;
Figure BDA0004106622520000042
step 3.3, c a 、a i 、s i Carrying out formula (4) to calculate and obtain a color predicted value c of the corresponding light sampling point i
c(x i ,ω,t j )=c a (x i )·(s(x i ,ω)+(1-s(x i ,ω))·a(ω)) (4)
Wherein c (x) i ,ω,t j ) Representing color value obtained by rendering i points on each ray, namely c i Wherein x is i Representing three-dimensional coordinates of point i on the light ray, wherein omega is sun direction angle, and c a (x i ) Refer to the reflected light color c of the MLP output a A (omega) refers to the ambient light color output by the MLP, t j Phase data of the image; s (x) i ω) is a shadow scalar, taking a value between 0 and 1;
step 3.4, T is taken i 、c i 、α i Bringing the color c (r) of the pixel corresponding to the light into the formula (5);
Figure BDA0004106622520000043
c (r) is the color value obtained by rendering the ray r, d (r) is the depth value of the corresponding point of the target object on the ray, N is the number of rays, t i 、c i 、α i And T i The distance to the camera, the predicted color value, the transmittance and the opacity factor for the i-th point of the up-sample of ray r, respectively.
Further, the loss function constructed in the step 2 includes a solar ray direction correction term, an MSE loss, and a depth supervision loss, wherein a calculation formula of the solar ray direction correction term is as follows:
Figure BDA0004106622520000044
wherein N is SC Refers to the total number of sampling points on the light, s i Scalar for the shadow of the ith point on the ray, R SC Is a secondary sun correction ray.
Further, the calculation formula of the MSE loss is as follows:
Figure BDA0004106622520000045
here, R refers to the ray set of the present training input, c GT (r 2) is the true color value of the corresponding pixel of ray r2 in the input image, β' (r 2) =β (r 2) +β min (beta), wherein beta min And η is the specified empirical value.
The β (r 2) associated with ray r2 is obtained by integrating the uncertainty predictions for the N samples of r2, as follows:
Figure BDA0004106622520000046
wherein beta is i Is the uncertainty predictive value of the ith point of the ray r;
further, in order to learn depth information of a scene, a depth supervision loss is constructed, and a calculation formula is as follows:
Figure BDA0004106622520000047
wherein R is DS The method is characterized in that the method refers to light rays which are constructed by key points extracted from a remote sensing image through a SIFT algorithm, X (r 3) is a three-dimensional coordinate of a sampling point on the light rays, o (r 3) is a light ray origin, and omega (r 3) is a contribution weight of X (r 3) to depth supervision information, namely the ratio of the absolute value of a difference value between each light ray X (r 3) and o (r 3) selected in the training to the sum of the absolute values of all light ray difference values.
Further, the calculation formula of the final loss function is as follows:
L=L RGB (R)+λ SC L SC (R SC )+λ DS L DS (R DS ) (10)
wherein lambda is SC And lambda is DS Is the weight of the corresponding loss term.
Further, after network training is completed, the network weight is kept in a ckpt file, end-to-end three-dimensional reconstruction is directly carried out through an input remote sensing image by means of the ckpt file, an imaging tool is used for reconstructing a three-dimensional model of the whole scene according to the output c (r) and d (r), and a corresponding DSM or mp4 type visualization result is output.
Further, each layer of the MLP has a width of 1024.
Further, lambda SC And lambda is DS Taking 0.1/3 and 1000/3 respectively.
Further, in step 1, an IEEE GRSS data set is selected, where the data set includes the following parts:
(1) WorldView-3 satellite images, full color and eight-band visible light and near infrared, ground sampling distances of 35 cm and 1.3 m respectively;
(2) Three-dimensional data provided by point clouds or digital surface models DSMs generated by an airborne laser radar with a resolution of 80 centimeters;
(3) Sensor RPC parameter, solar ray direction and shooting time phase information data.
The invention fully utilizes the multi-level characteristics of the high-resolution remote sensing image to realize the efficient automatic three-dimensional reconstruction of the large-scale building. The advantages are as follows:
(1) The RPC model is introduced into the nerve radiation field, so that camera projection light is constructed, the remote sensing image can be applied to NeRF, and meanwhile, the shadow shielding and the instantaneous object existing in the remote sensing image are correspondingly processed, so that the final result is finer.
(2) And multi-level features in the image are extracted in a convolution mode, the whole and partial information of the image is fully utilized, and the final result is improved.
(3) And progressively training the whole network model, combining the basic network module and the residual network module, finally fusing the output results of the two modules to obtain a final result, and improving the reconstruction precision.
Drawings
FIG. 1 is a schematic diagram of a neural network model for large-scale three-dimensional reconstruction according to an embodiment of the present invention;
fig. 2 is an overall flow chart of an embodiment of the present invention.
Detailed Description
In order to facilitate an understanding and practice of the invention by those of ordinary skill in the art, the invention will be further described with reference to the drawings and specific examples, it being understood that the examples described herein are for illustration and description only and are not intended to be limiting of the invention.
Referring to fig. 2, the method for reconstructing a large-scale building by fusing multi-scale image features provided by the invention is characterized by comprising the following steps: the invention adopts the RPC model of the satellite to construct the NeRF network, and adds a shadow detection module and an instantaneous object (such as vehicles, pedestrians and the like) detection module aiming at the problems of shadow, disordered background, complex content and the like of the remote sensing image, thereby eliminating the influence of inconsistent imaging conditions of the image on the model precision to a certain extent. Finally, a residual block is gradually and dynamically added in the training process to capture local details and integral features of the remote sensing image, so that the result accuracy is improved, and the integral model structure is shown in fig. 1. And finally, realizing the three-dimensional reconstruction of the large-scale building from end to end through the trained network.
Based on the structure of the model, the embodiment of the invention provides an end-to-end progressive nerve radiation field remote sensing image large-scale three-dimensional reconstruction method. The specific implementation steps are as follows:
step 1, remote sensing data selection and preprocessing
In the embodiment, the 2019 IEEE GRSS data fusion large-scale race data set is selected to be compared and tested with various building reconstruction methods. The data set has a plurality of complex urban scenes composed of different building densities, space sizes and surrounding environments, and can well verify the extraction precision and reliability of the three-dimensional reconstruction methods of different buildings. The dataset contained the following:
(1) WorldView-3 satellite images (supplied by Maxar), full-color and eight-band visible and near-infrared, ground sample distances of 35 cm and 1.3 meters, respectively.
(2) The three-dimensional data provided by the point cloud or Digital Surface Models (DSMs) generated by the airborne lidar is 80 cm in resolution.
(3) Metadata such as sensor RPC parameters, solar ray direction, shooting time phase information, etc.
The present example selected 26 Maxar world view-3 images from the dataset collected in jackson ville, florida from 2014 to 2016. From these data we take as input a set of RGB image segmentation results of different sizes, approximately 800 x 800 pixels, with a minimum resolution of 0.3 m/pixel, each AOI covering a range of 256 x 256 m. The training set and the test set are divided according to the ratio of 8:2, and the RPC camera model of the satellite image is directly used for light projection, and each RPC is defined by a projection function (projecting 3D points onto image pixels) and an inverse function (positioning function) thereof. The minimum height and the maximum height of the scene are respectively expressed as h min And h max A ray passing through the scene and intersecting pixel p of the jth image is modeled as a straight line between an initial point and a final 3D point, i.e., x start And x end . RPC localization function L using jth image j For pixel p at h min And h max Positioning the position to obtain the boundary points:
x start =L j (p,h max ) ECEF ;x end =L j (p,h min ) ECEF
wherein the subindex ECEF indicates the location function L j The returned 3D points are converted into the earth's center, earth-fixed coordinate system (or geocentric system) to operate in a cartesian reference system.
Given x start And x end The origin o and the direction vector d of the ray r (t) =o+td intersecting the pixel p may select the height boundary [ h ] in various ways min ,h max ]Such as a large scale elevation model extracted from low resolution data. The j-th image is expressed as:
Figure BDA0004106622520000071
Maximum height point x nearest to camera start As the origin o of the ray. The boundary of ray r (t) =o+td, i.e. [ t ] min ,t max ]Let t be min =0,t max =‖x end -x start2 . The ECEF coordinates cannot be used directly in practice, because they represent a larger coordinate value, the present invention normalizes all ray points to the interval [ -1,1 using a process similar to the subtraction offset and scaling used in the RPC function]Is a kind of medium. Limiting all pixels in the input image to h min And h max The resulting set of 3D points in range is used to calculate the offset and scale in each spatial dimension. And finally, organizing the constructed light and image RGB information and other metadata into a dictionary for convenient use.
Step 2, constructing a basic network module and a residual network module
As shown in fig. 1, the basic network model and the residual network model are identical in structure, and the main body is composed of MLPs. The key of NeRF is the construction and rendering of camera rays, the calculation amount is large, the selection of MLP can simplify the network structure to reduce the calculation amount on the one hand, and the gradient updating can be completed faster on the other hand.
Because the ground features in the remote sensing image are complex, and there are many objects which exist temporarily like pedestrians, vehicles and the like, the result obtained by learning different time phase data has larger error, and the building in the image often has a part covered by the shadow, so that processing operations for the shadow and the instantaneous object need to be added in the model to eliminate the effects.
Therefore, in constructing the basic network module and the residual network module, the two network modules have the same structure, such as the block in fig. 1, the main body is a multi-layer perceptron (MLP), each layer of the MLP has a width h (default 1024), the front half part of the MLP (the part before outputting σ in the block in fig. 1) will fit to the NeRF representation of the whole scene (i.e. the function in the form of MLP), the input is the coordinates of any point on the light (since the starting point and direction of the light are known, the coordinates of any point on the light can be calculated) and the direction thereof, and the function can be expressed as
(RGB,σ)=F(x,d) (1)
Where RGB represents the pixel values predicted by the MLP, x represents the three-dimensional coordinates of the point on the ray, and d is the direction of the ray.
After the final output voxel density sigma, an additional hidden layer with the width of h is added, RGB output by the first half part of the MLP is taken as input, and finally, a shadow scalar s, an uncertainty predictive value beta and a reflected light color c are obtained through the additional hidden layer with the width of h/2 and the output layer a And ambient light color a (related only to solar ray angle), i.e. by adding additional layers, a function expressed in MLP from input to corresponding output is obtained. It should be noted that the MLP outputting beta requires an additional input time feature t, as shown in block in FIG. 1 of the specification j The output s requires an additional input solar direction angle ω, while the output a is only a single layer MLP representation ω to a function;
step 3, progressively training the network
In the process of extracting features of an input remote sensing image by using a plurality of convolution layers, a plurality of images with different scales are obtained, the step is to up-sample an original input image for a plurality of times to construct an image pyramid, the images in the image pyramid are constructed as light rays, and the light ray direction of the image with higher hierarchy is obtained by scaling according to the light ray direction constructed by the original image.
Firstly, inputting rays constructed by the highest-level image into a basic network to learn overall scene information of abstract macroscopics, fitting spatial distribution of the scene by using MLP according to the basic network module structure, and outputting voxel density sigma of the overall scene base Color rendering value c base Shadow scalar s base Transmittance alpha base Uncertainty prediction value beta base
Then inputting the light constructed by each layer of images into the residual error network module layer by layer until the original image is inputLight rays are input to a residual error network module by learning more specific and rich scene detail information in low-level images, all residual error network modules output voxel density, color rendering values, shadow scalar, transmittance and uncertainty prediction values, and finally the results of all residual error network modules are fused (an average method is adopted by the method) to obtain a unified output: voxel density sigma res Color rendering value c res Shadow scalar s res Transmittance alpha res Uncertainty prediction value beta res The method comprises the steps of carrying out a first treatment on the surface of the Finally, outputting voxel density sigma of residual error network module res Color of reflected light c a res Shadow scalar s res Ambient light color a res Uncertainty prediction value beta res Voxel density sigma of overall scene output from basic network base Color of reflected light c a base Shadow scalar s base Ambient light color a base Uncertainty prediction value beta base Fusion is carried out to obtain the final output voxel density sigma i Color of reflected light c a Shadow scalar s i Ambient light color a i Uncertainty prediction value beta i
The depth of the spatial point is calculated by taking the parameters of the output into the following formula (i.e. sigma i The transmittance alpha of the space point is calculated by taking the formula (4) into consideration i And opacity factor T i And then alpha is i 、T i 、t i And (3) carrying out calculation by using the formula (3) to obtain the depth d (r) of the corresponding point of the target object. Will c a 、a i 、s i Carrying out formula (5) to calculate and obtain a color predicted value c of the corresponding light sampling point i Will T i 、c i 、α i Bringing into formula (2) to obtain color c (r)) of the corresponding pixel of the light ray, and updating the network weight (beta is obtained through the constructed loss function i 、T i 、α i Carrying out calculation by taking the formula (8) to obtain a predicted value beta (r) of the possibility that the corresponding pixels of the light belong to the instantaneous object, and combining c (r), beta (r) and the color value c of the corresponding pixels of the input image of the corresponding hierarchy GT (r) bring-in formula(7) Construction of loss term L RGB The method comprises the steps of carrying out a first treatment on the surface of the Will T i 、α i 、s i Build loss term L with equation (6) SC The method comprises the steps of carrying out a first treatment on the surface of the Bringing d (r) into equation (9) builds the loss term L DS The method comprises the steps of carrying out a first treatment on the surface of the Finally will L RGB 、L SC 、L DS The final loss function is obtained by substituting equation (10).
The basic formula for rendering a ray is as follows:
Figure BDA0004106622520000081
Figure BDA0004106622520000082
wherein c (r) is a color value obtained by rendering the light ray r, d (r) is a depth value of a corresponding point of the target object on the light ray, N is a point sampled on the light ray r, and t i 、c i 、α i And T i The distance to the camera, the predicted color value, the transmittance and the opacity factor for the i-th point of the up-sample of the ray r, respectively, are defined as follows:
Figure BDA0004106622520000083
middle sigma i To input the coordinate of the ith point and the light direction thereof into the first half of the MLP and output the obtained target object voxel density value, x i =t i+1 -t i-1 Is the distance between adjacent two sampling points.
The color value c obtained by rendering the i point on each ray is calculated by adopting a shadow perception irradiance model proposed in S-NeRF, and the calculation formula is as follows:
c(x i ,ω,t j )=c a (x i )·(s(x i ,ω)+(1-s(x i ,ω))·a(ω)) (5)
wherein c (x) i ,ω,t j ) Substituting c in the rendering formula i Wherein x is i Representing the three-dimensional coordinates of point i on the light, ω being the solar direction angle, c a (x i ) Refers to the reflected light color of the MLP output, i.e., c in FIG. 1 a A (omega) refers to the ambient light color output by the MLP, t j Phase data of the image; s (x) i ω) is a shadow scalar, takes a value between 0 and 1, and determines the shadow region by the albedo of the scene. Ideally, at those 3D points which are directly illuminated by the sun, s.apprxeq.1, the color is entirely determined by the reflectance c a (x) To explain.
The loss function comprises a solar ray direction correction term, MSE loss and depth supervision loss, and the specific construction mode is as follows:
in practice the direction of the solar rays ω is closely related to the acquisition date (in particular the satellite passing at the same time of day). Thus, due to the mixture of phenomena, the finally captured ambient irradiance of a (ω) is not only related to ω but also to the conditions of a particular date, such as weather or seasonal variations. The model can not correct the distortion caused by the change of different sun ray directions from training data, and the invention solves the problem by adding a sun ray direction correction term in a loss function, wherein the calculation formula of the correction term is as follows:
Figure BDA0004106622520000091
wherein N is SC Refers to the total number of sampling points, s, on ray r1 i Scalar for the shadow of the ith point on ray R1, R SC Is a secondary sun correction ray that follows the direction of the sun ray omega, while the other primary rays follow the viewing direction of the camera. The learnable geometry used by the solar ray correction term is defined by the transmittance T i And transparency alpha i Encoding to further supervise the learning of shadow-aware shadows s (x, ω). The first part of the formula means that for R SC R1, s predicted at the ith point i Should be matched with T i Similarly; while the second part of the formula requires that the integral of s/r1 is as close to 1 as possible, since non-occluded and non-shadowed areas must beMainly explained by the reflectivity in the shadow-aware irradiance model. The above operation is an operation of adding a shadow process as in fig. 2.
The MSE loss is to refer to a task uncertainty learning method in W-NeRF to improve the robustness of the model, and in this embodiment, the uncertainty prediction β is weighted according to the contribution of each ray between the rendered and known colors to the MSE, where the formula is as follows:
Figure BDA0004106622520000092
here, R refers to the ray set of the present training input, c GT (r 2) is the true color value of the corresponding pixel of ray r2 in the input image, β' (r 2) =β (r 2) +β min Wherein beta is min The value of eta is manually specified empirical value, and beta is taken min =0.05, η=3 to avoid taking negative values in the logarithm; logarithmic at L RGB The function of (a) is to prevent β from converging to infinity. In this way, the model can trade off between uncertainty coefficient β and color difference to get the final value.
The β (r 2) associated with ray r2 is obtained by integrating the uncertainty predictions for the N samples of r2, as follows:
Figure BDA0004106622520000101
wherein beta is i Is the uncertainty predictor of the ith point of ray r. The above operation is implemented as the add transient object handling operation in fig. 2.
Meanwhile, in order to learn the depth information of the scene, a depth supervision loss function can be constructed, and the calculation formula is as follows:
Figure BDA0004106622520000102
wherein R is DS Refers to extraction from remote sensing images through SIFT algorithmThe light constructed by the key points is X (r 3) which is the three-dimensional coordinate of the sampling point on the light, o (r 3) is the origin of the light, and omega (r 3) is the contribution weight of X (r 3) to the depth supervision information, namely the ratio of the absolute value of the difference value between each light X (r 3) and o (r 3) selected in the training to the sum of the absolute values of the difference values of all the light.
And finally, constructing a final loss function by weighted addition of the loss functions:
L=L RGB (R)+λ SC L SC (R SC )+λ DS L DS (R DS ) (10)
wherein lambda is SC And lambda is DS The weight of the corresponding loss item is respectively 0.1/3 and 1000/3.
And 4, realizing three-dimensional reconstruction of the building by utilizing the trained network in the step 3.
After the network training is finished, the network weight is kept in a ckpt file, and the end-to-end three-dimensional reconstruction can be directly carried out through the input remote sensing image by means of the ckpt file, so that a final result is output.
In specific implementation, the invention can adopt computer software technology to realize automatic operation flow, and the device for operating the flow of the invention should be within the protection scope.
It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims (10)

1. A large-scale building three-dimensional reconstruction method integrating multi-scale image features is characterized by comprising the following steps:
step 1, remote sensing data are selected and preprocessed, and a data set is divided into training and testing data sets according to a certain proportion;
acquiring RGB data, RPC model parameters, sun direction parameters and time phase data of the remote sensing image from a data set; acquiring light rays emitted from a satellite sensor based on an RPC model through preprocessing, wherein each light ray is a vector, and the origin coordinates emitted by the light rays and the directions of the origin coordinates are represented; finally, organizing RGB data, sun direction parameters and light vectors into dictionary type data for use;
step 2, constructing a basic network module and a residual error network module, wherein the main bodies of the basic network module and the residual error network module are multi-layer perceptron (MLP) which comprises an input layer, a plurality of hidden layers and an output layer; each layer of the MLP has a width h, the input of the input layer is the coordinate of any point on the light and the direction of the coordinate, and the output of the first half part of the MLP is the predicted pixel value and voxel density;
after the voxel density sigma is output, an implicit layer with the width of h is additionally added, RGB output by the first half part of the MLP is taken as input, and finally, a shadow scalar s, an uncertainty predictive value beta and a reflected light color c are obtained through the additional implicit layer with the width of h/2 and the output layer a Ambient light color a;
step 3, up-sampling remote sensing images input in the training data set by using a plurality of convolution layers to obtain a plurality of images with different scales, constructing an image pyramid by using the multi-scale images, and constructing the images in the image pyramid as light rays; firstly, inputting light rays corresponding to the highest-level image to a basic network module to learn overall scene information of a relatively abstract macro scale so as to obtain an output result; inputting light rays corresponding to other scale images into the residual error network module layer by layer to learn more specific and rich scene detail information in the low-level images, inputting the light rays corresponding to each image into one residual error network module, fusing the results of all the residual error network modules to obtain a unified output, finally fusing the unified output of the residual error network module with the output result of the basic network module to obtain a final output result, calculating the depth of a corresponding point of a target object and the color of a corresponding pixel through the final output result, and updating the network weight through a constructed loss function;
and step 4, inputting the remote sensing images in the test data set into the trained network in the step 3, and realizing three-dimensional reconstruction of the remote sensing images to be reconstructed.
2. A method for three-dimensional reconstruction of a large-scale building incorporating multi-scale image features as defined in claim 1, wherein: the processing of the first half of the MLP in step 2 is shown below:
(RGB,σ)=F(x,d) (1)
where RGB represents the pixel values predicted by MLP, σ is the voxel density, x represents the three-dimensional coordinates of the point on the ray, and d is the direction of the ray.
3. A method for three-dimensional reconstruction of a large-scale building incorporating multi-scale image features as defined in claim 1, wherein: the specific implementation manner of calculating the depth of the corresponding point of the target object and the color of the corresponding pixel in the step 3 is as follows;
step 3.1, first, sigma is calculated i The transmittance alpha of the space point is calculated by taking the formula (2) i And opacity factor T i
Figure QLYQS_1
Middle sigma i To input the coordinate of the ith point and the light direction thereof to the first half of the MLP and output the obtained target object voxel density value delta i =t i+1 -t i-1 Is the distance between two adjacent sampling points;
step 3.2, alpha i 、T i 、t i Carrying out calculation by using the formula (3) to obtain the depth d (r) of the corresponding point of the target object;
Figure QLYQS_2
step 3.3, c a 、a i 、s i Carrying out formula (4) to calculate and obtain a color predicted value c of the corresponding light sampling point i
c(x i ,ω,t j )=c a (x i )·(s(x i ,ω)+(1-s(x i ,ω))·a(ω)) (4)
Wherein c (x) i ,ω,t j ) Representing color value obtained by rendering i points on each ray, namely c i Wherein x is i Representing three-dimensional coordinates of point i on the light ray, wherein omega is sun direction angle, and c a (x i ) Refer to the reflected light color c of the MLP output a A (omega) refers to the ambient light color output by the MLP, t j Phase data of the image; s (x) i ω) is a shadow scalar, taking a value between 0 and 1;
step 3.4, T is taken i 、c i 、α i Bringing the color c (r) of the pixel corresponding to the light into the formula (5);
Figure QLYQS_3
c (r) is the color value obtained by rendering the ray r, d (r) is the depth value of the corresponding point of the target object on the ray, N is the number of rays, t i 、c i 、α i And T i The distance to the camera, the predicted color value, the transmittance and the opacity factor for the i-th point of the up-sample of ray r, respectively.
4. A method for three-dimensional reconstruction of a large-scale building incorporating multi-scale image features as claimed in claim 3, wherein: the loss function constructed in the step 2 comprises a solar ray direction correction term, MSE loss and depth supervision loss, wherein the calculation formula of the solar ray direction correction term is as follows:
Figure QLYQS_4
wherein N is SC Refers to the total number of sampling points on the light, s i Scalar for the shadow of the ith point on the ray, R SC Is a secondary sun correction ray.
5. The method for three-dimensional reconstruction of a large-scale building fused with multi-scale image features according to claim 4, wherein: the calculation formula of the MSE loss is as follows:
Figure QLYQS_5
here, R refers to the ray set of the present training input, c GT (r 2) is the true color value of the corresponding pixel of ray r2 in the input image, β' (r 2) =β (r 2) +β min (beta), wherein beta min And η is the specified empirical value.
The β (r 2) associated with ray r2 is obtained by integrating the uncertainty predictions for the N samples of r2, as follows:
Figure QLYQS_6
wherein beta is i Is the uncertainty predictor of the ith point of ray r.
6. The method for three-dimensional reconstruction of a large-scale building fused with multi-scale image features according to claim 5, wherein: in order to learn the depth information of the scene, a depth supervision loss is constructed, and the calculation formula is as follows:
Figure QLYQS_7
wherein R is DS The method is characterized in that the method refers to light rays which are constructed by key points extracted from a remote sensing image through a SIFT algorithm, X (r 3) is a three-dimensional coordinate of a sampling point on the light rays, o (r 3) is a light ray origin, and omega (r 3) is a contribution weight of X (r 3) to depth supervision information, namely the ratio of the absolute value of a difference value between each light ray X (r 3) and o (r 3) selected in the training to the sum of the absolute values of all light ray difference values.
7. The method for three-dimensional reconstruction of a large-scale building fused with multi-scale image features according to claim 6, wherein: the calculation formula of the final loss function is as follows:
L=L RGB (R)+λ SC L SC (R SC )+λ DS L DS (R DS ) (10)
wherein lambda is SC And lambda is DS Is the weight of the corresponding loss term.
8. A method for three-dimensional reconstruction of a large-scale building incorporating multi-scale image features as defined in claim 1, wherein: after network training is completed, the network weight is kept in a ckpt file, end-to-end three-dimensional reconstruction is directly carried out through the input remote sensing image by means of the ckpt file, an image tool is used for reconstructing a three-dimensional model of the whole scene according to the output c (r) and d (r), and a corresponding DSM or mp4 type visualization result is output.
9. The method for three-dimensional reconstruction of a large-scale building fused with multi-scale image features according to claim 7, wherein: lambda (lambda) SC And lambda is DS Taking 0.1/3 and 1000/3 respectively.
10. A method for three-dimensional reconstruction of a large-scale building incorporating multi-scale image features as defined in claim 1, wherein: in step 1, an IEEE GRSS data set is selected, wherein the data set comprises the following parts:
(1) WorldView-3 satellite images, full color and eight-band visible light and near infrared, ground sampling distances of 35 cm and 1.3 m respectively;
(2) Three-dimensional data provided by point clouds or digital surface models DSMs generated by an airborne laser radar with a resolution of 80 centimeters;
(3) Sensor RPC parameter, solar ray direction and shooting time phase information data.
CN202310194010.2A 2023-03-02 2023-03-02 Large-scale building three-dimensional reconstruction method integrating multi-scale image features Pending CN116402942A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310194010.2A CN116402942A (en) 2023-03-02 2023-03-02 Large-scale building three-dimensional reconstruction method integrating multi-scale image features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310194010.2A CN116402942A (en) 2023-03-02 2023-03-02 Large-scale building three-dimensional reconstruction method integrating multi-scale image features

Publications (1)

Publication Number Publication Date
CN116402942A true CN116402942A (en) 2023-07-07

Family

ID=87016846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310194010.2A Pending CN116402942A (en) 2023-03-02 2023-03-02 Large-scale building three-dimensional reconstruction method integrating multi-scale image features

Country Status (1)

Country Link
CN (1) CN116402942A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580161A (en) * 2023-07-13 2023-08-11 湖南省建筑设计院集团股份有限公司 Building three-dimensional model construction method and system based on image and NeRF model
CN117710583A (en) * 2023-12-18 2024-03-15 中铁第四勘察设计院集团有限公司 Space-to-ground image three-dimensional reconstruction method, system and equipment based on nerve radiation field
CN117765165A (en) * 2023-12-06 2024-03-26 之江实验室 Three-dimensional reconstruction method and device, storage medium and electronic equipment
CN117765172A (en) * 2023-12-12 2024-03-26 之江实验室 Method and device for three-dimensional reconstruction of remote sensing image
CN117765171A (en) * 2023-12-12 2024-03-26 之江实验室 Three-dimensional model reconstruction method and device, storage medium and electronic equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580161A (en) * 2023-07-13 2023-08-11 湖南省建筑设计院集团股份有限公司 Building three-dimensional model construction method and system based on image and NeRF model
CN116580161B (en) * 2023-07-13 2023-09-22 湖南省建筑设计院集团股份有限公司 Building three-dimensional model construction method and system based on image and NeRF model
CN117765165A (en) * 2023-12-06 2024-03-26 之江实验室 Three-dimensional reconstruction method and device, storage medium and electronic equipment
CN117765172A (en) * 2023-12-12 2024-03-26 之江实验室 Method and device for three-dimensional reconstruction of remote sensing image
CN117765171A (en) * 2023-12-12 2024-03-26 之江实验室 Three-dimensional model reconstruction method and device, storage medium and electronic equipment
CN117765172B (en) * 2023-12-12 2024-05-28 之江实验室 Method and device for three-dimensional reconstruction of remote sensing image
CN117710583A (en) * 2023-12-18 2024-03-15 中铁第四勘察设计院集团有限公司 Space-to-ground image three-dimensional reconstruction method, system and equipment based on nerve radiation field

Similar Documents

Publication Publication Date Title
CN116402942A (en) Large-scale building three-dimensional reconstruction method integrating multi-scale image features
Marí et al. Sat-nerf: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using rpc cameras
CN115082639B (en) Image generation method, device, electronic equipment and storage medium
Tian et al. Depth estimation using a self-supervised network based on cross-layer feature fusion and the quadtree constraint
Derksen et al. Shadow neural radiance fields for multi-view satellite photogrammetry
CA3214444A1 (en) Quotidian scene reconstruction engine
CN103559737A (en) Object panorama modeling method
CN110910437B (en) Depth prediction method for complex indoor scene
MX2013003853A (en) Rapid 3d modeling.
CN114758337B (en) Semantic instance reconstruction method, device, equipment and medium
US20230281913A1 (en) Radiance Fields for Three-Dimensional Reconstruction and Novel View Synthesis in Large-Scale Environments
CN116580161B (en) Building three-dimensional model construction method and system based on image and NeRF model
CN115937288A (en) Three-dimensional scene model construction method for transformer substation
Condorelli et al. A comparison between 3D reconstruction using nerf neural networks and mvs algorithms on cultural heritage images
CN116245757B (en) Multi-scene universal remote sensing image cloud restoration method and system for multi-mode data
CN111683221B (en) Real-time video monitoring method and system for natural resources embedded with vector red line data
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN115147709A (en) Underwater target three-dimensional reconstruction method based on deep learning
Li et al. 3D virtual urban scene reconstruction from a single optical remote sensing image
CN117115359A (en) Multi-view power grid three-dimensional space data reconstruction method based on depth map fusion
Zhang et al. Fast satellite tensorial radiance field for multi-date satellite imagery of large size
CN116152442A (en) Three-dimensional point cloud model generation method and device
CN116310228A (en) Surface reconstruction and new view synthesis method for remote sensing scene
CN108171731B (en) Minimum image set automatic optimization method considering topological geometry multi-element constraint
Motayyeb et al. Enhancing contrast of images to improve geometric accuracy of a UAV photogrammetry project

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination