CN116958449B

CN116958449B - Urban scene three-dimensional modeling method and device and electronic equipment

Info

Publication number: CN116958449B
Application number: CN202311174479.6A
Authority: CN
Inventors: 林晓博
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2024-04-30
Anticipated expiration: 2043-09-12
Also published as: CN116958449A

Abstract

The application provides a three-dimensional modeling method and device for urban scenes and electronic equipment, wherein the method comprises the following steps: acquiring an original image sequence obtained by taking an aerial photo of a city scene of the unmanned aerial vehicle and corresponding target camera parameters; the target camera parameters are obtained after parameter interpolation processing is carried out on the basis of the original camera parameters corresponding to the original image sequence; inputting target camera parameters into a pre-trained nerve radiation field model, and outputting an optimized image sequence corresponding to an original image sequence and a coarse-granularity depth image sequence through preset processing; determining a dense depth image sequence based on the original image sequence, the optimized image sequence, and the coarse-granularity depth image sequence; and carrying out three-dimensional modeling according to the dense depth image sequence and the corresponding camera parameters to obtain the three-dimensional model of the urban scene corresponding to the target camera parameters. The method can reduce labor cost and time cost while ensuring accurate modeling of urban scenes.

Description

Urban scene three-dimensional modeling method and device and electronic equipment

Technical Field

The present application relates to the field of three-dimensional modeling technologies, and in particular, to a method and an apparatus for three-dimensional modeling of an urban scene, and an electronic device.

Background

Currently, the mainstream three-dimensional modeling method for the large urban scene of the unmanned plane generally utilizes SfM, MVS, SLAM to acquire three-dimensional information of the scene and combines voxels to display and construct a three-dimensional grid model of the scene, so that a satisfactory high-quality large urban scene model can be generated. However, the existing professional aerial photography and large-scale scene reconstruction work generally need to adopt an aerial photography method to collect images of a large-area scene, and due to the large data volume and high precision requirement of large-area aerial survey collection and long field period, the mode still needs professional aerial planning technical support, so that the labor cost and the time cost are high.

Disclosure of Invention

The application aims to provide a three-dimensional modeling method, a device and electronic equipment for urban scene, which can automatically optimize an original image sequence obtained by aerial photography of an unmanned aerial vehicle to obtain a dense depth image sequence, and then perform three-dimensional modeling based on the optimized dense depth image sequence, so that the requirement on original image acquisition is reduced while the three-dimensional modeling quality is ensured, a large number of images of a professional person in a long period are not required to be acquired, and the labor cost and the time cost of accurate three-dimensional modeling are reduced.

In a first aspect, the present application provides a three-dimensional modeling method for urban scene, the method comprising: acquiring an original image sequence obtained by taking an aerial photo of a city scene of the unmanned aerial vehicle and a target camera parameter corresponding to the original image sequence; the target camera parameters are obtained after parameter interpolation processing is carried out on the basis of the original camera parameters corresponding to the original image sequence; inputting target camera parameters into a pre-trained nerve radiation field model, and outputting an optimized image sequence corresponding to an original image sequence and a coarse-granularity depth image sequence through preset processing; the neural radiation field model is obtained by implicit modeling based on an original image sequence and original camera parameters; determining a dense depth image sequence based on the original image sequence, the optimized image sequence, and the coarse-granularity depth image sequence; and carrying out three-dimensional modeling according to the dense depth image sequence and the corresponding camera parameters to obtain the three-dimensional model of the urban scene corresponding to the target camera parameters.

Further, the step of obtaining the target camera parameters corresponding to the original image sequence includes: acquiring an original track point sequence corresponding to an original image sequence; determining a track point to be inserted corresponding to each original track point in the original track point sequence; and inserting the camera parameters of the corresponding images in the original camera parameters corresponding to the original image sequence according to the track points to be inserted to obtain target camera parameters.

Further, the step of determining the track point to be inserted corresponding to each original track point according to each track point during unmanned aerial vehicle aerial photography includes: for each original track point, determining images corresponding to all track points in a corresponding camera shooting window according to the images corresponding to the original track points; calculating the similarity between the images corresponding to the original track points and the images corresponding to all the track points respectively; detecting target track points of which the upward change values of the two-dimensional coordinate positions corresponding to the images in the track method exceed a threshold value according to the sequence of the similarity from large to small; and selecting one track point at intervals of a preset length on the connecting line of the original track point and the target track point as the track point to be inserted corresponding to the original track point.

Further, the step of inputting the target camera parameters into the pre-trained neural radiation field model and outputting the optimized image sequence and the coarse-granularity depth image sequence corresponding to the original image sequence through preset processing includes: inputting target camera parameters into a pre-trained nerve radiation field model, and outputting an optimized image sequence corresponding to an original image sequence through nerve rendering; and performing guided ray sampling processing on the optimized image sequence, refining voxel octree processing, and synthesizing a coarse-granularity depth image sequence by applying a depth rendering formula.

Further, the step of determining a dense depth image sequence based on the original image sequence, the optimized image sequence, and the coarse-granularity depth image sequence includes: determining a fine granularity depth image sequence according to the original image sequence and the optimized image sequence; and carrying out image complementation on the fine-granularity depth image sequence by using the coarse-granularity depth image sequence to obtain a dense depth image sequence.

Further, the step of determining the fine-granularity depth image sequence according to the original image sequence and the optimized image sequence includes: setting a rectangular window with specified pixels on the original images as the center for each original image in the original image sequence; for each target pixel in the rectangular window, sequentially matching corresponding pixels in the image of the optimized image sequence by adopting a homography matrix, and calculating a normalized cross-correlation value of the target pixel and the matched pixels to obtain a matching aggregation cost function of the specified pixels based on color consistency constraint; and solving a depth value corresponding to each specified pixel by adopting a global optimization method based on the matching aggregation cost function corresponding to each specified pixel to obtain a depth map sequence with fine granularity.

Further, the training process of the neural radiation field model is as follows: acquiring a training sample set, wherein samples in the sample set comprise an original image sequence and original camera parameters corresponding to the original image sequence; dividing an original image sequence into a plurality of image groups; each image group corresponds to a block in the whole city scene divided in advance; dividing each image in the image group into a foreground part and a background part to obtain a foreground image group and a background image group; and performing implicit modeling on the foreground image group and the background image group through the nerve radiation field respectively to obtain a nerve radiation field model.

In a second aspect, the present application also provides a three-dimensional modeling apparatus for urban scene, the apparatus comprising: the camera parameter acquisition module is used for acquiring an original image sequence obtained by taking an aerial photo of a city scene by the unmanned aerial vehicle and a target camera parameter corresponding to the original image sequence; the target camera parameters are obtained after parameter interpolation processing is carried out on the basis of the original camera parameters corresponding to the original image sequence; the model processing module is used for inputting target camera parameters into a pre-trained nerve radiation field model, and outputting an optimized image sequence corresponding to an original image sequence and a coarse-granularity depth image sequence through preset processing; the neural radiation field model is obtained by implicit modeling based on an original image sequence and original camera parameters; the depth image optimization module is used for determining a dense depth image sequence based on the original image sequence, the optimized image sequence and the coarse granularity depth image sequence; and the three-dimensional modeling module is used for carrying out three-dimensional modeling according to the dense depth image sequence and the corresponding camera parameters to obtain the three-dimensional model of the urban scene corresponding to the target camera parameters.

In a third aspect, the present application also provides an electronic device comprising a processor and a memory, the memory storing computer executable instructions executable by the processor, the processor executing the computer executable instructions to implement the method of the first aspect.

In a fourth aspect, the present application also provides a computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of the first aspect.

In the three-dimensional modeling method, the three-dimensional modeling device and the electronic equipment for the urban scene, an original image sequence obtained by aerial photographing of the urban scene by the unmanned aerial vehicle and target camera parameters corresponding to the original image sequence are firstly obtained; the target camera parameters are obtained by performing parameter interpolation processing on the basis of original camera parameters corresponding to an original image sequence; then inputting target camera parameters into a pre-trained nerve radiation field model, and outputting an optimized image sequence corresponding to an original image sequence and a coarse-granularity depth image sequence through preset processing; the neural radiation field model is obtained by implicit modeling based on an original image sequence and original camera parameters; determining a dense depth image sequence based on the original image sequence, the optimized image sequence, and the coarse-granularity depth image sequence; and finally, carrying out three-dimensional modeling according to the dense depth image sequence and the corresponding camera parameters to obtain the three-dimensional model of the urban scene corresponding to the target camera parameters. According to the method, an original image sequence obtained by aerial photography of the urban scene of the unmanned aerial vehicle can be automatically optimized to obtain a dense depth image sequence, then three-dimensional modeling is performed based on the optimized dense depth image sequence, the requirement on original image acquisition is reduced while the three-dimensional modeling quality is ensured, a large number of image acquisitions of professionals in a long period are not needed, and the labor cost and the time cost of accurate three-dimensional modeling are reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a three-dimensional modeling method for urban scene provided by an embodiment of the application;

FIG. 2 is a flowchart of a modeling process in a three-dimensional modeling method for urban scene according to an embodiment of the present application;

FIG. 3 is a flow chart of parameter interpolation in a three-dimensional modeling method of urban scene according to an embodiment of the present application;

FIG. 4 is a block diagram of a three-dimensional modeling device for urban scene according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions of the present application will be clearly and completely described in connection with the embodiments, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the existing three-dimensional modeling method of the large scene of the city level of the unmanned aerial vehicle, professionals are required to acquire images of the large-area scene by adopting an aerial line shooting method, and even in aerial video of the professional unmanned aerial vehicle, buildings in the scene can appear on multiple frames of images at the same time, but certain sides of the buildings can be covered by limited images or even be lost due to missed shooting in the flight process. The relatively sparse building surface image data brings great influence to a certain extent on measurement of illumination consistency, dense point cloud reconstruction and grid modeling. Finally, for areas where voids are difficult to reconstruct or model, manual filling and repair by professionals is still required to build a complete and refined model geometry.

On the other hand, the urban scene modeling method based on depth information fusion and volume fusion (SDF) is the mainstream method of modern large scene modeling. But due to the L2-regularization nature of SDF, it is often necessary to average all colors observed for a given voxel, simply mapping the relevant depth values onto geometry, when building surface texture is reconstructed. If the obtained depth information is too smooth or has the problem of poor precision such as missing, the detail of the outer wall surface of the building in the scene can hardly be reserved in the existing modeling result. This requires the computation of an exact surface geometry so that the color of the reconstructed texture can be re-projected to be globally consistent. However, for limited unmanned aerial vehicle images, texture repeatability is high and complex, and building geometry is not clear enough. Moreover, objects such as low textures (building outer walls), repeated textures (lawns), mirror surfaces (water surfaces) and the like exist in a complex scene, the factors can cause deviation or loopholes to three-dimensional information prediction of the target surface, and the follow-up grid model construction is possibly unfavorable for a limited unmanned aerial vehicle image, and deformation, excessive smoothness and hollowness are easy to occur after modeling.

Therefore, the existing unmanned aerial photographing large scene modeling process still faces a plurality of problems, and operators are required to have enough rich in-and-out experience to be able to cope with various conditions encountered in the operation process. The whole acquisition and reconstruction process is only suitable for professional operation, and the acquisition process is complex, time-consuming and high in cost. Unmanned aerial vehicle building scene modeling is difficult to popularize to common users and to realize consumer-level applications.

Based on the above, the embodiment of the application provides a three-dimensional modeling method and device for urban scene and electronic equipment, which can automatically optimize an original image sequence obtained by aerial photography of an unmanned aerial vehicle to obtain a dense depth image sequence, and then perform three-dimensional modeling based on the optimized dense depth image sequence, so that the requirement on original image acquisition is reduced while the three-dimensional modeling quality is ensured, a large number of image acquisitions of professionals in a long period are not required, and the labor cost and time cost of accurate three-dimensional modeling are reduced.

For the sake of understanding the present embodiment, first, a three-dimensional modeling method for urban scene disclosed in the present embodiment is described in detail.

Fig. 1 is a flowchart of a three-dimensional modeling method for urban scene, which specifically includes the following steps:

step S102, obtaining an original image sequence obtained by taking an aerial photo of a city scene by an unmanned aerial vehicle and a target camera parameter corresponding to the original image sequence; the target camera parameters are obtained after parameter interpolation processing is carried out on the basis of the original camera parameters corresponding to the original image sequence.

The original image sequence can be obtained by taking an aerial image of a city scene of the unmanned aerial vehicle, and in the image acquisition, no professional personnel is required to acquire the image sequence through a professional method, namely the method is suitable for most unmanned aerial vehicle image sequences which are not subjected to professional flight path and flight attitude planning.

The original image sequence comprises a plurality of images arranged according to the track point sequence, or a plurality of images arranged according to the shooting time sequence, each image corresponds to a camera parameter, and the camera parameter corresponding to each image in the original image sequence is the original camera parameter corresponding to the original image sequence. The camera parameters include: the camera internal parameter matrix K and the external parameter rotation matrix R and the translation matrix T.

The target camera parameters are obtained after the parameter interpolation processing is carried out on the original camera parameters through a certain interpolation algorithm, so as to prepare for the follow-up compensation and optimization of the missing view angle attitude images of the unmanned aerial vehicle aerial scene.

Step S104, inputting target camera parameters into a pre-trained nerve radiation field model, and outputting an optimized image sequence corresponding to an original image sequence and a coarse-granularity depth image sequence through preset processing; the neural radiation field model is obtained by implicit modeling based on an original image sequence and original camera parameters.

In the embodiment of the application, the target camera parameters can be processed by adopting a pre-trained nerve radiation field model, so that an optimized image sequence and a coarse-granularity depth image sequence are obtained. The training process of the neural radiation field model is described in detail later.

Step S106, determining a dense depth image sequence based on the original image sequence, the optimized image sequence, and the coarse-granularity depth image sequence.

And further carrying out image sequence optimization through the original image sequence and the optimized image sequence on the basis of the coarse-granularity depth image sequence to obtain a dense depth image sequence.

And step S108, carrying out three-dimensional modeling according to the dense depth image sequence and the corresponding camera parameters to obtain a three-dimensional model of the urban scene corresponding to the target camera parameters.

According to the urban scene three-dimensional modeling method provided by the embodiment of the application, the original image sequence obtained by aerial photographing of the urban scene by the unmanned aerial vehicle can be automatically optimized to obtain the dense depth image sequence through the neural radiation field model obtained by implicit modeling based on the urban scene original image sequence and the original camera parameters, and then the three-dimensional modeling is performed based on the optimized dense depth image sequence, so that the requirement on original image acquisition is reduced while the three-dimensional modeling quality is ensured, a large number of images of professionals in a long period are not required to be acquired, and the labor cost and the time cost of accurate three-dimensional modeling are reduced.

The invention aims to design a high-quality three-dimensional modeling method and system for a city level large scene based on implicit-display model interactive coupling on the premise of considering high-precision and high-efficiency modeling, which can be suitable for most unmanned aerial vehicle image sequences without professional flight path and flight attitude planning, and the missing color and geometric space information is recovered by combining a scene implicit model to construct more complete and high-quality dense point cloud. The key frame poses of the more images are then restored by neural radiation fields in the modeling stage, fusing more texture images into a Symbol Distance Field (SDF) with illumination model constraints to ultimately generate a more realistic and high quality scene mesh model. The method realizes reconstruction of the high-quality unmanned aerial vehicle large scene grid model, reduces the operation complexity of unmanned aerial vehicle remote sensing modeling, and improves the practical application capability and popularization capability of the unmanned aerial vehicle scene modeling technology.

The embodiment of the application also provides another three-dimensional modeling method of the urban scene, which is realized on the basis of the embodiment; the present embodiment focuses on describing a model training process, an image sequence optimizing process, and a three-dimensional modeling process.

Referring to fig. 2, the training process of the neural radiation field model is as follows:

Step S202, a training sample set is obtained, wherein the sample set comprises an original image sequence and original camera parameters corresponding to the original image sequence. Namely, a group of urban scene original image sequences I _n obtained by unmanned aerial vehicle aerial photography is obtained.

Step S204, dividing the original image sequence into a plurality of image groups; each image group corresponds to a block in the entire city scene divided in advance.

In specific implementation, the whole urban scene is predefined to be divided into n blocks, and the sequence frames are summarized into n groups by a geometric clustering method based on illumination. Each set of data is trained by a neural radiation field model (which model is implemented by a multi-layer perceptron).

The "geometric clustering method based on illumination" is as follows:

For neural radiation field representation of large scenes, it is necessary to decompose the scene into units of centroid N e n= (N _x,n_y,n_z) and initialize the corresponding model weights f ⁿ. Each weight submodule is a fully connected layer sequence similar to the NeRF architecture.

An additional appearance embedding vector l (a) is associated with each input image a for calculating luminance, for representing the illumination and weather conditions in the image. Based on the appearance embedded vector, performing geometric clustering on the original image sequence to finish the division of the training data of the large scene, so that the neural radiation field has better flexibility when explaining the illumination difference between the images, and the expression capability of the large scene is improved;

At query time, the neural radiation field generates opacity σ and color c= (r, g, b) for a given position x, direction d and appearance embedding 1 (a) using the model weight f ⁿ closest to the query point

fⁿ(x)＝σ;fⁿ(x，d，l(a))＝c；

where n＝argmin_n∈N||n-x||²。

Step S206, dividing each image in the image group into a foreground part and a background part to obtain a foreground image group and a background image group.

All images of the unmanned aerial vehicle aerial scene are subdivided into a foreground part and a background part by a foreground dividing method based on light projection. That is, for each image group, a foreground image group and a background image group are obtained.

The foreground and background division method based on light projection is as follows:

Using 4D external volume parameterization and ray casting formulas, its partitioning results are improved by using more tightly surrounding camera poses and related foreground details. And then, by utilizing camera height measurement, the sampling boundary of the scene is further refined by stopping the light rays near the ground, so that unnecessary sample inquiry is reduced.

Step S208, for the foreground image group and the background image group, implicit modeling is performed through the nerve radiation field respectively to obtain a nerve radiation field model.

The foreground image group and the background image group are subjected to implicit modeling by adopting the nerve radiation field respectively, so that each level of detail in the unmanned aerial vehicle scene can be learned by the nerve radiation field, and the accuracy of the implicit modeling of the unmanned aerial vehicle aerial large scene is improved.

During specific implicit modeling, each group of unmanned aerial vehicle aerial images is independently trained by using an independent multi-layer perceptron, the size of the training space of each multi-layer perceptron is limited based on pixel correlation, pixels reaching a certain correlation degree are all divided into the same space for training, and then all groups are trained in parallel to obtain a nerve radiation field model M of a scene.

The method for limiting the size of each multi-layer perceptron training space based on pixel correlation comprises the following steps: sampling a point corresponding to each pixel of each training image along the camera ray and adding the pixel only to the training set of spatial cells intersected by it; a small overlap factor is then added between the cells to further reduce visual artifacts near the boundary.

In addition, the position of the mass center of each group of image pictures can be determined by dividing the whole unmanned aerial vehicle aerial shooting scene into two-dimensional space units under the overlooking view angle according to groups based on the urban scene sequence images obtained by aerial shooting of the unmanned aerial vehicles after grouping, the heights of the mass centers are fixed to be the same value, the selection of each group of data mass centers is completed, and the subsequent splicing of the blocks formed by each group is facilitated, so that a complete urban scene is synthesized.

In this embodiment, the described unmanned aerial vehicle scene is divided into grids of m×n according to the overlooking view angle, and the position of the centroid n∈n= (N _x,n_y,n_z) is determined according to the result of the division of the scene in the previous step, where N _z is a fixed value, and the obtained result is used for the subsequent scene splicing.

The optimization process of the original image sequence is described in detail below, firstly, target camera parameters corresponding to the original image sequence are obtained, namely, the original camera parameters of the original image sequence are taken as input, and camera parameters corresponding to the view angle images missing among t track points are supplemented by an interpolation algorithm based on the flight track of the unmanned aerial vehicle; referring to fig. 3, the specific process includes the following steps:

Step S302, an original track point sequence corresponding to the original image sequence is obtained; the original track point sequence includes: each image in the original image sequence corresponds to an original track point.

Step S304, determining the track points to be inserted corresponding to each original track point in the original track point sequence.

In the specific implementation, aiming at each original track point, determining images corresponding to all track points in a corresponding camera shooting window according to the images corresponding to the original track points; calculating the similarity between the images corresponding to the original track points and the images corresponding to all the track points respectively; detecting target track points of which the upward change values of the two-dimensional coordinate positions corresponding to the images in the track method exceed a threshold value according to the sequence of the similarity from large to small; and selecting one track point at intervals of a preset length on the connecting line of the original track point and the target track point as the track point to be inserted corresponding to the original track point.

Namely, the method of the interpolation algorithm based on the unmanned aerial vehicle flight track comprises the following steps:

firstly traversing a track point K in a flight path, and carrying out the following operations on each original track point K _i:

From the image taken by this track point, a photographing window w _i (photographing window refers to the camera view angle size), all track points { k _j,k_l,..} within this window are found, and the two-dimensional position coordinates { p _j,p_l,..} of the obtained track points in the scene are bound to the corresponding image { I _j,I_l,.}. And track point selection is performed according to the following rules.

Firstly, similarity calculation is carried out on images of the selected track points and corresponding images of all track points in a shooting window, sorting is carried out according to the similarity, whether the change value of a two-dimensional coordinate position corresponding to the image with highest similarity in the track direction is larger than a threshold value or not is inquired, and if the change value is larger than the threshold value, the track point corresponding to the image is selected as a target track point. If the image is smaller than the threshold value, the rest images are judged in the same way according to the similarity sequence until the images are larger than the threshold value, and the track point corresponding to the images larger than the threshold value is selected as the target track point.

Connecting the original track point with the target track point, and selecting the points on the connection line at intervals of a certain length as the track points to be inserted, namely selecting the viewing angles to be inserted.

Step S306, in the original camera parameters corresponding to the original image sequence, the camera parameters of the corresponding image are inserted according to the track points to be inserted, and the target camera parameters are obtained.

And then inputting target camera parameters into a pre-trained nerve radiation field model, and outputting an optimized image sequence corresponding to the original image sequence and a coarse-granularity depth image sequence through preset processing. The specific implementation process is as follows:

(1) Inputting target camera parameters into a pre-trained nerve radiation field model, and outputting an optimized image sequence corresponding to an original image sequence through nerve rendering; specifically, the target camera parameters are input into a nerve radiation field model M, and an optimized image sequence with a missing view angle is synthesized through nerve rendering

(2) And performing guided ray sampling processing on the optimized image sequence, refining voxel octree processing, and synthesizing a coarse-granularity depth image sequence by applying a depth rendering formula.

Then, using a neural rendering method (consistency of a caching technology and rendering), multiplexing rendering results to synthesize all unmanned aerial vehicle aerial images; through the implementation of guided ray sampling, the voxel octree is thinned, so that the rendering quality is further improved, and a depth rendering formula is utilized to synthesize coarse-granularity depth images { d _n}^synthesis ] corresponding to all unmanned aerial vehicle color images.

Wherein, the method of using the cache technology and the continuity of rendering is as follows:

A coarse buffer containing opacity and color is first pre-computed. The space-time consistency of the acquisition track is utilized, and the orthogonal direction of the track is focused; once the information needed to render a given view is calculated, a large portion of it is reused for the next view. The renderer uses the cached output to quickly generate an initial view, then performs additional rounds of model sampling to further refine the image, and stores these new values into the cache. Since each subsequent frame has significant overlap with its previous frame, it benefits from previous optimizations, requiring only a small incremental effort to be performed to maintain rendering quality.

"Guided ray sampling" is performed by:

rays are rendered in a single pass using weights stored in the octree structure. Since the refined octree provides a high quality estimate of scene geometry, only a small number of samples need be placed near the surface of emphasis. The rendering process is further accelerated by accumulating the transmittance of the light rays and ending the sampling after a certain threshold. By ensuring that the network touches each depth value in the cone during training, a continuous (c _zi,σ_zi) is learned, resulting in a finer granularity of sampling range, sampling by the lead-in, generating a more accurate RGB rendered image.

In some preferred embodiments, the "depth rendering formula generates the corresponding depth" by: the depth map of the input image should be rendered in the following equation, and the result is the corresponding depth.

Where { d _n}^syntesis is the depth value of one pixel point on the rendered image, N is the number of sample points, T _i is the cumulative transparency at the ith sample pointFor voxel density at the i-th sampling point,/>Is the distance between the i-th sampling point and the i+1-th sampling point, and z _i is the depth value of the i-th sampling point in the camera coordinate system.

After the optimized image sequence and the coarse-granularity depth image sequence are obtained, a dense depth image sequence is further determined based on the original image sequence, the optimized image sequence and the coarse-granularity depth image sequence, and the specific process is as follows:

(1) And determining a fine granularity depth image sequence according to the original image sequence and the optimized image sequence.

The neural radiation field can estimate the depth information (including textures and non-textures) of any material object in the scene through implicit modeling of the scene, but the geometric details of the scene are too smooth, and the overall accuracy is low. In order to provide better depth information of the unmanned aerial vehicle field for subsequent modeling, a high-precision dense depth prediction method based on sequential fragment matching is designed. By I _n andFor input, a sequence of fine-grained depth images { d _n}^serialized is generated by sequential computation between the original image sequences based on an aggregate cost function of color and geometric consistency.

The high-precision dense depth prediction method based on serialization fragment matching is specifically as follows:

Setting a rectangular window with specified pixels on the original images as the center for each original image in the original image sequence; for each target pixel in the rectangular window, sequentially matching corresponding pixels in the image of the optimized image sequence by adopting a homography matrix, and calculating a normalized cross-correlation value of the target pixel and the matched pixels to obtain a matching aggregation cost function of the specified pixels based on color consistency constraint; and solving a depth value corresponding to each specified pixel by adopting a global optimization method based on the matching aggregation cost function corresponding to each specified pixel to obtain a depth map sequence with fine granularity.

In specific implementation, according to the calculation principle of the original PATCHMATCH stereo algorithm, a single 3D inclined plane on each pixel point is utilized to overcome the offset problem when reconstructing the forward parallel curved surface, and the offset problem is expanded to find an approximate nearest neighbor in all possible planes. However, for sparse or limited non-professional aerial acquisition images, each 3D plane can only contain a small number of pixels, which means that there is only an inadequate guess. But the high-precision dense depth prediction method based on the serialization fragment matching is to perform depth estimation by considering an original image and all images of an interpolation synthesized image among flight tracks through a nerve radiation field, and the depth information of the current key frame mutually compensates through a plurality of depth maps, so that errors caused by random initialization are reduced.

Image I _n of original shooting track is taken as image frame synthesized between source image and trackIs a reference image. A rectangular window is set centered on the pixel p on the current image, with bounding box b=w×w. For each pixel q in B: we use homography matrix H _ij (q) to match sequentially the synthesized image frames/>, between tracksCorresponding pixels of (a) are provided. Then a matching aggregate cost function m (p, f _p) based on the color consistency constraint for each pixel p, where the pixel p point fp is the corresponding estimated 3D plane. The normalized cross-correlation (NCC) value between q and H _ij (q) can be calculated:

Where q is the pixel point in window B, Is the average value of all pixels in window B, H _ij (q) is the pixel after homography conversion from source image to reference image,/>Is the average of all pixels in the transformed window B. The closer the NCC value is to 1, the more similar the pixel points are, i.e., the higher the color consistency.

In addition, to further improve the accuracy of the geometric detail prediction of the depth map, the geometric consistency constraint of multiple views is integrated into cost aggregation, and the geometric consistency calculation between the source image and the reference image is defined as front-back re-projection error: Where x _j is the pixel point on the source image I _n,/> Representing a projective backward transformation from a source image to a reference image, H _ij is the transformation from the reference image/>To the homography matrix of the source image I _n, I indicates the euclidean distance, i.e. the straight line distance between the two points. If this distance is small, it means that the pixel locations before and after the transformation do not deviate significantly, i.e. the geometrical consistency is high. The reprojection error can be added as a penalty to the matching aggregate cost function such that the cost function considers both color consistency and geometric consistency. It can be seen that if the re-projection error is small, it means that the estimated depth and normal are consistent and that the corresponding points are at the same spatial position. Thus, the final cost value of each pixel is obtained, and the integrity and the accuracy of depth estimation are ensured. According to the cost value of each pixel, a global optimization method is adopted to solve the depth value corresponding to each pixel so as to obtain a fine-granularity depth image, and a fine-granularity depth image sequence is finally obtained through batch processing.

(2) And carrying out image complementation on the fine-granularity depth image sequence by using the coarse-granularity depth image sequence to obtain a dense depth image sequence.

The depth information with higher precision can be obtained by using a high-precision dense depth prediction method based on serialization fragment matching, but deviation or loopholes can occur in depth prediction in areas such as low texture (building outer wall), repeated texture (lawn), mirror surface (water surface) and the like by using a pure geometric method, so that in the embodiment of the application, a coarse-granularity depth image sequence { d _n}^synthesis is utilized to complement areas with holes or no values in a fine-granularity depth image sequence { d _n}^serialized by using a depth information complement method, and finally, the areas are mutually fused to obtain a final complete dense depth image sequence { d _n}^final.

The depth information complement method specifically comprises the following steps:

First, by traversing each pixel in the fine-grained depth image sequence { d _n}^serialized, if outliers occur, the location of the outlier is recorded P _error. The depth value corresponding to the P _error position in { d _n}^synthesis over is then replaced at the outlier to supplement the hole. Finally, a dense depth image sequence { d _n}^final with complete information is obtained.

Based on the foregoing, an optimized image sequence, i.e. a synthetic new view angle color map, has been obtainedThe original camera parameters P _n corresponding to each image are combined with the complete dense depth image sequence { d _n}^final, the original image sequence I _n. The method is used as data input of a symbol distance field large-scene three-dimensional modeling module guided by joint appearance and geometric optimization, and a high-quality grid model reconstruction result is finally obtained through reconstruction and optimization of the grid model. The specific method comprises the following steps:

Step a100, using voxel modeling method to fuse the input depth image { d _n}^final into a directed distance field (SDF), resulting in a coarse three-dimensional implicit expression. This step is to eliminate noise and occlusion using information from multiple depth images, resulting in a continuous and complete three-dimensional surface.

Step A200, based on the fused directed distance field, extracting a triangular mesh from the directed distance field by using an isosurface extraction algorithm, and selecting some key frames I _kf as candidates of texture mapping by calculating the contribution of each image to reconstruction quality. This step is to convert the SDF into a more manageable grid representation and to reduce the number of images required for texture mapping.

Step a300, based on each key frame I _kf selected in step a200, estimates its spatially varying spherical harmonic coefficients of the material and scene illumination, and its camera pose P _kf. This step is to build a more accurate illumination and reflection model m _l and m _r, which needs to take into account the possible variations in the material and illumination in the scene.

Step a400, recovering surface shape from the plurality of images according to illumination and reflection models m _l and m _r using a technique based on shape in shadow (SfS). I.e., by alternating iterations, the directional distance field, texture, pose, and spatially varying spherical harmonic coefficients are optimized until convergence, minimizing the reprojection error. This step is to further improve modeling quality, utilizing surface normal and reflection information contained in the color image to improve geometry and texture.

And step A500, extracting a high-quality three-dimensional reconstruction result from the optimized directed distance field, and mapping textures of the key frames to grids to finally obtain a three-dimensional model with a vivid appearance.

In some preferred embodiments, the "voxel modeling method" in step a100 is specifically:

The core of the method is a reconstructed curved surface, the reconstructed curved surface is implicitly stored as a sparse truncated symbol distance function, and the sparse truncated symbol distance function is expressed as D. Thus, each voxel stores the original (truncated) symbol distance D (v) to the nearest surface, its color C (v), the integral weight W (v). The depth map is integrated into the SDF using a weighted moving average scheme:

in some preferred embodiments, in step a300, "estimating the spatially varying spherical harmonic coefficients of its material and scene illumination" is performed by:

To represent the illumination of a scene, a full parametric model is used to define the coloring of each curved point. Global scene illumination. In order to make the problem easy to handle, it is assumed that the scene environment is an ideal diffuse reflection environment. The shadow B at voxel v is then calculated from voxel surface normal n (v), voxel albedo a (v) and scene illumination parameter l _m:

where H _m is the shading reference. The above formula defines the forward shading calculation, so the backward rendering problem can be solved by estimating the parameters of B.

In order to estimate the reflected irradiance B at voxel v, the illumination of the scene is parameterized by using a Spherical Harmonic (SH) basis function, which is a good smooth approximation of one of the known ideal surface reflections. To solve the disadvantage of a single global spherical harmonic to globally define scene illumination, the reconstructed volume needs to be divided into sub-volumes s= { S1, …, SK } of fixed size t _sv; the number of subvolumes is K. Each subvolume is assigned a spherical harmonic basis, each subvolume having its own spherical harmonic coefficient. Thus, the number of lighting parameters per scene is greatly increased and allows spatially adaptive lighting modification.

In view of the problems of uneven arrangement of the navigation belts, no planning of the tracks, low scene overlapping degree, fewer effective frames and the like of most unmanned aerial vehicle non-professional acquired image sequences, the modeling accuracy of the existing method is low, and even the problems of obvious holes, deformation and modeling failure occur. So at present, the modeling precision can be improved only by using a professional aerial photographing acquisition mode and performing image preprocessing by using professional software, so that the problems of hollowness, deformation and modeling failure are avoided, but the method is time-consuming and labor-consuming and difficult to popularize.

Aiming at the problems, in the three-dimensional modeling method for the urban scene provided by the embodiment of the application, the common unmanned aerial vehicle aerial image sequence is taken as input, and the urban level large-scene high-quality three-dimensional modeling method based on the unmanned aerial vehicle aerial image implicit-display model interactive coupling is provided, and the problems of low scene overlapping degree and less effective frames caused by the track random drawing are solved by synthesizing high-quality aerial scene color images by utilizing a nerve radiation field. Meanwhile, the neural radiation field and the serialization fragment matching are mutually supervised to solve the problems of depth information deletion and geometric structure smoothness of texture-free or repeated texture areas in a complex scene. And finally, introducing the material and illumination change model into a scene voxel modeling frame, and outputting a high-quality (non-hole and deformed) unmanned plane large scene grid model. In summary, the scheme provided by the embodiment of the application obviously reduces the difficulty and complexity of unmanned aerial vehicle image acquisition and urban large-scene three-dimensional modeling, and can realize automatic high-quality large-scene three-dimensional modeling by a non-professional aerial image sequence without performing unmanned aerial vehicle track planning image acquisition and image preprocessing by professionals in the modeling process, and the generated three-dimensional model also does not need later-stage manual optimization, so that the popularity of unmanned aerial vehicle large-scene three-dimensional modeling is improved.

Based on the above method embodiment, the present application further provides a three-dimensional modeling device for urban scene, as shown in fig. 4, where the device includes: the camera parameter obtaining module 42 is configured to obtain an original image sequence obtained by aerial photographing of a city scene by the unmanned aerial vehicle, and a target camera parameter corresponding to the original image sequence; the target camera parameters are obtained after parameter interpolation processing is carried out on the basis of the original camera parameters corresponding to the original image sequence; the model processing module 44 is configured to input the target camera parameters into a pre-trained neural radiation field model, and output an optimized image sequence corresponding to the original image sequence and a coarse-granularity depth image sequence through preset processing; the neural radiation field model is obtained by implicit modeling based on an original image sequence and original camera parameters; a depth image optimization module 46 for determining a dense depth image sequence based on the original image sequence, the optimized image sequence, and the coarse-granularity depth image sequence; the three-dimensional modeling module 48 is configured to perform three-dimensional modeling according to the dense depth image sequence and the corresponding camera parameters, so as to obtain a three-dimensional model of the urban scene corresponding to the target camera parameters.

Further, the camera parameter obtaining module 42 is configured to obtain an original track point sequence corresponding to the original image sequence; determining a track point to be inserted corresponding to each original track point in the original track point sequence; and inserting the camera parameters of the corresponding images in the original camera parameters corresponding to the original image sequence according to the track points to be inserted to obtain target camera parameters.

Further, the camera parameter obtaining module 42 is configured to determine, for each original track point, images corresponding to all track points in the corresponding camera shooting window according to the images corresponding to the original track points; calculating the similarity between the images corresponding to the original track points and the images corresponding to all the track points respectively; detecting target track points of which the upward change values of the two-dimensional coordinate positions corresponding to the images in the track method exceed a threshold value according to the sequence of the similarity from large to small; and selecting one track point at intervals of a preset length on the connecting line of the original track point and the target track point as the track point to be inserted corresponding to the original track point.

Further, the model processing module 44 is configured to input the target camera parameters into a pre-trained neural radiation field model, and output an optimized image sequence corresponding to the original image sequence through neural rendering; and performing guided ray sampling processing on the optimized image sequence, refining voxel octree processing, and synthesizing a coarse-granularity depth image sequence by applying a depth rendering formula.

Further, the depth image optimizing module 46 is configured to determine a fine-granularity depth image sequence according to the original image sequence and the optimized image sequence; and carrying out image complementation on the fine-granularity depth image sequence by using the coarse-granularity depth image sequence to obtain a dense depth image sequence.

Further, the depth image optimizing module 46 is configured to set, for each original image in the sequence of original images, a rectangular window with a specified pixel on the original image as a center; for each target pixel in the rectangular window, sequentially matching corresponding pixels in the image of the optimized image sequence by adopting a homography matrix, and calculating a normalized cross-correlation value of the target pixel and the matched pixels to obtain a matching aggregation cost function of the specified pixels based on color consistency constraint; and solving a depth value corresponding to each specified pixel by adopting a global optimization method based on the matching aggregation cost function corresponding to each specified pixel to obtain a depth map sequence with fine granularity.

Further, the apparatus further includes: the model training module is used for executing the following training process of the nerve radiation field model: acquiring a training sample set, wherein samples in the sample set comprise an original image sequence and original camera parameters corresponding to the original image sequence; dividing an original image sequence into a plurality of image groups; each image group corresponds to a block in the whole city scene divided in advance; dividing each image in the image group into a foreground part and a background part to obtain a foreground image group and a background image group; and performing implicit modeling on the foreground image group and the background image group through the nerve radiation field respectively to obtain a nerve radiation field model.

The device provided by the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing method embodiment, and for the sake of brief description, reference may be made to the corresponding content in the foregoing method embodiment where the device embodiment is not mentioned.

An embodiment of the present application further provides an electronic device, as shown in fig. 5, which is a schematic structural diagram of the electronic device, where the electronic device includes a processor 51 and a memory 50, where the memory 50 stores computer executable instructions that can be executed by the processor 51, and the processor 51 executes the computer executable instructions to implement the above method.

In the embodiment shown in fig. 5, the electronic device further comprises a bus 52 and a communication interface 53, wherein the processor 51, the communication interface 53 and the memory 50 are connected by the bus 52.

The memory 50 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is achieved via at least one communication interface 53 (which may be wired or wireless), and the internet, wide area network, local network, metropolitan area network, etc. may be used. Bus 52 may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The bus 52 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one bi-directional arrow is shown in FIG. 5, but not only one bus or type of bus.

The processor 51 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 51 or by instructions in the form of software. The processor 51 may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), and the like; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory and the processor 51 reads information in the memory and in combination with its hardware performs the steps of the method of the previous embodiment.

The embodiment of the application also provides a computer readable storage medium, which stores computer executable instructions that, when being called and executed by a processor, cause the processor to implement the above method, and the specific implementation can refer to the foregoing method embodiment and will not be described herein.

The method, the apparatus and the computer program product of the electronic device provided in the embodiments of the present application include a computer readable storage medium storing program codes, where the instructions included in the program codes may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment and will not be described herein.

The relative steps, numerical expressions and numerical values of the components and steps set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the description of the present application, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for three-dimensional modeling of urban scenes, the method comprising:

acquiring an original image sequence obtained by taking an aerial photo of a city scene of an unmanned aerial vehicle and a target camera parameter corresponding to the original image sequence; the target camera parameters are obtained after parameter interpolation processing is carried out on the basis of the original camera parameters corresponding to the original image sequence;

Inputting the target camera parameters into a pre-trained nerve radiation field model, and outputting an optimized image sequence corresponding to the original image sequence and a coarse-granularity depth image sequence through preset processing; the neural radiation field model is obtained by implicit modeling based on the original image sequence and the original camera parameters;

Determining a dense depth image sequence based on the original image sequence, the optimized image sequence, and the coarse-granularity depth image sequence;

And carrying out three-dimensional modeling according to the dense depth image sequence and the corresponding camera parameters to obtain the three-dimensional model of the urban scene corresponding to the target camera parameters.

2. The method of claim 1, wherein the step of obtaining target camera parameters corresponding to the original image sequence comprises:

acquiring an original track point sequence corresponding to the original image sequence;

determining a track point to be inserted corresponding to each original track point in the original track point sequence;

and inserting the camera parameters of the corresponding images in the original camera parameters corresponding to the original image sequence according to the track points to be inserted to obtain target camera parameters.

3. The method of claim 2, wherein the step of determining a track point to be inserted for each original track point in the sequence of original track points comprises:

For each original track point, determining images corresponding to all track points in a corresponding camera shooting window according to the images corresponding to the original track points;

Calculating the similarity between the images corresponding to the original track points and the images corresponding to all the track points respectively; detecting target track points of which the upward change values of the two-dimensional coordinate positions corresponding to the images in the track method exceed a threshold value according to the sequence of the similarity from large to small;

And selecting a track point at intervals of a preset length on the connecting line of the original track point and the target track point as a track point to be inserted corresponding to the original track point.

4. The method according to claim 1, wherein the step of inputting the target camera parameters into a pre-trained neural radiation field model, outputting an optimized image sequence corresponding to the original image sequence, and a coarse-granularity depth image sequence through a preset process, comprises:

Inputting the target camera parameters into a pre-trained nerve radiation field model, and outputting an optimized image sequence corresponding to the original image sequence through nerve rendering;

And performing guided ray sampling processing on the optimized image sequence, refining voxel octree processing, and synthesizing a coarse-granularity depth image sequence by applying a depth rendering formula.

5. The method of claim 1, wherein determining a dense depth image sequence based on the original image sequence, the optimized image sequence, and the coarse-granularity depth image sequence comprises:

determining a fine-granularity depth image sequence according to the original image sequence and the optimized image sequence;

And carrying out image complementation on the fine-granularity depth image sequence by applying the coarse-granularity depth image sequence to obtain a dense depth image sequence.

6. The method of claim 5, wherein determining a fine-grained depth image sequence from the original image sequence and the optimized image sequence comprises:

Setting a rectangular window with specified pixels on the original images as centers for each original image in the original image sequence; for each target pixel in the rectangular window, sequentially matching corresponding pixels in the image of the optimized image sequence by adopting a homography matrix, and calculating a normalized cross-correlation value of the target pixel and the matched pixels to obtain a matching aggregation cost function of the specified pixels based on color consistency constraint;

And solving a depth value corresponding to each specified pixel by adopting a global optimization method based on the matching aggregation cost function corresponding to each specified pixel to obtain a depth map sequence with fine granularity.

7. The method of claim 1, wherein the training process of the neural radiation field model is as follows:

Acquiring a training sample set, wherein samples in the sample set comprise the original image sequence and original camera parameters corresponding to the original image sequence;

dividing the original image sequence into a plurality of image groups; each image group corresponds to a block in the whole city scene divided in advance;

Dividing each image in the image group into a foreground part and a background part to obtain a foreground image group and a background image group;

and performing implicit modeling on the foreground image group and the background image group through the nerve radiation field respectively to obtain a nerve radiation field model.

8. An urban scene three-dimensional modeling apparatus, characterized in that it comprises:

The camera parameter acquisition module is used for acquiring an original image sequence obtained by taking an aerial photo of a city scene by the unmanned aerial vehicle and a target camera parameter corresponding to the original image sequence; the target camera parameters are obtained after parameter interpolation processing is carried out on the basis of the original camera parameters corresponding to the original image sequence;

The model processing module is used for inputting the target camera parameters into a pre-trained nerve radiation field model, and outputting an optimized image sequence corresponding to the original image sequence and a coarse-granularity depth image sequence through preset processing; the neural radiation field model is obtained by implicit modeling based on the original image sequence and the original camera parameters;

A depth image optimization module for determining a dense depth image sequence based on the original image sequence, the optimized image sequence, and the coarse-granularity depth image sequence;

And the three-dimensional modeling module is used for carrying out three-dimensional modeling according to the dense depth image sequence and the corresponding camera parameters to obtain a three-dimensional model of the urban scene corresponding to the target camera parameters.

9. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the method of any one of claims 1 to 7.

10. A computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any one of claims 1 to 7.