CN117726758A

CN117726758A - Rapid large-scale three-dimensional reconstruction method based on nerve radiation field

Info

Publication number: CN117726758A
Application number: CN202410020121.6A
Authority: CN
Inventors: 劳奕臻; 张惠晴
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2024-01-07
Filing date: 2024-01-07
Publication date: 2024-03-19

Abstract

The invention provides a rapid large-scale three-dimensional reconstruction method based on a nerve radiation field under the restriction of limited video memory resources, and belongs to the technical field of computer vision. The invention designs a space division strategy of a sub-scene and an automatic rendering strategy based on pixel projection by introducing a divide-and-conquer idea, and mainly comprises the following steps: preparing a group of orthographic or oblique aerial survey images shot by the unmanned aerial vehicle; adopting incremental SfM to calculate the pose of the camera and rebuild sparse point cloud; clustering and dividing the cameras by using a K-Means clustering algorithm based on the spatial distribution of the cameras; expanding the preliminarily divided camera clusters by using a scaling factor; inputting the shooting images corresponding to all cameras in each camera cluster into a NeRF model for independent training, and storing related model parameters; a brand new automatic rendering strategy based on pixel projection is provided for realizing automatic rendering of query camera images. The invention effectively solves the problems of long time consumption, high consumption of video memory resources and the like when large-scale three-dimensional reconstruction is carried out.

Description

Rapid large-scale three-dimensional reconstruction method based on nerve radiation field

Technical Field

The invention relates to the technical field of computer vision, in particular to a rapid large-scale three-dimensional reconstruction method based on a nerve radiation field under the restriction of limited video memory resources, which is used for solving the problems of long time consumption, high consumption of video memory resources and the like when large-scale three-dimensional reconstruction is performed by using large-scale input images.

Background

Urban-level large-scale three-dimensional reconstruction is an active and important research topic in the fields of photogrammetry and remote sensing. It generally involves creating an accurate and detailed 3D model of the entire city using various data sources, such as aerial or satellite images, lidar data, and street-level images. Obtaining high resolution images is very easy and low cost today, at the high speed of today's aerial survey technology. Therefore, large-scale three-dimensional reconstruction research based on images is very popular at present and has wide application scenes, such as urban planning, navigation, disaster management, historical building protection and the like.

The existing three-dimensional reconstruction technology based on images is mainly divided into two main categories: traditional geometry-based methods and neural network-based methods. The geometry-based scheme is largely divided into two steps, motion restoration structure (Structure From Motion, sfM) and multi-view stereoscopic (Multiple View Stereo, MVS). The SfM estimates the pose of the camera from the image and generates a sparse point cloud, and the MVS optimizes the point cloud and generates a dense point cloud. The other is a neural network-based reconstruction method represented by a neural radiation field (Neural Radiance Fields, neRF), which is a new milestone of three-dimensional reconstruction technology in recent years. NeRF implicitly expresses three-dimensional scene information by inputting images and corresponding camera pose training network parameters, and can generate new view angle images. However, city-level large-scale reconstruction techniques face mainly two major challenges: first, large-scale reconstruction typically requires processing and storage of large amounts of data, which can result in memory resources of a single GPU being quickly exceeded during reconstruction. This can lead to slow processing times, memory starvation errors, and other performance problems, and is very unfriendly to users or researchers with insufficient memory resources; secondly, it is usually very time-consuming to perform large-scale reconstruction, and with the increasing demands of real-time or near-real-time applications, such as navigation or disaster management, research on a more rapid and efficient large-scale three-dimensional reconstruction technology is more urgent.

In view of the desire to create a more accurate, detailed and useful large-scale 3D model, and the need to address the practical challenges of using large datasets and limited computational resources, this patent proposes a so far fastest large-scale three-dimensional reconstruction method that can be robust with limited video memory resources.

Disclosure of Invention

The invention provides a rapid and efficient large-scale three-dimensional reconstruction method based on a nerve radiation field, which can effectively process a large amount of data input under a limited video memory resource, and is based on the problems that the training speed of the existing large-scale reconstruction technology is still very long, the video memory requirement is high and the like. The method provided by the invention mainly comprises two parts: a spatial partitioning strategy for sub-scenes and an automatic rendering strategy based on pixel projection. Specifically, the invention provides a method for quickly reconstructing a large scale three-dimensional model based on a nerve radiation field under the restriction of a limited video memory resource, which is characterized by comprising the following steps:

step 1, preparing an input image; firstly, inputting a group of N-numbered orthographic or oblique image data, wherein the resolutions of all the image data are consistent;

step 2, camera pose measurement and sparse reconstruction; a camera model is set as a pinhole camera model by default, camera pose estimation and sparse point cloud reconstruction are carried out on all input image data by using COLMAP, so that camera pose corresponding to each image and sparse point cloud information are obtained, wherein the camera pose comprises position distribution of a camera in space and orientation information of the camera;

step 3, initial clustering division of camera distribution; according to the position distribution of the cameras of all input images in space, a K-means clustering algorithm (K-Means clustering algorithm) is firstly used for carrying out initial clustering to divide camera clusters.

Step 4, expanding a camera cluster; setting a scaling factor sigma for the divided camera clusters to perform radius expansion so as to increase the number of cameras of each camera cluster and the overlapping degree between the clusters, and the obtained shooting images corresponding to the cameras in each camera cluster jointly form a training input image of each sub-scene.

Step 5, training the submodel; and inputting the shooting images corresponding to all cameras in each camera cluster subjected to expansion into a NeRF model for independent training, and storing the parameters of the sub-scene model into a local disk.

And 6, automatically rendering the query camera image.

The invention provides a brand new automatic rendering strategy based on pixel projection, which is suitable for carrying out corresponding automatic sub-scene selection on any query camera viewpoint after model training is completed, so as to carry out partial or complete rendering on the query camera viewpoint. When the view point of the query camera needs to call a plurality of sub-scene models to perform partial rendering output, image alignment and splicing are required after the rendering is completed, so that a complete query camera output image is obtained; when the query camera viewpoint can be all rendered out by a single sub-scene model, the query camera image can be directly output after rendering is completed.

Furthermore, the input image preparation in step 1 firstly needs to use the unmanned rotorcraft to fly a specific route in a certain area, and in the flying process, fixed-point shooting is performed, and the orientation angle of a camera during shooting can be changed within the range of 30-90 degrees, so that a group of N-number of orthographic or oblique aerial survey image data shot by the unmanned rotorcraft can be obtained. The resolution of images shot by the same unmanned aerial vehicle is generally consistent, if the image resolutions are inconsistent, the images are required to be subjected to resolution lifting processing to ensure that the image resolutions are uniform, and the input image resolutions are generally 6000 x 4000.

Further, pose estimation (camera position in space O) is performed on all input images using COLMAP in step 2 ⁱ And the orientation R of the camera ⁱ ) And reconstructing the sparse point cloud to obtain the camera pose corresponding to each image and sparse point cloud information. Wherein the default camera model is set as a pinhole camera model (PINEHOLE) and all cameras are set asAnd performing incremental reconstruction on the same internal parameter, thereby obtaining the camera pose and sparse point cloud of all cameras.

Further, the initial clustering division based on camera distribution in the step 3 is implemented by adopting a K-Means clustering algorithm according to the position distribution of cameras of all input images in space, and the algorithm steps are as follows:

step 31: the maximum camera number MaxNum per cluster is set, and the number of clusters (K value) in the K-Means algorithm can be set to k=n/MaxNum.

Step 32: the initialized K samples (camera positions) are selected as initial cluster centers (O ₁ ,O ₂ ,O ₃ ,...,O _K ) Suppose that the cluster is divided into (C ₁ ,C ₂ ,C ₃ ,...,C _K ) The least squares error to be optimized

Step 33: for each sample X in the dataset _i Calculating the distances from the cluster center to K cluster centers, and dividing the distances into classes corresponding to the cluster centers with the smallest distances;

step 34: for each category O _j Recalculating its cluster center(i.e., the centroid of all samples belonging to that class); step 35: repeating the above steps 33 and 34 until reaching the minimum error threshold to obtain the camera cluster (C) ₁ ,C ₂ ,C ₃ ,...,C _K )。

Further, the camera cluster expansion in the step 4 is realized mainly by scaling the radius inside the cluster to increase the number of cameras inside the cluster, and the specific algorithm steps are as follows:

step 41: for the initially partitioned camera cluster C obtained in step 3 _i Calculating the distance from the camera in the camera cluster to the cluster center O by taking the cluster center as the circle center _i And the maximum distance is taken as the intra-cluster radius r _i . Setting the maximum camera number of each camera cluster after expansionThe upper limit is TopNum. Step 42: setting the scaling factor σ=1.1, the new cluster inner radius r is amplified _i ’＝σ*r _i Traversing all cameras, computing the cameras to the center O in the cluster _i Distance d of (2) _j If d _j Less than the new intra-cluster radius r _i And', adding the camera into the cluster, and stopping expanding the camera cluster when the camera cluster in the cluster reaches the upper limit TopNum.

Step 43: and repeating the steps 41 and 42 until the camera clusters divided in the step 3 are completely expanded.

Further, the training of the submodel in the step 5 is specifically: neRF is a new scene representation and view synthesis method, allowing the synthesis of highly realistic 3D scenes from 2D images. Firstly, constructing a ray in space according to the optical center and pixel point coordinates of a camera, wherein each 3D point on the ray has a position and a direction. Position encoding each 3D point on the light enables the NeRF model to learn higher frequency variations. And then, taking the position and the angle of the 3D point after coding as input, sending the position and the angle into a NeRF model, outputting the color and the transparency of the point by the model, calculating the color value of an image pixel corresponding to the 3D point in the direction by a volume rendering equation, and calculating a loss function value with the color value of the original image to supervise model training, and finally, obtaining the implicit expression of the three-dimensional scene, thereby rendering and obtaining the camera view of any visual angle. In the invention, the selected NeRF model is an Instant-NGP, each camera cluster obtained in the step 4 is an independent trainable sub-scene data set, the images corresponding to the cameras contained in the camera clusters are input into the Instant-NGP for independent training, and relevant model parameters are stored in a disk.

Further, the automatic rendering of the query camera image in the step 6 is one of the most important key steps of the invention, and is specifically divided into the following steps:

step 61: fitting a plane; first, point cloud data measured by COLMAP, namely position coordinates P (x, y, z) of 3D points in space, are extracted. These 3D points are then fitted to a ground plane and the plane parameters are solved using a least squares method.

Step 62: the pixels are projected and a sub-scene range box is obtained.

The invention adopts a method of projecting four corner points of an image onto a ground plane fitted by sparse point cloud to determine a scene range corresponding to the image shot by the camera. In step 4, all the cameras are divided into a plurality of camera clusters, and the images corresponding to the cameras in each camera cluster can be trained to obtain one sub-scene, i.e. each sub-scene comprises a plurality of shooting cameras, and one sub-scene range frame is composed of scene range frames of all the cameras contained in the scene.

Step 63: sub-scene selection and query camera image rendering; calculating a scene range frame of the query camera on the ground plane, traversing all sub-scene range frames, finding a sub-scene range frame with the intersection area with the projection frame of the query camera being larger than a certain threshold value, recording the sub-scene range frame as a rendering scheme, and then respectively calling the trained corresponding sub-scene model to render the corresponding image of the query camera. When the view point of the query camera needs to call a plurality of sub-scene models to perform partial rendering output, performing image alignment and splicing after the rendering is completed, so as to obtain a complete query camera output image; when the query camera viewpoint can be rendered out entirely by a single sub-scene model, then the query camera image can be directly output after rendering is completed.

The invention has the specific advantages that:

the sub-scene space division strategy and the pixel projection-based automatic rendering strategy are designed by introducing the divide-and-conquer algorithm idea, wherein the sub-scene space division strategy comprises initial cluster division of camera distribution and camera cluster expansion, and the rapid large-scale three-dimensional reconstruction under the limited video memory resource is realized. Compared with the prior art, the method has the remarkable advantages of larger input data protocol, higher speed of completing large-scale three-dimensional reconstruction, higher rendering image quality and the like.

Drawings

In order to more clearly illustrate the solution in the present application, a brief description of the drawings that are used in the description of the embodiments of the present application will be provided below.

FIG. 1 is a schematic diagram of a method of the present invention for rapid large-scale three-dimensional reconstruction based on a neural radiation field.

Fig. 2 is a flow chart of a method of the present invention for rapid large-scale three-dimensional reconstruction based on a neural radiation field.

Fig. 3 is a schematic view of plane fitting and pixel projection in the region dividing step in the neural radiation field based rapid large-scale three-dimensional reconstruction method of the present invention.

Fig. 4 is a schematic diagram of a sub-scene region box construction in the neural radiation field-based rapid large-scale three-dimensional reconstruction method of the present invention.

Fig. 5 is a schematic diagram of a build rendering scheme in the neural radiation field based rapid large-scale three-dimensional reconstruction method of the present invention.

Fig. 6 is a partial result effectiveness demonstration in the rapid large-scale three-dimensional reconstruction method based on the nerve radiation field of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments of the present invention will be described in further detail with reference to the accompanying drawings. The description of the exemplary embodiments is presented for purposes of illustration only and is in no way intended to limit the invention, its application, or uses. According to one embodiment of the invention, a rapid large-scale three-dimensional reconstruction method based on a nerve radiation field is provided, which is mainly used for solving the problems of large-scale three-dimensional reconstruction and new view angle rendering output, and an algorithm schematic is shown in fig. 1.

In this embodiment, the method for rapid large-scale three-dimensional reconstruction based on the neural radiation field mainly includes two major parts: the algorithm flow is shown in fig. 2, and mainly comprises the following steps:

s1, preparing an input image; a set of N number of orthographic or oblique image data captured using the rotorcraft is first input, all image data resolutions remaining consistent.

Setting the flight route track of the rotor unmanned aerial vehicle as a back-shaped flight from outside to inside, carrying out fixed-point shooting in the flight process, wherein the orientation angle of a 70% camera changes within a range of 50-70 degrees, the orientation angle of a 15% camera changes within a range of 70-90 degrees, and the orientation angle of a 15% camera changes within a range of 20-50 degrees, so that a group of N-number aerial survey image data shot by using the unmanned aerial vehicle can be obtained. The resolution of images shot by the same unmanned aerial vehicle is generally consistent, if the image resolutions are inconsistent, the images are required to be subjected to resolution lifting processing to ensure that the image resolutions are uniform, and the input image resolutions are generally 6000 x 4000.

S2, camera pose measurement and calculation and sparse reconstruction; and performing pose estimation (position distribution of cameras in space and orientation information of the cameras) and sparse point cloud reconstruction on all input images by using COLMAP, so that camera pose and sparse point cloud information corresponding to each image can be obtained. The default camera model is a pinhole camera model (PINEHOLE) and is set to be the same internal reference, and feature extraction, feature matching, sparse reconstruction, local and global beam adjustment optimization are respectively carried out, so that the camera pose and sparse point cloud of all cameras can be obtained. S3, initial clustering division of camera distribution; according to the position distribution of cameras of all input images in space, a K-means clustering algorithm (K-Means clustering algorithm) is used for initially dividing camera clusters.

K-Means is a common Euclidean distance-based clustering algorithm that considers that the closer the distance between two targets, the greater the similarity. Based on the position distribution of the cameras in the space, the invention uses a K-Means algorithm to perform preliminary clustering division on the cameras, and the main steps are as follows:

s31: the maximum camera number MaxNum per cluster is set, and the number of clusters (K value) in the K-Means algorithm can be set to k=n/MaxNum. S32: the initialized K samples (camera positions) are selected as initial cluster centers (O ₁ ,O ₂ ,O ₃ ,...,O _K ) Suppose that the cluster is divided into (C ₁ ,C ₂ ,C ₃ ,...,C _K ) The least squares error to be optimized

S33: for each sample in the datasetThe X is _i Calculating the distances from the cluster center to K cluster centers, and dividing the distances into classes corresponding to the cluster centers with the smallest distances from the cluster center;

s34: for each category O _j Recalculating its cluster center(i.e., the centroid of all samples belonging to that class);

s35: repeating the above steps S33 and S34 until reaching the minimum error threshold to obtain the initial cluster divided camera cluster (C ₁ ,C ₂ ,C ₃ ,...,C _K )。

S4, expanding a camera cluster; setting a scaling factor sigma for the divided camera clusters to perform radius expansion so as to increase the number of cameras of each camera cluster and the overlapping degree between the clusters, and the obtained shooting images corresponding to the cameras in each camera cluster jointly form a training input image of each sub-scene. The method comprises the following specific steps:

s41: for the initially partitioned camera cluster C obtained in step 3 _i (i=1, 2,.,. K) calculating camera-to-cluster center O in a camera cluster with the cluster center as the center of the circle _i And the maximum distance is taken as the intra-cluster radius r _i . The maximum number of cameras per camera cluster after expansion is set to TopNum.

S42: setting the scaling factor σ=1.1, the new cluster inner radius r is amplified _i ’＝σ*r _i Traversing all cameras, computing the cameras to the center O in the cluster _i Distance d of (2) _j If d _j Less than the new intra-cluster radius r _i And', adding the camera into the cluster, and stopping expanding the camera cluster when the camera cluster in the cluster reaches the upper limit TopNum.

S43: and repeating the steps 41 and 42 until the camera clusters divided in the step 3 are fully expanded, namely the space division of the sub-scene is completed. S5, training a sub-model; and inputting the shooting images corresponding to all cameras in each camera cluster subjected to expansion into a NeRF model for independent training, and storing relevant model parameters into a local disk.

NeRF introduced a new scene representation and view synthesis method, allowing highly realistic 3D scenes to be synthesized from 2D images. The training process includes capturing a set of 2D images of the scene from different viewpoints. For each pixel in the image, neRF computes a corresponding ray in 3D space. NeRF estimates scene color and opacity at each point along these rays, then compares the predicted appearance color values to the observed appearance color values, and adjusts network parameters to minimize the differences. After training, neRF can synthesize a new view of the scene by projecting light from the virtual camera location and accumulating color and opacity along the light using the learned neuro-radiation field. This process allows photographs to be generated from previously invisible viewpoints, providing realistic scene rendering images. Instant-NGP is the fastest nerve radiation field model to date that is extremely engineering-valuable. The work proposes a multi-resolution hash coding to accelerate model training, the multi-resolution hash table enhancement of trainable feature vectors can further reduce the model size, and the whole system is realized by using the completely fused CUDA kernel, so that parallelism is extremely exerted.

The training of the neutron scene model in the embodiment of the invention is a three-dimensional reconstruction method based on a nerve radiation field, wherein the specific model is Instant-NGP. On the basis of the divided camera clusters in the step S4, independent training is carried out on the sub-scene corresponding to each camera cluster, the training iteration number is set to be 5000, and after training is completed, relevant model network parameters are stored in a local disk. The disk space consumed by storing the network parameters as a model is far smaller than that of other forms, such as point cloud and grid storage.

S6, automatically rendering the query camera image; the invention provides a brand new automatic rendering strategy based on pixel projection, which is suitable for carrying out corresponding automatic sub-scene selection on any query camera viewpoint after model training is completed, so as to carry out partial or complete rendering on the query camera viewpoint. The specific algorithm comprises the following steps:

s61: fitting a plane; first, point cloud data measured by COLMAP, namely position coordinates P (x, y, z) of 3D points in space, are extracted. These 3D points are then fitted to a ground plane and the plane parameters are solved using a least squares method.

S62: the pixels are projected and a sub-scene range frame is obtained; as shown in fig. 3, the present invention adopts the method of projecting four corner points of an image onto a ground plane fitted by sparse point cloud to determine a scene range corresponding to the image captured by the camera. The method comprises the following steps:

s621: for the ith image (i-th image) captured by the pinhole camera model, the coordinates of the four corner points of the image in the camera coordinate system are (p ⁱ ₁ ,p ⁱ ₂ ,p ⁱ ₃ ,p ⁱ ₄ ) Camera optical center o of shooting camera corresponding to ith image ⁱ ＝[0,0,0]Is required to use(wherein k=1, 2,3,4, r ⁱ ，T ⁱ Representing camera pose information of a camera corresponding to the ith image in a world coordinate system) converts the four corner points from the camera coordinate system to the world coordinate system, and then obtains four projection points (G) by crossing a ground plane through a connecting line formed by the optical center and the corner points ⁱ ₁ ,G ⁱ ₂ ,G ⁱ ₃ ,G ⁱ ₄ ). After these intersections are obtained, the smallest rectangle that can wrap the points can be obtained as the range of scenes that the camera can capture.

In step S4, all the cameras are divided into a plurality of camera clusters, and the images corresponding to the cameras in each camera cluster can be trained to obtain one sub-scene, i.e. each sub-scene comprises a plurality of shooting cameras, and one sub-scene range frame is composed of shooting scene range frames of all the cameras included in the scene. As shown in fig. 4, step S621 is repeated, and all the camera projection scene range frames included in one camera cluster are calculated, so that the minimum rectangle capable of wrapping all the camera projection scene range frames is the scene range frame of the sub-scene.

S63: sub-scene selection and query camera image rendering; calculating a projection scene range frame of the query camera on the ground plane, traversing all sub-scene range frames, finding a sub-scene range frame with the intersection area of the sub-scene range frame with the projection frame of the query camera being larger than a certain threshold value, recording the sub-scene range frame as a rendering scheme, and then respectively calling the trained corresponding sub-scene model to render the corresponding image of the query camera. As shown in fig. 5, when the view point of the query camera needs to call a plurality of sub-scene models to perform partial rendering output, image alignment and stitching are required after rendering is completed, so as to obtain a complete query camera output image; when the query camera viewpoint can be completely rendered and output by a single sub-scene model, the complete query camera image can be directly output after the rendering is completed. The method comprises the following steps:

s631: when the query camera is located at a plurality of scene boundaries (the intersection area of the projection scene range frame and the plurality of sub-scene range frames of the query camera is greater than the threshold alpha), the rendered image needs to be composed of a part of each of the several sub-scene renderings. After different sub-models are called to generate corresponding partial rendering graphs, an image stitching algorithm integrated with multi-panorama, gain compensation, simple blending and multi-band blending is adopted to stitch the partial rendering graphs into a complete rendering result graph for output.

S632: when the query camera is located in a certain scene (the intersection area of the scene range frame of the query camera and a certain sub-scene range frame is equal to the scene range frame of the query camera), a complete image can be rendered by a single sub-scene, so that one sub-scene rendering map can be directly and randomly output. As in fig. 6, partial results in the fast large-scale three-dimensional reconstruction method based on neural radiation fields of the present invention are shown.

It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the present teachings should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. The disclosures of all articles and references, including patent applications and publications, are incorporated herein by reference for the purpose of completeness. The omission of any aspect of the subject matter disclosed herein in the preceding claims is not intended to forego such subject matter, nor should the inventors regard such subject matter as not be considered to be part of the disclosed subject matter.

Claims

1. The rapid large-scale three-dimensional reconstruction method based on the nerve radiation field is characterized by comprising the following steps of:

s1, preparing an input image; firstly, inputting a group of N-numbered orthographic or oblique image data, wherein the resolutions of all the image data are consistent;

s2, camera pose measurement and calculation and sparse reconstruction; a camera model is set as a pinhole camera model by default, camera pose estimation and sparse point cloud reconstruction are carried out on all input image data by using COLMAP, so that camera pose corresponding to each image and sparse point cloud information are obtained, wherein the camera pose comprises position distribution of a camera in space and orientation information of the camera;

s3, initial clustering division of camera distribution; according to the position distribution of the cameras of all input images in space, performing initial clustering to divide camera clusters by using a K-means clustering algorithm;

s4, expanding a camera cluster; setting a scaling factor sigma for the divided camera clusters to perform radius expansion so as to increase the number of cameras of each camera cluster and increase the overlapping degree between the clusters; the shot images corresponding to the cameras in each camera cluster jointly form a training input image of each sub-scene;

s5, training a sub-model; inputting the shooting images corresponding to all cameras in each camera cluster subjected to expansion into a NeRF model for independent training, and storing sub-scene model parameters;

s6, automatically rendering and inquiring the camera image; after the sub-model training is completed, performing corresponding automatic sub-scene selection on any query camera viewpoint by using an automatic rendering strategy based on pixel projection, so as to perform partial or complete rendering of the query camera viewpoint; when the view point of the query camera needs to call a plurality of sub-scene models to perform partial rendering output, performing image alignment and splicing after rendering is completed, so as to obtain a complete query camera output image; when the query camera viewpoint is completely rendered and output by the single sub-scene model, the query camera image is directly output after the rendering is completed.

2. The method of claim 1, wherein the input image preparation in step S1 is implemented as: firstly, a rotor unmanned aerial vehicle is used for carrying out scheduled route flight on a target area, fixed-point shooting is carried out in the flight process, the orientation angle of a camera during shooting is changed within the range of 30-90 degrees, and a group of N aerial survey image data with the same resolution of orthographic or oblique shooting shot by the unmanned aerial vehicle is obtained.

3. The method for rapid large-scale three-dimensional reconstruction based on nerve radiation field according to claim 1, wherein in the step S2, all camera models are set to perform incremental SfM reconstruction by using the same internal reference, thereby obtaining camera pose and sparse point cloud of all cameras.

4. The method for rapid large-scale three-dimensional reconstruction based on nerve radiation field according to claim 1, wherein in the step S3, the initial clustering based on camera distribution is implemented by using K-Means clustering algorithm according to the position distribution of cameras of all input images in space.

5. The method of claim 1, wherein in step S4, the camera cluster expansion is implemented to increase the number of intra-cluster cameras mainly by scaling the intra-cluster radius, and the implementation is as follows:

for the camera cluster obtained in the step S3, taking the center of the camera cluster as the circle center, and setting the maximum intra-cluster distance from the position of the camera in the cluster to the center of the cluster as the radius r _i The radius r _i Scaling with σ=1.1, r _i ’=σ*r _i New radius r of the cluster _i ' StructureNon-original cluster cameras within the circular range are added to the cluster until all eligible cameras are added to the cluster or the set maximum cluster camera capacity TopNum is reached, stopping expansion.

6. The method for rapid large-scale three-dimensional reconstruction based on nerve radiation field according to claim 1, wherein in the step S5, the photographed images corresponding to all cameras in each camera cluster subjected to the expansion are input into a NeRF model for independent training, and related model parameters are saved.

7. The method of claim 1, wherein in step S6, the automatic rendering strategy based on pixel projection includes plane fitting, pixel projection and obtaining sub-scene range box, sub-scene selection and query camera image rendering.

8. The method for fast large-scale three-dimensional reconstruction based on neural radiation fields as claimed in claim 7, wherein the plane fitting part in the pixel projection based automatic rendering strategy is implemented as: firstly, extracting point cloud data calculated by COLMAP, fitting a ground plane by using the point cloud data, and solving plane parameters by using a least square method.

9. The method for fast large-scale three-dimensional reconstruction based on neural radiation fields as claimed in claim 7, wherein the pixel projection and sub-scene-range-box obtaining portion in the pixel projection-based automatic rendering strategy is implemented as: and respectively calculating the projection scene range frames of the cameras based on the image corner projections, so that the projection scene range frames of the sub-scenes are calculated according to the cameras contained in each cluster.

10. The method for fast large-scale three-dimensional reconstruction based on a neural radiation field as set forth in claim 7, wherein the sub-scene selection and query camera image rendering section in the automatic pixel projection-based rendering strategy is implemented as: selecting sub-scenes according to the intersection situation of the projection scene range frame of the query camera and each sub-scene range frame, and realizing automatic rendering strategy generation;

if the view point of the query camera needs to call a plurality of sub-scene models to perform partial rendering output, performing image alignment and splicing after the rendering is completed, so as to obtain a complete query camera output image; if the query camera viewpoint can be fully rendered out by a single sub-scene model, then the complete query camera image can be directly output after rendering is completed.