CN116543117A

CN116543117A - High-precision large-scene three-dimensional modeling method for unmanned aerial vehicle images

Info

Publication number: CN116543117A
Application number: CN202310252401.5A
Authority: CN
Inventors: 余卓渊; 金鹏飞; 石智杰
Original assignee: Institute of Geographic Sciences and Natural Resources of CAS
Current assignee: Institute of Geographic Sciences and Natural Resources of CAS
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-08-04
Anticipated expiration: 2043-03-16
Also published as: CN116543117B

Abstract

The invention discloses a high-precision large-scene three-dimensional modeling method for unmanned aerial vehicle images, which comprises three steps in total, wherein the method avoids the traditional oblique photogrammetry three-dimensional modeling method, an unmanned aerial vehicle aerial image set is divided according to a circular surrounding track, the pose of an unmanned aerial vehicle camera is recovered by using an SFM algorithm after feature extraction, matching and geometric verification of the divided image set, then sub NeRF is trained, and finally sub NeRF around a target visual angle is combined, so that the implicit construction of a large-scene three-dimensional model is completed; through experimental tests, the method achieves good effect and can reconstruct the ground object with smooth surface and small cross section.

Description

High-precision large-scene three-dimensional modeling method for unmanned aerial vehicle images

Technical Field

The invention relates to the technical field of three-dimensional modeling, in particular to a high-precision large-scene three-dimensional modeling method for unmanned aerial vehicle images.

Background

The real-scene three-dimensional system can truly and orderly reflect large-scale space-time information of human production, life and ecological space, and is an important novel integrated system for promoting the development of smart cities and intelligent digital economy. The three-dimensional scene construction expands the traditional 2D data into the 3D data and uses the 3D data as a core data structure to realize a real scene environment, thereby replacing the traditional pure geometric visualization architecture of points, lines and planes. The real scene three-dimensional system enables a computer to comprehensively and three-dimensionally present and sense the current situation and the spatial distribution of various natural resource elements, and in addition, the system can accurately reflect the information such as the spatial distribution of the terrain, the surface texture details, the morphological characteristics of the ground objects and the like in a high-definition and visual mode. Therefore, the construction of the live-action three-dimensional model is an emerging technology for supporting remote sensing mapping theory and application problems, has important scientific research value and practical significance, and provides technical support for the development of digital twinning and meta universe. In addition, live-action three-dimensional modeling is also widely applied in the fields of urban planning, CIM, urban traffic, geological mapping, unmanned driving, virtual geographic environments and the like.

Along with the increasing abundance of geographic information data acquisition means, modeling methods for constructing three-dimensional scenes by different data sources are also layered endlessly. The three-dimensional modeling method is used for manually modeling through software such as Sketchup and 3dMax, and manually constructing BIM through software such as Revit, and the model obtained by the method is fine enough, but is time-consuming, labor-consuming, low in efficiency and difficult to meet the requirement of large-scale scene modeling. Or a method of obtaining a building white film by stretching a two-dimensional vector planar building in CAD software to the building height, which, although not requiring manual modeling, has difficulty in obtaining accurate data of the building height and lacks texture and shape of the model. And the laser point cloud modeling is performed, and the target object point cloud is constructed through an airborne laser radar to regenerate a triangular surface grid, so that the method has strong illumination resistance, high wind speed interference capability and high precision, but has high cost, and the problems of data noise and inconsistent data are still challenges. There are also street view picture modeling, aerophotogrammetry modeling, etc. captured by moving vehicles, but none of them is able to reconstruct a relatively complete three-dimensional scene. As for the three-dimensional modeling reconstruction effect through the images acquired through the network crowdsourcing path, the coverage degree of the network picture on the scene is seriously dependent.

Therefore, those skilled in the art are working to develop a high-precision large-scene three-dimensional modeling method for unmanned aerial vehicle images to solve the above-mentioned deficiencies of the prior art.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the present invention aims to solve the technical problem that the conventional photogrammetry three-dimensional modeling cannot reconstruct a ground object with a smooth surface and a small cross section.

In order to achieve the above purpose, the invention provides a high-precision large-scene three-dimensional modeling method of unmanned aerial vehicle images, which comprises the following steps:

step 1, acquiring and processing an unmanned aerial vehicle image;

step 2, constructing a single NeRF;

step 3, merging the NeRF constructed previously to obtain a three-dimensional scene under any view angle;

further, for step 1, the acquisition and processing of the unmanned aerial vehicle image may be sequentially divided into unmanned aerial vehicle aerial image, image set division, image batch importing, image feature extraction, image feature matching, geometric verification, and camera pose extraction;

the dividing standard of the divided image sets is that a certain scene needs to be covered, and the subset and the adjacent subset have higher overlapping degree; the subsets are formed by dividing tracks of unmanned aerial vehicles around the routes, and unmanned aerial vehicle images shot in each circular route are a subset;

the extraction of the influence features uses a SIFT algorithm (scale invariant feature transform ) to extract image features of the aerial image of the unmanned aerial vehicle, and the SIFTGPU graphics card is used for accelerating to reach real-time calculation speed; the SIFT algorithm searches feature points on different scale spaces, calculates the directions of the feature points, and simultaneously generates descriptors;

the matching influence features are used for feature point matching by using a Brute-force algorithm, the Brute-force algorithm traverses each pair of feature points, the distance between each pair of feature points is calculated, and whether each pair of feature points is a matching pair or not is determined according to a threshold value; for any two images in the unmanned aerial vehicle aerial image set, searching matching pairs by using the characteristic points and descriptors extracted by the SiftGPU through the Brute-force algorithm;

the geometric verification uses a RANSAC algorithm to randomly select matching pairs, a fitting matrix is calculated, and whether the matching pairs are reasonable or not is determined by calculating fitting errors; the geometric verification can effectively improve the matching precision and avoid matching errors;

the extracted camera pose is used for calculating the camera pose by using an incremental SFM algorithm; the incremental SFM algorithm can gradually carry out three-dimensional reconstruction, and effectively processes a large-scale image sequence; the incremental SFM algorithm can be divided into initialization and incremental reconstruction; the initialization includes triangulation and matrix decomposition;

further, for step 2, the construction of the single NeRF is performed by pre-constructing a fully connected neural network (MLP) and setting a multi-resolution hash coding, spherical harmonic coding rule; the construction of the single NeRF enables each pixel of the image shot by the camera at different positions and orientations to emit rays, and coarse sampling is carried out on the rays; sampling the sample pointsInputting the encoded coordinates and the embedded appearance vectors into a fully-connected neural network together for one round of fine sampling; guiding a second round of fine sampling by using a probability density function of the sampling points of the first round of fine sampling to +.>After coordinate encoding, the coordinate encoded data and the appearance embedded vector are input into a fully-connected neural network together, and the color and the volume density of each sampling point are output; the color of the sampling point of the second round of fine sampling is integrated through volume rendering accumulation to obtain the pixel color corresponding to each ray, and the pixel color is compared with a true value to calculate LOSS, and the process is iterated continuously until the LOSS is reduced to a lower value;

further, for step 3, the process of merging the previously constructed nerfs is to select sub-nerfs and render the target perspective image; the sub NeRF rule is that a given target visual angle is used as a circle center, a preset radius is used as a circle, and if the origin of the sub NeRF is projected in the circle, the sub NeRF is selected; the rendering target view angle images select an IDW inverse distance weighting algorithm to interpolate the images of the rendering target view angle;

further, for step 3, the images of the target viewing angles are connected to form a track, so that the effect of roaming in the three-dimensional space can be achieved;

by adopting the scheme, the high-precision large-scene three-dimensional modeling method for the unmanned aerial vehicle image has the following advantages:

(1) According to the high-precision large-scene three-dimensional modeling method for the unmanned aerial vehicle image, the unmanned aerial vehicle aerial image is selected as a data source, and the advantages of high spatial resolution, wide imaging range and high overlapping rate of the unmanned aerial vehicle image are fully utilized to acquire the image.

(2) According to the high-precision large-scene three-dimensional modeling method for the unmanned aerial vehicle image, a traditional oblique photogrammetry three-dimensional modeling method is avoided, a new scheme for reconstructing a large-scale three-dimensional model by constructing a nerve radiation field NeRF is provided, feature extraction matching and geometric verification are performed after the unmanned aerial vehicle aerial image set is divided, sub NeRF is trained, and finally sub NeRF is combined, so that implicit construction of the large-scene three-dimensional model is completed.

In summary, the high-precision large-scene three-dimensional modeling method for the unmanned aerial vehicle image disclosed by the invention avoids the traditional oblique photogrammetry three-dimensional modeling method, divides the unmanned aerial vehicle aerial image set according to a circular surrounding track, restores the pose of the unmanned aerial vehicle camera by using an SFM algorithm after feature extraction, matching and geometric verification, trains sub-NeRF, and finally merges sub-NeRF around a target visual angle to complete the implicit construction of a large-scene three-dimensional model; through experimental tests, the method achieves good effect and can reconstruct the ground object with smooth surface and small cross section.

The conception, specific technical scheme, and technical effects produced by the present invention will be further described in conjunction with the specific embodiments below to fully understand the objects, features, and effects of the present invention.

Drawings

FIG. 1 is a flow chart of a high-precision large-scene three-dimensional modeling method for unmanned aerial vehicle images;

FIG. 2 is a schematic view of a circular encircling track of the unmanned aerial vehicle;

fig. 3 is a schematic view of scene boundaries.

Detailed Description

The following describes a number of preferred embodiments of the present invention to make its technical contents more clear and easy to understand. This invention may be embodied in many different forms of embodiments which are exemplary of the description and the scope of the invention is not limited to only the embodiments set forth herein.

If there are experimental methods for which specific conditions are not specified, the experimental methods are usually carried out according to conventional conditions, such as the related instructions or manuals.

As shown in fig. 1 to 3, the high-precision large-scene three-dimensional modeling method for unmanned aerial vehicle images, provided by the invention, comprises the following specific embodiments:

step 1, the adopted route design scheme is a surrounding route (namely, the unmanned aerial vehicle flies around a circular or elliptical route according to a pre-planned path in the flight process, as shown in fig. 2), flight task requirements and terrain data are input through automatic route planning software, and an optimal route scheme is automatically generated at the moment; the unmanned aerial vehicle automatic flight system executes flight tasks according to the planned route to complete image acquisition work;

after collection, dividing an image set according to the track of the unmanned aerial vehicle around the route, wherein a camera on the unmanned aerial vehicle points to the same interest point on each circular route, namely the interest point pointed by the camera on the circular route is the origin of each sub NeRF, so that the unmanned aerial vehicle image shot by each circular route is taken as a subset;

the method comprises the steps that a SIFT algorithm is used on a heterogeneous computing system provided with a GPU, the image characteristics of an aerial image of the unmanned aerial vehicle are efficiently extracted, the SiftGPU processes pixels in parallel to construct a Gaussian pyramid and detect Gaussian difference DoG (Difference of Gaussian) key points, a compact key point list is effectively constructed by the SiftGPU through a GPU/CPU hybrid method based on GPU list generation, and finally the key points are processed in parallel to obtain directions and descriptors of the key points; when the Brute-force algorithm performs feature point matching, a matching threshold, namely a distance threshold, is firstly required to be defined, then all feature point pairs are traversed, and the distance between the feature point pairs is calculated; if the distances are less than the threshold, adding them to the matching pair list;

firstly, eliminating wrong matching point pairs through a RANSAC algorithm, calculating a basic matrix, specifically randomly and uniformly sampling eight pairs of matching point pairs, estimating the basic matrix based on the sampling point pairs by using a normalized eight-point method, calculating whether the remaining matching point pairs meet the current basic matrix, counting the matching point pairs meeting the current basic as the current basic matrix score, and repeating the steps for set times to obtain the basic matrix with the highest score; these eight point pairs satisfy the formula p' ^T Fp=0, P and P' are projections of the three-dimensional point P in each imaging plane, F being a basis matrix to be solved;

calculating a homography matrix by using a RANSAC algorithm, removing error matching point pairs, specifically uniformly sampling four pairs of matching point pairs randomly, estimating the homography matrix by using a four-point method based on the sampling point pairs, calculating whether the remaining matching point pairs meet the current homography matrix, counting the matching point pairs meeting the current foundation as the current homography matrix score, and repeating the steps for set times to obtain the homography matrix with the highest score; the four-point method formula is P '=hp, P and P' are projections of three-dimensional points P in each imaging plane, and H is a homography matrix required; substituting the coordinates of the four point pairs into the above formula to construct four equations, combining the 4 equations to form a linear equation set, and solving the linear equation set to obtain a homography matrix;

calculating the pose of a camera by using an incremental SFM, firstly calculating the track length of each feature point, wherein the track length represents the number of any feature point appearing in all pictures, then establishing a scene connected graph, taking each picture as a node of the connected graph, if the number of matched feature point pairs between any two nodes is larger than a certain threshold value, connecting the two nodes to be used as one side of the connected graph, selecting one side, selecting two seed pictures with large enough intersection angle (the median of ray included angles is generally not larger than 60 degrees and not smaller than 3 degrees when all point corresponding points are triangulated), and large enough identical-name point numbers and uniformly distributed as the two nodes of the side, robustly estimating an essential matrix corresponding to the side (the two seed pictures), decomposing the essential matrix to obtain the pose of the camera to which the two seed pictures belong (namely the external parameters of the camera), screening out the feature point pairs with the track length of the feature points larger than 2 on the two seed pictures, triangulating to obtain an initial reconstruction result, deleting the side from the scene connected graph, and completing the initialization of the SFM;

after initialization of the incremental SFM is completed, if an edge exists in the scene communication graph, selecting an edge from the scene communication graph, wherein the edge meets the maximization of a subset of the characteristic points with the track length of the characteristic points being more than 2 on the corresponding two pictures and the reconstructed 3D points, estimating the pose of a camera (external parameters of a camera) by using a PnP method, screening out a characteristic point pair with the track length of the characteristic points being more than 2 on the new two pictures corresponding to the edge, triangulating, deleting the newly added edge in the scene communication graph, executing a BA algorithm until no edge exists in the scene communication graph, and completing scene incremental reconstruction of the incremental SFM and camera pose recovery;

step 2, a space scene based on NeRF is expressed as a function of which the input is a five-dimensional vector, and the shape of a three-dimensional model in the space scene and the color information observed from different directions are described by implicit expression of an MLP network; the inputs to the five-dimensional vector function are a 3D position vector x= (x, y, z) and a 2D direction vectorx= (x, y, z) is the coordinates of three-dimensional points within the scene, +.>A direction angle (rotation in the direction of the yoz plane measured from the positive x-axis half-axis) and a polar angle of view (rotation in the direction of the xoy-plane measured from the positive z-axis half-axis) representing the viewing direction in the spherical coordinate system; the output of the function is that the ray from the camera reaches a three-dimensional position along direction dx represents a color c= (r, g, b) and a bulk density σ (x); the bulk density σ (x) represents the differential probability of an infinitesimal particle whose ray ends at the x-position; the five-dimensional vector function can be written as:

F _Θ (x，d)＝(c，σ)

the training process of the fully-connected neural network is to continuously adjust the weight parameter theta of the network model so that the model can output corresponding color and volume density after the 5D coordinate is input; to ensure multi-view consistency of the network output results, the volume density σ is only a function σ (x) of the spatial position x, while the color vector c is a function c (x, d) of x and d;

the specific flow of the multi-resolution hash coding is as follows: (1) For a given input coordinate x, finding its surrounding voxels at different resolution levels, mapping their integer coordinates by hashing to assign an index to each vertex of the voxel; (2) Finding out the feature vector corresponding to each vertex index of the grid in hash tables of different levels; (3) According to the relative positions of the input coordinates x in grids of different levels, interpolating the feature vector of each vertex of the grids of different levels into a feature vector through tri-linear interpolation; (4) Splicing the feature vectors of the grids with different resolutions to finish multi-resolution hash coding;

the expression on the graph rendering is that after the light source rotates, the illumination effect of the picture can be ensured not to shake and jump only by synchronously calculating the generalized Fourier coefficient after transformation; after the Laplace equation in spherical coordinates separates the variables, the function for the polar viewing angle θ is a Liehrender polynomialAbout the direction angle->Is +.>The spherical harmonics are defined as:

in the above-mentioned method, the step of,expressed as a unit vector pointing to the point +.>l represents times, m represents orders, and the orders are integers, wherein l is more than or equal to 0, and l is more than or equal to m and less than or equal to l; a is that _l，m Is a normalized coefficient such that ∈ ->The integral on the unit sphere is equal to 1; />

Spherical harmonics can be seen as mapping each point on a unit sphere (or each direction in three-dimensional space) to a complex function value; for the input direction d, it can be encoded as a combination of the base function of the spherical harmonic and the spherical harmonic coefficients, and then input into the color MLP network;

in order to enable NeRF to adapt to different illumination changes, GLO technology is adopted, namely, each image is assigned a corresponding real-valued appearance embedded vectorHaving a length of n ^(a) The method comprises the steps of carrying out a first treatment on the surface of the Then embedding the appearance into the vector->As input to the second color MLP network, the weighting parameters Θ and +.>The embedding is optimized together;

using a large cube of the surrounding model set in the multi-level hash encoding, the values of near and far are determined by calculating the intersection point of the camera ray and the large cube, as shown in fig. 3, so far, the scene boundary range is determined;

for the bodyRendering, fitting integration of continuous samples using discrete sample summation, i.e., t _n To t _f Divided into N uniform intervals, and then randomly extracting a sample point from each interval, the i-th sample point can be expressed as:

discrete sampling summation ensures that the MLP is estimated at macroscopically continuous positions in the continuous training and optimizing process, thereby ensuring continuity of scene representation; based on the above ideas, the form of converting the integration into summation is:

in the above, delta _i ＝t _i+1 -t _i Representing the distance, sigma, between adjacent sample points _i And c _i Is the bulk density and color of sample point i, from σ _i And c _i Calculation in a set of valuesIs differentiable, the transparency of the sample point i is 1-exp (-sigma) _i δ _i )；

For hierarchical sampling, neRF uses a hierarchical sampling scheme to sample points along rays in space, i.e., coarse sampling is performed first, and then the result of the coarse sampling is used to guide the next fine sampling; coarse sampling, i.e., uniform sampling, uniformly sampling N points within a predefined range from the camera; dividing the range N from near to far equally, and uniformly placing N sampling points on the rays; the position information of the N sampling points is subjected to multi-level hash coding, the direction information of the points is subjected to spherical harmonic function coding and appearance embedded vector splicing, and then the position information is sent to an MLP (multi-level hash coding) network, and the MLP network is called a coarse (coarse) network to obtain the volume density and the color of the N sampling points; on this ray, the corresponding pixel color can be regarded as a weighted sum of the sample point colors, i.e

w _i ＝T _i (1-exp(-σ _i δ _i ))

The size of the sampling weight reflects the distance degree of the sampling point from the surface of the three-dimensional model, and the second sampling can be used for mainly sampling the region with larger weight in the first sampling result; more samples are sampled at the places with large weight values, and less samples are sampled at the places with small weight values;

these weights were normalized to:

in the above, N _c The number of sampling points of coarse samplingCan be regarded as a probability density function of the object along the ray, from which the distribution of the object in the direction of the ray can be roughly obtained; next, based on this probability density function, fine sampling, i.e., inverse transform sampling (inverse transform sampling), is used to obtain N _f New sampling points are gathered at the position with high coarse sampling weight, sparse at the position with low coarse sampling weight, and are close to the surface of the object; then, N is _c +N _f The position and direction information of each sampling point is input into an MLP network after being encoded, the network is called a fine network, the volume density and color information of the new sampling point are obtained, and then the corresponding pixel color is calculated according to a volume rendering formula, wherein the formula is as follows:

the Loss function Loss is calculated from the total squared error between the pixel color output by the coarse network and the pixel color and true pixel color output by the fine network:

where R is a set of rays in each training batch,and->Pixel colors obtained by volume rendering after coarse sampling and fine sampling of camera rays respectively, +.>Then the true value of the image pixel color (GT); even if the final rendering result comes from->Can also make->So that the weight distribution from the coarse network can be used to allocate samples in the fine network;

when the value of Loss reaches a lower value, the training is regarded as being finished, iteration is stopped at the moment, a virtual camera can be placed for any view angle in a scene, the camera emits a plurality of rays, coarse sampling is carried out on each ray, the weight of each sampling point is obtained after a coarse network is sent, fine sampling is carried out, two rounds of sampling points are sent to a fine network, the color of each pixel is obtained after volume rendering, and the pixels form an image together; the image observed from any view angle in the scene can be obtained;

step 3, after training the NeRF model for each sub-set of the unmanned aerial vehicle image set, obtaining each sub-NeRF taking the interest point pointed by the camera on the circular route as the origin and taking the large cube constructed in the circular route as the boundary, only rendering the relevant sub-NeRF of the given target visual angle for improving the efficiency, then rendering the color information from the screened sub-NeRF, and using the target camera origin o and the screened sub-NeRF central point x _i Inverse distance weighting (Inverse Distance Weighted, IDW) between them interpolates between them, (in particular, each weight is calculated as w _i ∝d(o，x _i ) ^-p Wherein p affects the mixing rate between sub-NeRF renderings, d (c, x _i ) Is the origin o of the target camera to the sub-NeRF central point x after screening _i Distance between the two) and finally obtaining a new view under the target view angle; the views under a plurality of target visual angles are connected to form a track, so that the roaming effect in the three-dimensional space can be achieved;

according to the analysis of the embodiment 1, the high-precision large-scene three-dimensional modeling method for the unmanned aerial vehicle image avoids the traditional oblique photogrammetry three-dimensional modeling method, the unmanned aerial vehicle aerial image set is divided according to the circular surrounding tracks, the pose of the unmanned aerial vehicle camera is recovered by using an SFM algorithm after feature extraction, matching and geometric verification is carried out on the divided image set, then the sub NeRF is trained, and finally the sub NeRF around the target visual angle is combined, so that the implicit construction of the large-scene three-dimensional model is completed; through experimental tests, the method achieves good effect and can reconstruct the ground object with smooth surface and small cross section.

The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention without requiring creative effort by one of ordinary skill in the art. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by a person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims

1. A high-precision large-scene three-dimensional modeling method for unmanned aerial vehicle images is characterized in that,

the method comprises the following steps:

step 1, acquiring and processing an unmanned aerial vehicle image;

step 2, constructing a single NeRF;

and 3, merging the NeRF constructed previously to obtain a three-dimensional scene under any view angle.

2. The method for three-dimensionally modeling a large scene in high precision for an unmanned aerial vehicle image as claimed in claim 1, wherein,

for the step 1, the acquisition and processing of the unmanned aerial vehicle image can be divided into unmanned aerial vehicle aerial image, image set division, image batch importing, image feature extraction, image feature matching, geometric verification and camera pose extraction according to the sequence.

3. The method for three-dimensionally modeling a large scene with high accuracy of unmanned aerial vehicle image according to claim 2, wherein,

extracting influence features by using a SIFT algorithm, extracting image features of aerial images of the unmanned aerial vehicle, and accelerating by using a SIFTGPU display card to achieve real-time calculation speed; the SIFT algorithm searches feature points on different scale spaces, calculates the directions of the feature points, and simultaneously generates descriptors;

the extracted camera pose is used for calculating the camera pose by using an incremental SFM algorithm; the incremental SFM algorithm can gradually carry out three-dimensional reconstruction, and effectively processes a large-scale image sequence; the incremental SFM algorithm can be divided into initialization and incremental reconstruction; the initialization includes triangularization and matrix-based decomposition.

4. The method for three-dimensionally modeling a large scene in high precision for an unmanned aerial vehicle image as claimed in claim 1, wherein,

for the step 2, the construction of the single NeRF is realized by constructing a fully connected neural network in advance and setting a multi-resolution hash coding and spherical harmonic coding rule; the construction of the single NeRF enables each pixel of the image shot by the camera at different positions and orientations to emit rays, and coarse sampling is carried out on the rays; sampling the sample pointsInputting the encoded coordinates and the embedded appearance vectors into a fully-connected neural network together for one round of fine sampling; guiding a second round of fine sampling by using a probability density function of the sampling points of the first round of fine sampling to +.>After coordinate encoding, the coordinate encoded data and the appearance embedded vector are input into a fully-connected neural network together, and the color and the volume density of each sampling point are output; the color of the sampling point of the second round of fine sampling is integrated through volume rendering accumulation to obtain the pixel color corresponding to each ray, and the pixel color is compared with a true value to calculate LOSS, and the process is iterated continuously until the LOSS is reduced to a lower value.

5. The method for three-dimensionally modeling a large scene in high precision for an unmanned aerial vehicle image as claimed in claim 1, wherein,

for step 3, the process of merging the previously constructed nerfs is to select sub-nerfs and render the target perspective images; the sub NeRF rule is that a given target visual angle is used as a circle center, a preset radius is used as a circle, and if the origin of the sub NeRF is projected in the circle, the sub NeRF is selected; and the rendering target view angle images select an IDW inverse distance weighting algorithm to interpolate the images of the rendering target view angle.

6. The method for three-dimensionally modeling a large scene in high precision for an unmanned aerial vehicle image as defined in claim 5, wherein,

and 3, connecting a plurality of images of the target visual angles to form a track, so as to achieve the effect of roaming in a three-dimensional space.