CN114677479A

CN114677479A - Natural landscape multi-view three-dimensional reconstruction method based on deep learning

Info

Publication number: CN114677479A
Application number: CN202210384876.5A
Authority: CN
Inventors: 李毅; 张笑钦
Original assignee: Big Data And Information Technology Research Institute Of Wenzhou University
Current assignee: Big Data And Information Technology Research Institute Of Wenzhou University
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-06-28

Abstract

The invention provides a natural landscape multi-view three-dimensional reconstruction method based on deep learning, which comprises the following steps: acquiring a multi-view image set of a natural landscape, and preprocessing two-dimensional images in the multi-view image set; constructing a multi-scale feature extraction network, and performing feature extraction on the preprocessed two-dimensional image by using the trained multi-scale feature extraction network to obtain target key features; inputting the target key features into a patch matching iterative model based on learning to perform iterative computation of pixel depth matching, and outputting a corresponding depth map after the iterative computation is finished; the obtained depth map and the source image are input into a depth residual error network for optimization to obtain an optimized final depth map, and an object three-dimensional model is constructed according to the optimized final depth map to obtain a stereoscopic vision map of the natural landscape.

Description

Natural landscape multi-view three-dimensional reconstruction method based on deep learning

Technical Field

The invention relates to the technical field of three-dimensional reconstruction, in particular to a natural landscape multi-view three-dimensional reconstruction method based on deep learning.

Background

The metastechnology is a virtual-real fusion scene framework integrating virtual reality, a game engine, a mobile internet, a block chain and the like, and provides highly immersive interactive experience. The method applies the metauniverse technology to the actual travel scenes, improves the innovative development thought of travel products, protects the intellectual property of the travel digital products, and has very important theoretical significance and practical value for the development of the travel industry. The method comprises the following steps of constructing a natural landscape scene with fusion of virtuality and reality, and establishing a simulated virtual scene by a multi-view three-dimensional reconstruction technology according to the display mode of image data. In recent years, the three-dimensional reconstruction technology is widely applied to the application fields of intelligent unmanned driving, AR/VR, satellite remote sensing mapping, entertainment multimedia and the like by utilizing depth camera equipment, infrared sensors, laser radars and the like to extract and estimate depth information of real world scenes. The artificial intelligence technology can extract and process and convert the characteristics of the multi-view two-dimensional image through the high generalization capability of the neural network model, so that the prediction and estimation of the three-dimensional scene are more accurate and efficient. The acquisition of sample data in the three-dimensional reconstruction of the natural scene is affected by factors such as acquisition equipment, natural environment, noise, shielding and the like, so that the accuracy is not high, and great challenge is brought to the simulation construction of the virtual scene. Therefore, it is one of the research difficulties to be solved urgently to improve the accuracy and efficiency of the multi-view three-dimensional reconstruction in the natural virtual scene and apply the multi-view three-dimensional reconstruction to the literary tourist unit universe.

The multi-view stereo vision reconstruction technology is a method for recovering a three-dimensional model by utilizing a plurality of images of the same scene with different visual angles. Multi-view stereo reconstruction based on depth learning, such as the classical MVSNet network architecture, usually constructs a three-dimensional cost volume to return to the depth value of the scene. The convolutional neural network is utilized to carry out multi-view stereo matching, so that the traditional matching efficiency is integrally improved. However, due to the deep regularization process of the 3D convolutional neural network, MVSNet is also limited by video memory resources in a large-scale and high-resolution scenario. The traditional method is difficult to process specular reflection, textures and the like, the reconstruction integrity is low, the speed is low, the natural landscape model reconstruction environment influence factors are large, the feature extraction is insufficient, parameters are well designed in advance, self-adaption cannot be achieved, and only specific scene effects and universality are poor.

In summary, how to overcome the above-mentioned drawbacks is a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In order to solve the above-mentioned problems and needs, the present solution provides a deep learning-based natural landscape multi-view three-dimensional reconstruction method, which can solve the above technical problems due to the following technical solutions.

In order to achieve the purpose, the invention provides the following technical scheme: a natural landscape multi-view three-dimensional reconstruction method based on deep learning comprises the following steps: step1, acquiring a multi-view image set of the natural landscape, and preprocessing two-dimensional images in the multi-view image set;

step2, constructing a multi-scale feature extraction network, training the multi-scale feature extraction network through a training set to obtain a trained multi-scale feature extraction network, and performing feature extraction on the preprocessed two-dimensional image by using the trained multi-scale feature extraction network to obtain target key features;

step3, inputting the obtained target key features into a learning-based patch matching iterative model for iterative computation of pixel depth matching, and outputting a corresponding depth map after the iterative computation of the model is finished;

and Step4, inputting the depth map and the source image obtained in the Step3 into a depth residual error network for optimization to obtain an optimized final depth map, and constructing an object three-dimensional model according to the optimized final depth map to obtain a stereoscopic vision map of the natural landscape.

Further, the pre-processing comprises:

performing key reconstruction region segmentation on a two-dimensional image in a multi-view image set, wherein the multi-view image set comprises a source image and reference images of a plurality of corresponding view angles;

and combining the influence factors of the natural landscape environment to perform characteristic enhancement or shielding repair.

Furthermore, the constructed multi-scale feature extraction network is a variable convolution network based on an image pyramid FPN structure, and a convolution layer is applied to a feature graph obtained by original convolution to obtain the offset of deformable convolution deformation so as to realize target key feature extraction.

Furthermore, in the iterative computation process, during the initial iteration, the current single target key feature is used as the input of the initial iteration; if the iteration is started, the current single target key feature is connected with the depth map obtained in the last iteration to serve as the input of the current iteration.

Further, the pixel depth matching in the learning-based patch matching iterative model is realized by a preset matching cost calculation method.

Furthermore, the matching cost calculation method calculates the cost of each depth hypothesis value of each pixel through group-wise correlation;

the method specifically comprises the following steps: according to the formula:

calculating the similarity of each group, where wi (P) represents the weight of the pixel P to the reference image Ii, w_i(p)＝max{P_i(p，j)|j＝0，1，...，D-1}，

A similarity vector representing the corresponding group,

F₀(p)^gand F_i(p_i，j)^gRespectively representing the characteristics of the g-th group of source images and the characteristics of the g-th group of reference images, N representing the total number of the source images and the reference images, P_i，jRepresenting a pixel P of the corresponding source image in the reference image.

Further, Step4 specifically includes:

firstly, normalizing the input depth to [0, 1] and recovering after thinning;

inputting the obtained depth map and the source image into a depth residual error network to extract features, applying deconvolution to the obtained depth features, and upsampling to the size of the image features;

connecting the two obtained characteristics and applying a plurality of two-dimensional convolution layers to obtain a depth residual error;

then adding the depth estimation obtained in Step3 to finally obtain an optimized depth map;

and constructing a three-dimensional model of the object according to the optimized final depth map so as to obtain a stereoscopic vision map of the natural landscape.

A natural landscape multi-view three-dimensional reconstruction system based on deep learning adopts the natural landscape multi-view three-dimensional reconstruction method based on deep learning, and specifically comprises the following steps: the device comprises an image acquisition module, a multi-scale feature extraction module, an iterative computation module and an optimized reconstruction module;

the image acquisition module is used for acquiring a multi-view image set of a natural landscape and preprocessing a two-dimensional image in the multi-view image set;

the multi-scale feature extraction module is used for constructing a multi-scale feature extraction network, training the multi-scale feature extraction network through a training set to obtain a trained multi-scale feature extraction network, and extracting features of the preprocessed two-dimensional image by using the trained multi-scale feature extraction network to obtain target key features;

the multi-scale feature extraction module is used for inputting the obtained target key features into a learning-based patch matching iterative model for iterative computation of pixel depth matching, and outputting a corresponding depth map after the iterative computation of the model is finished;

and the optimization and reconstruction module is used for inputting the obtained depth map and the source image into a depth residual error network for optimization to obtain an optimized final depth map, and constructing an object three-dimensional model according to the optimized final depth map to obtain a stereoscopic vision map of the natural landscape.

According to the technical scheme, the invention has the beneficial effects that: the edge of the depth map is optimized through an edge processing algorithm of local region segmentation, so that the obtained depth map is more complete and accurate, the local detail precision of the landscape model is higher, the reconstruction efficiency is higher, and the universality is stronger.

In addition to the above objects, features and advantages, preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings so that the features and advantages of the present invention can be easily understood.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments of the present invention or the prior art will be briefly described, wherein the drawings are only used for illustrating some embodiments of the present invention and do not limit all embodiments of the present invention thereto.

Fig. 1 is a schematic diagram of specific steps of a natural landscape multi-view three-dimensional reconstruction method based on deep learning in the present invention.

Fig. 2 is a schematic diagram of a construction process of the natural landscape multi-view three-dimensional reconstruction method based on deep learning in the invention.

Fig. 3 is a schematic structural diagram of the deep learning-based natural landscape multi-view three-dimensional reconstruction system.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of specific embodiments of the present invention. Like reference symbols in the various drawings indicate like elements. It should be noted that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.

The existing three-dimensional reconstruction method of the natural landscape model has the problems of poor performance of low-texture and non-texture areas, high memory cost, long running time and the like, so that the application discloses a method for optimizing the edge of a depth map by an edge processing algorithm segmented by local areas, so that the obtained depth map is more complete and accurate, the local detail precision of the landscape model is higher, the reconstruction efficiency is higher, and the universality is higher, as shown in fig. 1 to 3, the method comprises the following steps: and Step1, acquiring a multi-view image set of the natural landscape and preprocessing two-dimensional images in the multi-view image set.

The specific treatment process of the pretreatment comprises the following steps: performing key reconstruction region segmentation on a two-dimensional image in a multi-view image set, wherein the multi-view image set comprises a source image and reference images of a plurality of corresponding view angles; and combining the influence factors of the natural landscape environment to perform characteristic enhancement or shielding repair. Because it is obviously insufficient to process only edge pixels for environmental influences (such as reflection) and edge occlusion of target objects in key reconstruction regions, it is necessary to perform dilation processing on edges by using a dilation algorithm in mathematical morphology, so as to obtain more edge region pixels. The target object is extracted from the image by using the mask generated by image segmentation, and the target object is combined with the background image to be used as pretreatment, bilateral filtering is carried out on edge pixels, the pixel value weight of each pixel point in the filter is close in a region far away from the edge, the distance weight occupies a dominant position in the filtering, and the pixel value weight of the pixel points on the same side of the edge is close in the edge region and is far larger than that of the pixel points on the other side of the edge, so that the pixel points on the non-same side hardly influence the filtering result, and the function of protecting edge information can be also achieved. Thereby achieving the preprocessing effects of reconstruction region edge feature enhancement and region edge extension. The result is gained for subsequent multi-view reconstruction.

The feature extraction of the two-dimensional image is very important for multi-view three-dimensional reconstruction, and directly influences the integrity and precision of a generated three-dimensional model. The method aims at the two-dimensional image acquisition of model multi-view, is directly applied to the feature matching and three-dimensional generation of a deep neural network, and because the external parameters of an original camera of acquisition equipment are continuously changed in the acquisition process, a large amount of accumulated errors can be brought to the subsequent feature matching, the integrity of model reconstruction is influenced, and excessive computing resource consumption can be caused. Therefore, the key reconstruction region segmentation is firstly carried out on the input multi-view, and the step is used as image preprocessing and can be simultaneously combined with natural landscape environment influence factors to carry out feature enhancement or occlusion repair. For the feature extraction of the non-texture feature region and the weak feature region, the multi-scale feature extraction in the subsequent steps can continuously reduce the matching accumulated error of the target feature region. The method and the device optimize the edge of the depth map through an edge processing algorithm of local region segmentation.

In general, there are usually several potential depths where the pixels are located on the depth boundary, where in similar color gamuts the depths are the same and the pixels usually have similar color depths on the same geometric plane. Coordinate transform prediction is performed directly using an optical flow-based approach that allows pixels to select potential depths, rather than intermediate depth values, to make points belong to objects or backgrounds, and to avoid ambiguity of depth propagation depth boundaries. In this process, to better predict the depth boundary, the depth iteratively obtains the mapping D based on the learned patch matching, increases the sampling rate to twice the original resolution, and then performs double sampling. Display and D₁The original images with the same resolution are connected to D_uIntermediate features are obtained from the output as inputs to the feature extraction network. To better extract the frontal edges of the depth image and obtain the important features, these corresponding edge features are expanded and contracted accordingly. Finally, the offset of the corresponding coordinate is obtained through convolution. For each coordinate point p^l-1Is D^l-1The following formula:

in the formula p^lIs the depth value after network processing, N (p)^l-1) Identifying a coordinate system, w, around the feature points_pFeature point distance estimation weights are identified.

And Step2, constructing a multi-scale feature extraction network, training the multi-scale feature extraction network through a training set to obtain a trained multi-scale feature extraction network, and performing feature extraction on the preprocessed two-dimensional image by using the trained multi-scale feature extraction network to obtain target key features.

Specifically, the constructed multi-scale feature extraction network is a variable convolution network based on an image pyramid FPN structure, and a convolution layer is applied to a feature graph obtained by original convolution to obtain the offset of deformable convolution deformation so as to realize target key feature extraction.

As shown in fig. 2, the variable convolution network obtains some downsampled images through gaussian smoothing and sub-sampling based on an image pyramid (FPN), that is, a K + 1-level gaussian pyramid can obtain K + 1-level gaussian images through smoothing and sub-sampling operations, and the gaussian pyramid includes a series of low-pass filters, and the cut-off frequency of the low-pass filters gradually increases by a factor of 2 from the upper layer to the lower layer, so that the gaussian pyramid can span a large frequency range. Inputting a picture to obtain a plurality of images with different scales, and connecting 4 vertexes of the images with different scales to construct an image pyramid similar to a real pyramid. By doing this, one dimension of scale (or depth) can be added to the 2-dimensional image, which can obtain more useful information therefrom.

In this embodiment, the training of the multi-scale feature extraction network through a training set includes:

first, input N pictures with view size of WXH, which can use different scales, such as

And

and labeled feature maps F1, F2, F3, F1 obtain a map F1' by inputting a variable convolution network, the variable convolution being as follows: formula definition

After multilayer convolution, the extracted features are divided into a plurality of parts according to the number of layers after the multilayer convolution, then the output parts of all the parts are processed by a deformable convolution module and used as the input of the next layer of convolution, and the deformable convolution module can enable the network to extract the key features of the target. The convolution kernel has the function of extracting the characteristics of an input object, and the traditional convolution kernel is usually the kernel size with a fixed size, so that the traditional convolution kernel has poor adaptability, generalization and weak capability on unknown change. The deformable convolution is that on the basis of the traditional convolution, a direction vector for adjusting a convolution kernel is added, so that the form of the convolution kernel is closer to a feature, the realization of the deformable convolution is basically similar to that of the traditional convolution, only one convolution layer is applied to a feature graph obtained by the convolution to obtain the offset of the deformable convolution deformation, and during training, the convolution kernel for generating the output feature and the convolution kernel for generating the offset are synchronously learned.

And Step3, inputting the obtained target key features into a learning-based patch matching iterative model for iterative computation of pixel depth matching, and outputting a corresponding depth map after the iterative computation of the model is finished.

Specifically, in the iterative computation process, during initial iteration, the current single target key feature is used as the input of the initial iteration; and if the iteration is started, connecting the current single target key feature with a depth map obtained by the last iteration to serve as the input of the current iteration, wherein the pixel depth matching in the learning-based patch matching iterative model is realized by a preset matching cost calculation method.

The matching cost calculation method calculates the cost of each depth assumed value of each pixel through group-wise correlation;

A similarity vector representing the corresponding group,

For multi-vision stereo vision, this step must integrate information from any number of source images to a single depth value for a single pixel. To this end, the cost per depth hypothesis for each pixel is calculated by group correlation. Views are aggregated using pixel-level view weights. Visibility information can be used to improve robustness in the cost aggregation process. Finally, the cost of each set of summary aggregates is mapped to the pixels of each depth layer through a very small net. The original pixel-level view weight computation network has planar three-dimensional convolution and non-linearity. However, the common 3D convolution cannot effectively model large motions for its fixed receptive field, resulting in limited important information for reconstruction.

Therefore, to optimize this problem, a variable three-dimensional convolution network integrated by variable convolution and planar convolution is used to maximize feature extraction for learnable nodes in a local region and improve the ability of image feature regions and continuous motion capture views. The variable three-dimensional convolution is shown as follows:

in this embodiment, at the initial iteration of patch matching, a randomly generated depth hypothesis is used as a reference for the first propagation, the random depth hypothesis being defined by a predefined inverse depth range [ d ]_min d_max]Determining and generating, for each pixel, a plurality of depth value hypotheses, dividing the depth range into a plurality of intervalsUniformly distributed over a predetermined inverse depth range or the like, and having each interval covered by a depth hypothesis, which on the one hand increases diversity, while enabling our network to operate in complex and large-scale scenarios. For the later iteration, the depth estimation generated in the previous stage is taken as a basis, meanwhile, the depth disturbance is carried out on each pixel in a depth range R, and the disturbance range is gradually reduced along with the continuation of the iteration, so that the depth assumption can be further enriched, meanwhile, the result is further refined, and the error of the previous iteration is corrected.

The results from each stage to the end are found to be different by experiment, and the current loss function does not represent the current training situation well. In the first stage, depth prediction is a simple perturbation, with no propagation, so its weight is small. At stage 0, while the depth map refinement module improves the accuracy of the point cloud, it correspondingly reduces the integrity. The accuracy can be greatly improved by using a variable convolutional neural network. In order to better improve the performance of the network, the loss refining module of the depth map is correspondingly reduced. In order to reflect that the loss function may better influence the entire training process, different weights ik are given at different stages (k ═ 0,1,2, 3). The loss function is defined as:

depth values of the same physical surface are usually correlated, so we abandon the previous propagation of depth hypotheses from a static set of neighboring pixels, and instead propagate them in an adaptive manner, from the same physical surface, which converges faster than before, and at the same time can collect more useful depth hypotheses from weak texture and non-texture regions, based on adaptive propagation over a deformable convolution network. This model learns additional two-dimensional offsets and applies them to fixed two-dimensional offsets, organized as a grid. We learn an additional two-dimensional offset per pixel on the target picture using 1 2DCNN and get the depth hypothesis by bilinear interpolation.

Referring to the above, for multi-view stereo vision, this step must integrate the cost value from any number of source image information to a single depth value for a single pixel. For this purpose, the cost of each assumed value can be calculated by Group-wise (reducing the correlation channel of the features to the number of groups, reducing the memory) correlation. Views are aggregated by a view weight at a pixel level. Visibility information can be utilized to improve robustness in the cost aggregation process. Finally, the cost of each group is projected to each pixel of the reference frame, each depth, through a small network.

In the present embodiment, it is preferred that,

respectively representing the characteristics of the source image and the reference image, and F after the characteristic channels are divided into G groups₀(p)^gAnd F_i(p_i，j)^gFeatures of group g are represented, their similarity being represented as S:

by using

Representing the similarity vector of the corresponding group. To compute the pixel-level view weights, the diversity of the initial set of depth assumptions at stage3 is exploited as shown in FIG. 2. The visibility information of the image Ii at pixel p is denoted by wi (p). The weights are locked after being computed once and upsampled to a finer stage. A simple pixel-level view weight calculation network consists of 3D convolution (1x1x1) and nonlinear sigmoid, and the initial similarity S is input_iThe output value is between 0 and 1,

pixel P to reference picture I_iThe weights of (d) are expressed as follows: w is a_i(p)＝max{P_iThe similarity of each last group (p, j) | j ═ 0, 1.., D-1} can be expressed as follows:

finally obtain eachSimilarity of each group of each pixel

A single cost value for each depth hypothesis for each pixel is obtained using a small network with 3D convolution (1x1x 1).

Step4: inputting the depth map and the source image obtained in the Step3 into a depth residual error network for optimization to obtain an optimized final depth map, and constructing an object three-dimensional model according to the optimized final depth map to obtain a stereoscopic vision map of the natural landscape.

The Step4 specifically includes:

firstly, normalizing the input depth to [0, 1] and recovering after thinning;

then adding the depth estimation obtained in Step3 to finally obtain an optimized depth map, wherein the optimized depth map is shown in the specification;

In this embodiment, the depth map obtained through Step3 and the input source image are used as input of the depth optimization algorithm and input into the designed depth residual error network, and in order to avoid the shift of a certain depth scale, the input depth is first normalized to [0, 1]]And recovering after thinning, the network extracts features from the processed depth map and the source image, applies deconvolution to the obtained depth features, and up-samples the image feature size. Connecting the two obtained features and applying a plurality of two-dimensional convolution layers to obtain a depth residual error, adding the depth residual error into the depth estimation obtained in the previous step to obtain an optimized depth map, wherein the optimized depth map is

Based on the natural landscape multi-view three-dimensional reconstruction method based on deep learning, the application also discloses a natural landscape multi-view three-dimensional reconstruction system based on deep learning, which specifically comprises the following steps: the device comprises an image acquisition module, a multi-scale feature extraction module, an iterative computation module and an optimized reconstruction module;

The method and the device adopt a multi-view three-dimensional reconstruction algorithm based on variability convolution aiming at a natural landscape model, and optimize the edge of the depth map through an edge processing algorithm of local region segmentation, so that the obtained depth map is more complete and accurate. The algorithm can estimate the image depth by inputting a plurality of pictures and corresponding camera parameters, further carry out three-dimensional modeling, and finally obtain a three-dimensional model of an object in the image, thereby better solving the problems that the low-texture and non-texture areas in the existing three-dimensional reconstruction method are not good in performance, the memory cost is high, the running time is long, the natural landscape model reconstruction environment influence factors are large, the feature extraction is insufficient, the parameters are well designed in advance, cannot be self-adaptive, only can aim at the specific scene effect, and the universality is not strong.

The above results illustrate the feasibility and effectiveness of the natural landscape multi-view three-dimensional reconstruction method based on deep learning provided by the application.

It should be noted that the described embodiments of the invention are only preferred ways of implementing the invention, and that all obvious modifications, which are within the scope of the invention, are all included in the present general inventive concept.

Claims

1. A natural landscape multi-view three-dimensional reconstruction method based on deep learning is characterized by comprising the following steps:

step1, acquiring a multi-view image set of a natural landscape, and preprocessing two-dimensional images in the multi-view image set;

2. The deep learning-based natural landscape multi-view three-dimensional reconstruction method according to claim 1, wherein the preprocessing comprises:

3. The natural landscape multi-view three-dimensional reconstruction method based on deep learning as claimed in claim 2, wherein the constructed multi-scale feature extraction network is a variable convolution network based on an image pyramid FPN structure, and a convolution layer is applied to a feature map obtained by original convolution to obtain the offset of deformable convolution deformation so as to realize target key feature extraction.

4. The natural landscape multi-view three-dimensional reconstruction method based on deep learning of claim 3, wherein in the iterative computation process, in the first iteration, the current single target key feature is taken as the input of the initial iteration; if the iteration is started, the current single target key feature is connected with the depth map obtained in the last iteration to serve as the input of the current iteration.

5. The deep learning-based natural landscape multi-view three-dimensional reconstruction method according to claim 4, wherein the pixel depth matching in the learning-based patch matching iterative model is realized by a preset matching cost calculation method.

6. The deep learning-based natural landscape multi-view three-dimensional reconstruction method as claimed in claim 5, wherein the matching cost calculation method calculates the cost of each depth hypothesis value of each pixel through group-wise correlation;

calculating the similarity of each group, where wi (P) represents the weight of the pixel P to the reference image Ii, w_i(p)＝max{P_i(p，j)|j＝0，1，...，D-1，

A similarity vector representing the corresponding group,

F₀(p)^gand F_i(p_i，j)^gRespectively representing the characteristics of the g-th group of source images and the characteristics of the g-th group of reference images, N representing the total number of the source images and the reference images, P_i,jRepresenting a pixel P of the corresponding source image in the reference image.

7. The natural landscape multi-view three-dimensional reconstruction method based on deep learning of claim 6, wherein the Step4 specifically comprises:

firstly, normalizing the input depth to [0, 1] and recovering after thinning;

8. A natural landscape multi-view three-dimensional reconstruction system based on deep learning, which adopts the natural landscape multi-view three-dimensional reconstruction method based on deep learning of the above claims 1-7, and is characterized by comprising: the device comprises an image acquisition module, a multi-scale feature extraction module, an iterative computation module and an optimized reconstruction module;