CN114677479A - Natural landscape multi-view three-dimensional reconstruction method based on deep learning - Google Patents

Natural landscape multi-view three-dimensional reconstruction method based on deep learning Download PDF

Info

Publication number
CN114677479A
CN114677479A CN202210384876.5A CN202210384876A CN114677479A CN 114677479 A CN114677479 A CN 114677479A CN 202210384876 A CN202210384876 A CN 202210384876A CN 114677479 A CN114677479 A CN 114677479A
Authority
CN
China
Prior art keywords
depth
view
feature extraction
dimensional
natural landscape
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210384876.5A
Other languages
Chinese (zh)
Inventor
李毅
张笑钦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Big Data And Information Technology Research Institute Of Wenzhou University
Original Assignee
Big Data And Information Technology Research Institute Of Wenzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Big Data And Information Technology Research Institute Of Wenzhou University filed Critical Big Data And Information Technology Research Institute Of Wenzhou University
Priority to CN202210384876.5A priority Critical patent/CN114677479A/en
Publication of CN114677479A publication Critical patent/CN114677479A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a natural landscape multi-view three-dimensional reconstruction method based on deep learning, which comprises the following steps: acquiring a multi-view image set of a natural landscape, and preprocessing two-dimensional images in the multi-view image set; constructing a multi-scale feature extraction network, and performing feature extraction on the preprocessed two-dimensional image by using the trained multi-scale feature extraction network to obtain target key features; inputting the target key features into a patch matching iterative model based on learning to perform iterative computation of pixel depth matching, and outputting a corresponding depth map after the iterative computation is finished; the obtained depth map and the source image are input into a depth residual error network for optimization to obtain an optimized final depth map, and an object three-dimensional model is constructed according to the optimized final depth map to obtain a stereoscopic vision map of the natural landscape.

Description

Natural landscape multi-view three-dimensional reconstruction method based on deep learning
Technical Field
The invention relates to the technical field of three-dimensional reconstruction, in particular to a natural landscape multi-view three-dimensional reconstruction method based on deep learning.
Background
The metastechnology is a virtual-real fusion scene framework integrating virtual reality, a game engine, a mobile internet, a block chain and the like, and provides highly immersive interactive experience. The method applies the metauniverse technology to the actual travel scenes, improves the innovative development thought of travel products, protects the intellectual property of the travel digital products, and has very important theoretical significance and practical value for the development of the travel industry. The method comprises the following steps of constructing a natural landscape scene with fusion of virtuality and reality, and establishing a simulated virtual scene by a multi-view three-dimensional reconstruction technology according to the display mode of image data. In recent years, the three-dimensional reconstruction technology is widely applied to the application fields of intelligent unmanned driving, AR/VR, satellite remote sensing mapping, entertainment multimedia and the like by utilizing depth camera equipment, infrared sensors, laser radars and the like to extract and estimate depth information of real world scenes. The artificial intelligence technology can extract and process and convert the characteristics of the multi-view two-dimensional image through the high generalization capability of the neural network model, so that the prediction and estimation of the three-dimensional scene are more accurate and efficient. The acquisition of sample data in the three-dimensional reconstruction of the natural scene is affected by factors such as acquisition equipment, natural environment, noise, shielding and the like, so that the accuracy is not high, and great challenge is brought to the simulation construction of the virtual scene. Therefore, it is one of the research difficulties to be solved urgently to improve the accuracy and efficiency of the multi-view three-dimensional reconstruction in the natural virtual scene and apply the multi-view three-dimensional reconstruction to the literary tourist unit universe.
The multi-view stereo vision reconstruction technology is a method for recovering a three-dimensional model by utilizing a plurality of images of the same scene with different visual angles. Multi-view stereo reconstruction based on depth learning, such as the classical MVSNet network architecture, usually constructs a three-dimensional cost volume to return to the depth value of the scene. The convolutional neural network is utilized to carry out multi-view stereo matching, so that the traditional matching efficiency is integrally improved. However, due to the deep regularization process of the 3D convolutional neural network, MVSNet is also limited by video memory resources in a large-scale and high-resolution scenario. The traditional method is difficult to process specular reflection, textures and the like, the reconstruction integrity is low, the speed is low, the natural landscape model reconstruction environment influence factors are large, the feature extraction is insufficient, parameters are well designed in advance, self-adaption cannot be achieved, and only specific scene effects and universality are poor.
In summary, how to overcome the above-mentioned drawbacks is a problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In order to solve the above-mentioned problems and needs, the present solution provides a deep learning-based natural landscape multi-view three-dimensional reconstruction method, which can solve the above technical problems due to the following technical solutions.
In order to achieve the purpose, the invention provides the following technical scheme: a natural landscape multi-view three-dimensional reconstruction method based on deep learning comprises the following steps: step1, acquiring a multi-view image set of the natural landscape, and preprocessing two-dimensional images in the multi-view image set;
step2, constructing a multi-scale feature extraction network, training the multi-scale feature extraction network through a training set to obtain a trained multi-scale feature extraction network, and performing feature extraction on the preprocessed two-dimensional image by using the trained multi-scale feature extraction network to obtain target key features;
step3, inputting the obtained target key features into a learning-based patch matching iterative model for iterative computation of pixel depth matching, and outputting a corresponding depth map after the iterative computation of the model is finished;
and Step4, inputting the depth map and the source image obtained in the Step3 into a depth residual error network for optimization to obtain an optimized final depth map, and constructing an object three-dimensional model according to the optimized final depth map to obtain a stereoscopic vision map of the natural landscape.
Further, the pre-processing comprises:
performing key reconstruction region segmentation on a two-dimensional image in a multi-view image set, wherein the multi-view image set comprises a source image and reference images of a plurality of corresponding view angles;
and combining the influence factors of the natural landscape environment to perform characteristic enhancement or shielding repair.
Furthermore, the constructed multi-scale feature extraction network is a variable convolution network based on an image pyramid FPN structure, and a convolution layer is applied to a feature graph obtained by original convolution to obtain the offset of deformable convolution deformation so as to realize target key feature extraction.
Furthermore, in the iterative computation process, during the initial iteration, the current single target key feature is used as the input of the initial iteration; if the iteration is started, the current single target key feature is connected with the depth map obtained in the last iteration to serve as the input of the current iteration.
Further, the pixel depth matching in the learning-based patch matching iterative model is realized by a preset matching cost calculation method.
Furthermore, the matching cost calculation method calculates the cost of each depth hypothesis value of each pixel through group-wise correlation;
the method specifically comprises the following steps: according to the formula:
Figure BDA0003594516650000031
calculating the similarity of each group, where wi (P) represents the weight of the pixel P to the reference image Ii, wi(p)=max{Pi(p,j)|j=0,1,...,D-1},
Figure BDA0003594516650000032
A similarity vector representing the corresponding group,
Figure BDA0003594516650000033
F0(p)gand Fi(pi,j)gRespectively representing the characteristics of the g-th group of source images and the characteristics of the g-th group of reference images, N representing the total number of the source images and the reference images, Pi,jRepresenting a pixel P of the corresponding source image in the reference image.
Further, Step4 specifically includes:
firstly, normalizing the input depth to [0, 1] and recovering after thinning;
inputting the obtained depth map and the source image into a depth residual error network to extract features, applying deconvolution to the obtained depth features, and upsampling to the size of the image features;
connecting the two obtained characteristics and applying a plurality of two-dimensional convolution layers to obtain a depth residual error;
then adding the depth estimation obtained in Step3 to finally obtain an optimized depth map;
and constructing a three-dimensional model of the object according to the optimized final depth map so as to obtain a stereoscopic vision map of the natural landscape.
A natural landscape multi-view three-dimensional reconstruction system based on deep learning adopts the natural landscape multi-view three-dimensional reconstruction method based on deep learning, and specifically comprises the following steps: the device comprises an image acquisition module, a multi-scale feature extraction module, an iterative computation module and an optimized reconstruction module;
the image acquisition module is used for acquiring a multi-view image set of a natural landscape and preprocessing a two-dimensional image in the multi-view image set;
the multi-scale feature extraction module is used for constructing a multi-scale feature extraction network, training the multi-scale feature extraction network through a training set to obtain a trained multi-scale feature extraction network, and extracting features of the preprocessed two-dimensional image by using the trained multi-scale feature extraction network to obtain target key features;
the multi-scale feature extraction module is used for inputting the obtained target key features into a learning-based patch matching iterative model for iterative computation of pixel depth matching, and outputting a corresponding depth map after the iterative computation of the model is finished;
and the optimization and reconstruction module is used for inputting the obtained depth map and the source image into a depth residual error network for optimization to obtain an optimized final depth map, and constructing an object three-dimensional model according to the optimized final depth map to obtain a stereoscopic vision map of the natural landscape.
According to the technical scheme, the invention has the beneficial effects that: the edge of the depth map is optimized through an edge processing algorithm of local region segmentation, so that the obtained depth map is more complete and accurate, the local detail precision of the landscape model is higher, the reconstruction efficiency is higher, and the universality is stronger.
In addition to the above objects, features and advantages, preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings so that the features and advantages of the present invention can be easily understood.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments of the present invention or the prior art will be briefly described, wherein the drawings are only used for illustrating some embodiments of the present invention and do not limit all embodiments of the present invention thereto.
Fig. 1 is a schematic diagram of specific steps of a natural landscape multi-view three-dimensional reconstruction method based on deep learning in the present invention.
Fig. 2 is a schematic diagram of a construction process of the natural landscape multi-view three-dimensional reconstruction method based on deep learning in the invention.
Fig. 3 is a schematic structural diagram of the deep learning-based natural landscape multi-view three-dimensional reconstruction system.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of specific embodiments of the present invention. Like reference symbols in the various drawings indicate like elements. It should be noted that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.
The existing three-dimensional reconstruction method of the natural landscape model has the problems of poor performance of low-texture and non-texture areas, high memory cost, long running time and the like, so that the application discloses a method for optimizing the edge of a depth map by an edge processing algorithm segmented by local areas, so that the obtained depth map is more complete and accurate, the local detail precision of the landscape model is higher, the reconstruction efficiency is higher, and the universality is higher, as shown in fig. 1 to 3, the method comprises the following steps: and Step1, acquiring a multi-view image set of the natural landscape and preprocessing two-dimensional images in the multi-view image set.
The specific treatment process of the pretreatment comprises the following steps: performing key reconstruction region segmentation on a two-dimensional image in a multi-view image set, wherein the multi-view image set comprises a source image and reference images of a plurality of corresponding view angles; and combining the influence factors of the natural landscape environment to perform characteristic enhancement or shielding repair. Because it is obviously insufficient to process only edge pixels for environmental influences (such as reflection) and edge occlusion of target objects in key reconstruction regions, it is necessary to perform dilation processing on edges by using a dilation algorithm in mathematical morphology, so as to obtain more edge region pixels. The target object is extracted from the image by using the mask generated by image segmentation, and the target object is combined with the background image to be used as pretreatment, bilateral filtering is carried out on edge pixels, the pixel value weight of each pixel point in the filter is close in a region far away from the edge, the distance weight occupies a dominant position in the filtering, and the pixel value weight of the pixel points on the same side of the edge is close in the edge region and is far larger than that of the pixel points on the other side of the edge, so that the pixel points on the non-same side hardly influence the filtering result, and the function of protecting edge information can be also achieved. Thereby achieving the preprocessing effects of reconstruction region edge feature enhancement and region edge extension. The result is gained for subsequent multi-view reconstruction.
The feature extraction of the two-dimensional image is very important for multi-view three-dimensional reconstruction, and directly influences the integrity and precision of a generated three-dimensional model. The method aims at the two-dimensional image acquisition of model multi-view, is directly applied to the feature matching and three-dimensional generation of a deep neural network, and because the external parameters of an original camera of acquisition equipment are continuously changed in the acquisition process, a large amount of accumulated errors can be brought to the subsequent feature matching, the integrity of model reconstruction is influenced, and excessive computing resource consumption can be caused. Therefore, the key reconstruction region segmentation is firstly carried out on the input multi-view, and the step is used as image preprocessing and can be simultaneously combined with natural landscape environment influence factors to carry out feature enhancement or occlusion repair. For the feature extraction of the non-texture feature region and the weak feature region, the multi-scale feature extraction in the subsequent steps can continuously reduce the matching accumulated error of the target feature region. The method and the device optimize the edge of the depth map through an edge processing algorithm of local region segmentation.
In general, there are usually several potential depths where the pixels are located on the depth boundary, where in similar color gamuts the depths are the same and the pixels usually have similar color depths on the same geometric plane. Coordinate transform prediction is performed directly using an optical flow-based approach that allows pixels to select potential depths, rather than intermediate depth values, to make points belong to objects or backgrounds, and to avoid ambiguity of depth propagation depth boundaries. In this process, to better predict the depth boundary, the depth iteratively obtains the mapping D based on the learned patch matching, increases the sampling rate to twice the original resolution, and then performs double sampling. Display and D1The original images with the same resolution are connected to DuIntermediate features are obtained from the output as inputs to the feature extraction network. To better extract the frontal edges of the depth image and obtain the important features, these corresponding edge features are expanded and contracted accordingly. Finally, the offset of the corresponding coordinate is obtained through convolution. For each coordinate point pl-1Is Dl-1The following formula:
Figure BDA0003594516650000071
in the formula plIs the depth value after network processing, N (p)l-1) Identifying a coordinate system, w, around the feature pointspFeature point distance estimation weights are identified.
And Step2, constructing a multi-scale feature extraction network, training the multi-scale feature extraction network through a training set to obtain a trained multi-scale feature extraction network, and performing feature extraction on the preprocessed two-dimensional image by using the trained multi-scale feature extraction network to obtain target key features.
Specifically, the constructed multi-scale feature extraction network is a variable convolution network based on an image pyramid FPN structure, and a convolution layer is applied to a feature graph obtained by original convolution to obtain the offset of deformable convolution deformation so as to realize target key feature extraction.
As shown in fig. 2, the variable convolution network obtains some downsampled images through gaussian smoothing and sub-sampling based on an image pyramid (FPN), that is, a K + 1-level gaussian pyramid can obtain K + 1-level gaussian images through smoothing and sub-sampling operations, and the gaussian pyramid includes a series of low-pass filters, and the cut-off frequency of the low-pass filters gradually increases by a factor of 2 from the upper layer to the lower layer, so that the gaussian pyramid can span a large frequency range. Inputting a picture to obtain a plurality of images with different scales, and connecting 4 vertexes of the images with different scales to construct an image pyramid similar to a real pyramid. By doing this, one dimension of scale (or depth) can be added to the 2-dimensional image, which can obtain more useful information therefrom.
In this embodiment, the training of the multi-scale feature extraction network through a training set includes:
first, input N pictures with view size of WXH, which can use different scales, such as
Figure BDA0003594516650000081
And
Figure BDA0003594516650000082
and labeled feature maps F1, F2, F3, F1 obtain a map F1' by inputting a variable convolution network, the variable convolution being as follows: formula definition
Figure BDA0003594516650000083
After multilayer convolution, the extracted features are divided into a plurality of parts according to the number of layers after the multilayer convolution, then the output parts of all the parts are processed by a deformable convolution module and used as the input of the next layer of convolution, and the deformable convolution module can enable the network to extract the key features of the target. The convolution kernel has the function of extracting the characteristics of an input object, and the traditional convolution kernel is usually the kernel size with a fixed size, so that the traditional convolution kernel has poor adaptability, generalization and weak capability on unknown change. The deformable convolution is that on the basis of the traditional convolution, a direction vector for adjusting a convolution kernel is added, so that the form of the convolution kernel is closer to a feature, the realization of the deformable convolution is basically similar to that of the traditional convolution, only one convolution layer is applied to a feature graph obtained by the convolution to obtain the offset of the deformable convolution deformation, and during training, the convolution kernel for generating the output feature and the convolution kernel for generating the offset are synchronously learned.
And Step3, inputting the obtained target key features into a learning-based patch matching iterative model for iterative computation of pixel depth matching, and outputting a corresponding depth map after the iterative computation of the model is finished.
Specifically, in the iterative computation process, during initial iteration, the current single target key feature is used as the input of the initial iteration; and if the iteration is started, connecting the current single target key feature with a depth map obtained by the last iteration to serve as the input of the current iteration, wherein the pixel depth matching in the learning-based patch matching iterative model is realized by a preset matching cost calculation method.
The matching cost calculation method calculates the cost of each depth assumed value of each pixel through group-wise correlation;
the method specifically comprises the following steps: according to the formula:
Figure BDA0003594516650000091
calculating the similarity of each group, where wi (P) represents the weight of the pixel P to the reference image Ii, wi(p)=max{Pi(p,j)|j=0,1,...,D-1},
Figure BDA0003594516650000092
A similarity vector representing the corresponding group,
Figure BDA0003594516650000093
F0(p)gand Fi(pi,j)gRespectively representing the characteristics of the g-th group of source images and the characteristics of the g-th group of reference images, N representing the total number of the source images and the reference images, Pi,jRepresenting a pixel P of the corresponding source image in the reference image.
For multi-vision stereo vision, this step must integrate information from any number of source images to a single depth value for a single pixel. To this end, the cost per depth hypothesis for each pixel is calculated by group correlation. Views are aggregated using pixel-level view weights. Visibility information can be used to improve robustness in the cost aggregation process. Finally, the cost of each set of summary aggregates is mapped to the pixels of each depth layer through a very small net. The original pixel-level view weight computation network has planar three-dimensional convolution and non-linearity. However, the common 3D convolution cannot effectively model large motions for its fixed receptive field, resulting in limited important information for reconstruction.
Therefore, to optimize this problem, a variable three-dimensional convolution network integrated by variable convolution and planar convolution is used to maximize feature extraction for learnable nodes in a local region and improve the ability of image feature regions and continuous motion capture views. The variable three-dimensional convolution is shown as follows:
Figure BDA0003594516650000094
in this embodiment, at the initial iteration of patch matching, a randomly generated depth hypothesis is used as a reference for the first propagation, the random depth hypothesis being defined by a predefined inverse depth range [ d ]min dmax]Determining and generating, for each pixel, a plurality of depth value hypotheses, dividing the depth range into a plurality of intervalsUniformly distributed over a predetermined inverse depth range or the like, and having each interval covered by a depth hypothesis, which on the one hand increases diversity, while enabling our network to operate in complex and large-scale scenarios. For the later iteration, the depth estimation generated in the previous stage is taken as a basis, meanwhile, the depth disturbance is carried out on each pixel in a depth range R, and the disturbance range is gradually reduced along with the continuation of the iteration, so that the depth assumption can be further enriched, meanwhile, the result is further refined, and the error of the previous iteration is corrected.
The results from each stage to the end are found to be different by experiment, and the current loss function does not represent the current training situation well. In the first stage, depth prediction is a simple perturbation, with no propagation, so its weight is small. At stage 0, while the depth map refinement module improves the accuracy of the point cloud, it correspondingly reduces the integrity. The accuracy can be greatly improved by using a variable convolutional neural network. In order to better improve the performance of the network, the loss refining module of the depth map is correspondingly reduced. In order to reflect that the loss function may better influence the entire training process, different weights ik are given at different stages (k ═ 0,1,2, 3). The loss function is defined as:
Figure BDA0003594516650000101
depth values of the same physical surface are usually correlated, so we abandon the previous propagation of depth hypotheses from a static set of neighboring pixels, and instead propagate them in an adaptive manner, from the same physical surface, which converges faster than before, and at the same time can collect more useful depth hypotheses from weak texture and non-texture regions, based on adaptive propagation over a deformable convolution network. This model learns additional two-dimensional offsets and applies them to fixed two-dimensional offsets, organized as a grid. We learn an additional two-dimensional offset per pixel on the target picture using 1 2DCNN and get the depth hypothesis by bilinear interpolation.
Referring to the above, for multi-view stereo vision, this step must integrate the cost value from any number of source image information to a single depth value for a single pixel. For this purpose, the cost of each assumed value can be calculated by Group-wise (reducing the correlation channel of the features to the number of groups, reducing the memory) correlation. Views are aggregated by a view weight at a pixel level. Visibility information can be utilized to improve robustness in the cost aggregation process. Finally, the cost of each group is projected to each pixel of the reference frame, each depth, through a small network.
In the present embodiment, it is preferred that,
Figure BDA0003594516650000111
respectively representing the characteristics of the source image and the reference image, and F after the characteristic channels are divided into G groups0(p)gAnd Fi(pi,j)gFeatures of group g are represented, their similarity being represented as S:
Figure BDA0003594516650000112
by using
Figure BDA0003594516650000113
Representing the similarity vector of the corresponding group. To compute the pixel-level view weights, the diversity of the initial set of depth assumptions at stage3 is exploited as shown in FIG. 2. The visibility information of the image Ii at pixel p is denoted by wi (p). The weights are locked after being computed once and upsampled to a finer stage. A simple pixel-level view weight calculation network consists of 3D convolution (1x1x1) and nonlinear sigmoid, and the initial similarity S is inputiThe output value is between 0 and 1,
Figure BDA0003594516650000114
pixel P to reference picture IiThe weights of (d) are expressed as follows: w is ai(p)=max{PiThe similarity of each last group (p, j) | j ═ 0, 1.., D-1} can be expressed as follows:
Figure BDA0003594516650000115
finally obtain eachSimilarity of each group of each pixel
Figure BDA0003594516650000116
A single cost value for each depth hypothesis for each pixel is obtained using a small network with 3D convolution (1x1x 1).
Step4: inputting the depth map and the source image obtained in the Step3 into a depth residual error network for optimization to obtain an optimized final depth map, and constructing an object three-dimensional model according to the optimized final depth map to obtain a stereoscopic vision map of the natural landscape.
The Step4 specifically includes:
firstly, normalizing the input depth to [0, 1] and recovering after thinning;
inputting the obtained depth map and the source image into a depth residual error network to extract features, applying deconvolution to the obtained depth features, and upsampling to the size of the image features;
connecting the two obtained characteristics and applying a plurality of two-dimensional convolution layers to obtain a depth residual error;
then adding the depth estimation obtained in Step3 to finally obtain an optimized depth map, wherein the optimized depth map is shown in the specification;
and constructing a three-dimensional model of the object according to the optimized final depth map so as to obtain a stereoscopic vision map of the natural landscape.
In this embodiment, the depth map obtained through Step3 and the input source image are used as input of the depth optimization algorithm and input into the designed depth residual error network, and in order to avoid the shift of a certain depth scale, the input depth is first normalized to [0, 1]]And recovering after thinning, the network extracts features from the processed depth map and the source image, applies deconvolution to the obtained depth features, and up-samples the image feature size. Connecting the two obtained features and applying a plurality of two-dimensional convolution layers to obtain a depth residual error, adding the depth residual error into the depth estimation obtained in the previous step to obtain an optimized depth map, wherein the optimized depth map is
Figure BDA0003594516650000121
Based on the natural landscape multi-view three-dimensional reconstruction method based on deep learning, the application also discloses a natural landscape multi-view three-dimensional reconstruction system based on deep learning, which specifically comprises the following steps: the device comprises an image acquisition module, a multi-scale feature extraction module, an iterative computation module and an optimized reconstruction module;
the image acquisition module is used for acquiring a multi-view image set of a natural landscape and preprocessing a two-dimensional image in the multi-view image set;
the multi-scale feature extraction module is used for constructing a multi-scale feature extraction network, training the multi-scale feature extraction network through a training set to obtain a trained multi-scale feature extraction network, and extracting features of the preprocessed two-dimensional image by using the trained multi-scale feature extraction network to obtain target key features;
the multi-scale feature extraction module is used for inputting the obtained target key features into a learning-based patch matching iterative model for iterative computation of pixel depth matching, and outputting a corresponding depth map after the iterative computation of the model is finished;
and the optimization and reconstruction module is used for inputting the obtained depth map and the source image into a depth residual error network for optimization to obtain an optimized final depth map, and constructing an object three-dimensional model according to the optimized final depth map to obtain a stereoscopic vision map of the natural landscape.
The method and the device adopt a multi-view three-dimensional reconstruction algorithm based on variability convolution aiming at a natural landscape model, and optimize the edge of the depth map through an edge processing algorithm of local region segmentation, so that the obtained depth map is more complete and accurate. The algorithm can estimate the image depth by inputting a plurality of pictures and corresponding camera parameters, further carry out three-dimensional modeling, and finally obtain a three-dimensional model of an object in the image, thereby better solving the problems that the low-texture and non-texture areas in the existing three-dimensional reconstruction method are not good in performance, the memory cost is high, the running time is long, the natural landscape model reconstruction environment influence factors are large, the feature extraction is insufficient, the parameters are well designed in advance, cannot be self-adaptive, only can aim at the specific scene effect, and the universality is not strong.
The above results illustrate the feasibility and effectiveness of the natural landscape multi-view three-dimensional reconstruction method based on deep learning provided by the application.
It should be noted that the described embodiments of the invention are only preferred ways of implementing the invention, and that all obvious modifications, which are within the scope of the invention, are all included in the present general inventive concept.

Claims (8)

1. A natural landscape multi-view three-dimensional reconstruction method based on deep learning is characterized by comprising the following steps:
step1, acquiring a multi-view image set of a natural landscape, and preprocessing two-dimensional images in the multi-view image set;
step2, constructing a multi-scale feature extraction network, training the multi-scale feature extraction network through a training set to obtain a trained multi-scale feature extraction network, and performing feature extraction on the preprocessed two-dimensional image by using the trained multi-scale feature extraction network to obtain target key features;
step3, inputting the obtained target key features into a learning-based patch matching iterative model for iterative computation of pixel depth matching, and outputting a corresponding depth map after the iterative computation of the model is finished;
and Step4, inputting the depth map and the source image obtained in the Step3 into a depth residual error network for optimization to obtain an optimized final depth map, and constructing an object three-dimensional model according to the optimized final depth map to obtain a stereoscopic vision map of the natural landscape.
2. The deep learning-based natural landscape multi-view three-dimensional reconstruction method according to claim 1, wherein the preprocessing comprises:
performing key reconstruction region segmentation on a two-dimensional image in a multi-view image set, wherein the multi-view image set comprises a source image and reference images of a plurality of corresponding view angles;
and combining the influence factors of the natural landscape environment to perform characteristic enhancement or shielding repair.
3. The natural landscape multi-view three-dimensional reconstruction method based on deep learning as claimed in claim 2, wherein the constructed multi-scale feature extraction network is a variable convolution network based on an image pyramid FPN structure, and a convolution layer is applied to a feature map obtained by original convolution to obtain the offset of deformable convolution deformation so as to realize target key feature extraction.
4. The natural landscape multi-view three-dimensional reconstruction method based on deep learning of claim 3, wherein in the iterative computation process, in the first iteration, the current single target key feature is taken as the input of the initial iteration; if the iteration is started, the current single target key feature is connected with the depth map obtained in the last iteration to serve as the input of the current iteration.
5. The deep learning-based natural landscape multi-view three-dimensional reconstruction method according to claim 4, wherein the pixel depth matching in the learning-based patch matching iterative model is realized by a preset matching cost calculation method.
6. The deep learning-based natural landscape multi-view three-dimensional reconstruction method as claimed in claim 5, wherein the matching cost calculation method calculates the cost of each depth hypothesis value of each pixel through group-wise correlation;
the method specifically comprises the following steps: according to the formula:
Figure FDA0003594516640000021
calculating the similarity of each group, where wi (P) represents the weight of the pixel P to the reference image Ii, wi(p)=max{Pi(p,j)|j=0,1,...,D-1,
Figure FDA0003594516640000022
A similarity vector representing the corresponding group,
Figure FDA0003594516640000023
F0(p)gand Fi(pi,j)gRespectively representing the characteristics of the g-th group of source images and the characteristics of the g-th group of reference images, N representing the total number of the source images and the reference images, Pi,jRepresenting a pixel P of the corresponding source image in the reference image.
7. The natural landscape multi-view three-dimensional reconstruction method based on deep learning of claim 6, wherein the Step4 specifically comprises:
firstly, normalizing the input depth to [0, 1] and recovering after thinning;
inputting the obtained depth map and the source image into a depth residual error network to extract features, applying deconvolution to the obtained depth features, and upsampling to the size of the image features;
connecting the two obtained characteristics and applying a plurality of two-dimensional convolution layers to obtain a depth residual error;
then adding the depth estimation obtained in Step3 to finally obtain an optimized depth map;
and constructing a three-dimensional model of the object according to the optimized final depth map so as to obtain a stereoscopic vision map of the natural landscape.
8. A natural landscape multi-view three-dimensional reconstruction system based on deep learning, which adopts the natural landscape multi-view three-dimensional reconstruction method based on deep learning of the above claims 1-7, and is characterized by comprising: the device comprises an image acquisition module, a multi-scale feature extraction module, an iterative computation module and an optimized reconstruction module;
the image acquisition module is used for acquiring a multi-view image set of a natural landscape and preprocessing a two-dimensional image in the multi-view image set;
the multi-scale feature extraction module is used for constructing a multi-scale feature extraction network, training the multi-scale feature extraction network through a training set to obtain a trained multi-scale feature extraction network, and extracting features of the preprocessed two-dimensional image by using the trained multi-scale feature extraction network to obtain target key features;
the multi-scale feature extraction module is used for inputting the obtained target key features into a learning-based patch matching iterative model for iterative computation of pixel depth matching, and outputting a corresponding depth map after the iterative computation of the model is finished;
and the optimization and reconstruction module is used for inputting the obtained depth map and the source image into a depth residual error network for optimization to obtain an optimized final depth map, and constructing an object three-dimensional model according to the optimized final depth map to obtain a stereoscopic vision map of the natural landscape.
CN202210384876.5A 2022-04-13 2022-04-13 Natural landscape multi-view three-dimensional reconstruction method based on deep learning Pending CN114677479A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210384876.5A CN114677479A (en) 2022-04-13 2022-04-13 Natural landscape multi-view three-dimensional reconstruction method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210384876.5A CN114677479A (en) 2022-04-13 2022-04-13 Natural landscape multi-view three-dimensional reconstruction method based on deep learning

Publications (1)

Publication Number Publication Date
CN114677479A true CN114677479A (en) 2022-06-28

Family

ID=82078216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210384876.5A Pending CN114677479A (en) 2022-04-13 2022-04-13 Natural landscape multi-view three-dimensional reconstruction method based on deep learning

Country Status (1)

Country Link
CN (1) CN114677479A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115170746A (en) * 2022-09-07 2022-10-11 中南大学 Multi-view three-dimensional reconstruction method, system and equipment based on deep learning
CN115908730A (en) * 2022-11-11 2023-04-04 南京理工大学 Edge-based three-dimensional scene reconstruction system method for remote control end under low communication bandwidth
CN117197215A (en) * 2023-09-14 2023-12-08 上海智能制造功能平台有限公司 Robust extraction method for multi-vision round hole features based on five-eye camera system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115170746A (en) * 2022-09-07 2022-10-11 中南大学 Multi-view three-dimensional reconstruction method, system and equipment based on deep learning
CN115908730A (en) * 2022-11-11 2023-04-04 南京理工大学 Edge-based three-dimensional scene reconstruction system method for remote control end under low communication bandwidth
CN117197215A (en) * 2023-09-14 2023-12-08 上海智能制造功能平台有限公司 Robust extraction method for multi-vision round hole features based on five-eye camera system
CN117197215B (en) * 2023-09-14 2024-04-09 上海智能制造功能平台有限公司 Robust extraction method for multi-vision round hole features based on five-eye camera system

Similar Documents

Publication Publication Date Title
CN109377530B (en) Binocular depth estimation method based on depth neural network
CN110443842B (en) Depth map prediction method based on visual angle fusion
US10353271B2 (en) Depth estimation method for monocular image based on multi-scale CNN and continuous CRF
CN109003325B (en) Three-dimensional reconstruction method, medium, device and computing equipment
CN105654492B (en) Robust real-time three-dimensional method for reconstructing based on consumer level camera
CN110378838B (en) Variable-view-angle image generation method and device, storage medium and electronic equipment
CN111931787A (en) RGBD significance detection method based on feature polymerization
CN106780592A (en) Kinect depth reconstruction algorithms based on camera motion and image light and shade
CN114677479A (en) Natural landscape multi-view three-dimensional reconstruction method based on deep learning
CN109146001B (en) Multi-view ISAR image fusion method
CN113963117B (en) Multi-view three-dimensional reconstruction method and device based on variable convolution depth network
CN111950477A (en) Single-image three-dimensional face reconstruction method based on video surveillance
KR102311796B1 (en) Method and Apparatus for Deblurring of Human Motion using Localized Body Prior
CN116229461A (en) Indoor scene image real-time semantic segmentation method based on multi-scale refinement
CN111899295A (en) Monocular scene depth prediction method based on deep learning
CN113850900A (en) Method and system for recovering depth map based on image and geometric clue in three-dimensional reconstruction
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN112734914A (en) Image stereo reconstruction method and device for augmented reality vision
CN114170290A (en) Image processing method and related equipment
Li et al. Deep learning based monocular depth prediction: Datasets, methods and applications
CN116310098A (en) Multi-view three-dimensional reconstruction method based on attention mechanism and variable convolution depth network
CN115205463A (en) New visual angle image generation method, device and equipment based on multi-spherical scene expression
CN113378756B (en) Three-dimensional human body semantic segmentation method, terminal device and storage medium
Jia et al. Depth measurement based on a convolutional neural network and structured light
CN110889868A (en) Monocular image depth estimation method combining gradient and texture features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination