CN117315169A

CN117315169A - Live-action three-dimensional model reconstruction method and system based on deep learning multi-view dense matching

Info

Publication number: CN117315169A
Application number: CN202311141285.6A
Authority: CN
Inventors: 季顺平; 刘瑾
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2023-12-29

Abstract

The invention discloses a live-action three-dimensional model reconstruction method and system based on deep learning multi-view dense matching. The method can be used for reconstructing the end-to-end intelligent three-dimensional live-action model from any multi-view aerial image, has stronger robustness in challenges such as multiple shielding, view angle change, large depth of field and the like in the multi-view inclined aerial image, and has quicker depth reasoning process and resource saving; reconstruction accuracy is superior to most commercial software and open source solutions; the method has the advantages that the mobility is strong, the deep learning model trained on the open source sample data can be directly transferred and applied to a real aviation image, and a good reconstruction effect can be obtained without retraining or fine tuning.

Description

Live-action three-dimensional model reconstruction method and system based on deep learning multi-view dense matching

Technical Field

The invention relates to a method for reconstructing a ground surface three-dimensional model based on deep learning multi-view dense matching, which can effectively reconstruct a real scene three-dimensional model of the ground surface in a form of a charted triangle net from a group of inclined multi-view aerial remote sensing images, can accurately and truly reflect urban landscapes, and has wide application values in the aspects of earth digitalization, smart city, infrastructure planning, historical heritage protection, virtual reality, measurement drawing and the like.

Background

Along with the continuous expansion of urban scale, the demands of people for finer and more stereoscopic expression of urban landforms are becoming stronger, and a real three-dimensional model reconstruction technology for recovering the real three-dimensional structure of the city is developed. The three-dimensional model of the ground surface live-action constructed from the multi-view remote sensing image is usually represented by a triangular mesh model with a map, and meanwhile contains geometric structure and color texture information, so that the real view of the city can be accurately reflected, and the three-dimensional model is gradually a wide expression form of a three-dimensional reconstruction result.

There are some three-dimensional reconstruction business software already in the market that are widely used in production. For example, contextCapture (i.e., smart 3D) software, france, is one of the mainstream three-dimensional reconstruction software in the industry; russian Agisoft Metashape (i.e., photoscan) supports the generation of accurate three-dimensional models from a wide range of data sources; pixel4D in switzerland and Pixel Factory from InfoTerra, france are also professional photogrammetry systems; furthermore, the SURE-Aial of Germany is used for urban modeling tasks of multi-view Aerial images, and UASMaster software of the Trimble Inpho system is dedicated to recovering three-dimensional scenes from images captured by unmanned aircraft systems. At this task, related systems in China have been developed successively, such as: reconstruction master, intelligent map of Xingzhu, natural scene three-dimensional modeling software of Xingzhu, etc. There are also some excellent three-dimensional reconstruction open source solutions in communities. For example: visualSfM is one of the early open-sourced graphical user interface applications for three-dimensional reconstruction; thereafter, the COLMAP constructs a complete three-dimensional reconstruction pipeline comprising an incremental motion restoration structure module, a PatchMatch-based multi-view depth and normal joint estimation module, a multi-view fusion module and a triangular mesh model reconstruction module. In addition to the multi-scenario generic framework of computer vision community development described above, remote sensing and photogrammetry communities have also opened some solutions, such as: free three-dimensional reconstruction software meshroum encapsulated on an AliceVision photogrammetry framework, micMac supporting three-dimensional reconstruction of the earth's surface from aerial or satellite images, and the like.

While each of these software and solutions is unique, most of the image dense Matching techniques as cores are based on classical algorithms such as Semi-Global Matching (SGM), patchMatch, or modifications thereof. These algorithms often employ shallow features that are designed manually, and have significant limitations in robustness by designing complex matching strategies to handle challenging situations.

In recent years, deep learning technology is popular in various fields and is introduced into dense matching tasks, so that a great deal of time consumption and labor feature extraction work are saved, and extensive research is caused. On the basis of multi-view dense matching of depth maps, MVS-Net provides microhomography transformation, multi-view geometry is embedded into a deep learning network, and modules such as feature extraction, cost body construction, cost body regularization and depth regression are designed, so that end-to-end multi-view dense matching is realized. The subsequent work has been to propose a series of improvements based on MVS-Net network infrastructure, such as RED-Net, casMVS, UCS-Net, to achieve a simultaneous improvement in accuracy while reducing memory consumption and run time. In addition, the method relies on truth labels to carry out supervision training, and some models such as M3VSNet, JDACS-MS and the like realize unsupervised network training by utilizing constraint relations of self-characteristics of images.

In summary, on a dense matching task, the deep learning method extracts high-dimensional abstract features from a large number of samples in a data-driven manner to overcome the limitation of manual design features. Related researches show that the effect of the deep learning method can exceed that of the traditional method after the model is well trained, and the method has great application potential. The advanced dense matching method driven by the data provides a new solution for the three-dimensional scene reconstruction task. However, while the advantages of deep-learning dense matching algorithms are fully validated on some standard data sets, to the knowledge of the pen, there are still few reports of three-dimensional reconstruction frames or software based on deep-learning methods. That is, the deep learning algorithm is only remained in the theoretical research stage of the method at present, but is not yet deployed and applied in the actual three-dimensional reconstruction engineering. The creation of a three-dimensional reconstruction new framework based on a deep learning algorithm is therefore of paramount importance.

Disclosure of Invention

Aiming at the defect of the existing three-dimensional reconstruction technology in the intelligent level, the invention provides a three-dimensional reconstruction method based on deep learning multi-vision dense matching in combination with the current advanced deep learning technology.

The method is used for reconstructing a city-level real-scene three-dimensional surface model from multi-view aviation images, taking remote sensing images and camera parameters under different view angles as inputs, selecting an optimal matching view angle set for each image, reasoning a depth map of each image through a pre-trained deep learning multi-view dense matching model, fusing all the image depth maps into a three-dimensional point cloud of an object side, and finally constructing a triangular model from the point cloud and mapping to obtain the real-scene three-dimensional model. The method provides a universal and intelligent solution for reconstructing the urban-level real-scene three-dimensional surface model task from the multi-view aerial image, and has good performance.

The technical scheme adopted for achieving the purpose of the invention is that the method for reconstructing the live-action three-dimensional model based on deep learning multi-view dense matching comprises the following steps:

step 1, taking a group of aviation multi-view images and initial imaging parameters as inputs, firstly carrying out aerial triangulation to solve imaging parameters of the images and sparse three-dimensional points of a scene, and then selecting an optimal matching neighborhood image set for each image to be matched according to the number of the sparse three-dimensional points;

step 2, constructing a deep learning dense matching network capable of performing self-adaptive multi-view aggregation according to the characteristics of the inclined multi-view aerial images, pre-training the network on an open source aerial data set, predicting the multi-view matching unit by utilizing a pre-trained network model, and reasoning a depth map corresponding to each image;

Step 3, back projecting each pixel point in the image to a three-dimensional object space by utilizing a collineation condition equation according to the solved internal parameters and position posture parameters of the camera and combining with the predicted depth map, and performing multi-view coarse difference elimination and fusion operation to obtain a three-dimensional point cloud model;

and 4, constructing an initial triangular mesh model from the three-dimensional point cloud, improving model details by adjusting the quality of triangular patches in the initial model, and finally attaching textures to the constructed model through the mapping relation between the images and the model.

Further, the specific implementation of step 1 includes the following sub-steps,

step 1.1, firstly, solving the accurate pose of a camera from a group of input images in a concentrated way by utilizing the existing aerial triangulation technology, and solving sparse three-dimensional space points;

step 1.2, based on sparse scene information, dividing a region to be reconstructed into mutually independent space grid blocks, setting a certain overlapping degree for adjacent blocks, and independently carrying out a subsequent reconstruction process for each block so as to support parallel processing;

step 1.3, according to the image orientation parameter information, independently selecting a group of effective image sets { I } for each target block v to be reconstructed _v And according to the three conditions of the overlapping degree of the image to be selected and the reference image, the average intersection angle of the connecting points and the included angle of the imaging main optical axis, the three conditions are { I } _v Each image I in } _ref Establishing an optimal neighborhood view set { I } _s }；

Step 1.4 according to AND I _ref Is the number of connection points of { I } _s The images in the images are prioritized, and the first N-1 images are taken as I _ref Is a set of optimally matched images.

Further, the specific implementation manner of the step 2 is as follows,

the self-adaptive multi-view aggregation deep learning dense matching network comprises: image feature extraction, multi-view feature alignment, matching cost construction, cost body regularization, depth value regression and loss function calculation; the feature extraction part extracts a feature map from the remote sensing image by using a series of two-dimensional convolution layers sharing weights; the multi-view feature alignment part maps the multi-view feature image to a reference image view angle through a microhomography mapping; the matching cost construction part learns the contribution degree (i.e. aggregation weight) of each neighborhood view to matching through a view weight learning module, and then carries out weighted aggregation on the feature bodies of all the views; the cost body regularization part further aggregates the constructed cost bodies through a regularization module based on a 2D convolution layer and a convolution circulation unit (ConvGRU); the depth value regression part calculates a depth map from the regularized cost body through a probability weighted average function; the loss function value calculating part adopts a smooth L1 loss value, and the loss value is used for guiding the training of the network until the loss value is not reduced any more after training, and the network model reaches the optimal value.

Further, the image feature extraction part is composed of N feature extraction modules sharing weights, N is the number of input images, and each feature extraction module is composed of a multi-scale encoder-decoder structure and L spatial pyramid pooling modules and is used for extracting feature representations of multiple scales from images; the encoder-decoder structure consists of L-scale 2D convolution layers and 2D deconvolution layers, a jump connection exists between the convolution layers with the same scale and the deconvolution layers, and the encoder-decoder structure generates L-scale characteristic diagrams F; each scale feature map is used for expanding receptive fields and fusing multi-level context information through a spatial pyramid pooling module, and the specific processing procedure of the spatial pyramid pooling module is as follows: processing the input feature map by three branches, wherein the first two branches are subjected to pooling, rolling and up-sampling operations, the third branch does not perform any processing, then the feature maps output by the three branches are connected in series, and the feature maps after being connected in series are fused through a convolution layer to be used as the output of the final current scale; all convolution layers and deconvolution layers are followed by a modified linear unit ReLU and a batch normalization layer.

Furthermore, the multi-view feature alignment part adopts the thought of a plane scanning method to uniformly sample D hypothetical depth planes in the depth search range in the reference image view cone space, and on each depth plane, the multi-view neighborhood feature map F is converted by single strain _i Mapping to the reference image view angle to realize alignment of multi-view features, stacking the feature images after D homography conversion into a feature body V with the dimension of D multiplied by C multiplied by H multiplied by W _i C is the number of characteristic channels, and H and W represent the height and width of the characteristic map.

Further, the matching cost construction part is composed of a view weight learning module, a multi-view aggregation module and a multi-scale cost construction module and is used for combining the feature bodies of the N viewsPolymerizing to form a cost body C; firstly, the contribution degree of each neighborhood view angle to matching, namely the aggregation weight, is learned through a view angle weight learning module, and the module respectively compares the characteristic body of each neighborhood view angle with the characteristic body V of the reference image _ref Correlation along the feature dimension, obtaining a two-view cost body, regularizing after passing through a lightweight 2D encoder-decoder structure formed by 4-layer convolution and 4-layer deconvolution, and converting the regularized cost body into a probability body P through softmax operation _s ，P _s The maximum value in D dimension is the weight W of the current view angle _s The method comprises the steps of carrying out a first treatment on the surface of the Then, grouping correlation is carried out between each neighborhood feature body and the reference feature body through a multi-view aggregation module, and N-1 two-view cost bodies with dimensions of G multiplied by D multiplied by H multiplied by W are obtained>Wherein G is the number of packets; according to each ofWeighting of the view angle, and weighting and aggregating all the grouping related cost volumes into a final cost volume C; finally, in the multi-scale cost construction, three-scale weighted aggregation cost bodies are constructed from coarse to fine and used for subsequent cost body regularization;

the cost body regularization part is a lightweight structural unit consisting of 2D convolution layers, 2D upper convolution layers and 2 ConvGRU layers, wherein a jump connection is arranged between the output of the first ConvGRU layer and the output of the first upper convolution layer and is used for fusing low-order and high-order information, each ConvGRU layer comprises a state conversion parameter, the cost information of the current depth plane is recorded and is transmitted to the next depth plane as an initial value, the regularization modules of each pyramid stage do not share weight, so that the aggregation cost body C of each stage is independently processed _l 。

Further, the loss function calculation section, represented by a smooth L1 loss between the depth label and the inferred depth value, includes two sections: the first part is the difference between N-1 reasoning depth maps and true values output by the visual angle weight learning module The second part is the difference between the inferred depth map and the true value obtained in the first stage of the pyramid ∈>The total loss function is defined as the weighted sum of the L+N-1 values:

wherein L is the scale, N is the number of input images, lambda ₀ And lambda (lambda) _l Is the weight factor corresponding to the two.

Further, the specific implementation manner of the step 3 is as follows,

back projecting the depth map obtained by reasoning the deep learning model to a three-dimensional object space, detecting and eliminating mismatching by utilizing the consistency relation among multi-view images, and filtering allThe depth map after the abnormal value is fused into a three-dimensional discrete point cloud. First, the depth value with the matching confidence degree lower than 0.2 in the depth estimation is regarded as outlier filtering, and then the forward-backward reprojection error epsilon between multi-view depth maps is calculated _p ＝||x _ref -x _reproj Relative difference epsilon between I and depth _d ＝||d _ref -d _reproj ||/d _ref When epsilon _p <1 and epsilon _d <0.01, the current inference depth d _ref Is considered to be two-vision consistent, where x _ref And d _ref For a pixel point and depth estimation value on the reference image, find x _ref Re-projecting corresponding points on the neighborhood image back to the reference image to obtain re-projected pixel points x _reproj And depth value d _reproj . When d _ref At and N _c When Zhang Linyu images are subjected to geometric consistency test, the number of times of meeting the consistency of two views is less than N _f Then it is treated as false match filtering. All effective pixel points x _ref And its use in N _c Effective homonymous image point { x ] on Zhang Linyu image _c The three-dimensional coordinates projected to the object space are averaged as fused 3D scene points. And processing each depth map in turn, and marking the pixels which are already involved in fusion, thereby avoiding repeated processing.

Further, the specific implementation of step 4 includes the following sub-steps,

step 4.1, performing Delaunay triangulation on the three-dimensional discrete point cloud to divide a space into tetrahedral units, dividing each tetrahedral unit into an external body or an internal body by using a graph cut method according to visibility information, and taking a set of directional triangular patches at the intersection of the two as a surface model to be reconstructed, so as to construct an initial triangular network model from the three-dimensional point cloud;

step 4.2, on the initial grid model, adjusting the angular quality and the size of the triangular patches in the initial model, removing isolated points and isolated faces, and iteratively adjusting the spatial positions of the vertices by utilizing the luminosity consistency of the grid vertices and neighborhood smoothing information so as to improve the details of the model;

and 4.3, finally, establishing a mapping relation between the image and the triangle net model by adopting a texture mapping algorithm, attaching textures to the constructed model, and eliminating a splicing seam phenomenon between adjacent texture blocks by utilizing a global adjustment algorithm based on least squares and a local adjustment algorithm based on poisson mixing to obtain the triangle net model after mapping.

The invention also provides a real-scene three-dimensional model reconstruction system based on deep learning multi-view dense matching, which comprises the following modules:

the preprocessing module is used for taking a group of aviation multi-view images and initial imaging parameters as inputs, firstly carrying out aerial triangulation to solve imaging parameters of the images and sparse three-dimensional points of a scene, and then selecting an optimal matching neighborhood image set for each image to be matched according to the number of the sparse three-dimensional points;

the network construction module is used for constructing a deep learning dense matching network capable of performing self-adaptive multi-view aggregation according to the characteristics of the inclined multi-view aerial images, the network performs pre-training on an open source aerial data set, a pre-trained network model is utilized to predict the multi-view matching unit, and a depth map corresponding to each image is inferred;

the three-dimensional point cloud acquisition module is used for back projecting each pixel point in the image to a three-dimensional object space by utilizing a collineation condition equation according to the solved internal parameters and position posture parameters of the camera and combining with the predicted depth map, and obtaining a three-dimensional point cloud after multi-view coarse difference elimination and fusion operation;

the reconstruction module is used for constructing an initial triangular mesh model from the three-dimensional point cloud, improving model details by adjusting the quality of triangular patches in the initial triangular mesh model, and finally attaching textures to the constructed model through the mapping relation between images and the model.

The invention has the following advantages:

(1) The established real-scene three-dimensional model reconstruction method based on deep learning multi-view dense matching can be used for performing end-to-end intelligent three-dimensional real-scene model reconstruction from any multi-view aerial image.

(2) The proposed deep learning multi-view dense matching network capable of performing self-adaptive cost aggregation has stronger robustness in challenges of multiple shielding, view angle change, large depth of field and the like in multi-view inclined aerial images, and the deep reasoning process is faster and saves resources.

(3) The reconstruction precision is high, and the precision of the real-scene three-dimensional model reconstructed by the method is superior to that of most commercial software and open source solutions.

(4) The method has the advantages that the mobility is strong, the deep learning model trained on the open source sample data can be directly transferred and applied to a real aviation image, and a good reconstruction effect can be obtained without retraining or fine tuning.

Drawings

FIG. 1 is a flow chart of the overall three-dimensional reconstruction method established by the present invention.

Fig. 2 is a schematic diagram of a deep learning multi-view dense matching network structure according to the present invention.

Fig. 3 is a schematic diagram of an image feature extraction module according to the present invention.

Fig. 4 is a real-scene three-dimensional reconstruction model obtained in the embodiment of the present invention.

FIG. 5 is a detailed comparison of three-dimensional reconstruction results obtained by the embodiment of the present invention and other methods and software.

Detailed Description

The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.

As shown in fig. 1, the method for reconstructing the live-action three-dimensional model based on deep learning multi-view dense matching provided by the invention comprises the following steps:

step 1, carrying out aerial triangulation on aerial multi-view images to solve imaging parameters of the images and sparse three-dimensional points of a scene, and selecting an optimal matching neighborhood image set for each image to be matched according to the number of the sparse three-dimensional points;

step 2, constructing a deep learning dense matching network capable of performing self-adaptive multi-view aggregation, pre-training on an open source aviation data set, predicting a multi-view matching unit by utilizing a pre-trained network model, and reasoning a depth map corresponding to each image;

step 3, performing multi-view coarse difference elimination and fusion operation according to the solved internal parameters and position posture parameters of the camera and the reasoning depth map, and projecting the filtered depth value to a three-dimensional object space to obtain a three-dimensional point cloud model;

and 4, constructing an initial triangular mesh model from the three-dimensional point cloud, improving model details by adjusting the quality of triangular patches in the initial model, and finally attaching textures to the constructed model.

step 1.1, solving the accurate pose of a camera from a group of input images in a concentrated manner by utilizing the existing aerial triangulation technology, and calculating sparse three-dimensional space points;

step 1.2, dividing the area to be rebuilt into independent space grid blocks, and setting 5% overlapping degree of adjacent blocks;

step 1.3, according to the image orientation parameter information, independently selecting a group of effective image sets { I } for each target block v to be reconstructed _v "as (I) _v Each image I in } _ref Establishing an optimal neighborhood view set { I } _s }；

Further, as shown in fig. 2, the adaptive multi-view aggregation deep learning dense matching network described in step 2 includes:

image feature extraction, multi-view feature alignment, matching cost construction, cost body regularization, depth value regression and loss function calculation; the feature extraction part extracts a feature map from the remote sensing image by using a series of two-dimensional convolution layers sharing weights; the multi-view feature alignment part maps the multi-view feature image to a reference image view angle through microhomography mapping change; the matching cost construction part learns the contribution degree (i.e. aggregation weight) of each neighborhood view to matching through a view weight learning module, and then carries out weighted aggregation on the feature bodies of all the views; the cost body regularization part further aggregates the constructed cost bodies through a regularization module based on a 2D convolution layer and a convolution circulation unit (ConvGRU); the depth value regression part calculates a depth map from the regularized cost body through a probability weighted average function; the loss function value calculating part adopts a smooth L1 loss value, and the loss value is used for guiding the training of the network until the loss value is not reduced any more after training, and the network model reaches the optimal value.

Further, the image feature extraction part is composed of N feature extraction modules (N is the number of input images) with shared weights, and each feature extraction module is composed of a multi-scale encoder-decoder structure and three-scale spatial pyramid pooling modules, and is used for extracting feature representations of multiple scales from the image, as shown in fig. 3. The encoder-decoder structure consists of L-scale (L=3) 2D convolution layers and 2D deconvolution layers, and sequentially comprises 8 2D convolution layers and 2 deconvolution layers, wherein the channel numbers are respectively as follows

{8,8,16,16,16,32,32,32,16,8}, wherein the convolution steps of the 3 rd and 6 th convolution layers, and the two deconvolution layers are 2, the convolution kernel size is 5×5, the steps of the remaining convolution layers are 1, and the convolution kernel size is 3×3. The outputs of the 2 nd and 5 th convolution layers are respectively connected with the outputs of the 1 st and 2 nd deconvolution layers in series, and the characteristic diagrams after being connected in series are fused and output through a convolution layer with the size of 3 multiplied by 3. The encoder and decoder architecture described above produces a feature map F of 3 scales in total. Each scale feature map is used to expand receptive fields and fuse multi-level context information through a spatial pyramid pooling module, respectively. Specifically, feature map F _1×1 First, two pooled feature maps F are obtained through two pooling operations of 4×4 and 8×8 sizes and convolution of 1×1 and up-sampling operations respectively _4×4 And F _8×8 Then F is carried out _1×1 、F _4×4 And F _8×8 The three feature maps are connected in series, and the feature maps after being connected in series are fused through a convolution layer with the size of 1 multiplied by 1 to be used as the output of the final current scale. All convolution layers and deconvolution layers are followed by a modified linear unit ReLU and a batch normalization layer (BN). N feature extraction modules sharing weights respectively extract N input images I _i Three-dimensional feature mapThe feature map is {1/16,1/4,1} of the input image, and the number of channels is {32,16,8}, respectively.

Furthermore, the multi-view feature alignment part adopts the thought of a plane scanning method to uniformly sample D hypothetical depth planes in the depth search range in the reference image view cone space, and on each depth plane, the multi-view neighborhood feature map F is converted by single strain _i Mapping to the reference image view angle to realize alignment of multi-view features, stacking the feature images after D homography conversion into a feature body V with the dimension of D multiplied by C multiplied by H multiplied by W _i C is the number of characteristic channels, and H and W are the height and width of the characteristic map, respectively.

Further, the matching cost construction part is composed of a view weight learning module, a multi-view aggregation module and a multi-scale cost construction module and is used for combining the feature bodies of the N viewsTo be polymerized into a cost body C. Firstly, the contribution degree (i.e. aggregation weight) of each neighborhood view to matching is learned by a view weight learning module, which respectively compares the feature body of each neighborhood view with the reference image feature body V _ref Correlation along the feature dimension, obtaining a two-view cost body, regularizing after passing through a lightweight 2D encoder-decoder structure formed by 4-layer convolution and 4-layer deconvolution, and converting the regularized cost body into a probability body P through softmax operation _s ，P _s The maximum value in D dimension is the weight W of the current view angle _s The method comprises the steps of carrying out a first treatment on the surface of the Then, grouping correlation is carried out between each neighborhood feature body and the reference feature body through a multi-view aggregation module, and N-1 two-view cost bodies with dimensions of G multiplied by D multiplied by H multiplied by W are obtained>Where G is the number of packets. According to the weight of each view angle, weighting and aggregating all the grouping related cost bodies into a final cost body C; finally, in the multi-scale cost construction, three scales are constructed from coarse to fineThe weighted aggregate cost volume of degrees is used for subsequent cost volume regularization.

Further, the cost body regularization part is a lightweight structural unit consisting of 2D convolution layers, 2D upper convolution layers and 2 convglu layers. Wherein there is a jump connection between the outputs of the first convglu layer and the first upper convolution layer for fusing the low-order and high-order information. Each ConvGRU layer contains a state transition parameter, records cost information of the current depth plane and transmits the cost information as an initial value to the next depth plane. Weights are not shared among regularization modules of each pyramid stage to independently process aggregate cost volume C of each stage _l 。

Further, the depth value regression part applies a softmax function to convert the cost body into a probability body along the depth direction on the regularized cost body, wherein each value in the probability body represents the current pixel point at the current depth layer D _i And (3) carrying out weighted average on the probability values on all the depth layers by using probability to obtain a final estimated depth value.

further, in the reasoning stage, for a set of multi-view matching units consisting of reference images and neighborhood images, the depth map corresponding to each reference image is deduced by using a pre-trained deep learning model under the condition that imaging parameters and depth search ranges (usually estimated from a sparse three-dimensional scene reconstructed by SfM) are known.

Further, the specific implementation manner of the step 3 is as follows,

back-projecting the depth map obtained by reasoning the deep learning model to a three-dimensional object space, detecting and eliminating mismatching by utilizing the consistency relation among multi-view images, and enabling all effective pixel points to be x _ref And its use in N _c Effective homonymous image point { x ] on Zhang Linyu image _c The three-dimensional coordinates projected to the object space are averaged as fused 3D scene points. Each depth map is processed in turn, and pixels that have participated in the fusion are marked to avoid repetitive processing.

step 4.1, performing Delaunay triangulation on the three-dimensional discrete point cloud to divide a space into tetrahedron units, and using a graph cut method to take a set of directional triangular patches at the intersection of an external body or an internal body of each tetrahedron unit as a surface model to be reconstructed according to the visibility information;

and 4.3, establishing a mapping relation between the image and the triangle net model by adopting a texture mapping algorithm, attaching textures to the constructed model, and eliminating a splicing seam phenomenon between adjacent texture blocks by utilizing a global adjustment algorithm based on least squares and a local adjustment algorithm based on poisson mixing to obtain the triangle net model after mapping.

The embodiment of the invention also provides a live-action three-dimensional model reconstruction system based on deep learning multi-view dense matching, which comprises the following modules:

The specific implementation manner of each module corresponds to each step, and the invention is not written.

Examples:

firstly, a deep learning multi-view dense matching network for multi-view aerial images is constructed according to the method (step 2), and then a multi-view aerial image dense matching data set with an open source is used for training a deep learning model. And then, based on the pre-trained deep learning model, reconstructing an end-to-end ground surface live-action three-dimensional model. Firstly, taking a group of aviation multi-view images and initial imaging parameters as inputs, carrying out aerial triangulation, solving imaging parameters of the images and sparse three-dimensional points of a scene, and selecting a group of optimal matching neighborhood image sets for each image to be matched according to the number of the sparse three-dimensional points; then taking the image to be matched, the N top best matching neighborhood images and the accurate imaging parameters corresponding to each image as the input of a network model, and reasoning to obtain a depth map corresponding to the image to be matched by utilizing a pre-trained deep learning multi-view matching model; performing rough difference elimination on a plurality of reasoning depth maps according to geometric and photometric consistency relations among multiple visual angles according to internal parameters and position posture parameters of the solved camera, back-projecting the filtered depth values to a three-dimensional object space by using an existing collineation conditional equation, and fusing adjacent points and homonymous points to obtain a three-dimensional point cloud model; and finally, constructing a triangular model from the point cloud, and attaching textures to the model after refining the details to obtain the live-action three-dimensional model.

Model training was performed on the training set of WHU-OMVS dataset and reconstruction of the experimental area was performed on the test set. The training set and the verification set are composed of inclined five-view images with 768 multiplied by 384 pixels, corresponding depth maps and camera parameters; the simulated fly height is 220m, the corresponding ground resolution is about 0.1m, and the heading and side overlap are 80% and 60%, respectively; more than 3 ten thousand oblique five-view images from 5 independent areas are contained, and an image space depth map is used as a training label. The test set data simulate the state of a real five-spelling camera system when in actual imaging, imaging is carried out at a flight height of about 550m, and the ground resolution is about 0.1m; the 5 visual angles are 268 images, the image size is 3712 multiplied by 5504 pixels, and the coverage surface scene range is about 850 multiplied by 700m. Unlike the training set, the test set provides a true value DSM of 0.2m for grid resolution in the range of five-view overlap of the test region (about 580m×580m size) in addition to the image-side depth map corresponding to each image, for evaluating the accuracy of the reconstructed model at the object side.

The deep learning dense matching network is realized under the condition of PyTorch 1.3.0, and a three-stage pyramid structure is adopted. In the training phase, the number of depth sampling planes is set to {48,32,8}; set to { (d) when trained on WHU-OMVS aviation dataset _max -d _min ) 48m,0.2m,0.1m, wherein the depth search range { d } _min ,d _max Obtaining from a sparse scene point cloud calculated by three solutions in space; the number of viewing angles involved in matching is 5, the training batch is set to be 1, the optimizer is RMSProp, and the learning rate is set to be 0.001; weight factor of loss function { lambda } ₀ ,λ ₁ ,λ ₂ ,λ ₃ The values are {0.5,0.5,1,2}, respectively. The network model was trained about 25 ten thousand times on each dataset, and the trained model was three-dimensionalA depth reasoning process is performed in the reconstruction framework. For oblique aerial images, the threshold in the view selection flow is set to τ ₁ ＝0.1,τ ₂ =5° and τ ₃ =45°, the consistency check threshold in the depth map fusion flow is set to N _c =10 and N _f ＝3。

The comparison scheme chosen included three main stream commercial software: contextCapture, agisoft Metashape, SURE, and two widely used open source three-dimensional reconstruction schemes COLMAP, openMVS. All experiments were performed on a single NVIDIA TITAN RTX GPU (24 GB) with Intel (R) Core (TM) i9-9900X CPU@3.60GHz. The images are downsampled twice uniformly before being input into each solution, and the downsampling function is disabled inside each solution to eliminate reconstruction quality differences caused by inconsistent data scales.

Quantitative accuracy assessment of the reconstruction results is performed uniformly on the DSM product interpolated from the triangular mesh model. The adopted evaluation indexes are as follows: 1) Percentage of effective grid number (Percentage of Accurate Grids in total, PAG), i.e. the number of grid cells m with absolute error less than the threshold alpha _α The percentage of the number of grid cells m that are valid with the true value (the threshold α is set to 0.2m,0.4m, and 0.6m in the experiment); 2) Mean absolute error (Mean Absolute Error, MAE), i.e. the average of the absolute differences between the true elevation value and the effective elevation value in the reconstructed DSM; 3) Root-Mean-Square Error (RMSE), i.e., the standard deviation between the true elevation value and the effective elevation value in the reconstructed DSM. In order to reduce the effect of the extremely large coarse differences, grids with absolute differences greater than the threshold T are not counted (T is set to 100 times the ground sampling interval) in the MAE and RMSE indices.

Table 1 comparison of reconstruction accuracy for six schemes on WHU-OMVS test zone

Compared with three commercial software, the Deep-learning three-dimensional reconstruction framework Deep3D has obvious advantages on MAE and RMSE indexes, and has three integrity indexes (namely PAG _α ) Deep3D is inPAG _0.4m And PAG _0.6m Optimum results were obtained. Second, the Deep3D framework has significant advantages over the two classical open source solutions in terms of individual metrics. In terms of three-dimensional model reconstruction time, deep3D is lower in efficiency than well-developed commercial software ContextCapture and SURE, and is slightly better than Metashape; but has significant advantages over the two open source solutions. Overall, the proposed Deep3D three-dimensional reconstruction frame shows good performance, and the 3D product produced is shown in fig. 3. Fig. 4 shows details of three-dimensional models produced by six solutions. It can be seen that ContextCapture, deep3D and metacope obtained the most complete, finest model. From the error plot, all methods restored the surface structure well, while Deep3D reconstruction results showed minimal error peaks at the edges of the building.

The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims

1. The method for reconstructing the live-action three-dimensional model based on deep learning multi-view dense matching is characterized by comprising the following steps of:

step 3, back projecting each pixel point in the image to a three-dimensional object space by utilizing a collineation condition equation according to the solved internal parameters and position posture parameters of the camera and combining with the predicted depth map, and performing multi-view coarse difference elimination and fusion operation to obtain a three-dimensional point cloud;

And 4, constructing an initial triangular mesh model from the three-dimensional point cloud, improving model details by adjusting the quality of triangular patches in the initial triangular mesh model, and finally attaching textures to the constructed model through the mapping relation between the images and the model.

2. The method for reconstructing the live-action three-dimensional model based on deep learning multi-view dense matching according to claim 1, which is characterized by comprising the following steps: the specific implementation of the step 1 comprises the following sub-steps:

3. The method for reconstructing the live-action three-dimensional model based on deep learning multi-view dense matching according to claim 1, which is characterized by comprising the following steps: the specific implementation of step 2 is as follows,

the self-adaptive multi-view aggregation deep learning dense matching network comprises: image feature extraction, multi-view feature alignment, matching cost construction, cost body regularization, depth value regression and loss function calculation; the image feature extraction part extracts feature images from the remote sensing images by using a series of two-dimensional convolution layers sharing weights; the multi-view feature alignment part maps the multi-view feature image to a reference image view angle through microhomography mapping change; the matching cost construction part learns the contribution degree of each neighborhood view to matching through a view weight learning module, and then weights and aggregates the feature bodies of all the views to obtain a cost body; the cost body regularization part further aggregates the constructed cost body through a regularization module based on the 2D convolution layer and the convolution circulation unit; the depth value regression part calculates a depth map from the regularized cost body through a probability weighting function; the loss function value calculating part adopts a smooth L1 loss value, and the loss value is used for guiding the training of the network until the loss value is not reduced any more after training, and the network model reaches the optimal value.

4. The method for reconstructing the live-action three-dimensional model based on deep learning multi-view dense matching according to claim 3, which is characterized by comprising the following steps: the image feature extraction part consists of N feature extraction modules sharing weight, wherein N is the number of input images, and each feature extraction module consists of a multi-scale encoder-decoder structure and L spatial pyramid pooling modules and is used for extracting feature representations of multiple scales from images; the encoder-decoder structure consists of L-scale 2D convolution layers and 2D deconvolution layers, a jump connection exists between the convolution layers with the same scale and the deconvolution layers, and the encoder-decoder structure generates L-scale characteristic diagrams F; each scale feature map is used for expanding receptive fields and fusing multi-level context information through a spatial pyramid pooling module, and the specific processing procedure of the spatial pyramid pooling module is as follows: processing the input feature map by three branches, wherein the first two branches are subjected to pooling, rolling and up-sampling operations, the third branch does not perform any processing, then the feature maps output by the three branches are connected in series, and the feature maps after being connected in series are fused through a convolution layer to be used as the output of the final current scale; all convolution layers and deconvolution layers are followed by a modified linear unit ReLU and a batch normalization layer.

5. The method for reconstructing the live-action three-dimensional model based on deep learning multi-view dense matching according to claim 3, which is characterized by comprising the following steps: the multi-view feature alignment part adopts the thought of a plane scanning method to uniformly sample D hypothetical depth planes in a depth search range in a reference image view cone space, and on each depth plane, a multi-view neighborhood feature map F is obtained through single strain conversion _i Mapping to the reference image view angle to realize alignment of multi-view features, stacking the feature images after D homography conversion into a feature body V with the dimension of D multiplied by C multiplied by H multiplied by W _i C is the number of characteristic channels, and H and W represent the height and width of the characteristic map.

6. The method for reconstructing the live-action three-dimensional model based on deep learning multi-view dense matching according to claim 5, which is characterized by comprising the following steps: the matching cost construction part consists of a view weight learning module, a multi-view aggregation module and a multi-scale cost construction module and is used for combining the characteristic bodies of N viewsPolymerizing to form a cost body C; firstly, the contribution degree of each neighborhood view angle to matching, namely the aggregation weight, is learned through a view angle weight learning module, and the module respectively compares the characteristic body of each neighborhood view angle with the characteristic body V of the reference image _ref Correlation along the feature dimension, obtaining a two-view cost body, regularizing after passing through a lightweight 2D encoder-decoder structure formed by 4-layer convolution and 4-layer deconvolution, and converting the regularized cost body into a probability body P through softmax operation _s ，P _s The maximum value in D dimension is the weight W of the current view angle _s The method comprises the steps of carrying out a first treatment on the surface of the Then, grouping correlation is carried out between each neighborhood feature body and the reference feature body through a multi-view aggregation module, and the dimension is GXDXHXW N-1 two-view cost bodies->Wherein G is the number of packets; according to the weight of each view angle, weighting and aggregating all the grouping related cost bodies into a final cost body C; finally, in the multi-scale cost construction, three-scale weighted aggregation cost bodies are constructed from coarse to fine and used for subsequent cost body regularization;

7. The method for reconstructing the live-action three-dimensional model based on deep learning multi-view dense matching according to claim 3, which is characterized by comprising the following steps: the loss function calculation section, represented by a smooth L1 loss between the depth label and the inferred depth value, includes two sections: the first part is the difference between N-1 reasoning depth maps and true values output by the visual angle weight learning moduleThe second part is the difference between the inferred depth map and the true value obtained in the first stage of the pyramid ∈>The total loss function is defined as the weighted sum of the L+N-1 values:

8. The method for reconstructing the live-action three-dimensional model based on deep learning multi-view dense matching according to claim 1, which is characterized by comprising the following steps: the specific implementation manner of the step 3 is that,

back projecting the depth map obtained by reasoning the deep learning model to a three-dimensional object space, detecting and eliminating mismatching by utilizing a consistency relation among multi-view images, and fusing all depth maps with abnormal values filtered into a three-dimensional discrete point cloud; firstly, taking depth values with matching confidence coefficient lower than a certain threshold value in depth estimation as outlier filtering, and then calculating forward-backward reprojection error epsilon between multi-view depth maps _p ＝||x _ref -x _reproj Relative difference epsilon between I and depth _d ＝||d _ref -d _reproj ||/d _ref When epsilon _p <1 and epsilon _d <0.01, the current inference depth d _ref Is considered to be two-vision consistent, where x _ref And d _ref For a pixel point and depth estimation value on the reference image, find x _ref Re-projecting corresponding points on the neighborhood image back to the reference image to obtain re-projected pixel points x _reproj And depth value d _reproj The method comprises the steps of carrying out a first treatment on the surface of the When d _ref At and N _c When Zhang Linyu images are subjected to geometric consistency test, the number of times of meeting the consistency of two views is less than N _f Then the result is regarded as error matching filtering; all effective pixel points x _ref And its use in N _c Effective homonymous image point { x ] on Zhang Linyu image _c The three-dimensional coordinates projected to the object space are averaged to be used as 3D scene points after fusion; and processing each depth map in turn, and marking the pixels which are already involved in fusion, thereby avoiding repeated processing.

9. The method for reconstructing the live-action three-dimensional model based on deep learning multi-view dense matching according to claim 1, which is characterized by comprising the following steps: the specific implementation of step 4 comprises the sub-steps of,

10. The utility model provides a real scene three-dimensional model rebuilding system based on deep learning multi-view intensive matching which is characterized in that includes following module: