CN116310131A

CN116310131A - Three-dimensional reconstruction method considering multi-view fusion strategy

Info

Publication number: CN116310131A
Application number: CN202310315104.0A
Authority: CN
Inventors: 路锦正; 黄炳森; 李强; 彭波; 赵集
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2023-03-28
Filing date: 2023-03-28
Publication date: 2023-06-23

Abstract

The invention discloses a three-dimensional reconstruction method considering a multi-View fusion strategy, which comprises the steps of respectively carrying out homography transformation and distortion on N-1 source feature images and reference feature images in a first stage to construct cost bodies, transmitting each cost body into View-Net to obtain a weight image of the View relative to the reference image, carrying out weighted fusion on the weight image and the cost bodies, and finally obtaining the cost bodies with useful information elements; performing cost body regularization through conventional 3D convolution, outputting a low-resolution depth map, initializing the depth information of the depth map for the preset depth of the next stage, and obtaining a final predicted depth map after three stages to finish three-dimensional reconstruction; the invention provides a new expression mode for multi-view cost volume fusion, improves the availability of matched pixels, reduces the interference of unmatched pixels, further obtains a more accurate depth estimation graph, and improves the integrity of three-dimensional point cloud reconstruction.

Description

Three-dimensional reconstruction method considering multi-view fusion strategy

Technical Field

The invention belongs to the technical field of three-dimensional reconstruction, and particularly relates to a three-dimensional reconstruction method considering a multi-view fusion strategy.

Background

With the continuous improvement of image demands and continuous rising of requirements on scene cognition, three-dimensional space vision becomes more and more important, and various scene structures and detail differences missing in a two-dimensional plane can be obtained through observation of the three-dimensional space, so that user experience is more real and reliable. The conventional three-dimensional reconstruction device can reconstruct the three-dimensional space of the picture shot by the camera, but the cost is very high, which is insufficient for supporting daily use, and the device cannot be popularized to the public.

Multi-view stereo (MVS) aims to recover 3D scene geometry from a set of RGB images using known camera poses and to obtain a 3D dense model of the real world scene from multiple images. It has many important applications such as document reconstruction, virtual reality, autopilot, defect detection, etc. The deep learning based MVS approach tends to use frontal plane scanning to evaluate the depth of the same candidate set based on the same image for each pixel, and achieves higher accuracy and integrity on many MVS benchmarks than the prior art, compared to conventional MVS approaches that utilize a manually made matching metric for image consistency checking. While learning-based MVS achieves unexpected results, there are many places to solve and optimize to further improve the quality of point cloud reconstruction. Convolutional Neural Networks (CNNs) have been widely used for three-dimensional reconstruction and a broader computer vision task, with recent multi-view stereo matching algorithms typically calculating a 3D cost volume based on a set of hypothetical depths, applying 3D convolution to the cost volume to regularize and regress the final scene depth.

Several studies have demonstrated the importance of the construction of cost volumes to depth prediction accuracy. From the prior MVS pipeline flow, N-1 source graphs and 1 reference graph are known to be input, two-view cost volumes are constructed through homography relation between each source graph and the reference graph, then N-1 two-view cost volumes are compressed into a final cost volume in a specific fusion mode, and finally 3DCNN or RNN regularization is utilized for depth prediction. One of the very important issues is the judgment of the visibility of the pixels in the view and the selection of the view. The disturbance of the extraneous view and the introduction of erroneous pixels can render verification of the final score ineffective or even negatively affected, resulting in an erroneous depth estimation.

The original MVSNet uses a view fusion strategy in the traditional sense to calculate variance distribution of different two-view cost volumes, the variances are used as weights to combine the different view cost volumes, and a plurality of variant algorithms follow the scheme. Other methods apply averaging or max pooling etc. to aggregate matching costs or use some specific implicit network layers directly to get so-called adaptive weight maps which are not reasonably theoretical to explain. Although the network may implicitly learn how to discard the unmatched pixels in the view, the unmatched pixels of the view interfere with still inevitably deteriorating the final reconstruction.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a three-dimensional reconstruction method considering a multi-view fusion strategy, so as to solve the problems that the verification of a final score becomes invalid or even brings negative influence due to the interference of irrelevant views and the transmission of wrong pixels in the existing three-dimensional reconstruction, thereby causing the error of depth estimation.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a three-dimensional reconstruction method taking into account a multi-view fusion strategy, comprising the steps of:

s1, acquiring a plurality of pictures acquired by a camera;

s2, extracting features of the pictures to obtain three different feature images;

s3, mapping the N-1 source feature images in one stage to an assumed depth plane of the reference view by adopting homography transformation;

s4, respectively constructing N1 initial cost bodies based on the N-1 source characteristic images and a reference image;

s5, performing self-adaptive weight training on the N-1 initial cost bodies through View-Net to obtain a weight graph corresponding to the initial cost bodies;

s6, carrying out weighted fusion on the weight graph and the initial cost body to obtain a cost body to be regularized;

s7, regularizing the cost body in the step S6 to obtain a probability map, and generating a depth map of the stage based on the probability map;

s8, taking the depth map of the stage in the step S7 as the initialization of the depth presetting of the second stage, and circularly executing the steps S3 to S7 to generate and obtain the depth map of the second stage;

s9, taking the depth map obtained in the second stage as an initialization of depth presetting in the third stage, and circularly executing the steps S3 to S7 to generate a final predicted depth map obtained in the third stage;

and S10, based on the final predicted depth map, completing three-dimensional reconstruction of the cost body.

Further, in step S3, mapping the N-1 source feature images in one stage onto the assumed depth plane of the reference view by using homography to obtain the implicit depth relationship between images of different perspectives:

wherein, the liquid crystal display device comprises a liquid crystal display device,

is->

Homography relation between the feature map of the individual view and the reference feature map with depth d; k (K) _i ，R _i ，t _i Intrinsic camera properties, rotation and translation, respectively, of the ith view, < >>

Is the principal axis of the reference camera; i, K ₀ ，R ₀ Extracting the residuals of the factors for the determinant respectivelyThe source view camera references, and the source view rotates and translates the external references.

Further, the step S4 specifically includes:

according to 1 reference feature map, N-1 source feature map and corresponding camera and pose parameters, by matching the first image with the second image

Zhang Yuan characteristic diagram is subjected to homography distortion transformation to a preset plane of a reference camera, and N-1 initial cost bodies are obtained through construction.

Further, in step S5, N-1 initial cost volumes are subjected to self-adaptive weight training through View-Net, so that a weight graph corresponding to the initial cost volumes is obtained:

wherein V (x) is a view weight map, exp (-x) is the power of e-x.

Further, in step S6, performing weighted fusion on the weight graph and the initial cost body to obtain a cost body to be regularized, including:

wherein Vt is _otal (k) K is {1,2,3}; warp _n (k) Is an initial cost body after homography distortion of an nth source view and a reference view in a kth stage; view _n (k) And outputting one-dimensional weight information for the nth initial cost body in the kth stage through View-Net.

Further, the average absolute difference between the true depth map and the estimated depth map is calculated by adopting the smooth L1 Loss, and the three phase losses are accumulated as a final Loss:

wherein k is a cascade stage,

d (p, k) is the true depth value of the pixel p of the kth stage, which is the set of valid true pixels,/i>

Is the predicted depth value of the pixel p of the kth stage.

The three-dimensional reconstruction method considering the multi-view fusion strategy provided by the invention has the following beneficial effects:

the invention provides a general aggregation network, which uses a general sub-network serving as an MVS pipeline network for training pixel-level confidence of a two-view cost body, and provides a new expression mode for multi-view cost body fusion, so as to improve the usability of matched pixels and reduce the interference of unmatched pixels, further obtain a more accurate depth map and improve the integrity of three-dimensional point cloud reconstruction.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a network architecture of the present invention.

FIG. 3 is a hypothetical depth plane for mapping N-1 source signatures to a reference view by homography transformation in accordance with the present invention.

FIG. 4 is a schematic diagram of a two-view fusion strategy according to the present invention.

FIG. 5 is a two-view weighting module according to the present invention.

Fig. 6 is a View of a DTU test set Scan1, in which a plurality of two-View cost volumes obtain corresponding fusion weights after passing through View-Net.

FIG. 7 is a plot of increased number of views versus different levels of error accuracy.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

Example 1

The embodiment provides a three-dimensional reconstruction method considering a multi-view fusion strategy, which is used for solving the problem that a worsening factor is added in the cost body fusion process due to uncertainty of view selection and pixel homography distortion in the existing method, and specifically comprises the following steps with reference to fig. 1:

s1, acquiring a plurality of pictures acquired by a camera;

referring to fig. 2, the input of the present embodiment includes a plurality of captured pictures and corresponding camera parameters.

in the embodiment, an FPN architecture is adopted as a feature extraction layer to respectively obtain three stage feature graphs (NxC x H x W) with different sizes;

the present embodiment assumes that the depth is uniformly sampled (e.g., 1-192 mm) from a range of depths throughout all stages. The first stage acquires image features at low resolution and constructs a cost volume through homography mapping at a predetermined depth range and larger depth interval, and the subsequent stage uses high spatial resolution, narrower depth range and smaller depth interval to obtain a finer depth prediction map.

In this embodiment, a first stage is taken as an example to describe a sub-network, wherein the 0 th stage is a reference feature map, N is a 1 source feature map, the N-1 source feature maps and the reference feature map are respectively subjected to homography transformation and distortion to construct cost bodies, and each cost body is transmitted into View-Net to obtain a weight map of the View relative to the reference map; the weight map can restrain non-matching information in the view and strengthen transformable information, and the weight map is similar to the function of an attention mechanism, and particularly, the weight map can restrain non-matching information in the view and strengthen transformable information, and the weight map can be used for solving the problems that in the prior art, the weight map is similar to the attention mechanism, and the weight map can be used for restraining non-matching information in the view, and can be used for strengthening transformable information, and particularly comprises steps S3-S6:

s3, mapping the N-1 source feature images in one stage onto an assumed depth plane of the reference view by adopting homography transformation;

referring to fig. 3, according to the epipolar principle, N-1 source feature maps are mapped onto d hypothetical planes of a reference view through homography transformation warping, thereby obtaining implicit depth relationships between images of different views.

The coordinate mapping is determined by homography:

is->

Homography relation between the feature map of the individual view and the reference feature map with depth d; in addition, K _i ，R _i ，t _i Intrinsic camera properties, rotation and translation, respectively, of the ith view, < >>

Is the principal axis of the reference camera; the 2D feature map is then warped into the hypothetical plane of the reference camera using a differentiable homography transformation to form a plurality of two-view cost volumes.

S4, respectively constructing N-1 initial cost bodies based on the N-1 source characteristic images and a reference view;

referring to FIG. 3, 1 reference feature map and N-1 source feature map are input, and the corresponding camera and pose parameters are respectively input by the method of the first pair of

Step S5, the weight module of the two views of the embodiment carries out self-adaptive weight training on N-1 initial cost bodies through View-Net to obtain a weight graph (1 XH XW) corresponding to the initial cost bodies;

referring to FIG. 5, the input of the view-Net module is a two-view cost volume, the number of channels is reduced to 1/2 of the original number through a Conv3d layer, and a BatchNorm3d-Relu is used as the active layer of the layer for fast convergence in the initial stage. The number of channels is then reduced to one dimension (1×d×h×w) by one Conv3D layer. The above operation can be considered here as a temporary cost volume x, and is converted into a final usefulness map (view weight map) V (x) (1×h×w) by the following formula.

Of course, to meet the coarse-to-fine training strategy, three independent View-nets are set for three different cascade stages, and specific setting parameters thereof are shown in table 1:

TABLE 1 three stage View-Net set parameters where the original resolution of the input is 512×640, the input obtained for each stage is in turn 1/4,1/2,1 times the original resolution

Step S6, referring to FIG. 4, in the two-view fusion strategy of the embodiment, weighting and fusing the weight graph and the initial cost body to obtain a cost body to be regularized;

the method adopts the method that the unsiqueeze dimension-increasing is multiplied by the cost body and added, and finally the added quantity is divided to obtain the fused final cost body, and the specific implementation formula is as follows:

wherein, warp _n (k) The initial cost body after homography distortion of the nth source View and the reference View in the kth stage is View _n (k) One-dimensional weight information V outputted by View-Net for the nth initial cost body in the kth stage _total (k) And (3) for the fused cost body k E {1,2,3} in the kth stage.

By adopting the fusion strategy of the step, the effective matching pixel sampling under different source views can be improved, the interference of non-matching pixels is restrained, and finally, a fusion weight diagram of the source view and the reference view is obtained; and fusing all the two-view cost volumes by pixel-level weights, so as to reduce the interference information of the final cost volumes.

Step S7, regularizing the cost body in the step S6 to obtain a probability map, and generating a depth map of the stage based on the probability map;

the regularization processing in this step obtains a probability map and generates a predicted depth map in this stage based on the probability map directly by conventional means, so that specific processes are not repeated in this embodiment.

Step S8, taking the depth map of the stage in the step S7 as the initialization of the depth presetting of the second stage, and circularly executing the steps S3 to S7 to generate a predicted depth map of the second stage;

step S9, taking the depth map obtained in the second stage as the initialization of the depth presetting in the third stage, and circularly executing the steps S3 to S7 to generate a final predicted depth map obtained in the third stage;

step S10, based on the final predicted depth map, completing three-dimensional reconstruction of the cost body;

the step is based on the final predicted depth map to complete three-dimensional reconstruction of the cost body, and conventional means in the field are adopted, so that detailed processes thereof are not repeated.

In this embodiment, for three stages, namely, loss calculation is performed on three-layer cascade output results, each layer calculates an average absolute difference between a real depth map and an estimated (predicted) depth map by using a smooth L1 Loss, and three-stage losses are accumulated as a final Loss:

wherein k is a cascade stage,

Is the predicted depth value of the pixel p of the kth stage.

Example 2

This example is used for performing evaluation verification of the method steps in example 1;

specifically, the embodiment trains the DTU training set on the DTU training set; in the training phase, the number of input images n=3 is set, and the image resolution is 512×640, in accordance with the conventional MVS operation. For coarse to fine regularization, the depth is assumed to be from 425 mm to 935 mm samples; the number of planar scan depth hypotheses for each stage is 48, 32 and 8, respectively; the corresponding depth interval decays 0.25 and 0.5 from the coarsest phase to the finer phase. The model was trained with Adam for 16 periods with an initial learning rate of 0.001, decayed 0.5 fold after 6, 8 and 12 periods, respectively, using a batch size of 2 on 1 ambida RTX 3090 GPU, with one batch occupying 6GB of memory.

Evaluating the proposed method on an evaluation set of the DTU dataset with an official evaluation criterion; in the evaluation phase, n=5 was set, the input resolution was 864×1152, and the quantitative evaluation thereof was as shown in table 2;

table 2DTU dataset quantitative evaluation (lower better), the method herein outperforms our baseline network and most other advanced methods in terms of integrity

From the partial multi-view stereo qualitative results on the DTU dataset of the table above, the method of the present invention is significantly denser and more complete in point cloud reconstruction.

To verify the advantages of the fusion strategy proposed by the present invention, the contribution of the strategy to MVS is visually represented, and the visibility graph after each two views pass through View-Net is visualized as shown in FIG. 6. As is evident from the figure, the contribution of the different viewing angles to the reference viewing angle is different, a lighter area symbolizes that the partial area matches more pixels with the reference view, while a lighter area implies that the partial area does not have a great relationship with the reference view. In other words, the amount of information that is focused on viewing the same object from different perspectives is not the same. So as to distribute the importance degree of homography distortion of different pixels in the reference visual angle;

from Vis-MVSNet, a good multi-view fusion strategy should be unaffected by the number of views, and not result in a drop in results due to the addition of source views. Performance tests using different numbers of source views on large and small scene data sets, respectively, are therefore performed for the proposed fusion strategy.

Verification tests were performed on the DTU dataset and the blendervs dataset, respectively, and the experimental results are shown in table 3 and fig. 7. As can be seen from experiments, with the improvement of the view quantity, the depth estimation accuracy of the method is gradually improved, and particularly, the improvement of high-accuracy ranges of 2mm, 4mm and the like is obvious.

Table 3 tests from left to right that the depth map introducing 2-8 views is smaller than the error of different precision, respectively.

Based on the verification, the filtering of the image information by the method provided by the invention is highlighted, the calculation of invalid image pixels is avoided, the estimation of correct pixels is improved, and therefore, the integrity of the reconstructed point cloud is improved.

Although specific embodiments of the invention have been described in detail with reference to the accompanying drawings, it should not be construed as limiting the scope of protection of the present patent. Various modifications and variations which may be made by those skilled in the art without the creative effort are within the scope of the patent described in the claims.

Claims

1. The three-dimensional reconstruction method taking the multi-view fusion strategy into consideration is characterized by comprising the following steps of:

s1, acquiring a plurality of pictures acquired by a camera;

2. The three-dimensional reconstruction method considering a multi-view fusion strategy according to claim 1, wherein in the step S3, a homography transformation is adopted to map N-1 source feature images in a stage onto a hypothetical depth plane of a reference view, so as to obtain an implicit depth relationship between images of different view angles:

is->

Is the principal axis of the reference camera; i, K ₀ ，R ₀ Extracting the residual factors of the determinant respectively, and carrying out rotation translation external parameters of the source view by using the internal parameters of the source view camera.

3. The three-dimensional reconstruction method according to claim 2, wherein the step S4 specifically comprises:

4. The three-dimensional reconstruction method considering the multi-View fusion strategy according to claim 1, wherein in the step S5, the adaptive weight training is performed on N-1 initial cost volumes through View-Net, so as to obtain a weight graph corresponding to the initial cost volumes:

where V (x) is the view weight map, exp (-) is the power of e.

5. The three-dimensional reconstruction method considering the multi-view fusion strategy according to claim 4, wherein the step S6 of performing weighted fusion on the weight map and the initial cost body to obtain the cost body to be regularized comprises:

wherein V is _total (k) K is {1,2,3}; warp _n (k) Is an initial cost body after homography distortion of an nth source view and a reference view in a kth stage; view _n (k) And outputting one-dimensional weight information for the nth initial cost body in the kth stage through View-Net.

6. The three-dimensional reconstruction method considering the multi-view fusion strategy according to claim 1, wherein the average absolute difference between the true depth map and the estimated depth map is calculated by adopting a smooth L1 Loss, and three phase losses are accumulated as a final Loss:

wherein k is a cascade stage,

Is the predicted depth value of the pixel p of the kth stage.