CN113345082B

CN113345082B - Characteristic pyramid multi-view three-dimensional reconstruction method and system

Info

Publication number: CN113345082B
Application number: CN202110707399.7A
Authority: CN
Inventors: 柏正尧; 李俊杰
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2022-11-11
Anticipated expiration: 2041-06-24
Also published as: CN113345082A

Abstract

The invention relates to a method and a system for multi-view three-dimensional reconstruction of a characteristic pyramid. According to the multi-view three-dimensional reconstruction method, after image data and a multi-view three-dimensional reconstruction network model are obtained, the multi-view three-dimensional reconstruction network model is adopted, a predicted depth map is obtained by taking the image data as input, then 3D point cloud data are generated according to the predicted depth map, finally, the 3D point cloud data are subjected to visualization processing to obtain a reconstructed three-dimensional view, and further the problems that a computing cost body occupies too large video memory space, a high-resolution depth map is difficult to estimate, the reconstructed point cloud is incomplete and the like in the existing image three-dimensional reconstruction process are solved. The method can be applied to three-dimensional modeling in the fields of cultural relic protection, automatic driving, industrial parts, virtual reality, buildings, medical images and the like, and brings important application value and practical significance to scientific research and development in the fields.

Description

Characteristic pyramid multi-view three-dimensional reconstruction method and system

Technical Field

The invention relates to the technical field of view reconstruction, in particular to a method and a system for multi-view three-dimensional reconstruction of a feature pyramid.

Background

Multi-view stereo (MVS) aims at reconstructing a 3D model of a scene from a series of overlapping images taken by a camera from multiple perspectives, and is now widely used in the fields of autopilot, virtual reality, cultural relic protection, ancient writing, medical imaging, and the like. Compared with an active three-dimensional reconstruction method, the method has the advantages of low cost, convenience and high efficiency because an expensive depth camera or a structured light camera is needed, but a high-resolution depth map can be fused to generate the high-quality dense point cloud, so that the acquisition of the high-resolution depth map plays a vital role in generating the high-quality dense point cloud through multi-view three-dimensional reconstruction.

The traditional multi-view three-dimensional reconstruction method introduces the similarity measurement made by hand to carry out image matching, and then carries out optimization processing on the generated dense point cloud, wherein the similarity measurement adopts normalized cross correlation, and the optimization processing adopts semi-global matching. Although such conventional methods achieve great success in three-dimensional reconstruction in ideal Lambertian scenes, they still suffer from some common limitations, such as low texture, high light and reflection effects of the scene making dense matching difficult to process, resulting in less integrity of the reconstruction. In recent years, a deep convolutional neural network has made a great progress in multi-view stereo vision, can extract high-level semantic information, and can encode global and local information of a scene and extract features, thereby greatly improving the robustness of multi-view stereo feature matching. By establishing a depth regression relationship between multiple views and corresponding depth maps, a number of end-to-end learning methods based on multi-view stereo have been proposed by many researchers, for example: MVSNet, R-MVSNet, CVP-MVSNet, casMVSNet, etc. These deep learning-based methods employ a deep convolutional neural network to infer a depth map for each view, and then generate a dense three-dimensional point cloud through a multi-view depth map fusion process, thereby building a 3D model. In particular, the depth structure of the estimated depth map is proposed in the MVSNet, and the corresponding depth map is deduced for each view, so that the integrity and the overall quality of reconstruction are greatly improved. One key link of the model is to establish a cost volume (cost volume) based on a planar scanning process and regularize the cost volume by a multiscale 3D Convolutional Neural Network (CNN). But this consumes video memory significantly and video memory utilization grows cubically with increasing image resolution. The partial research method utilizes the down-sampling image to reduce the utilization rate of the video memory, although the method can effectively reduce the occupancy rate of the video memory, partial characteristic information is lost, so that the resolution of the estimated depth map is low, and the reconstruction precision and the integrity are greatly reduced. In addition, feature extraction is a key problem of the deep learning-based multi-view stereo algorithm, and another key problem is the generation of cost bodies. Most of deep learning-based multi-view three-dimensional reconstruction carries out feature extraction by introducing CNN blocks, but the method has the defects that the wide-range interdependence relation between pixels is difficult to capture in a rough-to-fine strategy, and important information for a deep reasoning task cannot be captured.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method and a system for multi-view three-dimensional reconstruction of a feature pyramid.

In order to achieve the purpose, the invention provides the following scheme:

a characteristic pyramid multi-view three-dimensional reconstruction method comprises the following steps:

acquiring image data; the image data includes: a reference image and a plurality of source images;

acquiring a multi-view three-dimensional reconstruction network model; the multi-view three-dimensional reconstruction network model is a depth neural network model fused with camera parameters; the deep neural network model comprises a feature pyramid network based on a self-attention mechanism;

obtaining a prediction depth map by using the multi-view three-dimensional reconstruction network model and taking the image data as input;

generating 3D point cloud data according to the predicted depth map;

and carrying out visualization processing on the 3D point cloud data to obtain a reconstructed three-dimensional view.

Preferably, the obtaining a predicted depth map by using the multi-view three-dimensional reconstruction network model and using the image data as input specifically includes:

performing feature extraction on the image data by adopting the feature pyramid network based on the self-attention mechanism to obtain feature maps with different resolutions; the feature map comprises a reference image feature map and a source image feature map;

constructing a cost body according to the extracted features by adopting a grouping correlation method;

generating a probability body by using the cost body and softmax regression;

predicting according to the probability body to obtain an initial depth map;

and upsampling the initial depth map to obtain the predicted depth map.

Preferably, the constructing the cost body according to the extracted features by using a group correlation method specifically includes:

uniformly dividing the characteristic channels in the reference image characteristic diagram and the characteristic channels in the source image characteristic diagram into a plurality of characteristic groups according to channel dimensions;

determining the similarity between the features of the reference image and the features of the source images in each feature group, and compressing the similarity between the reference image and each source image into an initial cost body;

constructing the cost body according to the initial cost body; the cost body is the average value of all the initial cost bodies.

Preferably, the generating 3D point cloud data according to the predicted depth map specifically includes:

and fusing the prediction depth map by adopting a processing method of photometric filtering, geometric consistency filtering and depth fusion to generate 3D point cloud data.

Preferably, the acquiring image data further comprises:

acquiring an image dataset; the image dataset comprises a DTU reference dataset and a Tanks and Temples reference dataset.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the method for reconstructing the multi-view three-dimensional of the characteristic pyramid provided by the invention realizes the multi-view three-dimensional reconstruction based on the acquired image data by adopting the multi-view three-dimensional reconstruction network model comprising the characteristic pyramid network based on the self-attention mechanism, solves the problems that a computing cost body occupies too large video memory space, a high-resolution depth map is difficult to estimate, reconstruction point clouds are incomplete and the like in the conventional image three-dimensional reconstruction process, and realizes more complete multi-view three-dimensional reconstruction by relatively lower video memory and more accurate and precise depth prediction. The method can be applied to three-dimensional modeling in the fields of cultural relic protection, automatic driving, industrial parts, virtual reality, buildings, medical images and the like, and brings important application value and practical significance to scientific research and development in the fields.

Corresponding to the provided characteristic pyramid multi-view three-dimensional reconstruction method, the invention also correspondingly provides the following specific implementation system:

a feature pyramid multi-view three-dimensional reconstruction system, comprising:

the image data acquisition module is used for acquiring image data; the image data includes: a reference image and a plurality of source images;

the multi-view three-dimensional reconstruction network model acquisition module is used for acquiring a multi-view three-dimensional reconstruction network model; the multi-view three-dimensional reconstruction network model is a depth neural network model fused with camera parameters; the deep neural network model comprises a feature pyramid network based on a self-attention mechanism;

the prediction depth map determining module is used for obtaining a prediction depth map by using the multi-view three-dimensional reconstruction network model and taking the image data as input;

the 3D point cloud data generation module is used for generating 3D point cloud data according to the predicted depth map;

and the three-dimensional view reconstruction module is used for performing visualization processing on the 3D point cloud data to obtain a reconstructed three-dimensional view.

Preferably, the predicted depth map determination module comprises:

the feature extraction unit is used for extracting features of the image data by adopting the feature pyramid network based on the self-attention mechanism to obtain feature maps with different resolutions; the feature map comprises a reference image feature map and a source image feature map;

the cost body construction unit is used for constructing a cost body according to the extracted features by adopting a grouping correlation method;

a probability body generation unit for generating a probability body using the cost body and softmax regression;

the initial depth map prediction unit is used for obtaining an initial depth map according to the probability body prediction;

and the upsampling unit is used for upsampling the initial depth map to obtain the predicted depth map.

Preferably, the cost construction unit includes:

the feature group dividing subunit is used for uniformly dividing feature channels in the reference image feature map and feature channels in the source image feature map into a plurality of feature groups according to channel dimensions;

the compression subunit is used for determining the similarity between the features of the reference image and the features of the source images in each feature group and compressing the similarity between the reference image and each source image into an initial cost body;

a cost body constructing subunit, configured to construct the cost body according to the initial cost body; the cost body is the average value of all the initial cost bodies.

Preferably, the 3D point cloud data generating module includes:

and the 3D point cloud data generation unit is used for fusing the predicted depth map by adopting a processing method of photometric filtering, geometric consistency filtering and depth fusion to generate 3D point cloud data.

Preferably, the method further comprises the following steps:

an image dataset acquisition module for acquiring an image dataset; the image dataset comprises a DTU reference dataset and a Tanks andsamples reference dataset.

Since the technical effect achieved by the feature pyramid multi-view three-dimensional reconstruction system provided by the invention is the same as the technical effect achieved by the feature pyramid multi-view three-dimensional reconstruction method provided by the invention, the details are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a feature pyramid multi-view three-dimensional reconstruction method according to the present invention;

FIG. 2 is a block diagram of a multi-view three-dimensional reconstruction network model provided in an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a feature pyramid network feature extraction module based on a self-attention mechanism according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating the difference between two-dimensional convolution layers and self-attention layers when the convolution kernel is 3 according to an embodiment of the present invention;

FIG. 5 is a three-dimensional reconstruction qualitative result diagram of different view reconstruction methods of a DTU data set scene 23; wherein, fig. 5 (a) is a three-dimensional reconstruction qualitative result diagram of the MVSNet method; FIG. 5 (b) is a three-dimensional reconstruction qualitative result diagram of the CasMVSNet method; FIG. 5 (c) is a three-dimensional reconstruction qualitative result diagram of the feature pyramid multi-view three-dimensional reconstruction method provided by the present invention;

FIG. 6 is a qualitative result diagram of a point cloud reconstructed from a partial scene of a DTU data set;

FIG. 7 is a diagram of qualitative results of point cloud reconstruction on the Tanks andinstances intermediate dataset;

fig. 8 is a schematic structural diagram of a feature pyramid multi-view three-dimensional reconstruction system provided in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention aims to provide a method and a system for multi-view three-dimensional reconstruction of a characteristic pyramid, which are used for solving the problems that a computing cost body in the existing method occupies too large video memory space, a high-resolution depth map is difficult to estimate, reconstructed point cloud is incomplete and the like, and further realizing more complete and accurate multi-view three-dimensional reconstruction.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.

As shown in fig. 1, the feature pyramid multi-view three-dimensional reconstruction method provided by the present invention includes:

step 100: image data is acquired. The image data includes: a reference image and a plurality of source images. The image data obtained by the invention is mainly from the image data set. The image dataset used preferably comprises a DTU reference dataset and a Tanks and instances reference dataset.

Step 101: and acquiring a multi-view three-dimensional reconstruction network model. The multi-view three-dimensional reconstruction network model is a depth neural network model fused with camera parameters. The deep neural network model includes a feature pyramid network based on a self-attention mechanism. Specifically, a reference image and a plurality of source images are input, camera parameters are combined with a deep neural network, the camera parameters are embedded into the deep neural network through a differentiable homography transformation operation, a 2D image feature network is connected with a 3D space regularization network, and therefore an end-to-end multi-view three-dimensional reconstruction network model based on deep learning is formed, the specific structure of the multi-view three-dimensional reconstruction network model is shown in fig. 2, D represents a differentiable homography transformation, C represents a grouping correlation method, S represents Softmax regression, and an arrow represents up-sampling.

Step 102: and obtaining a prediction depth map by using the image data as input by adopting a multi-view three-dimensional reconstruction network model.

Step 103: and generating 3D point cloud data according to the predicted depth map. Specifically, the predicted depth map is fused by adopting a processing method of photometric filtering, geometric consistency filtering and depth fusion to generate 3D point cloud data. The point cloud obtained by the method is more complete and richer in details, and a better result is obtained on the DTU reference data set.

Step 104: and carrying out visualization processing on the 3D point cloud data to obtain a reconstructed three-dimensional view. Wherein the point cloud is a representation of a three-dimensional reconstruction. The optimal visualization processing mode adopted by the invention is to visualize the point cloud data by adopting an open3D open source library processed by 3D data to obtain a reconstructed point cloud picture, namely a three-dimensional view.

Further, the specific implementation process of the step 102 is as follows:

step 1020: and (4) extracting the features of the image data by adopting a feature pyramid network based on a self-attention mechanism to obtain feature maps with different resolutions. The feature map comprises a reference image feature map and a source image feature map.

The method comprises the steps of extracting hierarchical features of a feature pyramid, extracting the hierarchical features of the feature pyramid, introducing a self-attention mechanism in the hierarchical feature extraction of the feature pyramid, wherein the self-attention mechanism can acquire a large-range mutual dependency relationship among pixels, can capture a larger receptive field and is used for deep inference learning of more important information, and then more important information can be gathered in the whole space to determine whether corresponding image patches are matched or not.

Specifically, three feature graphs with different resolutions can be extracted by adopting a feature pyramid network based on a self-attention mechanism, the feature pyramid network adopts a three-layer structure to respectively construct three types of feature graphs, a self-attention layer is introduced in the feature pyramid hierarchical feature extraction, and the self-attention mechanism can capture a larger receptive field and more important context information. The characteristic pyramid network consists of a bottom-up part and a top-down part, wherein the bottom-up characteristic extraction is to perform down-sampling operation on an image to reduce the resolution of the image, the resolution is 1, 1/4 and 1/16 of the size of the input original image in sequence, and meanwhile, spatial information is lost, but high-level semantic information can be extracted. The feature extraction module is composed of 5 convolutional layers and 3 self-attention layers, as shown in fig. 3, wherein the setting parameters of each convolutional layer and self-attention layer from the left end to the right end of the page in fig. 3 are as follows: the channel is 8, the convolution kernel is 3, and the step length is 1; the channel is 8, the convolution kernel is 3, and the step length is 1; the channel is 16, the convolution kernel is 5, and the step length is 2; the channel is 16, the convolution kernel is 3, and the step length is 1; the channel is 16, the convolution kernel is 3, and the step length is 1; channel is 32, convolution kernel is 5, step length is 2; the channel is 32, the convolution kernel is 3, and the step length is 1; the channel is 32, the convolution kernel is 3, and the step size is 1. And secondly, performing top-down operation on the image by up-sampling, obtaining high-level semantic information and fine features with high resolution, and adding the features of the same layer and the up-sampled features of the previous layer by transverse connection to obtain the features of the next layer. Thus, the characteristic diagram with high-level semantic information and high-resolution fine characteristics is obtained at the lowest layer, and the corresponding characteristic diagram resolution is 1/16, 1/4 and 1 of the size of the original image.

Step 1021: and constructing a cost body according to the extracted features by adopting a grouping correlation method. Specifically, a feature graph with higher spatial resolution of the feature pyramid network is adopted to construct a cost body with higher resolution. The method comprises the steps of extracting multi-scale depth features of an input image by using a feature pyramid feature extraction network based on a self-attention mechanism, converting a 2D feature map into a 3D feature body through a feature homography conversion module, and constructing a cost body for the feature body by using a cascade mode and a grouping correlation method. A grouping correlation method based on similarity measurement is introduced to group the features and calculate similar features group by group, so that more effective features can be obtained, and a lightweight cost body is constructed. Compared with a cost body construction mode based on variance variables, the grouping correlation method based on similarity measurement can reduce the video memory utilization rate and obtain more effective characteristics, so that the reconstruction quality is improved. Based on this construction principle, the step 1021 is implemented as follows:

and uniformly dividing the characteristic channels in the reference image characteristic diagram and the characteristic channels in the source image characteristic diagram into a plurality of characteristic groups according to the channel dimension.

And determining the similarity between the features of the reference image and the features of the source image in each feature group, and compressing the similarity between the reference image and each source image into an initial cost body.

And constructing the cost body according to the initial cost body. The cost body is the average of all initial cost bodies.

Specifically, three types of cost bodies are respectively constructed by the three-layer structure of the feature pyramid network, and the spatial resolution of each stage of cost body and the resolution of the input feature map are doubled. With 3-level joint cost body, the spatial resolution of each layer is respectively

W × H, and constructing a cost body by adopting a packet correlation method. In the image matching process, a cost body is constructed through similarity measurement, and the basic idea is to group the features and calculate the similarity features group by group. Specifically, for depth reference image features and depth source image features at the depth hypothesis plane d, their feature channels are uniformly divided into G groups by channel dimension, and then the similarity of the depth reference image features and the depth source image features is calculated. When in all groupsAnd when the feature similarity between the depth reference image and the depth source image is calculated, the original feature map is compressed into a similar feature map of a G channel. After calculating the feature similarity on all the depth hypothesis planes, further compressing the feature similarity between the reference image and the ith source image into a cost body C _i Finally, cost volume C is computed as the average of all view cost volumes. Wherein, the cost body C is a 3DCNN regularization cost body.

Step 1022: probability volumes are generated using cost volumes and softmax regression.

Step 1023: and obtaining an initial depth map according to probability body prediction. Specifically, the probability body reflects probability estimation of each pixel at a certain depth, a probability distribution is formed after calculating probability estimation of all pixels at all depths, and the probability distribution along the depth direction also reflects quality of the depth estimation, so that a probability map can be obtained. The threshold parameter for outliers can then be better determined by calculating the probability so that the depth map is recovered from the probability map.

Step 1024: and upsampling the initial depth map to obtain a predicted depth map.

The specific implementation principle of steps 1023 and 1024 is as follows: extracting multi-scale depth features through a feature pyramid network based on a self-attention mechanism, then carrying out depth prediction in three stages, predicting an initial depth map at the coarsest level, and generating a high-resolution depth map at the finest level through gradual up-sampling iterative optimization.

The following describes the advantages of the feature pyramid multi-view three-dimensional reconstruction method provided by the present invention, by taking experiments on the DTU reference dataset and the Tanks and templates reference dataset as examples. In practice, the invention may also be implemented on the basis of another image data set.

In the embodiment, the network is built through an open-source deep learning framework PyTorch, and the experiment is carried out on an NVIDIA GeForce GTX 2080Ti (11 GB) display card by running codes written by python. In order to carry out quantitative comparison with the existing method, the invention adopts the official evaluation flow provided by the disclosed DTU data set and the Tanks and templates data set to evaluate the reconstruction effect of the invention. Based on this, in the experimental process, the characteristic pyramid multi-view three-dimensional reconstruction method provided by the invention comprises the following steps:

1) The data sets adopt a DTU reference data set and a Tanks and Temples reference data set, and the two data sets provide a series of complete evaluation procedures for multi-view three-dimensional reconstruction. Where the DTU reference dataset is a large indoor MVS dataset, 124 different scenes are scanned from 49 or 64 view positions under 7 different lighting conditions. The Tanks and Temples reference dataset is a large scale outdoor dataset that is scanned in a more complex environment. It contains two sets of scene data, intermediate set and advanced set.

2) The feature pyramid network based on the self-attention mechanism is adopted to extract features, and a self-attention layer is introduced into feature pyramid hierarchical feature extraction to capture a larger receptive field and more important context information and provide important information for a subsequent deep reasoning stage. Where, when the convolution kernel k =3, the difference between the two-dimensional convolution layer and the self-attention layer is as shown in fig. 4.

3) And constructing a multi-scale cost body by adopting a cascading mode and a grouping correlation method. The pyramid network adopts a 3-layer cascade architecture to respectively construct cost bodies of three types of resolution ratios. Secondly, a lightweight cost body can be constructed by adopting a grouping correlation method, and in the image matching process, the cost body is constructed through similarity measurement. Specifically, for depth reference image features and depth source image features at the depth hypothesis plane d, their feature channels are uniformly divided into G groups by channel dimension, and then the similarity of the depth reference image features and the depth source image features is calculated. When feature similarity between the depth reference image and the depth source image is computed in all the groupings, the original feature maps are compressed into a G-channel similar feature map. After calculating the feature similarity on all the depth hypothesis planes, calculating the feature similarity between the reference image and the ith source imageFurther compressed into a cost body C _i Finally, the total cost C is computed as the cost C for all views _i Is calculated.

4) And adopting a depth map inference strategy from rough to fine, wherein the depth prediction is divided into three stages, the depth inference starts from the coarsest stage, and a probability body is generated by using a 3D CNN regularization cost body and softmax regression so as to predict an initial depth map. And (4) through continuous up-sampling iterative optimization, finally generating a high-resolution depth map at the finest level, namely a prediction depth map.

5) And after the depth maps of all the images are predicted, fusing all the predicted depth maps to generate dense 3D point cloud representation by adopting a post-processing method of photometric filtering, geometric consistency filtering and depth fusion. Specifically, the luminosity consistency matching and the geometric consistency between a reference image and a source image are calculated, depth and redundant pixel points smaller than a threshold value are filtered, and then a plurality of images with the largest overlapped area are selected in an iteration mode to be paired and back-projected to a three-dimensional space to generate a three-dimensional point cloud model.

6) And (4) reconstructing by adopting an open3D source library processed by 3D data to obtain a point cloud picture, namely a reconstructed three-dimensional view.

Based on the experimental steps, the general implementation principle of the feature pyramid multi-view three-dimensional reconstruction method provided by the invention is as follows: firstly, inputting N images, wherein the N images comprise 1 reference image and (N-1) source images, carrying out multi-scale division on the input images by adopting a feature pyramid network based on an attention mechanism, and extracting three feature images with different resolutions by a 2D feature extraction network. Secondly, the 2D feature map is converted into a 3D feature volume (feature volume) by utilizing a differential homography transform (homography warping), the process is to map the features of the source image onto the depth plane of the camera view cone of the reference image, and in the conversion process, epipolar search (geometric constraint) is similar, and depth information is introduced. After differentiable homographic transformation, N characteristic bodies are obtained. The depth prediction is divided into three stages, a strategy from rough to fine is adopted, a multi-resolution depth map is predicted by utilizing the multi-scale image features, and in each stage, a multi-scale cost body is constructed by a feature body in a cascading mode and a grouping correlation method. Depth inference starts from the coarsest level, and a probability volume is generated using 3D CNN regularization cost volume and softmax regression, to predict the initial depth map. And finally generating a high-resolution depth map through continuous up-sampling iterative optimization. And fusing all the predicted depth maps to generate dense 3D point cloud representation by adopting a post-processing method of photometric filtering, geometric consistency filtering and depth fusion.

As can be seen from fig. 5, under the same input image resolution setting, the point cloud reconstructed by the feature pyramid multi-view three-dimensional reconstruction method provided by the present invention is relatively more complete and has better effect in details than the MVSNet method and the CasMVSNet method.

Table 1 shows experimental results obtained by evaluating 4 conventional methods and 6 methods based on deep learning and the feature pyramid multi-view three-dimensional reconstruction method provided by the present invention on a DTU data set under the same experimental parameter settings, and it can be seen from table 1 that the method (FPSA-MVSNet) provided by the present invention is significantly superior to the existing methods in terms of completeness and overall score. The accuracy and the integrity of the method provided by the invention reach 0.366mm and 0.284mm respectively, and the total score (Overall) is improved by 8.5% compared with the CasMVSNet of the existing best method, which proves the effectiveness of the method provided by the invention. Fig. 6 shows a qualitative result diagram of a DTU dataset partial scene reconstruction point cloud.

As shown in table 2, the point cloud of the scene is reconstructed on the Tanks and Temples intermediate dataset using the model trained on the DTU dataset and without any fine-tuning. Under the same experimental parameter setting, the method provided by the invention realizes a better result, and compared with the CasMVSNet method in the prior art, the average F-score is improved by 2.0%. This demonstrates the effectiveness and robustness of the method provided by the invention in complex scenarios. FIG. 7 shows qualitative results of Tanks and Temples intermediate dataset scene reconstruction point clouds.

TABLE 1 quantitative results of DTU dataset scene reconstruction

TABLE 2 quantitative results of Tanks and Temples intermediate dataset scene reconstruction

Experimental results show that the characteristic pyramid multi-view three-dimensional reconstruction method provided by the invention can effectively process the multi-view three-dimensional reconstruction problem of a complex scene, remarkably improve the integrity problem of reconstructed point cloud, has the advantages of low video memory occupation and high efficiency, and obtains the best reconstruction effect on a DTU reference data set compared with the existing method. The method has wide application prospect in the fields of cultural relics protection, automatic driving, industrial parts, virtual reality, buildings, medical images and the like.

In addition, corresponding to the above-provided feature pyramid multi-view three-dimensional reconstruction method, the present invention also provides a feature pyramid multi-view three-dimensional reconstruction system, as shown in fig. 8, where the feature pyramid multi-view three-dimensional reconstruction system includes: the system comprises an image data acquisition module 1, a multi-view three-dimensional reconstruction network model acquisition module 2, a prediction depth map determination module 3, a 3D point cloud data generation module 4 and a three-dimensional view reconstruction module 5.

The image data acquiring module 1 is used for acquiring image data. The image data includes: a reference image and a plurality of source images.

The multi-view three-dimensional reconstruction network model obtaining module 2 is used for obtaining a multi-view three-dimensional reconstruction network model. The multi-view three-dimensional reconstruction network model is a depth neural network model fused with camera parameters. The deep neural network model includes a feature pyramid network based on a self-attention mechanism.

The prediction depth map determining module 3 is configured to obtain a prediction depth map by using the image data as input, using a multi-view three-dimensional reconstruction network model.

The 3D point cloud data generation module 4 is configured to generate 3D point cloud data according to the predicted depth map.

The three-dimensional view reconstruction module 5 is used for performing visualization processing on the 3D point cloud data to obtain a reconstructed three-dimensional view.

Further, the prediction depth map determining module 3 adopted by the present invention includes: the device comprises a feature extraction unit, a cost body construction unit, a probability body generation unit, an initial depth map prediction unit and an up-sampling unit.

The feature extraction unit is used for extracting features of the image data by adopting a feature pyramid network based on a self-attention mechanism to obtain feature maps with different resolutions. The feature map comprises a reference image feature map and a source image feature map.

And the cost body construction unit is used for constructing a cost body according to the extracted features by adopting a grouping correlation method.

The probability body generation unit is used for generating a probability body by using the cost body and softmax regression.

The initial depth map prediction unit is used for obtaining an initial depth map according to probability body prediction.

The up-sampling unit is used for up-sampling the initial depth map to obtain a prediction depth map.

Further, the cost body constructing unit adopted by the invention comprises: the feature group partition subunit, the compression subunit and the cost body construction subunit.

The feature group division subunit is used for uniformly dividing feature channels in the reference image feature map and feature channels in the source image feature map into a plurality of feature groups according to channel dimensions.

The compressing subunit is used for determining the similarity between the features of the reference image in each feature group and the features of the source images, and compressing the similarity between the reference image and each source image into an initial cost body.

The cost body constructing subunit is used for constructing the cost body according to the initial cost body. The cost body is the average of all initial cost bodies.

Further, the 3D point cloud data generating module 4 adopted by the present invention includes: a 3D point cloud data generation unit.

The 3D point cloud data generation unit is used for fusing the prediction depth map to generate 3D point cloud data by adopting a processing method of photometric filtering, geometric consistency filtering and depth fusion.

Further, the feature pyramid multi-view three-dimensional reconstruction system provided by the invention further comprises: an image dataset acquisition module.

The image data set acquisition module is used for acquiring an image data set. The image dataset includes a DTU reference dataset and a Tanks and Temples reference dataset.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the description of the method part.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the foregoing, the description is not to be taken in a limiting sense.

Claims

1. A method for reconstructing a characteristic pyramid multi-view three-dimensional image is characterized by comprising the following steps:

generating 3D point cloud data according to the predicted depth map;

performing visualization processing on the 3D point cloud data to obtain a reconstructed three-dimensional view;

the obtaining of the predicted depth map by using the multi-view three-dimensional reconstruction network model and using the image data as input specifically includes:

generating a probability body by using the cost body and softmax regression;

predicting according to the probability body to obtain an initial depth map;

upsampling the initial depth map to obtain the predicted depth map;

the construction of the cost body according to the extracted features by adopting a grouping correlation method specifically comprises the following steps:

2. The method of claim 1, wherein generating 3D point cloud data from the predicted depth map comprises:

3. The method of claim 1, wherein the obtaining image data further comprises:

4. A feature pyramid multi-view three-dimensional reconstruction system, comprising:

the three-dimensional view reconstruction module is used for performing visualization processing on the 3D point cloud data to obtain a reconstructed three-dimensional view;

the predicted depth map determination module comprises:

an upsampling unit, configured to upsample the initial depth map to obtain the predicted depth map;

the cost body construction unit includes:

the compressing subunit is used for determining the similarity between the features of the reference image and the features of the source image in each feature group and compressing the similarity between the reference image and each source image into an initial cost body;

5. The feature pyramid multi-view three-dimensional reconstruction system of claim 4, wherein the 3D point cloud data generation module comprises:

6. The feature pyramid multi-view three-dimensional reconstruction system of claim 4, further comprising:

an image dataset acquisition module for acquiring an image dataset; the image dataset comprises a DTU reference dataset and a Tanks and Temples reference dataset.