CN116310098A

CN116310098A - Multi-view three-dimensional reconstruction method based on attention mechanism and variable convolution depth network

Info

Publication number: CN116310098A
Application number: CN202310185694.XA
Authority: CN
Inventors: 周贵云; 周祥
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-03-01
Filing date: 2023-03-01
Publication date: 2023-06-23

Abstract

The invention provides a three-dimensional reconstruction method based on deep learning, which comprises the following steps: constructing a multi-scale feature extraction network, and extracting multi-scale semantic features of the acquired multi-view images to obtain target key features of multiple scales; performing pixel-level depth evaluation iterative computation in a block matching iterative model based on an attention mechanism and a deformation convolution depth network, and outputting a depth map corresponding to each image after the iterative computation is completed; extracting depth residual errors through a residual error network, and optimizing the depth map to obtain a final depth map; and filtering the depth points through geometric consistency constraint and illumination consistency constraint, carrying out depth map fusion according to a visibility fusion algorithm, and outputting to obtain the three-dimensional point cloud of the scene. According to the method and the device, the target characteristics are purified through the attention mechanism, and the depth evaluation optimization is performed based on the variable convolution depth network, so that the obtained depth map is more accurate, and the integrity and the accuracy of the three-dimensional reconstruction model are improved.

Description

Multi-view three-dimensional reconstruction method based on attention mechanism and variable convolution depth network

Technical Field

The invention belongs to the field of geometric computer vision three-dimensional reconstruction, and relates to a multi-view three-dimensional reconstruction method based on an attention mechanism and a variable convolution depth network.

Background

In recent years, deep learning has been widely used in the field of computer vision and has been greatly successful, and has shown excellent performance in the field of processing application of two-dimensional images, such as visual tasks of image classification, object detection, semantic segmentation, and the like. However, the image is simply two-dimensional information, and the information which can be transmitted is limited, so how to acquire more three-dimensional information from the two-dimensional image gradually becomes one of hot problems of research.

The multi-view three-dimensional reconstruction technology is a method for recovering a three-dimensional scene based on a plurality of images with different view angles of the same scene. Conventional multi-view three-dimensional reconstruction techniques can be divided into four types of methods, voxel-based, surface evolution-based, feature region expansion-based, and depth map-based. The depth map-based method is simple and flexible, but the conventional depth map method is difficult to solve the problem of reconstruction of a weak texture region or a specular reflection region, and has low integrity and slow speed of reconstruction. The reconstruction method based on the deep learning can better utilize global semantic information, so that the accuracy and the speed of reconstruction are obviously improved.

In a multi-view reconstruction method based on deep learning, a three-dimensional cost body is generally constructed to perform depth regression to obtain a scene depth value. However, due to three-dimensional convolution regularization processing in depth regression calculation, the problem of limited video memory resources exists when a large-scale scene or a high-resolution scene is processed; and because the outdoor large-scale scene model rebuilding environment influence factors are complex, scene characteristics are difficult to comprehensively extract, the parameter self-adaptability is poor, modeling can be performed only for specific scenes, and the model generalization capability is not strong.

In view of the foregoing, there is a great need for solving the above problems.

Disclosure of Invention

The technical scheme aims at the problems and the demands, and provides a multi-view three-dimensional reconstruction method based on an attention mechanism and a variable convolution depth network, which can solve the technical problems due to the adoption of the following technical scheme.

In order to achieve the above purpose, the present invention provides the following technical solutions: a multi-view three-dimensional reconstruction method based on an attention mechanism and a variable convolution depth network comprises the following steps:

step1: constructing a multi-scale feature extraction network based on a recursive feature pyramid structure, and extracting multi-scale semantic features of the acquired multi-view images to acquire complete and accurate target key features;

step2: inputting the features obtained in Step1 into a block matching iterative model based on an attention mechanism and a deformation convolution depth network to perform pixel-level depth evaluation iterative calculation, and outputting a depth map corresponding to each image after the iterative calculation is completed;

step3: taking the depth map obtained in the Step2 and the corresponding image as input, extracting depth residual errors through a residual error network, and optimizing the depth map to obtain a final depth map;

step4: and taking all depth maps obtained by iterative calculation as input, filtering depth points through geometric consistency constraint and illumination consistency constraint, carrying out depth map fusion according to a visibility fusion algorithm, and outputting to obtain a three-dimensional point cloud of the scene.

The recursive feature pyramid mechanism is characterized in that a cavity space convolution pooling pyramid network is applied to a feature image output by an original feature pyramid structure to extract target key features, and the obtained target features are input into the feature pyramid network again to purify the target features again, so that a feature image with more abundant information is obtained.

In the iteration process of the block matching model based on the attention mechanism and the deformation convolution depth network, if the iteration is the primary iteration, taking a target feature map as initial iteration input; if the iteration is started, the target feature map is connected with the depth map output by the previous iteration to be used as the input of the current iteration. When depth evaluation is specifically performed, the target feature is enhanced through an attention mechanism, and then the cost of each depth hypothesis value of each pixel is calculated through the group-wise correlation, and the specific steps are as follows:

firstly, strengthening key features of a reference image based on an attention mechanism, and focusing on key targets in the image in iterative calculation; then converting the pixel points of the source view into a reference view coordinate system based on homography transformation, and carrying out dot product operation on the reference view and the source view under the reference view coordinate system to obtain image similarity; grouping the feature channel dimensions, and projecting each grouping similarity into a cost value of each depth hypothesis of each pixel of the reference image through a three-dimensional convolution network; then, carrying out summation normalization calculation on the depth cost values of all the source views, obtaining the probability of each depth hypothesis of each pixel point, and carrying out depth regression calculation to output a depth map; and finally, sampling the depth value of the similar pixel point of each pixel point through a variability convolution depth network, and performing depth filtering to smooth and filter the wrong depth estimation value.

The optimized residual error network takes the target feature image and the obtained depth image as input, extracts depth residual error information and optimizes the depth image.

The depth map fusion is used for projecting depth points meeting geometric constraint requirements and illumination consistency requirements into a three-dimensional space to form a three-dimensional point cloud so as to obtain a reconstructed three-dimensional model.

From the technical scheme, the beneficial effects of the invention are as follows: the method comprises the steps of obtaining target key features with richer semantics through a recursive feature pyramid structure, fusing global semantic information in depth evaluation calculation through an attention mechanism to make up for the deficiency of a convolution network, and filtering a depth map according to a variability convolution network, so that the obtained depth map is more complete and accurate, the reconstruction precision of a three-dimensional model is higher, and the universality of the whole network model is also stronger.

Drawings

FIG. 1 is a schematic diagram showing the steps in an embodiment of the present invention;

FIG. 2 is a diagram of a network architecture in an embodiment of the present invention;

FIG. 3 is a diagram of a recursive feature pyramid network architecture in an embodiment of the present invention;

FIG. 4 is a diagram of a network structure of a hole space convolution pooling pyramid in an embodiment of the present invention;

FIG. 5 is a diagram of a network architecture of an attention mechanism in an embodiment of the present invention;

fig. 6 is a three-dimensional reconstruction effect diagram in an embodiment of the present invention.

Detailed Description

The invention will now be described in further detail with reference to the accompanying drawings, the examples being described only for the purpose of facilitating the understanding of the invention and not being limiting.

The existing multi-view three-dimensional reconstruction method has the problems of poor reconstruction effect in a weak texture area or a reflecting surface, long running time, high memory consumption, weak generalization capability and the like, so the invention discloses a processing method based on an attention mechanism and a variable depth convolution network. As shown in fig. 1 and 2, the present invention provides a multi-view three-dimensional reconstruction method based on an attention mechanism and a variable depth convolution network, and the specific steps of the present invention include:

step1, constructing a multi-scale feature extraction network based on a recursive feature pyramid structure, and extracting multi-scale semantic features of the acquired multi-view images to acquire complete and accurate target key features.

Specifically, as shown in fig. 3, based on a recursive feature pyramid structure, a convolution neural network is used to perform multi-scale feature extraction on a plurality of multi-view images to obtain an original feature map. And applying an up-sampling layer to the original feature map with low resolution, and fusing the up-sampling layer with the original feature map with the next scale to obtain the target feature with rich semantic information. And the up-sampling fusion processing is adopted for the original feature images between two adjacent scales to obtain feature images with various scales.

Furthermore, purifying the obtained multi-scale feature map through a cavity space convolution pooling pyramid network, and inputting the purified feature map into the feature extraction network again to re-extract target key features, so as to obtain the multi-scale feature map with more abundant semantic information.

In this embodiment, the process of training the multi-scale extraction network through the training set includes:

inputting N H W multi-view images, including a reference view I _ref And N-1 source views { I } _src Acquiring three-scale original feature images through a trunk feature extraction network, and respectively marking the three-scale original feature images

Three-scale original feature map is F ₁ ，F ₂ ，F ₃ ，F ₁ Obtaining ∈ through an upsampling network>

New feature map F of scale ₁ ', F ₁ ' and F ₂ Fusion to give->

Final feature map F of scale ₂ '. Similarly, F is ₂ ' through the upsampling network and with F ₃ Fusion, obtain->

Scale target key feature F ₃ ^’ . As shown in FIG. 4, the pyramid network of the cavity space convolution pool can refine and purify the input feature map, and F is the above ₁ ^’ 、F ₂ ^’ 、F ₃ ^’ Three feature images are input into a feature image F with more abundant obtained information ₁ ^” 、F ₂ ^” 、F ₃ ^” . Finally, F ₁ ^” 、F ₂ ^” 、F ₃ ^” Inputting the features into the trunk feature extraction network again for feature extraction again,acquiring more refined feature map F ₁ ^”’ 、F ₂ ^”’ 、F ₃ ^”’ And F is combined with ₁ ^”’ And F is equal to ₁ ^’ 、F ₂ ^”’ And F is equal to ₂ ^’ 、F ₃ ^”’ And F is equal to ₃ ^’ And respectively fusing to obtain the output multi-scale characteristics. Compared with the traditional convolutional network for extracting the features, the method has the advantages that the repeated extraction and fusion of the target features are carried out through the recursive feature pyramid network, the feature map with richer semantic information and more accurate target features can be obtained, the output multi-scale features can be used for training a subsequent model at different scales, and the robustness and generalization capability of the whole model network are improved.

Step2, performing pixel-level depth evaluation iterative computation in a block matching iterative model based on an attention mechanism and a deformation convolution depth network, and outputting a depth map corresponding to each image after the iterative computation is completed.

For multi-view stereo matching, the correct depth value of any point pixel of the reference image is also the correct depth value of the corresponding pixel of any source view. For multi-view depth estimation, visual information of multiple source views needs to be integrated to estimate each depth value of a single pixel on a reference view. For this reason, as shown in fig. 5, global semantic information is integrated based on an attention mechanism, and the relationship in the spatial domain of each pixel point is fully utilized to strengthen the key features of the target; and then, carrying out weighted calculation based on grouping operation to obtain the cost value of each depth hypothesis of each pixel point, carrying out cost body normalization through a three-dimensional convolution network to obtain the single-generation value of each depth hypothesis, and carrying out depth regression calculation to obtain a depth map. Finally, based on a variability convolution network, variability filtering is performed, compared with a fixed filtering window, the variability filtering can obtain pixel point coordinates of the same three-dimensional plane, unreasonable depth values in the depth map can be effectively smoothed through averaging processing, and a more accurate depth map is obtained.

Specifically, in the iterative computation process, if the iteration is the first iteration, taking the target feature map as initial iteration input; if the iteration is started, the target feature map is connected with the depth map output by the previous iteration to be used as the input of the current iteration.

In the present embodiment, the depth range [1/d ] is according to a predefined reciprocal _max ，1/d _min ]The uniform sampling is performed such that each pixel has a plurality of depth hypotheses, and a random number is added to each depth hypothesis to obtain a random depth hypothesis. And acquiring coordinates (u, v) of similar pixel points of each pixel point based on the variability convolutional neural network, and performing depth sampling according to the acquired pixel coordinates to acquire new candidate hypotheses. By doing so, the network model can process large-scale complex scenes, and has better model generalization capability. For the subsequent iterative computation, according to the depth map generated by the previous iterative computation, adding random numbers in a reasonable depth range R to perform random disturbance, and gradually shrinking the depth range R in the subsequent iterative computation. This may further refine the depth estimation results and correct for erroneous estimations in previous iterative calculations.

Specific processes for depth assessment based on the attention mechanism and the variable convolution depth network include: acquisition of reference view I by means of an attention mechanism _i Weight information w of upper pixel p _i (p); according to the formula:

computing acquisition of reference view I _i With source view I _j Group similarity S at pixel p _i (p, j); s is carried out through a three-dimensional convolution network _i (p, j) projection as reference view I _i The cost value of each depth hypothesis of each pixel of (1) is calculated by performing depth regression to obtain a reference view I _i Is a depth map of (2); and finally, acquiring pixel point coordinates of the same local plane in the three-dimensional space through a variability convolution network, performing depth sampling according to the pixel coordinates, and acquiring depth average values of all sampling points to obtain a smoothed depth value. The variability filter formula is shown below: />

Wherein F is ₀ (p) ^g And F _i (p _i,j ) ^g G-th group of features representing a source view and g-th group of features of a reference view, p _i,j Representing pixel points p corresponding to the reference view and the source view, N represents the total number of the reference view and the source view, D _i Representing reference view I _i Depth hypothesis, d _p Representing the depth value, d, of the pixel p _t Represents the depth value of the pixel point in the same plane as the pixel p, and n represents the number of sampling points.

Step3, taking the depth map obtained in Step2 and the corresponding image as input, extracting a depth residual error through a residual error network, and optimizing the depth map to obtain a final depth map.

Specifically, to avoid that a certain depth scale is shifted, the input depth map is normalized to [0,1].

Further, up-sampling the reference view feature map extracted by the multi-scale feature network through a deconvolution network, up-sampling the normalized depth map to H×W scale, fusing the two obtained features, extracting depth residual information through a residual network, and finally fusing the obtained depth residual information with the depth map obtained in Step2 to obtain the final optimized depth map.

In the experiment, the depth map output by each stage has different degrees of errors with the true value, and in order to better reflect the loss change of the whole training process, the losses of different stages are accumulated to be the final loss, so that the loss function is defined as:

in this embodiment, the target key feature and the depth map obtained in Step2 are used as inputs, the feature map and the depth map normalized to [0,1] are respectively up-sampled and combined together, depth residual information is extracted through a residual network, the depth residual information and the normalized depth map are fused, and then the fused result is restored to be within the original depth range, so as to obtain the optimized depth map.

Step4, taking all depth maps obtained through iterative computation as input, filtering depth points through geometric consistency constraint and illumination consistency constraint, carrying out depth map fusion according to a visibility fusion algorithm, and outputting to obtain a three-dimensional point cloud of a scene.

Specifically, the probability of obtaining the depth map and the corresponding depth hypothesis in Step3 is used as input, depth points are filtered according to illumination consistency and geometric constraint consistency, and depth points which do not meet the threshold condition are removed. And based on a fusion algorithm of visibility, influence factors such as shielding, illumination and the like are reduced to the minimum, so that depth shielding and conflict among different views are minimized, unified point cloud representation is obtained, and a three-dimensional reconstruction result is output.

In this embodiment, the probability of the depth map and the corresponding depth hypothesis obtained by previous calculation is input, the illumination consistency threshold α and the geometric constraint consistency threshold β are preset, and the depth points which do not reach the threshold α or the threshold β are removed. In order to further suppress noise points, the depth values of similar pixel points meeting the threshold requirement are averaged to obtain the final visible depth. Finally, as shown in fig. 6, the fused depth map is directly projected into a three-dimensional space, and a three-dimensional reconstructed point cloud model is output.

The multi-view three-dimensional reconstruction algorithm based on the attention mechanism and the variable convolution depth network is adopted, the attention mechanism is introduced to strengthen the key characteristics of the target based on the recursive characteristic pyramid structure, global semantic information is better utilized in the depth evaluation process, and the depth evaluation precision is improved; based on regional locality assumption, the depth map is optimized through a variable convolution depth network, so that the problem of non-ideal reconstruction effect of a weak texture region in the existing three-dimensional reconstruction algorithm is solved well, and the three-dimensional reconstruction requirement of a complex large scene can be met.

The above results illustrate the feasibility and effectiveness of the three-dimensional reconstruction algorithm based on the attention mechanism and the depth-variable convolutional network proposed in the present application.

It should be noted that the above detailed description of the implementation of the present invention with reference to the drawings is only a preferred mode of the present invention, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention, and only obvious modifications are to be considered as falling within the scope of the present invention.

Claims

1. A multi-view three-dimensional reconstruction method based on an attention mechanism and a variable convolution depth network is characterized by comprising the following steps:

step2: performing pixel-level depth evaluation iterative computation in a block matching iterative model based on an attention mechanism and a deformation convolution depth network, and outputting a depth map corresponding to each image after the iterative computation is completed;

step3: based on a residual depth optimization module, taking the depth map and a corresponding image as input, extracting a depth residual through a residual network, and optimizing the depth map to obtain a final depth map;

step4: and based on a depth map fusion module, taking all depth maps obtained by iterative calculation as input, filtering depth points through geometric consistency constraint and illumination consistency constraint, carrying out depth map fusion according to a visibility fusion algorithm, and outputting to obtain a three-dimensional point cloud of a scene.

2. The multi-view three-dimensional reconstruction method based on the attention mechanism and the variable convolution depth network as claimed in claim 1, wherein the recursive feature pyramid structure uses a convolution neural network to extract multi-scale features of a plurality of multi-view images to obtain an original feature map;

applying an up-sampling layer to the original feature map with low resolution, and fusing the up-sampling layer with the original feature map with the next scale to obtain target features with rich semantic information;

the up-sampling fusion processing is adopted for the original feature images between two adjacent scales to obtain feature images with various scales;

purifying the obtained multi-scale feature map through a cavity space convolution pooling pyramid network, and inputting the purified feature map into the feature extraction network again to re-extract target key features, so as to obtain the multi-scale feature map with more abundant semantic information.

3. The multi-view three-dimensional reconstruction method based on an attention mechanism and a variable convolution depth network according to claim 1, wherein the block matching iterative model based on the attention mechanism and the variable convolution depth network strengthens key features of a reference image based on the attention mechanism, and focuses on key targets in the image in iterative calculation;

then converting the pixel points of the source view into a reference view coordinate system based on homography transformation, and carrying out dot product operation on the reference view and the source view under the reference view coordinate system to obtain image similarity;

grouping the feature channel dimensions, and projecting each grouping similarity into a cost value of each depth hypothesis of each pixel of the reference image through a three-dimensional convolution network;

then, carrying out summation normalization calculation on the depth cost values of all the source views, obtaining the probability of each depth hypothesis of each pixel point, and carrying out depth regression calculation to output a depth map;

finally, sampling depth values of similar pixel points of each pixel point through a variability convolution depth network, and performing depth filtering to smooth and filter wrong depth estimation values;

in the iterative computation process, if the iteration is the primary iteration, taking the target feature map as initial iteration input;

if the iteration is started, the target feature map is connected with the depth map output by the previous iteration to be used as the input of the current iteration.

4. The multi-view three-dimensional reconstruction method based on an attention mechanism and a variable convolution depth network according to claim 1, wherein the residual depth optimization module normalizes the input depth map to [0,1] in order to avoid that a certain depth scale is shifted;

upsampling a reference view feature map extracted by a multi-scale feature network through a deconvolution network, upsampling a normalized depth map, fusing the two obtained features, and extracting depth residual information through a residual network;

and finally, fusing the obtained depth residual information with the depth map after normalization and up-sampling to obtain a final optimized depth map.

5. The multi-view three-dimensional reconstruction method based on the attention mechanism and the variable convolution depth network as set forth in claim 1, wherein the depth fusion module filters depth points according to illumination consistency and geometric constraint consistency to eliminate depth points which do not meet a threshold condition;

and based on a fusion algorithm of visibility, influence factors such as shielding, illumination and the like are reduced to the minimum, so that depth shielding and conflict among different views are minimized, unified point cloud representation is obtained, and a three-dimensional reconstruction result is output.