CN116758219A - Region-aware multi-view stereo matching three-dimensional reconstruction method based on neural network - Google Patents

Region-aware multi-view stereo matching three-dimensional reconstruction method based on neural network Download PDF

Info

Publication number
CN116758219A
CN116758219A CN202310716212.9A CN202310716212A CN116758219A CN 116758219 A CN116758219 A CN 116758219A CN 202310716212 A CN202310716212 A CN 202310716212A CN 116758219 A CN116758219 A CN 116758219A
Authority
CN
China
Prior art keywords
depth
view
image
reconstruction
stereo matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310716212.9A
Other languages
Chinese (zh)
Inventor
朱建科
李宇
张一粟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202310716212.9A priority Critical patent/CN116758219A/en
Publication of CN116758219A publication Critical patent/CN116758219A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a three-dimensional reconstruction method for regional perception multi-view stereo matching based on a neural network. The invention constructs a regional perception multi-view stereo matching reconstruction frame, acquires a regional perception multi-view stereo matching reconstruction model after training the regional perception multi-view stereo matching reconstruction frame, and reconstructs multi-view images of an object to be reconstructed by using the regional perception multi-view stereo matching reconstruction model. The invention can effectively solve the problem of surface topology reconstruction precision of the current multi-view three-dimensional imaging method at the positions of texture deficiency areas and object boundaries, rapidly generate a high-precision three-dimensional model, and provide great convenience and efficiency for design, simulation, analysis and interaction in the fields of virtual reality, game development, building design, industrial manufacturing, robot vision, medical image processing and the like.

Description

Region-aware multi-view stereo matching three-dimensional reconstruction method based on neural network
Technical Field
The invention relates to a multi-view image three-dimensional reconstruction method in the field of computer vision, in particular to a region-aware multi-view three-dimensional matching three-dimensional reconstruction method based on a neural network.
Background
Three-dimensional reconstruction techniques are techniques that convert two-dimensional images or video into three-dimensional models. It has important application value in many fields such as virtual reality, game development, architectural design, industrial manufacturing, robot vision, medical image processing, etc. The three-dimensional reconstruction technology can quickly generate a high-precision three-dimensional model, and provides great convenience and efficiency for design, simulation, analysis and interaction in the field. For example, in the field of virtual reality, three-dimensional reconstruction techniques may copy scenes in the real world into virtual reality, enabling users to more realistically perceive the virtual reality environment; in medical image processing, the three-dimensional reconstruction technology can convert a two-dimensional medical image into a three-dimensional model, so that doctors can be helped to diagnose the illness state better; in industrial manufacturing, the three-dimensional reconstruction technology can quickly generate a three-dimensional model of a product, and reliable data support is provided for product design and manufacturing. The application prospect of the three-dimensional reconstruction technology is very wide, and more opportunities and challenges are brought to the development of various fields.
As a representative existence in the three-dimensional reconstruction technology, the multi-view stereo matching not only can restore the three-dimensional geometry, but also can extract the information of textures, colors, surface morphology and the like in the scene. The method utilizes the matching relation and the stereo corresponding relation of the overlapped images, and realizes high-precision three-dimensional reconstruction by calculating the position and depth information of each pixel point under different visual angles. In the development process of multi-view stereo matching technology, a plurality of classical algorithms and models are presented, mainly a traditional stereo matching method and a stereo matching method based on deep learning.
Conventional multi-view stereo matching methods utilize various three-dimensional representations such as grids, point clouds, voxels, and depth maps. Among these different representations, depth map based methods can reconstruct surfaces more completely and with higher robustness. By converting the multi-view reconstruction problem into the depth estimation problem and fusing all depth maps into a single three-dimensional point cloud, the troublesome topology problem is avoided. Among them, COLMAP and ACMM can obtain stable results. In particular, ACMM employs multi-scale geometric consistency to reconstruct features of different scales. The COLMAP uses optical and geometric priors to estimate pixel-level depth and normal vectors. Under complex scenes, large amounts of matching noise, and poor correspondence, the result of conventional multi-view stereo matching may have significant artifacts.
The multi-view stereo matching method based on deep learning tries to utilize global scene semantic information, including environmental illumination and object materials, keeps high performance under complex illumination conditions, and solves the limitation of the traditional method. The key of the methods is that depth image features are transformed into a reference camera view cone through a differentiable homography matrix, so that different three-dimensional matching cost bodies are established. And then regularizing the matching cost body through a three-dimensional convolutional neural network, and predicting the depth map. However, there are two unresolved problems with this type of approach: firstly, estimating that the confidence is low in a texture-free area; secondly, there are many outliers near the object boundary. This is mainly because a surface is generally considered to be a set of uncorrelated sampling points, rather than a surface having a topological relationship. Since each ray is associated with only one surface sampling point, it is not possible to focus on adjacent areas of the surface. The estimation of each depth value is limited by only one surface sampling point, which makes it impossible to infer with the surrounding surface. Unfortunately, in the non-textured areas and object boundaries, it is difficult to infer without more extensive surface information. Therefore, the existing multi-view stereo matching method based on deep learning is limited by too small perception range, and has poor reconstruction effect in a texture-free area and an object boundary.
Disclosure of Invention
In order to effectively solve the problem that the existing method is poor in reconstruction effect at the non-texture area and the object boundary, the invention provides the area-aware multi-view three-dimensional matching three-dimensional reconstruction method based on the neural network, which can realize complete estimation of the non-texture area, reduce outliers of the object boundary area and finally effectively improve the accuracy and the integrity of object reconstruction under various scenes.
The technical scheme adopted by the invention is as follows:
s1: constructing a multi-view image dataset;
s2: establishing a regional perception multi-view stereo matching reconstruction frame;
s3: training the regional perception multi-view stereo matching reconstruction frame by utilizing the multi-view image dataset to obtain a regional perception multi-view stereo matching reconstruction model;
s4: inputting the multi-view image of the object to be reconstructed into the region-aware multi-view stereo matching reconstruction model, and outputting a depth map reconstruction result under the view angle of the reference image.
The multi-view image dataset is characterized in that each group of training data consists of a plurality of images under different view angles and corresponding depth maps, wherein one image under one view angle is used as a reference image, and the images under the other view angles are used as source images.
The regional perception multi-view stereo matching reconstruction framework comprises an image feature extraction network, three matching cost body construction modules and three depth and symbol distance field estimation networks, wherein the input of the regional perception multi-view stereo matching reconstruction framework serves as the input of the image feature extraction network, the image feature extraction network is connected with the three matching cost body construction modules, the first matching cost body construction module is connected with the first depth and symbol distance field estimation networks, the rough depth map reconstruction result output by the first depth and symbol distance field estimation networks is input into the second matching cost body construction module, the second matching cost body construction module is connected with the second depth and symbol distance field estimation networks, the rough depth map reconstruction result output by the second depth and symbol distance field estimation networks is input into the third matching cost body construction module, and the third matching cost body construction module is connected with the third depth and symbol distance field estimation networks.
And the image feature extraction network outputs two-dimensional feature images under different scales and inputs the two-dimensional feature images into three matching cost body construction modules respectively.
In the matching cost body construction module, firstly, setting a depth range of an assumed plane, then, respectively carrying out homography transformation on two-dimensional feature images corresponding to a plurality of source images in the training data of the current group based on the depth range of the assumed plane to obtain corresponding feature bodies under a reference view angle, and then, aggregating all the feature bodies under the reference view angle to obtain the matching cost body under the reference image view angle.
The matching cost body construction module specifically comprises the following steps:
firstly, setting a depth range of an assumed plane, and determining the assumed planes with different depths; next, in the hypothetical planes of different depths, a homography matrix for each source image to be converted into a reference image perspective is determined, with the following formula:
wherein ,Hi (d) Homography matrix, K, for conversion of the ith source image at depth value d i As an internal reference of the i Zhang Yuan th image, R i A rotation matrix K of the i Zhang Yuan th image 0 As an internal reference of the reference image, R 0 An extrinsic rotation matrix for a reference image;
then, in the assumed planes with different depths, mapping from the deformed pixel p' in the two-dimensional feature map of each source image to the pixel p of the reference image is performed by using each homography matrix, so as to obtain all corresponding feature V of each source image under the reference view angle in the depth range of the assumed plane i
wherein ,is the relative camera translation from the reference image to the source image;
finally, it will be assumed that it is flatAll corresponding characteristic volumes V of each source image under reference view angle in depth range of surface i After aggregation, a matching cost body under the view angle of the reference image is obtained, and the formula is as follows:
wherein C is a matching cost volume at the reference picture view,is a three-dimensional convolutional neural network for element-by-element weight estimation, as indicated by element-by-element multiplication, V i and V0 Features extracted from the source image and the reference image in the depth range of the assumed plane, respectively.
And the input of the estimation network of the depth and signed distance field is a matching cost body, regularization and signed distance body estimation are respectively carried out on the matching cost body in the estimation network of the depth and signed distance field, a probability body and a signed distance body are respectively obtained, and then the probability body and the signed distance body are fused through a body fusion module, so that a depth map reconstruction result of the reference image is obtained.
In the estimation network of the depth and the signed distance field, the probability body P of the matching cost body is calculated by using the following formula:
P=F softmax (V)
wherein C represents a matching cost body, F softmax () Representing a regularized three-dimensional convolutional neural network based on softmax;
the signed distance body S of the matching cost body is calculated using the following formula:
S=F tanh (C)
wherein ,Ftanh () Representing a regularized three-dimensional convolutional neural network based on a tanh function;
in the body fusion module, the fusion of the depth between each layer of depth of the probability body is carried out under the supervision of the signed distance body, and specifically: when a certain depth in the signed distance body S is smaller than or equal to a distance threshold value, carrying out depth fusion on the probability body corresponding to the depth, otherwise, not accumulating the probability body of the depth, wherein the formula is as follows:
depth less than or equal to distance threshold
Depth greater than distance threshold
Where j represents the minimum depth from the hypothetical planeTo the assumed plane maximum depth +.>Is a depth layer number of (2); j (j) prev The last layer depth number indicating the j-th layer depth, ">An accumulated depth map representing the j-th layer depth, < >>Represents the j th prev Accumulated depth map of depth of layer, softmax () represents Softmax probability function, ++>Representing the depth of the j-th layer in the hypothetical plane;
and after fusion of the hypothetical planes of all the depths in the probability body, obtaining a depth map reconstruction result of the reference image.
In the step S3, a total loss is calculated according to the depth map true value, the signed distance body estimated value output by the regional perception multi-view stereo matching reconstruction frame, and the depth map reconstruction result, and the total loss is calculated according to the following formula:
L=L d +λL s
wherein L is total L 1 Loss value, L d and LS Representing L at depth map and signed distance body, respectively 1 The value of the loss is calculated and, and />Depth map true value and signed distance volume true value, D k and Sk The depth map reconstruction result and the estimated value of the signed distance body, respectively, L represents the sequence number value of three stages, and the/represents L 1 The loss calculation function, λ, is the loss weight.
The signed distance body true value is obtained through calculation by a local search module based on the depth map true value, and specifically comprises the following steps:
and for each accurate query point pt on different hypothesis planes in each matching cost body C, taking the shortest distance of each surface sampling point pt 'in the depth map true value D' and in a preset range closest to the accurate query point pt as a signed distance true value corresponding to the accurate query point pt, thereby obtaining a signed distance body true value, wherein pt 'is a surface point obtained by the depth map true value, and the number of the surface sampling points pt' in the preset range is k multiplied by k.
The beneficial effects of the invention are as follows:
by adopting the technical scheme, the image feature extraction network of the method uses the basic structure of a Recursive Feature Pyramid (RFP), so that the object itself needing reconstruction can be focused more.
The invention restores a detailed three-dimensional scene by using depth and signed distance field branching and using a matching cost body, introduces the signed distance from a sampling point to the surface as supervision, expands the perception range of model prediction, realizes the complete estimation of a texture-free area and reduces the outlier of an object boundary area.
The method calculates the signed distance between the point sets based on the triangular meshes to balance accuracy and speed, thereby solving the problem of lack of true value of the model meshes.
The invention ensures that the problem that the existing multi-view stereo matching based on deep learning has poor reconstruction effect in a texture-free area and an object boundary can be effectively solved on the basis of slightly increasing the occupation of a control video memory, and the integrity and the accuracy of the multi-view stereo matching reconstruction are obviously improved.
Drawings
Fig. 1 is a general flow chart of a region-aware multi-view stereo matching three-dimensional reconstruction method based on a neural network according to an embodiment of the present invention.
FIG. 2 is an overall region-aware multi-view stereo matching reconstruction framework of an embodiment of the present invention.
Fig. 3 is a flow chart of a matching cost body construction module according to an embodiment of the present invention.
FIG. 4 is a flow chart of an estimation network for depth and signed distance fields in accordance with an embodiment of the present invention.
Fig. 5 is a diagram of a reconstruction result of region-aware multi-view stereo matching under a DTU dataset according to an embodiment of the present invention.
Fig. 6 is a reconstruction performed on a Tanks and Temples dataset showing the reconstruction results and their corresponding accuracy and recall errors in accordance with an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
As shown in fig. 1, the embodiment of the present invention and the implementation process thereof are specifically as follows:
s1: constructing a multi-view image dataset;
in the multi-view image dataset, each group of training data is composed of a plurality of images under different view angles and corresponding depth maps, wherein one image under one view angle is used as a reference image, the images under the other view angles are used as source images, and the plurality of depth maps are used as depth map true values under the corresponding view angles. Combining the hypothesis planes in the matching cost body, the depth map true values can be further converted into signed distance body true values to be used as further supervision.
In this embodiment, two general data sets, DTU and BlendedMVS, are selected, and both data have multi-view image information of the object to be reconstructed. The DTU and BlendedMVS data sets also provide static point cloud information, so that static grid information can be obtained through a screened Poisson surface reconstruction method, and depth map information of different angles is further rendered to form a required data set.
The DTU and BlendedMVS data set has strong three-dimensional reconstruction challenges. There are many pieces of non-textured surface information in the former DTU and more complex scene and object boundary information in the latter BlendedMVS. The original image information under the general data set and the depth map information obtained by subsequent processing form common training data.
S2: establishing a regional perception multi-view stereo matching reconstruction frame;
as shown in fig. 2, the region-aware multi-view stereo matching reconstruction framework includes an image feature extraction network, three matching cost body construction modules and three estimation networks of depth and signed distance fields, the input of the region-aware multi-view stereo matching reconstruction framework serves as the input of the image feature extraction network, in a specific implementation, an original image is cut to 512×640 resolution and then input into the network, the image feature extraction networks are all connected with the three matching cost body construction modules, a first matching cost body construction module is connected with the estimation networks of the first depth and signed distance fields, the rough depth map reconstruction results output by the estimation networks of the first depth and signed distance fields are input into a second matching cost body construction module, the second matching cost body construction module is connected with the estimation networks of the second depth and signed distance fields, the rough depth map reconstruction results output by the estimation networks of the second depth and signed distance fields are input into a third matching cost body construction module, and the third matching cost body construction module is connected with the estimation networks of the third depth and signed distance fields. The third matching cost body construction module may output a final exact depth map reconstruction result.
The image feature extraction network outputs two-dimensional feature images under different scales and inputs the two-dimensional feature images into three matching cost body construction modules respectively. The image feature extraction network is specifically a recursive feature pyramid module, so that an object needing to be reconstructed can be focused more; the two-dimensional feature map under different scales output by the recursive feature pyramid module is a two-dimensional feature map under a pyramid structure, the pyramid structure is formed by 3 feature maps with different sizes, and the two-dimensional feature map is recorded as follows
For multi-stage matching cost body construction, i.e. three matching cost body construction modules connected in sequence, this is a multi-stage optimization process from coarse to fine. In this process, the most critical step is the determination of the hypothetical planar depth range. For each matching cost volume, there are several hypothetical planes corresponding to it. At the lowest stage, the range of hypothetical plane minimum depth to maximum depth is initialized so that it can cover all depth ranges. When the matching cost body in the subsequent stage is constructed, the depth range of the matching cost body is optimized by referring to the generation result of the upper-level low-scale depth map, so that a rough-to-fine matching cost body construction process is completed.
In the matching cost body construction module, as shown in fig. 3, firstly, a depth range of an assumed plane is set, then, based on the depth range of the assumed plane, homography transformation is performed on two-dimensional feature images corresponding to a plurality of source images in the training data of the current set respectively, so as to obtain corresponding feature bodies under a reference view angle, and then, after aggregation is performed on each feature body under the reference view angle, the matching cost body under the reference image view angle is obtained.
The matching cost body construction module specifically comprises the following steps:
firstly, setting a depth range of an assumed plane, and determining the assumed planes with different depths; for the first matching cost volume construction module, it assumes the depth range of the plane as the initial value. For the second and third matching cost body construction modules, the depth range of the assumed plane is determined based on the depth range of the current assumed plane and the depth map reconstruction result output by the estimation network of the previous depth and the signed distance field after the depth range is reduced. Next, in the hypothetical planes of different depths, a homography matrix for each source image to be converted into a reference image perspective is determined, with the following formula:
wherein ,Hi (d) Homography matrix, K, for conversion of the ith source image at depth value d i As an internal reference of the i Zhang Yuan th image, R i A rotation matrix K of the i Zhang Yuan th image 0 As an internal reference of the reference image, R 0 An extrinsic rotation matrix for a reference image;
then, in the assumed planes with different depths, mapping from the deformed pixel p' in the two-dimensional feature map of each source image to the pixel p of the reference image is performed by using each homography matrix, so as to obtain all corresponding feature V of each source image under the reference view angle in the depth range of the assumed plane i Satisfies the following conditionsD, C ', H ', W ' are the number of hypothetical planes, the feature vector length, the image height, the image width, and N represents the number of source images, respectively.
wherein ,is the relative camera translation from the reference image to the source image;
in particular, in order to be able to process any number of source images, it will be assumed, at last, that all the corresponding features V of each source image at the reference view in the depth range of the plane i After aggregation, a matching cost body under the view angle of the reference image is obtained, and the formula is as follows:
wherein C is a matching cost volume at the reference picture view,is a three-dimensional convolutional neural network for element-by-element weight estimation, as indicated by element-by-element multiplication, V i and V0 Features extracted from the source image and the reference image in the depth range of the assumed plane, respectively.
As shown in fig. 4, the input of the estimation network of the depth and the signed distance field is a matching cost body, regularization and signed distance body estimation are respectively performed on the matching cost body in the estimation network of the depth and the signed distance field, a probability body and a signed distance body are respectively obtained, and then the probability body and the signed distance body are fused through a body fusion module, so that a depth map reconstruction result of the reference image is obtained.
Specifically:
in the estimation network of depth and signed distance field, the probability body P of the matching cost body is calculated by the following formula and used for representing the weights of the assumption planes of different depths:
P=F softmax (C)
wherein C represents a matching cost body of a single scale, F softmax () Representing a regularized three-dimensional convolutional neural network based on softmax;
the signed distance volume S of the matching cost volume is calculated using the following formula to represent the signed distances of the different depth hypothesis planes:
S=F tanh (C)
wherein ,Ftanh () Representing regularized three dimensions based on a tanh functionThe convolutional neural network can pay more attention to short-distance sampling points of the structural surface, wherein the surface refers to a depth map reconstruction result, namely, the surface under the three-dimensional concept of a model which is finally reconstructed;
in the body fusion module, fusion of the depth between each layer of depth of the probability body is carried out under the supervision of the signed distance body, and the fusion is specifically expressed as two fusion situations of each pixel point of the reference image under the supervision of a signed distance threshold value, and if the threshold value is exceeded, the probability depth of the layer is not accumulated by the pixel. Specifically: when a certain depth in the symbol distance body S is smaller than or equal to a distance threshold value, the estimated depth is relatively close to the surface, the probability body corresponding to the depth is subjected to depth fusion, otherwise, the estimated depth is relatively far away from the surface, the probability body of the depth is not accumulated, and the formula is as follows:
depth less than or equal to distance threshold
Depth greater than distance threshold
Where j represents the minimum depth from the hypothetical planeTo the assumed plane maximum depth +.>Is a depth layer number of (2); j (j) prev The last layer depth number indicating the j-th layer depth, ">An accumulated depth map representing the j-th layer depth and initialized to 0 at the beginning,/for>Represents the j th prev Accumulated depth map of layer depth, softmax () represents Softmax probabilityThe function of the function is that,representing the depth of the j-th layer in the hypothetical plane;
and after fusion of the hypothetical planes of all the depths in the probability body, obtaining a depth map reconstruction result of the reference image.
S3: training the regional perception multi-view stereo matching reconstruction frame by utilizing the multi-view image dataset to obtain a regional perception multi-view stereo matching reconstruction model;
in the training of the regional perception multi-view stereo matching reconstruction frame, the image feature extraction network in the regional perception multi-view stereo matching reconstruction frame and the three-dimensional convolution neural network for element-by-element weight estimation in the matching cost body construction module are trained simultaneously, and the estimation network of depth and signed distance fields are trained simultaneously.
And S3, calculating total loss according to the depth map true value, the signed distance body estimated value output by the regional perception multi-view stereo matching reconstruction frame and the depth map reconstruction result, and then carrying out gradient back propagation based on the total loss, so as to train the regional perception multi-view stereo matching reconstruction frame, finally carrying out parameter updating to realize training of the whole network model, and carrying out iterative training until the network converges, thus finally obtaining the regional perception multi-view stereo matching reconstruction model. In particular embodiments, the estimation network of the signed distance field is updated again in the 10 th iteration period to avoid slow model convergence caused by its early introduction. The calculation formula of the total loss is as follows:
L=L d +λL s
wherein L is total L 1 Loss value, L d and LS Representing L at depth map and signed distance body, respectively 1 The value of the loss is calculated and,anddepth map true value and signed distance volume true value are obtained by computing according to depth map true value, D l and Sl The depth map reconstruction result and the estimated value of the signed distance body, respectively, L represents the sequence number value of three stages, and the/represents L 1 The loss calculation function, λ, is a loss weight, a weight for balancing the two losses, and is set to 0.1.
The signed distance body true value is obtained by calculation through a local search module based on the depth map true value, and specifically comprises the following steps:
and for each accurate query point pt on different hypothesis planes in each matching cost body C, taking the shortest distance of each surface sampling point pt 'in the depth map true value D' and in a preset range closest to the accurate query point pt as a signed distance true value corresponding to the accurate query point pt, thereby obtaining a signed distance body true value, wherein pt 'is a surface point obtained by the depth map true value, and the number of the surface sampling points pt' in the preset range is k multiplied by k. In this embodiment, k=5. Such a local search strategy may guarantee higher efficiency.
In the process of calculating the shortest distance, for each search of the shortest distance, a local search module is used, specifically: the k is used as the batch size to perform local search, namely, for each query point, a small-range search is performed on the surface to be queried, namely, the nearest k multiplied by k surface sampling points are selected to calculate the distance. Thus, the time complexity can be optimized from the time complexity O (nxh×w) of the global search to O (nxk×k), i.e., O (n), where n is the number of query points, H is the depth image height value, and W is the depth image width value. For the surface on which the above-mentioned sampling points are located, a triangular mesh is specifically used to represent the object surface.
Error analysis of the above procedure: for the triangular surface we assume, there are three cases where a minimum sphere of one radius achieves: the nearest point falls on the sphere, the nearest edge formed by two points is tangent to the sphere, and the nearest surface formed by three points is tangent to the sphere. In this case, the error distribution can be deduced to be within a very small range:
where e represents the general error of the query point pt,the square errors, pt 'in the three cases that the nearest points fall on the sphere, the nearest edges formed by two points are tangent to the sphere and the nearest faces formed by three points are tangent to the sphere' 1 and pt′2 Is two adjacent surface points and d () represents the euclidean distance between the two points. The inequality shows that the error e does not exceed the euclidean distance between two points re-projected from two adjacent pixels.
S4: inputting the multi-view image of the object to be reconstructed into the region-aware multi-view stereo matching reconstruction model, and outputting a depth map reconstruction result under the view angle of the reference image. During predictive reconstruction, a reference image and a source image in a multi-view image of an object to be reconstructed are determined, and the output of an estimation network of a third depth and a signed distance field is taken as a final output.
Obtaining a depth map reconstruction result of the reference image through the model calculation; as shown in fig. 5 (a) -5 (d) and 6, the reconstruction of the model on the DTU data set and the Tanks and Temples data set has better integrity and accuracy, and does not have more reconstruction anomalies due to the conditions of less textured surfaces and object boundaries.
Table 1: reconstruction results on DTU dataset (lower better):
Method accuracy (mm) Integrity (mm) Integral (mm)
COLMAP 0.400 0.664 0.532
MVSNet 0.396 0.527 0.462
RA-MVSNet (the invention) 0.326 0.268 0.297
From table 1, it can be clearly found that in the quantitative test of the dataset reconstruction of the DTU, the method of the present invention is significantly superior to the conventional method and the initial multi-view stereo matching method based on deep learning in terms of accuracy and integrity.
The embodiment of the invention provides a region-aware multi-view stereo matching three-dimensional reconstruction method based on a neural network, which can recover a detailed three-dimensional scene by utilizing matching cost bodies of depth and signed distance field branches. In the invention, signed distance body supervision enables depth prediction to obtain more hypothetical planes, improves the accuracy of surface topology, and has particularly obvious reconstruction effect on non-texture areas and object boundaries; the block perception strategy is adopted, and the hypothetical plane is associated with the surface block, so that the perception range is enhanced, and the integrity of a reconstruction result is improved; furthermore, using directed distances to represent the true directed distances of the training, efficient computation may be achieved. The invention can effectively solve the problem of surface topology reconstruction precision of the current multi-view three-dimensional imaging method at the positions of texture deficiency areas and object boundaries, rapidly generate a high-precision three-dimensional model, and provide great convenience and efficiency for design, simulation, analysis and interaction in the fields of virtual reality, game development, building design, industrial manufacturing, robot vision, medical image processing and the like.
The above-described embodiments are intended to illustrate the present invention, not to limit it, and any modifications and variations made thereto are within the spirit of the invention and the scope of the appended claims.

Claims (10)

1. The regional perception multi-view stereo matching three-dimensional reconstruction method based on the neural network is characterized by comprising the following steps of:
s1: constructing a multi-view image dataset;
s2: establishing a regional perception multi-view stereo matching reconstruction frame;
s3: training the regional perception multi-view stereo matching reconstruction frame by utilizing the multi-view image dataset to obtain a regional perception multi-view stereo matching reconstruction model;
s4: inputting the multi-view image of the object to be reconstructed into the region-aware multi-view stereo matching reconstruction model, and outputting a depth map reconstruction result under the view angle of the reference image.
2. The neural network-based region-aware multi-view stereo matching three-dimensional reconstruction method according to claim 1, wherein each set of training data in the multi-view image dataset is composed of a plurality of images at different views and corresponding depth maps, wherein an image at one view is used as a reference image, and images at the other views are used as source images.
3. The three-dimensional reconstruction method of the regional perception multi-view stereo matching based on the neural network according to claim 1, wherein the regional perception multi-view stereo matching reconstruction framework comprises an image feature extraction network, three matching cost body construction modules and three depth and symbol distance field estimation networks, the input of the regional perception multi-view stereo matching reconstruction framework serves as the input of the image feature extraction network, the image feature extraction networks are connected with the three matching cost body construction modules, the first matching cost body construction module is connected with the first depth and symbol distance field estimation networks, the rough depth map reconstruction result output by the first depth and symbol distance field estimation networks is input into the second matching cost body construction module, the second matching cost body construction module is connected with the second depth and symbol distance field estimation networks, the rough depth map reconstruction result output by the second depth and symbol distance field estimation networks is input into the third matching cost body construction module, and the third matching cost body construction module is connected with the third depth and symbol distance field estimation networks.
4. The three-dimensional reconstruction method of regional perception multi-view stereo matching based on the neural network according to claim 3, wherein the image feature extraction network outputs two-dimensional feature graphs under different scales and inputs the two-dimensional feature graphs into three matching cost body construction modules respectively.
5. The three-dimensional reconstruction method of regional perception multi-view stereo matching based on a neural network according to claim 3, wherein in the matching cost body construction module, firstly, setting a depth range of an assumed plane, then, respectively carrying out homography transformation on two-dimensional feature images corresponding to a plurality of source images in the training data of a current group based on the depth range of the assumed plane to obtain corresponding feature bodies under a reference view, and then, after all the feature bodies under the reference view are aggregated, obtaining the matching cost body under the view of the reference image.
6. The three-dimensional reconstruction method of regional perception multi-view stereo matching based on the neural network according to claim 3, wherein the matching cost body construction module is specifically:
firstly, setting a depth range of an assumed plane, and determining the assumed planes with different depths; next, in the hypothetical planes of different depths, a homography matrix for each source image to be converted into a reference image perspective is determined, with the following formula:
wherein ,Hi (d) Homography matrix, K, for conversion of the ith source image at depth value d i As an internal reference of the i Zhang Yuan th image, R i A rotation matrix K of the i Zhang Yuan th image 0 As an internal reference of the reference image, R 0 An extrinsic rotation matrix for a reference image;
then, in the assumed planes with different depths, mapping from the deformed pixel p' in the two-dimensional feature map of each source image to the pixel p of the reference image is performed by using each homography matrix, so as to obtain all corresponding feature V of each source image under the reference view angle in the depth range of the assumed plane i
wherein ,is the relative camera translation from the reference image to the source image;
finally, it will be assumed that all corresponding feature volumes V of each source image under the reference view angle in the depth range of the plane i After aggregation, a matching cost body under the view angle of the reference image is obtained, and the formula is as follows:
wherein C is a matching cost volume at the reference picture view,is a three-dimensional convolutional neural network for element-by-element weight estimation, as indicated by element-by-element multiplication, V i and V0 Features extracted from the source image and the reference image in the depth range of the assumed plane, respectively.
7. The three-dimensional reconstruction method of regional perception multi-view stereo matching based on a neural network according to claim 3, wherein the input of the estimation network of the depth and signed distance field is a matching cost body, regularization and signed distance body estimation are respectively carried out on the matching cost body in the estimation network of the depth and signed distance field, a probability body and a signed distance body are respectively obtained, and then the probability body and the signed distance body are fused through a body fusion module, so that a depth map reconstruction result of a reference image is obtained.
8. The three-dimensional reconstruction method of regional perception multi-view stereo matching based on neural network according to claim 3, wherein in the estimation network of depth and signed distance field, the probability body P of the matching cost body is calculated by using the following formula:
P=F softmax (C)
wherein C represents a matching cost body, F softmax () Representing a regularized three-dimensional convolutional neural network based on softmax;
the signed distance body S of the matching cost body is calculated using the following formula:
S=F tanh (C)
wherein ,Ftanh () Representing a regularized three-dimensional convolutional neural network based on a tanh function;
in the body fusion module, the fusion of the depth between each layer of depth of the probability body is carried out under the supervision of the signed distance body, and specifically: when a certain depth in the signed distance body S is smaller than or equal to a distance threshold value, carrying out depth fusion on the probability body corresponding to the depth, otherwise, not accumulating the probability body of the depth, wherein the formula is as follows:
depth less than or equal to distance threshold
Depth greater than distance threshold
Where j represents the minimum depth from the hypothetical planeTo the assumed plane maximum depth +.>Is a depth layer number of (2); j (j) prev The last layer depth number indicating the j-th layer depth, ">An accumulated depth map representing the j-th layer depth, < >>Represents the j th prev Accumulated depth map of depth of layer, softmax () represents Softmax probability function, ++>Representing the depth of the j-th layer in the hypothetical plane;
and after fusion of the hypothetical planes of all the depths in the probability body, obtaining a depth map reconstruction result of the reference image.
9. The three-dimensional reconstruction method of regional perception multi-view stereo matching based on neural network according to claim 3, wherein in S3, total loss is calculated according to depth map true value, signed distance body estimated value output by regional perception multi-view stereo matching reconstruction frame, depth map reconstruction result, and the calculation formula of total loss is as follows:
L=L d +λL s
wherein L is total L 1 Loss value, L d and LS Representing L at depth map and signed distance body, respectively 1 The value of the loss is calculated and, and />Depth map true value and signed distance volume true value, D k and Sk The depth map reconstruction result and the estimated value of the signed distance body are respectively, i represents the serial number value of three stages, and L represents L 1 The loss calculation function, λ, is the loss weight.
10. The three-dimensional reconstruction method of regional perception multi-view stereo matching based on a neural network according to claim 9, wherein the signed distance body true value is obtained by calculating through a local search module based on a depth map true value, and specifically comprises the following steps:
and for each accurate query point pt on different hypothesis planes in each matching cost body C, taking the shortest distance of each surface sampling point pt 'in the depth map true value D' and in a preset range closest to the accurate query point pt as a signed distance true value corresponding to the accurate query point pt, thereby obtaining a signed distance body true value, wherein pt 'is a surface point obtained by the depth map true value, and the number of the surface sampling points pt' in the preset range is k multiplied by k.
CN202310716212.9A 2023-06-16 2023-06-16 Region-aware multi-view stereo matching three-dimensional reconstruction method based on neural network Pending CN116758219A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310716212.9A CN116758219A (en) 2023-06-16 2023-06-16 Region-aware multi-view stereo matching three-dimensional reconstruction method based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310716212.9A CN116758219A (en) 2023-06-16 2023-06-16 Region-aware multi-view stereo matching three-dimensional reconstruction method based on neural network

Publications (1)

Publication Number Publication Date
CN116758219A true CN116758219A (en) 2023-09-15

Family

ID=87950922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310716212.9A Pending CN116758219A (en) 2023-06-16 2023-06-16 Region-aware multi-view stereo matching three-dimensional reconstruction method based on neural network

Country Status (1)

Country Link
CN (1) CN116758219A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116958455A (en) * 2023-09-21 2023-10-27 北京飞渡科技股份有限公司 Roof reconstruction method and device based on neural network and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116958455A (en) * 2023-09-21 2023-10-27 北京飞渡科技股份有限公司 Roof reconstruction method and device based on neural network and electronic equipment
CN116958455B (en) * 2023-09-21 2023-12-26 北京飞渡科技股份有限公司 Roof reconstruction method and device based on neural network and electronic equipment

Similar Documents

Publication Publication Date Title
CN109410307B (en) Scene point cloud semantic segmentation method
Chen et al. Point-based multi-view stereo network
CN109118564B (en) Three-dimensional point cloud marking method and device based on fusion voxels
CN110570522B (en) Multi-view three-dimensional reconstruction method
Stier et al. Vortx: Volumetric 3d reconstruction with transformers for voxelwise view selection and fusion
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN112862949B (en) Object 3D shape reconstruction method based on multiple views
Li et al. ADR-MVSNet: A cascade network for 3D point cloud reconstruction with pixel occlusion
CN113962858A (en) Multi-view depth acquisition method
CN113449612B (en) Three-dimensional target point cloud identification method based on sub-flow sparse convolution
CN116563493A (en) Model training method based on three-dimensional reconstruction, three-dimensional reconstruction method and device
CN113673400A (en) Real scene three-dimensional semantic reconstruction method and device based on deep learning and storage medium
CN116758219A (en) Region-aware multi-view stereo matching three-dimensional reconstruction method based on neural network
CN113177592A (en) Image segmentation method and device, computer equipment and storage medium
CN114202632A (en) Grid linear structure recovery method and device, electronic equipment and storage medium
CN114565916A (en) Target detection model training method, target detection method and electronic equipment
CN116310219A (en) Three-dimensional foot shape generation method based on conditional diffusion model
CN114782417A (en) Real-time detection method for digital twin characteristics of fan based on edge enhanced image segmentation
CN117132737B (en) Three-dimensional building model construction method, system and equipment
CN113888697A (en) Three-dimensional reconstruction method under two-hand interaction state
CN116912405A (en) Three-dimensional reconstruction method and system based on improved MVSNet
Lin et al. A-SATMVSNet: An attention-aware multi-view stereo matching network based on satellite imagery
CN116310228A (en) Surface reconstruction and new view synthesis method for remote sensing scene
Hu et al. Learning structural graph layouts and 3D shapes for long span bridges 3D reconstruction
Chen et al. Ground 3D Object Reconstruction Based on Multi-View 3D Occupancy Network using Satellite Remote Sensing Image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination